## CHENNAI MATHEMATICAL INSTITUTE (M.Sc. / Ph.D. Programme in Computer Science)15 May 2013

Question 1 |

Ball Mart has 10

^{ 7}different items in stock across all its stores worldwide. The company has collected billing data for 10^{10}customer transactions. Each individual bill has at most 10 distinct items in it. Ball Mart’s CEO wants to optimize the company’s inventory and has asked for a list of those items that appear in at least 2% of the billed transactions. Which of the following is the most precise upper bound one can compute for the number of such items, given the data?500 | |

1000 | |

5000 | |

20000 |

Question 1 Explanation:

An item that is in 2% of the bills must appear in 2 × 10

^{ 8}bills. Across all bills, there are at most (10^{10}) × 10 = 10^{11}items mentioned. So at most (10^{11})/(2 × 10^{ 8}) = 500 items can appear in 2% of the bills. The number of items in stock is irrelevant.Question 2 |

10% of all email you receive is spam. Your spam filter is 90% reliable: that is, 90% of the mails it marks as spam are indeed spam and 90% of spam mails are correctly labelled as spam. If you see a mail marked spam by your filter, what is the probability
that it really is spam?

10% | |

50% | |

70% | |

90% |

Question 2 Explanation:

Out of 100 mails, 10 are spam. The filter will label 9 of 10 spam as spam and 9 of 90 non-spam as spam. So 18 are labelled spam, of which 9 are actually spam. You can compute the same result more formally using conditional probabilities.

Question 3 |

When a user submits a query, a search engine does the following. For every webpage that has been visited by the search engine, it computes a score indicating how relevant that page is to the query. Finally, it reports the pages with the top k scores on the screen, for a number k specified by the user. A good data structure for accumulating the scores and ranking them is:

a queue | |

a heap | |

a stack | |

a binary search tree |

Question 3 Explanation:

Let n be the number of pages visited by the search engine at the time a query is submitted. Assume that it takes constant time to compute the relevance score for each page w.r.t. a query. Then it takes O(n) time to compute the relevance scores, a further O(n) time to build a heap of n relevance scores, and O(k · log n) time for k delete-max operations to return the top k scores.

Question 4 |

Consider the set of all words over the alphabet {x, y, z} where the number of y’s is not divisible by 2 or 7 and no x appears after a z. This language is:

regular | |

not known to be regular | |

context-free but not regular | |

recursively enumerable but not context-free |

Question 4 Explanation:

Let Σ = {x, y, z}. The language in question is (Σ ∗ \ (L

L

L

L

It suffices to show that all these three languages are regular, and appeal to the fact that regular languages are closed under boolean operations.

They are described by the following regular expressions, respectively.

r

r

r

_{1}∪ L_{2})) ∩ L_{3}whereL

_{1}= {w ∈ Σ ∗ | the number of y’s is divisible by 2},L

_{2}= {w ∈ Σ ∗ | the number of y’s is divisible by 7}, andL

_{3}= {w ∈ Σ ∗ | no x appears after a z}.It suffices to show that all these three languages are regular, and appeal to the fact that regular languages are closed under boolean operations.

They are described by the following regular expressions, respectively.

r

_{1}= ((x + z) ∗ y(x + z) ∗ y(x + z) ∗ ) ∗r

_{2}= ((x + z) ∗ y(x + z) ∗ y(x + z) ∗ y(x + z) ∗ y(x + z) ∗ y(x + z) ∗ y(x + z) ∗ y(x + z) ∗ ) ∗r

_{3}= (x + y) ∗ (y + z) ∗
There are 4 questions to complete.