## CHENNAI MATHEMATICAL INSTITUTE (M.Sc. / Ph.D. Programme in Computer Science)15 May 2013

 Question 1
Ball Mart has 10 7 different items in stock across all its stores worldwide. The company has collected billing data for 10 10 customer transactions. Each individual bill has at most 10 distinct items in it. Ball Mart’s CEO wants to optimize the company’s inventory and has asked for a list of those items that appear in at least 2% of the billed transactions. Which of the following is the most precise upper bound one can compute for the number of such items, given the data?
 A 500 B 1000 C 5000 D 20000
Algorithms       Time-Complexity
Question 1 Explanation:
An item that is in 2% of the bills must appear in 2 × 10 8 bills. Across all bills, there are at most (10 10 ) × 10 = 10 11 items mentioned. So at most (10 11 )/(2 × 10 8 ) = 500 items can appear in 2% of the bills. The number of items in stock is irrelevant.
 Question 2
10% of all email you receive is spam. Your spam filter is 90% reliable: that is, 90% of the mails it marks as spam are indeed spam and 90% of spam mails are correctly labelled as spam. If you see a mail marked spam by your filter, what is the probability that it really is spam?
 A 10% B 50% C 70% D 90%
Engineering-Mathematics       Probability
Question 2 Explanation:
Out of 100 mails, 10 are spam. The filter will label 9 of 10 spam as spam and 9 of 90 non-spam as spam. So 18 are labelled spam, of which 9 are actually spam. You can compute the same result more formally using conditional probabilities.
 Question 3
When a user submits a query, a search engine does the following. For every webpage that has been visited by the search engine, it computes a score indicating how relevant that page is to the query. Finally, it reports the pages with the top k scores on the screen, for a number k specified by the user. A good data structure for accumulating the scores and ranking them is:
 A a queue B a heap C a stack D a binary search tree
Data-Structures       Heap-Tree
Question 3 Explanation:
Let n be the number of pages visited by the search engine at the time a query is submitted. Assume that it takes constant time to compute the relevance score for each page w.r.t. a query. Then it takes O(n) time to compute the relevance scores, a further O(n) time to build a heap of n relevance scores, and O(k · log n) time for k delete-max operations to return the top k scores.
 Question 4
Consider the set of all words over the alphabet {x, y, z} where the number of y’s is not divisible by 2 or 7 and no x appears after a z. This language is:
 A regular B not known to be regular C context-free but not regular D recursively enumerable but not context-free
Theory-of-Computation       Regular-Language
Question 4 Explanation:
Let Σ = {x, y, z}. The language in question is (Σ ∗ \ (L 1 ∪ L 2 )) ∩ L 3 where
L 1 = {w ∈ Σ ∗ | the number of y’s is divisible by 2},
L 2 = {w ∈ Σ ∗ | the number of y’s is divisible by 7}, and
L 3 = {w ∈ Σ ∗ | no x appears after a z}.
It suffices to show that all these three languages are regular, and appeal to the fact that regular languages are closed under boolean operations.
They are described by the following regular expressions, respectively.
r 1 = ((x + z) ∗ y(x + z) ∗ y(x + z) ∗ ) ∗
r 2 = ((x + z) ∗ y(x + z) ∗ y(x + z) ∗ y(x + z) ∗ y(x + z) ∗ y(x + z) ∗ y(x + z) ∗ y(x + z) ∗ ) ∗
r 3 = (x + y) ∗ (y + z) ∗
There are 4 questions to complete.

Register Now