Title:  Finding duplicates in a data stream

Parikshit Gopalan, MSR Silicon Valley and Jaikumar Radhakrishnan, TIFR Mumbai.

Abstract:

Given a data stream of length $n$ over an alphabet $[m]$ where $n > m$, we 
consider the problem of finding a duplicate in a single pass. We give a 
randomized algorithm for this problem that uses $O((\log m)^3)$ space. This 
answers a question of Muthukrishnan and Tarui, who asked if this problem could 
be solved using sub-linear space and one pass over the input. Our algorithm 
solves the more general problem of finding a positive frequency element in a 
stream given by frequency updates where the sum of all frequencies is positive. 
Our main tool is an Isolation Lemma that reduces this problem to the task of 
detecting and identifying a Dictatorial variable in a Boolean halfspace. We 
present various relaxations of the condition $n >m$, under which one can find 
duplicates efficiently.