Finding surprising patterns in textual data streams


We address the task of detecting surprising patterns in large textual data streams. These can reveal events in the real world when the data streams are generated by online news media, emails, Twitter feeds, movie subtitles, scientific publications, and more. The volume of interest in such text streams often exceeds human capacity for analysis, such that automatic pattern recognition tools are indispensable. In particular, we are interested in surprising changes in the frequency of n-grams of words, or more generally of symbols from an unlimited alphabet size. Despite the exponentially large number of possible n-grams in the size of the alphabet (which is itself unbounded), we show how these can be detected efficiently. To this end, we rely on a data structure known as a generalised suffix tree, which is additionally annotated with a limited amount of statistical information. Crucially, we show how the generalised suffix tree as well as these statistical annotations can efficiently be updated in an on-line fashion.

2010 2nd International Workshop on Cognitive Information Processing, CIP2010