Sliding Window Joins over Data Streams


Dr. Ling Liu / Bugra Gedik
{lingliu, bgedik}
CCB 216 / CCB 263




Boosted by the proliferation of Internet connections and availability of network connected sensing devices, we are experiencing an ever increasing rate of digital information available from on-line sources. As a result of this, management of fast data streams has become a challenging problem, which requires solutions that will enable applications to effectively access and extract information from such data streams. Recently, database researchers have come up with the idea of Data Stream Management Syst ems (DSMS), which try to solve this problem in a very similar way the DBMSs have solved the data management problem for traditional data bases. Some examples of prototype DSMSs are Aurora (Brown, Brandeis, MIT), STREAM (Stanford) and Telegraph (Berkeley).

Joins are key operations in any query processing engine. However, it is not possible to apply well-known join algorithms designed for traditional database relations, without any changes, to data streams. Traditional joins are blocking operations. The y need to perform a scan on one of the inputs to produce all result tuples that match with a given tuple from the other input. However, data streams are unbounded, thus blocking is not an option. One natural way of handling joins on infinite streams is to use sliding windows. In a windowed stream join, a tuple from one stream is joined with only the tuples currently available in the window of another stream. The windows defined over the streams are bounded, as new tuples get inserted into the windows, old ones are removed. Some examples of such windows are "Last 100 tuples" or "Last 10seconds' tuples". The figure on the right illustrates a two-way windowed stream join.

Some application examples, that can make use of sliding window joins are:

  1. Finding similar news items from different sources (ex: CNN, Reuters, BBC)
    1. News items are represented by weighted keywords
    2. Perform windowed inner product join on weighted keywords

  2. Finding correlation between phone calls and stock trading
    1. Phone call streams {… (Pa,Pb)…} -> Pa calls Pb
    2. Stock trading streams {…, (Pb, Sx) , …} -> Pb trades Sx
    3. Perform windowed equi-join on person
    4. Pa hints Sx to Pb in the phone call

Students are expected to get familiar with the topic through reading the references [1,2,3,4]. The aim of this project is to study research issues, in particular, following issues related with sliding window joins over data streams are of interest:

A report describing your approach to performing sliding window joins over data streams. Make sure you describe your data model, assumptions about workload properties, semantics of your join operation, type of processing you perform for evaluating the j oin and how you handle load shedding. The report should describe why your approach is better than the others, ex. more generic, more efficient, more adaptive.

You will be graded on the quality of your report.

B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and Issues in Data Stream Systems. Proceedings of the ACM Symposium on Principles of Database Systems (PODS), 2002.
[2] J. Kang, J.F. Naughton and S.D. Viglas. Evaluating Window Joins over Unbounded Streams. Proceedings of the IEEE International Conference on Data Engineering (ICDE), 2003.
[3] L. Golab and M.T. Ozsu. Processing Sliding Window Multi-Joins in Continuous Queries over Data Streams. Proceedings of the International Conference on Very Large Data Bases (VLDB), 2003.
A. Das, J. Gehrke and M. Riedewald. Approximate Join Processing over Data Streams. Proceedings of the ACM International Conference on Management of Data (SIGMOD), 2003.
[5] L. Golab, S. Garg, and M.T. Ozsu. On Indexing Sliding Windows over Online Data Streams. Proceedings of the International Conference on Extending Database Technology (EDBT), 2004.
[6] U. Srivastava and J. Widom. Memory-Limited Execution of Windowed Stream Joins. Proceedings of the International Conference on Very Large Databases (VLDB), 2004.
more available on request