Load Shedding for Sliding Window Multi-Joins over Data Streams

Data Model: The commonly used model for data streams is that, each stream is a time ordered set of tuples. Each tuple is composed of possibly more than one attribute as defined by a schema. For instance, in the example 2 above, the phone ca ll stream has the schema (Caller: String, Callee: String) . Most of the work on data streams has so far considered simple types for the attributes in a tuple. However, the example in 1 requires representing a news item as a tuple. This can be achieved by defining a complex attribute, such as a weighted set valued attribute. In this case the join condition can be a threshold function defined over the similarity between the tuples based on a similarity metric.

Workload Properties: Different assump tions about the properties of the streams can be made, depending on the application at hand. For instance, some tuples may be more important than others; as a result, their inclusion in the output of the join may be more preferable. This information can b ecome handy if it is not possible to compute the join exactly (possibly due to insufficient system resources) . In the example 1 above, news concerning national security may be of higher importance. In example 2, calls from insiders may be of higher inter est. Another assumption can be that, there is a non-flat match probability distribution function for a newly received tuple that is compared to a tuple in a window defined over another stream, based on the difference between the timestamps of the tuples. In example 2, there will be a time delay between a hint call and the action of trading a stock based on the hint. As a result, it is more profitable to compare a tuple received from the stock stream with tuples that are further inside the window defined o ver the call stream. Another application which better illustrates the usefulness of this assumption is traffic monitoring. Assume there are three streams corresponding to three different locations on the traffic network. Each stream consists of object ide ntifiers that indicate presence of the objects at the location represented by the stream. If we want to perform an equi-join on these streams to find the objects that has passed through all three points, we'll observe a difference between the timestamps of the matching tuples from different streams. You can refer to [6] for more details about the latter assumption.

Join Semantics: Windows can be user defined (time-based or count-based), in which case we have fixed windows and there is a well defined output of the join. Or they can be flexible, in which case the system uses the available memory or CPU resources to maximize the output size (or the utility) of the join.

Join Processing: Different types of processing can be used to perform the join operation. For instance, in example 1, an inverted index can be built over window contents, while in example 2 a hash-based index can be more suitable. It is imp ortant to note that, as opposed to traditional joins, in stream joins we have to deal with the issue of dynamically maintaining the indexes built over the sliding windows as tuples enter and exit the window. This brings additional costs to indexing. Using basic windows is an effective technique for dealing with such costs. See [5] for a detailed discussion of indexing sliding windows defined over data streams.

Load Shedding Techniques: Data sources are push based in DSMSs. Which means that we do not have control over input data rates. They can be high and bursty. As a result, load on the system can exceed the capacity of the system, especially du ring spikes in data rates. This results in increased output latency and also tuple dropping as a result of overflowing buffers. In order to keep up with the incoming rates of the streams, load shedding is usually needed in data stream processing systems. Load shedding means doing less work, thus producing fewer (or in other words approximate) results. Load shedding is required for stream joins, when (i) stream rates are high, (ii) windows are large, (iii) join conditions are costly to evaluate. Different techniques can be applied to shed load in stream joins, such as Tuple Dropping (random, low selectivity first, low valued first), Window Reduction [6], or Selective Processing (processing only certain portions of the windows). It is important to make use of the assumptions described in Workload Parameters bullet when thinking about load shedding. Also, it is crucial that such load shedding techniques be adaptive to stream properties as these properties may change dynamically.

Sponsor	Dr. Ling Liu / Bugra Gedik {lingliu, bgedik}@cc.gatech.edu CCB 216 / CCB 263
Area	Databases

Sliding Window Joins over Data Streams