CS8803DS Algorithmic Techniques for Mananging
Massive Data Sets
Spring 2008, College of Computing, Georgia Tech
Office: KCAB 3350
Office phone: 52168
Office hours: TTh after class or by appointment
Email: jx at cc
Teaching Assistants: Nan
Hua
Office hours:
Email:
Annoucements
Table of Contents
- Class meeting times: TR 3:05-4:25
- Classroom: CCB 101
- In this course, we will introduce the most important problems and
results in data streaming, with applications in networking, systems,
information security, databases, and theoretical computer science (no
kidding). Data streaming is concerned with processing a long
stream of data items in one pass using a small working memory in order
to answer a class of queries regarding the stream. The trick is to
``remember'', in this small memory, as much information pertinent to
the queries as possible. In this course, we will present both the
elegance and the depth of data streaming algorithms. We will
cover the theoretical foundation of data streaming, and time
permitting, the theory of communication complexity for proving lower
bounds. Database, systems, security, and networking
students are strongly encouraged to take this course as data streaming
has many applications in these areas and this course will improve their
mathematics and algorithm skills. Theory students are also
strongly encouraged to enroll since it introduces them to real world
problems where advanced algorithms can apply and make a real
impact. This is the first time such a course has been offered and
only a couple of universities have offered a similar course, due
largely to the scarcity of experts in this area.
- The pre-requisite for this course is graduate-level algorithm
course (CS6505 or CS6550) or equivalent.
- Every student MUST fill in the course survey form and print out
and submit the "receipt page".
- No textbook is needed for this course. However, Prof. Muthukrishnan's book (a
refined
version if his free online survey) can be a very helpful reference.
- The information of the book could be found in Prof Muthukrishnan's homepage.
Some previous online version
could be also be found.
- Most of the class materials will come
from the research literature.
The reading list will be updated every
week, before we cover the
corresponding
topic.
First part: network data streaming
- Week 1&2&3, cover Prof. Xu's ACM
Sigmetrics'07 tutorial on network data streaming
- Collections for F0 estimation link (collected
by Tongqing Qiu)
- Supplementary Materials: Order Statistics pdf
(Scanned from "Statistical Inference", 2nd ed., by George
Casella and Roger L. Berger)
- Answers to Questions: The origin of the term "Tug of War" txt
- Statistic Counter slide slide
- Compressed Bloom Filter slide
- Scalable Query routing for Unstructured P2P Networks
(Exponential Decay Bloom Filter (EDBF)) slide
- Global Iceberg slide
- Data Streaming Algorithms for Accurate and Efficient
Measurement of Traffic and Flow Matrices pdf slide
- Joint Streaming and Sampling Techniques for Accurate
Identification of Super Sources/Destinations pdf slide
- Scalable and Efficient Data Streaming Algorithms for Detecting
Common Contents in Internet Traffic pdf
slide
- Space-Code Bloom Filter for Efficient Per-Flow Traffic
Measurement pdf slide
- Large-Scale IP Traceback in High-Speed Internet: Practical
Techniques and Theoretical Foundation pdf slide
- Entropy paper pdf
slide
Second part: data streaming in the
theoretical context
- The space complexity of approximating the frequency moments
(The origin of "Tug of War") pdf
- Join Size (inner product) estimation using Tug of
War Algorithm pdf
- CM sketch paper pdf
- Estimating Dominance Norms of Multiple Data Streams ps
- Stable distributions, pseudorandom generators,
embeddings, and data stream computation (pdf)
- Loglog Counting of Large
Cardinalities pdf
- Probabilistic
Counting Algorithms for Data Base
Applications
- Counting
distinct
elements in a data stream
- Space
efficient algorithms for distinct elements in a data stream
- Approximate
aggregation techniques for
sensor database pdf
- Near-optimal sparse fourier representations via sampling ps
- Improved Time Bounds for Near-Optimal Sparse Fourier
Representations pdf
- what's
new finding significant
difference in network data streams pdf
- find
hierarchical heavy hitters in data
streams pdf
- one-pass
wavelet decompositions of data
streams pdf
- distinct sampling for highly-accurate
answers to distinct values queries and event reports pdf
- what's
hot and what's not: tracking
most frequent items dynamically pdf
Third part: machine learning foundation
of data streaming
Fourth part: database data streaming
(time permitting)
The course will include an individual or small-group (1 to 2 students)
research project. Will discuss the detail in class.
Every student (or group) need to scribe one class (hopefully related to
his/her research project). Latex template zip.