Data Mining Algorithms
| Sponsor |
Edward Omiecinski
edwardo@cc.gatech.edu
138 CoC
|
| Area |
Database Systems |
Problem
Data mining is the process of discovering interesting knowledge from large
amounts of data stored in databases. Typically, the knowledge is in the form
of patterns or associations.
In this project, you will get an introduction to one of the main data mining
problems, called association rule mining, and two algorithms that can be used
to solve this problem.
Tasks:
-
Read the notes on Association Rule Mining by Jeff Ullman (Stanford)
assocrules.ps
which gives a good conceptual description of the problem and algorithms.
-
Implement the standard levelwise Apriori Algorithm and a variation of it based
on sampling (i.e., Toivonen's Algorithm) which are described in the notes.
-
Write a simple data generation program to produce market-basket data. It will
produce a set of M records (i.e., baskets) where each record contains a unique
record identifier and a set of N items purchased. N should be between some min
and max value specified.
-
Devise and run a set of experiments with your two algorithms.
Deliverables
-
Write a 3-5 page Report addressing the following issues:
-
How did you decide on the number of experiments to run?
-
How does the set of association rules (actually large item sets) produced
by the two algorithms compare?
-
How does the execution time of the two algorithms compare?
-
If the sampling approach does not produce the same output as Apriori,
then suggest a couple of ways that might rectify this problem.
Evaluation
Based on the report turned in to the sponsor of the project by the due
date.