Exploring Skewed Data Clusters via Visualization

Sponsor	Ling Liu (lingliu@cc.gatech.edu)
Area	Information and Knowledge Management

Problem
Data clustering technique is the unsupervised classification of patterns into groups. The patterns in the same group are more similar to the patterns in other groups. Cluster analysis is an important technique for understanding of large multi-dimensional datasets. Most of clustering research to date has been focused on developing automatic clustering algorithms and cluster validation methods. The automatic algorithms are known to work well in dealing with clusters of regular shapes, e.g. compact spherical shapes, but may incur higher error rates when dealing with arbitrarily shaped clusters. Arbitrarily shaped clusters also bring extra problems in cluster validation and cluster hierarchy definition.

Interactive data visualization was proved effective in understanding complicated multi-dimensional datasets in many applications. This technique can also be applied in cluster analysis very well. We have developed an interactive visual data-clustering tool called VISTA. VISTA maps the multi-dimensional (>3D) data onto the 2D visual space via the VISTA visualization model, which also enables us to create an effective user interface manipulating the mapping conveniently. Using VISTA tool, we can possibly observe the distribution of clusters in irregular shape.

The tool is available at:

http://disl.cc.gatech.edu/VISTA/

Here is what you need to do.

Read the following paper on VISTA system:

Keke Chen and Ling Liu: A Visual Framework Invites Human into the Clustering Process. SSDBM2003, Boston, MA, 2003

Download and install VISTA. Be aware that you need to install Java 2 in order to run VISTA system.
Read the online documentation and try to use the tool with the sample datasets in the VISTA package. Learn how to manipulate the visualization and find the interesting result.
Find one dataset in some interesting areas, such as intrusion detection, security, web systems and bioinformatics, which has irregular cluster distribution shown by VISTA system. The dataset should be in format of row by column, where each column represents a feature and each row represents the features of one entity. Check the sample datasets to see the format. Numerical data is preferred in your experiment, although VISTA can also deal with some categorical values. If there are documented clusters, contrast the documented clusters and the clusters you observed with VISTA system and find the difference – are there any clusters in irregular shape? Which clusters are close? Which clusters seem in one big cluster? Any hierarchical cluster structure? Can you use the domain knowledge about the data to explain your findings? If not, what is the inconsistency between your findings and the domain knowledge, and how to interpret it?
Save the interesting screen shots.

Deliverables

The dataset you used and a report including:

· a description of the dataset you used,

· the interpretation of clusters,

· any domain knowledge about the dataset,

· your findings and the screen shots

· discussion about the difference and consistency between your findings and the documentation, if there is documentation about the cluster distribution.

· the parameter setting of the interesting visualizations ( alpha values, theta values and zooming factor. Note: alpha and theta values are estimated with the visualization)

Evaluation
Based on the report turned in to the sponsor of the project by the due date.