This course introduces relevant programming techniques for data analytics. Topics include programming languages, relevant software packages, good programming practices, linear algebra in data analytics, numerical computing, and 4~5 machine learning algorithms as running problems. After completing the course, students will gain the skills to implement a data analytics pipeline (data collection, data retrieval, data analysis, data visualization) and several "handy" machine learning algorithms.
Piazza Discussion Forum
We will use
Piazza
for discussion (e.g., homework, project).
Post your questions there, and the teaching staff and your fellow classmates will be able to help answer them quickly.
You can also use Piazza to find project teammates.
Tsquare will only be used for submission of assignments and projects.
Office Hours
Prerequisites
Students should have some experience in programming with any language, for example, working knowledge of variables, operators, statements, control flows, reference, functions, classes, etc., and should feel comfortable reading documentation.
Additional formal prerequisites
Undergraduate semester level CS 1371 (Computing for Engineers) or a different programming course, minimum grade of D.
Schedule (tentative)
Date 
Topic 
Wed 
Fri 
Events 
Aug 
20, 22 
* Course introduction
* Course survey
* Introduction to Python and its data structures

Slides 
Slides 


27, 29 
* Python exercises Q&A
* Data collection
 wget, urllib/urllib2, API


Slides 

Sep 
3, 5 
* Data collection (cont'd)
* Topics in Python

Slides 
Slides 
HW1 out (Wed) 

10, 12 
* Charting/Visualization

R resources
(Link 1)
(Link 2)
(Link 3)

Slides 


17, 19 
* Data storage and retrieval in sqlite
* Basic linear algebra overview

Slides 
Notes 
HW1 due (Fri) 

24, 26 
* Dense and sparse matrices (including Numpy)
* Good programming practices




Oct 
1, 3 
* Basic linear algebra overview
 Matrixvector multiplication
 Norms





8, 10 
* Linear regression
 Least squares
 Computing Least squares





15, 17 
* Logistic regression
 Regression vs. Classification
 Gradient descent
* Vectorization in Numpy and R





22, 24 
* Project proposal presentations





29, 31 
* Kmeans clustering
 Kmeans and its variations
 Computer architecture overview




Nov 
5, 7 
* Kmeans clustering
 Numerical software stack
 Case study: Efficient implementation of Kmeans





12, 14 
* Singular value decomposition
 SVD and dimension reduction
 Interpreting SVD geometrically





19, 21 
* Singular value decomposition
 Direct and iterative methods for computing SVD
 SVD applications: PCA, LSI





26, 28 
(Thanksgiving holiday)




Dec 
3, 5 
* Final project presentations 



Grading
 10% Class and Piazza participation
 20% Midterm
 30% Homework
 40% Final project
Late Submissions Policy
 No late homework allowed. However, there are no penalties for medical reasons or emergencies. You must submit a doctor's note or an official letter explaining the emergency.
Textbooks, references, and reading materials
 None required.
 References for programming:
 References for linear algebra / numerical computing:
 References for "practical" machine learning:
Homework (tentative)
Please note that while collaboration is allowed, individual collaborators *must* write up their own answers.
All GT students must observe the
honor code.
Project
Team project: 23 people.
Description and grading policy coming soon (proposal + presentation, progress report, final report + presentation).
Dataset Ideas
 Freebase
 Yelp
 Numerous APIs from Google (e.g., Maps, Freebase, YouTube, etc.)
 Trulia, Zillow: real estate listing sites
 Movies data: Rotten Tomatoes, IMDB
 Dataset about soccer games, players, clubs.
No API, but easy to scrape.
For a soccer player: transfer history, performance, nationality, birth date, etc.
For a soccer club: performance, squad, etc.
Thanks Ding!

Article Search API from the New York Times (all the way back to 1851!) (Thanks Guido!)
Auditors
Auditors must first obtain instructor's permission of the instructor,
then enroll in the course.
The auditor must attend all lectures, and optionally complete the assignments.
Acknowledgements & Related Classes
Many thanks to our colleagues for sharing their course materials:
Prof. Le Song  Introduction to Computational Data Analysis  Spring 2014