This course introduces relevant programming techniques for data analytics. Topics include programming languages, relevant software packages, good programming practices, linear algebra in data analytics, numerical computing, and 4~5 machine learning algorithms as running problems. After completing the course, students will gain the skills to implement a data analytics pipeline (data collection, data retrieval, data analysis, data visualization) and several "handy" machine learning algorithms.

Piazza Discussion Forum

We will use Piazza for discussion (e.g., homework, project). Post your questions there, and the teaching staff and your fellow classmates will be able to help answer them quickly. You can also use Piazza to find project teammates.

T-square will only be used for submission of assignments and projects.

Office Hours

Instructor Da Kuang Thu, 4-5pm, Klaus 1315
Instructor Polo Chau Thu, 4-5pm, Klaus 1315
TA Lianxiao (Shawn) Qiu Mon, 1-2pm, Klaus 2108

Prerequisites

Students should have some experience in programming with any language, for example, working knowledge of variables, operators, statements, control flows, reference, functions, classes, etc., and should feel comfortable reading documentation.
Additional formal prerequisites
Undergraduate semester level CS 1371 (Computing for Engineers) or a different programming course, minimum grade of D.

Schedule (tentative)

Date Topic Wed Fri Events
Aug 20, 22 * Course introduction
* Course survey
* Introduction to Python and its data structures
Slides Slides  
27, 29 * Python exercises Q&A
* Data collection
  • wget, urllib/urllib2, API
  Slides  
Sep 3, 5 * Data collection (cont'd)
  • BeautifulSoup
* Topics in Python
Slides Slides HW1 out (Wed)
10, 12 * Charting/Visualization
  • Charting in R
R resources
(Link 1) (Link 2) (Link 3)
Slides  
17, 19 * Data storage and retrieval in sqlite
* Basic linear algebra overview
  • Vectors, matrices
Slides Notes HW1 due (Fri)
24, 26 * Dense and sparse matrices (including Numpy)
* Good programming practices
     
Oct 1, 3 * Basic linear algebra overview
  • Matrix-vector multiplication
  • Norms
     
8, 10 * Linear regression
  • Least squares
  • Computing Least squares
     
15, 17 * Logistic regression
  • Regression vs. Classification
  • Gradient descent
* Vectorization in Numpy and R
     
22, 24 * Project proposal presentations      
29, 31 * K-means clustering
  • K-means and its variations
  • Computer architecture overview
     
Nov 5, 7 * K-means clustering
  • Numerical software stack
  • Case study: Efficient implementation of K-means
     
12, 14 * Singular value decomposition
  • SVD and dimension reduction
  • Interpreting SVD geometrically
     
19, 21 * Singular value decomposition
  • Direct and iterative methods for computing SVD
  • SVD applications: PCA, LSI
     
26, 28 (Thanksgiving holiday)      
Dec 3, 5 * Final project presentations      

Grading

Late Submissions Policy

Textbooks, references, and reading materials

Homework (tentative)

Please note that while collaboration is allowed, individual collaborators *must* write up their own answers. All GT students must observe the honor code.

Project

Team project: 2-3 people. Description and grading policy coming soon (proposal + presentation, progress report, final report + presentation).

Dataset Ideas

Auditors

Auditors must first obtain instructor's permission of the instructor, then enroll in the course. The auditor must attend all lectures, and optionally complete the assignments.

Acknowledgements & Related Classes

Many thanks to our colleagues for sharing their course materials:
Prof. Le Song - Introduction to Computational Data Analysis - Spring 2014