Introduction to Data Science

Instructor

Gerard de Melo
CBIM 8, Dept. of Computer Science
gerard.demelo@rutg...

Office hours: By appointment (due to pandemic). Please send an email or use Sakai forums.

All emails must have "[CS439]" in subject to be considered.

Teaching Assistants

Abu Shoeb
as2352@scarletmail.rutgers.[...]
Office hours: Tue 3–4 PM
WebEx Meeting# 790 739 681
See announcement for password.

Shahab Raji
sr1101@rutgers.[...]
Office hours: Thu 3–4 PM
WebEx Meeting# 790 739 681
See announcement for password.

Liqin Long (Grader)
ll741@scarletmail.rutgers.[...]

Time and Location

Tuesdays and Thursdays, 18:40 – 20:00
~~Richard Weeks Hall (RWH) 102, Busch Campus~~
Online via Sakai

Recitations:
Section 1 (Tuesdays, 20:25 - 21:20)
Section 2 (Thursdays, 20:25 - 21:20)
WebEx Meeting# 791 508 063
See announcement for password.

Announcements

The Sakai site for this course is available, and is where we post slides and announcements.

Overview

Our modern world is increasingly being driven by data. We increasingly see data determining which companies succeed, who wins elections, and even who marries whom. In this course, we will cover fundamental techniques in the emerging field of Data Science. This course is aimed at computer science students, so we will focus in particular on important computational aspects such as working with massive amounts of data ("Big Data") and learning from data ("machine learning").

Schedule/Syllabus

Date	Topics
01-21 Tue	Introduction
01-23 Thu	Data Collection: Gathering Data
01-28 Tue	Data Collection: Parsing Data
01-30 Thu	Data Analysis: Data Frames and Preprocessing
02-04 Tue	Data Analysis: Basic Statistics
02-06 Thu	Big Data Analysis (Hadoop)
02-11 Tue	Big Data Analysis (Spark)
02-13 Thu	Big Data Analysis (Spark, Ethical Aspects)
02-18 Tue	Textual Data
02-20 Thu	Data Visualization (Guest lecture by Professor James Abello)
02-25 Tue	Textual Data, Data Visualization
02-27 Thu	Social Networks, Link Analysis, Graph Data Mining
03-03 Tue	In-Class Mid-Term Exam
03-05 Thu	Finding Groups (Clustering)
03-10 Tue	Learning from Data: Feature Extraction and Dimensionality Reduction
03-12 Thu	Spring Recess (extended)
03-17 Tue	Spring Recess
03-19 Thu	Spring Recess
03-24 Tue	Learning from Data: Predicting Numbers (Regression)
03-26 Thu	Learning from Data: Predicting Numbers (Regression)
03-31 Tue	Learning from Data: Simple Classification Algorithms
04-02 Thu	Learning from Data: Essential Practices, Modern Classification Algorithms
04-07 Tue	Learning from Data: Modern Classification Algorithms
04-09 Thu	Learning from Data: Evaluating Machine Learning Models
04-14 Tue	Learning from Data: Multi-Class Classification
04-16 Thu	Learning from Data: Fairness and Bias, Interpretability and Explainability
04-21 Tue	Learning from Data: Deep Learning
04-23 Thu	Data Mining Algorithms
04-28 Tue	Recommending Items
04-30 Thu	Project Presentations
05-07 Thu	Final Exam (8:00 PM, officially until 11:00 PM)

Slides, Discussion Forum

Sakai (coming soon) will be used to host slides, as well as to provide a forum for discussions.

Grading and Course Project

The grades will be determined as follows:

30% Assignments
30% Course Project
20% Mid-Term Exam
20% Final Exam

We will occasionally sample the attendance in recitations, and people who are found to attend and participate regularly can receive a grade bonus of up to 10%.

Graded homework assignments will be announced on Sakai. Make sure to enable e-mail notifications.
A set of slides with details about the course project requirements will be on Sakai as well.

Policies:

Course projects can be done in teams of two but all homework must be done entirely on your own. Late submissions will be accepted for up to 3 days after the deadline, but with a grade penalty of 20% for each late day.
This class adopts a zero-tolerance stance towards cheating and plagiarism. Please refer to Rutgers' academic integrity policy. Changing the wording, or reimplementing something in a different programming language is not enough to avoid charges of plagiarism.

References

Since we are focusing on the latest developments, this course does not strictly follow any designated coursebook. Rather, specific references for further reading will be posted at the end of the slides for each unit (typically the last slide). Still, the following (optional) books may be useful.

Ian Langmore, Daniel Krasner. Applied Data Science
Holden Karau et al. Learning Spark. Lightning-Fast Data Analysis. O'Reilly, 2015.

Note: The book is helpful but not required, especially since this is a fast-paced field and some of the latest changes to Spark are not yet covered in the book.
Jure Leskovec, Anand Rajaraman, Jeff Ullman. Mining of Massive Datasets
Note: Available for free online.

Contact

For problems or questions about this site, please contact Gerard de Melo. Rutgers is an equal access/equal opportunity institution. Individuals with disabilities are encouraged to direct suggestions, comments, or complaints concerning any accessibility issues with Rutgers web sites to: accessibility@rutgers.edu or complete the Report Accessibility Barrier / Provide Feedback form.