Massive Data Storage and Retrieval

Instructor

Gerard de Melo
CBIM 8, Dept. of Computer Science
Office hours: Wednesdays, 5-6 PM
(Note: no office hours on Nov. 6)

Teaching Assistants

Abu Shoeb
Email: as2352@scarletmail.rutg...
Office hours: Fridays, 11am-12pm in CoRE 246 (by appointment)

Sepehr Janghorbani
Email: sj620@scarletmail.rutg...
Office hours: Mondays 1-2:30pm at CBIM (by appointment)

Time and Location

Wednesdays, 10:20 - 13:20
SEC 117 (Pond Science & Eng. Resource Center), Busch Campus

Announcements

The Sakai site is available. Enrolled students are already in the system. Others may get in touch with the TA to get access.

Overview

In today's world of massive amounts of data, new methods and techniques are needed. In this course, we discuss methods to dive deeper into such data, covering storage and retrieval but also going beyond it to consider recent advances at the intersection of Big Data and Artificial Intelligence. In terms of areas, the course will focus on techniques for 1) Big Data processing (MapReduce and Spark), 2) Information Retrieval and indexing, and 3) Natural Language Processing (and some Computer Vision) to enable retrieval based on the content and semantics.

In terms of methods, much of the second half of the course will focus on Deep Learning and neural network methods for these areas.

The course will include hands-on practical work on real data sets, based on the Apache Spark platform as well as on deep learning frameworks.

Prerequisites

Basic familiarity with data structures (from introductory computer science classes) and basic mathematics and probability theory.

 Basic programming ability. Some of our examples will be based on Apache Spark. Prior knowledge of Spark (especially using the Scala programming language) is not required, but certainly won't harm. Other examples will use Deep Learning tools, most of which require knowledge of Python, C++, Java, or Scala.

Topics

Date	Topics
09-04	Logistics, Introduction to Massive Data
09-11	Big Data Processing with MapReduce and Spark
09-18	Big Data Processing with Spark
09-25	No class (due to conference presentation)
10-02	Big Data Processing: Spark, Data Streams, Data Storage
10-09	Information Retrieval: Models, Storage, and Indexing
10-16	Vector-based Storage and Retrieval
10-23	Vector-based Representation Learning
10-30	Representation/Deep Learning: Gradient-based Optimization
11-06	Representation/Deep Learning: Network Architectures
11-13	Representation/Deep Learning: Sequence Modeling
11-20	Semantic Content Analysis
11-27	No class due to Thanksgiving Recess
12-04	Information Retrieval: Question Answering, Recap
12-11	Short Project Presentations

Slides, Discussion Forum

Sakai is used to host slides, as well as to provide a forum for discussions.

Grading and Course Project

The grades will be determined as follows:

30% In-Class Mini Quizzes
30% Assignments
40% Course Project

The main course requirement will be a semester-long course project, involving Apache Spark and/or Deep Learning. See the introduction and course project slides on Sakai for further details. Additionally, there will be graded in-class quizzes. These will be announced in advance. Graded homework assignments will be announced on Sakai. Make sure to enable e-mail notifications.

Policies:

Course projects can be done in teams of two but all homework must be done entirely on your own. Late submissions will be accepted for up to 3 days after the deadline, but with a grade penalty of 20% for each late day.
This class adopts a zero-tolerance stance towards cheating and plagiarism. Please refer to Rutgers' academic integrity policy. Changing the wording, or reimplementing something in a different programming language is not enough to avoid charges of plagiarism.

References

Since we are focusing on the latest research and technology, this course does not strictly follow any designated coursebook. However, the following (optional) books may be useful.

Holden Karau et al. Learning Spark. Lightning-Fast Data Analysis. O'Reilly, 2015. Source code examples.

Note: The book is helpful but not required, especially since this is a fast-paced field and some of the latest changes to Spark are not yet covered in the book.
Jure Leskovec, Anand Rajaraman, Jeff Ullman. Mining of Massive Datasets
Note: Available for free online.
Yoav Goldberg. A Primer on Neural Network Models for Natural Language Processing

Contact

For problems or questions about this site, please contact Gerard de Melo. Rutgers is an equal access/equal opportunity institution. Individuals with disabilities are encouraged to direct suggestions, comments, or complaints concerning any accessibility issues with Rutgers web sites to the instructor or to accessibility@rutgers.edu, or complete the Report Accessibility Barrier / Provide Feedback form.