Instructor

Gerard de Melo
CBIM 8, Dept. of Computer Science
Office hours: Wednesdays, 6-7 PM

Teaching Assistant

Rajarshi Bhowmik
Email: rajarshi.bhowmik@rutg...
Office hours: Thursdays, 2-4 PM in Hill 410

Announcements

Overview

In today's world of massive amounts of data, new methods and techniques are needed. In this course, we discuss methods to dive deeper into such data. In terms of areas, the course will focus on techniques for information retrieval, natural language processing, and relevant indexing methods to enable retrieval based on the semantics of text (or multimodal content).

In terms of methods, the course will focus on recent Deep Learning and neural network methods for these areas.

The course will include hands-on practical work on real data sets, based on deep learning frameworks, as well as optionally the Apache Spark platform.

Prerequisites

Basic familiarity with data structures (from Introduction to Computer Science class) and basic mathematics and probability theory.


Basic programming ability. Some of our examples will use Deep Learning tools, most of which require knowledge of Python, C++, Java, or Scala. Other examples will be based on Apache Spark. Prior knowledge of Spark (especially using the Scala programming language) is not required, but certainly won't harm.

Topics

DateTopicsFurther Reading/References
09-07 Logistics, Introduction to Massive Data and Deep Learning Reading: Chapter 1 from Data-Intensive Text Processing with MapReduce

Assignment: Install Apache Zeppelin. This requires Java 1.7+. Download and install from http://zeppelin.apache.org (the smaller net-install version is sufficient, but you can also install the full version of Zeppelin). Make sure that you are able to start Zeppelin and then (perhaps a few seconds later) access it via your Web browser on http://localhost:8080/. Finally, verify that it is working by creating a new notebook, entering the following code and then pressing Shift-Enter to run it.
Array("Hello", "World").mkString(" ")
09-14Textual Data Processing with Spark Reading:
Chapter 2 from Data-Intensive Text Processing with MapReduce
Scala crash course
Twitter's Scala School
09-21Text Processing with Spark Reading:
J. Laskowski. Mastering Apache Spark 2.0. RDD — Resilient Distributed Dataset Chapter.
J. Laskowski. Mastering Apache Spark 2.0. RDD Transformations Chapter.
J. Laskowski. Mastering Apache Spark 2.0. RDD Actions Chapter.

For Reference Only:
More of J. Laskowski. Mastering Apache Spark 2.0.
Holden Karau et al. Learning Spark. Lightning-Fast Data Analysis. O'Reilly, 2015.
09-28Information Retrieval and Textual Data Processing with Spark Reading:
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.
Chapter 2 (The term vocabulary & postings lists)
Chapter 6 (Scoring, term weighting & the vector space model)
Chapter 21 (Link Analysis)
10-05 Search and Storage Reading:
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.
Chapter 5 (Index compression)
Chapter 8 (Evaluation in information retrieval)
Chapter 9 (Relevance feedback & query expansion)
10-12 Embeddings and Representation Learning Dan Jurafsky and James H. Martin. Speech and Language Processing (3rd ed. draft):
Chapter 15 (Vector Semantics)
Chapter 16 (Semantics with Dense Vectors)
10-19 Representation Learning: Gradient-based Learning Stanford CS 231n Course Notes on Optimization
10-26 Search with Vectors Jure Leskovec, Anand Rajaraman, Jeff Ullman. Mining of Massive Data Sets. Chapter 3.
Example of how LSH is used at Uber (using Spark)
Code example
11-02 Deep Learning for Classification and Labeling Y. Goldberg. A Primer on Neural Network Models for Natural Language Processing: Sections 4 and 6
11-09 Sequence Modeling Y. Goldberg. A Primer on Neural Network Models for Natural Language Processing: Section 9 (Convolutional Layers, p.42ff)
11-16 Semantic Analysis PyTorch
G. Neubig (2017). Neural Machine Translation and Sequence-to-sequence Models: A Tutorial: Sections 6 and 7
11-21 (Tuesday!) Semantic Analysis and Question Answering
Sentence and Document Models
11-23 No class: Thanksgiving Recess
11-30 Short Project Presentations
12-07 Question Answering, Recap IBM's Watson
Deep Learning approaches
Conversational Agents (e.g., Siri)

See also: Rutgers Academic Calendar.

Slides, Discussion Forum

Sakai is used to host slides, as well as to provide a forum for discussions.

Grading and Course Project

The main course requirement will be a semester-long course project, involving Apache Spark and/or Deep Learning. See Course Project Slides for details. Related to this course project, there will be homework assignments during the semester (Note: currently just the intermediate report).
Additionally, there will be small in-class quizzes.

References

Since we are focusing on the latest research and technology, this course does not strictly follow any designated coursebook. However, the following (optional) books may be useful.