Instructor

Gerard de Melo
CBIM 8, Dept. of Computer Science
Office hours: Wednesdays, 6-7 PM

Teaching Assistant

Rajarshi Bhowmik
Email: rajarshi.bhowmik@rutg...
Office hours: Thursdays, 2-4 PM in CBIM

Announcements

Overview

In today's world of massive amounts of data, new methods and techniques are needed. In this course, we discuss methods to dive deeper into such data. In terms of areas, the course will focus on techniques for information retrieval, natural language processing, and relevant indexing methods to enable retrieval based on the semantics of text (or multimodal content).

In terms of methods, the course will focus on recent Deep Learning and neural network methods for these areas.

The course will include hands-on practical work on real data sets, based on deep learning frameworks, as well as optionally the Apache Spark platform.

Prerequisites

Basic familiarity with data structures (from Introduction to Computer Science class) and basic mathematics and probability theory.


Basic programming ability. Some of our examples will use Deep Learning tools, most of which require knowledge of Python, C++, Java, or Scala. Other examples will be based on Apache Spark. Prior knowledge of Spark (especially using the Scala programming language) is not required, but certainly won't harm.

Topics

DateTopicsFurther Reading/References
09-07 Logistics, Introduction to Massive Data and Deep Learning Reading: Chapter 1 from Data-Intensive Text Processing with MapReduce

Assignment: Install Apache Zeppelin. This requires Java 1.7+. Download and install from http://zeppelin.apache.org (the smaller net-install version is sufficient, but you can also install the full version of Zeppelin). Make sure that you are able to start Zeppelin and then (perhaps a few seconds later) access it via your Web browser on http://localhost:8080/. Finally, verify that it is working by creating a new notebook, entering the following code and then pressing Shift-Enter to run it.
Array("Hello", "World").mkString(" ")
09-14Textual Data Processing with Spark Reading:
Chapter 2 from Data-Intensive Text Processing with MapReduce
Scala crash course
Twitter's Scala School
09-21Text Processing with Spark Natural Language Processing with Spark Text Processing Pipeline
09-28Information Retrieval and Textual Data Processing with Spark Boolean Model, Classic Vector Model, Term Weighting
Feature Engineering with Spark, Spark ML
Relevance feedback, Query Expansion, Web Search, Learning to Rank
10-05 Search and Storage Inverted Indexes for Search, Elastic Search
Semantic Hashing, Vector Representations of Data
10-12 Embeddings and Learning with SGD Word Vectors
SGD
10-19 Storage and Algorithms Duplicate Detection, Nearest-Neighbor/Similarity Search with LSH, FAISS, etc.
Spatial Indexing
NoSQL
10-26 Deep Learning Backpropagation, Feedforward Networks, Activation Functions, Softmax
11-02 Deep Learning for Classification, Retrieval, and Labeling Convolutional Neural Networks
Sentiment Analysis, Text Classification, Neural Information Retrieval, Named Entity Recognition
11-09 Sequence Modeling Language Models for Retrieval and Prediction
Recurrent Neural Networks, LSTMs, Seq2Seq
11-16 Semantic Analysis LSTMs with Attention, Neural Machine Translation
11-21 (Tuesday!) Question Answering and Structured Data IBM's Watson, Question Answering over Structured Data
Deep Learning approaches
Siri
11-23 No class: Thanksgiving Recess
11-30 Short Project Presentations
12-07 Question Answering, Recap

See also: Rutgers Academic Calendar.

Slides, Discussion Forum

Sakai is used to host slides, as well as to provide a forum for discussions.

Grading and Course Project

The main course requirement will be a semester-long course project, involving Apache Spark and/or Deep Learning. See Course Project Slides for details. Related to this course project, there will be homework assignments during the semester (Note: currently just the intermediate report).
Additionally, there will be small in-class quizzes.

References

Since we are focusing on the latest research and technology, this course does not strictly follow any designated coursebook. However, the following (optional) books may be useful.