Office hours: Thursdays, 2-4 PM in CBIM
In today's world of massive amounts of data, new methods and techniques are needed. In this course, we discuss methods to dive deeper into such data. In terms of areas, the course will focus on techniques for information retrieval, natural language processing, and relevant indexing methods to enable retrieval based on the semantics of text (or multimodal content).
In terms of methods, the course will focus on recent Deep Learning and neural network methods for these areas.
The course will include hands-on practical work on real data sets, based on deep learning frameworks, as well as optionally the Apache Spark platform.
Basic familiarity with data structures (from Introduction to Computer Science class) and basic mathematics and probability theory.
Basic programming ability. Some of our examples will use Deep Learning tools, most of which require knowledge of Python, C++, Java, or Scala. Other examples will be based on Apache Spark. Prior knowledge of Spark (especially using the Scala programming language) is not required, but certainly won't harm.
|09-07||Logistics, Introduction to Massive Data and Deep Learning||
Reading: Chapter 1 from Data-Intensive Text Processing with MapReduce|
Assignment: Install Apache Zeppelin. This requires Java 1.7+. Download and install from http://zeppelin.apache.org (the smaller net-install version is sufficient, but you can also install the full version of Zeppelin). Make sure that you are able to start Zeppelin and then (perhaps a few seconds later) access it via your Web browser on http://localhost:8080/. Finally, verify that it is working by creating a new notebook, entering the following code and then pressing Shift-Enter to run it.
Array("Hello", "World").mkString(" ")
|09-14||Textual Data Processing with Spark||
Chapter 2 from Data-Intensive Text Processing with MapReduce
Scala crash course
Twitter's Scala School
|09-21||Text Processing with Spark||Natural Language Processing with Spark Text Processing Pipeline|
|09-28||Information Retrieval and Textual Data Processing with Spark||
Boolean Model, Classic Vector Model, Term Weighting|
Feature Engineering with Spark, Spark ML
Relevance feedback, Query Expansion, Web Search, Learning to Rank
|10-05||Search and Storage||
Inverted Indexes for Search, Elastic Search|
Semantic Hashing, Vector Representations of Data
|10-12||Embeddings and Learning with SGD||
|10-19||Storage and Algorithms||
Duplicate Detection, Nearest-Neighbor/Similarity Search with LSH, FAISS, etc.|
|10-26||Deep Learning||Backpropagation, Feedforward Networks, Activation Functions, Softmax|
|11-02||Deep Learning for Classification, Retrieval, and Labeling||
Convolutional Neural Networks|
Sentiment Analysis, Text Classification, Neural Information Retrieval, Named Entity Recognition
Language Models for Retrieval and Prediction|
Recurrent Neural Networks, LSTMs, Seq2Seq
|11-16||Semantic Analysis||LSTMs with Attention, Neural Machine Translation|
|11-21 (Tuesday!)||Question Answering and Structured Data||
IBM's Watson, Question Answering over Structured Data|
Deep Learning approaches
|11-23||No class: Thanksgiving Recess|
|11-30||Short Project Presentations|
|12-07||Question Answering, Recap|
See also: Rutgers Academic Calendar.
Sakai is used to host slides, as well as to provide a forum for discussions.
The main course requirement will be a semester-long course project,
involving Apache Spark and/or Deep Learning.
See Course Project Slides for details.
Related to this course project, there will
be homework assignments during the semester (Note: currently just the intermediate report).
Additionally, there will be small in-class quizzes.
Since we are focusing on the latest research and technology, this course does not strictly follow any designated coursebook. However, the following (optional) books may be useful.
Note: The book is helpful but not required, especially since this is a fast-paced field and some of the latest changes to Spark are not yet covered in the book.
Jure Leskovec, Anand Rajaraman, Jeff Ullman. Mining of Massive Datasets
Note: Available for free online.