Instructor

Gerard de Melo
CBIM 8, Dept. of Computer Science
Office hours: Wed 5:10pm-6:10pm
(note: no office hours on April 26)

Teaching Assistant

Fangda Han
Email: fh199@scarletmail...
Office hours: Tue 2pm-4pm at Hill 270

Announcements

Overview

In recent years, we've witnessed an explosion in the amount of available data. This brings us both novel challenges and opportunities. The challenges include developing new frameworks to manage and process such large amounts of data (e.g., Hadoop, Apache Spark, cloud technology, “NoSQL” databases, and scalable data structures for Big Data).

The opportunities include our ability to draw on these massive amounts of data to enable stunning new advances in data mining and in creating intelligent systems. Examples include deep neural networks that can automatically describe a picture and large systems such as IBM's Watson that outperform even the best humans at answering quiz show questions.

In this course, we will study some of the latest advances in these areas, focusing both on foundations and on methods to acquire insights and knowledge from Big Data. Examples include deep learning, learning and extracting knowledge from text, compiling knowledge to create large knowledge graphs, and using all of these techniques in intelligent applications, e.g. for question answering and intelligent agents such as Facebook's "M".

The course will include hands-on practical work on real data sets, based, among others, on the Apache Spark platform.

Prerequisites

Basic familiarity with data structures (from Introduction to Computer Science class) and basic mathematics and probability theory.


Basic programming ability. Many of our examples will be based on Apache Spark. Prior knowledge of Spark (especially using the Scala programming language) is not required, but certainly won't harm.

Topics

DateTopicsFurther Reading/References
01-18 WedLogistics, Introduction to Big Data, Overview Chapter 1 from Data-Intensive Text Processing with MapReduce
01-23 MonBig Data Infrastructure The NIST Definition of Cloud Computing
Above the Clouds: A Berkeley View of Cloud Computing
01-25 WedDistributed Processing 1 Chapter 2 from Data-Intensive Text Processing with MapReduce
Scala crash course
Twitter's Scala School
01-30 MonDistributed Processing 2 Chapter 3 from Data-Intensive Text Processing with MapReduce
Notes: If you're on Windows, try the latest master build, where the globalScope issue has supposedly been fixed: A precompiled version of this latest master build is available for download. If you're having trouble running Spark examples due to network issues, try running it with:
bin/spark-notebook -Dmanager.tachyon.enabled=false
Finally, if you're still facing issues getting Spark Notebook to work, you can also try the free Databricks Community Edition cloud service, which provides an environment that is similar to Spark Notebook, but runs in the cloud using Amazon servers. When creating a new notebook on Databricks, select Scala to run the Scala examples from our class. To upload a text file to your server instance, click on "Tables", then "Create Table" (more information here).
02-01 WedDistributed Processing 3 J. Laskowski. Mastering Apache Spark 2.0
Holden Karau et al. Learning Spark. Lightning-Fast Data Analysis. O'Reilly, 2015.

Exercise (for practice, not graded): Download the salary data file and use Spark (via Spark Notebook) to determine the average salary for every company. For reference, look at the exercise code PDF from our class, and consider searching the Web about how to achieve basic operations in Scala, such as accessing the elements of an array.
02-06 MonDistributed Processing 4 Spark SQL, DataFrames and Datasets guide
The references from last week also remain relevant.
02-08 WedData Analytics 1 Statistics in Spark
Important: See Course Project Slides (course-project.pdf on Sakai) for important deadlines.
02-13 MonData Analytics 2Check out the "vis" folder in Spark Notebook
Vegas is a library for more advanced visualization (currently best in separate project outside of Spark Notebook)
02-15 WedStructured Data and Graphs F. Gessert. NoSQL Databases: a Survey and Decision Guidance
Holden Karau et al. Learning Spark. Chapter 9 for the concepts
But see Spark SQL, DataFrames and Datasets Guide for the latest Spark 2.x syntax.
02-20 MonStructured Data and Graphs Tom Heath and Christian Bizer (2011) Linked Data: Evolving the Web into a Global Data Space (1st edition). Synthesis Lectures on the Semantic Web: Theory and Technology, 1:1, 1-136. Morgan & Claypool.
02-22 WedStructured Data on the WebTom Heath and Christian Bizer (2011) Linked Data: Evolving the Web into a Global Data Space (1st edition). Synthesis Lectures on the Semantic Web: Theory and Technology, 1:1, 1-136. Morgan & Claypool.
02-27 MonData Streams and Data Structures for Big Data Chapter 4 of Jure Leskovec, Anand Rajaraman, Jeff Ullman. Mining of Massive Datasets.
Spark Streaming guide
03-01 WedWeb Mining and Information Extraction 1 Chapter 9 of Bing Liu. Web Data Mining. 2nd Ed.

Extracting text from PDFs and other formats: Add Apache Tika's JAR file to your class path and then:
val tika = new org.apache.tika.Tika()
val stream = new FileInputStream("/path/to/file.pdf")
val text = tika.parseToString(stream)
stream.close
03-06 MonWeb Mining and Information Extraction 2 Chapter 21 of Daniel Jurafsky and James H. Martin. Speech and Language Processing. Upcoming 3rd Ed.
03-08 WedScalable Data Mining Machine Learning for Developers
A Course in Machine Learning by Hal Daumé III
03-13 MonSpring recess
03-15 WedSpring recess
03-20 MonRepresentation Learning and Deep Learning 1 Word2Vec with DeepLearning4J
Stanford CS 224n Word2Vec Slides
03-22 WedRepresentation Learning and Deep Learning 2 Stanford CS 231n Slides 1
Stanford CS 231n Slides 2
03-27 MonDeep Learning 1 Stanford CS 231n Slides 3
03-29 WedDeep Learning 2 Stanford CS 231n Slides 4
04-03 MonDeep Learning 3 Configuring IntelliJ for MXNet
MXNet Char-Level Language Model
Basic MXNet Scala Examples
Awesome MXNet Examples
04-05 WedNo class
04-10 MonNatural Language Processing and Deep Learning Spark Notebook showing how to implement a neural network from scratch
DeepLearning4J
Keras
G. Neubig (2017). Neural Machine Translation and Sequence-to-sequence Models: A Tutorial
04-12 WedNatural Language Understanding and Semantics with Big Data Intel's BigDL library for Deep Learning in Spark
Language model-based text generation (Scala)
Language model-based text generation (Java)
Language model-based text generation (PyTorch)
Ready-to-use Sequence-to-Sequence learning tool
04-17 MonScalable Machine Learning Intel's BigDL library for Deep Learning in Spark
Junto - Semi-supervised Learning
Mahout
Olivier Chapelle, Bernhard Schölkopf, Alexander Zien. Introduction to Semi-Supervised Learning. Chapter 1
04-19 WedData Integration Entity Resolution tutorial
04-24 MonShort Project Presentations
04-26 WedShort Project Presentations
05-01 MonReview/Applications/Outlook Chapter 28 of Daniel Jurafsky and James H. Martin. Speech and Language Processing. Upcoming 3rd Ed.

See also: Rutgers Academic Calendar.

Slides, Discussion Forum

Sakai is now used to host slides, as well as to provide a forum for discussions.

Grading and Course Project

The main course requirement will be a semester-long course project, involving Apache Spark and/or Deep Learning. See Course Project Slides for details. Related to this course project, there will be homework assignments during the semester (Note: currently just the intermediate report).
Additionally, there will be small in-class quizzes.

Course Project Resources

The following resources may be helpful for those still undecided about their course project

Datasets:

Topics: Have a look at some reports by Stanford students (CS224N, CS224D) to get some general inspiration (but obviously, do not plagiarize their work). You may also consider current academic research published at conferences such as ACL and COLING. Finally, you may get in touch with Gerard de Melo to get some feedback or discuss ideas.

Coding: See the resources in the syllabus for examples of working with Spark. You may also find it helpful to read a basic introduction to Scala, but note that it's really not necessary to become an expert on all the more advanced features of the language. Instead, just use Google to figure things out along the way.

Report writing: Later, when preparing your submissions, ShareLaTeX may be useful as a simple cloud-based platform for multiple people to concurrently edit a LaTeX document.

References

Please set up Spark Notebook following the instructions in the slides. Download from http://spark-notebook.io/dl/zip/0.7.0/2.11/2.1.0/2.7.2/false/true. After that, check out some of the example notebooks included with Spark Notebook.

Since we are focusing on the latest research and technology, this course does not strictly follow any designated coursebook. However, the following (optional) books may be useful.