Office hours: Tue 2pm-4pm at Hill 270
In recent years, we've witnessed an explosion in the amount of available data. This brings us both novel challenges and opportunities. The challenges include developing new frameworks to manage and process such large amounts of data (e.g., Hadoop, Apache Spark, cloud technology, “NoSQL” databases, and scalable data structures for Big Data).
The opportunities include our ability to draw on these massive amounts of data to enable stunning new advances in data mining and in creating intelligent systems. Examples include deep neural networks that can automatically describe a picture and large systems such as IBM's Watson that outperform even the best humans at answering quiz show questions.
In this course, we will study some of the latest advances in these areas, focusing both on foundations and on methods to acquire insights and knowledge from Big Data. Examples include deep learning, learning and extracting knowledge from text, compiling knowledge to create large knowledge graphs, and using all of these techniques in intelligent applications, e.g. for question answering and intelligent agents such as Facebook's "M".
The course will include hands-on practical work on real data sets, based, among others, on the Apache Spark platform.
Basic familiarity with data structures (from Introduction to Computer Science class) and basic mathematics and probability theory.
|01-18 Wed||Logistics, Introduction to Big Data, Overview||Chapter 1 from Data-Intensive Text Processing with MapReduce|
|01-23 Mon||Big Data Infrastructure||
The NIST Definition of Cloud Computing|
Above the Clouds: A Berkeley View of Cloud Computing
|01-25 Wed||Distributed Processing 1||
Chapter 2 from Data-Intensive Text Processing with MapReduce|
Scala crash course
Twitter's Scala School
|01-30 Mon||Distributed Processing 2||
Chapter 3 from Data-Intensive Text Processing with MapReduce|
Notes: If you're on Windows, try the latest master build, where the globalScope issue has supposedly been fixed: A precompiled version of this latest master build is available for download. If you're having trouble running Spark examples due to network issues, try running it with:
bin/spark-notebook -Dmanager.tachyon.enabled=falseFinally, if you're still facing issues getting Spark Notebook to work, you can also try the free Databricks Community Edition cloud service, which provides an environment that is similar to Spark Notebook, but runs in the cloud using Amazon servers. When creating a new notebook on Databricks, select Scala to run the Scala examples from our class. To upload a text file to your server instance, click on "Tables", then "Create Table" (more information here).
|02-01 Wed||Distributed Processing 3||
J. Laskowski. Mastering Apache Spark 2.0|
Holden Karau et al. Learning Spark. Lightning-Fast Data Analysis. O'Reilly, 2015.
Exercise (for practice, not graded): Download the salary data file and use Spark (via Spark Notebook) to determine the average salary for every company. For reference, look at the exercise code PDF from our class, and consider searching the Web about how to achieve basic operations in Scala, such as accessing the elements of an array.
|02-06 Mon||Distributed Processing 4||
Spark SQL, DataFrames and Datasets guide|
The references from last week also remain relevant.
|02-08 Wed||Data Analytics 1||
Statistics in Spark|
Important: See Course Project Slides (course-project.pdf on Sakai) for important deadlines.
|02-13 Mon||Data Analytics 2||Check out the "vis" folder in Spark Notebook|
Vegas is a library for more advanced visualization (currently best in separate project outside of Spark Notebook)
|02-15 Wed||Structured Data and Graphs||
F. Gessert. NoSQL Databases: a Survey and Decision Guidance|
Holden Karau et al. Learning Spark. Chapter 9 for the concepts
But see Spark SQL, DataFrames and Datasets Guide for the latest Spark 2.x syntax.
|02-20 Mon||Structured Data and Graphs||Tom Heath and Christian Bizer (2011) Linked Data: Evolving the Web into a Global Data Space (1st edition). Synthesis Lectures on the Semantic Web: Theory and Technology, 1:1, 1-136. Morgan & Claypool.|
|02-22 Wed||Structured Data on the Web||Tom Heath and Christian Bizer (2011) Linked Data: Evolving the Web into a Global Data Space (1st edition). Synthesis Lectures on the Semantic Web: Theory and Technology, 1:1, 1-136. Morgan & Claypool.|
|02-27 Mon||Data Streams and Data Structures for Big Data||
Chapter 4 of Jure Leskovec, Anand Rajaraman, Jeff Ullman. Mining of Massive Datasets.|
Spark Streaming guide
|03-01 Wed||Web Mining and Information Extraction 1||
Chapter 9 of Bing Liu. Web Data Mining. 2nd Ed.
Extracting text from PDFs and other formats: Add Apache Tika's JAR file to your class path and then:
val tika = new org.apache.tika.Tika()
|03-06 Mon||Web Mining and Information Extraction 2||Chapter 21 of Daniel Jurafsky and James H. Martin. Speech and Language Processing. Upcoming 3rd Ed.|
|03-08 Wed||Scalable Data Mining||
Machine Learning for Developers|
A Course in Machine Learning by Hal Daumé III
|03-13 Mon||Spring recess|
|03-15 Wed||Spring recess|
|03-20 Mon||Representation Learning and Deep Learning 1|
|03-22 Wed||Representation Learning and Deep Learning 2|
|03-27 Mon||Representation Learning and Deep Learning 3|
|03-29 Wed||Natural Language Processing and Deep Learning 1||G. Neubig (2017). Neural Machine Translation and Sequence-to-sequence Models: A Tutorial|
|04-03 Mon||Deep Learning|
|04-05 Wed||No class|
|04-10 Mon||Natural Language Processing and Deep Learning 2|
|04-12 Wed||Natural Language Understanding and Semantics|
|04-17 Mon||Data Integration and Knowledge Graphs|
|04-19 Wed||Applications (incl. Search, Question Answering)||Chapter 28 of Daniel Jurafsky and James H. Martin. Speech and Language Processing. Upcoming 3rd Ed.|
|04-24 Mon||Short Project Presentations|
|04-26 Wed||Short Project Presentations|
See also: Rutgers Academic Calendar.
Sakai is now used to host slides, as well as to provide a forum for discussions.
The main course requirement will be a semester-long course project,
involving Apache Spark and/or Deep Learning.
See Course Project Slides for details.
Related to this course project, there will
be homework assignments during the semester (Note: currently just the intermediate report).
Additionally, there will be small in-class quizzes.
The following resources may be helpful for those still undecided about their course project
Topics: Have a look at some reports by Stanford students (CS224N, CS224D) to get some general inspiration (but obviously, do not plagiarize their work). You may also consider current academic research published at conferences such as ACL and COLING. Finally, you may get in touch with Gerard de Melo to get some feedback or discuss ideas.
Coding: See the resources in the syllabus for examples of working with Spark. You may also find it helpful to read a basic introduction to Scala, but note that it's really not necessary to become an expert on all the more advanced features of the language. Instead, just use Google to figure things out along the way.
Report writing: Later, when preparing your submissions, ShareLaTeX may be useful as a simple cloud-based platform for multiple people to concurrently edit a LaTeX document.
Please set up Spark Notebook following the instructions in the slides. Download from http://spark-notebook.io/dl/zip/0.7.0/2.11/2.1.0/2.7.2/false/true. After that, check out some of the example notebooks included with Spark Notebook.
Since we are focusing on the latest research and technology, this course does not strictly follow any designated coursebook. However, the following (optional) books may be useful.
Note: The book is helpful but not required, especially since this is a fast-paced field and some of the latest changes to Spark are not yet covered in the book.
Jure Leskovec, Anand Rajaraman, Jeff Ullman. Mining of Massive Datasets
Note: Available for free online.