Big Data Analytics and Text Mining

Instructor

Gerard de Melo
CBIM 8, Dept. of Computer Science
Office hours: Wed 5:10pm-6:10pm
(note: no office hours on April 26)

Teaching Assistant

Fangda Han
Email: fh199@scarletmail...
Office hours: Tue 2pm-4pm at Hill 270

Time and Location

Mondays and Wednesdays, 1:40pm - 3:00pm
Till 116, Tillet Hall, Livingston Campus

Announcements

Important: See the Course Project Slides (course-project.pdf on Sakai) for important deadlines. However, note that the deadline for short project proposals has been extended to Wednesday, Feb. 22. See Course Project References for possible datasets and topics.
The Sakai site is finally available. Enrolled students are already in the system. Others may get in touch with the TA Fangda Han to get access.

Overview

In recent years, we've witnessed an explosion in the amount of available data. This brings us both novel challenges and opportunities. The challenges include developing new frameworks to manage and process such large amounts of data (e.g., Hadoop, Apache Spark, cloud technology, “NoSQL” databases, and scalable data structures for Big Data).

The opportunities include our ability to draw on these massive amounts of data to enable stunning new advances in data mining and in creating intelligent systems. Examples include deep neural networks that can automatically describe a picture and large systems such as IBM's Watson that outperform even the best humans at answering quiz show questions.

In this course, we will study some of the latest advances in these areas, focusing both on foundations and on methods to acquire insights and knowledge from Big Data. Examples include deep learning, learning and extracting knowledge from text, compiling knowledge to create large knowledge graphs, and using all of these techniques in intelligent applications, e.g. for question answering and intelligent agents such as Facebook's "M".

The course will include hands-on practical work on real data sets, based, among others, on the Apache Spark platform.

Prerequisites

Basic familiarity with data structures (from Introduction to Computer Science class) and basic mathematics and probability theory.

 Basic programming ability. Many of our examples will be based on Apache Spark. Prior knowledge of Spark (especially using the Scala programming language) is not required, but certainly won't harm.

Topics

Date	Topics	Further Reading/References
01-18 Wed	Logistics, Introduction to Big Data, Overview	Chapter 1 from Data-Intensive Text Processing with MapReduce
01-23 Mon	Big Data Infrastructure	The NIST Definition of Cloud Computing Above the Clouds: A Berkeley View of Cloud Computing
01-25 Wed	Distributed Processing 1	Chapter 2 from Data-Intensive Text Processing with MapReduce Scala crash course Twitter's Scala School
01-30 Mon	Distributed Processing 2	Chapter 3 from Data-Intensive Text Processing with MapReduce Notes: If you're on Windows, try the latest master build, where the globalScope issue has supposedly been fixed: A precompiled version of this latest master build is available for download. If you're having trouble running Spark examples due to network issues, try running it with: bin/spark-notebook -Dmanager.tachyon.enabled=false Finally, if you're still facing issues getting Spark Notebook to work, you can also try the free Databricks Community Edition cloud service, which provides an environment that is similar to Spark Notebook, but runs in the cloud using Amazon servers. When creating a new notebook on Databricks, select Scala to run the Scala examples from our class. To upload a text file to your server instance, click on "Tables", then "Create Table" (more information here).
02-01 Wed	Distributed Processing 3	J. Laskowski. Mastering Apache Spark 2.0 Holden Karau et al. Learning Spark. Lightning-Fast Data Analysis. O'Reilly, 2015. Exercise (for practice, not graded): Download the salary data file and use Spark (via Spark Notebook) to determine the average salary for every company. For reference, look at the exercise code PDF from our class, and consider searching the Web about how to achieve basic operations in Scala, such as accessing the elements of an array.
02-06 Mon	Distributed Processing 4	Spark SQL, DataFrames and Datasets guide The references from last week also remain relevant.
02-08 Wed	Data Analytics 1	Statistics in Spark Important: See Course Project Slides (course-project.pdf on Sakai) for important deadlines.
02-13 Mon	Data Analytics 2	Check out the "vis" folder in Spark Notebook Vegas is a library for more advanced visualization (currently best in separate project outside of Spark Notebook)
02-15 Wed	Structured Data and Graphs	F. Gessert. NoSQL Databases: a Survey and Decision Guidance Holden Karau et al. Learning Spark. Chapter 9 for the concepts But see Spark SQL, DataFrames and Datasets Guide for the latest Spark 2.x syntax.
02-20 Mon	Structured Data and Graphs	Tom Heath and Christian Bizer (2011) Linked Data: Evolving the Web into a Global Data Space (1st edition). Synthesis Lectures on the Semantic Web: Theory and Technology, 1:1, 1-136. Morgan & Claypool.
02-22 Wed	Structured Data on the Web	Tom Heath and Christian Bizer (2011) Linked Data: Evolving the Web into a Global Data Space (1st edition). Synthesis Lectures on the Semantic Web: Theory and Technology, 1:1, 1-136. Morgan & Claypool.
02-27 Mon	Data Streams and Data Structures for Big Data	Chapter 4 of Jure Leskovec, Anand Rajaraman, Jeff Ullman. Mining of Massive Datasets. Spark Streaming guide
03-01 Wed	Web Mining and Information Extraction 1	Chapter 9 of Bing Liu. Web Data Mining. 2nd Ed. Extracting text from PDFs and other formats: Add Apache Tika's JAR file to your class path and then: val tika = new org.apache.tika.Tika() val stream = new FileInputStream("/path/to/file.pdf") val text = tika.parseToString(stream) stream.close
03-06 Mon	Web Mining and Information Extraction 2	Chapter 21 of Daniel Jurafsky and James H. Martin. Speech and Language Processing. Upcoming 3rd Ed.
03-08 Wed	Scalable Data Mining	Machine Learning for Developers A Course in Machine Learning by Hal Daumé III
03-13 Mon	Spring recess
03-15 Wed	Spring recess
03-20 Mon	Representation Learning and Deep Learning 1	Word2Vec with DeepLearning4J Stanford CS 224n Word2Vec Slides
03-22 Wed	Representation Learning and Deep Learning 2	Stanford CS 231n Slides 1 Stanford CS 231n Slides 2
03-27 Mon	Deep Learning 1	Stanford CS 231n Slides 3
03-29 Wed	Deep Learning 2	Stanford CS 231n Slides 4
04-03 Mon	Deep Learning 3	Configuring IntelliJ for MXNet MXNet Char-Level Language Model Basic MXNet Scala Examples Awesome MXNet Examples
04-05 Wed	No class
04-10 Mon	Natural Language Processing and Deep Learning	Spark Notebook showing how to implement a neural network from scratch DeepLearning4J Keras G. Neubig (2017). Neural Machine Translation and Sequence-to-sequence Models: A Tutorial
04-12 Wed	Natural Language Understanding and Semantics with Big Data	Intel's BigDL library for Deep Learning in Spark Language model-based text generation (Scala) Language model-based text generation (Java) Language model-based text generation (PyTorch) Ready-to-use Sequence-to-Sequence learning tool
04-17 Mon	Scalable Machine Learning	Intel's BigDL library for Deep Learning in Spark Junto - Semi-supervised Learning Mahout Olivier Chapelle, Bernhard Schölkopf, Alexander Zien. Introduction to Semi-Supervised Learning. Chapter 1
04-19 Wed	Data Integration	Entity Resolution tutorial
04-24 Mon	Short Project Presentations
04-26 Wed	Short Project Presentations
05-01 Mon	Review/Applications/Outlook	Chapter 28 of Daniel Jurafsky and James H. Martin. Speech and Language Processing. Upcoming 3rd Ed.

Slides, Discussion Forum

Sakai is now used to host slides, as well as to provide a forum for discussions.

Grading and Course Project

The main course requirement will be a semester-long course project, involving Apache Spark and/or Deep Learning. See Course Project Slides for details. Related to this course project, there will be homework assignments during the semester (Note: currently just the intermediate report).
Additionally, there will be small in-class quizzes.

Course Project Resources

The following resources may be helpful for those still undecided about their course project

Datasets:

DataHub has a lot of structured data in formats such as RDF and CSV.
Text collections (not all of which are freely available) are listed at the University of Oxford Text Archive.
Wikipedia has a lot of content and you can focus on particular topics by selecting for a specific set of categories. The structured data in Wikipedia may be easier to access by using DBpedia, while from the original Wikipedia dumps you can also get textual content, which can be accessed e.g. via the DKPro library. The Simple English Wikipedia may be easier to start with due to its small size.
Lastly, you can simply google "<topic> dataset" or "<topic> corpus", e.g. "inauguration speech corpus".

Topics: Have a look at some reports by Stanford students (CS224N, CS224D) to get some general inspiration (but obviously, do not plagiarize their work). You may also consider current academic research published at conferences such as ACL and COLING. Finally, you may get in touch with Gerard de Melo to get some feedback or discuss ideas.

Coding: See the resources in the syllabus for examples of working with Spark. You may also find it helpful to read a basic introduction to Scala, but note that it's really not necessary to become an expert on all the more advanced features of the language. Instead, just use Google to figure things out along the way.

Report writing: Later, when preparing your submissions, ShareLaTeX may be useful as a simple cloud-based platform for multiple people to concurrently edit a LaTeX document.

References

Please set up Spark Notebook following the instructions in the slides. Download from http://spark-notebook.io/dl/zip/0.7.0/2.11/2.1.0/2.7.2/false/true. After that, check out some of the example notebooks included with Spark Notebook.

Since we are focusing on the latest research and technology, this course does not strictly follow any designated coursebook. However, the following (optional) books may be useful.

Holden Karau et al. Learning Spark. Lightning-Fast Data Analysis. O'Reilly, 2015.

Note: The book is helpful but not required, especially since this is a fast-paced field and some of the latest changes to Spark are not yet covered in the book.
Jure Leskovec, Anand Rajaraman, Jeff Ullman. Mining of Massive Datasets
Note: Available for free online.
Yoav Goldberg. A Primer on Neural Network Models for Natural Language Processing