Monday, January 28, 2019
8:30 AM – 12:30 pm
(with 10:30 – 11:00 am break)


Hilton Hawaiian Village
Honolulu, Hawaii, USA


While word embeddings such as those produced by word2vec and GloVe are widely known as a simple means of working with textual data, there has recently been substantial progress on improved methods that yield better embeddings. In particular, one may wish to induce neural vector representations not just of individual words but also of longer units of language, including 1) multi-word phrases, 2) entire sentences, or even 3) complete documents.

Algorithms for these settings can draw on large corpora, but may also exploit supervision from other kinds of data such as document labels, lexical resources, or natural language inference datasets. Sentence embeddings are of particular interest, because they may need to properly account for quite subtle distinctions between overall rather similar sentences. Moreover, new techniques have been developed to develop embeddings for multilingual and cross-lingual settings.

This tutorial will thus provide an overview of recent state-of-the-art methods that go beyond word2vec and better model the semantics of longer units such as sentences and documents, both monolingually and cross-lingually. The tutorial will start with a brief refresher of word2vec and and how it relates to classic methods for distributional semantics, so no prior knowledge is required.

Topics and Slides

1Introduction, Words Motivation
History, Distributional vs. Distributed Semantics
Refresher: word2vec
Coping with rare words
2Phrase Vectors Phrase Detection in word2vec
External Supervision
3Sentence Vectors word2vec-inspired Approaches
Supervision from Various Sources
Simple Aggregation
4Document Vectors Word Vector-based
Deep IR methods
5Applications, Conclusion Applications, e.g. Matching, IR, Unsupervised NMT
Sentiment Embeddings, Visual Font Embeddings, Graph Embeddings
Common Sense



For problems or questions about this site, please contact Gerard de Melo.