Gerard de Melo's Projects and Resources

Initiatives

Mittelstand-Digital
Together with a consortium of other partners, we run the German BMWK-funded Mittelstand-Digital Zentrum Berlin, a center that brings AI and IT technology to small and medium-sized enterprises.
AI Service Center
The AI Service Center Berlin-Brandenburg conducts research on how to best provide AI cloud and consultancy services to third-party organizations. Funded by BMBF.
Lexvo.org
Contributes information about words and other language-related entities to the Linked Data Web and Semantic Web, leading to a Web of Data in which the British Library, the Spanish National Library, and others have linked their data to Lexvo.org, and Lexvo.org in turn connects its own data to other valuable resources.

Natural Language Processing

Universal Wordnet (UWN)
One of the largest multilingual knowledge graphs, transforming the well-known WordNet database into a massively multilingual resource covering over 1 million words and several million named entities in a single semantically organized hierarchy. This is based on machine learning along with the MENTA extension based on Wikipedia. Our derivative project OpenWordNet-PT (GitHub) is being used by Google Translate.
Sentiment/Emotion
Datasets and resources for sentiment analysis and fine-grained emotion analysis, in part available for multiple languages.
Etymological Wordnet
A database of etymological and derivational relationships between words in different languages, mined from Wiktionary.
NL-Augmenter
We contributed to this massive data augmentation library.
BIG-bench
A community effort to create a massive evaluation suite for large language models.
PEAK
Pyramid Evaluation of summary quality using Automated Knowledge extraction — A method for evaluating the quality of a summary (e.g., one written by students) using the Pyramid method, which is known to be significantly more reliable than the ROUGE method when evaluating individual summaries.
Biomedical Embeddings
Vector embeddings of words and concepts from the biomedical domain. The source code is a part of AiTextML.
NomLex-PT
Lexical resource providing information about Portuguese nominalizations.
Good, Great, Excellent: Semantic Intensity Information
System that scores the relative intensities of different words.
MASC Word Sense Alignment Visualizations
Non-1-to-1 alignments of word senses from two inventories visualized. See also our Blog post about this project.
MTRoget
Thesauri in many languages, obtained by translating Roget's Thesaurus using task-specific statistical techniques
Typo Correction Data
Large spelling correction training datasets that enable deep learning-powered context-sensitive spelling correction.
Chinese Poetry Generation Dataset
A dataset for training and evaluating Mandarin Chinese poetry generation, as described in our ACL paper.
Cross-Lingual Code-Switching Dataset
A dataset to evaluate cross-lingual representation learning and text classification systems. This benchmark requires training on English training data but testing on documents that mix English and non-English words.
FrameNet Browsing Interface
A new more user-friendly browsing interface for the FrameNet lexical semantic resource, which describes the semantic roles of sentences and words.
Swedish Blingbring Thesaurus
A Swedish thesaurus based on Sven Casper Bring's Svenskt ordförråd ordnat i begreppsklasser but reorganized and modernized.

Information Extraction and Information Retrieval

PACRR
Source code for a deep neural Information Retrieval system.
Multi-Document Relationships
System that visualizes information from across multiple documents using a graph-based user interface to browse relationships.
Taxonomy Prediction Benchmark Data
An open-domain taxonomy prediction benchmark dataset covering a much more diverse set of domains than previous datasets.

Knowledge and Data Resources

FrameBase
FrameBase uses frame semantics, a theory of natural language semantics, to represent knowledge about the world in a consistent way.
WebChild / Knowlywood
Large amounts of common-sense knowledge extracted from the Web.
YAGO-SUMO Integration
An ontology providing an enormous body of axiomatized world knowledge based on YAGO as well as the Suggested Upper Merged Ontology (SUMO). YAGO was used in IBM's famous Jeopardy!-winning system Watson.
Entity Type Description Generator
Source code for a deep learning-based system that generates natural language descriptions of entities, along with corresponding benchmark datasets (based on Wikidata).
Inductive Knowledge Graph Completion Datasets
Benchmark datasets for knowledge graph completion in an inductive setting (previously unseen entities in test set): WN18RR-Inductive, FB15k-237-Inductive, NELL-995-Inductive.
SPASS-XDB
Online interface to the SPASS-XDB reasoning system, which combines state-of-the-art theorem proving with support for large-scale knowledge sources.
Wikipedia IMDb Mappings
Mappings between Wikipedia articles and corresponding Internet Movie Database entries (Download)
GD View
File Viewer (for DOS) supporting over 400 different file formats.

Multimodality

Masked-Piper
Tool for anonymizing video while retaining human pose and gesture information, to enable sharing of data for multimodal analysis.
Flash-MNIST
Challenge Dataset for video classification.
CITE
Dataset providing discourse relationships between images and text.

Source Code

Source Code
We have published the source code for a number of different research projects. Follow the link for a list of available code bases.
SpaCy Tutorial
Tutorial for SpaCy NLP Library.

Misc.

Conway's Game of Life
Online version of Conway's Game Of Life cellular automaton.
Wire Game
Simple wire game that runs in the browser.

Further Information

 

Return to Main Page