Gerard de Melo's Projects and Resources

Initiatives

Together with a consortium of other partners, we run the German BMWK-funded Mittelstand-Digital Zentrum Berlin, a center that brings AI and IT technology to small and medium-sized enterprises.

AI Service Center

The AI Service Center Berlin-Brandenburg conducts research on how to best provide AI cloud and consultancy services to third-party organizations. Funded by BMBF.

Gorilla Conservation AI

Developing computer vision techniques to support gorilla conservation efforts in the Congo rainforest in joint work with Sabine Plattner African Charities (thanks to Magdalena Bermejo) and Conservation X Labs (thanks to Dante Wasmuth). HPI PI Maximilian Schall presented this to German Chancellor Olaf Scholz.

Sidewalk Ballet

We are investigating computer vision techniques to support urban studies on pedestrian walkability (Jane Jacob's Sidewalk Ballet) in a collaboration with Andres Sevtsuk (MIT) and Maryam Hosseini (Berkeley).

Lexvo.org

Contributes information about words and other language-related entities to the Linked Data Web and Semantic Web, leading to a Web of Data in which the British Library, the Spanish National Library, and others have linked their data to Lexvo.org, and Lexvo.org in turn connects its own data to other valuable resources.

aideadlines.org

The aideadlines.org site provides AI conference call for paper deadlines for the research community. It is based on the original site created by Abhishek Das, who no longer maintains it.

Large Language Models and NLP

Universal Wordnet (UWN)

One of the largest multilingual knowledge graphs, transforming the well-known WordNet database into a massively multilingual resource covering over 1 million words and several million named entities in a single semantically organized hierarchy. This is based on machine learning along with the MENTA extension based on Wikipedia. Our derivative project OpenWordNet-PT (GitHub) is being used by Google Translate.

Sentiment/Emotion

Datasets and resources for sentiment analysis and fine-grained emotion analysis, in part available for multiple languages.

Etymological Wordnet

A database of etymological and derivational relationships between words in different languages, mined from Wiktionary.

NL-Augmenter

We contributed to this massive data augmentation library.

BIG-bench

A community effort to create a massive evaluation suite for large language models.

PEAK

Pyramid Evaluation of summary quality using Automated Knowledge extraction — A method for evaluating the quality of a summary (e.g., one written by students) using the Pyramid method, which is known to be significantly more reliable than the ROUGE method when evaluating individual summaries.

Biomedical Embeddings

Vector embeddings of words and concepts from the biomedical domain. The source code is a part of AiTextML.

NomLex-PT

Lexical resource providing information about Portuguese nominalizations.

Good, Great, Excellent: Semantic Intensity Information

System that scores the relative intensities of different words.

MASC Word Sense Alignment Visualizations

Non-1-to-1 alignments of word senses from two inventories visualized. See also our Blog post about this project.

MTRoget

Thesauri in many languages, obtained by translating Roget's Thesaurus using task-specific statistical techniques

Typo Correction Data

Large spelling correction training datasets that enable deep learning-powered context-sensitive spelling correction.

Chinese Poetry Generation Dataset

A dataset for training and evaluating Mandarin Chinese poetry generation, as described in our ACL paper.

Cross-Lingual Code-Switching Dataset

A dataset to evaluate cross-lingual representation learning and text classification systems. This benchmark requires training on English training data but testing on documents that mix English and non-English words.

FrameNet Browsing Interface

A new more user-friendly browsing interface for the FrameNet lexical semantic resource, which describes the semantic roles of sentences and words.

Swedish Blingbring Thesaurus

A Swedish thesaurus based on Sven Casper Bring's Svenskt ordförråd ordnat i begreppsklasser but reorganized and modernized.

Multimodality

PubMedCLIP

A vision-language model for the medical domain led by Sedigheh Eslami. Also available on HuggingFace.

MaskAnyone

Tool for anonymizing video while retaining human pose and gesture information, to enable sharing of data for multimodal analysis.

Flash-MNIST

Challenge Dataset for video classification.

CITE

Dataset providing discourse relationships between images and text.

Knowledge and Data Resources

Knowledge Graphs book

Our book on knowledge graphs teaches both fundamental principles and current trends. It is available to purchase but also freely browsable online.

FrameBase

FrameBase uses frame semantics, a theory of natural language semantics, to represent knowledge about the world in a consistent way.

WebChild / Knowlywood

Large amounts of common-sense knowledge extracted from the Web.

YAGO-SUMO Integration

An ontology providing an enormous body of axiomatized world knowledge based on YAGO as well as the Suggested Upper Merged Ontology (SUMO). YAGO was used in IBM's famous Jeopardy!-winning system Watson.

Entity Type Description Generator

Source code for a deep learning-based system that generates natural language descriptions of entities, along with corresponding benchmark datasets (based on Wikidata).

Inductive Knowledge Graph Completion Datasets

Benchmark datasets for knowledge graph completion in an inductive setting (previously unseen entities in test set): WN18RR-Inductive, FB15k-237-Inductive, NELL-995-Inductive.

Visualizing and Curating Knowledge Graphs over Time and Space