Fast, DB Backed pretrained word embeddings for natural language processing.

Last update: Nov 21, 2022

Overview

Embeddings

Embeddings is a python package that provides pretrained word embeddings for natural language processing and machine learning.

Instead of loading a large file to query for embeddings, embeddings is backed by a database and fast to load and query:

>>> %timeit GloveEmbedding('common_crawl_840', d_emb=300)
100 loops, best of 3: 12.7 ms per loop

>>> %timeit GloveEmbedding('common_crawl_840', d_emb=300).emb('canada')
100 loops, best of 3: 12.9 ms per loop

>>> g = GloveEmbedding('common_crawl_840', d_emb=300)

>>> %timeit -n1 g.emb('canada')
1 loop, best of 3: 38.2 µs per loop

Installation

pip install embeddings  # from pypi
pip install git+https://github.com/vzhong/embeddings.git  # from github

Usage

Upon first use, the embeddings are first downloaded to disk in the form of a SQLite database. This may take a long time for large embeddings such as GloVe. Further usage of the embeddings are directly queried against the database. Embedding databases are stored in the $EMBEDDINGS_ROOT directory (defaults to ~/.embeddings). Note that this location is probably undesirable if your home directory is on NFS, as it would slow down database queries significantly.

from embeddings import GloveEmbedding, FastTextEmbedding, KazumaCharEmbedding, ConcatEmbedding

g = GloveEmbedding('common_crawl_840', d_emb=300, show_progress=True)
f = FastTextEmbedding()
k = KazumaCharEmbedding()
c = ConcatEmbedding([g, f, k])
for w in ['canada', 'vancouver', 'toronto']:
    print('embedding {}'.format(w))
    print(g.emb(w))
    print(f.emb(w))
    print(k.emb(w))
    print(c.emb(w))

Docker

If you use Docker, an image prepopulated with the Common Crawl 840 GloVe embeddings and Kazuma Hashimoto's character ngram embeddings is available at vzhong/embeddings. To mount volumes from this container, set $EMBEDDINGS_ROOT in your container to /opt/embeddings.

For example:

docker run --volumes-from vzhong/embeddings -e EMBEDDINGS_ROOT='/opt/embeddings' myimage python train.py

Contribution

Pull requests welcome!

Fast, DB Backed pretrained word embeddings for natural language processing.

Related tags

Overview

Embeddings

Installation

Usage

Docker

Contribution

Owner

Victor Zhong

Multilingual finetuning of Machine Translation model on low-resource languages. Project for Deep Natural Language Processing course.

The tool to make NLP datasets ready to use

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention.

CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

GVT is a generic translation tool for parts of text on the PC screen with Text to Speak functionality.

Installation, test and evaluation of Scribosermo speech-to-text engine

A natural language modeling framework based on PyTorch

Just a basic Telegram AI chat bot written in Python using Pyrogram.

STT for TorchScript is a port of Coqui STT based on DeepSpeech to PyTorch.

Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Twitter-Sentiment-Analysis - Twitter sentiment analysis for india's top online retailers(2019 to 2022)

InferSent sentence embeddings

StarGAN - Official PyTorch Implementation

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

A retro text-to-speech bot for Discord

Unofficial Python library for using the Polish Wordnet (plWordNet / Słowosieć)

The Internet Archive Research Assistant - Daily search Internet Archive for new items matching your keywords

Text classification is one of the popular tasks in NLP that allows a program to classify free-text documents based on pre-defined classes.

Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code

Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention