Lbl2Vec learns jointly embedded label, document and word vectors to retrieve documents with predefined topics from an unlabeled document corpus.

Last update: Dec 20, 2022

Overview

Lbl2Vec

Lbl2Vec is an algorithm for unsupervised document classification and unsupervised document retrieval. It automatically generates jointly embedded label, document and word vectors and returns documents of topics modeled by manually predefined keywords. Once you train the Lbl2Vec model you can:

Classify documents as related to one of the predefined topics.
Get similarity scores for documents to each predefined topic.
Get most similar predefined topic of documents.

See the paper for more details on how it works.

Corresponding Medium post describing the use of Lbl2Vec for unsupervised text classification can be found here.

Benefits

No need to label the whole document dataset for classification.
No stop word lists required.
No need for stemming/lemmatization.
Works on short text.
Creates jointly embedded label, document, and word vectors.

How does it work?

The key idea of the algorithm is that many semantically similar keywords can represent a topic. In the first step, the algorithm creates a joint embedding of document and word vectors. Once documents and words are embedded in a vector space, the goal of the algorithm is to learn label vectors from previously manually defined keywords representing a topic. Finally, the algorithm can predict the affiliation of documents to topics from document vector <-> label vector similarities.

The Algorithm

0. Use the manually defined keywords for each topic of interest.

Domain knowledge is needed to define keywords that describe topics and are semantically similar to each other within the topics.

Basketball	Soccer	Baseball
NBA	FIFA	MLB
Basketball	Soccer	Baseball
LeBron	Messi	Ruth
...	...	...

1. Create jointly embedded document and word vectors using Doc2Vec.

Documents will be placed close to other similar documents and close to the most distinguishing words.

2. Find document vectors that are similar to the keyword vectors of each topic.

Each color represents a different topic described by the respective keywords.

3. Clean outlier document vectors for each topic.

Red documents are outlier vectors that are removed and do not get used for calculating the label vector.

4. Compute the centroid of the outlier cleaned document vectors as label vector for each topic.

Points represent the label vectors of the respective topics.

5. Compute label vector <-> document vector similarities for each label vector and document vector in the dataset.

Documents are classified as topic with the highest label vector <-> document vector similarity.

Installation

pip install lbl2vec

Usage

For detailed information visit the Lbl2Vec API Guide and the examples.

from lbl2vec import Lbl2Vec

Learn new model from scratch

Learns word vectors, document vectors and label vectors from scratch during Lbl2Vec model training.

# init model
model = Lbl2Vec(keywords_list=descriptive_keywords, tagged_documents=tagged_docs)
# train model
model.fit()

Important parameters:

keywords_list: iterable list of lists with descriptive keywords of type str. For each label at least one descriptive keyword has to be added as list of str.
tagged_documents: iterable list of gensim.models.doc2vec.TaggedDocument elements. If you wish to train a new Doc2Vec model this parameter can not be None, whereas the doc2vec_model parameter must be None. If you use a pretrained Doc2Vec model this parameter has to be None. Input corpus, can be simply a list of elements, but for larger corpora, consider an iterable that streams the documents directly from disk/network.

Use word and document vectors from pretrained Doc2Vec model

Uses word vectors and document vectors from a pretrained Doc2Vec model to learn label vectors during Lbl2Vec model training.

# init model
model = Lbl2Vec(keywords_list=descriptive_keywords, doc2vec_model=pretrained_d2v_model)
# train model
model.fit()

Important parameters:

keywords_list: iterable list of lists with descriptive keywords of type str. For each label at least one descriptive keyword has to be added as list of str.
doc2vec_model: pretrained gensim.models.doc2vec.Doc2Vec model. If given a pretrained Doc2Vec model, Lbl2Vec uses the pre-trained Doc2Vec model from this parameter. If this parameter is defined, tagged_documents parameter has to be None. In order to get optimal Lbl2Vec results the given Doc2Vec model should be trained with the parameters "dbow_words=1" and "dm=0".

Predict label similarities for documents used for training

Computes the similarity scores for each document vector stored in the model to each of the label vectors.

# get similarity scores from trained model
model.predict_model_docs()

Important parameters:

doc_keys: list of document keys (optional). If None: return the similarity scores for all documents that are used to train the Lbl2Vec model. Else: only return the similarity scores of training documents with the given keys.

Predict label similarities for new documents that are not used for training

Computes the similarity scores for each given and previously unknown document vector to each of the label vectors from the model.

# get similarity scores for each new document from trained model
model.predict_new_docs(tagged_docs=tagged_docs)

Important parameters:

tagged_docs: iterable list of gensim.models.doc2vec.TaggedDocument elements

Save model to disk

model.save('model_name')

Load model from disk

model = Lbl2Vec.load('model_name')

Citing Lbl2Vec

When citing Lbl2Vec in academic papers and theses, please use this BibTeX entry:

@conference{webist21,
author={Tim Schopf. and Daniel Braun. and Florian Matthes.},
title={Lbl2Vec: An Embedding-based Approach for Unsupervised Document Retrieval on Predefined Topics},
booktitle={Proceedings of the 17th International Conference on Web Information Systems and Technologies - WEBIST,},
year={2021},
pages={124-132},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010710300003058},
isbn={978-989-758-536-4},
issn={2184-3252},
}

Torch-based tool for quantizing high-dimensional vectors using additive codebooks

Trainable multi-codebook quantization This repository implements a utility for use with PyTorch, and ideally GPUs, for training an efficient quantizer

41 Jan 7, 2023

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language This repository contains UA-GEC data and an accompanying Python lib

226 Dec 29, 2022

This repository contains the code for "Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP".

Self-Diagnosis and Self-Debiasing This repository contains the source code for Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based

62 Dec 12, 2022

Ever felt tired after preprocessing the dataset, and not wanting to write any code further to train your model? Ever encountered a situation where you wanted to record the hyperparameters of the trained model and able to retrieve it afterward? Models Playground is here to help you do that. Models playground allows you to train your models right from the browser.

Models Playground 🗂️ Upload a Preprocessed Dataset 🌠 Choose whether to perform Classification or Regression 🦹 Enter the Dependent Variable ?

19 Dec 10, 2022

ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

ERISHA: Multilingual Multispeaker Expressive Text-to-Speech Library ERISHA is a multilingual multispeaker expressive speech synthesis framework. It ca

43 Nov 27, 2022

Official repository for "Action-Based Conversations Dataset: A Corpus for Building More In-Depth Task-Oriented Dialogue Systems"

Action-Based Conversations Dataset (ABCD) This respository contains the code and data for ABCD (Chen et al., 2021) Introduction Whereas existing goal-

49 Oct 9, 2022

Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.

AVATAR Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation. AVATAR stands for jAVA-pyThon progrAm tRanslation. AV

26 Dec 3, 2022

[2021 MultiMedia] CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval

CONQUER: Contexutal Query-aware Ranking for Video Corpus Moment Retreival PyTorch implementation of CONQUER: Contexutal Query-aware Ranking for Video

23 Dec 26, 2022

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering (NAACL 2021)

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering Abstract In open-domain question answering (QA), retrieve-and-read mec

34 Apr 13, 2022

Comments

ValueError: cannot compute similarity with no input

Hi Team,

I am getting following error while running model fit:

2022-04-08 14:19:04,344 - Lbl2Vec - INFO - Train document and word embeddings 2022-04-08 14:19:09,992 - Lbl2Vec - INFO - Train label embeddings

ValueError Traceback (most recent call last) in

~/SageMaker/lbl2vec/lbl2vec.py in fit(self) 248 # get doc keys and similarity scores of documents that are similar to 249 # the description keywords --> 250 self.labels[['doc_keys', 'doc_similarity_scores']] = self.labels['description_keywords'].apply(lambda row: self._get_similar_documents( 251 self.doc2vec_model, row, num_docs=self.num_docs, similarity_threshold=self.similarity_threshold, min_num_docs=self.min_num_docs)) 252

~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds) 4211 else: 4212 values = self.astype(object)._values -> 4213 mapped = lib.map_infer(values, f, convert=convert_dtype) 4214 4215 if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

~/SageMaker/lbl2vec/lbl2vec.py in (row) 249 # the description keywords 250 self.labels[['doc_keys', 'doc_similarity_scores']] = self.labels['description_keywords'].apply(lambda row: self._get_similar_documents( --> 251 self.doc2vec_model, row, num_docs=self.num_docs, similarity_threshold=self.similarity_threshold, min_num_docs=self.min_num_docs)) 252 253 # validate that documents to calculate label embeddings from are found

~/SageMaker/lbl2vec/lbl2vec.py in _get_similar_documents(self, doc2vec_model, keywords, num_docs, similarity_threshold, min_num_docs) 625 for word in cleaned_keywords_list] 626 similar_docs = doc2vec_model.dv.most_similar( --> 627 positive=keywordword_vectors, topn=num_docs) 628 except KeyError as error: 629 error.args = (

~/anaconda3/envs/python3/lib/python3.6/site-packages/gensim/models/keyedvectors.py in most_similar(self, positive, negative, topn, clip_start, clip_end, restrict_vocab, indexer) 775 all_keys.add(self.get_index(key)) 776 if not mean: --> 777 raise ValueError("cannot compute similarity with no input") 778 mean = matutils.unitvec(array(mean).mean(axis=0)).astype(REAL) 779

ValueError: cannot compute similarity with no input
help wanted

opened by TechyNilesh 3
pip install doesnt work

Hello I'm trying to install the package but I get an error.

pip install lbl2vec

Collecting lbl2vec ERROR: Could not find a version that satisfies the requirement lbl2vec (from versions: none) ERROR: No matching distribution found for lbl2vec

I searched a bit on google and couldn't find a solution.

Python 3.7.4 pip 19.2.3
help wanted

opened by veiro 2
Is paragraph classification possible?

Hello and thanks for sharing this. A question: can Lbl2Vec perform well when the "documents" are paragraph-sized? For example 3-5 sentences? Would we need to change Doc2Vec that Lbl2Vec currently uses into Sent2Vec or some other equivalent? Your thoughts?

Thanks!

opened by stelmath 0

Releases(v1.0.2)

v1.0.2(Dec 29, 2022)

Add Lbl2TransformerVec and change multiprocessing module from swifter to ray.
Source code(tar.gz)
Source code(zip)
lbl2vec-1.0.2-py3-none-any.whl(24.16 KB)
lbl2vec-1.0.2.tar.gz(24.60 KB)
v1.0.1(Jul 20, 2021)

Updated the long_description in setup.py to create a project description from the README.md on PyPI.
Source code(tar.gz)
Source code(zip)
lbl2vec-1.0.1-py3-none-any.whl(12.50 KB)
lbl2vec-1.0.1.tar(100.00 KB)
v1.0(Jul 20, 2021)

Initial Lbl2Vec release version
Source code(tar.gz)
Source code(zip)
lbl2vec-1.0-py3-none-any.whl(10.44 KB)
lbl2vec-1.0.tar(80.00 KB)

Owner

sebis - TUM - Germany

Official account of sebis chair

GitHub Repository https://wwwmatthes.in.tum.de

This repo holds code for TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

TransUNet This repo holds code for TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation Usage

1.4k Jan 04, 2023

Automatic Attendance marker for LMS Practice School Division, BITS Pilani

LMS Attendance Marker Automatic script for lazy people to mark attendance on LMS for Practice School 1. Setup Add your LMS credentials and time slot t

3 Jun 12, 2021

Implementation of ProteinBERT in Pytorch

ProteinBERT - Pytorch (wip) Implementation of ProteinBERT in Pytorch. Original Repository Install $ pip install protein-bert-pytorch Usage import torc

92 Dec 25, 2022

LAMDA: Label Matching Deep Domain Adaptation

LAMDA: Label Matching Deep Domain Adaptation This is the implementation of the paper LAMDA: Label Matching Deep Domain Adaptation which has been accep

9 Sep 06, 2022

A Nim frontend for pytorch, aiming to be mostly auto-generated and internally using ATen.

Master Release Pytorch - Py + Nim A Nim frontend for pytorch, aiming to be mostly auto-generated and internally using ATen. Because Nim compiles to C+

425 Dec 22, 2022

Pointer-generator - Code for the ACL 2017 paper Get To The Point: Summarization with Pointer-Generator Networks

Note: this code is no longer actively maintained. However, feel free to use the Issues section to discuss the code with other users. Some users have u

2.1k Jan 04, 2023

Pseudo lidar - (CVPR 2019) Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving

Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving This paper has been accpeted by Conference o

881 Dec 27, 2022

A Keras implementation of YOLOv4 (Tensorflow backend)

keras-yolo4 请使用更完善的版本: https://github.com/miemie2013/Keras-YOLOv4 Please visit here for more complete model: https://github.com/miemie2013/Keras-YOLOv

384 Nov 29, 2022

A PyTorch Implementation of PGL-SUM from "Combining Global and Local Attention with Positional Encoding for Video Summarization", Proc. IEEE ISM 2021

PGL-SUM: Combining Global and Local Attention with Positional Encoding for Video Summarization PyTorch Implementation of PGL-SUM From "PGL-SUM: Combin

35 Dec 22, 2022

An expansion for RDKit to read all types of files in one line

RDMolReader An expansion for RDKit to read all types of files in one line How to use? Add this single .py file to your project and import MolFromFile(

1 Dec 18, 2021

Distilled coarse part of LoFTR adapted for compatibility with TensorRT and embedded divices

Coarse LoFTR TRT Google Colab demo notebook This project provides a deep learning model for the Local Feature Matching for two images that can be used

46 Dec 24, 2022

A brand new hub for Scene Graph Generation methods based on MMdetection (2021). The pipeline of from detection, scene graph generation to downstream tasks (e.g., image cpationing) is supported. Pytorch version implementation of HetH (ECCV 2020) and TopicSG (ICCV 2021) is included.

MMSceneGraph Introduction MMSceneneGraph is an open source code hub for scene graph generation as well as supporting downstream tasks based on the sce

39 Dec 17, 2022

A resource for learning about deep learning techniques from regression to LSTM and Reinforcement Learning using financial data and the fitness functions of algorithmic trading

A tour through tensorflow with financial data I present several models ranging in complexity from simple regression to LSTM and policy networks. The s

195 Dec 07, 2022

Lbl2Vec learns jointly embedded label, document and word vectors to retrieve documents with predefined topics from an unlabeled document corpus.

Related tags

Overview

Lbl2Vec

Benefits

How does it work?

The Algorithm

Installation

Usage

Learn new model from scratch

Use word and document vectors from pretrained Doc2Vec model

Predict label similarities for documents used for training

Predict label similarities for new documents that are not used for training

Save model to disk

Load model from disk

Citing Lbl2Vec

You might also like...

Torch-based tool for quantizing high-dimensional vectors using additive codebooks

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

This repository contains the code for "Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP".

ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

Official repository for "Action-Based Conversations Dataset: A Corpus for Building More In-Depth Task-Oriented Dialogue Systems"

Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.

[2021 MultiMedia] CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering (NAACL 2021)

Comments

ValueError: cannot compute similarity with no input

2022-04-08 14:19:04,344 - Lbl2Vec - INFO - Train document and word embeddings 2022-04-08 14:19:09,992 - Lbl2Vec - INFO - Train label embeddings

pip install doesnt work

Is paragraph classification possible?

Releases(v1.0.2)

v1.0.2(Dec 29, 2022)

v1.0.1(Jul 20, 2021)

v1.0(Jul 20, 2021)

Owner

sebis - TUM - Germany

This repo holds code for TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

Automatic Attendance marker for LMS Practice School Division, BITS Pilani

Implementation of ProteinBERT in Pytorch

LAMDA: Label Matching Deep Domain Adaptation

A Nim frontend for pytorch, aiming to be mostly auto-generated and internally using ATen.

Pointer-generator - Code for the ACL 2017 paper Get To The Point: Summarization with Pointer-Generator Networks

Pseudo lidar - (CVPR 2019) Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving

A Keras implementation of YOLOv4 (Tensorflow backend)

A PyTorch Implementation of PGL-SUM from "Combining Global and Local Attention with Positional Encoding for Video Summarization", Proc. IEEE ISM 2021

An expansion for RDKit to read all types of files in one line

Distilled coarse part of LoFTR adapted for compatibility with TensorRT and embedded divices

A brand new hub for Scene Graph Generation methods based on MMdetection (2021). The pipeline of from detection, scene graph generation to downstream tasks (e.g., image cpationing) is supported. Pytorch version implementation of HetH (ECCV 2020) and TopicSG (ICCV 2021) is included.

A resource for learning about deep learning techniques from regression to LSTM and Reinforcement Learning using financial data and the fitness functions of algorithmic trading

YolactEdge: Real-time Instance Segmentation on the Edge

Implementation for the IJCAI2021 work "Beyond the Spectrum: Detecting Deepfakes via Re-synthesis"

Normalization Matters in Weakly Supervised Object Localization (ICCV 2021)

GoodNews Everyone! Context driven entity aware captioning for news images

A Weakly Supervised Amodal Segmenter with Boundary Uncertainty Estimation

Stochastic Scene-Aware Motion Prediction

AoT is a system for automatically generating off-target test harness by using build information.