Biterm Topic Model (BTM): modeling topics in short texts

Overview

Biterm Topic Model

CircleCI Documentation Status Codacy Badge Issues Downloads PyPI

Bitermplus implements Biterm topic model for short texts introduced by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Actually, it is a cythonized version of BTM. This package is also capable of computing perplexity and semantic coherence metrics.

Development

Please note that bitermplus is actively improved. Refer to documentation to stay up to date.

Requirements

  • cython
  • numpy
  • pandas
  • scipy
  • scikit-learn
  • tqdm

Setup

Linux and Windows

There should be no issues with installing bitermplus under these OSes. You can install the package directly from PyPi.

pip install bitermplus

Or from this repo:

pip install git+https://github.com/maximtrp/bitermplus.git

Mac OS

First, you need to install XCode CLT and Homebrew. Then, install libomp using brew:

xcode-select --install
brew install libomp
pip3 install bitermplus

Example

Model fitting

import bitermplus as btm
import numpy as np
import pandas as pd

# IMPORTING DATA
df = pd.read_csv(
    'dataset/SearchSnippets.txt.gz', header=None, names=['texts'])
texts = df['texts'].str.strip().tolist()

# PREPROCESSING
# Obtaining terms frequency in a sparse matrix and corpus vocabulary
X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
tf = np.array(X.sum(axis=0)).ravel()
# Vectorizing documents
docs_vec = btm.get_vectorized_docs(texts, vocabulary)
docs_lens = list(map(len, docs_vec))
# Generating biterms
biterms = btm.get_biterms(docs_vec)

# INITIALIZING AND RUNNING MODEL
model = btm.BTM(
    X, vocabulary, seed=12321, T=8, M=20, alpha=50/8, beta=0.01)
model.fit(biterms, iterations=20)
p_zd = model.transform(docs_vec)

# METRICS
perplexity = btm.perplexity(model.matrix_topics_words_, p_zd, X, 8)
coherence = btm.coherence(model.matrix_topics_words_, X, M=20)
# or
perplexity = model.perplexity_
coherence = model.coherence_

Results visualization

You need to install tmplot first.

import tmplot as tmp
tmp.report(model=model, docs=texts)

Report interface

Tutorial

There is a tutorial in documentation that covers the important steps of topic modeling (including stability measures and results visualization).

Comments
  • the topic distribution for all doc is similar

    the topic distribution for all doc is similar

    topic

    [9.99998750e-01 3.12592152e-07 3.12592152e-07 3.12592152e-07  3.12592152e-07] [9.99999903e-01 2.43742411e-08 2.43742411e-08 2.43742411e-08  2.43742411e-08] [9.99999264e-01 1.83996702e-07 1.83996702e-07 1.83996702e-07  1.83996702e-07] [9.99998890e-01 2.77376339e-07 2.77376339e-07 2.77376339e-07  2.77376339e-07] [9.99999998e-01 3.94318712e-10 3.94318712e-10 3.94318712e-10  3.94318712e-10] [9.99998428e-01 3.92884503e-07 3.92884503e-07 3.92884503e-07  3.92884503e-07]

    bug help wanted good first issue 
    opened by JennieGerhardt 11
  • ERROR: Failed building wheel for bitermplus

    ERROR: Failed building wheel for bitermplus

    creating build/temp.macosx-10.9-universal2-cpython-310/src/bitermplus clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -arch arm64 -arch x86_64 -g -I/Library/Frameworks/Python.framework/Versions/3.10/include/python3.10 -c src/bitermplus/_btm.c -o build/temp.macosx-10.9-universal2-cpython-310/src/bitermplus/_btm.o -Xpreprocessor -fopenmp src/bitermplus/_btm.c:772:10: fatal error: 'omp.h' file not found #include <omp.h> ^~~~~~~ 1 error generated. error: command '/usr/bin/clang' failed with exit code 1 [end of output]

    note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for bitermplus Failed to build bitermplus ERROR: Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects

    bug documentation 
    opened by QinrenK 9
  • Got an unexpected result in marked sample

    Got an unexpected result in marked sample

    Hi, @maximtrp, I am trying to use bitermplus for topic modeling. However, when i use the marked sample to train the model. i got the unexpeted result. Firstly, the marked samples contain 5 types, but trained model get a huge perlexity when the the number of topic is 5. Secondly, when i test the topic parameter from 1 to 20, the perplexity was reduced following the increase of topic number. my code is following: df = pd.read_csv('dataPretreatment/data/corpus.txt', header=None, names=['texts']) texts = df['texts'].str.strip().tolist() print(df) stop_words = segmentWord.stopwordslist() perplexitys = [] coherences = []

    for T in range(1,21,1): print(T) X, vocabulary, vocab_dict = btm.get_words_freqs(texts, stop_words=stop_words) # Vectorizing documents docs_vec = btm.get_vectorized_docs(texts, vocabulary) # Generating biterms biterms = btm.get_biterms(docs_vec) # INITIALIZING AND RUNNING MODEL model = btm.BTM(X, vocabulary, seed=12321, T=T, M=50, alpha=50/T, beta=0.01) model.fit(biterms, iterations=2000) p_zd = model.transform(docs_vec) perplexity = btm.perplexity(model.matrix_topics_words_, p_zd, X, T) coherence = model.coherence_ perplexitys.append(perplexity) coherences.append(coherence)

    ``

    opened by Chen-X666 7
  • Getting the error 'CountVectorizer' object has no attribute 'get_feature_names_out'

    Getting the error 'CountVectorizer' object has no attribute 'get_feature_names_out'

    Hi @maximtrp, I am trying to use bitermplus for topic modeling. Running the code shows the error I mentioned in the title. Seems sth in get_words_freqs function goes wrong. I appreciate if you advise how I can fix that.

    opened by Sajad7010 4
  • Cannot find Closest topics and Stable topics

    Cannot find Closest topics and Stable topics

    Hello there, I am able to generate the model and visualize it. But when I tried to find the closest topics and stable topics, I get the error for code line:

    closest_topics, dist = btm.get_closest_topics(*matrix_topic_words, top_words=139, verbose=True)
    

    The error is:

    IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
    

    This is despite me separately checking the array size and it is 2-D. I am pasting the code below. Pl. can you check if I am doing anything wrong.

    Thank you.

    X, vocabulary, vocab_dict = btm.get_words_freqs(clean_text, max_df=.85, min_df=15,ngram_range=(1,2))
    
    # Vectorizing documents
    docs_vec = btm.get_vectorized_docs(clean_text, vocabulary)
    
    # Generating biterms
    Y = X.todense()
    biterms = btm.get_biterms(docs_vec, 15)
    
    # INITIALIZING AND RUNNING MODEL
    model = btm.BTM(X, vocabulary, T=8, M=10, alpha=500/1000, beta=0.01, win=15, has_background= True)
    model.fit(biterms, iterations=500, verbose=True)
    p_zd = model.transform(docs_vec,verbose=True)  
    print(p_zd) 
    
    # matrix of document-topics; topics vs. documents, topics vs. words probabilities 
    matrix_docs_topics = model.matrix_docs_topics_    #Documents vs topics probabilities matrix.
    topic_doc_matrix = model.matrix_topics_docs_      #Topics vs documents probabilities matrix.
    matrix_topic_words = model.matrix_topics_words_   #Topics vs words probabilities matrix.
    
    # Getting stable topics
    print("Array Dimension = ",len(matrix_topic_words.shape))
    closest_topics, dist = btm.get_closest_topics(*matrix_topic_words, top_words=100, verbose=True)
    stable_topics, stable_kl = btm.get_stable_topics(closest_topics, thres=0.7)
    
    # Stable topics indices list
    print(stable_topics)
    
    help wanted question 
    opened by RashmiBatra 4
  • Questions regarding Perplexity and Model Comparison with C++

    Questions regarding Perplexity and Model Comparison with C++

    I have two questions regarding this mode. First of all, I noticed that the evaluation metric perplexity was implemented. However, traditionally, the perplexity was mostly computed on the held-out dataset. Does that mean that when using this model, we should leave out certain proportion of the data and compute the perplexity on those samples that have not been used for training the model? My second question was that I was trying to compare this implementation with the C++ version from the original paper. The results (the top words in each topic) are quite different when the same parameters are used on the same corpus. Do you know what might be causing that and which part was implemented differently?

    help wanted question 
    opened by orpheus92 3
  • How do I get the topic words?

    How do I get the topic words?

    Hi,

    Firstly, thanks for sharing your code.

    Not an issue, just a question. I'm able to see the relevant words for a topic in the tmplot report. How do I get those words? I need to get at least the most three relevant terms.

    Thanks in advance.

    question 
    opened by aguinaldoabbj 3
  • failed building wheels

    failed building wheels

    Hi!

    I've got an error when running pip3 install bitermplus on MacOS (intel-based, Ventura), using python 3.10.8 in a separate venv (not anaconda):

    Building wheels for collected packages: bitermplus
      Building wheel for bitermplus (pyproject.toml) ... error
      error: subprocess-exited-with-error
    
      × Building wheel for bitermplus (pyproject.toml) did not run successfully.
      │ exit code: 1
      ╰─> [34 lines of output]
          Error in sitecustomize; set PYTHONVERBOSE for traceback:
          AssertionError:
          running bdist_wheel
          running build
          running build_py
          creating build
          creating build/lib.macosx-12-x86_64-cpython-310
          creating build/lib.macosx-12-x86_64-cpython-310/bitermplus
          copying src/bitermplus/__init__.py -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
          copying src/bitermplus/_util.py -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
          running egg_info
          writing src/bitermplus.egg-info/PKG-INFO
          writing dependency_links to src/bitermplus.egg-info/dependency_links.txt
          writing requirements to src/bitermplus.egg-info/requires.txt
          writing top-level names to src/bitermplus.egg-info/top_level.txt
          reading manifest file 'src/bitermplus.egg-info/SOURCES.txt'
          reading manifest template 'MANIFEST.in'
          adding license file 'LICENSE'
          writing manifest file 'src/bitermplus.egg-info/SOURCES.txt'
          copying src/bitermplus/_btm.c -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
          copying src/bitermplus/_btm.pyx -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
          copying src/bitermplus/_metrics.c -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
          copying src/bitermplus/_metrics.pyx -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
          running build_ext
          building 'bitermplus._btm' extension
          creating build/temp.macosx-12-x86_64-cpython-310
          creating build/temp.macosx-12-x86_64-cpython-310/src
          creating build/temp.macosx-12-x86_64-cpython-310/src/bitermplus
          clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX12.sdk -I/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.10/include/python3.10 -c src/bitermplus/_btm.c -o build/temp.macosx-12-x86_64-cpython-310/src/bitermplus/_btm.o -Xpreprocessor -fopenmp
          src/bitermplus/_btm.c:772:10: fatal error: 'omp.h' file not found
          #include <omp.h>
                   ^~~~~~~
          1 error generated.
          error: command '/usr/bin/clang' failed with exit code 1
          [end of output]
    
      note: This error originates from a subprocess, and is likely not a problem with pip.
      ERROR: Failed building wheel for bitermplus
    Failed to build bitermplus
    ERROR: Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects
    

    Could this error be related to #29? I've tested on a PC and it worked though.

    bug documentation 
    opened by alanmaehara 2
  • Failed building wheel for bitermplus

    Failed building wheel for bitermplus

    Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects

    When I try to install bitermplus with pip install bitermplus there is an error massage like this : note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for bitermplus ERROR: Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects

    bug 
    opened by novra 2
  • Calculation of nmi,ami,ri

    Calculation of nmi,ami,ri

    I'm trying to test the model and see if it matches the data labels, but I can't get the topic for each document. I'm trying to get the list of labels to apply nmi, ami and ri so I'm wondering how to get the labels from the model. @maximtrp

    opened by gitassia 2
  • Implementation Guide

    Implementation Guide

    I was wondering is there any way to print the the topics generate by the BTM model, just like how I can do it with Gensim. In addition to that, I am getting all negative coherence values in the range of -500 or -600. I am not sure if I am doing something wrong. The issues is, I am not able to interpret the results, even plotting gives some strange output.

    image

    The following image show what is held by the variable adobe, again I am not sure if it needs to be in this manner or each row here needs to a list

    image
    opened by neel6762 2
Releases(v0.6.12)
Owner
Maksim Terpilowski
Research scientist
Maksim Terpilowski
Sentiment Classification using WSD, Maximum Entropy & Naive Bayes Classifiers

Sentiment Classification using WSD, Maximum Entropy & Naive Bayes Classifiers

Pulkit Kathuria 173 Jan 04, 2023
☀️ Measuring the accuracy of BBC weather forecasts in Honolulu, USA

Accuracy of BBC Weather forecasts for Honolulu This repository records the forecasts made by BBC Weather for the city of Honolulu, USA. Essentially, t

Max Halford 12 Oct 15, 2022
NLP techniques such as named entity recognition, sentiment analysis, topic modeling, text classification with Python to predict sentiment and rating of drug from user reviews.

This file contains the following documents sumbited for Baruch CIS9665 group 9 fall 2021. 1. Dataset: drug_reviews.csv 2. python codes for text classi

Aarif Munwar Jahan 2 Jan 04, 2023
MiCECo - Misskey Custom Emoji Counter

MiCECo Misskey Custom Emoji Counter Introduction This little script counts custo

7 Dec 25, 2022
Code for Editing Factual Knowledge in Language Models

KnowledgeEditor Code for Editing Factual Knowledge in Language Models (https://arxiv.org/abs/2104.08164). @inproceedings{decao2021editing, title={Ed

Nicola De Cao 86 Nov 28, 2022
The Easy-to-use Dialogue Response Selection Toolkit for Researchers

The Easy-to-use Dialogue Response Selection Toolkit for Researchers

GMFTBY 32 Nov 13, 2022
Signature remover is a NLP based solution which removes email signatures from the rest of the text.

Signature Remover Signature remover is a NLP based solution which removes email signatures from the rest of the text. It helps to enchance data conten

Forges Alterway 8 Jan 06, 2023
TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

TweebankNLP This repo contains the new Tweebank-NER dataset and off-the-shelf Twitter-Stanza pipeline for state-of-the-art Tweet NLP, as described in

Laboratory for Social Machines 84 Dec 20, 2022
Build Text Rerankers with Deep Language Models

Reranker is a lightweight, effective and efficient package for training and deploying deep languge model reranker in information retrieval (IR), question answering (QA) and many other natural languag

Luyu Gao 140 Dec 06, 2022
:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

deepset 1.6k Dec 27, 2022
The swas programming language

The Swas programming language This is a language that was made for fun. Installation Step 0: Make sure you have python installed Step 1. Clone this re

Swas.py 19 Jul 18, 2022
A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion

List Of English Words A text file containing over 466k English words. While searching for a list of english words (for an auto-complete tutorial) I fo

dwyl 8.5k Jan 03, 2023
[ICCV 2021] Instance-level Image Retrieval using Reranking Transformers

Instance-level Image Retrieval using Reranking Transformers Fuwen Tan, Jiangbo Yuan, Vicente Ordonez, ICCV 2021. Abstract Instance-level image retriev

UVA Computer Vision 86 Dec 28, 2022
This project consists of data analysis and data visualization (done using python)of all IPL seasons from 2008 to 2019 and answering the most asked questions about the IPL.

IPL-data-analysis This project consists of data analysis and data visualization of all IPL seasons from 2008 to 2019 and answering the most asked ques

Sivateja A T 2 Feb 08, 2022
Research Code for NeurIPS 2020 Spotlight paper "Large-Scale Adversarial Training for Vision-and-Language Representation Learning": UNITER adversarial training part

VILLA: Vision-and-Language Adversarial Training This is the official repository of VILLA (NeurIPS 2020 Spotlight). This repository currently supports

Zhe Gan 109 Dec 31, 2022
SentAugment is a data augmentation technique for semi-supervised learning in NLP.

SentAugment SentAugment is a data augmentation technique for semi-supervised learning in NLP. It uses state-of-the-art sentence embeddings to structur

Meta Research 363 Dec 30, 2022
SEJE is a prototype for the paper Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering.

SEJE is a prototype for the paper Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering. Contents Inst

0 Oct 21, 2021
Subtitle Workshop (subshop): tools to download and synchronize subtitles

SUBSHOP Tools to download, remove ads, and synchronize subtitles. SUBSHOP Purpose Limitations Required Web Credentials Installation, Configuration, an

Joe D 4 Feb 13, 2022
Simple translation demo showcasing our headliner package.

Headliner Demo This is a demo showcasing our Headliner package. In particular, we trained a simple seq2seq model on an English-German dataset. We didn

Axel Springer News Media & Tech GmbH & Co. KG - Ideas Engineering 16 Nov 24, 2022
Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products

Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products.

Leah Pathan Khan 2 Jan 12, 2022