Pipeline for training LSA models using Scikit-Learn.

Last update: Sep 05, 2022

Overview

Latent Semantic Analysis

Pipeline for training LSA models using Scikit-Learn.

Usage

Instead of writing custom code for latent semantic analysis, you just need:

install pipeline:

pip install latent-semantic-analysis

run pipeline:

either in terminal:

lsa-train --path_to_config config.yaml

or in python:

import latent_semantic_analysis

latent_semantic_analysis.train(path_to_config="config.yaml")

NOTE: more about config file here.

No data preparation is needed, only a csv file with raw text column (with arbitrary name).

Config

The user interface consists of only one files:

config.yaml - general configuration with sklearn TF-IDF and SVD parameters

Change config.yaml to create the desired configuration and train LSA model with the following command:

terminal:

lsa-train --path_to_config config.yaml

python:

import latent_semantic_analysis

latent_semantic_analysis.train(path_to_config="config.yaml")

Default config.yaml:

seed: 42
path_to_save_folder: models

# data
data:
  data_path: data/data.csv
  sep: ','
  text_column: text

# tf-idf
tf-idf:
  lowercase: true
  ngram_range: (1, 1)
  max_df: 1.0
  min_df: 1

# svd
svd:
  n_components: 10
  algorithm: arpack

NOTE: tf-idf and svd are sklearn TfidfVectorizer and TruncatedSVD parameters correspondingly, so you can parameterize instances of these classes however you want.

Output

After training the model, the pipeline will return the following files:

model.joblib - sklearn pipeline with LSA (TF-IDF and SVD steps)
config.yaml - config that was used to train the model
logging.txt - logging file
doc2topic.json - document embeddings
term2topic.json - term embeddings

Requirements

Python >= 3.6

Citation

If you use latent-semantic-analysis in a scientific publication, we would appreciate references to the following BibTex entry:

@misc{dayyass2021lsa,
    author       = {El-Ayyass, Dani},
    title        = {Pipeline for training LSA models},
    howpublished = {\url{https://github.com/dayyass/latent-semantic-analysis}},
    year         = {2021}
}

You might also like...

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Pipeline For NLP with Bloom's Taxonomy Using Improved Question Classification and Question Generation using Deep Learning This repository contains all

9 Jul 17, 2021

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景安装教程快速上手（一）预训练模型（二）机器翻译（三）文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台，支持多种预训练方式，以及序列生成和自然语言理解任务。安装教程 git clone git

Tencent Minority-Mandarin Translation Team

42 Dec 20, 2022

Toy example of an applied ML pipeline for me to experiment with MLOps tools.

Toy Machine Learning Pipeline Table of Contents About Getting Started ML task description and evaluation procedure Dataset description Repository stru

190 Dec 21, 2022

Pipeline for chemical image-to-text competition

BMS-Molecular-Translation Introduction This is a pipeline for Bristol-Myers Squibb – Molecular Translation by Vadim Timakin and Maksim Zhdanov. We got

7 Sep 20, 2022

Pipeline for fast building text classification TF-IDF + LogReg baselines.

Text Classification Baseline Pipeline for fast building text classification TF-IDF + LogReg baselines. Usage Instead of writing custom code for specif

57 Dec 7, 2022

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Multilingual Latent Dirichlet Allocation (LDA) Pipeline This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It

74 Oct 7, 2022

Releases(v0.1.0)

v0.1.0(Oct 8, 2021)

First Release! 🥳🎉🍾
Source code(tar.gz)
Source code(zip)

Pipeline for training LSA models using Scikit-Learn.

Related tags

Overview

Latent Semantic Analysis

Usage

Config

Output

Requirements

Citation

You might also like...

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

Toy example of an applied ML pipeline for me to experiment with MLOps tools.

Pipeline for chemical image-to-text competition

Pipeline for fast building text classification TF-IDF + LogReg baselines.

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

MHtyper is an end-to-end pipeline for recognized the Forensic microhaplotypes in Nanopore sequencing data.

BookNLP, a natural language processing pipeline for books

Vad-sli-asr - A Python scripts for a speech processing pipeline with Voice Activity Detection (VAD)

Releases(v0.1.0)

v0.1.0(Oct 8, 2021)

Owner

Dani El-Ayyass

Text-Based zombie apocalyptic decision-making game in Python

Long text token classification using LongFormer

ASCEND Chinese-English code-switching dataset

SDL: Synthetic Document Layout dataset

BERN2: an advanced neural biomedical namedentity recognition and normalization tool

CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus

glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

A multi-lingual approach to AllenNLP CoReference Resolution along with a wrapper for spaCy.

Pipeline for chemical image-to-text competition

2021 AI CUP Competition on Traditional Chinese Scene Text Recognition - Intermediate Contest

A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

A program that uses real statistics to choose the best times to bet on BloxFlip's crash gamemode

Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization (ACL 2021)

Awesome-NLP-Research (ANLP)

MiCECo - Misskey Custom Emoji Counter

Python SDK for working with Voicegain Speech-to-Text

TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

Graphical user interface for Argos Translate

PyTorch Implementation of "Non-Autoregressive Neural Machine Translation"

Pretrained Japanese BERT models