Pipeline for fast building text classification TF-IDF + LogReg baselines.

Overview

tests linter codecov

python 3.6 release (latest by date) license

pre-commit code style: black

pypi version pypi downloads

Text Classification Baseline

Pipeline for fast building text classification TF-IDF + LogReg baselines.

Usage

Instead of writing custom code for specific text classification task, you just need:

  1. install pipeline:
pip install text-classification-baseline
  1. run pipeline:
  • either in terminal:
text-clf-train
  • or in python:
import text_clf

text_clf.train()

No data preparation is needed, only a csv file with two raw columns (with arbitrary names):

  • text
  • target

NOTE: the target can be presented in any format, including text - not necessarily integers from 0 to n_classes-1.

Config

The user interface consists of only one file config.yaml.

Change config.yaml to create the desired configuration and train text classification model with the following command:

  • terminal:
text-clf-train --path_to_config config.yaml
  • python:
import text_clf

text_clf.train(path_to_config="config.yaml")

Default config.yaml:

seed: 42
verbose: true
path_to_save_folder: models

# data
data:
  train_data_path: data/train.csv
  valid_data_path: data/valid.csv
  sep: ','
  text_column: text
  target_column: target_name_short

# tf-idf
tf-idf:
  lowercase: true
  ngram_range: (1, 1)
  max_df: 1.0
  min_df: 0.0

# logreg
logreg:
  penalty: l2
  C: 1.0
  class_weight: balanced
  solver: saga
  multi_class: auto
  n_jobs: -1

NOTE: tf-idf and logreg are sklearn TfidfVectorizer and LogisticRegression parameters correspondingly, so you can parameterize instances of these classes however you want.

Output

After training the model, the pipeline will return the following files:

  • model.joblib - sklearn pipeline with TF-IDF and LogReg steps
  • target_names.json - mapping from encoded target labels from 0 to n_classes-1 to it names
  • config.yaml - config that was used to train the model
  • logging.txt - logging file

Requirements

Python >= 3.6

Citation

If you use text-classification-baseline in a scientific publication, we would appreciate references to the following BibTex entry:

@misc{dayyass2021textclf,
    author       = {El-Ayyass, Dani},
    title        = {Pipeline for training text classification baselines},
    howpublished = {\url{https://github.com/dayyass/text-classification-baseline}},
    year         = {2021}
}
You might also like...
Code for EMNLP 2021 main conference paper
Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.
PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

Glow-Speak glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end. Installation git clone https://g

Pipeline for chemical image-to-text competition
Pipeline for chemical image-to-text competition

BMS-Molecular-Translation Introduction This is a pipeline for Bristol-Myers Squibb – Molecular Translation by Vadim Timakin and Maksim Zhdanov. We got

Text-Summarization-using-NLP - Text Summarization using NLP  to fetch BBC News Article and summarize its text and also it includes custom article Summarization A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:
A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

A Python package implementing a new model for text classification with visualization tools for Explainable AI 🍣 Online live demos: http://tworld.io/s

Text vectorization tool to outperform TFIDF for classification tasks
Text vectorization tool to outperform TFIDF for classification tasks

WHAT: Supervised text vectorization tool Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP meth

Text vectorization tool to outperform TFIDF for classification tasks
Text vectorization tool to outperform TFIDF for classification tasks

WHAT: Supervised text vectorization tool Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP meth

Comments
  • release v0.1.4

    release v0.1.4

    • fixed load_20newsgroups.py (#65 #71)
    • added Makefile (#71)
    • added logging confusion matrix (#72)
    • replaced all "valid" occurrences with "test" (#74)
    • updated docstrings (#77)
    • changed python interface - train function returns model and target_names_mapping (#78)
    enhancement 
    opened by dayyass 1
  • release v0.1.6

    release v0.1.6

    fixed token frequency support (add token frequency support #85) fixed threshold selection for binary classification (add threshold selection for binary classification #86)

    bug enhancement 
    opened by dayyass 0
  • release v0.1.5

    release v0.1.5

    • added lemmatization (#66)
    • added token frequency support (#84)
    • added threshold selection for binary classification (#79)
    • added arbitrary save folder name (#80)
    enhancement 
    opened by dayyass 0
  • release v0.1.5

    release v0.1.5

    • added lemmatization (#81)
    • added token frequency support (#85)
    • added threshold selection for binary classification (#86)
    • added arbitrary save folder name (#83)
    enhancement 
    opened by dayyass 0
Releases(v0.1.6)
  • v0.1.6(Nov 6, 2021)

    Release v0.1.6

    • fixed token frequency support (add token frequency support #85)
    • fixed threshold selection for binary classification (add threshold selection for binary classification #86)
    Source code(tar.gz)
    Source code(zip)
  • v0.1.5(Oct 21, 2021)

    Release v0.1.5 🥳🎉🍾

    • added pymorphy2 lemmatization (#81)
    • added token frequency support (#85)
    • added threshold selection for binary classification (#86)
    • added arbitrary save folder name (#83)

    pymorphy2 lemmatization (config.yaml)

    # preprocessing
    # (included in resulting model pipeline, so preserved for inference)
    preprocessing:
      lemmatization: pymorphy2
    

    token frequency support

    • text_clf.token_frequency.get_token_frequency(path_to_config) -
      get token frequency of train dataset according to the config file parameters

    threshold selection for binary classification

    • text_clf.pr_roc_curve.get_precision_recall_curve(path_to_model_folder) -
      get precision and recall metrics for precision-recall curve
    • text_clf.pr_roc_curve.get_roc_curve(path_to_model_folder) -
      get false positive rate (fpr) and true positive rate (tpr) metrics for roc curve
    • text_clf.pr_roc_curve.plot_precision_recall_curve(precision, recall) -
      plot precision-recall curve
    • text_clf.pr_roc_curve.plot_roc_curve(fpr, tpr) -
      plot roc curve
    • text_clf.pr_roc_curve.plot_precision_recall_f1_curves_for_thresholds(precision, recall, thresholds) -
      plot precision, recall, f1-score curves for probability thresholds

    arbitrary save folder name (config.yaml)

    experiment_name: model
    
    Source code(tar.gz)
    Source code(zip)
  • v0.1.4(Oct 10, 2021)

    • fixed load_20newsgroups.py (#65 #71)
    • added Makefile (#71)
    • added logging confusion matrix (#72)
    • replaced all "valid" occurrences with "test" (#74)
    • updated docstrings (#77)
    • changed python interface - train function returns model and target_names_mapping (#78)
    Source code(tar.gz)
    Source code(zip)
  • v0.1.3(Sep 2, 2021)

  • v0.1.2(Aug 19, 2021)

  • v0.1.1(Aug 11, 2021)

  • v0.1.0(Aug 7, 2021)

Owner
Dani El-Ayyass
NLP Tech Lead @ Sber AI, Master Student in Applied Mathematics and Computer Science @ CMC MSU
Dani El-Ayyass
本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料,该资料目前包含 自然语言处理各领域的 面试题积累。

【关于 NLP】那些你不知道的事 作者:杨夕、芙蕖、李玲、陈海顺、twilight、LeoLRH、JimmyDU、艾春辉、张永泰、金金金 介绍 本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料,该资料目前包含 自然语言处理各领域的 面试题积累。 目录架构 一、【

1.4k Dec 30, 2022
Multilingual text (NLP) processing toolkit

polyglot Polyglot is a natural language pipeline that supports massive multilingual applications. Free software: GPLv3 license Documentation: http://p

RAMI ALRFOU 2.1k Jan 07, 2023
Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision

Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

Chenyang Huang 37 Jan 04, 2023
Protein Language Model

ProteinLM We pretrain protein language model based on Megatron-LM framework, and then evaluate the pretrained model results on TAPE (Tasks Assessing P

THUDM 77 Dec 27, 2022
CCF BDCI BERT系统调优赛题baseline(Pytorch版本)

CCF BDCI BERT系统调优赛题baseline(Pytorch版本) 此版本基于Pytorch后端的huggingface进行实现。由于此实现使用了Oneflow的dataloader作为数据读入的方式,因此也需要安装Oneflow。其它框架的数据读取可以参考OneflowDataloade

Ziqi Zhou 9 Oct 13, 2022
Text vectorization tool to outperform TFIDF for classification tasks

WHAT: Supervised text vectorization tool Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP meth

186 Dec 29, 2022
Checking spelling of form elements

Checking spelling of form elements. You can check the source files of external workflows/reports and configuration files

СКБ Контур (команда 1с) 15 Sep 12, 2022
Submit issues and feature requests for our API here.

AIx GPT API Submit issues and feature requests for our API here. See https://apps.aixsolutionsgroup.com for more info. Python Quick Start pip install

AIx Solutions 7 Mar 27, 2022
Sequence-to-Sequence Framework in PyTorch

nmtpytorch allows training of various end-to-end neural architectures including but not limited to neural machine translation, image captioning and au

LIUM 395 Nov 21, 2022
A python package for deep multilingual punctuation prediction.

This python library predicts the punctuation of English, Italian, French and German texts. We developed it to restore the punctuation of transcribed spoken language.

Oliver Guhr 27 Dec 22, 2022
A framework for evaluating Knowledge Graph Embedding Models in a fine-grained manner.

A framework for evaluating Knowledge Graph Embedding Models in a fine-grained manner.

NEC Laboratories Europe 13 Sep 08, 2022
PRAnCER is a web platform that enables the rapid annotation of medical terms within clinical notes.

PRAnCER (Platform enabling Rapid Annotation for Clinical Entity Recognition) is a web platform that enables the rapid annotation of medical terms within clinical notes. A user can highlight spans of

Sontag Lab 39 Nov 14, 2022
PIZZA - a task-oriented semantic parsing dataset

The PIZZA dataset continues the exploration of task-oriented parsing by introducing a new dataset for parsing pizza and drink orders, whose semantics cannot be captured by flat slots and intents.

17 Dec 14, 2022
Python powered crossword generator with database with 20k+ polish words

crossword_generator Generate simple crossword puzzle from words and definitions fetched from krzyżowki.edu.pl endpoints -/ string:word - returns js

0 Jan 04, 2022
To classify the News into Real/Fake using Features from the Text Content of the article

Hoax-Detector Authenticity of news has now become a major problem. The Idea is to classify the News into Real/Fake using Features from the Text Conten

Aravindhan 1 Feb 09, 2022
Index different CKAN entities in Solr, not just datasets

ckanext-sitesearch Index different CKAN entities in Solr, not just datasets Requirements This extension requires CKAN 2.9 or higher and Python 3 Featu

Open Knowledge Foundation 3 Dec 02, 2022
Speech Recognition Database Management with python

Speech Recognition Database Management The main aim of this project is to recogn

Abhishek Kumar Jha 2 Feb 02, 2022
ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset. Through its Python API, the pretrained model can be fine-tuned on any protein-related task in

241 Jan 04, 2023
Open source annotation tool for machine learning practitioners.

doccano doccano is an open source text annotation tool for humans. It provides annotation features for text classification, sequence labeling and sequ

7.1k Jan 01, 2023
Code Implementation of "Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction".

Span-ASTE: Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction ***** New March 31th, 2022: Scikit-Style API for Easy Usage *****

Chia Yew Ken 111 Dec 23, 2022