scikit-learn wrappers for Python fastText.

Last update: Sep 09, 2022

Related tags

Overview

skift

scikit-learn wrappers for Python fastText.

>>> from skift import FirstColFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = FirstColFtClassifier(lr=0.3, epoch=10)
>>> sk_clf.fit(df[['txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]

Contents

1 Installation
2 Configuration
3 Features
4 Wrappers
- 4.1 Standard wrappers
- 4.2 pandas-dependent wrappers
5 Contributing
6 Credits

1 Installation

Dependencies:

numpy
scipy
scikit-learn
The fasttext Python package

pip install skift

Because fasttext reads input data from files, skift has to dump the input data into temporary files for fasttext to use. A dedicated folder is created for those files on the filesystem. By default, this storage is allocated in the system temporary storage location (i.e. /tmp on *nix systems). To override this default location, use the SKIFT_TEMP_DIR environment variable:

export SKIFT_TEMP_DIR=/path/to/desired/temp/folder

NOTE: The directory will be created if it does not already exist.

3 Features

Adheres to the scikit-learn classifier API, including predict_proba.
Also caters to the common use case of pandas.DataFrame inputs.
Enables easy stacking of fastText with other types of scikit-learn-compliant classifiers.
Pickle-able classifier objects.
Built around the official fasttext Python package.
Pure python.
Supports Python 3.5+.
Fully tested on Linux, OSX and Windows operating systems.

4 Wrappers

fastText works only on text data, which means that it will only use a single column from a dataset which might contain many feature columns of different types. As such, a common use case is to have the fastText classifier use a single column as input, ignoring other columns. This is especially true when fastText is to be used as one of several classifiers in a stacking classifier, with other classifiers using non-textual features.

skift includes several scikit-learn-compatible wrappers (for the official fastText Python package) which cater to these use cases.

NOTICE: Any additional keyword arguments provided to the classifier constructor, besides those required, will be forwarded to the fastText.train_supervised method on every call to fit.

4.1 Standard wrappers

These wrappers do not make additional assumptions on input besides those commonly made by scikit-learn classifies; i.e. that input is a 2d ndarray object and such.

FirstColFtClassifier - An sklearn classifier adapter for fasttext that takes the first column of input ndarray objects as input.

>>> from skift import FirstColFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = FirstColFtClassifier(lr=0.3, epoch=10)
>>> sk_clf.fit(df[['txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]

IdxBasedFtClassifier - An sklearn classifier adapter for fasttext that takes input by column index. This is set on object construction by providing the input_ix parameter to the constructor.

>>> from skift import IdxBasedFtClassifier
>>> df = pandas.DataFrame([[5, 'woof', 0], [83, 'meow', 1]], columns=['count', 'txt', 'lbl'])
>>> sk_clf = IdxBasedFtClassifier(input_ix=1, lr=0.4, epoch=6)
>>> sk_clf.fit(df[['count', 'txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]

4.2 pandas-dependent wrappers

These wrappers assume the X parameter given to fit, predict, and predict_proba methods is a pandas.DataFrame object:

FirstObjFtClassifier - An sklearn adapter for fasttext using the first column of dtype == object as input.

>>> from skift import FirstObjFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = FirstObjFtClassifier(lr=0.2)
>>> sk_clf.fit(df[['txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]

ColLblBasedFtClassifier - An sklearn adapter for fasttext taking input by column label. This is set on object construction by providing the input_col_lbl parameter to the constructor.

>>> from skift import ColLblBasedFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = ColLblBasedFtClassifier(input_col_lbl='txt', epoch=8)
>>> sk_clf.fit(df[['txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]

5 Contributing

Package author and current maintainer is Shay Palachy ([email protected]); You are more than welcome to approach him for help. Contributions are very welcomed.

5.1 Installing for development

Clone:

git clone [email protected]:shaypal5/skift.git

Install in development mode, including test dependencies:

cd skift
pip install -e '.[test]'

To also install fasttext, see instructions in the Installation section.

5.2 Running the tests

To run the tests use:

cd skift
pytest

5.3 Adding documentation

The project is documented using the numpy docstring conventions, which were chosen as they are perhaps the most widely-spread conventions that are both supported by common tools such as Sphinx and result in human-readable docstrings. When documenting code you add to this project, follow these conventions.

Additionally, if you update this README.rst file, use python setup.py checkdocs to validate it compiles.

6 Credits

Created by Shay Palachy ([email protected]).

Fixes: uniaz, crouffer, amirzamli and sgt.

Comments

Fix temp dir permission docker error
Remove dependance on user home directory for temporary storage. User directories ("~/") are not always created for Unix service accounts.

Create the temporary directory using tempfile.mkdtemp()

Store the directory path in a singleton-like structure accessed via a function call

This fixes issue https://github.com/shaypal5/skift/issues/6 by creating the tempdir in an OS/environment agnostic way, and does not rely on the users' home directory being writeable.
opened by crouffer 12

Installing fasttext with skift doesn't work

Tried running this from the README:

pip install skift[fasttext] --process-dependency-links

Got this error:

Collecting fasttext==0.1.0+git.3b5fd29; extra == "fasttext" (from skift[fasttext])
  Could not find a version that satisfies the requirement fasttext==0.1.0+git.3b5fd29; extra == "fasttext" (from skift[fasttext]) (from versions: 0.2.0, 0.2.1, 0.3.0, 0.3.1, 0.4.0, 0.5.0, 0.5.1, 0.5.12, 0.5.13, 0.5.14, 0.5.15, 0.5.16, 0.5.17, 0.5.18, 0.5.19, 0.6.0, 0.6.1, 0.6.2, 0.6.4, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.7.4, 0.7.5, 0.7.6, 0.8.0, 0.8.1, 0.8.2, 0.8.3)
 No matching distribution found for fasttext==0.1.0+git.3b5fd29; extra == "fasttext" (from skift[fasttext])

Tried with Python 3.6.4 in and out of a virtualenv. Seems skift expects to find a version of fasttext that's not available in pypi?

bug

opened by polm 10

error returned during training due to wrong default encoder on Windows 10

Hello!

I am trying to train a supervised text classification model on some text that contains also non-alphanumeric characters

from skift import FirstColFtClassifier
sk_clf = FirstColFtClassifier(lr=0.25, dim=100, epoch=100, minCount=5, 
                              minn=3, maxn=6, wordNgrams=3, loss='softmax')
sk_clf.fit(X_train, y_train)

As soon as the first non alphanumeric character occurs during training I get the following error

UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-8-05c208efc7be> in <module>()
      4                               minn=3, maxn=6, wordNgrams=3, loss='softmax')
      5 # Train fastText classifier
----> 6 sk_clf.fit(X_train, y_train)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\skift\core.py in fit(self, X, y)
    117         temp_trainset_fpath = temp_dataset_fpath()
    118         input_col = self._input_col(X)
--> 119         dump_xy_to_fasttext_format(input_col, y, temp_trainset_fpath)
    120         # train
    121         self.model = train_supervised(

~\AppData\Local\Continuum\anaconda3\lib\site-packages\skift\util.py in dump_xy_to_fasttext_format(X, y, filepath)
     68     with open(filepath, 'w+') as wfile:
     69         for text, label in zip(X, y):
---> 70             wfile.write('__label__{} {}\n'.format(label, text))
     71 
     72 

~\AppData\Local\Continuum\anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
     17 class IncrementalEncoder(codecs.IncrementalEncoder):
     18     def encode(self, input, final=False):
---> 19         return codecs.charmap_encode(input,self.errors,encoding_table)[0]
     20 
     21 class IncrementalDecoder(codecs.IncrementalDecoder):

UnicodeEncodeError: 'charmap' codec can't encode character '\u010d' in position 493: character maps to <undefined>

As the error clearly shows, this is due to the fact that cp1252.py is the default encoder used by skift. Even though I am on a Windows OS, I am using Python 3.7 installed with Anaconda 5.3.0, and the standard encoding as far as I know should be UTF-8. (I have already verified that, by simply renaming the utf_8.py encoder as cp1252.py, the model training completes without any error. This is a dirty hack I would like to avoid though, because I plan to operationalize the model in production on Azure ML Studio).

Is there a way to enforce skift to use as default the utf_8.py encoder?

Any help appreciated!

Kind regards

bug good first issue

opened by 86mm86 9

Adding model tuning.
The cli interface to fasttext to do parameter tuning and model quantization:

fasttext supervised -input model_train.train -output model_tune -autotune-validation model_train.valid -autotune-modelsize 100M -autotune-duration 1200 -loss one-vs-all

Do you plan to implement it in your package at some point ? If I can make a pr with a piece of code that does the job
enhancement help wanted good first issue
opened by robinicole 7
WIP: core: support autotune

Hi, added support for auto-tuning. Please LMK if you support this direction, and I'll add documentation and more tests to make it a mergeable PR.

Signed-off-by: Dimid Duchovny [email protected]

opened by dimidd 4
Return ndarrays instead of lists while predicting

The functions predict, predict_proba return lists instead of numpy arrays which makes them unusable with classifiers like sklearn.multiclass.OneVsRestClassifier. GridSearch and other similar functionality also don't work.

This is a quick fix.
bug good first issue

opened by uniaz 4

Support for string labels

skift seems to expect integer labels and will fail when using string labels.

For instance, when running

from skift import FirstColFtClassifier
import pandas as pd
df = pd.DataFrame(
    data=[
        ['woof', 'a'],
        ['meow', 'b'],
        ['squick', 'c'],
    ],
    columns=['txt', 'lbl'],
)
sk_clf = FirstColFtClassifier(lr=0.3, epoch=10)
sk_clf.fit(df[['txt']], df['lbl'])
sk_clf.predict([['squick']])

I get

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-32-52a73258e761> in <module>
----> 1 sk_clf.predict([['squick']])

/usr/local/Caskroom/miniconda/base/envs/base/lib/python3.7/site-packages/skift/core.py in predict(self, X)
    165         return np.array([
    166             self._clean_label(res[0][0])
--> 167             for res in self._predict(X)
    168         ], dtype=np.float_)
    169 

/usr/local/Caskroom/miniconda/base/envs/base/lib/python3.7/site-packages/skift/core.py in <listcomp>(.0)
    165         return np.array([
    166             self._clean_label(res[0][0])
--> 167             for res in self._predict(X)
    168         ], dtype=np.float_)
    169 

/usr/local/Caskroom/miniconda/base/envs/base/lib/python3.7/site-packages/skift/core.py in _clean_label(ft_label)
    135     @staticmethod
    136     def _clean_label(ft_label):
--> 137         return int(ft_label[9:])
    138 
    139     def _predict_on_str_arr(self, str_arr, k=1):

ValueError: invalid literal for int() with base 10: 'c'

This is a bit unexpected since neither sklearn nor fasttext require integer labels.

I guess skift could handle that either by:

passing the string labels directly to fasttext (caveat: might require some cleaning)
automatically calling LabelEncoder (e.g. as in sklearn's code for LR)

enhancement help wanted good first issue

opened by michelole 3

utf-8 encoding for xy input file

fastText assumes UTF-8 encoded text (see fastText Python README).

Without the encoding flag, the xy input file is written using the system's locale, which is problematic, especially on Windows. Attempting to train a model with text which uses utf-8 symbols results in an exception.

Passing the flag to open when writing the input file solves this issue.

opened by sgt 3
1D array input for training
Hi,

I'm very sorry for asking such a basic question but can't work this one out! Usually, I see other text classifiers taking one of three forms;

(1D) List of strings, if it performs tokenisation and vectorisation itself

(2D) List of tokens if it performs vectorisation itself

(2D) List of vectors if it is just a classifier

I'm a little confused as the readme does not have a case where multiple tokens are inputted into the model. However, in the tests it appears is that it is trained on a pd.DataFrame for X and a pd.Series for y. I believe fasttext does the tokenisation and vectorisation itself, so why do we need a two dimensional input instead of a 1D list of strings? Is there benefit to doing it that way over something like this;

FtClassifier().fit( ['Input 1', 'Input 2'], [1, 0] )

or the equivalent but with 1D numpy arrays?

Many thanks! Dom
question
opened by DomHudson 3
os.makedirs(TEMP_DIR, exist_ok=True) causes PermissionError in docker container

Running skift in a docker container results in permission errors when trying to load previously generated models.

File "/usr/local/lib/python3.5/dist-packages/skift/util.py", line 10, in PermissionError: [Errno 13] Permission denied: '/root/.temp'

The problem is the docker container is running as user 'root', but the /root/ folder is not writable.

I have a fix, and will open a pull request shortly
bug

opened by crouffer 2
hyperparameter tuning

how can we tune parameters? in https://fasttext.cc/docs/en/autotune.html uses autotuneValidationFile to feed validation see to model. how can we set this parameter?
question

opened by Alihjt 1
Add multi-label support

Add support to providing multi-label labels in a scikit-learn-compliant format, utilizing (under the hood) fasttext's support for multi-label scenarios.
enhancement help wanted

opened by shaypal5 4

Releases(v0.0.23)

v0.0.23(Feb 14, 2022)

Source code(tar.gz)
Source code(zip)
v0.0.22(Jan 20, 2022)

Source code(tar.gz)
Source code(zip)
v0.0.21(Dec 13, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Shay Palachy

Interested in doing data science and developing open source tools in Python.

GitHub Repository

text to speech toolkit. 好用的中文语音合成工具箱，包含语音编码器、语音合成器、声码器和可视化模块。

ttskit Text To Speech Toolkit: 语音合成工具箱。安装 pip install -U ttskit 注意可能需另外安装的依赖包：torch，版本要求torch=1.6.0,=1.7.1，根据自己的实际环境安装合适cuda或cpu版本的torch。 ttskit的

483 Jan 04, 2023

Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization

Line as a Visual Sentence with LineTR This repository contains the inference code, pretrained model, and demo scripts of the following paper. It suppo

158 Dec 27, 2022

Part of Speech Tagging using Hidden Markov Model (HMM) POS Tagger and Brill Tagger

Part of Speech Tagging using Hidden Markov Model (HMM) POS Tagger and Brill Tagger In this project, our aim is to tune, compare, and contrast the perf

0 Dec 25, 2021

TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

TweebankNLP This repo contains the new Tweebank-NER dataset and off-the-shelf Twitter-Stanza pipeline for state-of-the-art Tweet NLP, as described in

84 Dec 20, 2022

Machine Psychology: Python Generated Art

Machine Psychology: Python Generated Art A limited collection of 64 algorithmically generated artwork. Each unique piece is then given a title by the

67 Dec 13, 2022

小布助手对话短文本语义匹配的一个baseline

oppo-text-match 小布助手对话短文本语义匹配的一个baseline 模型参考：https://kexue.fm/archives/8213 base版本线下大概0.952，线上0.866（单模型，没做K-flod融合）。训练测试环境：tensorflow 1.15 + keras

132 Dec 14, 2022

Graph Coloring - Weighted Vertex Coloring Problem

Graph Coloring - Weighted Vertex Coloring Problem This project proposes several local searches and an MCTS algorithm for the weighted vertex coloring

1 Jul 08, 2022

This is a general repo that helps you develop fast/effective NLP classifiers using Huggingface

NLP Classifier Introduction This project trains a bert model on any NLP classifcation model. And uses the model in make predictions on new data using

3 Mar 11, 2022

MPNet: Masked and Permuted Pre-training for Language Understanding

MPNet MPNet: Masked and Permuted Pre-training for Language Understanding, by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu, is a novel pre-tr

228 Nov 21, 2022

test

Lidar-data-decode In this project, you can decode your lidar data frame(pcap file) and make your own datasets(test dataset) in Windows without any hug

46 Dec 05, 2022

End-to-end text to speech system using gruut and onnx. There are 40 voices available across 8 languages.

End to end text to speech system using gruut and onnx

673 Dec 28, 2022

Research code for the paper "Fine-tuning wav2vec2 for speaker recognition"

Fine-tuning wav2vec2 for speaker recognition This is the code used to run the experiments in https://arxiv.org/abs/2109.15053. Detailed logs of each t

103 Dec 26, 2022

Code to reproduce the results of the paper 'Towards Realistic Few-Shot Relation Extraction' (EMNLP 2021)

Realistic Few-Shot Relation Extraction This repository contains code to reproduce the results in the paper "Towards Realistic Few-Shot Relation Extrac

8 Nov 09, 2022

Unofficial Python library for using the Polish Wordnet (plWordNet / Słowosieć)

Polish Wordnet Python library Simple, easy-to-use and reasonably fast library for using the Słowosieć (also known as PlWordNet) - a lexico-semantic da

12 Dec 23, 2022

Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers and helping them make a wise buying decision.

Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers an

1 Jan 01, 2022

Get list of common stop words in various languages in Python

Python Stop Words Table of contents Overview Available languages Installation Basic usage Python compatibility Overview Get list of common stop words

142 Dec 21, 2022

Simple virtual assistant using pyttsx3 and speech recognition optionally with pywhatkit and pther libraries.

VirtualAssistant Simple virtual assistant using pyttsx3 and speech recognition optionally with pywhatkit and pther libraries. Third Party Libraries us

1 Nov 27, 2021

Implementation of the Hybrid Perception Block and Dual-Pruned Self-Attention block from the ITTR paper for Image to Image Translation using Transformers

ITTR - Pytorch Implementation of the Hybrid Perception Block (HPB) and Dual-Pruned Self-Attention (DPSA) block from the ITTR paper for Image to Image

17 Dec 23, 2022

EMNLP'2021: Can Language Models be Biomedical Knowledge Bases?

BioLAMA BioLAMA is biomedical factual knowledge triples for probing biomedical LMs. The triples are collected and pre-processed from three sources: CT

41 Nov 18, 2022

Athena is an open-source implementation of end-to-end speech processing engine.

Athena is an open-source implementation of end-to-end speech processing engine. Our vision is to empower both industrial application and academic research on end-to-end models for speech processing.

34 Sep 08, 2022