Yet Another Sequence Encoder - Encode sequences to vector of vector in python !

Last update: Aug 19, 2021

Overview

Yase

Yet Another Sequence Encoder - encode sequences to vector of vectors in python !

Why Yase ?

Yase enable you to encode any sequence which can be represented by string to be encoded into a list of word-vector representation.

When searching over a tool to encode a sentence as a list of word-vector, it was clear that there was no simple tool to use. And so, i decided to create Yase.

Note : If you only want to get the word-vector of a word, or average of word-vector in sentence, you should probably better check Spacy.

Requirements

Yase requirements are :

numpy
tqdm

Mapping file

The mapping should be a columnar file like :

<token> <vector value>
token1 0.1 0.6 -1.2
token2 0.6 -2.3 3.4

All data should be separated by space, thus no space is allowed in token. You should be able to directly use Facebook Fast Text pretrained word vector as mapping.

Input file

Input file should be a list of text, with one sample per line.

hello world
Yase is awesome !

The default separator is a space " " but any regular expression can be provided.

Note that Yase is case insensitive

How to use

yase is command line tool. You can install by with :pip install git+https://github.com/PPACI/yase.git

>> yase
usage: yase [-h] --input input.txt [--input-encoding UTF8] --output
               output.txt --mapping mapping.vec [--mapping-encoding UTF8]
               [--separator \ |\.|\,] [--no-replace]
               [--cleaning-json cleaning.json]

Yet Another Sequence Encoder

optional arguments:
  -h, --help            show this help message and exit
  --input input.txt     Path to file to encode
  --input-encoding UTF8
                        encoding of input file. UTF8 by default
  --output output.txt   Path to output file
  --mapping mapping.vec
                        Path to mapping file
  --mapping-encoding UTF8
                        encoding of mapping file. UTF8 by default
  --separator \ |\.|\,  regular expression used to split the input sequence
  --no-replace          don't clean input data
  --cleaning-json cleaning.json
                        Path to your own json replacement file for cleaning.
                        Will use the included replacement file otherwise.

If you wanted to use the english word vector for an input file like previously described :

yase --input "input.txt" --output "output.csv" --mapping "wiki.en.vec"

Output format

The idea behind yase is to be as easy as possible to integrate it in all data science processing.

Yase output it's your data as CSV.

The only problem with CSV is that it's difficult to integrate multi-dimensional array. So we had to find a compromise..

Yase encode the vector columns in JSON format, which is easily readable and is very similar to python array representation.

The output file will be similar to :

inputs	vectors
hello world	[[1,1,1],[2,2,2]]
yase is awesome !	[[3,3,3],[4,4,4]]

Cleaning

Yase will automatically try to clean your input file by applying regex in the right order.

For example : Hello I'm yase.Nice to meet you will magically become Hello I m yase . Nice to meet you.

Remember that yase is case insensitive. So yase will understand as hello i m yase . nice to meet you.

Lastly, if your mapping doesn't include a mapping for ".", you will obtain vectors for hello i m yase nice to meet you

Of course, you can disable this behaviour by providing --no-replace argument.

Providing your own replacement file

You can do this by providing a path to your file with --cleaning-json.

The replacement file is a json like :

{
  "\"": "",
  "'": "",
  ",": " , ",
  "\\.": " . ",
  "  ": " "
}

Input are regex, so remind to escape . or *.

Note that replacement are made in the same order as in the json. So here, the first replacement will be to remove "

How to load a yase output ?

As said previously, the choice made with Yase make it possible to use it as simply as :

import pandas, json

csv = pandas.read_csv("output.csv")
csv.vectors = csv.vectors.apply(json.loads)

csv.head()

Note that Pandas is not mandatory but very recommended for data science.

TODO

Optimize Mapping loading time
Optional argument to output fixed size vectors for all input sequences
Surely lot of thing !

Can i contribute ?

Off course ! If you want to improve Yase, your idea / pull requests / issues are welcomed !

Yet Another Sequence Encoder - Encode sequences to vector of vector in python !

Related tags

Overview

Yase

Why Yase ?

Requirements

Mapping file

Input file

How to use

Output format

Cleaning

Providing your own replacement file

How to load a yase output ?

TODO

Can i contribute ?

Owner

Pierre PACI

VoiceFixer VoiceFixer is a framework for general speech restoration.

Transformers and related deep network architectures are summarized and implemented here.

FewCLUE: 为中文NLP定制的小样本学习测评基准

Extract rooms type, door, neibour rooms, rooms corners nad bounding boxes, and generate graph from rplan dataset

Adversarial Examples for Extreme Multilabel Text Classification

SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning

Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages

Machine Learning Course Project, IMDB movie review sentiment analysis by lstm, cnn, and transformer

Scene Text Retrieval via Joint Text Detection and Similarity Learning

Train BPE with fastBPE, and load to Huggingface Tokenizer.

Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

NLP techniques such as named entity recognition, sentiment analysis, topic modeling, text classification with Python to predict sentiment and rating of drug from user reviews.

Suite of 500 procedurally-generated NLP tasks to study language model adaptability

NLP Text Classification

aMLP Transformer Model for Japanese

A combination of autoregressors and autoencoders using XLNet for sentiment analysis

Open-World Entity Segmentation

The code for the Subformer, from the EMNLP 2021 Findings paper: "Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers", by Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.