ICE Tokenizer

Token id [0, 20000) are image tokens.
Token id [20000, 20100) are common tokens, mainly punctuations. E.g., icetk[20000] == ' ', icetk[20003] == ' ', icetk[20006] == ','.
Token id [20100, 83823) are English tokens.
Token id [83823, 145653) are Chinese tokens.
Token id [145653, 150000) are rare tokens. E.g., icetk[145803] == 'α'.

You can install the package via

pip install icetk

Tokenization

from icetk import icetk
tokens = icetk.tokenize('Hello World! I am icetk.')
# tokens == ['▁Hello', '▁World', '!', '▁I', '▁am', '▁ice', 'tk', '.']
ids = icetk.encode('Hello World! I am icetk.')
# ids == [39316, 20932, 20035, 20115, 20344, 22881, 35955, 20007]
en = icetk.decode(ids)
# en == 'Hello World! I am icetk.' # always perfectly recover (if without 
   
    )
   

ids = icetk.encode('你好世界！这里是 icetk。')
# ids == [20005, 94874, 84097, 20035, 94947, 22881, 35955, 83823]

ids = icetk.encode(image_path='test.jpeg', image_size=256, compress_rate=8)
# ids == tensor([[12738, 12430, 10398,  ...,  7236, 12844, 12386]], device='cuda:0')
# ids.shape == torch.Size([1, 1024])
img = icetk.decode(image_ids=ids, compress_rate=8)
# img.shape == torch.Size([1, 3, 256, 256])
from torchvision.utils import save_image
save_image(img, 'recover.jpg')

A unified tokenization tool for Images, Chinese and English.

Related tags

Overview

ICE Tokenizer

Tokenization

Owner

THUDM

PortaSpeech - PyTorch Implementation

Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

PocketSphinx is a lightweight speech recognition engine, specifically tuned for handheld and mobile devices, though it works equally well on the desktop

PyTorch Implementation of the paper Single Image Texture Translation for Data Augmentation

Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields

Reformer, the efficient Transformer, in Pytorch

A highly sophisticated sequence-to-sequence model for code generation

Sentiment-Analysis and EDA on the IMDB Movie Review Dataset

Machine learning classifiers to predict American Sign Language .

File-based TF-IDF: Calculates keywords in a document, using a word corpus.

COVID-19 Chatbot with Rasa 2.0: open source conversational AI

An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

BiNE: Bipartite Network Embedding

:P Some basic stuff I'm gonna use for my upcoming Agile Software Development and Devops

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Test finetuning of XLSR (multilingual wav2vec 2.0) for other speech classification tasks

Search for documents in a domain through Google. The objective is to extract metadata

Wake: Context-Sensitive Automatic Keyword Extraction Using Word2vec

SurvTRACE: Transformers for Survival Analysis with Competing Events

Universal Adversarial Triggers for Attacking and Analyzing NLP (EMNLP 2019)