Transformation spoken text to written text

Last update: Dec 28, 2022

Related tags

Overview

Transformation spoken text to written text

This model is used for formatting raw asr text output from spoken text to written text (Eg. date, number, id, ...). It also supports formatting "out of vocab" by using external vocabulary.

Some of examples:

input  : tám giờ chín phút ngày mười tám tháng năm năm hai nghìn không trăm hai mươi hai
output : 8h9 18/5/2022

input  : mã số quy đê tê tê đê hai tám chéo hai không không ba
output : mã số qdttd28/2003

input  : thể tích tám mét khối trọng lượng năm mươi ki lô gam
output : thể tích 8 m3 trọng lượng 50 kg

input    : ngày hai tám tháng tư cô vít bùng phát ở sờ cốt lờn chiếm tám mươi phần trăm là biến chủng đen ta và bê ta
ex_vocab : ['scotland', 'covid', 'delta', 'beta']
output   : 28/4 covid bùng phát ở scotland chiếm 80 % là biến chủng delta và beta

Model architecture

Infer model

Play around at Huggingface Space

import torch
import model_handling
from data_handling import DataCollatorForNormSeq2Seq
from model_handling import EncoderDecoderSpokenNorm
import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""

Init tokenizer and model

tokenizer = model_handling.init_tokenizer()
model = EncoderDecoderSpokenNorm.from_pretrained('nguyenvulebinh/spoken-norm', cache_dir=model_handling.cache_dir)
data_collator = DataCollatorForNormSeq2Seq(tokenizer)

Infer sample

bias_list = ['scotland', 'covid', 'delta', 'beta']
input_str = 'ngày hai tám tháng tư cô vít bùng phát ở sờ cốt lờn chiếm tám mươi phần trăm là biến chủng đen ta và bê ta'

inputs = tokenizer([input_str])
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']
if len(bias_list) > 0:
    bias = data_collator.encode_list_string(bias_list)
    bias_input_ids = bias['input_ids']
    bias_attention_mask = bias['attention_mask']
else:
    bias_input_ids = None
    bias_attention_mask = None

inputs = {
    "input_ids": torch.tensor(input_ids),
    "attention_mask": torch.tensor(attention_mask),
    "bias_input_ids": bias_input_ids,
    "bias_attention_mask": bias_attention_mask,
}

Format input text with bias phrases

outputs = model.generate(**inputs, output_attentions=True, num_beams=1, num_return_sequences=1)

for output in outputs.cpu().detach().numpy().tolist():
    # print('\n', tokenizer.decode(output, skip_special_tokens=True).split(), '\n')
    print(tokenizer.sp_model.DecodePieces(tokenizer.decode(output, skip_special_tokens=True).split()))

28/4 covid bùng phát ở scotland chiếm 80 % là biến chủng delta và beta

Format input text without bias phrases

outputs = model.generate(**{
    "input_ids": torch.tensor(input_ids),
    "attention_mask": torch.tensor(attention_mask),
    "bias_input_ids": None,
    "bias_attention_mask": None,
}, output_attentions=True, num_beams=1, num_return_sequences=1)

for output in outputs.cpu().detach().numpy().tolist():
    # print('\n', tokenizer.decode(output, skip_special_tokens=True).split(), '\n')
    print(tokenizer.sp_model.DecodePieces(tokenizer.decode(output, skip_special_tokens=True).split()))

28/4 cô vít bùng phát ở sờ cốt lờn chiếm 80 % là biến chủng đen ta và bê ta

Contact

[email protected]

Transformation spoken text to written text

Related tags

Overview

Transformation spoken text to written text

Model architecture

Infer model

Init tokenizer and model

Infer sample

Format input text with bias phrases

Format input text without bias phrases

Contact

Owner

Nguyen Binh

A notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository

Code for "Semantic Role Labeling as Dependency Parsing: Exploring Latent Tree Structures Inside Arguments".

SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples

Code to reprudece NeurIPS paper: Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

Google AI 2018 BERT pytorch implementation

GNES enables large-scale index and semantic search for text-to-text, image-to-image, video-to-video and any-to-any content form

A retro text-to-speech bot for Discord

Code for paper "Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features"

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

Source code for CsiNet and CRNet using Fully Connected Layer-Shared feedback architecture.

Wind Speed Prediction using LSTMs in PyTorch

PyABSA - Open & Efficient for Framework for Aspect-based Sentiment Analysis

Open source code for AlphaFold.

It analyze the sentiment of the user, whether it is postive or negative.

PyTorch Implementation of the paper Single Image Texture Translation for Data Augmentation

PG-19 Language Modelling Benchmark

Estimation of the CEFR complexity score of a given word, sentence or text.

The projects lets you extract glossary words and their definitions from a given piece of text automatically using NLP techniques

Tokenizer - Module python d'analyse syntaxique et de grammaire, tokenization

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing