This repository contains code accompanying the paper "An End-to-End Chinese Text Normalization Model based on Rule-Guided Flat-Lattice Transformer"

Last update: Nov 28, 2022

Related tags

Deep Learning FlatTN

Overview

FlatTN

This repository contains code accompanying the paper "An End-to-End Chinese Text Normalization Model based on Rule-Guided Flat-Lattice Transformer" published on ICASSP 2022.

Requirement

Python: 3.7.3
PyTorch: 1.2.0
FastNLP: 0.5.0
Numpy: 1.16.4
fitlog

For more about FastNLP, please visit here. For Fitlog, please refer to this.

Dataset download

We release a large-scale Chinese Text Normalization (TN) Dataset in corporatioin with Databaker (Beijing) Technology Co., Ltd.

To download the dataset, please visit https://www.data-baker.com/en/#/data/index/TNtts.

(For Chinese version of the download page, please visit https://www.data-baker.com/data/index/TNtts.)

Data preprocessing

The raw dataset in jsonl format are saved at: dataset/processed/CN_TN_epoch-01-28645_2.jsonl

We preprocessed the data into the BMES format, and divided the data into train 、dev 、test by 8:1:1.

dataset/processed/shuffled_BMES
                      ├── train.char.bmes
                      ├── dev.char.bmes
                      └── test.char.bmes

An example of the processed data in BMES format is as follows:

2 B-DIGIT
0 M-DIGIT
1 M-DIGIT
5 E-DIGIT
年 S-SELF
， S-PUNC
只 S-SELF
剩 S-SELF
3 B-CARDINAL
9 E-CARDINAL
天 S-SELF
。 S-PUNC

You can re-run our code to preprocess and divide the raw dataset again:

cd dataset/processed
python preprocess.py

You can also used the following code to get statistics of all NSW categories of the data:

cd dataset/processed
python stat.py

Training

Our code are in version V1, run training code

cd V1
python flat_main.py --dataset databaker

Our proposed rule base are saved in a python file: V1/add_rule.py

Acknowledgement

Our code is based on Flat-Lattice-Transformer (FLAT) from LeeSureman.

For more information about FLAT, please refer to LeeSureman/Flat-Lattice-Transformer.

This repository contains code accompanying the paper "An End-to-End Chinese Text Normalization Model based on Rule-Guided Flat-Lattice Transformer"

Related tags

Overview

FlatTN

Requirement

Dataset download

Data preprocessing

Training

Acknowledgement

Owner

THUHCSI

Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

Neural network for stock price prediction

Submanifold sparse convolutional networks

General purpose Slater-Koster tight-binding code for electronic structure calculations

Open source simulator for autonomous vehicles built on Unreal Engine / Unity, from Microsoft AI & Research

Sample code and notebooks for Vertex AI, the end-to-end machine learning platform on Google Cloud

PyTorch implementation of Convolutional Neural Fabrics http://arxiv.org/abs/1606.02492

Model search is a framework that implements AutoML algorithms for model architecture search at scale

Official PyTorch implementation of the paper Image-Based CLIP-Guided Essence Transfer.

RETRO-pytorch - Implementation of RETRO, Deepmind's Retrieval based Attention net, in Pytorch

Deep Video Matting via Spatio-Temporal Alignment and Aggregation [CVPR2021]

《Towards High Fidelity Face Relighting with Realistic Shadows》(CVPR 2021)

The PyTorch implementation of Directed Graph Contrastive Learning (DiGCL), NeurIPS-2021

A list of multi-task learning papers and projects.

A python interface for training Reinforcement Learning bots to battle on pokemon showdown

《Geo Word Clouds》paper implementation

Unofficial implementation of Google "CutPaste: Self-Supervised Learning for Anomaly Detection and Localization" in PyTorch

Keqing Chatbot With Python

SOFT: Softmax-free Transformer with Linear Complexity, NeurIPS 2021 Spotlight

StudioGAN is a Pytorch library providing implementations of representative Generative Adversarial Networks (GANs) for conditional/unconditional image generation.