LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation

Overview

LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation

Tasks | Datasets | LongLM | Baselines | Paper

Introduction

LOT is a benchmark for evaluating Chinese long text modeling. LOT consists of two understanding tasks and two generation tasks. We construct new datasets for these tasks based on human-written Chinese stories.

Furthermore, we release an encoder-decoder-based Chinese long text pretraining model named LongLM with up to 1 billion parameters. We pretrain LongLM on 120G Chinese novels with two generative tasks including text infilling and conditional continuation. Extensive experiments show that LongLM outperforms similar-sized pretraining models substantially on both the understanding and generation tasks in LOT.

Tasks

We design LOT as an aggregation of two understanding tasks including Cloze Test (ClozeT) and Sentence Position Prediction (SenPos), and two generation tasks including Plot Completion (PlotCom) and Outline-conditioned Generation (OutGen). We show the task descriptions in the table below.

Datasets

We show the data statistics in the table below. The abbreviation sent/len is short for sentence/length, respectively. The datasets and evaluation scripts can be downloaded from THUCloud.

LongLM

1. Parameters

  • $d_m$: the dimension of hidden states
  • $d_{ff}$: the dimension of feed forward layers
  • $d_{kv}$: the dimension of the keys/values in the self-attention layers
  • $n_h$: the number of attention heads
  • $n_e$: the number of hidden layers of the encoder
  • $n_d$: the number of hidden layers of the decoder
  • #P: the number of parameters

2. Pretraining Tasks

3. Pretraining Data

We collect 120G novels as the pretraining data for LongLM. The pretraining data will be publicly available soon.

4. Checkpoints

  1. Download: The checkpoints and example data can be downloaded from THUCloud. The training and generation scripts are under the directory longlm. You can also use the official script provided by Transformers to fine-tune the model.

  2. Model Loading:

    from transformers import T5Tokenizer, T5ForConditionalGeneration
    tokenizer = T5Tokenizer.from_pretrained('LongLM-large')
    model = T5ForConditionalGeneration.from_pretrained('LongLM-large')
    
    • Dependencies: torch=1.8.1, transformers=4.6.1
  3. Training:

    Execute bash ./finetune.sh to fine-tune LongLM. If deepspeed is available, you can execute bash ./finetune_deepspped.sh to accelerate.

    env CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 CUDA_LAUNCH_BLOCKING=1 python3 -m torch.distributed.launch --nproc_per_node=8 \
    finetune_trainer.py \
    --data_dir=./data \ # directory of data
    --train_name=train \ # file prefix of the training data
    --output_dir=./save_model \ # output directory to save the checkpoint
    --save_total_limit=10 \ # maximum number of the saved checkpoints
    --per_gpu_train_batch_size=3 \ # batch size for training
    --per_gpu_eval_batch_size=3 \ # batch size for evaluation
    --num_train_epochs=1 \ # number of training epochs
    --logging_steps=5 \ # number of stps to log the loss value
    --model_name_or_path=./LongLM-small \ # path to the pretrained model
    --warmup_steps=100 \ # number of steps for warmup
    --learning_rate=1e-4 \ # learning rate
    --n_val=100 \ # number of examples for validation
    --do_train --do_eval \ # whether to training/validation
    --evaluation_strategy steps \ # strategy of evaluation
    --gradient_accumulation_steps=40 # number of steps for gradient accumulation
    --overwrite_output_dir \
    --load_best_model_at_end
  4. Generation:

    ",return_tensors="pt", padding=True, truncation=True, max_length=512).input_ids.to(device) gen = model.generate(input_ids, do_sample=True, decoder_start_token_id=1, top_p=0.9, max_length=512) ">
    input_ids = tokenizer("小咕噜对,
         
          "
         ,return_tensors="pt", padding=True, truncation=True, max_length=512).input_ids.to(device)
    
    gen = model.generate(input_ids, do_sample=True, decoder_start_token_id=1, top_p=0.9, max_length=512)

Baselines

1. Understanding Tasks

The example data, training and evaluation scripts of LongLM are under the directory ./baselines/understanding. You can execute bash ./finetune.sh to fine-tune LongLM and execute bash ./eval.sh to evaluate the fine-tuned model.

2. Generation Tasks

The training script of LongLM for the generation tasks is the same as pretraining script. The generation script and example data can be found under ./baseline/generation. You can execute bash ./gen.sh for generation.

Citation

@misc{guan2021lot,
      title={LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation}, 
      author={Jian Guan and Zhuoer Feng and Yamei Chen and Ruilin He and Xiaoxi Mao and Changjie Fan and Minlie Huang},
      year={2021},
      eprint={2108.12960},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Owner
Conversational AI groups from Tsinghua University
This is a really simple text-to-speech app made with python and tkinter.

Tkinter Text-to-Speech App by Souvik Roy This is a really simple tkinter app which converts the text you have entered into a speech. It is created wit

Souvik Roy 1 Dec 21, 2021
BERT score for text generation

BERTScore Automatic Evaluation Metric described in the paper BERTScore: Evaluating Text Generation with BERT (ICLR 2020). News: Features to appear in

Tianyi 1k Jan 08, 2023
Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields

Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields [project page][paper][cite] Geometry-Consistent Neural Shape Represe

Yifan Wang 100 Dec 19, 2022
Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

smaller-LaBSE LaBSE(Language-agnostic BERT Sentence Embedding) is a very good method to get sentence embeddings across languages. But it is hard to fi

Jeong Ukjae 13 Sep 02, 2022
Constituency Tree Labeling Tool

Constituency Tree Labeling Tool The purpose of this package is to solve the constituency tree labeling problem. Look from the dataset labeled by NLTK,

张宇 6 Dec 20, 2022
Plugin repository for Macast

Macast-plugins Plugin repository for Macast. How to use third-party player plugin Download Macast from GitHub Release. Download the plugin you want fr

109 Jan 04, 2023
Beyond Paragraphs: NLP for Long Sequences

Beyond Paragraphs: NLP for Long Sequences

AI2 338 Dec 02, 2022
Athena is an open-source implementation of end-to-end speech processing engine.

Athena is an open-source implementation of end-to-end speech processing engine. Our vision is to empower both industrial application and academic research on end-to-end models for speech processing.

Ke Technologies 34 Sep 08, 2022
ElasticBERT: A pre-trained model with multi-exit transformer architecture.

This repository contains finetuning code and checkpoints for ElasticBERT. Towards Efficient NLP: A Standard Evaluation and A Strong Baseli

fastNLP 48 Dec 14, 2022
Club chatbot

Chatbot Club chatbot Instructions to get the Chatterbot working Step 1. First make sure you are using a version of Python 3 or newer. To check your ve

5 Mar 07, 2022
Twitter Sentiment Analysis using #tag, words and username

Twitter Sentment Analysis Web App using #tag, words and username to fetch data finds Insides of data and Tells Sentiment of the perticular #tag, words or username.

Kumar Saksham 26 Dec 25, 2022
This is my reading list for my PhD in AI, NLP, Deep Learning and more.

This is my reading list for my PhD in AI, NLP, Deep Learning and more.

Zhong Peixiang 156 Dec 21, 2022
A Python module made to simplify the usage of Text To Speech and Speech Recognition.

Nav Module The solution for voice related stuff in Python Nav is a Python module which simplifies voice related stuff in Python. Just import the Modul

Snm Logic 1 Dec 20, 2021
A Python/Pytorch app for easily synthesising human voices

Voice Cloning App A Python/Pytorch app for easily synthesising human voices Documentation Discord Server Video guide Voice Sharing Hub FAQ's System Re

Ben Andrew 840 Jan 04, 2023
[ICLR 2021 Spotlight] Pytorch implementation for "Long-tailed Recognition by Routing Diverse Distribution-Aware Experts."

RIDE: Long-tailed Recognition by Routing Diverse Distribution-Aware Experts. by Xudong Wang, Long Lian, Zhongqi Miao, Ziwei Liu and Stella X. Yu at UC

Xudong (Frank) Wang 205 Dec 16, 2022
多语言降噪预训练模型MBart的中文生成任务

mbart-chinese 基于mbart-large-cc25 的中文生成任务 Input source input: text + /s + lang_code target input: lang_code + text + /s Usage token_ids_mapping.jso

11 Sep 19, 2022
Auto translate textbox from Japanese to English or Indonesia

priconne-auto-translate Auto translate textbox from Japanese to English or Indonesia How to use Install python first, Anaconda is recommended Install

Aji Priyo Wibowo 5 Aug 25, 2022
Continuously update some NLP practice based on different tasks.

NLP_practice We will continuously update some NLP practice based on different tasks. prerequisites Software pytorch = 1.10 torchtext = 0.11.0 sklear

0 Jan 05, 2022
A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN

artificial intelligence cosmic love and attention fire in the sky a pyramid made of ice a lonely house in the woods marriage in the mountains lantern

Phil Wang 2.3k Jan 01, 2023
CCF BDCI BERT系统调优赛题baseline(Pytorch版本)

CCF BDCI BERT系统调优赛题baseline(Pytorch版本) 此版本基于Pytorch后端的huggingface进行实现。由于此实现使用了Oneflow的dataloader作为数据读入的方式,因此也需要安装Oneflow。其它框架的数据读取可以参考OneflowDataloade

Ziqi Zhou 9 Oct 13, 2022