Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Last update: Dec 31, 2022

Overview

Data Augmentation using Pre-trained Transformer Models

Code associated with the Data Augmentation using Pre-trained Transformer Models paper

Code contains implementation of the following data augmentation methods

EDA (Baseline)
Backtranslation (Baseline)
CBERT (Baseline)
BERT Prepend (Our paper)
GPT-2 Prepend (Our paper)
BART Prepend (Our paper)

DataSets

In paper, we use three datasets from following resources

Low-data regime experiment setup

Run src/utils/download_and_prepare_datasets.sh file to prepare all datsets.
download_and_prepare_datasets.sh performs following steps

Download data from github
Replace numeric labels with text for STSA-2 and TREC dataset
For a given dataset, creates 15 random splits of train and dev data.

Dependencies

To run this code, you need following dependencies

Pytorch 1.5
fairseq 0.9
transformers 2.9

How to run

To run data augmentation experiment for a given dataset, run bash script in scripts folder. For example, to run data augmentation on snips dataset,

run scripts/bart_snips_lower.sh for BART experiment
run scripts/bert_snips_lower.sh for rest of the data augmentation methods

How to cite

@inproceedings{kumar-etal-2020-data,
    title = "Data Augmentation using Pre-trained Transformer Models",
    author = "Kumar, Varun  and
      Choudhary, Ashutosh  and
      Cho, Eunah",
    booktitle = "Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems",
    month = dec,
    year = "2020",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.lifelongnlp-1.3",
    pages = "18--26",
}

Contact

Please reachout to [email protected] for any questions related to this code.

License

This project is licensed under the Creative Common Attribution Non-Commercial 4.0 license.

Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Related tags

Overview

Data Augmentation using Pre-trained Transformer Models

DataSets

Low-data regime experiment setup

Dependencies

How to run

How to cite

Contact

License

Owner

Transformer related optimization, including BERT, GPT

Findings of ACL 2021

HF's ML for Audio study group

Easy to use, state-of-the-art Neural Machine Translation for 100+ languages

Code for Text Prior Guided Scene Text Image Super-Resolution

Sentiment Analysis Project using Count Vectorizer and TF-IDF Vectorizer

Convolutional Neural Networks for Sentence Classification

Machine translation models released by the Gourmet project

Code for the paper "Language Models are Unsupervised Multitask Learners"

Dense Passage Retriever - is a set of tools and models for open domain Q&A task.

This is the Alpha of Nutte language, she is not complete yet / Essa é a Alpha da Nutte language, não está completa ainda

Generate a cool README/About me page for your Github Profile

🕹 An esoteric language designed so that the program looks like the transcript of a Pokémon battle

Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

Just a Basic like Language for Zeno INC

New Modeling The Background CodeBase

Spacy-ginza-ner-webapi - Named Entity Recognition API with spaCy and GiNZA

Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Get list of common stop words in various languages in Python

NLP made easy