Implementation of ICLR 2020 paper "Revisiting Self-Training for Neural Sequence Generation"

Last update: Dec 31, 2022

Overview

Self-Training for Neural Sequence Generation

This repo includes instructions for running noisy self-training algorithms from the following paper:

Revisiting Self-Training for Neural Sequence Generation
Junxian He*, Jiatao Gu*, Jiajun Shen, Marc'Aurelio Ranzato
ICLR 2020

Requirement

fairseq (please see the fairseq repo for other requirements on Python and PyTorch versions)

fairseq can be installed with:

pip install fairseq

Data

Download and preprocess the WMT'14 En-De dataset:

# Download and prepare the data
wget https://raw.githubusercontent.com/pytorch/fairseq/master/examples/translation/prepare-wmt14en2de.sh
bash prepare-wmt14en2de.sh --icml17

TEXT=wmt14_en_de
fairseq-preprocess --source-lang en --target-lang de \
    --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
    --destdir wmt14_en_de_bin --thresholdtgt 0 --thresholdsrc 0 \
    --joined-dictionary --workers 16

Then we mimic a semi-supervised setting where 100K training samples are randomly selected as parallel corpus and the remaining English training samples are treated as unannotated monolingual corpus:

bash extract_wmt100k.sh

Preprocess WMT100K:

bash preprocess.sh 100ken 100kde

Add noise to the monolingual corpus for later usage:

TEXT=wmt14_en_de
python paraphrase/paraphrase.py \
  --paraphraze-fn noise_bpe \
  --word-dropout 0.2 \
  --word-blank 0.2 \
  --word-shuffle 3 \
  --data-file ${TEXT}/train.mono_en \
  --output ${TEXT}/train.mono_en_noise \
  --bpe-type subword

Train the base supervised model

Train the translation model with 30K updates:

bash supervised_train.sh 100ken 100kde 30000

Self-training as pseudo-training + fine-tuning

Translate the monolingual data to train.[suffix] to form a pseudo parallel dataset:

bash translate.sh [model_path] [suffix]

Suppose the pseduo target language suffix is mono_de_iter1 (by default), preprocess:

bash preprocess.sh mono_en_noise mono_de_iter1

Pseudo-training + fine-tuning:

bash self_train.sh mono_en_noise mono_de_iter1

The above command trains the model on the pseduo parallel corpus formed by train.mono_en_noise and train.mono_de_iter1 and then fine-tune it on real parallel data.

This self-training process can be repeated for multiple iterations to improve performance.

Reference

@inproceedings{He2020Revisiting,
title={Revisiting Self-Training for Neural Sequence Generation},
author={Junxian He and Jiatao Gu and Jiajun Shen and Marc'Aurelio Ranzato},
booktitle={Proceedings of ICLR},
year={2020},
url={https://openreview.net/forum?id=SJgdnAVKDH}
}

Implementation of ICLR 2020 paper "Revisiting Self-Training for Neural Sequence Generation"

Related tags

Overview

Self-Training for Neural Sequence Generation

Requirement

Data

Train the base supervised model

Self-training as pseudo-training + fine-tuning

Reference

Owner

Junxian He

🛠️ SLAMcore SLAM Utilities

This repository contains FEDOT - an open-source framework for automated modeling and machine learning (AutoML)

People log into different sites every day to get information and browse through these sites one by one

Flexible-CLmser: Regularized Feedback Connections for Biomedical Image Segmentation

Prososdy Morph: A python library for manipulating pitch and duration in an algorithmic way, for resynthesizing speech.

The Unsupervised Reinforcement Learning Benchmark (URLB)

BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches

Illuminated3D This project participates in the Nasa Space Apps Challenge 2021.

Hand gesture recognition model that can be used as a remote control for a smart tv.

Official code for: A Probabilistic Hard Attention Model For Sequentially Observed Scenes

Keras code and weights files for popular deep learning models.

Code for the paper A Theoretical Analysis of the Repetition Problem in Text Generation

alfred-py: A deep learning utility library for human

Feature board for ERPNext

Fre-GAN: Adversarial Frequency-consistent Audio Synthesis

The source code of the paper "Understanding Graph Neural Networks from Graph Signal Denoising Perspectives"

Multi-View Consistent Generative Adversarial Networks for 3D-aware Image Synthesis (CVPR2022)

Code repo for "RBSRICNN: Raw Burst Super-Resolution through Iterative Convolutional Neural Network" (Machine Learning and the Physical Sciences workshop in NeurIPS 2021).

traiNNer is an open source image and video restoration (super-resolution, denoising, deblurring and others) and image to image translation toolbox based on PyTorch.

Implementation of the ICCV'21 paper Temporally-Coherent Surface Reconstruction via Metric-Consistent Atlases

Implementation of ICLR 2020 paper "Revisiting Self-Training for Neural Sequence Generation"

Related tags

Overview

Self-Training for Neural Sequence Generation

Requirement

Data

Train the base supervised model

Self-training as pseudo-training + fine-tuning

Reference

Owner

Junxian He

🛠️ SLAMcore SLAM Utilities

This repository contains FEDOT - an open-source framework for automated modeling and machine learning (AutoML)

People log into different sites every day to get information and browse through these sites one by one

Flexible-CLmser: Regularized Feedback Connections for Biomedical Image Segmentation

Prososdy Morph: A python library for manipulating pitch and duration in an algorithmic way, for resynthesizing speech.

The Unsupervised Reinforcement Learning Benchmark (URLB)

BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches

Illuminated3D This project participates in the Nasa Space Apps Challenge 2021.

Hand gesture recognition model that can be used as a remote control for a smart tv.

Official code for: A Probabilistic Hard Attention Model For Sequentially Observed Scenes

Keras code and weights files for popular deep learning models.

Code for the paper A Theoretical Analysis of the Repetition Problem in Text Generation

alfred-py: A deep learning utility library for **human**

Feature board for ERPNext

Fre-GAN: Adversarial Frequency-consistent Audio Synthesis

The source code of the paper "Understanding Graph Neural Networks from Graph Signal Denoising Perspectives"

Multi-View Consistent Generative Adversarial Networks for 3D-aware Image Synthesis (CVPR2022)

Code repo for "RBSRICNN: Raw Burst Super-Resolution through Iterative Convolutional Neural Network" (Machine Learning and the Physical Sciences workshop in NeurIPS 2021).

traiNNer is an open source image and video restoration (super-resolution, denoising, deblurring and others) and image to image translation toolbox based on PyTorch.

Implementation of the ICCV'21 paper Temporally-Coherent Surface Reconstruction via Metric-Consistent Atlases

alfred-py: A deep learning utility library for human