The repository for the paper "When Do You Need Billions of Words of Pretraining Data?"

Last update: Nov 25, 2022

Overview

pretraining-learning-curves

This is the repository for the paper When Do You Need Billions of Words of Pretraining Data?

Edge Probing

We use jiant1 for our edge probing experiments. This tutorial can help you set up the environment and get started with jiant.

Below is an example of how to reproduce our dependency labelling experiment with roberta-base-1B-3, which is one of the MiniBERTas we probe.

Download and Preprocess the Data

The commands below help you get and tokenize the data for the dependency labelling task. Remember to change directory to the root of the jiant and activate your jiant environment first.

mkdir data

mkdir data/edges

probing/data/get_ud_data.sh data/edges/dep_ewt

python probing/get_edge_data_labels.py -o data/edges/dep_ewt/labels.txt -i data/edges/dep_ewt/*.json

python probing/retokenize_edge_data.py -t nyu-mll/roberta-base-1B-3  data/edges/dep_ewt/*.json

Run the Experiment

If you have not used jiant before, you will probably need to set two critical environment variables:

$JIANT_PROJECT_PREFIX: the directory where logs and model checkpoints will be saved.

$JIANT_DATA_DIR: The data directory. Set it to PATH/TO/LOCAL/REPO/data

Now, you are ready to run the probing program:

python main.py –config_file jiant/config/edgeprobe/edgeprobe_miniberta.conf\ 
–overrides “exp_name=DL_tutorial, target_tasks=edges-dep-ud-ewt,\
transformers_output_mode=mix, input_module=nyu-mll/roberta-base-1B-3,\ 
target_train_val_interval=1000, batch_size=32, target_train_max_vals=130, lr=0.0005”

A logging message will be printed out after each validation. You should expect validation f1 to exceed 90 in only a few validations.

The final validation result will be printed after the experiment is finished, and can also be found in $JIANT_PROJECT_PREFIX/DL_tutorial/results.tsv. You should expect the final validation f1 to be around 95.

Minimum Description Length Probing with Edge Probing tasks

For this experiment, we use this fork of jiant1.

BLiMP

The code for our BLiMP experiments can be found here. You can already check results for our MiniBERTas.

If you want to rerun experiments on your own, we have prepared BLiMP data so you only need to include all dependencies for the environment and run scripts following the tutorial here. Note that when intalling dependencies CUDA version could be a problem when installing mxnet.

SuperGLUE

We use jiant2 for our SuperGLUE experiments. Get started with jiant2 using this guide and examples.

The repository for the paper "When Do You Need Billions of Words of Pretraining Data?"

Related tags

Overview

pretraining-learning-curves

Edge Probing

Download and Preprocess the Data

Run the Experiment

Minimum Description Length Probing with Edge Probing tasks

BLiMP

SuperGLUE

Owner

ML² AT CILVR

codes for Self-paced Deep Regression Forests with Consideration on Ranking Fairness

[ICML 2021] DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning | 斗地主AI

This is a computer vision based implementation of the popular childhood game 'Hand Cricket/Odd or Even' in python

🧠 A PyTorch implementation of 'Deep CORAL: Correlation Alignment for Deep Domain Adaptation.', ECCV 2016

Simple reference implementation of GraphSAGE.

Graph parsing approach to structured sentiment analysis.

The official PyTorch implementation of paper BBN: Bilateral-Branch Network with Cumulative Learning for Long-Tailed Visual Recognition

Code for testing convergence rates of Lipschitz learning on graphs

The official repository for our paper "The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization".

Benchmark for Answering Existential First Order Queries with Single Free Variable

More Photos are All You Need: Semi-Supervised Learning for Fine-Grained Sketch Based Image Retrieval

Language Used: Python . Made in Jupyter(Anaconda) notebook.

One implementation of the paper "DMRST: A Joint Framework for Document-Level Multilingual RST Discourse Segmentation and Parsing".

Framework for evaluating ANNS algorithms on billion scale datasets.

the official code for ICRA 2021 Paper: "Multimodal Scale Consistency and Awareness for Monocular Self-Supervised Depth Estimation"

Official implementation of Monocular Quasi-Dense 3D Object Tracking

Sdf sparse conv - Deep Learning on SDF for Classifying Brain Biomarkers

Single-stage Keypoint-based Category-level Object Pose Estimation from an RGB Image

Apache Spark - A unified analytics engine for large-scale data processing

ICLR 2021, Fair Mixup: Fairness via Interpolation