Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm

Last update: Dec 05, 2022

Overview

Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm

This is the Pytorch implementation for sparse progressive distillation (SPD). For more details about the motivation, techniques and experimental results, refer to our paper here.

Running

Environment Preparation (using python3)
```
pip install -r requirements.txt
```
Dataset Preparation

The original GLUE dataset could be downloaded here.

BERT_base fine-tuning on GLUE

We use finetuned BERT_base as the teacher. For each task of GLUE benchmark, we obtain the finetuned model using the original huggingface transformers code with the following script.

python run_glue.py \
          --model_name_or_path $INT_DIR \
          --task_name $TASK_NAME \
          --do_train \
          --do_eval \
          --data_dir $GLUE_DIR/$TASK_NAME/ \
          --max_seq_length 128 \
          --per_gpu_train_batch_size 32 \
          --per_gpu_eval_batch_size 32 \
          --learning_rate 3e-5 \
          --num_train_epochs 4.0 \
          --output_dir $OUT_DIR \
          --evaluate_during_training \
          --overwrite_output_dir \
          --logging_steps 400 \
          --logging_dir $OUT_DIR \
          --save_steps 10000

Sparse Progressive Distillation

We use run_glue.py to run the sparse progressive distillation. --num_prune_epochs is the epochs for pruning. --num_train_epochs is the total number of epochs (pruning, progressive distillation, finetuning).

python run_glue.py \
  --model_name_or_path PATH_TO_FINETUNED_MODEL \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --do_lower_case \
  --data_dir $GLUE_DIR/$TASK_NAME/ \
  --max_seq_length 128 \
  --per_gpu_train_batch_size 32 \
  --per_gpu_eval_batch_size 32 \
  --learning_rate 6.4e-4 \
  --save_steps 50 \
  --num_prune_epochs 30 \
  --num_train_epochs 60 \
  --sparsity 0.9 \
  --output_dir $OUT_DIR \
  --evaluate_during_training \
  --replacing_rate 0.8 \
  --overwrite_output_dir \
  --steps_for_replacing 0 \
  --scheduler_type linear

To Dos

Provide our teacher model for each task.
Provide best performed model checkpoint for each task.

Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm

Related tags

Overview

Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm

Running

BERT_base fine-tuning on GLUE

Sparse Progressive Distillation

To Dos

Owner

CondenseNet V2: Sparse Feature Reactivation for Deep Networks

clDice - a Novel Topology-Preserving Loss Function for Tubular Structure Segmentation

A Python module for parallel optimization of expensive black-box functions

Disentangled Cycle Consistency for Highly-realistic Virtual Try-On, CVPR 2021

Boostcamp AI Tech 3rd / Basic Paper reading w.r.t Embedding

Hierarchical Uniform Manifold Approximation and Projection

This repository is the code of the paper Accelerating Deep Reinforcement Learning for Digital Twin Network Optimization with Evolutionary Strategies

HarDNeXt: Official HarDNeXt repository

Understanding Hyperdimensional Computing for Parallel Single-Pass Learning

A strongly-typed genetic programming framework for Python

Learning hierarchical attention for weakly-supervised chest X-ray abnormality localization and diagnosis

Deep Two-View Structure-from-Motion Revisited

Code image classification of MNIST dataset using different architectures: simple linear NN, autoencoder, and highway network

Re-TACRED: Addressing Shortcomings of the TACRED Dataset

Multi Agent Reinforcement Learning for ROS in 2D Simulation Environments

Deployment of PyTorch chatbot with Flask

Official PyTorch implementation of "Edge Rewiring Goes Neural: Boosting Network Resilience via Policy Gradient".

Code and data of the EMNLP 2021 paper "Mind the Style of Text! Adversarial and Backdoor Attacks Based on Text Style Transfer"

Point Cloud Denoising input segmentation output raw point-cloud valid/clear fog rain de-noised Abstract Lidar sensors are frequently used in environme

Using the provided dataset which includes various book features, in order to predict the price of books, using various proposed methods and models.