For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training.

Last update: Nov 02, 2022

Related tags

Deep Learning SciBERTSUM

Overview

LongScientificFormer

For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training.

Some codes are borrowed from ONMT(https://github.com/OpenNMT/OpenNMT-py)

Data Preparation

Option 1: download the processed data

Pre-processed data

Put all files into raw_data directory

Step 2. Download Stanford CoreNLP

We will need Stanford CoreNLP to tokenize the data. Download it here and unzip it. Then add the following command to your bash_profile:

export CLASSPATH=/path/to/stanford-corenlp-4.2.2/stanford-corenlp-4.2.2.jar

replacing /path/to/ with the path to where you saved the stanford-corenlp-4.2.2 directory.

step 3. extracting sections from GROBID XML files

python preprocess.py -mode extract_pdf_sections -log_file ../logs/extract_section.log

step 4. extracting text from TIKA XML files

python preprocess.py -mode get_text_clean_tika -log_file ../logs/extract_tika_text.log

step 5. Tokenize texts from papers and slides using stanfordCoreNLP

python preprocess.py -mode tokenize  -save_path ../temp -log_file ../logs/tokenize_by_corenlp.log

Step 6. Extract source, section, and target from tokenized files

python preprocess.py -mode clean_paper_jsons -save_path ../json_data/  -n_cpus 10 -log_file ../logs/build_json.log

Step 7. Generate BERT `.pt` files from source, sections and targets

python preprocess.py -mode format_to_bert -raw_path ../json_data/ -save_path ../bert_data  -lower -n_cpus 40 -log_file ../logs/build_bert_files.log

Model Training

First run: For the first time, you should use single-GPU, so the code can download the BERT model. Use -visible_gpus -1, after downloading, you could kill the process and rerun the code with multi-GPUs.

Train

python train.py  -ext_dropout 0.1 -lr 2e-3  -visible_gpus 1,2,3 -report_every 200 -save_checkpoint_steps 1000 -batch_size 1 -train_steps 100000 -accum_count 2  -log_file ../logs/ext_bert -use_interval true -warmup_steps 10000

To continue training from a checkpoint

python train.py  -ext_dropout 0.1 -lr 2e-3  -train_from ../models/model_step_99000.pt -visible_gpus 1,2,3 -report_every 200 -save_checkpoint_steps 1000 -batch_size 1 -train_steps 100000 -accum_count 2  -log_file ../logs/ext_bert -use_interval true -warmup_steps 10000

Test

python train.py -mode test  -test_batch_size 1 -bert_data_path ../bert_data -log_file ../logs/ext_bert_test -test_from ../models/model_step_99000.pt -model_path ../models -sep_optim true -use_interval true -visible_gpus 1,2,3 -alpha 0.95 -result_path ../results/ext

For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training.

Related tags

Overview

LongScientificFormer

Data Preparation

Option 1: download the processed data

Step 2. Download Stanford CoreNLP

step 3. extracting sections from GROBID XML files

step 4. extracting text from TIKA XML files

step 5. Tokenize texts from papers and slides using stanfordCoreNLP

Step 6. Extract source, section, and target from tokenized files

Step 7. Generate BERT `.pt` files from source, sections and targets

Model Training

Train

Test

Owner

Athar Sefid

AI grand challenge 2020 Repo (Speech Recognition Track)

Pytorch implementations of the paper Value Functions Factorization with Latent State Information Sharing in Decentralized Multi-Agent Policy Gradients

A curated list of awesome resources related to Semantic Search🔎 and Semantic Similarity tasks.

This is the dataset and code release of the OpenRooms Dataset.

Punctuation Restoration using Transformer Models for High-and Low-Resource Languages

Simulation environments for the CrazyFlie quadrotor: Used for Reinforcement Learning and Sim-to-Real Transfer

🦙 LaMa Image Inpainting, Resolution-robust Large Mask Inpainting with Fourier Convolutions, WACV 2022

Supplementary materials to "Spin-optomechanical quantum interface enabled by an ultrasmall mechanical and optical mode volume cavity" by H. Raniwala, S. Krastanov, M. Eichenfield, and D. R. Englund, 2022

This is the codebase for the ICLR 2021 paper Trajectory Prediction using Equivariant Continuous Convolution

Predictive Maintenance LSTM

Introduction to AI assignment 1 HCM University of Technology, term 211

Open-source code for Generic Grouping Network (GGN, CVPR 2022)

Code release for "Self-Tuning for Data-Efficient Deep Learning" (ICML 2021)

这是一个yolo3-tf2的源码，可以用于训练自己的模型。

Code for models used in Bashiri et al., "A Flow-based latent state generative model of neural population responses to natural images".

Pose Transformers: Human Motion Prediction with Non-Autoregressive Transformers

Code for the paper "Improving Vision-and-Language Navigation with Image-Text Pairs from the Web" (ECCV 2020)

PyTorch inference for "Progressive Growing of GANs" with CelebA snapshot

EMNLP'2021: Simple Entity-centric Questions Challenge Dense Retrievers

Python lib to talk to pylontech lithium batteries (US2000, US3000, ...) using RS485

For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training.

Related tags

Overview

LongScientificFormer

Data Preparation

Option 1: download the processed data

Step 2. Download Stanford CoreNLP

step 3. extracting sections from GROBID XML files

step 4. extracting text from TIKA XML files

step 5. Tokenize texts from papers and slides using stanfordCoreNLP

Step 6. Extract source, section, and target from tokenized files

Step 7. Generate BERT .pt files from source, sections and targets

Model Training

Train

Test

Owner

Athar Sefid

AI grand challenge 2020 Repo (Speech Recognition Track)

Pytorch implementations of the paper Value Functions Factorization with Latent State Information Sharing in Decentralized Multi-Agent Policy Gradients

A curated list of awesome resources related to Semantic Search🔎 and Semantic Similarity tasks.

This is the dataset and code release of the OpenRooms Dataset.

Punctuation Restoration using Transformer Models for High-and Low-Resource Languages

Simulation environments for the CrazyFlie quadrotor: Used for Reinforcement Learning and Sim-to-Real Transfer

🦙 LaMa Image Inpainting, Resolution-robust Large Mask Inpainting with Fourier Convolutions, WACV 2022

Supplementary materials to "Spin-optomechanical quantum interface enabled by an ultrasmall mechanical and optical mode volume cavity" by H. Raniwala, S. Krastanov, M. Eichenfield, and D. R. Englund, 2022

This is the codebase for the ICLR 2021 paper Trajectory Prediction using Equivariant Continuous Convolution

Predictive Maintenance LSTM

Introduction to AI assignment 1 HCM University of Technology, term 211

Open-source code for Generic Grouping Network (GGN, CVPR 2022)

Code release for "Self-Tuning for Data-Efficient Deep Learning" (ICML 2021)

这是一个yolo3-tf2的源码，可以用于训练自己的模型。

Code for models used in Bashiri et al., "A Flow-based latent state generative model of neural population responses to natural images".

Pose Transformers: Human Motion Prediction with Non-Autoregressive Transformers

Code for the paper "Improving Vision-and-Language Navigation with Image-Text Pairs from the Web" (ECCV 2020)

PyTorch inference for "Progressive Growing of GANs" with CelebA snapshot

EMNLP'2021: Simple Entity-centric Questions Challenge Dense Retrievers

Python lib to talk to pylontech lithium batteries (US2000, US3000, ...) using RS485

Step 7. Generate BERT `.pt` files from source, sections and targets