Persian Bert For Long-Range Sequences

Last update: Dec 14, 2022

Overview

ParsBigBird: Persian Bert For Long-Range Sequences

The Bert and ParsBert algorithms can handle texts with token lengths of up to 512, however, many tasks such as summarizing and answering questions require longer texts. In our work, we have trained the BigBird model for the Persian language to process texts up to 4096 in the Farsi (Persian) language using sparse attention.

Big bird's attention block from BigBird's paper

Evaluation: 🌡️

We have evaluated the model on three tasks with different sequence lengths

Name	Params	SnappFood (F1)	Digikala Magazine(F1)	PersianQA (F1)
distil-bigbird-fa-zwnj	78M	85.43%	94.05%	73.34%
bert-base-fa	118M	87.98%	93.65%	70.06%

Despite being as big as distill-bert, the model performs equally well as ParsBert and is much better on PersianQA which requires much more context
This evaluation was based on max_lentgh=2048 (It can be changed up to 4096)

How to use ❓

As Contextualized Word Embedding

from transformers import BigBirdModel, AutoTokenizer

MODEL_NAME = "SajjadAyoubi/distil-bigbird-fa-zwnj"
# by default its in `block_sparse` block_size=32
model = BigBirdModel.from_pretrained(MODEL_NAME, block_size=32)
# you can use full attention like the following: use this when input isn't longer than 512
model = BigBirdModel.from_pretrained(MODEL_NAME, attention_type="original_full")

text = "😃 امیدوارم مدل بدردبخوری باشه چون خیلی طول کشید تا ترین بشه"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokens = tokenizer(text, return_tensors='pt')
output = model(**tokens) # contextualized embedding

As Fill Blank

from transformers import pipeline

MODEL_NAME = 'SajjadAyoubi/distil-bigbird-fa-zwnj'
fill = pipeline('fill-mask', model=MODEL_NAME, tokenizer=MODEL_NAME)
results = fill('تهران پایتخت [MASK] است.')
print(results[0]['token_str'])
>>> 'ایران'

Pretraining details: 🔭

This model was pretrained using a masked language model (MLM) objective on the Persian section of the Oscar dataset. Following the original BERT training, 15% of tokens were masked. This was first described in this paper and released in this repository. Documents longer than 4096 were split into multiple documents, while documents much smaller than 4096 were merged using the [SEP] token. Model is warm started from distilbert-fa’s checkpoint.

For more details, you can take a look at config.json at the model card in 🤗 Model Hub

Fine Tuning Recommendations: 🐤

Due to the model's memory requirements, gradient_checkpointing and gradient_accumulation should be used to maintain a reasonable batch size. Considering this model isn't really big, it's a good idea to first fine-tune it on your dataset using Masked LM objective (also called intermediate fine-tuning) before implementing the main task. In block_sparse mode, it doesn't matter how many tokens are input. It just attends to 256 tokens. Furthermore, original_full should be used up to 512 sequence lengths (instead of block sparse).

Fine Tuning Examples 👷‍♂️ 👷‍♀️

Dataset	Fine Tuning Example
Digikala Magazine Text Classification

Contact us: 🤝

If you have a technical question regarding the model, pretraining, code or publication, please create an issue in the repository. This is the fastest way to reach us.

Citation: ↩️

we didn't publish any papers on the work. However, if you did, please cite us properly with an entry like one below.

@misc{ParsBigBird,
  author          = {Ayoubi, Sajjad},
  title           = {ParsBigBird: Persian Bert For Long-Range Sequences},
  year            = 2021,
  publisher       = {GitHub},
  journal         = {GitHub repository},
  howpublished    = {\url{https://github.com/SajjjadAyobi/ParsBigBird}},
}

Persian Bert For Long-Range Sequences

Related tags

Overview

ParsBigBird: Persian Bert For Long-Range Sequences

Evaluation: 🌡️

How to use ❓

As Contextualized Word Embedding

As Fill Blank

Pretraining details: 🔭

Fine Tuning Recommendations: 🐤

Fine Tuning Examples 👷‍♂️ 👷‍♀️

Contact us: 🤝

Citation: ↩️

Owner

Sajjad Ayoubi

Takes a string and puts it through different languages in Google Translate a requested amount of times, returning nonsense.

Repository for fine-tuning Transformers 🤗 based seq2seq speech models in JAX/Flax.

A cross platform OCR Library based on PaddleOCR & OnnxRuntime

Trains an OpenNMT PyTorch model and SentencePiece tokenizer.

一个基于Nonebot2和go-cqhttp的娱乐性qq机器人

An implementation of the Pay Attention when Required transformer

Mapping a variable-length sentence to a fixed-length vector using BERT model

Multilingual text (NLP) processing toolkit

Officile code repository for "A Game-Theoretic Perspective on Risk-Sensitive Reinforcement Learning"

CJK computer science terms comparison / 中日韓電腦科學術語對照 / 日中韓のコンピュータ科学の用語対照 / 한·중·일 전산학 용어 대조

结巴中文分词

Text Classification in Turkish Texts with Bert

PyTorch source code of NAACL 2019 paper "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models"

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

Code repository for "It's About Time: Analog clock Reading in the Wild"

Simple, hackable offline speech to text - using the VOSK-API.

An A-SOUL Text Generator Based on CPM-Distill.

FB ID CLONER WUTHOT CHECKPOINT, FACEBOOK ID CLONE FROM FILE

Exploring dimension-reduced embeddings

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.