Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer

Last update: Feb 02, 2022

Overview

"# bpe_algorithm_can_finetune_tokenizer"

this is an implyment for https://github.com/huggingface/transformers/issues/15153

I just add tens of lines of code into the py_bpe algorithm. function finetune_tokenizer is main function added.

Details can be see in example.py , actuctally it is very simple. the official python library tokenizer is written is rust. I am learning hoping to give a rust version of this code.

ps: the_factor_of_new_added_token_divided_unk_number is the only param you should set. hoping can find a auto algorithm to set it.

Owner

张博

I am a chinese coder, having some machine learning and math book code and notes shared

GitHub Repository

Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

82 Dec 19, 2022

Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

README Code for Two-stage Identifier: "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022. For details of the model a

45 Nov 29, 2022

Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

305 Dec 22, 2022

Implementation of legal QA system based on SentenceKoBART

LegalQA using SentenceKoBART Implementation of legal QA system based on SentenceKoBART How to train SentenceKoBART Based on Neural Search Engine Jina

75 Dec 27, 2022

Easy to start. Use deep nerual network to predict the sentiment of movie review.

Easy to start. Use deep nerual network to predict the sentiment of movie review. Various methods, word2vec, tf-idf and df to generate text vectors. Various models including lstm and cov1d. Achieve f1

1 Nov 19, 2021

Transformers implementation for Fall 2021 Clinic

Installation Download miniconda3 if not already installed You can check by running typing conda in command prompt. Use conda to create an environment

1 Oct 28, 2021

source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

WhiteningBERT Source code and data for paper WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach. Preparation git clone https://github.com

49 Dec 17, 2022

🤖 Basic Financial Chatbot with handoff ability built with Rasa

Financial Services Example Bot This is an example chatbot demonstrating how to build AI assistants for financial services and banking with Rasa. It in

4 Aug 10, 2022

문장단위로 분절된 나무위키 데이터셋. Releases에서 다운로드 받거나, tfds-korean을 통해 다운로드 받으세요.

Namuwiki corpus 문장단위로 미리 분절된 나무위키 코퍼스. 목적이 LM등에서 사용하기 위한 데이터셋이라, 링크/이미지/테이블 등등이 잘려있습니다. 문장 단위 분절은 kss를 활용하였습니다. 라이선스는 나무위키에 명시된 바와 같이 CC BY-NC-SA 2.0

16 Apr 02, 2022

DeLighT: Very Deep and Light-Weight Transformers

DeLighT: Very Deep and Light-weight Transformers This repository contains the source code of our work on building efficient sequence models: DeFINE (I

440 Dec 18, 2022

Visual Automata is a Python 3 library built as a wrapper for Caleb Evans' Automata library to add more visualization features.

55 Nov 17, 2022

Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer

Related tags

Overview

Owner

张博

Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

Implementation of legal QA system based on SentenceKoBART

Easy to start. Use deep nerual network to predict the sentiment of movie review.

Transformers implementation for Fall 2021 Clinic

source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

🤖 Basic Financial Chatbot with handoff ability built with Rasa

문장단위로 분절된 나무위키 데이터셋. Releases에서 다운로드 받거나, tfds-korean을 통해 다운로드 받으세요.

DeLighT: Very Deep and Light-Weight Transformers

Visual Automata is a Python 3 library built as a wrapper for Caleb Evans' Automata library to add more visualization features.

Calibre recipe to convert latest issue of Analyse & Kritik into an ebook

Mednlp - Medical natural language parsing and utility library

PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

Code of paper: A Recurrent Vision-and-Language BERT for Navigation

Python code for ICLR 2022 spotlight paper EViT: Expediting Vision Transformers via Token Reorganizations

Telegram AI chat bot written in Python using Pyrogram

Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx

Klexikon: A German Dataset for Joint Summarization and Simplification

👄 The most accurate natural language detection library for Python, suitable for long and short text alike