KoBART model on huggingface transformers

Overview

KoBART-Transformers

  • SKT에서 공개한 KoBART를 편리하게 사용할 수 있게 transformers로 포팅하였습니다.

Install (Optional)

  • BartModelPreTrainedTokenizerFast를 이용하면 설치하실 필요 없습니다.
pip install kobart-transformers

Tokenizer

  • PreTrainedTokenizerFast를 이용하여 구현되었습니다.
  • PreTrainedTokenizerFast.from_pretrained("hyunwoongko/kobart")와 동일합니다.
>>> from kobart_transformers import get_kobart_tokenizer
>>> # from transformers import PreTrainedTokenizerFast

>>> kobart_tokenizer = get_kobart_tokenizer()
>>> # kobart_tokenizer = PreTrainedTokenizerFast.from_pretrained("hyunwoongko/kobart")

>>> kobart_tokenizer.tokenize("안녕하세요. 한국어 BART 입니다.🤣:)l^o")
['▁안녕하', '세요.', '▁한국어', '▁B', 'A', 'R', 'T', '▁입', '니다.', '🤣', ':)', 'l^o']

Model

  • BartModel을 이용하여 구현되었습니다.
  • BartModel.from_pretrained("hyunwoongko/kobart")와 동일합니다.
>>> from kobart_transformers import get_kobart_model, get_kobart_tokenizer
>>> # from transformers import BartModel

>>> kobart_tokenizer = get_kobart_tokenizer()
>>> model = get_kobart_model()
>>> # model = BartModel.from_pretrained("hyunwoongko/kobart")

>>> inputs = kobart_tokenizer(['안녕하세요.'], return_tensors='pt')
>>> model(inputs['input_ids'])
Seq2SeqModelOutput(last_hidden_state=tensor([[[-0.4488, -4.3651,  3.2349,  ...,  5.8916,  4.0497,  3.5468],
         [-0.4096, -4.6106,  2.7189,  ...,  6.1745,  2.9832,  3.0930]]],
       grad_fn=<TransposeBackward0>), past_key_values=None, decoder_hidden_states=None, decoder_attentions=None, cross_attentions=None, encoder_last_hidden_state=tensor([[[ 0.4624, -0.2475,  0.0902,  ...,  0.1127,  0.6529,  0.2203],
         [ 0.4538, -0.2948,  0.2556,  ..., -0.0442,  0.6858,  0.4372]]],
       grad_fn=<TransposeBackward0>), encoder_hidden_states=None, encoder_attentions=None)

For Seq2Seq Training

  • seq2seq 학습시에는 아래와 같이 get_kobart_for_conditional_generation()을 이용합니다.
  • BartForConditionalGeneration.from_pretrained("hyunwoongko/kobart")와 동일합니다.
>>> from kobart_transformers import get_kobart_for_conditional_generation
>>> # from transformers import BartForConditionalGeneration

>>> model = get_kobart_for_conditional_generation()
>>> # model = BartForConditionalGeneration.from_pretrained("hyunwoongko/kobart")

Updates Notes

version 0.1

  • pad 토큰이 설정되지 않은 에러를 해결하였습니다.
from kobart import get_kobart_tokenizer
kobart_tokenizer = get_kobart_tokenizer()
kobart_tokenizer(["한국어", "BART 모델을", "소개합니다."], truncation=True, padding=True)
{
'input_ids': [[28324, 3, 3, 3, 3], [15085, 264, 281, 283, 24224], [15630, 20357, 3, 3, 3]], 
'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 
'attention_mask': [[1, 0, 0, 0, 0], [1, 1, 1, 1, 1], [1, 1, 0, 0, 0]]
}

version 0.1.3

  • get_kobart_for_conditional_generation()__init__.py에 등록하였습니다.

version 0.1.4

  • 누락되었던 special_tokens_map.json을 추가하였습니다.
  • 이제 pip install 없이 KoBART를 이용할 수 있습니다.
  • thanks to bernardscumm

Reference

Owner
Hyunwoong Ko
be a creator not a programmer.
Hyunwoong Ko
本插件是pcrjjc插件的重置版,可以独立于后端api运行

pcrjjc2 本插件是pcrjjc重置版,不需要使用其他后端api,但是需要自行配置客户端 本项目基于AGPL v3协议开源,由于项目特殊性,禁止基于本项目的任何商业行为 配置方法 环境需求:.net framework 4.5及以上 jre8 别忘了装jre8 别忘了装jre8 别忘了装jre8

132 Dec 26, 2022
A high-level Python library for Quantum Natural Language Processing

lambeq About lambeq is a toolkit for quantum natural language processing (QNLP). Documentation: https://cqcl.github.io/lambeq/ Getting started Prerequ

Cambridge Quantum 315 Jan 01, 2023
Biterm Topic Model (BTM): modeling topics in short texts

Biterm Topic Model Bitermplus implements Biterm topic model for short texts introduced by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Actua

Maksim Terpilowski 49 Dec 30, 2022
Reading Wikipedia to Answer Open-Domain Questions

DrQA This is a PyTorch implementation of the DrQA system described in the ACL 2017 paper Reading Wikipedia to Answer Open-Domain Questions. Quick Link

Facebook Research 4.3k Jan 01, 2023
A simple word search made in python

Word Search Puzzle A simple word search made in python Usage $ python3 main.py -h usage: main.py [-h] [-c] [-f FILE] Generates a word s

Magoninho 16 Mar 10, 2022
This repository describes our reproducible framework for assessing self-supervised representation learning from speech

LeBenchmark: a reproducible framework for assessing SSL from speech Self-Supervised Learning (SSL) using huge unlabeled data has been successfully exp

49 Aug 24, 2022
ReCoin - Restoring our environment and businesses in parallel

Shashank Ojha, Sabrina Button, Abdellah Ghassel, Joshua Gonzales "Reduce Reuse R

sabrina button 1 Mar 14, 2022
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.9k Dec 31, 2022
Enterprise Scale NLP with Hugging Face & SageMaker Workshop series

Workshop: Enterprise-Scale NLP with Hugging Face & Amazon SageMaker Earlier this year we announced a strategic collaboration with Amazon to make it ea

Philipp Schmid 161 Dec 16, 2022
An evaluation toolkit for voice conversion models.

Voice-conversion-evaluation An evaluation toolkit for voice conversion models. Sample test pair Generate the metadata for evaluating models. The direc

30 Aug 29, 2022
Repository for fine-tuning Transformers 🤗 based seq2seq speech models in JAX/Flax.

Seq2Seq Speech in JAX A JAX/Flax repository for combining a pre-trained speech encoder model (e.g. Wav2Vec2, HuBERT, WavLM) with a pre-trained text de

Sanchit Gandhi 21 Dec 14, 2022
Data and code to support "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley)

anlp21 Course materials for "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley) Syllabus: http://people.ischool.berkeley.edu/~dba

David Bamman 48 Dec 06, 2022
Code repository of the paper Neural circuit policies enabling auditable autonomy published in Nature Machine Intelligence

Code repository of the paper Neural circuit policies enabling auditable autonomy published in Nature Machine Intelligence

9 Jan 08, 2023
Pytorch version of BERT-whitening

BERT-whitening This is the Pytorch implementation of "Whitening Sentence Representations for Better Semantics and Faster Retrieval". BERT-whitening is

Weijie Liu 255 Dec 27, 2022
Uses Google's gTTS module to easily create robo text readin' on command.

Tool to convert text to speech, creating files for later use. TTRS uses Google's gTTS module to easily create robo text readin' on command.

0 Jun 20, 2021
a chinese segment base on crf

Genius Genius是一个开源的python中文分词组件,采用 CRF(Conditional Random Field)条件随机场算法。 Feature 支持python2.x、python3.x以及pypy2.x。 支持简单的pinyin分词 支持用户自定义break 支持用户自定义合并词

duanhongyi 237 Nov 04, 2022
PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit.

PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit. It provides easy-to-use, low-overhead, first-class Python wrappers for t

922 Dec 31, 2022
Repositório do trabalho de introdução a NLP

Trabalho da disciplina de BI NLP Repositório do trabalho da disciplina Introdução a Processamento de Linguagem Natural da pós BI-Master da PUC-RIO. Eq

Leonardo Lins 1 Jan 18, 2022
Python library to make development of portfolio analysis faster and easier

Trafalgar Python library to make development of portfolio analysis faster and easier Installation 🔥 For the moment, Trafalgar is still in beta develo

Santosh Passoubady 641 Jan 01, 2023
LSTC: Boosting Atomic Action Detection with Long-Short-Term Context

LSTC: Boosting Atomic Action Detection with Long-Short-Term Context This Repository contains the code on AVA of our ACM MM 2021 paper: LSTC: Boosting

Tencent YouTu Research 9 Oct 11, 2022