Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Last update: Aug 26, 2022

Related tags

Overview

Indobenchmark Toolkit

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources for Bahasa Indonesia such as Institut Teknologi Bandung, Universitas Multimedia Nusantara, The Hong Kong University of Science and Technology, Universitas Indonesia, DeepMind, Gojek, and Prosa.AI.

Research Paper

IndoNLU has been accepted by AACL-IJCNLP 2020 and you can find the details in our paper https://www.aclweb.org/anthology/2020.aacl-main.85.pdf. If you are using any component on IndoNLU including Indo4B, FastText-Indo4B, or IndoBERT in your work, please cite the following paper:

@inproceedings{wilie2020indonlu,
  title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
  author={Bryan Wilie and Karissa Vincentio and Genta Indra Winata and Samuel Cahyawijaya and X. Li and Zhi Yuan Lim and S. Soleman and R. Mahendra and Pascale Fung and Syafri Bahar and A. Purwarianti},
  booktitle={Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing},
  year={2020}
}

IndoNLG has been accepted by EMNLP 2021 and you can find the details in our paper https://arxiv.org/abs/2104.08200. If you are using any component on IndoNLG including Indo4B-Plus, IndoBART, or IndoGPT in your work, please cite the following paper:

@misc{cahyawijaya2021indonlg,
      title={IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation}, 
      author={Samuel Cahyawijaya and Genta Indra Winata and Bryan Wilie and Karissa Vincentio and Xiaohong Li and Adhiguna Kuncoro and Sebastian Ruder and Zhi Yuan Lim and Syafri Bahar and Masayu Leylia Khodra and Ayu Purwarianti and Pascale Fung},
      year={2021},
      eprint={2104.08200},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

IndoNLU and IndoNLG Models

IndoBERT and IndoBERT-lite Models

We provide 4 IndoBERT and 4 IndoBERT-lite Pretrained Language Model [Link]

IndoBERT-base
- Phase 1 [Link]
- Phase 2 [Link]
IndoBERT-large
- Phase 1 [Link]
- Phase 2 [Link]
IndoBERT-lite-base
- Phase 1 [Link]
- Phase 2 [Link]
IndoBERT-lite-large
- Phase 1 [Link]
- Phase 2 [Link]

FastText (Indo4B)

We provide the full uncased FastText model file (11.9 GB) and the corresponding Vector file (3.9 GB)

FastText model (11.9 GB) [Link]
Vector file (3.9 GB) [Link]

We provide smaller FastText models with smaller vocabulary for each of the 12 downstream tasks

FastText-Indo4B [Link]
FastText-CC-ID [Link]

IndoBART and IndoGPT Models

We provide IndoBART and IndoGPT Pretrained Language Model [Link]

IndoBART [Link]
IndoBART-v2 [Link]
IndoGPT2 [Link]

You might also like...

An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

VizSeq is a Python toolkit for visual analysis on text generation tasks like machine translation, summarization, image captioning, speech translation

310 Feb 1, 2021

Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics.

Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics. Jury offers a smooth and easy-to-use interface. It uses datasets for underlying metric computation, and hence adding custom metric is easy as adopting datasets.Metric.

129 Jan 6, 2023

Code for the paper "Flexible Generation of Natural Language Deductions"

12 Nov 11, 2022

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

BanglaBERT This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced i

197 Dec 25, 2022

Releases(v0.1.4)

v0.1.4(Jun 22, 2022)
Fix spacing between subword when decoding using IndoNLGTokenizer

Remove unused additional special tokens '[java]', '[sunda]', '[indonesia]' from IndoNLGTokenizer (language tokens are included in the special_tokens_to_ids instead)

Source code(tar.gz)
Source code(zip)
indobenchmark-toolkit-0.1.4.tar.gz(13.62 KB)

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Related tags

Overview

Indobenchmark Toolkit

Research Paper

IndoNLU and IndoNLG Models

IndoBERT and IndoBERT-lite Models

FastText (Indo4B)

IndoBART and IndoGPT Models

You might also like...

An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics.

Code for the paper "Flexible Generation of Natural Language Deductions"

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark.

PyTorch implementation of the paper: Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding

A python framework to transform natural language questions to queries in a database query language.

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

NL. The natural language programming language.

Releases(v0.1.4)

v0.1.4(Jun 22, 2022)

Owner

Samuel Cahyawijaya

An easy to use Natural Language Processing library and framework for predicting, training, fine-tuning, and serving up state-of-the-art NLP models.

초성 해석기 based on ko-BART

Script to download some free japanese lessons in portuguse from NHK

NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT

Fast, DB Backed pretrained word embeddings for natural language processing.

Ελληνικά νέα (Python script) / Greek News Feed (Python script)

🌸 fastText + Bloom embeddings for compact, full-coverage vectors with spaCy

Training RNNs as Fast as CNNs

Grover is a model for Neural Fake News -- both generation and detectio

Natural Language Processing with transformers

ZUNIT - Toward Zero-Shot Unsupervised Image-to-Image Translation

Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

A Python 3.6+ package to run .many files, where many programs written in many languages may exist in one file.

A website which allows you to play with the GPT-2 transformer

Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

A repo for materials relating to the tutorial of CS-332 NLP

2021 AI CUP Competition on Traditional Chinese Scene Text Recognition - Intermediate Contest

Text editor on python tkinter to convert english text to other languages with the help of ployglot.

Which Apple Keeps Which Doctor Away? Colorful Word Representations with Visual Oracles

Ceaser-Cipher - The Caesar Cipher technique is one of the earliest and simplest method of encryption technique