IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

Overview

IndoBERTweet 🐦 🇮🇩

1. Paper

Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021), Dominican Republic (virtual).

2. About

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter that is trained by extending a monolingually trained Indonesian BERT model with additive domain-specific vocabulary.

In this paper, we show that initializing domain-specific vocabulary with average-pooling of BERT subword embeddings is more efficient than pretraining from scratch, and more effective than initializing based on word2vec projections.

3. Pretraining Data

We crawl Indonesian tweets over a 1-year period using the official Twitter API, from December 2019 to December 2020, with 60 keywords covering 4 main topics: economy, health, education, and government. We obtain in total of 409M word tokens, two times larger than the training data used to pretrain IndoBERT. Due to Twitter policy, this pretraining data will not be released to public.

4. How to use

Load model and tokenizer (tested with transformers==3.5.1)

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("indolem/indobertweet-base-uncased")
model = AutoModel.from_pretrained("indolem/indobertweet-base-uncased")

Preprocessing Steps:

  • lower-case all words
  • converting user mentions and URLs into @USER and HTTPURL, respectively
  • translating emoticons into text using the emoji package.

5. Results over 7 Indonesian Twitter Datasets

Models Sentiment Emotion Hate Speech NER Average
IndoLEM SmSA EmoT HS1 HS2 Formal Informal
mBERT 76.6 84.7 67.5 85.1 75.1 85.2 83.2 79.6
malayBERT 82.0 84.1 74.2 85.0 81.9 81.9 81.3 81.5
IndoBERT (Willie, et al., 2020) 84.1 88.7 73.3 86.8 80.4 86.3 84.3 83.4
IndoBERT (Koto, et al., 2020) 84.1 87.9 71.0 86.4 79.3 88.0 86.9 83.4
IndoBERTweet (1M steps from scratch) 86.2 90.4 76.0 88.8 87.5 88.1 85.4 86.1
IndoBERT + Voc adaptation + 200k steps 86.6 92.7 79.0 88.4 84.0 87.7 86.9 86.5
Owner
IndoLEM
"Indonesian Language Evaluation Montage”, a comprehensive dataset encompassing spanning morpho-syntax, semantics, and discourse for Indonesian NLP.
IndoLEM
Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources (NAACL-2021).

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources Description This is the repository for the paper Unifying Cross-

Sapienza NLP group 16 Sep 09, 2022
CoSENT、STS、SentenceBERT

CoSENT_Pytorch 比Sentence-BERT更有效的句向量方案

102 Dec 07, 2022
HiFi DeepVariant + WhatsHap workflowHiFi DeepVariant + WhatsHap workflow

HiFi DeepVariant + WhatsHap workflow Workflow steps align HiFi reads to reference with pbmm2 call small variants with DeepVariant, using two-pass meth

William Rowell 2 May 14, 2022
History Aware Multimodal Transformer for Vision-and-Language Navigation

History Aware Multimodal Transformer for Vision-and-Language Navigation This repository is the official implementation of History Aware Multimodal Tra

Shizhe Chen 46 Nov 23, 2022
Write Alphabet, Words and Sentences with your eyes.

The-Next-Gen-AI-Eye-Writer The Eye tracking Technique has become one of the most popular techniques within the human and computer interaction era, thi

Rohan Kasabe 2 Apr 05, 2022
Mkdocs + material + cool stuff

Modern-Python-Doc-Example mkdocs + material + cool stuff Doc is live here Features out of the box amazing good looking website thanks to mkdocs.org an

Francesco Saverio Zuppichini 61 Oct 26, 2022
中文医疗信息处理基准CBLUE: A Chinese Biomedical LanguageUnderstanding Evaluation Benchmark

English | 中文说明 CBLUE AI (Artificial Intelligence) is playing an indispensabe role in the biomedical field, helping improve medical technology. For fur

452 Dec 30, 2022
A text augmentation tool for named entity recognition.

neraug This python library helps you with augmenting text data for named entity recognition. Augmentation Example Reference from An Analysis of Simple

Hiroki Nakayama 48 Oct 11, 2022
Score-Based Point Cloud Denoising (ICCV'21)

Score-Based Point Cloud Denoising (ICCV'21) [Paper] https://arxiv.org/abs/2107.10981 Installation Recommended Environment The code has been tested in

Shitong Luo 79 Dec 26, 2022
This simple Python program calculates a love score based on your and your crush's full names in English

This simple Python program calculates a love score based on your and your crush's full names in English. There is no logic or reason in the calculation behind the love score. The calculation could ha

p.katekomol 1 Jan 24, 2022
Auto translate textbox from Japanese to English or Indonesia

priconne-auto-translate Auto translate textbox from Japanese to English or Indonesia How to use Install python first, Anaconda is recommended Install

Aji Priyo Wibowo 5 Aug 25, 2022
Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

SWRM Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors" Clone Clone th

14 Jan 03, 2023
Easy, fast, effective, and automatic g-code compression!

Getting to the meat of g-code. Easy, fast, effective, and automatic g-code compression! MeatPack nearly doubles the effective data rate of a standard

Scott Mudge 97 Nov 21, 2022
Header-only C++ HNSW implementation with python bindings

Hnswlib - fast approximate nearest neighbor search Header-only C++ HNSW implementation with python bindings. NEWS: version 0.6 Thanks to (@dyashuni) h

2.3k Jan 05, 2023
SimBERT升级版(SimBERTv2)!

RoFormer-Sim RoFormer-Sim,又称SimBERTv2,是我们之前发布的SimBERT模型的升级版。 介绍 https://kexue.fm/archives/8454 训练 tensorflow 1.14 + keras 2.3.1 + bert4keras 0.10.6 下载

317 Dec 23, 2022
2021 2학기 데이터크롤링 기말프로젝트

공지 주제 웹 크롤링을 이용한 취업 공고 스케줄러 스케줄 주제 정하기 코딩하기 핵심 코드 설명 + 피피티 구조 구상 // 12/4 토 피피티 + 스크립트(대본) 제작 + 녹화 // ~ 12/10 ~ 12/11 금~토 영상 편집 // ~12/11 토 웹크롤러 사람인_평균

Choi Eun Jeong 2 Aug 16, 2022
NLP topic mdel LDA - Gathered from New York Times website

NLP topic mdel LDA - Gathered from New York Times website

1 Oct 14, 2021
FB ID CLONER WUTHOT CHECKPOINT, FACEBOOK ID CLONE FROM FILE

* MY SOCIAL MEDIA : Programming And Memes Want to contact Mr. Error ? CONTACT : [ema

Mr. Error 9 Jun 17, 2021
Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

This is a fork of Fairseq(-py) with implementations of the following models: Pervasive Attention - 2D Convolutional Neural Networks for Sequence-to-Se

Maha 490 Dec 15, 2022
In this project, we compared Spanish BERT and Multilingual BERT in the Sentiment Analysis task.

Applying BERT Fine Tuning to Sentiment Classification on Amazon Reviews Abstract Sentiment analysis has made great progress in recent years, due to th

Alexander Leonardo Lique Lamas 5 Jan 03, 2022