2021海华AI挑战赛·中文阅读理解·技术组·第三名

Overview

海华中文阅读理解比赛

队名:ATTOY排名:第三名

赛题背景

https://www.biendata.xyz/competition/haihua_2021

文字是人类用以记录和表达的最基本工具,也是信息传播的重要媒介。透过文字与符号,我们可以追寻人类文明的起源,可以传播知识与经验,读懂文字是认识与了解的第一步。对于人工智能而言,它的核心问题之一就是认知,而认知的核心则是语义理解。

机器阅读理解(Machine Reading Comprehension)是自然语言处理和人工智能领域的前沿课题,对于使机器拥有认知能力、提升机器智能水平具有重要价值,拥有广阔的应用前景。机器的阅读理解是让机器阅读文本,然后回答与阅读内容相关的问题,体现的是人工智能对文本信息获取、理解和挖掘的能力,在对话、搜索、问答、同声传译等领域,机器阅读理解可以产生的现实价值正在日益凸显,长远的目标则是能够为各行各业提供解决方案。

《2021海华AI挑战赛·中文阅读理解》大赛由中关村海华信息技术前沿研究院与清华大学交叉信息研究院联合主办,腾讯云计算协办。共设置题库16000条数据,总奖金池30万元,且腾讯云计算为中学组赛道提供独家算力资源支持。

本次比赛的数据来自小学/中高考语文阅读理解题库(其中,技术组的数据主要为中高考语文试题,中学组的数据主要来自小学语文试题)。相较于英文,中文阅读理解有着更多的歧义性和多义性,然而璀璨的中华文明得以绵延数千年,离不开每一个时代里努力钻研、坚守传承的人,这也正是本次大赛的魅力与挑战,让机器读懂文字,让机器学习文明。秉承着人才培养的初心,我们继续保留针对中学组以及技术组的两条平行赛道,科技创新,时代有我,期待你们的回响。

比赛任务

本次比赛技术组的数据来自中高考语文阅读理解题库。每条数据都包括一篇文章,至少一个问题和多个候选选项。参赛选手需要搭建模型,从候选选项中选出正确的一个。

2021海华AI挑战赛·中文阅读理解·技术组 第三名(ATTOY团队)解决方案

算法方案

1.预训练模型:MacBERT-Large

2.对抗训练

FreeLB ICLR 2020

3.知识蒸馏

Born Again Neural Networks ICML 2018

环境要求

tqdm==4.50.2 numpy==1.19.2 pandas==1.1.3 transformers==3.5.1 torch==1.7.0+cu110 scikit_learn==0.24.2

运行方法

bash bash.sh

超参数

FreeLB训练参数配置

'fold_num': 4, 
'seed': 42,
'model': 'hfl/chinese-macbert-large', 
'max_len': 512, 
'epochs': 12,
'train_bs': 4, 
'valid_bs': 4,
'lr': 2e-5,  
'lrSelf': 1e-4,  
'accum_iter': 8, 
'weight_decay': 1e-4, 
'adv_lr': 0.01,
'adv_norm_type': 'l2',
'adv_init_mag': 0.03,
'adv_max_norm': 1.0,
'ip': 2

EKD训练参数配置

'fold_num': 4, 
'seed': 42,
'model': 'hfl/chinese-macbert-large', 
'max_len': 256, 
'epochs': 12,
'train_bs': 4, 
'valid_bs': 4,
'lr': 2e-5,  
'lrSelf': 1e-4,  
'accum_iter': 8, 
'weight_decay': 1e-4, 
'adv_lr': 0.01,
'adv_norm_type': 'l2',
'adv_init_mag': 0.03,
'adv_max_norm': 1.0,
'ip': 2
DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time

DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time. While it efficiently searches the answers out of 60 billion phrases in Wikipedia, it is also v

Jinhyuk Lee 543 Jan 08, 2023
Machine learning classifiers to predict American Sign Language .

ASL-Classifiers American Sign Language (ASL) is a natural language that serves as the predominant sign language of Deaf communities in the United Stat

Tarek idrees 0 Feb 08, 2022
An Explainable Leaderboard for NLP

ExplainaBoard: An Explainable Leaderboard for NLP Introduction | Website | Download | Backend | Paper | Video | Bib Introduction ExplainaBoard is an i

NeuLab 319 Dec 20, 2022
Use the power of GPT3 to execute any function inside your programs just by giving some doctests

gptrun Don't feel like coding today? Use the power of GPT3 to execute any function inside your programs just by giving some doctests. How is this diff

Roberto Abdelkader Martínez Pérez 11 Nov 11, 2022
A fast and easy implementation of Transformer with PyTorch.

FasySeq FasySeq is a shorthand as a Fast and easy sequential modeling toolkit. It aims to provide a seq2seq model to researchers and developers, which

宁羽 7 Jul 18, 2022
🌸 fastText + Bloom embeddings for compact, full-coverage vectors with spaCy

floret: fastText + Bloom embeddings for compact, full-coverage vectors with spaCy floret is an extended version of fastText that can produce word repr

Explosion 222 Dec 16, 2022
Clone a voice in 5 seconds to generate arbitrary speech in real-time

This repository is forked from Real-Time-Voice-Cloning which only support English. English | 中文 Features 🌍 Chinese supported mandarin and tested with

Weijia Chen 25.6k Jan 06, 2023
RuCLIP tiny (Russian Contrastive Language–Image Pretraining) is a neural network trained to work with different pairs (images, texts).

RuCLIPtiny Zero-shot image classification model for Russian language RuCLIP tiny (Russian Contrastive Language–Image Pretraining) is a neural network

Shahmatov Arseniy 26 Sep 20, 2022
Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products

Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products.

Leah Pathan Khan 2 Jan 12, 2022
This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Technique for Text Classification

The baseline code is for EDA: Easy Data Augmentation techniques for boosting performance on text classification tasks

Akbar Karimi 81 Dec 09, 2022
Awesome Treasure of Transformers Models Collection

💁 Awesome Treasure of Transformers Models for Natural Language processing contains papers, videos, blogs, official repo along with colab Notebooks. 🛫☑️

Ashish Patel 577 Jan 07, 2023
Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2.

Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2. It is trained (finetuned) on a curated list of approximately 45K Python (~470MB) files gathered from the

Galois Autocompleter 91 Sep 23, 2022
Translation to python of Chris Sims' optimization function

pycsminwel This is a locol minimization algorithm. Uses a quasi-Newton method with BFGS update of the estimated inverse hessian. It is robust against

Gustavo Amarante 1 Mar 21, 2022
a CTF web challenge about making screenshots

screenshotter (web) A CTF web challenge about making screenshots. It is inspired by a bug found in real life. The challenge was created by @LiveOverfl

219 Jan 02, 2023
NLP-Project - Used an API to scrape 2000 reddit posts, then used NLP analysis and created a classification model to mixed succcess

Project 3: Web APIs & NLP Problem Statement How do r/Libertarian and r/Neoliberal differ on Biden post-inaguration? The goal of the project is to see

Adam Muhammad Klesc 2 Mar 29, 2022
Residual2Vec: Debiasing graph embedding using random graphs

Residual2Vec: Debiasing graph embedding using random graphs This repository contains the code for S. Kojaku, J. Yoon, I. Constantino, and Y.-Y. Ahn, R

SADAMORI KOJAKU 5 Oct 12, 2022
Two-stage text summarization with BERT and BART

Two-Stage Text Summarization Description We experiment with a 2-stage summarization model on CNN/DailyMail dataset that combines the ability to filter

Yukai Yang (Alexis) 6 Oct 22, 2022
Simple bots or Simbots is a library designed to create simple bots using the power of python. This library utilises Intent, Entity, Relation and Context model to create bots .

Simple bots or Simbots is a library designed to create simple chat bots using the power of python. This library utilises Intent, Entity, Relation and

14 Dec 15, 2021
STT for TorchScript is a port of Coqui STT based on DeepSpeech to PyTorch.

st3 STT for TorchScript is a port of Coqui STT based on DeepSpeech to PyTorch. Currently it supports converting pbmm models to pt scripts with integra

Vlad Ki 8 Oct 18, 2021
Sentiment Classification using WSD, Maximum Entropy & Naive Bayes Classifiers

Sentiment Classification using WSD, Maximum Entropy & Naive Bayes Classifiers

Pulkit Kathuria 173 Jan 04, 2023