CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

Last update: Nov 30, 2022

Related tags

Overview

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training

This is the official repository for the code and models of the paper CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training. If you use our dataset, code or any parts thereof, please cite this paper:

@misc{huber-etal-2021-ccqa,
  title={CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training}, 
  author={Patrick Huber and Armen Aghajanyan and Barlas Oğuz and Dmytro Okhonko and Wen-tau Yih and Sonal Gupta and Xilun Chen},
  year={2021},
  eprint={2110.07731},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Getting Common Crawl Snapshots

The Common Crawl project provides monthly web snapshots of new and updates websites in raw HTML format. Every monthly snapshot (~50-70TB) is further separated into smaller WARC (Web ARChive) files. To download a single WARC file, go to the Common Crawl website for the respective month (e.g. May 2021) and download the WARC paths file. The downloaded WARC paths file contains a \newline separated list of download destination of the actual files. Pick a path and prepend s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ for the complete URL. Once downloaded, gunzip the archive and a single Common Crawl web archive is ready to be processed.

Dataset Generation

Dependencies

Below are the required dependencies to run the dataset generation, curation and model evaluations.

Rust
Rust packages: clap, html-escape, indicatif, kuchiki, rayon, regex, serde, serde_json, warc (see Cargo.toml file for versions)
Python 3.7.3
Python dependencies: fasttext language identification, fasttext==0.9.2, lxml==4.3.2

Processing Common Crawl data (Rust)

Build the cargo package with cargo build from within the rust folder
Run the script with cargo run <path/to/warc/file> <path/to/output/file.mhtml>

Curating the minified HTML data (Python)

To generate json objects for every webpage in the minified HTML, run

python mhtml_to_json.py <path/to/fasttext/lid.176.bin> <path/to/mhtml/file> <path/to/output/file>

Aggregating datapoints to remove duplicate URL entries (Python)

As mentioned in the paper, we use the original dataset for our in-domain pre-training experiments. However, we also provide a cleaned version of the dataset, aggregating same-URL duplicates into a single object. To run the datapoint aggregation script, execute

python json_duplicate_filter.py <path/to/json/file> <path/to/output/file>

Converting json dataset into closed-book and passage retrieval formats (Python)

To be able to train closed-book (sequence-to-sequence) and passage retrieval (DPR) models on the CCQA dataset, the corpus needs to be further processed

Closed-book processing

To prepare the dataset for closed-book question-answering training, run:

python closed_book_processing.py <path/to/json/file> <path/to/output/file> <--only_english> <--keep_markup>

Passage retrieval (DPR) processing

To prepare the dataset for passage rertieval (DPR) training, run:

python passage_retrieval_processing.py <path/to/json/file> <path/to/output/file> <--only_english> <--keep_markup>

CCQA In-Domain Pre-Trained Model Checkpoints

BART and T5 checkpoints are Huggingface transformer models tested with transformers version 4.8.2

The DPR model checkpoint can be downloaded for the original DPR codebase or the DPR v2 codebase

LICENSE

The majority of CCQA is licensed under CC-BY-NC, however portions of the project are available under separate license terms: crowbook-text-processing is licensed under the MPL-2.0 license.

CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

Related tags

Overview

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training

Getting Common Crawl Snapshots

Dataset Generation

Dependencies

Processing Common Crawl data (Rust)

Curating the minified HTML data (Python)

Aggregating datapoints to remove duplicate URL entries (Python)

Converting json dataset into closed-book and passage retrieval formats (Python)

Closed-book processing

Passage retrieval (DPR) processing

CCQA In-Domain Pre-Trained Model Checkpoints

LICENSE

Owner

Meta Research

Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

GVT is a generic translation tool for parts of text on the PC screen with Text to Speak functionality.

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.

PyTorch code for EMNLP 2019 paper "LXMERT: Learning Cross-Modality Encoder Representations from Transformers".

ElasticBERT: A pre-trained model with multi-exit transformer architecture.

Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

Turn clang-tidy warnings and fixes to comments in your pull request

An example project using OpenPrompt under pytorch-lightning for prompt-based SST2 sentiment analysis model

Rhyme with AI

Proquabet - Convert your prose into proquints and then you essentially have Vogon poetry

Malaya-Speech is a Speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow.

Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification"

lightweight, fast and robust columnar dataframe for data analytics with online update

Just a basic Telegram AI chat bot written in Python using Pyrogram.

A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion

sangha, pronounced "suhng-guh", is a social networking, booking platform where students and teachers can share their practice.

A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.

An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.

RuCLIP-SB (Russian Contrastive Language–Image Pretraining SWIN-BERT) is a multimodal model for obtaining images and text similarities and rearranging captions and pictures. Unlike other versions of the model we use BERT for text encoder and SWIN transformer for image encoder.