Amazon Multilingual Counterfactual Dataset (AMCD)

Last update: Sep 20, 2022

Overview

Amazon Multilingual Counterfactual Dataset (AMCD)

This repository contains a dataset described in the paper:

I Wish I Would Have Loved This One, But I Didn’t – A Multilingual Dataset for Counterfactual Detection in Product Reviews. James O’Neill, Polina Rozenshtein, Ryuichi Kiryo, Motoko Kubota, Danushka Bollegala. EMNLP'21. arxiv version

The dataset contains sentences from Amazon customer reviews (sampled from Amazon product review dataset) annotated for counterfactual detection (CFD) binary classification. Counterfactual statements describe events that did not or cannot take place. Counterfactual statements may be identified as statements of the form – If p was true, then q would be true (i.e. assertions whose antecedent (p) and consequent (q) are known or assumed to be false).

The key features of this dataset are:

The dataset is multilingual and contains sentences in English, German, and Japanese.
The labeling was done by professional linguists and high quality was ensured.
The dataset is supplemented with the annotation guidelines and definitions, which were worked out by professional linguists. We also provide the clue word lists, which are typical for counterfactual sentences and were used for initial data filtering. The clue word lists were also compiled by professional linguists.

Please see paper for the data statistics, detailed description of data collection and annotation.

For the dataset format please see README.txt.

Cite

If you use this dataset in your research, please cite the paper.

License Summary

The documentation is made available under the Creative Commons Attribution-ShareAlike 4.0 International License. See the LICENSE file.

Amazon Multilingual Counterfactual Dataset (AMCD)

Related tags

Overview

Amazon Multilingual Counterfactual Dataset (AMCD)

Cite

License Summary

Owner

A sample project that exists for PyPUG's "Tutorial on Packaging and Distributing Projects"

Exploring dimension-reduced embeddings

Official PyTorch implementation of SegFormer

Knowledge Oriented Programming Language

Multilingual word vectors in 78 languages

PyWorld3 is a Python implementation of the World3 model

A python framework to transform natural language questions to queries in a database query language.

使用Mask LM预训练任务来预训练Bert模型。训练垂直领域语料的模型表征，提升下游任务的表现。

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

Word2Wave: a framework for generating short audio samples from a text prompt using WaveGAN and COALA.

문장단위로 분절된 나무위키 데이터셋. Releases에서 다운로드 받거나, tfds-korean을 통해 다운로드 받으세요.

Use Google's BERT for named entity recognition （CoNLL-2003 as the dataset）.

PyTorch implementation of Tacotron speech synthesis model.

Faster, modernized fork of the language identification tool langid.py

超轻量级bert的pytorch版本，大量中文注释，容易修改结构，持续更新

Python3 to Crystal Translation using Python AST Walker

A minimal Conformer ASR implementation adapted from ESPnet.

내부 작업용 django + vue(vuetify) boilerplate. 짠 하면 돌아감.

☀️ Measuring the accuracy of BBC weather forecasts in Honolulu, USA

IMS-Toucan is a toolkit to train state-of-the-art Speech Synthesis models