A crowdsourced dataset of dialogues grounded in social contexts involving utilization of commonsense.

Last update: Dec 20, 2022

Related tags

Overview

Commonsense-Dialogues Dataset

We present Commonsense-Dialogues, a crowdsourced dataset of ~11K dialogues grounded in social contexts involving utilization of commonsense. The social contexts used were sourced from the train split of the SocialIQA dataset, a multiple-choice question-answering based social commonsense reasoning benchmark.

For the collection of the Commonsense-Dialogues dataset, each Turker was presented a social context and asked to write a dialogue of 4-6 turns between two people based on the event(s) described in the context. The Turker was asked to alternate between the roles of an individual referenced in the context and a 3rd party friend. See the following dialogues as examples:

    "1": {  # dialogue_id
        "context": "Sydney met Carson's mother for the first time last week. He liked her.",   # multiple individuals in the context: Sydney and Carson
        "speaker": "Sydney",   # role 1 = Sydney, role 2 = a third-person friend of Sydney
        "turns": [
            "I met Carson's mother last week for the first time.",
            "How was she?",
            "She turned out to be really nice. I like her.",
            "That's good to hear.",
            "It is, especially since Carson and I are getting serious.",
            "Well, at least you'll like your in-law if you guys get married."
        ]
    }

    "2": {
        "context": "Kendall had a party at Jordan's house but was found out to not have asked and just broke in.",
        "speaker": "Kendall",
        "turns": [
            "Did you hear about my party this weekend at Jordan\u2019s house?",
            "I heard it was amazing, but that you broke in.",
            "That was a misunderstanding, I had permission to be there.",
            "Who gave you permission?",
            "I talked to Jordan about it months ago before he left town to go to school, but he forgot to tell his roommates about it.",
            "Ok cool, I hope everything gets resolved."
        ]
    }

The data can be found in the /data directory of this repo. train.json has ~9K dialogues, valid.json and test.json have ~1K dialogues each. Since all the contexts were sourced from the train split of SocialIQA, it is imperative to note that any form of multi-task training and evaluation with Commonsense-Dialogues and SocialIQA must be done with caution to ensure fair and accurate conclusions.

Some statistics about the data are provided below:

Stat	Train	Valid	Test
# of dialogues	9058	1157	1158
average # of turns in a dialogue	5.72	5.72	5.71
average # of words in a turn	12.4	12.4	12.2
# of distinct SocialIQA contexts used	3672	483	473
average # of dialogues for a SocialIQA context	2.46	2.395	2.45

Security

See CONTRIBUTING for more information.

License

This repository is licensed under the CC-BY-NC 4.0 License.

Citation

If you use this dataset, please cite the following paper:

@inproceedings{zhou-etal-2021-commonsense,
    title = "Commonsense-Focused Dialogues for Response Generation: An Empirical Study",
    author = "Zhou, Pei  and
      Gopalakrishnan, Karthik  and
      Hedayatnia, Behnam  and
      Kim, Seokhwan  and
      Pujara, Jay  and
      Ren, Xiang  and
      Liu, Yang  and
      Hakkani-Tur, Dilek",
    booktitle = "Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue",
    year = "2021",
    address = "Singapore and Online",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2109.06427"
}

Note that the paper uses newly collected dialogues as well as those that were filtered from existing datasets. This repo contains our newly collected dialogues alone.

A crowdsourced dataset of dialogues grounded in social contexts involving utilization of commonsense.

Related tags

Overview

Commonsense-Dialogues Dataset

Security

License

Citation

Owner

Alexa

Search with BERT vectors in Solr and Elasticsearch

Material for GW4SHM workshop, 16/03/2022.

BeautyNet is an AI powered model which can tell you whether you're beautiful or not.

NLP, Machine learning

Train BPE with fastBPE, and load to Huggingface Tokenizer.

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Code for the paper "VisualBERT: A Simple and Performant Baseline for Vision and Language"

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Repo for Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

無料で使える中品質なテキスト読み上げソフトウェア、VOICEVOXの音声合成エンジン

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

This repository details the steps in creating a Part of Speech tagger using Trigram Hidden Markov Models and the Viterbi Algorithm without using external libraries.

MRC approach for Aspect-based Sentiment Analysis (ABSA)

Mapping a variable-length sentence to a fixed-length vector using BERT model

[NeurIPS 2021] Code for Learning Signal-Agnostic Manifolds of Neural Fields

The Classical Language Toolkit

CMeEE 数据集医学实体抽取

This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정한 코드입니다.