Entity-Based Knowledge Conflicts in Question Answering.

Overview

Entity-Based Knowledge Conflicts in Question Answering

Run Instructions | Paper | Citation | License

This repository provides the Substitution Framework described in Section 2 of our paper Entity-Based Knowledge Conflicts in Question Answering. Given a quesion answering dataset, we derive a new dataset where the context passages have been modified to have new answers to their question. By training on the original examples and evaluating on the derived examples, we simulate a parametric-contextual knowledge conflict --- useful for understanding how model's employ sources of knowledge to arrive at a decision.

Our dataset derivation follows two steps: (1) identifying named entity answers, and (2) replacing all occurrences of the answer in the context with a substituted entity, effectively changing the answer. The answer substitutions depend on the chosen substitution policy.

Run Instructions

1. Setup

Setup requirements and download SpaCy and WikiData dependencies.

bash setup.sh

2. (Optional) Download and Process Wikidata

This optional stage reproduces wikidata/entity_info.json.gz, downloaded during Setup.

Download the Wikidata dump from October 2020 here and the Wikipedia pageviews from June 2, 2020 here.

NOTE: We don't use the newest Wikidata dump because Wikidata doesn't keep old dumps so reproducibility is an issue. If you'd like to use the newest dump, it is available here. Wikipedia pageviews, on the other hand, are kept around and can be found here. Be sure to download the *-user.bz2 file and not the *-automatic.bz2 or the *-spider.bz2 files.

To extract out Wikidata information, run the following (takes ~8 hours)

python extract_wikidata_info.py --wikidata_dump wikidata-20201026-all.json.bz2 --popularity_dump pageviews-20210602-user.bz2 --output_file entity_info.json.gz

The output file of this step is available here.

3. Load and Preprocess Dataset

PYTHONPATH=. python src/load_dataset.py -d MRQANaturalQuestionsTrain -w wikidata/entity_info.json.gz
PYTHONPATH=. python src/load_dataset.py -d MRQANaturalQuestionsDev -w wikidata/entity_info.json.gz

4. Generate Substitutions

PYTHONPATH=. python src/generate_substitutions.py --inpath datasets/normalized/MRQANaturalQuestionsTrain.jsonl --outpath datasets/substitution-sets/MRQANaturalQuestionsTrain
   
    .jsonl 
    
      -n 1 ...
PYTHONPATH=. python src/generate_substitutions.py --inpath datasets/normalized/MRQANaturalQuestionsDev.jsonl --outpath datasets/substitution-sets/MRQANaturalQuestionsDev
     
      .jsonl 
      
        -n 1 ...

      
     
    
   

See descriptions of the substitution policies (substitution-commands) we provide here. Inspect the argparse and substitution-specific subparsers in generate_substitutions.py to see additional arguments.

Our Substitution Functions

Here we define the the substitution functions we provide. These functions ingests a QADataset, and modifies the context passage, according to defined rules, such that there is now a new answer to the question, according to the context. Greater detail is provided in our paper.

  • Alias Substitution (sub-command: alias-substitution) --- Here we replace an answer with one of it's wikidata aliases. Since the substituted answer is always semantically equivalent, answer type preservation is naturally maintained.
  • Popularity Substitution (sub-command: popularity-substitution) --- Here we replace answers with a WikiData answer of the same type, with a specified popularity bracket (according to monthly page views).
  • Corpus Substitution (sub-command: corpus-substitution) --- Here we replace answers with other answers of the same type, sampled from the same corpus.
  • Type Swap Substitution (sub-command: type-swap-substitution) --- Here we replace answers with other answers of different type, sampled from the same corpus.

How to Add Your own Dataset / Substitution Fn / NER Models

Use your own Dataset

To add your own dataset, create your own subclass of QADataset (in src/classes/qadataset.py).

  1. Overwrite the read_original_dataset function, to read your dataset, creating a List of QAExample objects.
  2. Add your class and the url/filepath to the DATASETS variable in src/load_dataset.py.

See MRQANaturalQuetsionsDataset in src/classes/qadataset.py as an example.

Use your own Substitution Function

We define 5 different substitution functions in src/generate_substitutions.py. These are described here. Inspect their docstrings and feel free to add your own, leveraging any of the wikidata, derived answer type, or other info we populate for examples and answers. Here are the steps to create your own:

  1. Add a subparser in src/generate_substitutions.py for your new function, with any relevant parameters. See alias_sub_parser as an example.
  2. Add your own substitution function to src/substitution_fns.py, ensuring the signature arguments match those specified in the subparser. See alias_substitution_fn as an example.
  3. Add a reference to your new function to SUBSTITUTION_FNS in src/generate_substitutions.py. Ensure the dictionary key matches the subparser name.

Use your own Named Entity Recognition and/or Entity Linking Model

Our SpaCy NER model is trained and used mainly to categorize answer text into answer types. Only substitutions that preserve answer type are likely to be coherent.

The functions which need to be changed are:

  1. run_ner_linking in utils.py, which loads the NER model and populates info for each answer (see function docstring).
  2. Answer._select_answer_type() in src/classes/answer.py, which uses the NER answer type label and wikidata type labels to cateogrize the answer into a type category.

Citation

Please cite the following if you found this resource or our paper useful.

@misc{longpre2021entitybased,
      title={Entity-Based Knowledge Conflicts in Question Answering}, 
      author={Shayne Longpre and Kartik Perisetla and Anthony Chen and Nikhil Ramesh and Chris DuBois and Sameer Singh},
      year={2021},
      eprint={2109.05052},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

The Knowledge Conflicts repository, and entity-based substitution framework are licensed according to the LICENSE file.

Contact Us

To contact us feel free to email the authors in the paper or create an issue in this repository.

Owner
Apple
Apple
Accelerated deep learning R&D

Accelerated deep learning R&D PyTorch framework for Deep Learning research and development. It focuses on reproducibility, rapid experimentation, and

Catalyst-Team 3.1k Jan 06, 2023
Official PyTorch implementation of RobustNet (CVPR 2021 Oral)

RobustNet (CVPR 2021 Oral): Official Project Webpage Codes and pretrained models will be released soon. This repository provides the official PyTorch

Sungha Choi 173 Dec 21, 2022
Continuous Diffusion Graph Neural Network

We present Graph Neural Diffusion (GRAND) that approaches deep learning on graphs as a continuous diffusion process and treats Graph Neural Networks (GNNs) as discretisations of an underlying PDE.

Twitter Research 227 Jan 05, 2023
A practical ML pipeline for data labeling with experiment tracking using DVC.

Auto Label Pipeline A practical ML pipeline for data labeling with experiment tracking using DVC Goals: Demonstrate reproducible ML Use DVC to build a

Todd Cook 4 Mar 08, 2022
Codes of paper "Unseen Object Amodal Instance Segmentation via Hierarchical Occlusion Modeling"

Unseen Object Amodal Instance Segmentation (UOAIS) Seunghyeok Back, Joosoon Lee, Taewon Kim, Sangjun Noh, Raeyoung Kang, Seongho Bak, Kyoobin Lee This

GIST-AILAB 92 Dec 13, 2022
A benchmark dataset for emulating atmospheric radiative transfer in weather and climate models with machine learning (NeurIPS 2021 Datasets and Benchmarks Track)

ClimART - A Benchmark Dataset for Emulating Atmospheric Radiative Transfer in Weather and Climate Models Official PyTorch Implementation Using deep le

21 Dec 31, 2022
Lecture materials for Cornell CS5785 Applied Machine Learning (Fall 2021)

Applied Machine Learning (Cornell CS5785, Fall 2021) This repo contains executable course notes and slides for the Applied ML course at Cornell and Co

Volodymyr Kuleshov 103 Dec 31, 2022
TensorFlow implementation of ENet, trained on the Cityscapes dataset.

segmentation TensorFlow implementation of ENet (https://arxiv.org/pdf/1606.02147.pdf) based on the official Torch implementation (https://github.com/e

Fredrik Gustafsson 248 Dec 16, 2022
This is a beginner-friendly repo to make a collection of some unique and awesome projects. Everyone in the community can benefit & get inspired by the amazing projects present over here.

Awesome-Projects-Collection Quality over Quantity :) What to do? Add some unique and amazing projects as per your favourite tech stack for the communi

Rohan Sharma 178 Jan 01, 2023
PyTorch implementation of Pointnet2/Pointnet++

Pointnet2/Pointnet++ PyTorch Project Status: Unmaintained. Due to finite time, I have no plans to update this code and I will not be responding to iss

Erik Wijmans 1.2k Dec 29, 2022
CoaT: Co-Scale Conv-Attentional Image Transformers

CoaT: Co-Scale Conv-Attentional Image Transformers Introduction This repository contains the official code and pretrained models for CoaT: Co-Scale Co

mlpc-ucsd 191 Dec 03, 2022
Repo for paper "Dynamic Placement of Rapidly Deployable Mobile Sensor Robots Using Machine Learning and Expected Value of Information"

Repo for paper "Dynamic Placement of Rapidly Deployable Mobile Sensor Robots Using Machine Learning and Expected Value of Information" Notes I probabl

Berkeley Expert System Technologies Lab 0 Jul 01, 2021
Generate pixel-style avatars with python.

face2pixel Generate pixel-style avatars with python. Run: Clone the project: git clone https://github.com/theodorecooper/face2pixel install requiremen

Theodore Cooper 2 May 11, 2022
Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System

Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System Authors: Yixuan Su, Lei Shu, Elman Mansimov, Arshit Gupta, Deng Cai, Yi-An Lai

Amazon Web Services - Labs 123 Dec 23, 2022
Image De-raining Using a Conditional Generative Adversarial Network

Image De-raining Using a Conditional Generative Adversarial Network [Paper Link] [Project Page] He Zhang, Vishwanath Sindagi, Vishal M. Patel In this

He Zhang 216 Dec 18, 2022
Language Models Can See: Plugging Visual Controls in Text Generation

Language Models Can See: Plugging Visual Controls in Text Generation Authors: Yixuan Su, Tian Lan, Yahui Liu, Fangyu Liu, Dani Yogatama, Yan Wang, Lin

Yixuan Su 195 Dec 22, 2022
Creating a Linear Program Solver by Implementing the Simplex Method in Python with NumPy

Creating a Linear Program Solver by Implementing the Simplex Method in Python with NumPy Simplex Algorithm is a popular algorithm for linear programmi

Reda BELHAJ 2 Oct 12, 2022
SpineAI Bilsky Grading With Python

SpineAI-Bilsky-Grading SpineAI Paper with Code 📫 Contact Address correspondence to J.T.P.D.H. (e-mail: james_hallinan AT nuhs.edu.sg) Disclaimer This

<a href=[email protected]"> 2 Dec 16, 2021
rliable is an open-source Python library for reliable evaluation, even with a handful of runs, on reinforcement learning and machine learnings benchmarks.

Open-source library for reliable evaluation on reinforcement learning and machine learning benchmarks. See NeurIPS 2021 oral for details.

Google Research 529 Jan 01, 2023
Channel Pruning for Accelerating Very Deep Neural Networks (ICCV'17)

Channel Pruning for Accelerating Very Deep Neural Networks (ICCV'17)

Yihui He 1k Jan 03, 2023