Open-Domain Question-Answering for COVID-19 and Other Emergent Domains

Overview

Open-Domain Question-Answering for COVID-19 and Other Emergent Domains

This repository contains the source code for an end-to-end open-domain question answering system. The system is made up of two components: a retriever model and a reading comprehension (question answering) model. We provide the code for these two models in addition to demo code based on Streamlit. A video of the demo can be viewed here.

Installation

Our system uses PubMedBERT, a neural language model that is pretrained on PubMed abstracts for the retriever. Download the PyTorch version of PubMedBert here. For reading comprehension, we utilize BioBERT fine-tuned on SQuAD V2 . The model can be found here.

Datasets

We provide the COVID-QA dataset under the data directory. This is used for both the retriever and reading models. The train/dev/test files for the retriever are named dense_*.txt and those for reading comprehension are named qa_*.json.

The CORD-19 dataset is available for download here. Our system requires download of both the document_parses and metadata files for complete article information. For our system we use the 2021-02-15 download but any other download can also work. This must be combined into a jsonl file where each line contains a json object with:

  • id: article PMC id
  • title: article title
  • text: article text
  • index: text's index in the corpus (also the same as line number in the jsonl file)
  • date: article date
  • journal: journal published
  • authors: author list

We split each article into multiple json entries based on paragraph text cutoff in the document_parses file. Paragraphs that are longer than 200 tokens are split futher. This can be done with splitCORD.py where

* metdata-file: the metadata downloaded for CORD
* pmc-path: path to the PMC articles downloaded for CORD
* out-path: output jsonl file

Dense Retrieval Model

Once we have our model (PubMedBERT), we can start training. More specifically during training, we use positive and negative paragraphs, positive being paragraphs that contain the answer to a question, and negative ones not. We train on the COVID-QA dataset (see the Datasets section for more information on COVID-QA). We have a unified encoder for both questions and text paragraphs that learns to encode questions and associated texts into similar vectors. Afterwards, we use the model to encode the CORD-19 corpus.

Training

scripts/train.sh can be used to train our dense retrieval model.

CUDA_VISIBLE_DEVICES=0 python ../train_retrieval.py \
    --do_train \
    --prefix strong_dpr_baseline_b150 \
    --predict_batch_size 2000 \
    --model_name microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext \
    --train_batch_size 75 \
    --learning_rate 2e-5 \
    --fp16 \
    --train_file ../data/dense_train.txt \
    --predict_file ../data/dense_dev.txt \
    --seed 16 \
    --eval_period 300 \
    --max_c_len 300 \
    --max_q_len 30 \
    --warmup_ratio 0.1 \
    --num_train_epochs 20 \
    --dense_only \
    --output_dir /path/to/model/output \

Here are things to keep in mind:

1. The output_dir flag is where the model will be saved.
2. You can define the init_checkpoint flag to continue fine-tuning on another dataset.

The Dense retrieval model is then combined with BM25 for reranking (see paper for details).

Corpus

Next, go to scripts/encode_covid_corpus.sh for the command to encode our corpus.

CUDA_VISIBLE_DEVICES=0 python ../encode_corpus.py \
    --do_predict \
    --predict_batch_size 1000 \
    --model_name microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext \
    --fp16 \
    --predict_file /path/to/corpus \
    --max_c_len 300 \
    --init_checkpoint /path/to/saved/model/checkpoint_best.pt \
    --save_path /path/to/encoded/corpus

We pass the corpus (CORD-19) to our trained encoder in our dense retrieval model. Corpus embeddings are indexed.

Here are things to keep in mind:

1. The predict_file flag should take in your CORD-19 dataset path. It should be a .jsonl file.
2. Look at your output_dir path when you ran train.sh. After training our model, we should now have a checkpoint in that folder. Copy the exact path onto
the init_checkpoint flag here.
3. As previously mentioned, the result of these commands is the corpus (CORD-19) embeddings become indexed. The embeddings are saved in the save_path flag argument. Create that directory path as you wish.

Evaluation

You can run scripts/eval.sh to evaluate the document retrieval model.

CUDA_VISIBLE_DEVICES=0 python ../eval_retrieval.py \
    ../data/dense_test.txt \
    /path/to/encoded/corpus \
    /path/to/saved/model/checkpoint_best.pt \
    --batch-size 1000 --model-name microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext  --topk 100 --dimension 768

We evaluate retrieval on a test set from COVID-QA. We determine the percentage of questions that have retrieved paragraphs with the correct answer across different top-k settings.

We do that in the following 3 ways:

  1. exact answer matches in top-k retrievals
  2. matching articles in top-k retrievals
  3. F1 and Siamese BERT fuzzy matching

Here are things to think about:

1. The first, second, and third arguments are our COVID-QA test set, corpus indexed embeddings, and retrieval model respectively.
2. The other flag that is important is the topk one. This flag determines the quantity of retrieved CORD19 paragraphs.

Reading Comprehension

We utilize the HuggingFace's question answering scripts to train and evaluate our reading comprehension model. This can be done with scripts/qa.sh. The scripts are modified to allow for the extraction of multiple answer spans per document. We use a BioBERT model fine-tuned on SQuAD V2 as our pre-trained model.

CUDA_VISIBLE_DEVICES=0 python ../qa/run_qa.py \
  --model_name_or_path ktrapeznikov/biobert_v1.1_pubmed_squad_v2 \
  --train_file ../data/qa_train.json \
  --validation_file ../data/qa_dev.json \
  --test_file ../data/qa_test.json \
  --do_train \
  --do_eval \
  --do_predict \
  --per_device_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 5 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /path/to/model/output \

Demo

We combine the retrieval model and reading model for an end-to-end open-domain question answering demo with Streamlit. This can be run with scripts/demo.sh.

CUDA_VISIBLE_DEVICES=0 streamlit run ../covid_qa_demo.py -- \
  --retriever-model-name microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext \
  --retriever-model path/to/saved/retriever_model/checkpoint_best.pt \
  --qa-model-name ktrapeznikov/biobert_v1.1_pubmed_squad_v2 \
  --qa-model /path/to/saved/qa_model \
  --index-path /path/to/encoded/corpus

Here are things to keep in mind:

1. retriever-model is the checkpoint file of your trained retriever model.
2. qa-model is the trained reading comprehension model.
3. index-path is the path to the encoded corpus embeddings.

Requirements

See requirements.txt

Minimal working example of data acquisition with nidaqmx python API

Data Aquisition using NI-DAQmx python API Based on this project It is a minimal working example for data acquisition using the NI-DAQmx python API. It

Pablo 1 Nov 05, 2021
A Numba-based two-point correlation function calculator using a grid decomposition

A Numba-based two-point correlation function (2PCF) calculator using a grid decomposition. Like Corrfunc, but written in Numba, with simplicity and hackability in mind.

Lehman Garrison 3 Aug 24, 2022
PyIOmica (pyiomica) is a Python package for omics analyses.

PyIOmica (pyiomica) This repository contains PyIOmica, a Python package that provides bioinformatics utilities for analyzing (dynamic) omics datasets.

G. Mias Lab 13 Jun 29, 2022
Feature Detection Based Template Matching

Feature Detection Based Template Matching The classification of the photos was made using the OpenCv template Matching method. Installation Use the pa

Muhammet Erem 2 Nov 18, 2021
Active Learning demo using two small datasets

ActiveLearningDemo How to run step one put the dataset folder and use command below to split the dataset to the required structure run utils.py For ea

3 Nov 10, 2021
Data processing with Pandas.

Processing-data-with-python This is a simple example showing how to use Pandas to create a dataframe and the processing data with python. The jupyter

1 Jan 23, 2022
Visions provides an extensible suite of tools to support common data analysis operations

Visions And these visions of data types, they kept us up past the dawn. Visions provides an extensible suite of tools to support common data analysis

168 Dec 28, 2022
Galvanalyser is a system for automatically storing data generated by battery cycling machines in a database

Galvanalyser is a system for automatically storing data generated by battery cycling machines in a database, using a set of "harvesters", whose job it

Battery Intelligence Lab 20 Sep 28, 2022
The Dash Enterprise App Gallery "Oil & Gas Wells" example

This app is based on the Dash Enterprise App Gallery "Oil & Gas Wells" example. For more information and more apps see: Dash App Gallery See the Dash

Austin Caudill 1 Nov 08, 2021
Analyze the Gravitational wave data stored at LIGO/VIRGO observatories

Gravitational-Wave-Analysis This project showcases how to analyze the Gravitational wave data stored at LIGO/VIRGO observatories, using Python program

1 Jan 23, 2022
Convert monolithic Jupyter notebooks into Ploomber pipelines.

Soorgeon Join our community | Newsletter | Contact us | Blog | Website | YouTube Convert monolithic Jupyter notebooks into Ploomber pipelines. soorgeo

Ploomber 65 Dec 16, 2022
A pipeline that creates consensus sequences from a Nanopore reads. I

A pipeline that creates consensus sequences from a Nanopore reads. It clusters reads that are similar to each other and creates a consensus that is then identified using BLAST.

Ada Madejska 2 May 15, 2022
This repository contains some analysis of possible nerdle answers

Nerdle Analysis https://nerdlegame.com/ This repository contains some analysis of possible nerdle answers. Here's a quick overview: nerdle.py contains

0 Dec 16, 2022
Analysiscsv.py for extracting analysis and exporting as CSV

wcc_analysis Lichess page documentation: https://lichess.org/page/world-championships Each WCC has a study, studies are fetched using: https://lichess

32 Apr 25, 2022
An orchestration platform for the development, production, and observation of data assets.

Dagster An orchestration platform for the development, production, and observation of data assets. Dagster lets you define jobs in terms of the data f

Dagster 6.2k Jan 08, 2023
We're Team Arson and we're using the power of predictive modeling to combat wildfires.

We're Team Arson and we're using the power of predictive modeling to combat wildfires. Arson Map Inspiration There’s been a lot of wildfires in Califo

Jerry Lee 3 Oct 17, 2021
VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

André Rodrigues 2 Feb 14, 2022
Tools for the analysis, simulation, and presentation of Lorentz TEM data.

ltempy ltempy is a set of tools for Lorentz TEM data analysis, simulation, and presentation. Features Single Image Transport of Intensity Equation (SI

McMorran Lab 1 Dec 26, 2022
Toolchest provides APIs for scientific and bioinformatic data analysis.

Toolchest Python Client Toolchest provides APIs for scientific and bioinformatic data analysis. It allows you to abstract away the costliness of runni

Toolchest 11 Jun 30, 2022
NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

Blaze 3.1k Jan 05, 2023