Code for the paper "Flexible Generation of Natural Language Deductions"

Overview

Flexible Generation of Natural Language Deductions

a.k.a. ParaPattern

https://arxiv.org/abs/2104.08825

Kaj Bostrom, Lucy Zhao, Swarat Chaudhuri, and Greg Durrett

This repository contains all the code needed to replicate the experiments from the paper, and additionally provides a set of tools to put together new natural language deduction operations from scratch.

In the data/ folder, you'll find all the data used to train and evaluate our models, already preprocessed and ready to go, with the exception of the MNLI dataset due to its size - if you want to replicate our MNLI-BART baseline, you'll need to download a copy of MNLI and run data/mnli/filter.py for yourself. The data folder also contains several generic conversion scripts, which you may find useful for processing operation training examples, as well as paraphrase.py, which does automatic paraphrase generation if you pass it a path to a suitable sequence-to-sequence paraphrasing model checkpoint, e.g. https://huggingface.co/tuner007/pegasus_paraphrase

In the modeling/ folder, you'll find the fine-tuning code needed to train operation models, as well as scripts to run all the evaluations described in the paper. Just make sure you're on transformers version 4.2.1, not the latest version, since several of the scripts are carefully built around bugs that have since been patched out of the library.

If you have access to multiple GPUs, you can change the --nproc_per_node argument in finetune.sh from 1 to whatever number of GPUs you want to use for training.

In the dep_search/ folder, you'll find tools to perform bulk dependency parsing using spaCy, as well as scripts to index the resulting stream of dependency trees and scrape them using dependency patterns. For reference, the templates used in the paper live in dep_search/templates/. If you want to write your own templates, a good place to start is playing around with the dependency pattern DSL using dep_search.struct_query.parse_query - if you're wondering how to express a given syntactic pattern, you can start by calling dep_search.struct_query.Head.from_spacy on a spaCy token; this will construct a syntactic pattern without any slots from that token's dependency subtree. Printing patterns this way is a great way to familiarize yourself with dependency structure if you need to brush up on that stuff (I can never remember what POS tag/arc label conventions spaCy uses so I was printing out a lot of these trees while I was developing the templates we used in the paper).

Unfortunately, I never got around to optimizing the syntactic search process all that well, so for large free-text corpora (~=100M sentences or more) it can take a day or two to do a full run of parsing and indexing using dep_search/scrape.py. I find a good way to iterate on a pattern is to start by casting a really broad net, and then narrow down your pattern on a subset of those results so that you don't have to re-index your whole original corpus each time you make a small change to a template.

Owner
Kaj Bostrom
PhD student at UT Austin Computer Science. Studying NLP (reading comprehension/language understanding in particular)
Kaj Bostrom
Gathers machine learning and Tensorflow deep learning models for NLP problems, 1.13 < Tensorflow < 2.0

NLP-Models-Tensorflow, Gathers machine learning and tensorflow deep learning models for NLP problems, code simplify inside Jupyter Notebooks 100%. Tab

HUSEIN ZOLKEPLI 1.7k Dec 30, 2022
PyTranslator é simultaneamente um editor e tradutor de texto com diversos recursos e interface feito com coração e 100% em Python

PyTranslator O Que é e para que serve o PyTranslator? PyTranslator é simultaneamente um editor e tradutor de texto em com interface gráfica que usa a

Elizeu Barbosa Abreu 1 May 12, 2022
Auto_code_complete is a auto word-completetion program which allows you to customize it on your needs

auto_code_complete is a auto word-completetion program which allows you to customize it on your needs. the model for this program is one of the deep-learning NLP(Natural Language Process) model struc

RUO 2 Feb 22, 2022
Python-zhuyin - An open source Python library that provides a unified interface for converting between Chinese pinyin and Zhuyin (bopomofo)

Python-zhuyin - An open source Python library that provides a unified interface for converting between Chinese pinyin and Zhuyin (bopomofo)

2 Dec 29, 2022
Code-autocomplete, a code completion plugin for Python

Code AutoComplete code-autocomplete, a code completion plugin for Python.

xuming 13 Jan 07, 2023
Journalism AI – Quotes extraction for modular journalism

Quote extraction for modular journalism (JournalismAI collab 2021)

Journalism AI collab 2021 207 Dec 25, 2022
A fast and easy implementation of Transformer with PyTorch.

FasySeq FasySeq is a shorthand as a Fast and easy sequential modeling toolkit. It aims to provide a seq2seq model to researchers and developers, which

宁羽 7 Jul 18, 2022
A music comments dataset, containing 39,051 comments for 27,384 songs.

Music Comments Dataset A music comments dataset, containing 39,051 comments for 27,384 songs. For academic research use only. Introduction This datase

Zhang Yixiao 2 Jan 10, 2022
Official PyTorch implementation of SegFormer

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers Figure 1: Performance of SegFormer-B0 to SegFormer-B5. Project page

NVIDIA Research Projects 1.4k Dec 29, 2022
無料で使える中品質なテキスト読み上げソフトウェア、VOICEVOXの音声合成エンジン

VOICEVOX ENGINE VOICEVOXの音声合成エンジン。 実態は HTTP サーバーなので、リクエストを送信すればテキスト音声合成できます。 API ドキュメント VOICEVOX ソフトウェアを起動した状態で、ブラウザから

Hiroshiba 3 Jul 05, 2022
This repository contains the code, models and datasets discussed in our paper "Few-Shot Question Answering by Pretraining Span Selection"

Splinter This repository contains the code, models and datasets discussed in our paper "Few-Shot Question Answering by Pretraining Span Selection", to

Ori Ram 88 Dec 31, 2022
Unsupervised text tokenizer focused on computational efficiency

YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE)

VK.com 847 Dec 19, 2022
A fast, efficient universal vector embedding utility package.

Magnitude: a fast, simple vector embedding utility library A feature-packed Python package and vector storage file format for utilizing vector embeddi

Plasticity 1.5k Jan 02, 2023
Text-to-Speech for Belarusian language

title emoji colorFrom colorTo sdk app_file pinned Belarusian TTS 🐸 green green gradio app.py false Belarusian TTS 📢 🤖 Belarusian TTS (text-to-speec

Yurii Paniv 1 Nov 27, 2021
DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism This repository is the official PyTorch implementation of our AAAI-2022 paper, in

Jinglin Liu 829 Jan 07, 2023
Korean Sentence Embedding Repository

Korean-Sentence-Embedding 🍭 Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides

80 Jan 02, 2023
Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention

Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention April 6, 2021 We extended segment-means to compute landmarks without requiri

Zhanpeng Zeng 322 Jan 01, 2023
FastFormers - highly efficient transformer models for NLU

FastFormers FastFormers provides a set of recipes and methods to achieve highly efficient inference of Transformer models for Natural Language Underst

Microsoft 678 Jan 05, 2023
Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any language

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any

Little Endian 1 Apr 28, 2022
Count the frequency of letters or words in a text file and show a graph.

Word Counter By EBUS Coding Club Count the frequency of letters or words in a text file and show a graph. Requirements Python 3.9 or higher matplotlib

EBUS Coding Club 0 Apr 09, 2022