skweak: A software toolkit for weak supervision applied to NLP tasks

Last update: Dec 28, 2022

Overview

skweak: Weak supervision for NLP

Labelled data remains a scarce resource in many practical NLP scenarios. This is especially the case when working with resource-poor languages (or text domains), or when using task-specific labels without pre-existing datasets. The only available option is often to collect and annotate texts by hand, which is expensive and time-consuming.

skweak (pronounced /skwi:k/) is a Python-based software toolkit that provides a concrete solution to this problem using weak supervision. skweak is built around a very simple idea: Instead of annotating texts by hand, we define a set of labelling functions to automatically label our documents, and then aggregate their results to obtain a labelled version of our corpus.

The labelling functions may take various forms, such as domain-specific heuristics (like pattern-matching rules), gazetteers (based on large dictionaries), machine learning models, or even annotations from crowd-workers. The aggregation is done using a statistical model that automatically estimates the relative accuracy (and confusions) of each labelling function by comparing their predictions with one another.

skweak can be applied to both sequence labelling and text classification, and comes with a complete API that makes it possible to create, apply and aggregate labelling functions with just a few lines of code. The toolkit is also tightly integrated with SpaCy, which makes it easy to incorporate into existing NLP pipelines. Give it a try!

Full Paper:
Pierre Lison, Jeremy Barnes and Aliaksandr Hubin (2021), "skweak: Weak Supervision Made Easy for NLP", arXiv:2104.09683.

Documentation & API: See the Wiki for details on how to use skweak.

121_file_Video.mp4

Dependencies

spacy >= 3.0.0
hmmlearn >= 0.2.4
pandas >= 0.23
numpy >= 1.18

You also need Python >= 3.6.

Install

The easiest way to install skweak is through pip:

pip install skweak

or if you want to install from the repo:

pip install --user git+https://github.com/NorskRegnesentral/skweak

The above installation only includes the core library (not the additional examples in examples).

Basic Overview

Weak supervision with skweak goes through the following steps:

Start: First, you need raw (unlabelled) data from your text domain. skweak is build on top of SpaCy, and operates with Spacy Doc objects, so you first need to convert your documents to Doc objects using SpaCy.
Step 1: Then, we need to define a range of labelling functions that will take those documents and annotate spans with labels. Those labelling functions can comes from heuristics, gazetteers, machine learning models, etc. See the for more details.
Step 2: Once the labelling functions have been applied to your corpus, you need to aggregate their results in order to obtain a single annotation layer (instead of the multiple, possibly conflicting annotations from the labelling functions). This is done in skweak using a generative model that automatically estimates the relative accuracy and possible confusions of each labelling function.
Step 3: Finally, based on those aggregated labels, we can train our final model. Step 2 gives us a labelled corpus that (probabilistically) aggregates the outputs of all labelling functions, and you can use this labelled data to estimate any kind of machine learning model. You are free to use whichever model/framework you prefer.

Quickstart

Here is a minimal example with three labelling functions (LFs) applied on a single document:

import spacy, re
from skweak import heuristics, gazetteers, aggregation, utils

# LF 1: heuristic to detect occurrences of MONEY entities
def money_detector(doc):
   for tok in doc[1:]:
      if tok.text[0].isdigit() and tok.nbor(-1).is_currency:
          yield tok.i-1, tok.i+1, "MONEY"
lf1 = heuristics.FunctionAnnotator("money", money_detector)

# LF 2: detection of years with a regex
lf2= heuristics.TokenConstraintAnnotator("years", lambda tok: re.match("(19|20)\d{2}$", tok.text), "DATE")

# LF 3: a gazetteer with a few names
NAMES = [("Barack", "Obama"), ("Donald", "Trump"), ("Joe", "Biden")]
trie = gazetteers.Trie(NAMES)
lf3 = gazetteers.GazetteerAnnotator("presidents", {"PERSON":trie})

# We create a corpus (here with a single text)
nlp = spacy.load("en_core_web_sm")
doc = nlp("Donald Trump paid $750 in federal income taxes in 2016")

# apply the labelling functions
doc = lf3(lf2(lf1(doc)))

# and aggregate them
hmm = aggregation.HMM("hmm", ["PERSON", "DATE", "MONEY"])
hmm.fit_and_aggregate([doc])

# we can then visualise the final result (in Jupyter)
utils.display_entities(doc, "hmm")

Obviously, to get the most out of skweak, you will need more than three labelling functions. And, most importantly, you will need a larger corpus including as many documents as possible from your domain, so that the model can derive good estimates of the relative accuracy of each labelling function.

Documentation

See the Wiki.

License

skweak is released under an MIT License.

The MIT License is a short and simple permissive license allowing both commercial and non-commercial use of the software. The only requirement is to preserve the copyright and license notices (see file License). Licensed works, modifications, and larger works may be distributed under different terms and without source code.

Citation

See our paper describing the framework:

Pierre Lison, Jeremy Barnes and Aliaksandr Hubin (2021), "skweak: Weak Supervision Made Easy for NLP", arXiv:2104.09683

@misc{lison2021skweak,
      title={skweak: Weak Supervision Made Easy for NLP}, 
      author={Pierre Lison and Jeremy Barnes and Aliaksandr Hubin},
      year={2021},
      eprint={2104.09683},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Comments

Label Function Analysis

First of all, thanks for open sourcing such an awesome project!

Our team has been playing around skweak for a sequential labeling task, and we were wondering if there were any plans in the roadmap to include tooling that helps practitioners understand the "impact" of their label functions statistically.

Snorkel for example, provides a LF Analysis tool to understand how one's label functions apply to a dataset statistically (e.g., coverage, overlap, conflicts). Similar functionality would be tremendously helpful in gauging the efficacy of one's label functions for each class in a sequential labeling problem.

Are there any plans to add such functionality down the line as a feature enhancement?
enhancement

opened by schopra8 20
Tokens with no possible state

I very often get the error of this line that there is a "problem with token X", causing HMM training to be aborted after only a couple of documents in the very first iteration.

I found out that this is due to framelogprob having all -np.inf for the token in question. So I checked what happens in self._compute_log_likelihood for the respective document and found that this document had only one labeling function firing and X[source] in this line was all False for the first token (or state?).

This means that this token/state is also all masked with -np.inf in logsum in this line.

Now, I am unsure how to fix that. This clearly does not look like the desired behavior but I suppose "testing for tokens with no possible states" is there for a reason. Can I simply replace -np.inf in self._compute_log_likelihood with -100000 ? Then, of course, the test will not fail and not abort training but there will be a token with only very improbable states. Is that ok?

Or is that the wrong approach? Should tokens without observed labels from the labeling functions rather get a default label (e.g., O)? So why is that not done here? Is it a bug? I am not sure where I should look for a bug, if there is one. Can someone with a better knowledge of the code base give some advice on this?

opened by mnschmit 10
_do_forward_pass, _do_backward_pass, _compute_posteriors not defined in skweak.aggregation

skweak/aggregation.py", line 405, in fit logprob, fwdlattice = self._do_forward_pass(framelogprob) AttributeError: 'HMM' object has no attribute '_do_forward_pass'

opened by ManuBohra 10
TypeError: unhashable type: 'list'

Upon applying config file in order to train textcat model using the following code:

!spacy init config - --lang en --pipeline ner --optimize accuracy | \ spacy train - --paths.train ./train.spacy --paths.dev ./train.spacy \ --initialize.vectors en_core_web_md --output train

I receive following error message:

[i] Saving to output directory: train [i] Using CPU

=========================== Initializing pipeline =========================== 2022-03-27 15:49:59.778883: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found 2022-03-27 15:49:59.778913: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2022-03-27 15:49:59.798942: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found 2022-03-27 15:49:59.798976: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. [2022-03-27 15:50:05,376] [INFO] Set up nlp object from config [2022-03-27 15:50:05,395] [INFO] Pipeline: ['tok2vec', 'ner'] [2022-03-27 15:50:05,395] [INFO] Created vocabulary [2022-03-27 15:50:07,968] [INFO] Added vectors: en_core_web_md [2022-03-27 15:50:08,292] [INFO] Finished initializing nlp object Traceback (most recent call last): File "C:\ProgramData\Anaconda3\lib\runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\ProgramData\Anaconda3\lib\runpy.py", line 87, in run_code exec(code, run_globals) File "C:\ProgramData\Anaconda3\Scripts\spacy.exe_main.py", line 7, in File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\cli_util.py", line 71, in setup_cli command(prog_name=COMMAND) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 829, in call return self.main(*args, **kwargs) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 782, in main rv = self.invoke(ctx) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 1259, in invoke return process_result(sub_ctx.command.invoke(sub_ctx)) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 1066, in invoke return ctx.invoke(self.callback, **ctx.params) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 610, in invoke return callback(*args, **kwargs) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\typer\main.py", line 497, in wrapper return callback(**use_params) # type: ignore File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\cli\train.py", line 45, in train_cli train(config_path, output_path, use_gpu=use_gpu, overrides=overrides) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\cli\train.py", line 72, in train nlp = init_nlp(config, use_gpu=use_gpu) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\training\initialize.py", line 84, in init_nlp nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\language.py", line 1308, in initialize proc.initialize(get_examples, nlp=self, **p_settings) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\pipeline\tok2vec.py", line 215, in initialize validate_get_examples(get_examples, "Tok2Vec.initialize") File "spacy\training\example.pyx", line 65, in spacy.training.example.validate_get_examples File "spacy\training\example.pyx", line 44, in spacy.training.example.validate_examples File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\training\corpus.py", line 142, in call for real_eg in examples: File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\training\corpus.py", line 164, in make_examples for reference in reference_docs: File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\training\corpus.py", line 199, in read_docbin for doc in docs: File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\tokens_serialize.py", line 150, in get_docs doc.spans.from_bytes(self.span_groups[i]) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\tokens_dict_proxies.py", line 54, in from_bytes group = SpanGroup(doc).from_bytes(value_bytes) File "spacy\tokens\span_group.pyx", line 170, in spacy.tokens.span_group.SpanGroup.from_bytes File "C:\ProgramData\Anaconda3\lib\site-packages\srsly_msgpack_api.py", line 27, in msgpack_loads msg = msgpack.loads(data, raw=False, use_list=use_list) File "C:\ProgramData\Anaconda3\lib\site-packages\srsly\msgpack_init.py", line 79, in unpackb return _unpackb(packed, **kwargs) File "srsly\msgpack_unpacker.pyx", line 191, in srsly.msgpack._unpacker.unpackb TypeError: unhashable type: 'list'

Seems like a dependency issue. What is the reason for it? And is there a way to fix it?

Also : Is the following error message a problem ? "[E1010] Unable to set entity information for token 10 which is included in more than one span in entities, blocked, missing or outside." or can it be avoided by simply applying the following?: for document in train_data: try: document.ents = document.spans["hmm"] skweak.utils.docbin_writer(train_data, "train.spacy") except Exception as e: print(e)

opened by AlineBornschein 6
TypeError when nothing is found on in a document

Hi! I'm getting an exception from fit_and_aggregate. TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'. The exception is from line 227 in aggregation.py, np.apply_along_axis(...)

This seems to happen when all of my labeling functions return empty on one of the docs so the DataFrame is empty.

opened by oholter 6
Error in MultilabelNaiveBayes

I am using Skweak Multilabel for classification and I am getting the following error message - RuntimeError: No valid state found at position 0

I aggregated LFs using CombinedAnnotator, then initialized MultilabelNaiveBayes - MultilabelNaiveBayes("skweak_preds",final_label_list) and then trained the model - skweak_model.fit(d2s)

Any help in fixing this is appreciated. Thanks!

opened by sujeethrv 5
Converting .spacy files to conll format to train other models on it.

Once I fit the aggregation model on the data, I used Skweak's function to write it as a Docbin file which will get saved as a .spacy file. How do I convert this into a normal CoNLL format file. Are there any libraries or tools that can do that ?

opened by Akshay0799 5

Gazetteer is not working with single tokens

Hello.

Can't get why gazetteer doesn't match single name 'Barack'?

import spacy, re
from skweak import heuristics, gazetteers, aggregation, utils, base
nlp = spacy.load("en_core_web_sm", disable=["ner"])
doc = nlp('Barack Obama and Donald Trump')
NAMES = [("Barack"), ("Donald", "Trump")]
lf3 = gazetteers.GazetteerAnnotator("presidents", {"PERSON":gazetteers.Trie(NAMES)})
doc = lf3(doc)
print(doc.spans)

{'presidents': [Donald Trump]}

Any ideas?

Thanks for a remarkable lib!

opened by slavaGanzin 5

[Question] Underspecified Labels w/ out Fine-Grained Label
Context

I'm training an NER model using the HMM aggregator.

I have 2 label classes [A, B] and an under-specified label [C] which is a super-class of A and B within my ontology.

I have 3-sets of gazetteer label functions - one set for A, one set for B, and one set for C.

Issue

When training the HMM, I have tokens which are annotated by label functions for C (superclass) but are not annotated by label functions for A and B (e.g., the term "Apple" is being labeled as an ENT but is not being captured by the LFs for PER or PROD).

Currently I'm calling the HMM function as follows:

hmm = aggregation.HMM("hmm", [A, B], sequence_labelling=True) hmm.add_underspecified_label(C, [A, B]) _ = hmm.fit_and_aggregate(annotated_docs)

This triggers an error from the below aggregation code, since all probability mass is being placed on a label that was not included in the HMM (i.e., the under-specified label C). https://github.com/NorskRegnesentral/skweak/blob/0613f20b9c8be3f22553e303ec22c72dea1f206a/skweak/aggregation.py#L397-L401

Question(s)

Should I be including the under-specified label as a possible label option in the HMM?

hmm = aggregation.HMM("hmm", [A, B, C], sequence_labelling=True) hmm.add_underspecified_label(C, [A, B]) _ = hmm.fit_and_aggregate(annotated_docs)

How are underspecified labels "learned" or trained differently vs. the "specified labels" (e.g., A, B in the example)?

Thanks in advance!
opened by schopra8 5
use Flair with skweak

hello , is here anyone who tried to implement another model/framework other than spacy (ner) as a labeling function. i tried to work with flair but didnt work. can anyone help me and thanks in advance .

opened by Ihebzayen 4
Runtime error in display_entities

I am using the latest version of skweak: 0.2.17. I tried running the example (quick-start.ipynb) in the repo. When I try to execute

skweak.utils.display_entities(docs[28], "other_org_detector")

, I get this error.

opened by latchukarthick98 3
Step by step NER alternative 2

Hello,

First of all, thank you for the library.

I'm kind of new to NER, and I'd like to know how the 2nd alternative of the NER process would be done, where a more sophisticated model is created, since I didn't find it in Step by Step NER.

opened by boskis222 0
minimal example not working

When I try to run the minimal example on the home page, an error appears: AttributeError: 'BaseHMM' object has no attribute '_do_forward_log_pass'

Am I missing something from the install or is it just pip install skweak?

opened by davidbetancur8 2
Support options in displacy.render

This is enhance request for display_entities can be a bit more flexible if you includeoptions={} as part of their parameters. Ex: def display_entities(doc: Doc, layer=None, add_tooltip=False, options={}):

then fix the line below: html = spacy.displacy.render(doc2, jupyter=False, style="ent", manual=True, options=options)

That will extends the functionality of render when creating new entities.

Thanks for the great work with SKWEAK.

opened by lidiexy-palinode 0
Support for relation extraction

Right now, skweak supports two main types of NLP tasks: (token-level) sequence labelling and text classification. Both rests on the idea that labelling functions associate labels to text spans, and the role of the aggregation model is then to merge the outputs of those labelling functions such as to get unified predictions.

However, some NLP tasks cannot be easily associated to text spans. For instance, relation extraction necessitates a prediction on pairs of spans.

The question is then how to provide support for such type of tasks, for instance by implementing a RelationAnnotator that could be used to associate pairs of spans to a label.

Technically speaking, we could still encode the annotations internally as SpanGroup objects. One solution would be to only add one span of the pair in the SpanGroup, but then specify that this span is connected to a second span (SpanGroup objects allows the inclusion of JSON-serialised attributes). The method get_observation_df in the BaseAggregator class could then be extended to detect whether a span is a normal one, or is connected to a second span. If that is the case, the aggregation would then be done on pairs of spans instead of single spans.

Do get in touch if this functionality is something you need, so that we know whether we should prioritise this in our next release :-)
enhancement

opened by plison 4
Regression-based outcome

Hello, thank you for sharing this repo. Do you have plans for providing capability for a regression-based outcome? Something along the lines of fine-grained sentiment on a scale from 1-5?
enhancement

opened by dmracek 1

Releases(0.3.1)

0.3.1(Mar 25, 2022)
Brand new version of skweak, including both a number of bug fixes and some new functionalities:

skweak is now using the latest version of hmmlearn, thereby fixing a number errors due to a mismatch between method names

We now have a clearer split between aggregation models for sequence labelling and for text classification. Possible aggregators for sequence labelling are SequentialMajorityVoter and HMM (preferred), while the aggregators for non-sequential text classification are MajorityVoter and NaiveBayes.

We also introduce a brand new functionality: multi-label classification! Instead of assuming that all labels are mutually exclusive, you can now aggregate the results of labelling functions without assuming that only one label is correct. This multi-label scheme is available for both sequence labelling (see MultilabelSequentialMajorityVoter and MultilabelHMM) and text classification (see MultilabelMajorityVoter and MultilabelNaiveBayes).

By default, all labels can be simultaneously true for a given data point, but you can enforce exclusivity relations between labels through the method set_exclusive_labels. If all labels are set to be mutually exclusive, the aggregation is equivalent to a standard multi-class setup. Internally, this functionality is implemented by constructing and fitting separate aggregation models for each label.

The code for the aggregation models has also been heavily refactored, making it hopefully easier to create new aggregation models.
Source code(tar.gz)
Source code(zip)
0.2.8(Apr 19, 2021)

First official release of skweak, with support for both sequence labelling and text classification! See the documentation for details.
Source code(tar.gz)
Source code(zip)
btc.tar.gz(63.68 MB)
conll2003.spacy(4.56 MB)
conll2003.tar.gz(63.63 MB)
crunchbase.json.gz(8.55 MB)
muc6.spacy(3.37 MB)
norec.conllu.tar.gz(161.68 MB)
reuters_small.spacy(1.54 MB)
reuters_small.tar.gz(190.15 KB)
wikidata_small_tokenised.json.gz(11.36 MB)
wikidata_tokenised.json.gz(21.06 MB)

Owner

Norsk Regnesentral (Norwegian Computing Center)

Norwegian Computing Center is a private foundation performing research in statistical modeling, machine learning and information/communication technology

GitHub Repository

Mastering Transformers, published by Packt

Mastering Transformers This is the code repository for Mastering Transformers, published by Packt. Build state-of-the-art models from scratch with adv

195 Jan 01, 2023

An easy to use Natural Language Processing library and framework for predicting, training, fine-tuning, and serving up state-of-the-art NLP models.

Welcome to AdaptNLP A high level framework and library for running, training, and deploying state-of-the-art Natural Language Processing (NLP) models

407 Jan 03, 2023

DeepAmandine is an artificial intelligence that allows you to talk to it for hours, you won't know the difference.

DeepAmandine This is an artificial intelligence based on GPT-3 that you can chat with, it is very nice and makes a lot of jokes. We wish you a good ex

3 Apr 19, 2022

[NeurIPS 2021] Code for Learning Signal-Agnostic Manifolds of Neural Fields

Learning Signal-Agnostic Manifolds of Neural Fields This is the uncleaned code for the paper Learning Signal-Agnostic Manifolds of Neural Fields. The

60 Dec 12, 2022

Rhyme with AI

Local development Create a conda virtual environment and activate it: conda env create --file environment.yml conda activate rhyme-with-ai Install the

28 Nov 21, 2022

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Spanish Language Models 💃🏻 Corpora 📃 Corpora Number of documents Size (GB) BNE 201,080,084 570GB Models 🤖 RoBERTa-base BNE: https://huggingface.co

203 Dec 20, 2022

Active learning for text classification in Python

Active Learning allows you to efficiently label training data in a small-data scenario.

375 Dec 28, 2022

An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"

The implementation of paper CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval. CLIP4Clip is a video-text retrieval model based

456 Jan 06, 2023

Application to help find best train itinerary, uses speech to text, has a spam filter to segregate invalid inputs, NLP and Pathfinding algos.

T-IAI-901-MSC2022 - GROUP 18 Gestion de projet Notre travail a été organisé et réparti dans un Trello. https://trello.com/b/X3s2fpPJ/ia-projet Install

1 Feb 05, 2022

Spert NLP Relation Extraction API deployed with torchserve for inference

URLMask Python program for Linux users to change a URL to ANY domain. A program than can take any url and mask it to any domain name you like. E.g. ne

1 Nov 24, 2021

Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration 🚃

This repository provides a library for efficient training of masked language models (MLM), built with fairseq. We fork fairseq to give researchers mor

92 Dec 27, 2022

Word2Wave: a framework for generating short audio samples from a text prompt using WaveGAN and COALA.

Word2Wave is a simple method for text-controlled GAN audio generation. You can either follow the setup instructions below and use the source code and CLI provided in this repo or you can have a play

91 Dec 23, 2022

BERT score for text generation

BERTScore Automatic Evaluation Metric described in the paper BERTScore: Evaluating Text Generation with BERT (ICLR 2020). News: Features to appear in

1k Jan 08, 2023

Ecommerce product title recognition package

revizor This package solves task of splitting product title string into components, like type, brand, model and article (or SKU or product code or you

16 Mar 03, 2022

A machine learning model for analyzing text for user sentiment and determine whether its a positive, neutral, or negative review.

Sentiment Analysis on Yelp's Dataset Author: Roberto Sanchez, Talent Path: D1 Group Docker Deployment: Deployment of this application can be found her

0 Aug 04, 2021