๐Ÿ๐Ÿ’ฏpySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

Overview

PySBD logo

pySBD: Python Sentence Boundary Disambiguation (SBD)

Python package codecov License PyPi GitHub

pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detection module that works out-of-the-box.

This project is a direct port of ruby gem - Pragmatic Segmenter which provides rule-based sentence boundary detection.

pysbd_code

Highlights

'PySBD: Pragmatic Sentence Boundary Disambiguation' a short research paper got accepted into 2nd Workshop for Natural Language Processing Open Source Software (NLP-OSS) at EMNLP 2020.

Research Paper:

https://arxiv.org/abs/2010.09657

Recorded Talk:

pysbd_talk

Poster:

name

Install

Python

pip install pysbd

Usage

  • Currently pySBD supports 22 languages.
import pysbd
text = "My name is Jonas E. Smith. Please turn to p. 55."
seg = pysbd.Segmenter(language="en", clean=False)
print(seg.segment(text))
# ['My name is Jonas E. Smith.', 'Please turn to p. 55.']
import spacy
from pysbd.utils import PySBDFactory

nlp = spacy.blank('en')

# explicitly adding component to pipeline
# (recommended - makes it more readable to tell what's going on)
nlp.add_pipe(PySBDFactory(nlp))

# or you can use it implicitly with keyword
# pysbd = nlp.create_pipe('pysbd')
# nlp.add_pipe(pysbd)

doc = nlp('My name is Jonas E. Smith. Please turn to p. 55.')
print(list(doc.sents))
# [My name is Jonas E. Smith., Please turn to p. 55.]

Contributing

If you want to contribute new feature/language support or found a text that is incorrectly segmented using pySBD, then please head to CONTRIBUTING.md to know more and follow these steps.

  1. Fork it ( https://github.com/nipunsadvilkar/pySBD/fork )
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create a new Pull Request

Citation

If you use pysbd package in your projects or research, please cite PySBD: Pragmatic Sentence Boundary Disambiguation.

@inproceedings{sadvilkar-neumann-2020-pysbd,
    title = "{P}y{SBD}: Pragmatic Sentence Boundary Disambiguation",
    author = "Sadvilkar, Nipun  and
      Neumann, Mark",
    booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.nlposs-1.15",
    pages = "110--114",
    abstract = "We present a rule-based sentence boundary disambiguation Python package that works out-of-the-box for 22 languages. We aim to provide a realistic segmenter which can provide logical sentences even when the format and domain of the input text is unknown. In our work, we adapt the Golden Rules Set (a language specific set of sentence boundary exemplars) originally implemented as a ruby gem pragmatic segmenter which we ported to Python with additional improvements and functionality. PySBD passes 97.92{\%} of the Golden Rule Set examplars for English, an improvement of 25{\%} over the next best open source Python tool.",
}

Credit

This project wouldn't be possible without the great work done by Pragmatic Segmenter team.

Comments
  • Question marks at the end swallowed

    Question marks at the end swallowed

    Looks like the example with just question marks is good now:

    >>> segmenter.segment("??")
    ['??']
    

    but the example with double question marks as a token at the end of a sentence still loses the question marks:

    >>> segmenter.segment("T stands for the vector transposition. As shown in Fig. ??")
    ['T stands for the vector transposition.', 'As shown in Fig.']
    

    looks like this is the minimal repro:

    >>> segmenter.segment("Fig. ??")
    ['Fig.']
    
    bug edge-cases 
    opened by dakinggg 11
  • Pysbd just hangs๐Ÿ›

    Pysbd just hangs๐Ÿ›

    Describe the bug The process hangs .

    To Reproduce Steps to reproduce the behavior: Input text - <f.302205302116302416302500302513915bd> flat = "f.302205302116302416302500302513915bd" print(flat) x=segClean = pysbd.Segmenter(language="en", clean=True, char_span=False) for z in x.segment(flat): print(z)

    Example: Input text - "My name is Jonas E. Smith. Please turn to p. 55."

    Expected behavior Return f.302205302116302416302500302513915

    Example: ['f.302205302116302416302500302513915bd']

    Additional context Add any other context about the problem here.

    help wanted 
    opened by kariato 8
  • Incorrect text span start and end returned

    Incorrect text span start and end returned

    Looks like something weird happening in this case, note that the indices of the second text span are incorrect:

    >>> seg = pysbd.Segmenter(language='en', clean=False, char_span=True)
    >>> seg.segment("1) The first item. 2) The second item.")                                                                                
    [TextSpan(sent='1) The first item.', start=0, end=18), TextSpan(sent='2) The second item.', start=0, end=19)] 
    
    bug 
    opened by dakinggg 7
  • Performance improvement?

    Performance improvement?

    I am not certain of this, but I suspect there might be room for performance improvement by using re.compile to precompile all of the needed regexs. Otherwise they will have to be compiled regularly (once the re cache of 100 has been exceeded)

    question 
    opened by dakinggg 7
  • Slovak lang support

    Slovak lang support

    We've added support for SBD in Slovak language text.

    Language specific improvements:

    • list of common slovak abbreviations
    • list of prepositive abbreviations
    • list of number abbreviations
    • handling of roman numerals
    • handling of โ€ž text โ€œ quotes, that are common in Slovak language
    • handling of ordinal numerals in dates, such as 17. Aprรญl 2020
    • modified the replacement of periods in abbreviations, so it can consistently handle common Slovak abbreviations such as Company Name s. r. o.
    • disabled processing of alphabetical lists, because of conflicts with some common abbreviations

    The code has been tested for stability on a very large corpus of web text. The has been no rigorous testing for segmentation quality, but the subjective feeling in the team is very positive.

    language 
    opened by misotrnka 6
  • Different segmentation with Spacy and when using pySBD directly

    Different segmentation with Spacy and when using pySBD directly

    Firstly thank you for this project - I was lucky to find it and it is really useful

    I seem to have found a case where the segmentation is behaving differently when run within the Spacy pipeline and when run using pySBD directly. I stumbled on it with my own text where a sentence after a previous sentence that was in quotes was being lumped together. I looked through the Golden Rules and found this wasn't expected and then noticed that even with the text in one of your tests it acts differently in Spacy.

    To reproduce run these two bits of code:

    from pysbd.utils import PySBDFactory
    nlp = spacy.blank('en')
    nlp.add_pipe(PySBDFactory(nlp))
    doc = nlp("She turned to him, \"This is great.\" She held the book out to show him.")
    for sent in doc.sents:
        print(str(sent).strip() + '\n')
    

    She turned to him, "This is great." She held the book out to show him.

    import pysbd
    text = "She turned to him, \"This is great.\" She held the book out to show him."
    seg = pysbd.Segmenter(language="en", clean=False)
    #print(seg.segment(text))
    for sent in seg.segment(text):
        print(str(sent).strip() + '\n')
    

    She turned to him, "This is great."

    She held the book out to show him.

    The second way is the desired output (based on the rules at least)

    bug help wanted 
    opened by nmstoker 6
  • destructive behaviour in edge-cases

    destructive behaviour in edge-cases

    As of v0.3.3, pySBD shows destructive behavior in some edge-cases even when setting the option clean to False. When dealing with OCR text, pySBD removes whitespace after multiple periods.

    To reproduce

    import pysbd
    
    splitter = pysbd.Segmenter(language="fr", clean=False)
    
    text = "Maissen se chargea du reste .. Logiquement,"
    print(splitter.segment(text))
    
    text = "Maissen se chargea du reste ... Logiquement,"
    print(splitter.segment(text))
    
    text = "Maissen se chargea du reste .... Logiquement,"
    print(splitter.segment(text))
    

    Actual output Please note the missing whitespace after the final period in the example with .. and .....

    ['Maissen se chargea du reste .', '.', 'Logiquement,']
    ['Maissen se chargea du reste ... ', 'Logiquement,']
    ['Maissen se chargea du reste .', '...', 'Logiquement,']
    

    Expected output

    ['Maissen se chargea du reste .', '. ', 'Logiquement,']
    ['Maissen se chargea du reste ... ', 'Logiquement,']
    ['Maissen se chargea du reste .', '... ', 'Logiquement,']
    

    In general, pySBD works well. Many thanks @nipunsadvilkar. I can also look into this as soon as I find some time and open a pull request.

    bug edge-cases 
    opened by aflueckiger 5
  • ๐ŸŽ โšก๏ธ ๐Ÿ’ฏ [Rough] Benchmark across Segmentation Tools, Libraries and Algorithms

    ๐ŸŽ โšก๏ธ ๐Ÿ’ฏ [Rough] Benchmark across Segmentation Tools, Libraries and Algorithms

    Segmentation Tools, Libraries and Algorithms:

    • [x] Stanza
    • [x] syntok
    • [x] NLTK
    • [x] spaCy
    • [x] blingfire

    | Tool | Accuracy | Speed (ms) | |-----------|----------|------------| | blingfire | 75.00% | 49.91 | | pySBD | 97.92% | 2449.18 | | syntok | 68.75% | 783.73 | | spaCy | 52.08% | 473.96 | | stanza | 72.92% | 120803.37 | | NLTK | 56.25% | 342.98 |

    opened by nipunsadvilkar 5
  • โœจ ๐Ÿ’ซ  Support Multiple languages

    โœจ ๐Ÿ’ซ Support Multiple languages

    Languages to be supported:

    • [x] English
    • [x] Bulgarian
    • [x] Spanish
    • [x] Russian
    • [x] Arabic
    • [x] Amharic
    • [x] Marathi
    • [x] Hindi
    • [x] Armenian
    • [x] Persian
    • [x] Urdu
    • [x] Polish
    • [x] Chinese
    • [x] Dutch
    • [x] Danish
    • [x] French
    • [x] Italian
    • [x] Greek
    • [x] Burmese
    • [x] Japanese
    • [x] Deutsch
    • [x] Kazakh
    enhancement 
    opened by nipunsadvilkar 4
  • Regexp issues

    Regexp issues

    I'm getting errors because the regexp engine interprets parentesis: "unterminated subpattern" and "unbalanced parenthesis".

    I'm analysing very large amounts of text, so not sure how these were triggered.

    opened by mollerhoj 4
  • Reduce some calls to re.sub

    Reduce some calls to re.sub

    So calls to re.compile are not a problem. The main thing slowing it down is lots of calls to re.sub in abbreviation_replacer.py. I reduced some of these calls which speeds it up by a factor of ~3-3.5x on my machine, for the specific (longish) document that I tested with. I also included the script I used to test timing. Given that you are much more familiar with the codebase, see if my changes look reasonable, but all the tests do still pass. There are probably some more ways to speed up the calls in that file.

    enhancement 
    opened by dakinggg 4
  • How is accuracy on OPUS-100 computed?

    How is accuracy on OPUS-100 computed?

    Hi! Thanks for this library.

    Since there is no notion of documents in the OPUS-100 dataset it is not clear to me how accuracy is computed. I tried a naive approach using pairwise joining of sentences:

    from datasets import load_dataset
    import pysbd
    
    if __name__ == "__main__":
        sentences = [
            sample["de"].strip()
            for sample in load_dataset("opus100", "de-en", split="test")["translation"]
        ]
    
        correct = 0
        total = 0
    
        segmenter = pysbd.Segmenter(language="de")
    
        for sent1, sent2 in zip(sentences, sentences[1:]):
            out = tuple(
                s.strip() for s in segmenter.segment(sent1 + " " + sent2)
            )
    
            total += 1
    
            if out == (sent1, sent2):
                correct += 1
    
        print(f"{correct}/{total} = {correct / total}")
    

    But I get 1011/1999 = 50.6% Accuracy which is not close to the 80.95% Accuracy reported in the paper.

    Thanks for any help!

    opened by bminixhofer 1
  • Added decorator as required by latest SpaCy

    Added decorator as required by latest SpaCy

    Hello!

    In using pySBD, I've noticed that the current example script no longer works with the latest version of SpaCy (3.3.0). This is the traceback I get:

    Traceback (most recent call last):
      File "/Users/lucas/Code/significant-statements-extraction/scripts/test_pysbd.py", line 27, in <module>
        nlp.add_pipe(pysbd_sentence_boundaries)
      File "/Users/lucas/miniforge3/envs/pytorch_p39/lib/python3.9/site-packages/spacy/language.py", line 773, in add_pipe
        raise ValueError(err)
    ValueError: [E966] `nlp.add_pipe` now takes the string name of the registered component factory, not a callable component. Expected string, but got <function pysbd_sentence_boundaries at 0x11ffa9160> (name: 'None').
    
    - If you created your component with `nlp.create_pipe('name')`: remove nlp.create_pipe and call `nlp.add_pipe('name')` instead.
    
    - If you passed in a component like `TextCategorizer()`: call `nlp.add_pipe` with the string name instead, e.g. `nlp.add_pipe('textcat')`.
    
    - If you're using a custom component: Add the decorator `@Language.component` (for function components) or `@Language.factory` (for class components / factories) to your custom component and assign it a name, e.g. `@Language.component('your_name')`. You can then run `nlp.add_pipe('your_name')` to add it to the pipeline.
    

    This pull requests add a @Language.component decorator to make pySBD available in SpaCy again.

    opened by soldni 0
  • Arabic sentence split on the Arabic comma

    Arabic sentence split on the Arabic comma

    Describe the bug Arabic sentence split on the Arabic comma.

    To Reproduce Steps to reproduce the behavior:

    import pysbd
    text = "ู‡ุฐู‡ ุชุฌุฑุจุฉุŒ ู„ู„ุบุฉ ุงู„ุนุฑุจูŠุฉ"
    seg = pysbd.Segmenter(language="ar", clean=True)
    >>> print(seg.segment(text))
    

    Output: ['ู‡ุฐู‡ ุชุฌุฑุจุฉุŒ', 'ู„ู„ุบุฉ ุงู„ุนุฑุจูŠุฉ']

    Expected behavior The text should not be split on the Arabic comma. Expected output: ['ู‡ุฐู‡ ุชุฌุฑุจุฉุŒ ู„ู„ุบุฉ ุงู„ุนุฑุจูŠุฉ']

    Additional context I locally fixed it by modifying the file: pysbd/lang/arabic.py, deleting ุŒ from SENTENCE_BOUNDARY_REGEX.

    opened by ymoslem 0
  • Does pysbd delete sentences after detection ?

    Does pysbd delete sentences after detection ?

    Hey there, So ive been using pysbd to detect boundries in hindi and marathi language and then save the same data rearranged from a paragraph to one sentence boundry per sample. Unfortunately the storage size has gone down from 22GB to 14.5 GB after just detecting boundries and just saving them per sentence. and yes i did turn off the clean args.

    opened by StephennFernandes 0
  • Update pysbd_as_spacy_component.py

    Update pysbd_as_spacy_component.py

    Thanks for a great sentence splitting package. A small contribution, after troubleshooting, why the code was not working out of the box. The spacy v3 requires a string in the add_pipe() call. The component need to be declared using the language decorator. See also https://spacy.io/usage/processing-pipelines#custom-components. Hope it helps other users.

    opened by guebeln0 0
Releases(v0.3.4)
  • v0.3.4(Feb 11, 2021)

  • v0.3.3(Oct 8, 2020)

  • v0.3.2(Sep 11, 2020)

  • v0.3.1(Aug 11, 2020)

  • v0.3.0(Aug 11, 2020)

    v0.3.0

    • โœจ ๐Ÿ’ซ Support Multiple languages - #2
    • ๐ŸŽโšก๏ธ๐Ÿ’ฏ Benchmark across Segmentation Tools, Libraries and Algorithms
    • ๐ŸŽจ โ™ป๏ธ Update sentence char_span logic
    • โšก๏ธ Performance improvements - #41
    • โ™ป๏ธ๐Ÿ› Refactor AbbreviationReplacer
    Source code(tar.gz)
    Source code(zip)
  • v0.3.0rc(Jun 9, 2020)

    • โœจ ๐Ÿ’ซ sent char_span through with spaCy & regex approach - #63
    • โ™ป๏ธ Refactoring to support multiple languages
    • โœจ ๐Ÿ’ซInitial language support for - Hindi, Marathi, Chinese, Spanish
    • โœ… Updated tests - more coverage & regression tests for issues
    • ๐Ÿ‘ท๐Ÿ‘ท๐Ÿปโ€โ™€๏ธ GitHub actions for CI-CD
    • ๐Ÿ’šโ˜‚๏ธ Add code coverage - coverage.py Add Codecov
    • ๐Ÿ› Fix incorrect text span & vanilla pysbd vs spacy output discrepancy - #49, #53, #55 , #59
    • ๐Ÿ› Fix NUMBERED_REFERENCE_REGEX for zero or one time - #58
    • ๐Ÿ”Fix security vulnerability bleach - #62
    Source code(tar.gz)
    Source code(zip)
  • v0.2.3(Nov 13, 2019)

  • v0.2.2(Nov 1, 2019)

  • v0.2.1(Oct 30, 2019)

  • v0.2.0(Oct 25, 2019)

    • โœจAdd char_span parameter (optional) to get sentence & its (start, end) char offsets from original text
    • โœจpySBD as a spaCy component example
    • ๐Ÿ› Fix double question mark swallow bug - #39
    Source code(tar.gz)
    Source code(zip)
  • v0.1.5(Oct 24, 2019)

  • v0.1.4(Oct 20, 2019)

    • โœจ โœ… Handle intermittent punctuations added special case: r"[ใ€‚๏ผŽ.๏ผ!?].*" to handle intermittent dots, exclaimation, etc. special cases group can be updated as per developer needs- #34
    Source code(tar.gz)
    Source code(zip)
  • v0.1.3(Oct 19, 2019)

    • ๐Ÿ› Fix lists_item_replacer - #29
    • ๐Ÿ› Fix & โ™ป๏ธ refactor replace_multi_period_abbreviations - #30
    • ๐Ÿ› Fix abbreviation_replacer - #31
    • โœ… Add regression tests for issues
    Source code(tar.gz)
    Source code(zip)
  • v0.1.2(Oct 18, 2019)

  • v0.1.1(Oct 9, 2019)

Owner
Nipun Sadvilkar
I like to explore Jungle of Data with Python as my swiss knife with pandas, numpy, matplotlib and scikit-learn as its multi-tools๐Ÿ˜…
Nipun Sadvilkar
Speech Recognition for Uyghur using Speech transformer

Speech Recognition for Uyghur using Speech transformer Training: this model using CTC loss and Cross Entropy loss for training. Download pretrained mo

Uyghur 11 Nov 17, 2022
Count the frequency of letters or words in a text file and show a graph.

Word Counter By EBUS Coding Club Count the frequency of letters or words in a text file and show a graph. Requirements Python 3.9 or higher matplotlib

EBUS Coding Club 0 Apr 09, 2022
EMNLP 2021 paper "Pre-train or Annotate? Domain Adaptation with a Constrained Budget".

Pre-train or Annotate? Domain Adaptation with a Constrained Budget This repo contains code and data associated with EMNLP 2021 paper "Pre-train or Ann

Fan Bai 8 Dec 17, 2021
Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

Google 6.4k Jan 01, 2023
An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI, torch2trt to accelerate. our model support for int8, dynamic input and profiling. (Nvidia-Alibaba-TensoRT-hackathon2021)

Ultra_Fast_Lane_Detection_TensorRT An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI to accelerate. our model support for in

steven.yan 121 Dec 27, 2022
NeoDays-based tileset for the roguelike CDDA (Cataclysm Dark Days Ahead)

NeoDaysPlus Reduced contrast, expanded, and continuously developed version of the CDDA tileset NeoDays that's being completed with new sprites for mis

0 Nov 12, 2022
NeurIPS'21: Probabilistic Margins for Instance Reweighting in Adversarial Training (Pytorch implementation).

source code for NeurIPS21 paper robabilistic Margins for Instance Reweighting in Adversarial Training

9 Dec 20, 2022
Just a Basic like Language for Zeno INC

zeno-basic-language Just a Basic like Language for Zeno INC This is written in 100% python. this is basic language like language. so its not for big p

Voidy Devleoper 1 Dec 18, 2021
Code for lyric-section-to-comment generation based on huggingface transformers.

CommentGeneration Code for lyric-section-to-comment generation based on huggingface transformers. Migrate Guyu model and code (both 12-layers and 24-l

Yawei Sun 8 Sep 04, 2021
Sequence model architectures from scratch in PyTorch

This repository implements a variety of sequence model architectures from scratch in PyTorch. Effort has been put to make the code well structured so that it can serve as learning material. The train

Brando Koch 11 Mar 28, 2022
Sploitus - Command line search tool for sploitus.com. Think searchsploit, but with more POCs

Sploitus Command line search tool for sploitus.com. Think searchsploit, but with

watchdog2000 5 Mar 07, 2022
Lattice methods in TensorFlow

TensorFlow Lattice TensorFlow Lattice is a library that implements constrained and interpretable lattice based models. It is an implementation of Mono

504 Dec 20, 2022
The official implementation of "BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies?, ACL 2021 main conference"

BERT is to NLP what AlexNet is to CV This is the official implementation of BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Iden

Asahi Ushio 20 Nov 03, 2022
Mkdocs + material + cool stuff

Modern-Python-Doc-Example mkdocs + material + cool stuff Doc is live here Features out of the box amazing good looking website thanks to mkdocs.org an

Francesco Saverio Zuppichini 61 Oct 26, 2022
Unsupervised Language Model Pre-training for French

FlauBERT and FLUE FlauBERT is a French BERT trained on a very large and heterogeneous French corpus. Models of different sizes are trained using the n

GETALP 212 Dec 10, 2022
Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization

Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization ๐Ÿ“ฅ Download Datasets ๐Ÿ“ฅ Download Trained Models INTRODUCTION TH2ZH (

Nakhun Chumpolsathien 5 Jan 03, 2022
Weakly-supervised Text Classification Based on Keyword Graph

Weakly-supervised Text Classification Based on Keyword Graph How to run? Download data Our dataset follows previous works. For long texts, we follow C

Hello_World 20 Dec 29, 2022
Blue Brain text mining toolbox for semantic search and structured information extraction

Blue Brain Search Source Code DOI Data & Models DOI Documentation Latest Release Python Versions License Build Status Static Typing Code Style Securit

The Blue Brain Project 29 Dec 01, 2022
Scene Text Retrieval via Joint Text Detection and Similarity Learning

This is the code of "Scene Text Retrieval via Joint Text Detection and Similarity Learning". For more details, please refer to our CVPR2021 paper.

79 Nov 29, 2022
NLP codes implemented with Pytorch (w/o library such as huggingface)

NLP_scratch NLP codes implemented with Pytorch (w/o library such as huggingface) scripts โ”œโ”€โ”€ models: Neural Network models โ”œโ”€โ”€ data: codes for dataloa

3 Dec 28, 2021