Python interface for converting Penn Treebank trees to Stanford Dependencies and Universal Depenencies

Overview

PyStanfordDependencies

https://travis-ci.org/dmcc/PyStanfordDependencies.svg?branch=master https://badge.fury.io/py/PyStanfordDependencies.png https://coveralls.io/repos/dmcc/PyStanfordDependencies/badge.png?branch=master

Python interface for converting Penn Treebank trees to Universal Dependencies and Stanford Dependencies.

Example usage

Start by getting a StanfordDependencies instance with StanfordDependencies.get_instance():

>>> import StanfordDependencies
>>> sd = StanfordDependencies.get_instance(backend='subprocess')

get_instance() takes several options. backend can currently be subprocess or jpype (see below). If you have an existing Stanford CoreNLP or Stanford Parser jar file, use the jar_filename parameter to point to the full path of the jar file. Otherwise, PyStanfordDependencies will download a jar file for you and store it in locally (~/.local/share/pystanforddeps). You can request a specific version with the version flag, e.g., version='3.4.1'. To convert trees, use the convert_trees() or convert_tree() method (note that by default, convert_trees() can be considerably faster if you're doing batch conversion). These return a sentence (list of Token objects) or a list of sentences (list of list of Token objects) respectively:

>>> sent = sd.convert_tree('(S1 (NP (DT some) (JJ blue) (NN moose)))')
>>> for token in sent:
...     print token
...
Token(index=1, form='some', cpos='DT', pos='DT', head=3, deprel='det')
Token(index=2, form='blue', cpos='JJ', pos='JJ', head=3, deprel='amod')
Token(index=3, form='moose', cpos='NN', pos='NN', head=0, deprel='root')

This tells you that moose is the head of the sentence and is modified by some (with a det = determiner relation) and blue (with an amod = adjective modifier relation). Fields on Token objects are readable as attributes. See docs for additional options in convert_tree() and convert_trees().

Visualization

If you have the asciitree package, you can use a prettier ASCII formatter:

>>> print sent.as_asciitree()
 moose [root]
  +-- some [det]
  +-- blue [amod]

If you have Python 2.7 or later, you can use Graphviz to render your graphs. You'll need the Python graphviz package to call as_dotgraph():

>>> dotgraph = sent.as_dotgraph()
>>> print dotgraph
digraph {
        0 [label=root]
        1 [label=some]
                3 -> 1 [label=det]
        2 [label=blue]
                3 -> 2 [label=amod]
        3 [label=moose]
                0 -> 3 [label=root]
}
>>> dotgraph.render('moose') # renders a PDF by default
'moose.pdf'
>>> dotgraph.format = 'svg'
>>> dotgraph.render('moose')
'moose.svg'

The Python xdot package provides an interactive visualization:

>>> import xdot
>>> window = xdot.DotWindow()
>>> window.set_dotcode(dotgraph.source)

Both as_asciitree() and as_dotgraph() allow customization. See the docs for additional options.

Backends

Currently PyStanfordDependencies includes two backends:

  • subprocess (works anywhere with a java binary, but more overhead so batched conversions with convert_trees() are recommended)
  • jpype (requires jpype1, faster than the subprocess backend, also includes access to the Stanford CoreNLP lemmatizer)

By default, PyStanfordDependencies will attempt to use the jpype backend. If jpype isn't available or crashes on startup, PyStanfordDependencies will fallback to subprocess with a warning.

Universal Dependencies status

PyStanfordDependencies supports most features in Universal Dependencies (see issue #10 for the most up to date status). PyStanfordDependencies output matches Universal Dependencies in terms of structure and dependency labels, but Universal POS tags and features are missing. Currently, PyStanfordDependencies will output Universal Dependencies by default (unless you're using Stanford CoreNLP 3.5.1 or earlier).

Related projects

More information

Licensed under Apache 2.0.

Written by David McClosky (homepage, code)

Bug reports and feature requests: GitHub issue tracker

Release summaries

  • 0.3.1 (2015.11.02): Better collapsed universal handling, bugfixes
  • 0.3.0 (2015.10.09): Support copy nodes, more input checking/debugging help, example convert.py program
  • 0.2.0 (2015.08.02): Universal Dependencies support (mostly), Python 3 support (fully), minor API updates
  • 0.1.7 (2015.06.13): Bugfixes for JPype, handle version mismatches in IBM Java
  • 0.1.6 (2015.02.12): Support for graphviz formatting, CoreNLP 3.5.1, better Windows portability
  • 0.1.5 (2015.01.10): Support for ASCII tree formatting
  • 0.1.4 (2015.01.07): Fix CCprocessed support
  • 0.1.3 (2015.01.03): Bugfixes, coveralls integration, refactoring
  • 0.1.2 (2015.01.02): Better CoNLL structures, test suite and Travis CI support, bugfixes
  • 0.1.1 (2014.12.15): More docs, fewer bugs
  • 0.1 (2014.12.14): Initial release
Comments
  • Sentence.from_stanford_dependencies() fails on collapsed (enhanced) dependency strings

    Sentence.from_stanford_dependencies() fails on collapsed (enhanced) dependency strings

    Below is an example where the function fails at assertion: assert len(matches) == 1 (CoNLL.py, line 209)

    Universal dependencies, enhanced nsubj(reach-3, Visitors-1) nsubj(reach-3', Visitors-1) aux(reach-3, can-2) root(ROOT-0, reach-3) conj:and(reach-3, reach-3') dobj(reach-3, it-4) advmod(reach-3, only-5) case(escort-9, under-6) amod(escort-9, strict-7) amod(escort-9, military-8) nmod:under(reach-3, escort-9) cc(reach-3, and-10) case(permission-13, with-11) amod(permission-13, prior-12) nmod:with(reach-3', permission-13) case(Pentagon-16, from-14) det(Pentagon-16, the-15) nmod:from(permission-13, Pentagon-16) case(flights-22, aboard-18) amod(flights-22, special-19) amod(flights-22, small-20) compound(flights-22, shuttle-21) nmod:aboard(reach-3, flights-22) nsubj(reach-24, flights-22) ref(flights-22, that-23) acl:relcl(flights-22, reach-24) det(base-26, the-25) dobj(reach-24, base-26) case(flight-30, by-27) det(flight-30, a-28) amod(flight-30, circuitous-29) nmod:by(reach-24, flight-30) case(States-34, from-31) det(States-34, the-32) compound(States-34, United-33) nmod:from(flight-30, States-34)

    My guess is that relations such as nsubj(reach-3', Visitors-1) are not catched by the regex. Am I missing anything? Thanks!

    opened by ccsasuke 13
  • Getting [Error 32] trying to parse tree from example

    Getting [Error 32] trying to parse tree from example

    Hello, David.

    I'm getting Windows [Error 32] error when I'm trying to parse tree from example. Here is code:

    sd = StanfordDependencies.get_instance(backend='subprocess') sent = sd.convert_tree('(S1 (NP (DT some) (JJ blue) (NN moose)))')

    Next error shows Visual Studio: [Error 32] Ïðîöåñó íå âäàëîñÿ îòðèìàòè äîñòóï äî ôàéëó,: 'c:\users\sergiy\appdata\local\temp\tmpmd8c8k' *file name differs all the time

    **I've tried to use another constructor, using jar_filename parameter - same exception

    ***I've tried to install JPypeBackend - it didn't help. It started failing when I was trying to call get_instance method.

    Maybe i'm doing something wrong, but if there is problem, pleace take a look.

    Thanks a lot)

    opened by MisterMeUA 5
  • Stanford Dependency returned for Sentence does not match.

    Stanford Dependency returned for Sentence does not match.

    Hello,

    The sample sentence I used is: "Janet had prune juice today before lunch." When I use StanfordCoreNLP in R and run it I get the result:

    (ROOT (S (NP (NNP Janet)) (VP (VBD had) (S (VP (VB prune) (NP (NN juice)) (NP-TMP (NN today)) (PP (IN before) (NP (NN lunch)))))) (. .)))

    Using pyStanfordDependencies, I get:

    (S (NP (NNP Janet)) (VP (VBD had) (VP (VBN prune) (NP (NN juice) (NN today)) (PP (IN before) (NP (NN lunch))))) (. .))

    This difference makes it difficult to apply rules to get triples from the sentence. Kindly review. Maybe I am making a mistake somewhere.

    Regards, Bonson

    opened by bonsonsm 3
  • Differences in using subprocess and jpype backends

    Differences in using subprocess and jpype backends

    Hi,

    I got different results when using two different backends with same stanford corenlp jar. It seems like the result from subprocess is identical to the one from Stanford online demo. I've also gone through the python code but still couldn't figure it out.

    I'd be appreciated if you can offer me any advice.

    opened by leonli02 3
  • AttributeError: type object 'edu.stanford.nlp.process.Morphology' has no attribute 'stemStaticSynchronized'

    AttributeError: type object 'edu.stanford.nlp.process.Morphology' has no attribute 'stemStaticSynchronized'

    import StanfordDependencies
    sd = StanfordDependencies.get_instance(backend='jpype', jar_filename='C:/project_ck/stanford-corenlp-full-2018-10-05/stanford-corenlp-3.9.2.jar')
    

    Rase this error.

    Beside, how to use multiple jar file?

    opened by bifeng 2
  • CoNLL-X data format URL link not working

    CoNLL-X data format URL link not working

    @dmcc URL mentioned in class Token is no more available.

    This could be updated with: CoNLL-X shared task on Multilingual Dependency Parsing by Buchholz and Marsi(2006) http://aclweb.org/anthology/W06-2920 Section 3

    If you want, I can update.

    opened by kaushikacharya 1
  • adding close() on temp file for fixing bug #15 and #51

    adding close() on temp file for fixing bug #15 and #51

    Closing the temp file before trying to remove it. solving error code 32 "WindowsError: [Error 32] The process cannot access the file: tempfile" on bugs #15 and #51

    opened by mens2lux 1
  • Reopening issue #14

    Reopening issue #14

    Opening a new issue since I could not reopen it. Details are in the comments of issue #14 . I'm opening this one just in case you won't get notified for comments of a closed issue.

    opened by ccsasuke 1
  • Conversion of NLTK tree to PTB format

    Conversion of NLTK tree to PTB format

    The convert_tree() function is not able to form dependencies for a nltk tree and an alternate conversion from nltk to ptb doesnt work

    [via http://stackoverflow.com/a/29614388/1118542]

    opened by anirudh708 1
  • JPypeBackend initialization returns AttributeError for CoreNLP >= 3.5.0

    JPypeBackend initialization returns AttributeError for CoreNLP >= 3.5.0

    When initializing a JPypeBackend object, the puncFilter attribute is set to trees.PennTreebankLanguagePack().punctuationWordRejectFilter().accept (line 52 in JPypeBackend.py). However, for CoreNLP versions >= 3.5.0, this results in an AttributeError: 'edu.stanford.nlp.util.Filters$NegatedFilter' object has no attribute 'accept'.

    The solution is to change the line to change line 52 to self.puncFilter = trees.PennTreebankLanguagePack().punctuationWordRejectFilter().test. That breaks compatibility with CoreNLP versions < 3.5.0. I worked out a hacky version check using java.util.jar.JarInputStream(stream).getManifest(). If you like to retain compatibility with older CoreNLP versions, I could fork and send a pull request. Otherwise it is a quick fix.

    bug 
    opened by Tiepies 1
  • AttributeError: Java package 'edu' is not valid

    AttributeError: Java package 'edu' is not valid

    For some reason after the code automatically downloads the .jar file from http://search.maven.org/remotecontent?filepath=edu/stanford/nlp/stanford-corenlp/3.5.2/stanford-corenlp-3.5.2.jar and puts it in /root/.local/share/pystanforddeps/, get an error from StanfordDependencies/JPypeBackend.py: AttributeError: Java package 'edu' is not valid Please assist. Thank you.

    opened by MaryFllh 0
  • jpype fails when using with flask

    jpype fails when using with flask

    He, I wrapped your library in a flask app and had JPype fail due to an unsafe thread issue. I had to modify the JPypeBackend.py file to attach the thread to the JVM. Changes start on line 45:

    num_thread = jpype.isThreadAttachedToJVM()
    if num_thread is not 1:
         jpype.attachThreadToJVM()
    

    JPypeBackend.py.zip Attached the modified file here

    opened by staplet3 2
  • Strange KeyError

    Strange KeyError

    I ran into an error with this tree from CoNLL-2012 dataset:

    In [1]: import StanfordDependencies
    
    In [2]: sd = StanfordDependencies.get_instance()
    
    In [3]: sd.convert_trees(['(TOP (S (CC But) (PRN (S (NP (PRP you)) (VP (VBP know)))) (NP (PRP you)) (VP (VBP look) (PP (IN at) (NP (NP (DT this) (NN guy)) (PRN (S (NP (PRP you))
       ...:  (VP (VBP know)))) (VP (VP (VBG punching) (NP (DT the) (CD one) (NN guy))) (VP (VBG grabbing) (NP (DT the) (NNP AP) (NN producer)) (PRN (S (NP (PRP you)) (VP (VBP know))
       ...: ))))))) (. /.)))'])
    ---------------------------------------------------------------------------
    KeyError                                  Traceback (most recent call last)
    <ipython-input-3-e204c241ff5e> in <module>()
    ----> 1 sd.convert_trees(['(TOP (S (CC But) (PRN (S (NP (PRP you)) (VP (VBP know)))) (NP (PRP you)) (VP (VBP look) (PP (IN at) (NP (NP (DT this) (NN guy)) (PRN (S (NP (PRP you)) (VP (VBP know)))) (VP (VP (VBG punching) (NP (DT the) (CD one) (NN guy))) (VP (VBG grabbing) (NP (DT the) (NNP AP) (NN producer)) (PRN (S (NP (PRP you)) (VP (VBP know))))))))) (. /.)))'])
    
    /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/StanfordDependencies/StanfordDependencies.py in convert_trees(self, ptb_trees, representation, universal, include_punct, include_erased, **kwargs)
        114                       include_erased=include_erased)
        115         return Corpus(self.convert_tree(ptb_tree, **kwargs)
    --> 116                       for ptb_tree in ptb_trees)
        117
        118     @abstractmethod
    
    /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/StanfordDependencies/StanfordDependencies.py in <genexpr>(.0)
        114                       include_erased=include_erased)
        115         return Corpus(self.convert_tree(ptb_tree, **kwargs)
    --> 116                       for ptb_tree in ptb_trees)
        117
        118     @abstractmethod
    
    /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/StanfordDependencies/JPypeBackend.py in convert_tree(self, ptb_tree, representation, include_punct, include_erased, add_lemmas, universal)
        139
        140         if representation == 'basic':
    --> 141             sentence.renumber()
        142         return sentence
        143     def stem(self, form, tag):
    
    /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/StanfordDependencies/CoNLL.py in renumber(self)
        109             self[:] = [token._replace(index=mapping[token.index],
        110                                       head=mapping[token.head])
    --> 111                        for token in self]
        112     def as_conll(self):
        113         """Represent this Sentence as a string in CoNLL-X format.  Note
    
    /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/StanfordDependencies/CoNLL.py in <listcomp>(.0)
        109             self[:] = [token._replace(index=mapping[token.index],
        110                                       head=mapping[token.head])
    --> 111                        for token in self]
        112     def as_conll(self):
        113         """Represent this Sentence as a string in CoNLL-X format.  Note
    
    KeyError: 11
    
    opened by minhlab 0
  • Error of Jpypebackend when trying example

    Error of Jpypebackend when trying example

    Hi David,

    I'm trying the example to produce dependencies from a parsed sentence using Stanford Parser. When I use your code: sd = StanfordDependencies.get_instance(jar_filename="/home/stanford-parser/stanford-parser.jar") it pops up the error: UserWarning: Error importing JPypeBackend, falling back to SubprocessBackend. raise ValueError("Bad exit code from Stanford CoreNLP") ValueError: Bad exit code from Stanford CoreNLP

    Any information would be highly appreciated!

    Thanks! Yiru

    opened by YiruS 3
  • Support CoreNLP 3.6.0

    Support CoreNLP 3.6.0

    CoreNLP version 3.6.0 has (at least) two changes which break PyStanfordDependencies:

    • [x] stemStaticSynchronized was renamed to stemStatic
    • [ ] This stack trace shows up for all SubprocessBackend conversion tests:
    Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/LoggerFactory
        at edu.stanford.nlp.io.IOUtils.<clinit>(IOUtils.java:42)
        at edu.stanford.nlp.trees.MemoryTreebank.processFile(MemoryTreebank.java:302)
        at edu.stanford.nlp.util.FilePathProcessor.processPath(FilePathProcessor.java:84)
        at edu.stanford.nlp.trees.MemoryTreebank.loadPath(MemoryTreebank.java:152)
        at edu.stanford.nlp.trees.Treebank.loadPath(Treebank.java:180)
        at edu.stanford.nlp.trees.Treebank.loadPath(Treebank.java:151)
        at edu.stanford.nlp.trees.Treebank.loadPath(Treebank.java:137)
        at edu.stanford.nlp.trees.GrammaticalStructure.main(GrammaticalStructure.java:1702)
    Caused by: java.lang.ClassNotFoundException: org.slf4j.LoggerFactory
        at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 8 more
    }
    

    (comes from a command line like this: java -ea -cp /path/to/stanford-corenlp-3.6.0.jar edu.stanford.nlp.trees.EnglishGrammaticalStructure -basic -treeFile treefile -keepPunct -originalDependencies)

    @gangeli, is slf4j required to run CoreNLP 3.6.0?

    bug 
    opened by dmcc 10
  • jre has value 1.8 but 1.7 required and then CoreNLP needs 1.8+

    jre has value 1.8 but 1.7 required and then CoreNLP needs 1.8+

    edited registry to 1.7 then got

    JavaRuntimeVersionError too old must use 1.8+ for CoreNLP

    I am using the jar_filename parameter to point to the recent stanford-parser.jar

    Thanks!

    opened by ccrowner 9
  • Better Universal Dependencies support

    Better Universal Dependencies support

    This would involve at least the following:

    1. ~~Add the -originalDependencies option for both backends.~~
    2. Find a way to download the feature mapping and include it in the classpath. It's included in the giant models jar files, so we could include those, but it seems overkill to download these if we can avoid it.
    3. Populate the features field with features from universal dependencies (requires 2.)
    4. Map the POS tags to their Universal counterparts.
    enhancement 
    opened by dmcc 0
Releases(v0.3.1)
Textpipe: clean and extract metadata from text

textpipe: clean and extract metadata from text textpipe is a Python package for converting raw text in to clean, readable text and extracting metadata

Textpipe 298 Nov 21, 2022
End-to-end image captioning with EfficientNet-b3 + LSTM with Attention

Image captioning End-to-end image captioning with EfficientNet-b3 + LSTM with Attention Model is seq2seq model. In the encoder pretrained EfficientNet

2 Feb 10, 2022
glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

Glow-Speak glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end. Installation git clone https://g

Rhasspy 8 Dec 25, 2022
PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI

data2vec-pytorch PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI (F

Aryan Shekarlaban 105 Jan 04, 2023
Document processing using transformers

Doc Transformers Document processing using transformers. This is still in developmental phase, currently supports only extraction of form data i.e (ke

Vishnu Nandakumar 13 Dec 21, 2022
Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module.

Import Subtitles for Blender VSE Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module. Supported formats by py

4 Feb 27, 2022
sangha, pronounced "suhng-guh", is a social networking, booking platform where students and teachers can share their practice.

Flask React Project This is the backend for the Flask React project. Getting started Clone this repository (only this branch) git clone https://github

Courtney Newcomer 17 Sep 29, 2021
Contact Extraction with Question Answering.

contactsQA Extraction of contact entities from address blocks and imprints with Extractive Question Answering. Goal Input: Dr. Max Mustermann Hauptstr

Jan 2 Apr 20, 2022
jiant is an NLP toolkit

🚨 Update 🚨 : As of 2021/10/17, the jiant project is no longer being actively maintained. This means there will be no plans to add new models, tasks,

ML² AT CILVR 1.5k Dec 28, 2022
An open-source NLP library: fast text cleaning and preprocessing.

An open-source NLP library: fast text cleaning and preprocessing

Iaroslav 21 Mar 18, 2022
open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

中文开放信息抽取系统, open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

7 Nov 02, 2022
Mycroft Core, the Mycroft Artificial Intelligence platform.

Mycroft Mycroft is a hackable open source voice assistant. Table of Contents Getting Started Running Mycroft Using Mycroft Home Device and Account Man

Mycroft 6.1k Jan 09, 2023
simpleT5 is built on top of PyTorch-lightning⚡️ and Transformers🤗 that lets you quickly train your T5 models.

Quickly train T5 models in just 3 lines of code + ONNX support simpleT5 is built on top of PyTorch-lightning ⚡️ and Transformers 🤗 that lets you quic

Shivanand Roy 220 Dec 30, 2022
An assignment on creating a minimalist neural network toolkit for CS11-747

minnn by Graham Neubig, Zhisong Zhang, and Divyansh Kaushik This is an exercise in developing a minimalist neural network toolkit for NLP, part of Car

Graham Neubig 63 Dec 29, 2022
A Multi-modal Model Chinese Spell Checker Released on ACL2021.

ReaLiSe ReaLiSe is a multi-modal Chinese spell checking model. This the office code for the paper Read, Listen, and See: Leveraging Multimodal Informa

DaDa 106 Dec 29, 2022
Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab.

Pretrained Language Model This repository provides the latest pretrained language models and its related optimization techniques developed by Huawei N

HUAWEI Noah's Ark Lab 2.6k Jan 08, 2023
MHtyper is an end-to-end pipeline for recognized the Forensic microhaplotypes in Nanopore sequencing data.

MHtyper is an end-to-end pipeline for recognized the Forensic microhaplotypes in Nanopore sequencing data. It is implemented using Python.

willow 6 Jun 27, 2022
Voilà turns Jupyter notebooks into standalone web applications

Rendering of live Jupyter notebooks with interactive widgets. Introduction Voilà turns Jupyter notebooks into standalone web applications. Unlike the

Voilà Dashboards 4.5k Jan 03, 2023
Applied Natural Language Processing in the Enterprise - An O'Reilly Media Publication

Applied Natural Language Processing in the Enterprise This is the companion repo for Applied Natural Language Processing in the Enterprise, an O'Reill

Applied Natural Language Processing in the Enterprise 95 Jan 05, 2023
本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料,该资料目前包含 自然语言处理各领域的 面试题积累。

【关于 NLP】那些你不知道的事 作者:杨夕、芙蕖、李玲、陈海顺、twilight、LeoLRH、JimmyDU、艾春辉、张永泰、金金金 介绍 本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料,该资料目前包含 自然语言处理各领域的 面试题积累。 目录架构 一、【

1.4k Dec 30, 2022