Japanese synonym library

Overview

chikkarpy

PyPi version test

chikkarpyはchikkarのPython版です。 chikkarpy is a Python version of chikkar.

chikkarpy は Sudachi 同義語辞書を利用し、SudachiPyの出力に同義語展開を追加するために開発されたライブラリです。

単体でも同義語辞書の検索ツールとして利用できます。

利用方法 Usage

TL;DR

$ pip install chikkarpy

$ echo "閉店" | chikkarpy
閉店    クローズ,close,店仕舞い

Step 1. chikkarpyのインストール

$ pip install chikkarpy

Step 2. 使用方法

コマンドライン

$ echo "閉店" | chikkarpy
閉店    クローズ,close,店仕舞い

chikkarpyは入力された単語を見て一致する同義語のリストを返します。 同義語辞書内の曖昧性フラグが1の見出し語をトリガーにすることはできません。 出力はクエリ\t同義語リストの形式です。

$ chikkarpy search -h
usage: chikkarpy search [-h] [-d [file [file ...]]] [-ev] [-o file] [-v]
                        [file [file ...]]

Search synonyms

positional arguments:
  file                  text written in utf-8

optional arguments:
  -h, --help            show this help message and exit
  -d [file [file ...]]  synonym dictionary (default: system synonym
                        dictionary)
  -ev                   Enable verb and adjective synonyms.
  -o file               the output file
  -v, --version         print chikkarpy version

自分で用意したユーザー辞書を使いたい場合は-dで読み込むバイナリ辞書を指定できます。 (バイナリ辞書のビルドは辞書の作成を参照してください。) 複数辞書を読み込む場合は順番に注意してください。 以下の場合,user2 > user > system の順で同義語を検索して見つかった時点で検索結果を返します。

chikkarpy -d system.dic user.dic user2.dic

また、出力はデフォルトで体言のみです。 用言も出力したい場合は-evを有効にしてください。

$ echo "開放" | chikkarpy
開放	オープン,open
$ echo "開放" | chikkarpy -ev
開放	開け放す,開く,オープン,open

python ライブラリ

使用例

from chikkarpy import Chikkar
from chikkarpy.dictionarylib import Dictionary

chikkar = Chikkar()

system_dic = Dictionary("system.dic", False)
chikkar.add_dictionary(system_dic)

print(chikkar.find("閉店"))
# => ['クローズ', 'close', '店仕舞い']

print(chikkar.find("閉店", group_ids=[5])) # グループIDによる検索
# => ['クローズ', 'close', '店仕舞い']

print(chikkar.find("開放"))
# => ['オープン', 'open']

chikkar.enable_verb() # 用言の出力制御(デフォルトは体言のみ出力)
print(chikkar.find("開放"))
# => ['開け放す', '開く', 'オープン', 'open']

chikkar.add_dictionary()で複数の辞書を読み込ませる場合は順番に注意してください。 最後に読み込んだ辞書を優先して検索します。

辞書の作成 Build a dictionary

新しく辞書を追加する場合は、利用前にバイナリ形式辞書の作成が必要です。 Before using new dictionary, you need to create a binary format dictionary.

$ chikkarpy build -i synonym_dict.csv -o system.dic 
$ chikkarpy build -h
usage: chikkarpy build [-h] -i file [-o file] [-d string]

Build Synonym Dictionary

optional arguments:
  -h, --help  show this help message and exit
  -i file     dictionary file (csv)
  -o file     output file (default: synonym.dic)
  -d string   description comment to be embedded on dictionary

開発者向け

Code Format

scripts/lint.sh を実行して、コードが正しいフォーマットかを確認してください。

flake8 flake8-import-order flake8-builtins が必要です。

Test

scripts/test.sh を実行してテストしてください。

Contact

chikkarpyはWAP Tokushima Laboratory of AI and NLPによって開発されています。

開発者やユーザーの方々が質問したり議論するためのSlackワークスペースを用意しています。

You might also like...
Script to download some free japanese lessons in portuguse from NHK
Script to download some free japanese lessons in portuguse from NHK

Nihongo_nhk This is a script to download some free japanese lessons in portuguese from NHK. It can be executed by installing the packages with: pip in

An open collection of annotated voices in Japanese language

声庭 (Koniwa): オープンな日本語音声とアノテーションのコレクション Koniwa (声庭): An open collection of annotated voices in Japanese language 概要 Koniwa(声庭)は利用・修正・再配布が自由でオープンな音声とアノテ

Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Japanese-LUW-Tokenizer Japanese Long-Unit-Word (国語研長単位) Tokenizer for Transformers based on 青空文庫 Basic Usage from transformers import RemBertToken

PyJPBoatRace: Python-based Japanese boatrace tools 🚤

pyjpboatrace :speedboat: provides you with useful tools for data analysis and auto-betting for boatrace.

aMLP Transformer Model for Japanese

aMLP-japanese Japanese aMLP Pretrained Model aMLPとは、Liu, Daiらが提案する、Transformerモデルです。 ざっくりというと、BERTの代わりに使えて、より性能の良いモデルです。 詳しい解説は、こちらの記事などを参考にしてください。 この

A Japanese tokenizer based on recurrent neural networks
A Japanese tokenizer based on recurrent neural networks

Nagisa is a python module for Japanese word segmentation/POS-tagging. It is designed to be a simple and easy-to-use tool. This tool has the following

This repository has a implementations of data augmentation for NLP for Japanese.

daaja This repository has a implementations of data augmentation for NLP for Japanese: EDA: Easy Data Augmentation Techniques for Boosting Performance

Visual Automata is a Python 3 library built as a wrapper for Caleb Evans' Automata library to add more visualization features.
Visual Automata is a Python 3 library built as a wrapper for Caleb Evans' Automata library to add more visualization features.

Visual Automata Copyright 2021 Lewi Lie Uberg Released under the MIT license Visual Automata is a Python 3 library built as a wrapper for Caleb Evans'

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!
Comments
  • pip install does not work under SudachiPy 0.6.x environment / SudachiPy 0.6.x の環境下で pip install が通らない

    pip install does not work under SudachiPy 0.6.x environment / SudachiPy 0.6.x の環境下で pip install が通らない

    temporary solution / 暫定的な解決方法

    Install SudachiPy 0.5.4, then chikkarpy, then reinstall the latest version of SudachiPy. SudachiPy 0.5.4 をインストールしてから、chikkarpy をインストールし、その後 SudachiPy 最新版を再インストールする。

    pip install sudachipy==0.5.4 --upgrade
    pip install sudachidict_core
    pip install chikkarpy
    pip install sudachipy --upgrade
    
    opened by Nishihara-Daiki 1
  • chikkarpy has no attribute 'dictionarylib' in certain cases

    chikkarpy has no attribute 'dictionarylib' in certain cases

    case 1: raised ERROR if call chikkarpy.dictionarylib

    $ pip install chikkarpy
    $ python
    >>> import chikkarpy
    >>> chikkarpy.dictionarylib
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: module 'chikkarpy' has no attribute 'dictionarylib'
    

    case 2: pass if use from chikkarpy import dictionarylib

    $ pip install chikkarpy
    $ python
    >>> from chikkarpy import dictionarylib
    >>> dictionarylib
    <module 'chikkarpy.dictionarylib' from '/usr/local/lib/python3.7/dist-packages/chikkarpy/dictionarylib/__init__.py'>
    

    case 3: pass if call chikkarpy.dictionarylib AFTER from chikkarpy import dictionarylib

    $ pip install chikkarpy
    $ python
    >>> import chikkarpy
    >>> from chikkarpy import dictionarylib
    >>> chikkarpy.dictionarylib
    <module 'chikkarpy.dictionarylib' from '/usr/local/lib/python3.7/dist-packages/chikkarpy/dictionarylib/__init__.py'>
    
    opened by Nishihara-Daiki 0
Releases(v0.1.1)
Owner
Works Applications
Works Applications
⚖️ A Statutory Article Retrieval Dataset in French.

A Statutory Article Retrieval Dataset in French This repository contains the Belgian Statutory Article Retrieval Dataset (BSARD), as well as the code

Maastricht Law & Tech Lab 19 Nov 17, 2022
Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models

Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models. A paraphrase framework is more than just a paraphrasing model.

Prithivida 681 Jan 01, 2023
DataCLUE: 国内首个以数据为中心的AI测评(含模型分析报告)

DataCLUE 以数据为中心的AI测评(DataCLUE) DataCLUE: A Chinese Data-centric Language Evaluation Benchmark 内容导引 章节 描述 简介 介绍以数据为中心的AI测评(DataCLUE)的背景 任务描述 任务描述 实验结果

CLUE benchmark 135 Dec 22, 2022
Few-shot Natural Language Generation for Task-Oriented Dialog

Few-shot Natural Language Generation for Task-Oriented Dialog This repository contains the dataset, source code and trained model for the following pa

172 Dec 13, 2022
A desktop GUI providing an audio interface for GPT3.

Jabberwocky neil_degrasse_tyson_with_audio.mp4 Project Description This GUI provides an audio interface to GPT-3. My main goal was to provide a conven

16 Nov 27, 2022
SummerTime - Text Summarization Toolkit for Non-experts

A library to help users choose appropriate summarization tools based on their specific tasks or needs. Includes models, evaluation metrics, and datasets.

Yale-LILY 213 Jan 04, 2023
Understanding the Difficulty of Training Transformers

Admin Understanding the Difficulty of Training Transformers Guided by our analyses, we propose Adaptive Model Initialization (Admin), which successful

Liyuan Liu 300 Dec 29, 2022
Hostapd-mac-tod-acl - Setup a hostapd AP with MAC ToD ACL

A brief explanation This script provides a quick way to setup a Time-of-day (Tod

2 Feb 03, 2022
Ask for weather information like a human

weather-nlp About Ask for weather information like a human. Goals Understand typical questions like: Hourly temperatures in Potsdam on 2020-09-15. Rai

5 Oct 29, 2022
Anuvada: Interpretable Models for NLP using PyTorch

Anuvada: Interpretable Models for NLP using PyTorch So, you want to know why your classifier arrived at a particular decision or why your flashy new d

EDGE 102 Oct 01, 2022
Image2pcl - Enter the metaverse with 2D image to 3D projections

Image2PCL Enter the metaverse with 2D image to 3D projections! This is an implem

Benjamin Ho 0 Feb 05, 2022
Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

Francis R. Willett 305 Dec 22, 2022
Just Another Telegram Ai Chat Bot Written In Python With Pyrogram.

OkaeriChatBot Just another Telegram AI chat bot written in Python using Pyrogram. Requirements Python 3.7 or higher.

Wahyusaputra 2 Dec 23, 2021
The entmax mapping and its loss, a family of sparse softmax alternatives.

entmax This package provides a pytorch implementation of entmax and entmax losses: a sparse family of probability mappings and corresponding loss func

DeepSPIN 330 Dec 22, 2022
official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

Plugin 3 Jan 12, 2022
nlp基础任务

NLP算法 说明 此算法仓库包括文本分类、序列标注、关系抽取、文本匹配、文本相似度匹配这五个主流NLP任务,涉及到22个相关的模型算法。 框架结构 文件结构 all_models ├── Base_line │   ├── __init__.py │   ├── base_data_process.

zuxinqi 23 Sep 22, 2022
Chinese Grammatical Error Diagnosis

nlp-CGED Chinese Grammatical Error Diagnosis 中文语法纠错研究 基于序列标注的方法 所需环境 Python==3.6 tensorflow==1.14.0 keras==2.3.1 bert4keras==0.10.6 笔者使用了开源的bert4keras

12 Nov 25, 2022
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

DeeBERT This is the code base for the paper DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference. Code in this repository is also available

Castorini 132 Nov 14, 2022
Huggingface Transformers + Adapters = ❤️

adapter-transformers A friendly fork of HuggingFace's Transformers, adding Adapters to PyTorch language models adapter-transformers is an extension of

AdapterHub 1.2k Jan 09, 2023
Original implementation of the pooling method introduced in "Speaker embeddings by modeling channel-wise correlations"

Speaker-Embeddings-Correlation-Pooling This is the original implementation of the pooling method introduced in "Speaker embeddings by modeling channel

Themos Stafylakis 10 Apr 30, 2022