Persian Lexicon

This repo uses Uppsala Persian Corpus (UPC) to construct a lexicon of 70664 unique words. With all the excitement around game Wordle, we also extracted words with different length (2, 3, 4, ..., 10) and stored them to separate files for easier access. Please note that these files might contain offensive words, I have not check them manually.

GetWords.py can read these files and return words as a list of strings.

Cleanup details

Main Lexicon

The main lexicon (data/persian-words.txt) is build very liberally; we only filter out words that contain ASCII characters or Arabic numerals.

Fixed length Lexicons

More conservative filtering has been applied to files with fixed word length. We drop all words that contain any of the following characters:

After applying these filters, we ended up with these number of words per file:

2 letter words: 310 unique words
3 letter words: 2378 unique words
4 letter words: 7059 unique words
5 letter words: 10043 unique words
6 letter words: 9541 unique words
7 letter words: 7350 unique words
8 letter words: 4681 unique words
9 letter words: 2529 unique words
10 letter words: 1250 unique words

Persian-lexicon - A lexicon of 70K unique Persian (Farsi) words

Related tags

Overview

Persian Lexicon

Cleanup details

Main Lexicon

Fixed length Lexicons

Owner

Saman Vaisipour

Natural Language Processing Tasks and Examples.

Translators - is a library which aims to bring free, multiple, enjoyable translation to individuals and students in Python

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

A simple version of DeTR

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

ProtFeat is protein feature extraction tool that utilizes POSSUM and iFeature.

🕹 An esoteric language designed so that the program looks like the transcript of a Pokémon battle

Korea Spell Checker

Code for evaluating Japanese pretrained models provided by NTT Ltd.

Unofficial Python library for using the Polish Wordnet (plWordNet / Słowosieć)

NLP: SLU tagging

🍊 PAUSE (Positive and Annealed Unlabeled Sentence Embedding), accepted by EMNLP'2021 🌴

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet.

SEJE is a prototype for the paper Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering.

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

The first online catalogue for Arabic NLP datasets.

RuCLIP tiny (Russian Contrastive Language–Image Pretraining) is a neural network trained to work with different pairs (images, texts).

MiCECo - Misskey Custom Emoji Counter

Learn meanings behind words is a key element in NLP. This project concentrates on the disambiguation of preposition senses. Therefore, we train a bert-transformer model and surpass the state-of-the-art.