This repository collects together basic linguistic processing data for using dataset dumps from the Common Voice project

Overview

Common Voice Utils

This repository collects together basic linguistic processing data for using dataset dumps from the Common Voice project. It aims to provide a one-stop-shop for utilities and data useful in training ASR and TTS systems.

Tools

  • Phonemiser:
    • A rudimentary grapheme to phoneme (g2p) system based on either:
      • a deterministic longest-match left-to-right replacement of orthographic units; or
      • a weighted finite-state transducer
  • Validator:
    • A validation and normalisation script.
    • It checks a sentence to see if it can be converted and if possible normalises the encoding, removes punctuation and returns it
  • Alphabet:
    • The relevant alphabet of the language, appropriate for use in training ASR
  • Segmenter:
    • A deterministic sentence segmentation algorithm tuned for segmenting paragraphs from Wikipedia
  • Corpora:
    • Contains metadata for different corpora you may be interested in using with Common Voice

Installation

The easiest way is with pip:

$ pip install git+https://github.com/ftyers/commonvoice-utils.git

How to use it

Command line tool

There is also a command line tool, covo /ˈkəʊvəʊ/ which aims to expose much of the functionality through the command line. Some examples on the next lines:

Process a Wikipedia dump

Use a Wikipedia dump to get text for a language mode in the right format:

$ covo dump mtwiki-latest-pages-articles.xml.bz2 | covo segment mt | covo norm mt
x'inhi l-wikipedija
il-wikipedija hi mmexxija mill-fondazzjoni wikimedia fondazzjoni mingħajr fini ta' lukru li tospita proġetti oħra b'kontenut ħieles u multilingwi
il-malti huwa l-ilsien nazzjonali tar-repubblika ta' malta
huwa l-ilsien uffiċjali flimkien mal-ingliż kif ukoll wieħed mill-ilsna uffiċjali tal-unjoni ewropea

Query the OPUS corpus collection

Get a list of URLs for a particular language from the OPUS corpus collection:

$ covo opus mt | sort -gr
23859 documents,69.4M tokens	https://object.pouta.csc.fi/OPUS-DGT/v2019/mono/mt.txt.gz
8665 documents,25.8M tokens	https://object.pouta.csc.fi/OPUS-JRC-Acquis/v3.0/mono/mt.txt.gz
5388 documents,8.9M tokens	https://object.pouta.csc.fi/OPUS-JW300/v1b/mono/mt.txt.gz
...

Convert grapheme input to phonemes

Get the grapheme to phoneme output for some arbitrary input:

$ echo "euskal herrian euskaraz" | covo phon eu
eus̺kal erian eus̺kaɾas̻

$ echo "قايتا نىشان بەلگىلەش ئورنى ئۇيغۇرچە ۋىكىپىدىيە" | covo phon ug
qɑjtɑ nɪʃɑn bɛlɡɪlɛʃ ornɪ ujʁurtʃɛ vɪkɪpɪdɪjɛ

Export data for use in Coqui STT

Designed for use with Coqui STT, converts to 16kHz mono-channel PCM .wav files and runs the transcripts through the validation step. In addition outputs .csv files for each of the input .tsv files.

$ covo export myv cv-corpus-8.0-2022-01-19/myv/
Loading TSV file:  cv-corpus-8.0-2022-01-19/myv/test.tsv
  Importing mp3 files...
  Imported 292 samples.
  Skipped 2 samples that were longer than 10 seconds.
  Final amount of imported audio: 0:27:03 from 0:27:23.
  Saving new Coqui STT-formatted CSV file to:  cv-corpus-8.0-2022-01-19/myv/clips/test.csv
  Writing CSV file for train.py as:  cv-corpus-8.0-2022-01-19/myv/clips/test.csv

Python module

The code can also be used as a Python module, here are some examples:

Alphabet

Returns an alphabet appropriate for end-to-end speech recognition.

>>> from cvutils import Alphabet
>>> a = Alphabet('cv')
>>> a.get_alphabet()
' -абвгдежзийклмнопрстуфхцчшщыэюяёҫӑӗӳ'

Corpora

Some miscellaneous tools for working with corpora:

>>> from cvutils import Corpora
>>> c = Corpora('kpv')
>>> c.dump_url()
'https://dumps.wikimedia.org/kvwiki/latest/kvwiki-latest-pages-articles.xml.bz2'
>>> c.target_segments()
[]
>>> c = Corpora('cv')
>>> c.target_segments()
['нуль', 'пӗрре', 'иккӗ', 'виҫҫӗ', 'тӑваттӑ', 'пиллӗк', 'улттӑ', 'ҫиччӗ', 'саккӑр', 'тӑххӑр', 'ҫапла', 'ҫук']
>>> c.dump_url()
'https://dumps.wikimedia.org/cvwiki/latest/cvwiki-latest-pages-articles.xml.bz2'

Grapheme to phoneme

For a given token, return an approximate broad phonemised version of it.

>>> from cvutils import Phonemiser
>>> p = Phonemiser('ab')
>>> p.phonemise('гӏапынхъамыз')
'ʕapənqaməz'

>>> p = Phonemiser('br')
>>> p.phonemise("implijout")
'impliʒut'

Validator

For a given input sentence/utterance, the validator returns either a validated and normalised version of the string according to the validation rules, or None if the string cannot be validated.

>>> from cvutils import Validator
>>> v = Validator('ab')
>>> v.validate('Аллаҳ хаҵеи-ԥҳәыси иеилыхны, аҭыԥҳацәа роума иалихыз?')
'аллаҳ хаҵеи-ԥҳәыси иеилыхны аҭыԥҳацәа роума иалихыз'

>>> v = Validator('br')
>>> v.validate('Ha cʼhoant hocʼh eus da gendercʼhel da implijout ar servijer-mañ ?')
"ha c'hoant hoc'h eus da genderc'hel da implijout ar servijer-mañ"

Sentence segmentation

Mostly designed for use with Wikipedia, takes a paragraph and returns a list of the sentences found within it.

>> for sent in s.segment(para): ... print(sent) ... Peurliesañ avat e kemm ar vogalennoù e c'hengerioù evit dont da vezañ heñvel ouzh ar vogalennoù en nominativ (d.l.e. ar stumm-meneg), da skouer e hungareg: Aour, tungsten, zink, uraniom, h.a., a vez kavet e kondon Bouryatia. A-bouez-bras evit armerzh ar vro eo al labour-douar ivez pa vez gounezet gwinizh ha legumaj dreist-holl. A-hend-all e vez gounezet arc'hant dre chaseal ha pesketa.">
>>> from cvutils import Segmenter 
>>> s = Segmenter('br')
>>> para = "Peurliesañ avat e kemm ar vogalennoù e c'hengerioù evit dont da vezañ heñvel ouzh ar vogalennoù en nominativ (d.l.e. ar stumm-meneg), da skouer e hungareg: Aour, tungsten, zink, uraniom, h.a., a vez kavet e kondon Bouryatia. A-bouez-bras evit armerzh ar vro eo al labour-douar ivez pa vez gounezet gwinizh ha legumaj dreist-holl. A-hend-all e vez gounezet arc'hant dre chaseal ha pesketa."
>>> for sent in s.segment(para):
...     print(sent)
... 
Peurliesañ avat e kemm ar vogalennoù e c'hengerioù evit dont da vezañ heñvel ouzh ar vogalennoù en nominativ (d.l.e. ar stumm-meneg), da skouer e hungareg: Aour, tungsten, zink, uraniom, h.a., a vez kavet e kondon Bouryatia.
A-bouez-bras evit armerzh ar vro eo al labour-douar ivez pa vez gounezet gwinizh ha legumaj dreist-holl.
A-hend-all e vez gounezet arc'hant dre chaseal ha pesketa.

Language support

Language Autonym Code (CV) (WP) Phon Valid Alphabet Segment
Abkhaz Аԥсуа abk ab
Amharic አማርኛ amh am
Arabic اَلْعَرَبِيَّةُ ara ar ar
Assamese অসমীয়া asm as as
Azeri Azərbaycanca aze az az
Bashkort Башҡортса bak ba ba
Basaa Basaa bas bas
Belarusian Беларуская мова bel be be
Bengali বাংলা ben bn bn
Breton Brezhoneg bre br br
Bulgarian Български bul bg bg
Catalan Català cat ca ca
Czech Čeština ces cs cs
Chukchi Ԓыгъоравэтԓьэн ckt
Chuvash Чӑвашла chv cv cv
Hakha Chin Hakha Lai cnh cnh
Welsh Cymraeg cym cy cy
Dhivehi ދިވެހި div dv dv
Greek Ελληνικά ell el el
Danish Dansk dan da da
German Deutsch deu de de
English English eng en en
Esperanto Esperanto epo eo eo
Ewe Eʋegbe ewe ee ee
Spanish Español spa es es
Erzya Эрзянь кель myv myv myv
Estonian Eesti est et et
Basque Euskara eus eu eu
Persian فارسی pes fa fa
Finnish Suomi fin fi fi
French Français fra fr fr
Frisian Frysk fry fy-NL fy
Igbo Ásụ̀sụ́ Ìgbò ibo ig ig
Irish Gaeilge gle ga-IE ga
Galician Galego glg gl gl
Guaraní Avañeʼẽ gug gn gn
Hindi हिन्दी hin hi hi
Hausa Harshen Hausa hau ha ha
Upper Sorbian Hornjoserbšćina hsb hsb hsb
Hungarian Magyar nyelv hun hu hu
Armenian Հայերեն hye hy-AM hy
Interlingua Interlingua ina ia ia
Indonesian Bahasa indonesia ind id id
Icelandic Íslenska isl is is
Italian Italiano ita it it
Japanese 日本語 jpn ja ja
Georgian ქართული ენა kat ka ka
Kabyle Taqbaylit kab kab kab
Kazakh Қазақша kaz kk kk
Kikuyu Gĩgĩkũyũ kik ki ki
Kyrgyz Кыргызча kir ky ky
Kurmanji Kurdish Kurmancî kmr ku ku
Sorani Kurdish سۆرانی ckb ckb ckb
Komi-Zyrian Коми кыв kpv kv kv
Luganda Luganda lug lg lg
Lithuanian Lietuvių kalba lit lt lt
Lingala Lingála lin ln ln
Latvian Latviešu valoda lvs lv lv
Luo Dholuo luo luo
Macedonian Македонски mkd mk mk
Malayalam മലയാളം mal ml ml
Marathi मराठी mar mr mr
Mongolian Монгол хэл khk mn mn
Moksha Мокшень кяль mdf mdf mdf
Maltese Malti mlt mt mt
Dutch Nederlands nld nl nl
Chewa Chichewa nya ny ny
Norwegian Nynorsk Nynorsk nno nn-NO nn
Oriya ଓଡ଼ିଆ ori or or
Punjabi ਪੰਜਾਬੀ pan pa-IN pa
Polish Polski pol pl pl
Portuguese Português por pt pt
Kʼicheʼ Kʼicheʼ quc
Romansch (Sursilvan) Romontsch roh rm-sursilv rm
Romansch (Vallader) Rumantsch roh rm-vallader rm
Romanian Românește ron ro ro
Russian Русский rus ru ru
Kinyarwanda Kinyarwanda kin rw rw
Sakha Саха тыла sah sah sah
Santali ᱥᱟᱱᱛᱟᱲᱤ sat sat sat
Serbian Srpski srp sr sr
Slovak Slovenčina slk sk sk
Slovenian Slovenščina slv sl sl
Swahili Kiswahili swa sw
Swedish Svenska swe sv-SE sv
Tamil தமிழ் tam ta ta
Thai ภาษาไทย tha th th
Turkish Türkçe tur tr tr
Tatar Татар теле tat tt tt
Twi Twi tw tw tw
Ukrainian Українська мова ukr uk uk
Urdu اُردُو urd ur ur
Uyghur ئۇيغۇر تىلى uig ug ug
Uzbek Oʻzbekcha uzb uz uz
Vietnamese Tiếng Việt vie vi vi
Votic Vaďďa tšeeli vot vot
Wolof Wolof wol wo
Yoruba Èdè Yorùbá yor yo
Chinese (China) 中文 cmn zh-CN zh
Chinese (Hong Kong) 中文 cmn zh-HK zh
Chinese (Taiwan) 中文 cmn zh-TW zh

Frequently asked questions

Why not use [insert better system] for [insert task here] ?

There are potentially a lot of better language-specific systems for doing these tasks, but each one has a slightly different API, so if you want to support all the Common Voice languages or even a reasonable subset you have to learn and use the same number of language-specific APIs.

The idea of these utilities is to provide adequate implementations of things are are likely to be useful when working with all the languages in Common Voice. If you are working on a single language or have a specific setup or are using more data than just Common Voice, maybe this isn't for you. But if you want to just train coqui-ai/STT on Common Voice, then maybe it is :)

Why not just make the alphabet from the transcripts ?

Depending on the language in Common Voice, the transcripts can contain a lot of random punctuation, numerals, and incorrect character encodings (for example Latin ç instead of Cyrillic ҫ for Chuvash). These may look the same but will result in bigger sparsity for the model. Additionally some languages may have several encodings of the same character, such as the apostrophe. These will ideally be normalised before training.

Also, if you are working with a single language you probably have time to look through all the transcripts for the alphabetic symbols, but if you want to work with a large number of Common Voice languages at the same time it's useful to have them all in one place.

Hey aren't some of those languages not in Common Voice ?

That's right, some of the languages are either not in Common Voice (yet!) or are in Common Voice but have not been released yet. If I've been working with them I've included them anyway.

See also

  • epitran: Great grapheme to phoneme system that supports a wide range of languages.

Licence

All the code, aside from that explicitly licensed under a different licence, is licensed under the AGPL v 3.0.

Acknowledgements

Owner
Francis Tyers
Francis Tyers
多语言降噪预训练模型MBart的中文生成任务

mbart-chinese 基于mbart-large-cc25 的中文生成任务 Input source input: text + /s + lang_code target input: lang_code + text + /s Usage token_ids_mapping.jso

11 Sep 19, 2022
BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese

Table of contents Introduction Using BARTpho with fairseq Using BARTpho with transformers Notes BARTpho: Pre-trained Sequence-to-Sequence Models for V

VinAI Research 58 Dec 23, 2022
End-2-end speech synthesis with recurrent neural networks

Introduction New: Interactive demo using Google Colaboratory can be found here TTS-Cube is an end-2-end speech synthesis system that provides a full p

Tiberiu Boros 214 Dec 07, 2022
结巴中文分词

jieba “结巴”中文分词:做最好的 Python 中文分词组件 "Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation

Sun Junyi 29.8k Jan 02, 2023
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

English | 简体中文 | 繁體中文 | 한국어 State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow 🤗 Transformers provides thousands of pretrained models

Hugging Face 77.1k Dec 31, 2022
Creating a python chatbot that Starbucks users can text to place an order + help cut wait time of a normal coffee.

Creating a python chatbot that Starbucks users can text to place an order + help cut wait time of a normal coffee.

2 Jan 20, 2022
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

Soohwan Kim 26 Dec 14, 2022
Installation, test and evaluation of Scribosermo speech-to-text engine

Scribosermo STT Setup Scribosermo is a LGPL licensed, open-source speech recognition engine to "Train fast Speech-to-Text networks in different langua

Florian Quirin 3 Jun 20, 2022
Source code of the "Graph-Bert: Only Attention is Needed for Learning Graph Representations" paper

Graph-Bert Source code of "Graph-Bert: Only Attention is Needed for Learning Graph Representations". Please check the script.py as the entry point. We

14 Mar 25, 2022
Datasets of Automatic Keyphrase Extraction

This repository contains 20 annotated datasets of Automatic Keyphrase Extraction made available by the research community. Following are the datasets and the original papers that proposed them. If yo

LIAAD - Laboratory of Artificial Intelligence and Decision Support 163 Dec 23, 2022
Simple python code to fix your combo list by removing any text after a separator or removing duplicate combos

Combo List Fixer A simple python code to fix your combo list by removing any text after a separator or removing duplicate combos Removing any text aft

Hamidreza Dehghan 3 Dec 05, 2022
Visual Automata is a Python 3 library built as a wrapper for Caleb Evans' Automata library to add more visualization features.

Visual Automata Copyright 2021 Lewi Lie Uberg Released under the MIT license Visual Automata is a Python 3 library built as a wrapper for Caleb Evans'

Lewi Uberg 55 Nov 17, 2022
Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement

MTFAA-Net Unofficial PyTorch implementation of Baidu's MTFAA-Net: "Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speec

Shimin Zhang 87 Dec 19, 2022
A music comments dataset, containing 39,051 comments for 27,384 songs.

Music Comments Dataset A music comments dataset, containing 39,051 comments for 27,384 songs. For academic research use only. Introduction This datase

Zhang Yixiao 2 Jan 10, 2022
spaCy-wrap: For Wrapping fine-tuned transformers in spaCy pipelines

spaCy-wrap: For Wrapping fine-tuned transformers in spaCy pipelines spaCy-wrap is minimal library intended for wrapping fine-tuned transformers from t

Kenneth Enevoldsen 32 Dec 29, 2022
Pretty-doc - Composable text objects with python

pretty-doc from __future__ import annotations from dataclasses import dataclass

Taine Zhao 2 Jan 17, 2022
BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions

BERTopic BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable

Maarten Grootendorst 3.6k Jan 07, 2023
Google AI 2018 BERT pytorch implementation

BERT-pytorch Pytorch implementation of Google AI's 2018 BERT, with simple annotation BERT 2018 BERT: Pre-training of Deep Bidirectional Transformers f

Junseong Kim 5.3k Jan 07, 2023
NeoDays-based tileset for the roguelike CDDA (Cataclysm Dark Days Ahead)

NeoDaysPlus Reduced contrast, expanded, and continuously developed version of the CDDA tileset NeoDays that's being completed with new sprites for mis

0 Nov 12, 2022
Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics.

Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics. Jury offers a smooth and easy-to-use interface. It uses datasets for underlying metric computa

Open Business Software Solutions 129 Jan 06, 2023