CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus

Related tags

Text Data & NLPcvss
Overview

CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus

License: CC BY 4.0

CVSS is a massively multilingual-to-English speech-to-speech translation corpus, covering sentence-level parallel speech-to-speech translation pairs from 21 languages into English. CVSS is derived from the Common Voice speech corpus and the CoVoST 2 speech-to-text translation corpus. The translation speech in CVSS is synthesized with two state-of-the-art TTS models trained on the LibriTTS corpus.

CVSS includes two versions of spoken translation for all the 21 x-en language pairs from CoVoST 2, with each version providing unique values:

  • CVSS-C: All the translation speeches are in a single canonical speaker's voice. Despite being synthetic, these speeches are of very high naturalness and cleanness, as well as having consistent speaking style. These properties ease the modelling of the target speech and enable models to produce high quality translation speech suitable for user-facing applications.

  • CVSS-T: The translation speeches are in voices transferred from the corresponding source speeches. Each translation pair has similar voices on the two sides despite of being in different languages, making this dataset suitable for building models that preserve speakers' voices when translate speech into different languages.

In together with the source speeches originated from Common Voice, they make two multilingual speech-to-speech tranlsation datasets each with about 1,900 hours of speech.

In addition to translation speech, CVSS also provides normalized translation text matching the pronunciation in the translation speech (e.g. on numbers, currencies, acronyms, etc.), which can be use for both model training as well as standalizing evaluation.

Please check out our paper for the detailed description of this corpus, as well as the baseline models we trained on both datasets.

Getting the data

The translation speech and the normalized translation text in CVSS can be downloaded from the links in the following table:

Source language Code CVSS-C CVSS-T
Arabic ar link link
Catalan ca link link
Welsh cy link link
German de link link
Estonian et link link
Spanish es link link
Persian fa link link
French fr link link
Indonesian id link link
Italian it link link
Japanese ja link link
Latvian lv link link
Mongolian mn link link
Dutch nl link link
Portuguese pt link link
Russian ru link link
Slovenian sl link link
Swedish sv link link
Tamil ta link link
Turkish tr link link
Chinese zh link link

Each tar.gz file in the links above includes train, dev and test directories containing audio clips as the translation speech, as well as train.tsv, dev.tsv and test.tsv files containing the normalized translation text. The normalized translation text files included in CVSS-C and CVSS-T are identical.

These translation audio clips and translation texts are to be paired with the Common Voice release version 4 (required) based on the audio file names. If you need the original translation text without the normalization, they are provided by CoVoST 2.

License

CVSS is released under the very permissive Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Citation

Please cite this paper when referencing the CVSS corpus:

@misc{jia2022cvss,
    title={{CVSS} Corpus and Massively Multilingual Speech-to-Speech Translation},
    author={Jia, Ye and Tadmor Ramanovich, Michelle and Wang, Quan and Zen, Heiga},
    eprint={2201.03713},
    archivePrefix={arXiv},
    year={2022}
}
Owner
Google Research Datasets
Datasets released by Google Research
Google Research Datasets
Sequence Modeling with Structured State Spaces

Structured State Spaces for Sequence Modeling This repository provides implementations and experiments for the following papers. S4 Efficiently Modeli

HazyResearch 902 Jan 06, 2023
Official PyTorch implementation of SegFormer

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers Figure 1: Performance of SegFormer-B0 to SegFormer-B5. Project page

NVIDIA Research Projects 1.4k Dec 29, 2022
Code for Discovering Topics in Long-tailed Corpora with Causal Intervention.

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention ACL2021 Findings Usage 0. Prepare environment Requirements: python==3.6 te

Xiaobao Wu 8 Dec 16, 2022
TTS is a library for advanced Text-to-Speech generation.

TTS is a library for advanced Text-to-Speech generation. It's built on the latest research, was designed to achieve the best trade-off among ease-of-training, speed and quality. TTS comes with pretra

Mozilla 6.5k Jan 08, 2023
Common Voice Dataset explorer

Common Voice Dataset Explorer Common Voice Dataset is by Mozilla Made during huggingface finetuning week Usage pip install -r requirements.txt streaml

Ceyda Cinarel 22 Nov 16, 2022
Biterm Topic Model (BTM): modeling topics in short texts

Biterm Topic Model Bitermplus implements Biterm topic model for short texts introduced by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Actua

Maksim Terpilowski 49 Dec 30, 2022
The simple project to separate mixed voice (2 clean voices) to 2 separate voices.

Speech Separation The simple project to separate mixed voice (2 clean voices) to 2 separate voices. Result Example (Clisk to hear the voices): mix ||

vuthede 31 Oct 30, 2022
Google and Stanford University released a new pre-trained model called ELECTRA

Google and Stanford University released a new pre-trained model called ELECTRA, which has a much compact model size and relatively competitive performance compared to BERT and its variants. For furth

Yiming Cui 1.2k Dec 30, 2022
Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipB

Jie Lei 雷杰 612 Jan 04, 2023
【原神】自动演奏风物之诗琴的程序

疯物之诗琴 读取midi并自动演奏原神风物之诗琴。 可以自定义配置文件自动调整音符来适配风物之诗琴。 (原神1.4直播那天就开始做了!到现在才能放出来。。) 如何使用 在Release页面中下载打包好的程序和midi压缩包并解压。 双击运行“疯物之诗琴.exe”。 在原神中打开风物之诗琴,软件内输入

435 Jan 04, 2023
Simple text to phones converter for multiple languages

Phonemizer -- foʊnmaɪzɚ The phonemizer allows simple phonemization of words and texts in many languages. Provides both the phonemize command-line tool

CoML 762 Dec 29, 2022
Free and Open Source Machine Translation API. 100% self-hosted, offline capable and easy to setup.

LibreTranslate Try it online! | API Docs | Community Forum Free and Open Source Machine Translation API, entirely self-hosted. Unlike other APIs, it d

3.4k Dec 27, 2022
Maha is a text processing library specially developed to deal with Arabic text.

An Arabic text processing library intended for use in NLP applications Maha is a text processing library specially developed to deal with Arabic text.

Mohammad Al-Fetyani 184 Nov 27, 2022
Grover is a model for Neural Fake News -- both generation and detectio

Grover is a model for Neural Fake News -- both generation and detection. However, it probably can also be used for other generation tasks.

Rowan Zellers 856 Dec 24, 2022
PIZZA - a task-oriented semantic parsing dataset

The PIZZA dataset continues the exploration of task-oriented parsing by introducing a new dataset for parsing pizza and drink orders, whose semantics cannot be captured by flat slots and intents.

17 Dec 14, 2022
Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景 安装教程 快速上手 (一)预训练模型 (二)机器翻译 (三)文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练 背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台,支持多种预训练方式,以及序列生成和自然语言理解任务。 安装教程 git clone git

Tencent Minority-Mandarin Translation Team 42 Dec 20, 2022
Legal text retrieval for python

legal-text-retrieval Overview This system contains 2 steps: generate training data containing negative sample found by mixture score of cosine(tfidf)

Nguyễn Minh Phương 22 Dec 06, 2022
Toward Model Interpretability in Medical NLP

Toward Model Interpretability in Medical NLP LING380: Topics in Computational Linguistics Final Project James Cross ( 1 Mar 04, 2022

Simple telegram bot to convert files into direct download link.you can use telegram as a file server 🪁

TGCLOUD 🪁 Simple telegram bot to convert files into direct download link.you can use telegram as a file server 🪁 Features Easy to Deploy Heroku Supp

Mr.Acid dev 6 Oct 18, 2022
Smart discord chatbot integrated with Dialogflow

academic-NLP-chatbot Smart discord chatbot integrated with Dialogflow to interact with students naturally and manage different classes in a school. De

Tom Huynh 5 Oct 24, 2022