Official PyTorch implementation of SyntaSpeech (IJCAI 2022)

Last update: Nov 24, 2022

Related tags

Overview

SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech

This repository is the official PyTorch implementation of our IJCAI-2022 paper, in which we propose SyntaSpeech for syntax-aware non-autoregressive Text-to-Speech.

Our SyntaSpeech is built on the basis of PortaSpeech (NeurIPS 2021) with three new features:

We propose Syntactic Graph Builder (Sec. 3.1) and Syntactic Graph Encoder (Sec. 3.2), which is proved to be an effective unit to extract syntactic features to improve the prosody modeling and duration accuracy of TTS model.
We introduce Multi-Length Adversarial Training (Sec. 3.3), which could replace the flow-based post-net in PortaSpeech, speeding up the inference time and improving the audio quality naturalness.
We support three datasets: LJSpeech (single-speaker English dataset), Biaobei (single-speaker Chinese dataset) , and LibriTTS (multi-speaker English dataset).

Environments

conda create -n synta python=3.7
condac activate synta
pip install -U pip
pip install Cython numpy==1.19.1
pip install torch==1.9.0 
pip install -r requirements.txt
# install dgl for graph neural network, dgl-cu102 supports rtx2080, dgl-cu113 support rtx3090
pip install dgl-cu102 dglgo -f https://data.dgl.ai/wheels/repo.html 
sudo apt install -y sox libsox-fmt-mp3
bash mfa_usr/install_mfa.sh # install force alignment tools

Run SyntaSpeech!

Please follow the following steps to run this repo.

1. Preparation

Data Preparation

You can directly use our binarized datasets for LJSpeech and Biaobei. Download them and unzip them into the data/binary/ folder.

As for LibriTTS, you can download the raw datasets and process them with our data_gen modules. Detailed instructions can be found in dosc/prepare_data.

Vocoder Preparation

We provide the pre-trained model of vocoders for three datasets. Specifically, Hifi-GAN for LJSpeech and Biaobei, ParallelWaveGAN for LibriTTS. Download and unzip them into the checkpoints/ folder.

2. Training Example

Then you can train SyntaSpeech in the three datasets.

cd <the root_dir of your SyntaSpeech folder>
export PYTHONPATH=./
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/lj/synta.yaml --exp_name lj_synta --reset # training in LJSpeech
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name biaobei_synta --reset # training in Biaobei
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name libritts_synta --reset # training in LibriTTS

3. Tensorboard

tensorboard --logdir=checkpoints/lj_synta
tensorboard --logdir=checkpoints/biaobei_synta
tensorboard --logdir=checkpoints/libritts_synta

4. Inference Example

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/lj/synta.yaml --exp_name lj_synta --reset --infer # inference in LJSpeech
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name biaobei_synta --reset --infer # inference in Biaobei
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name libritts_synta --reset ---infer # inference in LibriTTS

Audio Demos

Audio samples in the paper can be found in our demo page.

We also provide HuggingFace Demo Page for LJSpeech. Try your interesting sentences there!

Citation

@article{ye2022syntaspeech,
  title={SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech},
  author={Ye, Zhenhui and Zhao, Zhou and Ren, Yi and Wu, Fei},
  journal={arXiv preprint arXiv:2204.11792},
  year={2022}
}

Acknowledgements

Our codes are based on the following repos:

Comments

pinyin preprocess problem

005804 你当#1我傻啊#3？脑子#1那么大#2怎么#1塞进去#4？ ni3 dang1 wo2 sha3 a5 nao3 zi5 na4 me5 da4 zen3 me5 sai1 jin4 qu4

txt_struct=[['', ['']], ['你', ['n', 'i3']], ['当', ['d', 'ang1']], ['我', ['uo3']], ['傻', ['sh', 'a3']], ['啊', ['a', '?', 'n', 'ao3']], ['?', ['z', 'i']], ['脑', ['n', 'a4']], ['子', ['m', 'e']], ['那', ['d', 'a4']], ['么', ['z', 'en3']], ['大', ['m', 'e']], ['怎', ['s', 'ai1']], ['么', ['j', 'in4']], ['塞', ['q', 'v4', '?']], ['进', []], ['去', []], ['?', []], ['', ['']]]

ph_gb_word=['', 'n_i3', 'd_ang1', 'uo3', 'sh_a3', 'a_?n_ao3', 'z_i', 'n_a4', 'm_e', 'd_a4', 'z_en3', 'm_e', 's_ai1', 'j_in4', 'q_v4?', '', '', '', '']

what is 'a_?_n_ao3'

in the mfa_dict it appears ch_a1_d_ou1 ,a_?_n_ao3 and so on

opened by windowxiaoming 2
discriminator output['y_c'] never used

Discriminator's output['y_c'] never used, and never calculated in discriminator forward func. What does this variable mean? https://github.com/yerfor/SyntaSpeech/blob/5b07439633a3e714d2a6759ea4097eb36d6cd99a/tasks/tts/synta.py#L81

opened by mayfool 2
A question of KL divergence calculation

In modules/tts/portaspeech/fvae.py, SyntaFVAE compute loss_kl (line 121) , Can someone help explain why loss_kl = ((logqx - logpx) * nonpadding_sqz).sum() / nonpadding_sqz.sum() / logqx.shape[1]，I think loss_kl should be compute by loss_kl = logqx.exp()*(logqx - logpx) I would be very grateful if you could reply to me！

opened by JiaYK 2

mfa for multi speaker.

In the code, group MFA inputs for better parallelism. For multi speaker, it maybe go wrong. For input g_uang3 zh_ou1 n_v3 d_a4 x_ve2 sh_eng1 d_eng1 sh_an1 sh_i1 l_ian2 s_i4 t_ian1 j_ing3 f_ang1 zh_ao3 d_ao4 i2 s_i4 n_v3 sh_i1. The TexGrid is

	item [1]:
		class = "IntervalTier"
		name = "words"
		xmin = 0.0
		xmax = 9.4444
		intervals: size = 56
			intervals [1]:
				xmin = 0
				xmax = 0.5700000000000001
				text = ""
			intervals [2]:
				xmin = 0.5700000000000001
				xmax = 0.61
				text = "eng"
			intervals [3]:
				xmin = 0.61
				xmax = 0.79
				text = "s_an1"
			intervals [4]:
				xmin = 0.79
				xmax = 0.89
				text = "eng"
			intervals [5]:
				xmin = 0.89
				xmax = 1.06
				text = "i1"
			intervals [6]:
				xmin = 1.06
				xmax = 1.24
				text = "eng"
			intervals [7]:
				xmin = 1.24
				xmax = 1.3
				text = ""
			intervals [8]:
				xmin = 1.3
				xmax = 1.36
				text = "s_an1"
			intervals [9]:
				xmin = 1.36
				xmax = 1.42
				text = ""
			intervals [10]:
				xmin = 1.42
				xmax = 1.49
				text = "eng"
			intervals [11]:
				xmin = 1.49
				xmax = 1.67
				text = "s_i4"
			intervals [12]:
				xmin = 1.67
				xmax = 1.78
				text = "eng"
			intervals [13]:
				xmin = 1.78
				xmax = 1.91
				text = ""
			intervals [14]:
				xmin = 1.91
				xmax = 1.96
				text = "er4"
			intervals [15]:
				xmin = 1.96
				xmax = 2.06
				text = "eng"
			intervals [16]:
				xmin = 2.06
				xmax = 2.19
				text = ""
			intervals [17]:
				xmin = 2.19
				xmax = 2.35
				text = "i1"
			intervals [18]:
				xmin = 2.35
				xmax = 2.53
				text = "eng"
			intervals [19]:
				xmin = 2.53
				xmax = 3.03
				text = "i1"
			intervals [20]:
				xmin = 3.03
				xmax = 3.42
				text = "eng"
			intervals [21]:
				xmin = 3.42
				xmax = 3.48
				text = "i1"
			intervals [22]:
				xmin = 3.48
				xmax = 3.6
				text = ""
			intervals [23]:
				xmin = 3.6
				xmax = 3.64
				text = "eng"
			intervals [24]:
				xmin = 3.64
				xmax = 3.86
				text = "i1"
			intervals [25]:
				xmin = 3.86
				xmax = 3.99
				text = "eng"
			intervals [26]:
				xmin = 3.99
				xmax = 4.59
				text = ""
			intervals [27]:
				xmin = 4.59
				xmax = 4.869999999999999
				text = "er4"
			intervals [28]:
				xmin = 4.869999999999999
				xmax = 4.9799999999999995
				text = "eng"
			intervals [29]:
				xmin = 4.9799999999999995
				xmax = 5.1899999999999995
				text = "s_i4"
			intervals [30]:
				xmin = 5.1899999999999995
				xmax = 5.34
				text = ""
			intervals [31]:
				xmin = 5.34
				xmax = 5.43
				text = "eng"
			intervals [32]:
				xmin = 5.43
				xmax = 5.6
				text = ""
			intervals [33]:
				xmin = 5.6
				xmax = 5.76
				text = "i1"
			intervals [34]:
				xmin = 5.76
				xmax = 6.279999999999999
				text = "eng"
			intervals [35]:
				xmin = 6.279999999999999
				xmax = 6.359999999999999
				text = "s_an1"
			intervals [36]:
				xmin = 6.359999999999999
				xmax = 6.47
				text = ""
			intervals [37]:
				xmin = 6.47
				xmax = 6.6
				text = "eng"
			intervals [38]:
				xmin = 6.6
				xmax = 6.9399999999999995
				text = "i1"
			intervals [39]:
				xmin = 6.9399999999999995
				xmax = 7.039999999999999
				text = "eng"
			intervals [40]:
				xmin = 7.039999999999999
				xmax = 7.289999999999999
				text = "s_an1"
			intervals [41]:
				xmin = 7.289999999999999
				xmax = 7.369999999999999
				text = "eng"
			intervals [42]:
				xmin = 7.369999999999999
				xmax = 7.6
				text = "s_i4"
			intervals [43]:
				xmin = 7.6
				xmax = 7.699999999999999
				text = "eng"
			intervals [44]:
				xmin = 7.699999999999999
				xmax = 7.869999999999999
				text = ""
			intervals [45]:
				xmin = 7.869999999999999
				xmax = 8.049999999999999
				text = "er4"
			intervals [46]:
				xmin = 8.049999999999999
				xmax = 8.26
				text = ""
			intervals [47]:
				xmin = 8.26
				xmax = 8.299999999999999
				text = "eng"
			intervals [48]:
				xmin = 8.299999999999999
				xmax = 8.36
				text = "s_i4"
			intervals [49]:
				xmin = 8.36
				xmax = 8.389999999999999
				text = ""
			intervals [50]:
				xmin = 8.389999999999999
				xmax = 8.42
				text = "eng"
			intervals [51]:
				xmin = 8.42
				xmax = 8.45
				text = ""
			intervals [52]:
				xmin = 8.45
				xmax = 8.59
				text = "s_an1"
			intervals [53]:
				xmin = 8.59
				xmax = 8.83
				text = ""
			intervals [54]:
				xmin = 8.83
				xmax = 9.1
				text = "eng"
			intervals [55]:
				xmin = 9.1
				xmax = 9.44
				text = "i1"
			intervals [56]:
				xmin = 9.44
				xmax = 9.4444
				text = ""

opened by leon2milan 2

Problem with DDP

Hello, I have experimented on your excellent job with this repo. But I found the ddp is not effective. I wonder if the way I used is wrong?

CUDA_VISIBLE_DEVICES=0,1,2 python -m torch.distributed.launch --nproc_per_node 3 tasks/run.py --config //fs.yaml --exp_name fs_test_demo --reset

opened by zhazl 0

Releases(v1.0.0)

v1.0.0(May 21, 2022)

We release the pretrained models of SyntaSpeech on LJSpeech, Biaobei, and LibriTTS. For pretrained vocoder and datasets, please refer to the provided links in README.md
Source code(tar.gz)
Source code(zip)
biaobei_synta.zip(295.58 MB)
libritts_synta.zip(310.03 MB)
lj_synta.zip(304.98 MB)

Owner

Zhenhui YE

I am currently a second-year computer science Ph.D student at Zhejiang University, working on deep learning and reinforcement learning.

GitHub Repository

ESTDepth: Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks (CVPR 2021)

ESTDepth: Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks (CVPR 2021) Project Page | Video | Paper | Data We present a novel metho

65 Nov 28, 2022

Machine Learning Time-Series Platform

cesium: Open-Source Platform for Time Series Inference Summary cesium is an open source library that allows users to: extract features from raw time s

632 Dec 26, 2022

Face2webtoon - Despite its importance, there are few previous works applying I2I translation to webtoon.

Despite its importance, there are few previous works applying I2I translation to webtoon. I collected dataset from naver webtoon 연애혁명 and tried to transfer human faces to webtoon domain.

64 Oct 19, 2022

Normalization Matters in Weakly Supervised Object Localization (ICCV 2021)

Normalization Matters in Weakly Supervised Object Localization (ICCV 2021) 99% of the code in this repository originates from this link. ICCV 2021 pap

10 Feb 01, 2022

PyExplainer: A Local Rule-Based Model-Agnostic Technique (Explainable AI)

PyExplainer PyExplainer is a local rule-based model-agnostic technique for generating explanations (i.e., why a commit is predicted as defective) of J

14 Nov 13, 2022

Code for Massive-scale Decoding for Text Generation using Lattices

Massive-scale Decoding for Text Generation using Lattices Jiacheng Xu, Greg Durrett TL;DR: a new search algorithm to construct lattices encoding many

37 Dec 18, 2022

A small tool to joint picture including gif

README 做设计的时候遇到拼接长图的情况，但是发现没有什么好用的能拼接gif的工具。于是自己写了个gif拼接小工具。可以自动拼接gif、png和jpg等常见格式。效果从上至下从下至上从左至右从右至左使用克隆仓库 git clone https://github.com/Dels

3 Dec 15, 2021

Lightweight Cuda Renderer with Python Wrapper.

pyRender Lightweight Cuda Renderer with Python Wrapper. Compile Change compile.sh line 5 to the glm library include path. This library can be download

53 Dec 02, 2022

PaddlePaddle GAN library, including lots of interesting applications like First-Order motion transfer, wav2lip, picture repair, image editing, photo2cartoon, image style transfer, and so on.

English | 简体中文 PaddleGAN PaddleGAN provides developers with high-performance implementation of classic and SOTA Generative Adversarial Networks, and s

6.4k Jan 09, 2023

Python scripts for performing stereo depth estimation using the HITNET Tensorflow model.

HITNET-Stereo-Depth-estimation Python scripts for performing stereo depth estimation using the HITNET Tensorflow model from Google Research. Stereo de

76 Jan 02, 2023

App for identification of various objects. Based on YOLO v4 tiny architecture

Object_detection Repository containing trained model yolo v4 tiny, which is capable of identification 80 different classes Default feed is set to be a

0 Jun 22, 2022

Neural style in TensorFlow! 🎨

neural-style An implementation of neural style in TensorFlow. This implementation is a lot simpler than a lot of the other ones out there, thanks to T

5.5k Dec 29, 2022

Real Time Object Detection and Classification using Yolo Algorithm.

Real time Object detection & Classification using YOLO algorithm. Real Time Object Detection and Classification using Yolo Algorithm. What is Object D

1 Apr 17, 2022

Sign Language is detected in realtime using video sequences. Our approach involves MediaPipe Holistic for keypoints extraction and LSTM Model for prediction.

RealTime Sign Language Detection using Action Recognition Approach Real-Time Sign Language is commonly predicted using models whose architecture consi

15 Aug 20, 2022

RRxIO - Robust Radar Visual/Thermal Inertial Odometry: Robust and accurate state estimation even in challenging visual conditions.

RRxIO - Robust Radar Visual/Thermal Inertial Odometry RRxIO offers robust and accurate state estimation even in challenging visual conditions. RRxIO c

64 Dec 29, 2022

PSGAN running with ncnn⚡妆容迁移/仿妆⚡Imitation Makeup/Makeup Transfer⚡

144 Dec 26, 2022

Motion and Shape Capture from Sparse Markers

MoSh++ This repository contains the official chumpy implementation of mocap body solver used for AMASS: AMASS: Archive of Motion Capture as Surface Sh

135 Dec 23, 2022

A PyTorch implementation of the architecture of Mask RCNN

EDIT (AS OF 4th NOVEMBER 2019): This implementation has multiple errors and as of the date 4th, November 2019 is insufficient to be utilized as a reso

975 Dec 30, 2022

Narya API allows you track soccer player from camera inputs, and evaluate them with an Expected Discounted Goal (EDG) Agent

Narya The Narya API allows you track soccer player from camera inputs, and evaluate them with an Expected Discounted Goal (EDG) Agent. This repository

121 Dec 30, 2022

Trading and Backtesting environment for training reinforcement learning agent or simple rule base algo.

TradingGym TradingGym is a toolkit for training and backtesting the reinforcement learning algorithms. This was inspired by OpenAI Gym and imitated th

1.1k Jan 02, 2023