Unsupervised Pre-training for Data-Efficient Text-to-Speech on Low Resource Languages
Demo audio samples (Korean)

We used Griffin-Lim algorithm as a vocoder, hence GT (Griffin-Lim) is the practical ground truth audio.
All models were trained with 0.5 shards (12 minutes) of paired data which is very small amount to train Tacotron from scratch.


Main results

Text: 저는 대개 점심을 걸러요. jeoneun daege jeomsimeul geolleoyo.

GT (Griffin-Lim) T-Pho T-VQ
T-SD T-SD + SegAug T-VQ + SegAug



Text: 그 건물은 천장이 높다. geu geonmureun cheonjangi nopda.

GT (Griffin-Lim) T-Pho T-VQ
T-SD T-SD + SegAug T-VQ + SegAug



Comparison to a simple upsampling pre-text task

Text: 먹고 살 생각을 하니 걱정이야. meokgo sal saenggageul hani geokjeongiya.

GT (Griffin-Lim) Naive T-SD



Text: 목이 아파요. mogi apayo.

GT (Griffin-Lim) Naive T-SD



Results of T-SD using different segmentation methods

Text: 사람의 앞날은 알 수가 없어요. saramui apnareun al suga eopseoyo.

GT (Griffin-Lim) Random segmentation (Default) Pseudo phoneme segmentation Phoneme segmentation



Text: 오늘은 쉬는 날이에요. oneureun swineun narieyo.

GT (Griffin-Lim) Random segmentation (Default) Pseudo phoneme segmentation Phoneme segmentation