Direct speech-to-image translation

Direct Speech-to-Image Translation

ICT,CAS

^✝PKU

UCAS

Bytedance

Our speech-to-image transltion task. The input of the model is the speech signal without text. Note that the text are shown only for readibility. Our goal is to show the content of the input speech onto the image.

Abstract

Speech-to-image translation without text is an interesting and useful topic due to the potential applications in humancomputer interaction, art creation, computer-aided design. etc. Not to mention that many languages have no writing form. However, as far as we know, it has not been well-studied how to translate the speech signals into images directly and how well they can be translated. In this paper, we attempt to translate the speech signals into the image signals without the transcription stage by leveraging the advance of teacher-student learning and generative adversarial models. Specifically, a speech encoder is designed to represent the input speech signals as an embedding feature, and it is trained using teacher-student learning to obtain better generalization ability on new classes. Subsequently, a stacked adversarial generative network is used to synthesized high-quality images conditioned on the embedding feature encoded by the speech encoder. Experimental results on both synthesized and real data show that our proposed method is efficient to translate the raw speech signals into images without the middle text representation. Ablation study gives more insights about our method.

Framework

Our framework for speech-to-image translation, which is composed with a speech encoder and a stacked generator. The speech encoder contains a multi-layer CNN and an RNN to encode the input time-frequency spectrogram into an embedding feature with 1024 dimensions. The speech encoder is trained using teacher-student learning with the pretrained image encoder. The generator with 3 branches is used to synthesize image at a resolution 256x256 from the embedding feature.

Results on synthesized data











Results on CUB-200 and Oxford-102 dataset. Left: the input speech description. Right: the synthesized images conditioned on the left speech description and different noises. The speech description are synthesized via Baidu TTS.

Results on read data





results on Place-205 dataset with real speech descriptions.

Feature interpolation









Feature interpolation results on CUB-200 and Oxford-102 dataset.

Supplementary material

You can find the data in supplemental material from here.

Paper

"Direct Speech-to-Image Translation",
Jiguo, Li, Xinfeng Zhang, Chuanmin Jia, Jizheng Xu, Li Zhang, Yue Wang, Siwei Ma, Wen Gao
Arxiv
Arxiv IEEE xplore

Data and Code

Data

We use 3 datasets in our paper, the data can be downloaded form the following table.

The data used in our paper.
dataset	image data	speech caption data	split file
CUB-200	CUB-200	syhthesized by Baidu TTS with person 0 or download here	train/val split
Oxford-102	Oxford-102	syhthesized by Baidu TTS with person 0 or download here	train/test split
Places-205	subset of Places-205	Places Audio Captions Dataset	train/test split

Code

The code can be found on my github. Any question about the code or the paper, feel free to mail me: jiguo.li@vipl.ict.ac.cn

Acknowledgment

The authors would like to thank Shiqi Wang for helpful discussion.