Direct Speech-to-Image Translation

Jiguo Li Xinfeng Zhang Chuanmin Jia Jizheng Xu Li Zhang Yue Wang Siwei Ma Wen Gao


Our speech-to-image transltion task. The input of the model is the speech signal without text. Note that the text are shown only for readibility. Our goal is to show the content of the input speech onto the image.


Speech-to-image translation without text is an interesting and useful topic due to the potential applications in humancomputer interaction, art creation, computer-aided design. etc. Not to mention that many languages have no writing form. However, as far as we know, it has not been well-studied how to translate the speech signals into images directly and how well they can be translated. In this paper, we attempt to translate the speech signals into the image signals without the transcription stage by leveraging the advance of teacher-student learning and generative adversarial models. Specifically, a speech encoder is designed to represent the input speech signals as an embedding feature, and it is trained using teacher-student learning to obtain better generalization ability on new classes. Subsequently, a stacked adversarial generative network is used to synthesized high-quality images conditioned on the embedding feature encoded by the speech encoder. Experimental results on both synthesized and real data show that our proposed method is efficient to translate the raw speech signals into images without the middle text representation. Ablation study gives more insights about our method.


Our framework for speech-to-image translation, which is composed with a speech encoder and a stacked generator. The speech encoder contains a multi-layer CNN and an RNN to encode the input time-frequency spectrogram into an embedding feature with 1024 dimensions. The speech encoder is trained using teacher-student learning with the pretrained image encoder. The generator with 3 branches is used to synthesize image at a resolution 256x256 from the embedding feature.

Results on synthesized data

Results on CUB-200 and Oxford-102 dataset. Left: the input speech description. Right: the synthesized images conditioned on the left speech description and different noises. The speech description are synthesized via Baidu TTS.

Results on read data

results on Place-205 dataset with real speech descriptions.

Feature interpolation

Feature interpolation results on CUB-200 and Oxford-102 dataset.

Supplementary material

You can find the data in supplemental material from here.


    "Direct Speech-to-Image Translation",
Jiguo, Li, Xinfeng Zhang, Chuanmin Jia, Jizheng Xu, Li Zhang, Yue Wang, Siwei Ma, Wen Gao
Arxiv IEEE xplore

Data and Code


We use 3 datasets in our paper, the data can be downloaded form the following table.

datasetimage dataspeech caption datasplit file
CUB-200 CUB-200 syhthesized by Baidu TTS with person 0 or download here train/val split
Oxford-102 Oxford-102 syhthesized by Baidu TTS with person 0 or download here train/test split
Places-205 subset of Places-205 Places Audio Captions Dataset train/test split
The data used in our paper.


The code can be found on my github. Any question about the code or the paper, feel free to mail me:


The authors would like to thank Shiqi Wang for helpful discussion.