Learning to Fool the Speaker Recognition


Jiguo Li Xinfeng Zhang Chuanmin Jia Jizheng Xu Li Zhang Yue Wang Siwei Ma Wen Gao

ICT,CAS PKU UCAS Bytedance

How to fool the speaker recognition model.

Abstract

Due to the widespread deployment of fingerprint/face/speaker recognition systems, attacking deep learning based biometric systems has drawn more and more attention. Previous research mainly studied the attack to the vision-based system, such as fingerprint and face recognition. While the attack for speaker recognition has not been investigated yet, although it has been widely used in our daily life. In this paper, we attempt to fool the state-of-the-art speaker recognition model and present speaker recognition attacker, a lightweight model to fool the deep speaker recognition model by adding imperceptible perturbations onto the raw speech waveform. We find that the speaker recognition system is also vulnerable to the attack, and we achieve a high success rate on the non-targeted attack. Besides, we also present an effective method to optimize the speaker recognition attacker to obtain a trade-off between the attack success rate with the perceptual quality. Experiments on the TIMIT dataset show that we can achieve a sentence error rate of $99.2\%$ with an average SNR $57.2\text{dB}$ and PESQ 4.2 with speed rather 20 times than the real-time.
 

Framework

Illustration of our framework. Our speaker recognition attacker is applied to the raw speech input and generates a speech with perturbations, which can fool the following speaker recognition model although the perturbations are imperceptible. Besides, a pretrained phoneme recognition model is used to help to train the attacker network.

Results on the non-targeted attack

Here we show the results with $\lambda_{phn}=1, \lambda_{norm}=1000, m=0.01$.
The origin speech prediction The adversarial example prediction ground truth
fcjf0 fsrh0 fcjf0
fcjf0 fsrh0 fcjf0
fcjf0 fsrh0 fcjf0
fdaw0 fsrh0 fdaw0
fdaw0 fsrh0 fdaw0
fdaw0 fsrh0 fdaw0
faem0 msvs0 faem0
faem0 msvs0 faem0
faem0 msvs0 faem0
fbcg1 fsrh0 fbcg1

Non-targeted attack results on TIMIT dataset.

Results on targeted attack

The origin speech prediction The adversarial example prediction attack target ground truth
fdaw0 fcjf0 fcjf0 fdaw0
fdaw0 fcjf0 fcjf0 fdaw0
faem0 fcjf0 fcjf0 faem0
fbcg1 fcjf0 fcjf0 fbcg1
fcjf0 mrjh0 mrjh0 fcjf0
faem0 mrjh0 mrjh0 faem0
fcjf0 fklc0 fklc0 fcjf0
fcjf0 fklc0 fklc0 fcjf0
faem0 mfxs0 mfxs0 faem0
faem0 faem0 mfxs0 faem0

Targeted results on TIMIT dataset.

Paper

    "Learning to Fool the Speaker Recognition",
Jiguo, Li, Xinfeng Zhang, Chuanmin Jia, Jizheng Xu, Li Zhang, Yue Wang, Siwei Ma, Wen Gao
Arxiv

 

Data, Pretrained models, Code

The TIMIT dataset can be downloaded from here. The codes can be found on github. The pretrained models can be found from here.

Acknowledgment

    The authors would like to thank Jiayi Fu, Jing Lin, Junjie Shi for helpful discussion.