Learning to Fool the Speaker Recognition
Jiguo Li | Xinfeng Zhang | Chuanmin Jia | Jizheng Xu | Li Zhang | Yue Wang | Siwei Ma✝ | Wen Gao |
ICT,CAS | ✝PKU | UCAS | Bytedance |
How to fool the speaker recognition model.
|
Abstract
Due to the widespread deployment of fingerprint/face/speaker recognition systems, attacking deep learning based biometric systems has drawn more and more attention. Previous research mainly studied the attack to the vision-based system, such as fingerprint and face recognition. While the attack for speaker recognition has not been investigated yet, although it has been widely used in our daily life. In this paper, we attempt to fool the state-of-the-art speaker recognition model and present speaker recognition attacker, a lightweight model to fool the deep speaker recognition model by adding imperceptible perturbations onto the raw speech waveform. We find that the speaker recognition system is also vulnerable to the attack, and we achieve a high success rate on the non-targeted attack. Besides, we also present an effective method to optimize the speaker recognition attacker to obtain a trade-off between the attack success rate with the perceptual quality. Experiments on the TIMIT dataset show that we can achieve a sentence error rate of $99.2\%$ with an average SNR $57.2\text{dB}$ and PESQ 4.2 with speed rather 20 times than the real-time.
Framework
Illustration of our framework. Our speaker recognition attacker is applied to the raw speech input and generates a speech with perturbations, which can fool the following speaker recognition model although the perturbations are imperceptible. Besides, a pretrained phoneme recognition model is used to help to train the attacker network.
|
Results on the non-targeted attack
Here we show the results with $\lambda_{phn}=1, \lambda_{norm}=1000, m=0.01$.The origin speech | prediction | The adversarial example | prediction | ground truth |
fcjf0 | fsrh0 | fcjf0 | ||
fcjf0 | fsrh0 | fcjf0 | ||
fcjf0 | fsrh0 | fcjf0 | ||
fdaw0 | fsrh0 | fdaw0 | ||
fdaw0 | fsrh0 | fdaw0 | ||
fdaw0 | fsrh0 | fdaw0 | ||
faem0 | msvs0 | faem0 | ||
faem0 | msvs0 | faem0 | ||
faem0 | msvs0 | faem0 | ||
fbcg1 | fsrh0 | fbcg1 | ||
Non-targeted attack results on TIMIT dataset.
|
Results on targeted attack
The origin speech | prediction | The adversarial example | prediction | attack target | ground truth |
fdaw0 | fcjf0 | fcjf0 | fdaw0 | ||
fdaw0 | fcjf0 | fcjf0 | fdaw0 | ||
faem0 | fcjf0 | fcjf0 | faem0 | ||
fbcg1 | fcjf0 | fcjf0 | fbcg1 | ||
fcjf0 | mrjh0 | mrjh0 | fcjf0 | ||
faem0 | mrjh0 | mrjh0 | faem0 | ||
fcjf0 | fklc0 | fklc0 | fcjf0 | ||
fcjf0 | fklc0 | fklc0 | fcjf0 | ||
faem0 | mfxs0 | mfxs0 | faem0 | ||
faem0 | faem0 | mfxs0 | faem0 | ||
Targeted results on TIMIT dataset.
|
Paper
"Learning to Fool the Speaker Recognition", Jiguo, Li, Xinfeng Zhang, Chuanmin Jia, Jizheng Xu, Li Zhang, Yue Wang, Siwei Ma, Wen Gao Arxiv |
Data, Pretrained models, Code
The TIMIT dataset can be downloaded from here. The codes can be found on github. The pretrained models can be found from here.
Acknowledgment
The authors would like to thank Jiayi Fu, Jing Lin, Junjie Shi for helpful discussion.
|