关键词:  神经网络|特征分布差异|对抗样本检测|异常值检测
基金项目:本文研究成果得到国家自然科学基金项目(No. 62072404), CCF-蚂蚁科研基金(CCF-AFSG No. RF20220003),杭州市领军型创新创业团队(No. TD2022011)资助。
Exploiting Feature Space Divergence For Adversarial Example Detection
HAN Meng,YU Weiping,ZHOU Yiyun,DU Wentao,SUN Yanbin,LIN Changting
Zhejiang University, Hangzhou 310007, China;Gentel. ai, Hangzhou 310051, China;Tuya Inc., Hangzhou 310010, China;CareerBuilder, IL, 60601, USA;Guangzhou University, Guangzhou 510006, China
Neural network models have been shown to be vulnerable to adversarial examples, which are the maliciously crafted inputs with adding slight perturbation to the original natural inputs, resulting in incorrect classification by the ML model. Such adversarial samples threaten the security of high requirements and key applications in daily life, such as autonomous driving, surveillance systems, and biometric authentication. Recent works have shown detecting adversarial examples can be more effective than preventing them by enhancing models during training time. Moreover, the neural network model is more easier to distinguish adversarial samples from original natural samples as its middle hidden layer capture and extract sample information during training time. Therefore, we investigate the statistical divergence of hidden representations between adversarial inputs and benign inputs on different layers in neural networks in this study. Our results show that this divergence can vary among different layers. By identifying the most effective layers for identifying the divergence and the statistical representation distribution of the benign training datasets, a framework for adversarial samples detection using feature distribution is proposed in this paper. The framework can be divided into generalized adversarial samples detection method and conditional adversarial samples detection method. The former calculates the outlier score of the test set after obtaining statistical features by extracting the learned training data representation from each hidden layer. The latter obtains the statistical features of the corresponding training data by comparing the prediction results of deep neural network models. The calculated statistical features include the L2 norm distance from the origin and the correlation with the top eigenvalue of the sample covariance matrix. Our experiment results show that both detection methods can detect adversarial samples using hidden layer information and have good detection effects on adversarial samples generated by different attacks, thereby demonstrating the effectiveness of the proposed detection framework in detecting adversarial samples.
Key words:  neural network|feature space divergence|adversarial example detection|outlier detection