基于抽象语法树的智能化漏洞检测系统

陈肇炫; 邹德清; 李珍; 金海

引用本文：

陈肇炫,邹德清,李珍,金海.基于抽象语法树的智能化漏洞检测系统[J].信息安全学报,2020,5(4):1-13 [点击复制]
CHEN Zhaoxuan,ZOU Deqing,LI Zhen,JIN Hai.Intelligent vulnerability detection system based on abstract syntax tree[J].Journal of Cyber Security,2020,5(4):1-13 [点击复制]

本文已被：浏览 7293次下载 6915次	码上扫一扫！
基于抽象语法树的智能化漏洞检测系统
陈肇炫^1,2, 邹德清^1,3,4, 李珍^1,3, 金海^1,2
0 字体:加大+\|默认\|缩小-
(1.大数据技术与系统国家工程研究中心服务计算技术与系统教育部重点实验室集群与网格计算湖北省重点实验室大数据安全湖北省工程研究中心, 武汉中国 430074;2.华中科技大学计算机科学与技术学院, 武汉中国 430074;3.华中科技大学网络空间安全学院, 武汉中国 430074;4.深圳华中科技大学研究院, 深圳中国 518000)

摘要:

源代码漏洞的自动检测是一个重要的研究课题。目前现有的解决方案大多是基于线性模型，依赖于源代码的文本信息而忽略了语法结构信息，从而造成了源代码语法和语义信息的丢失，同时也遗漏了许多漏洞特征。提出了一种基于结构表征的智能化漏洞检测系统Astor，致力于使用源代码的结构信息进行智能化漏洞检测，所考虑的结构信息是抽象语法树（Abstract Syntax Tree，AST）。首先，构建了一个从源代码转化而来且包含源码语法结构信息的数据集，提出使用深度优先遍历的机制获取AST的语法表征。最后，使用神经网络模型学习AST的语法表征。为了评估Astor的性能，对多个基于结构化数据和基于线性数据的漏洞检测系统进行比较，实验结果表明Astor能有效提升漏洞检测能力，降低漏报率和误报率。此外，还进一步总结出结构化模型更适用于长度大，信息量丰富的数据。

关键词: 漏洞检测结构表征抽象语法树神经网络

DOI：10.19363/J.cnki.cn10-1380/tn.2020.07.01

投稿时间：2019-12-06修订日期：2020-04-20

基金项目:本课题得到国家自然科学基金项目（No.U1936211），深圳市基础研究（学科布局）（No.JCYJ20170413114215614），广东省省级科技计划项目（No.2017B010124001），广东省重点领域研发计划项目（No.2019B010139001）的资助。

Intelligent vulnerability detection system based on abstract syntax tree

CHEN Zhaoxuan^1,2, ZOU Deqing^1,3,4, LI Zhen^1,3, JIN Hai^1,2

(1.National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab, Clusters and Grid Computing Lab, Big Data Security Engineering Research Center, Wuhan 430074, China;2.School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China;3.School of Cyber Science and engineering, Huazhong University of Science and Technology, Wuhan 430074, China;4.Institute of Huazhong University of Science and Technology, Shenzhen 518000, China)

Abstract:

Automatic detection of source code vulnerability is an important research topic. However, most existing solutions are based on linear models. They rely on the text information of source code but ignore the grammatical structure information. This will cause the loss of source code syntax and semantic information, but also miss many vulnerability features. In this paper, an Abstract Syntax Tree (AST) based source code structured representation learning system is proposed to study the structured information of source code and detect the vulnerabilities, called Astor. First, we present a data set that is transformed from the source code and contains information about the syntax structure of the source code. In addition, we propose using a depth first information extraction scheme to obtain the syntax and semantic representation of AST. In Astor, the neural network based detection system is used to learn the representation of AST. In order to evaluate the Astor, we compare vulnerability detection systems based on structured data and linear data. The results show that Astor can achieve much fewer false negative and false positive than other approaches. In addition, this paper further concludes that the structured model is more suitable for data with rich semantic information.

Key words: vulnerability detection structured representation abstract syntax tree neural network