基于图对比学习的漏洞检测方法

夏鹏程; 曾凡平; 刘通

引用本文：

夏鹏程,曾凡平,刘通.基于图对比学习的漏洞检测方法[J].信息安全学报,已采用 [点击复制]
XIA Pengcheng,ZENG Fanping,LIU Tong.Code Vulnerability Detection Technology Based on Graph Contrastive Learning[J].Journal of Cyber Security,Accept [点击复制]

本文已被：浏览 1601次下载 0次
基于图对比学习的漏洞检测方法
夏鹏程, 曾凡平, 刘通
0 字体:加大+\|默认\|缩小-
(中国科学技术大学计算机科学与技术学院)

摘要:

代码漏洞检测是软件开发过程中的一个重要阶段。随着软件系统变得越来越复杂和庞大，漏洞的数量和种类都在增加。现有的漏洞检测方法通常极度依赖于有标签的数据集，或者在缺乏标签时需要大量的手动工作。本文提出了一种基于图对比学习的漏洞检测新方法(CVD-GCL)，致力于让漏洞检测技术摆脱标签的限制，完成对漏洞代码特征的深度学习，从而实现对源代码进行智能化的漏洞检测。首先，将源代码转换为代码的四种图结构。然后，提出一个新的图嵌入的方法，构建一个包含源代码语义语法信息和代码结构的漏洞代码图数据集。其次，搭建了一个图对比学习框架，并用该框架来聚合图中的特征，不断增强模型捕获代码漏洞特征的能力。最后，利用经过训练的图神经网络编码器，进行漏洞检测任务。现有的方法侧重于有监督的漏洞检测，而本文首次提出并实现无监督的代码漏洞检测方法。为了验证本文提出的方法，本工作在两个著名的开源项目上进行了实验，这两个项目包括丰富且具有高复杂性的真实源代码。本工作首先通过与其他图数据集构建方法进行对比实验，来验证本工作提出的数据集的高效性，证明了在漏洞检测领域，对图进行向量化时，考虑边的属性可以让检测精度更高。然后与最著名的有监督漏洞检测方法进行模型性能的对比，在使用同等规模的训练集情况下，本工作不仅节约数据集标注的成本，而且在检测的精度方面达到69.64%，充分证明了CVD-GCL在漏洞检测领域的有效性。

关键词: 漏洞检测代码属性图图神经网络对比学习无监督学习

DOI：

投稿时间：2024-01-08修订日期：2024-03-22

基金项目:科技部国家重点研发计划

Code Vulnerability Detection Technology Based on Graph Contrastive Learning

XIA Pengcheng, ZENG Fanping, LIU Tong

(School of computer science and technology,University of Science and Technology of China)

Abstract:

Code vulnerability detection is an important stage in the software development process. As software systems become more and more complex and huge, the number and types of vulnerabilities are increasing. Existing vulnerability detection methods usually rely heavily on labeled datasets, or require a lot of manual effort when labels are lacking. Our work pro-poses a new method for vulnerability detection based on graph contrastive learning (CVD-GCL), which is dedicated to freeing vulnerability detection technology from the limitations of labels, realizing deep learning of vulnerability code fea-tures, and thus achieving intelligent vulnerability detection of source code. First, convert the source code into four different graph of the code. Second, a new method of graph embedding is proposed to construct a dataset of vulnerable code graphs containing syntax information, semantic information and code structure of source code. Third, a graph compari-son learning framework was constructed and used to aggregate the features in the graph, continuously enhancing the model's ability to capture code vulnerability features. Finally, the trained graph neural network encoder is used to perform the task of vulnerability detection. Existing methods focus on supervised vulnerability detection, while this paper proposes and implements an unsupervised code vulnerability detection method for the first time. To verify the proposed method in this paper, experiments were conducted on two well-known open source projects, which include rich and highly complex real source codes. Our work first validates the efficiency of the proposed dataset by conducting contrastive experiments with other graph dataset construction methods, proving that when embedding graphs for vulnerability detection, consid-ering the attributes of edges can make the detection accuracy higher. Then compared with the most famous supervised vulnerability detection method, and under the condition of using the same scale of training set, our work not only saves the cost of labeling the dataset, but also achieves 69.64% accuracy in detection, which fully proves the effectiveness of CVD-GCL in the field of vulnerability detection.

Key words: vulnerability detection code property graph graph neural networks contrastive learning unsupervised learning