基于符号约束比对的漏洞触发样本去重方法

宋振宇; 王嘉炜; 宋晨; 宋燕妮; 赵元博; 郭璇

引用本文：

宋振宇,王嘉炜,宋晨,宋燕妮,赵元博,郭璇.基于符号约束比对的漏洞触发样本去重方法[J].信息安全学报,已采用 [点击复制]
songzhenyu,wangjiawei,songchen,songyanni,zhaoyuanbo,guoxuan.Crash Deduplication Based on Symbolic Constraint Comparison[J].Journal of Cyber Security,Accept [点击复制]

本文已被：浏览 140次下载 0次
基于符号约束比对的漏洞触发样本去重方法
宋振宇, 王嘉炜, 宋晨, 宋燕妮, 赵元博, 郭璇
0 字体:加大+\|默认\|缩小-
(中国科学院信息工程研究所)

摘要:

安全人员使用模糊测试工具能够在短时间内发现大量的漏洞触发样本，但是大量漏洞触发样本的去重面临严峻挑战，现有基于崩溃位置和调用栈的方法往往出现过聚类问题，而基于漏洞成因的方法则存在计算开销高的缺陷。为了解决上述问题，本文提出了一种基于符号约束比对的漏洞触发样本去重方法，该方法首先通过逐字节变换和测试识别漏洞触发样本的关键字节以定位漏洞成因，然后收集符号路径约束以提取控制流特征，最后通过计算控制流特征之间的相似度矩阵并使用谱聚类和DBSCAN两种算法进行聚类，实现漏洞触发样本的高效去重。在Magma和MoonLight中选取的7个目标程序数据集进行测试，包含8类23个CVE漏洞，涵盖了真实世界中各类型的漏洞场景。实验结果表明，本文方法没有产生过聚类问题，聚类准确率高，在Poppler数据集、SoX数据集以及LibTIFF数据集上分别达到99.99%、99.69%和97.16%的F1值；同时，本文方法所消耗的CPU时间仅为同类型方法的千分之一，对大规模软件项目具有较好的适用性。

关键词: 漏洞触发样本去重漏洞成因符号约束软件漏洞

DOI：

投稿时间：2024-10-22修订日期：2025-02-27

基金项目:中国科学院网络测评技术重点实验室资助项目,网络安全防护技术北京市重点实验室资助项目,国家重点研发计划（课题编号：2021YFB2910102）,国家自然科学基金（项目批准号：62202465）

Crash Deduplication Based on Symbolic Constraint Comparison

songzhenyu, wangjiawei, songchen, songyanni, zhaoyuanbo, guoxuan

(Institute of Information Engineering, Chinese Academy of Sciences)

Abstract:

Security personnel can discover a considerable number of crash samples in a short time by using fuzzing tools. However, the deduplication of these numerous crash samples faces serious challenges in practical security analysis. Existing methods based on crash location and call stack often encounter over-clustering problems, which reduces their effectiveness, while approaches based on vulnerability root causes typically suffer from high computational expenses, making them impractical for large-scale applications.In order to address these significant challenges, this paper presents a novel crash deduplication approach based on symbolic constraint comparison. This innovative method first identifies the key bytes of crash samples through byte-by-byte transformation and testing to accurately determine the root causes of vulnerabilities. Subsequently, it collects symbolic path constraints to extract the corresponding control-flow features of crash samples, providing a detailed representation of each crash's execution pattern. Finally, the method achieves efficient deduplication by calculating similarity matrices among the control-flow characteristics and conducting clustering using both spectral clustering and DBSCAN algorithms, thereby enabling flexible adaptation to different vulnerability distributions.The effectiveness and efficiency of the proposed approach were thoroughly evaluated on seven target program datasets carefully selected from Magma and MoonLight benchmarks. The comprehensive experimental dataset contained 8 classes of 23 CVE vulnerabilities, representing a diverse range of real-world vulnerability scenarios. The experimental results consistently demonstrate that the proposed method successfully avoids the over-clustering problem while achieving high clustering accuracy across various vulnerability types, with F1-measure of 99.99%, 99.69%, and 97.16% on the Poppler, SoX, and LibTIFF datasets.The experimental results consistently demonstrate that the proposed method successfully avoids the over-clustering problem while maintaining high clustering accuracy. In terms of computational performance, the CPU time consumed by this method is remarkably reduced to only one-thousandth of existing approaches, which represents a substantial improvement in efficiency and makes it particularly well-suited for large-scale software projects.

Key words: crash deduplication root-cause of vulnerabilities symbolic constraints software vulnerabilities