机器学习在模糊测试中的应用现状与挑战

陈宏程; 闫秋存; 相璐; 孟国柱; 陈翔; 纪守领

引用本文：

陈宏程,闫秋存,相璐,孟国柱,陈翔,纪守领.机器学习在模糊测试中的应用现状与挑战[J].信息安全学报,已采用 [点击复制]
chenhongcheng,yanqiucun,xianglu,mengguozhu,chenxiang,jishouling.Machine Learning for Fuzzing: Current Status and Challenges[J].Journal of Cyber Security,Accept [点击复制]

本文已被：浏览 8320次下载 2515次
机器学习在模糊测试中的应用现状与挑战
陈宏程¹, 闫秋存¹, 相璐¹, 孟国柱¹, 陈翔², 纪守领³
0 字体:加大+\|默认\|缩小-
(1.中国科学院信息工程研究所;2.南通大学;3.浙江大学)

摘要:

软件漏洞是网络安全的主要威胁来源之一，及早地发现漏洞并进行修补对保障安全具有重要作用。模糊测试是一种动态软件测试技术，它通过使用大量半随机的输入数据执行待测目标来主动地进行漏洞发现。因其概念简单、易于部署、效果良好等特点，模糊测试技术被广泛应用于各类软件的漏洞挖掘并成功发掘出了大量的软件漏洞。然而，朴素的模糊测试仍受到算力分配低效、过度依赖专家经验进行参数设置和输入格式的人工分析开销高昂等问题的困扰。模糊测试在运行过程中会产生大量的数据，充分挖掘与利用数据中蕴含的知识则能够帮助提升模糊测试的智能程度和减少人工开销。近年来，以深度学习为代表的机器学习方法快速发展，并在模式识别和数据生成等领域取得了突破性的进展。因此越来越多的研究人员尝试使用机器学习方法来对模糊测试技术进行改进。本文对近年来在模糊测试中应用机器学习的相关文献进行了系统性的调研和分析，从模糊测试的各环节中总结出五个适用于使用机器学习方法进行改进的子任务：输入模型推断、变异操作定制、种子文件调度、变异操作调度和测试用例过滤。针对每一个子任务，本文对传统模糊测试的解决方案和其中存在的不足进行介绍，并从具体目标、所用算法类型等角度对使用机器学习的相关文献进行归纳和梳理。本文对模糊测试中各类机器学习方法的流行度进行了分析，并对背后的原因进行了解释。然后本文对相关工作中的数据获取、数据预处理、模型训练、模型评估等环节的设计考虑进行了讨论，并对其中的典型操作进行了介绍。本文对生成式模糊测试中的解析正确率这一重要指标进行了介绍。最后本文就同一任务上机器学习方法的技术演进进行分析。基于上述分析和讨论，本文对未来在模糊测试中应用机器学习的六个富有潜力的研究方向进行了展望。

关键词: 模糊测试机器学习深度学习

DOI：10.19363/J.cnki.cn10-1380/tn.2024.02.21

投稿时间：2022-07-18修订日期：2022-10-27

基金项目:国家自然科学基金项目（No.61902395）

Machine Learning for Fuzzing: Current Status and Challenges

chenhongcheng¹, yanqiucun¹, xianglu¹, mengguozhu¹, chenxiang², jishouling³

(1.Institute of Information Engineering;2.Nantong University;3.Zhejiang University)

Abstract:

Software vulnerabilities are one of the main threats in cybersecurity. Timely vulnerability discovery and patching are important for cybersecurity. Fuzz testing is a dynamic software testing method. It proactively discovers vulnerabilities by providing large quantities of semi-random inputs to testing targets. Fuzz testing gains popularity since it"s conceptually simple, easy to deploy and very effective in vulnerability discovery. Fuzz testing is applied to various categories of software and discovers tons of vulnerabilities in them. However, naive fuzzing still suffers from energy waste in computing power allocation, requirement for expertise in parameter setting and labor intensive input format inference. Fuzzing campaigns produce large amount of data. Extracting and exploiting knowledge contained can improve fuzzing intelligence and reduce labor costs. Machine learning, especially deep learning, evolves rapidly in recent years and makes big breakthrough in areas such as pattern recognition and data generation. Thus more and more researchers try to overcome roadblocks in fuzzing with machine learning. This paper systematically surveys recent advances in machine learning applications for fuzzing. We highlight five subtasks suitable for applying machine learning from fuzzing workflow, namely input model inference, mutator inference, seed file scheduling, mutator scheduling and test case filtering. We introduce solutions in traditional fuzzing for each subtask and point out their deficiencies. Then we categorize and summarize related works applying machine learning from perspectives of algorithm employment and goals to achieve for each subtask. We analyze the popularity of different categories of machine learning algorithms and explain reasons behind. We discuss the design choices and typical solutions of dataset aquiring, data preprocessing, model training and model evaluation in related works. We introduce pass rate, an important evaluation metric in generation-based fuzz testing. We analyze evolvement of machine learning algorithms in certain subfields. Finally, we propose six promising directions for future research on applying machine learning algorithms in fuzz testing based on our analysis.

Key words: fuzzing machine learning deep learning