面向多样化编译环境的恶意代码同源性分析

刘昕仪; 彭国军; 刘思德; 杨秀璋; 傅建明

本文已被：浏览 2378次下载 1935次	码上扫一扫！
面向多样化编译环境的恶意代码同源性分析
刘昕仪,彭国军,刘思德,杨秀璋,傅建明
分享到：微信更多字体:加大+\|默认\|缩小-
(空天信息安全与可信计算教育部重点实验室武汉中国 430072;武汉大学国家网络安全学院武汉中国 430072)

摘要:

随着恶意样本数量的急剧增加,为减少人工溯源的工作量,恶意代码同源性分析研究的重要性日益凸显。然而,攻击者在复用恶意代码时,需针对不同的攻击场景设置特定的编译环境,这会造成同源二进制在语法和结构层面存在很大差异,降低恶意代码同源性分析的准确率。为解决此问题,本文通过分析编译环境对二进制生成带来的影响,实现了一个准确、无监督、高效的恶意代码同源性分析方案。本文采用二进制提升与重优化技术将其统一到中间表示层,一定程度上消除语法、结构层面的改变。针对传统CBOW模型学习代码单词语义的不足,提出指令级的上下文语义学习方案,并考虑到出现上下文无关指令的小概率事件,结合SIF模型计算基本块特征向量。此外,针对恶意代码中库函数和字符串包含敏感信息更丰富的特点,本文提出基本块初始匹配集合的建立算法,在K-Hop贪心匹配算法和线性匹配算法的基础上,进一步提高了恶意代码同源性分析的准确率。实验表明,对于开源恶意代码Mirai,本方案相较于现有的无监督模型和预训练模型,在分析准确性和运行开销两个方面的综合表现更优。同时,对于其他类型的恶意代码,本方案输出的同源性指数均高于本文预先设立的同源性判定阈值,进一步证明其有效性。

关键词: 恶意代码同源性编译环境语义学习

DOI：10.19363/J.cnki.cn10-1380/tn.2024.11.03

投稿时间：2023-02-14修订日期：2023-10-03

基金项目:本课题得到国家自然科学基金资助项目(No.62172308,No.U1626107,No.61972297,No.62172144)资助。

Malware Homology Analysis under Diverse Compilation Environments

LIU Xinyi,PENG Guojun,LIU Side,YANG Xiuzhang,FU Jianming

Key Laboratory of Space Information Security and Trusted Computing, Ministry of Education, Wuhan 430072, China;School of Cyber Science and Engineering, Wuhan University, Wuhan 430072, China

Abstract:

With the sharp increase in the number of malicious software samples, in order to reduce the workload of manual traceability, the importance of malware homology analysis has never been more critical. However, when attackers reuse malicious codes, it is necessary to set up a specific compilation environment for different attack scenarios. This diversity in compilation environments leads to significant variations in the syntax and structure of homologous binaries, thus compromising the accuracy of malware homology analysis. To solve this problem, we implement an accurate, unsupervised and efficient malware homology scheme by analyzing the impact of compilation environments on binary generation. We adopt the binary promotion and re-optimization technologies to unify binaries to the same intermediate representation layer, which eliminates the syntax and structural changes to a certain extent. Aiming at the insufficiency of the traditional Continuous Bag of Words (CBOW) model in token semantics learning, an instruction-level contextual semantics learning scheme is proposed. And considering the small probability events of context-independent instructions, we use the Smooth Inverse Frequency (SIF) model to calculate feature vectors of basic blocks. In addition, in view of the fact that library functions and strings in malwares contain richer sensitive information, we propose an establishment algorithm of the initial matching set of basic blocks, which further improves the accuracy of malware homology analysis based on K-Hop greedy matching algorithm and linear matching algorithm. Experimental results demonstrate the effectiveness of our solution. When applied to the open-source malware Mirai, compared with the existing unsupervised model and pre-trained model, this solution has better overall performance in terms of analysis accuracy and running cost. At the same time, for various other types of malwares, the homology indexes output by this scheme are all higher than the homology judgment threshold we set, further validating its utility in the field of malware homology analysis.

Key words: malware homology compilation environments semantic learning