【打印本页】      【下载PDF全文】   查看/发表评论  下载PDF阅读器  关闭
←前一篇|后一篇→ 过刊浏览    高级检索
本文已被:浏览 506次   下载 330 本文二维码信息
分享到: 微信 更多
(空天信息安全与可信计算教育部重点实验室 武汉 中国 430072;武汉大学国家网络安全学院 武汉 中国 430072)
关键词:  恶意代码同源性  编译环境  语义学习
Malware Homology Analysis under Diverse Compilation Environments
LIU Xinyi,PENG Guojun,LIU Side,YANG Xiuzhang,FU Jianming
Key Laboratory of Space Information Security and Trusted Computing, Ministry of Education, Wuhan 430072, China;School of Cyber Science and Engineering, Wuhan University, Wuhan 430072, China
With the sharp increase in the number of malicious software samples, in order to reduce the workload of manual traceability, the importance of malware homology analysis has never been more critical. However, when attackers reuse malicious codes, it is necessary to set up a specific compilation environment for different attack scenarios. This diversity in compilation environments leads to significant variations in the syntax and structure of homologous binaries, thus compromising the accuracy of malware homology analysis. To solve this problem, we implement an accurate, unsupervised and efficient malware homology scheme by analyzing the impact of compilation environments on binary generation. We adopt the binary promotion and re-optimization technologies to unify binaries to the same intermediate representation layer, which eliminates the syntax and structural changes to a certain extent. Aiming at the insufficiency of the traditional Continuous Bag of Words (CBOW) model in token semantics learning, an instruction-level contextual semantics learning scheme is proposed. And considering the small probability events of context-independent instructions, we use the Smooth Inverse Frequency (SIF) model to calculate feature vectors of basic blocks. In addition, in view of the fact that library functions and strings in malwares contain richer sensitive information, we propose an establishment algorithm of the initial matching set of basic blocks, which further improves the accuracy of malware homology analysis based on K-Hop greedy matching algorithm and linear matching algorithm. Experimental results demonstrate the effectiveness of our solution. When applied to the open-source malware Mirai, compared with the existing unsupervised model and pre-trained model, this solution has better overall performance in terms of analysis accuracy and running cost. At the same time, for various other types of malwares, the homology indexes output by this scheme are all higher than the homology judgment threshold we set, further validating its utility in the field of malware homology analysis.
Key words:  malware homology  compilation environments  semantic learning