面向人类编程习惯的反编译代码控制结构恢复技术

曹颖; 梁瑞刚; 张润泽; 徐丹丹

引用本文：

曹颖,梁瑞刚,张润泽,徐丹丹.面向人类编程习惯的反编译代码控制结构恢复技术[J].信息安全学报,已采用 [点击复制]
Cao Ying,Liang Ruigang,Zhang Runze,xudandan@iie.ac.cn.Decompilation Using Control Structure Recovery Techniques Towards Human Habits[J].Journal of Cyber Security,Accept [点击复制]

本文已被：浏览 1399次下载 0次
面向人类编程习惯的反编译代码控制结构恢复技术
曹颖, 梁瑞刚, 张润泽, 徐丹丹
0 字体:加大+\|默认\|缩小-
(中国科学院信息工程研究所)

摘要:

反编译器通常被用于对源代码不可获取的软件（二进制）进行安全性分析，例如恶意软件分析、漏洞挖掘和验证等。由于这类任务通常需要逆向工程师对二进制进行深入分析，而逐一分析所有汇编代码耗时又低效。反编译器可以帮助逆向工程师获取二进制中每个函数的语义，从而快速定位关键函数或代码片段，大幅度提升了逆向工程师代码的分析效率。然而，尽管当前反编译器在提升其反编译代码的控制结构上做出了很多努力，其生成的高级控制语句可读性仍然与人类编写的代码相差很多，依旧需要逆向工程师花费大量时间人工分析代码的控制条件和逻辑。本文利用大语言模型与人类对齐的代码理解和代码生成的能力，提出了面向人类程序编程习惯的控制结构优化技术LLMReStructor。与传统的反编译器相比，LLMReStructor能够根据特定代码的功能和使用场景将控制结构恢复为更符合人类编程习惯的语句，并且经过与源码的对比分析，LLMReStructor恢复的控制结构与对应的源码最为接近。此外，针对不同反编译器可读性的问卷调查评估也表明，LLMReStructor优化后的反编译代码最受用户好评。

关键词: 反编译大语言模型控制结构恢复可读性

DOI：

投稿时间：2024-01-08修订日期：2024-04-10

基金项目:国家自然科学基金项目（面上项目，重点项目，重大项目）

Decompilation Using Control Structure Recovery Techniques Towards Human Habits

Cao Ying, Liang Ruigang, Zhang Runze, xudandan@iie.ac.cn

(Institute of Information Engineering, Chinese Academy of Sciences)

Abstract:

Decompilers are commonly employed for software security analysis where source code is inaccessible, such as in binary formats for tasks like malware analysis, vulnerability mining, and verification. Given the intricate nature of these tasks, reverse engineers often require deep analysis of the binaries, but analyzing all assembly code one by one is time-consuming and inefficient. Decompilers aid reverse engineers in extracting the semantics of each function within binaries, enabling quick identification of critical functions or code segments, thereby significantly boosting the efficiency of code analysis in reverse engineering. However, despite substantial efforts to enhance the control structure readability of decompiled code, the readability of high-level control statements generated by current decompilers still markedly differs from human-written code, necessitating extensive manual analysis of control conditions and logic by reverse engineers. This paper leverages the capabilities of large language models in human-aligned code understanding and generation to propose LLMReStructor, a control structure optimization technique oriented towards human programming habits. Compared to traditional decompilers, LLMReStructor can restore control structures to statements that more closely align with human programming habits based on the code's specific function and usage scenario. Through comparative analysis with the source code, LLMReStructor's restored control structures closely resemble the corresponding source code. Additionally, surveys assessing the readability of decompiled code from different decompilers have shown that code optimized by LLMReStructor is most favored by users. This novel approach underscores the integration of advanced language modeling techniques with decompilation processes, marking a significant advancement in reverse engineering by bridging the gap between machine-generated and human-readable code.

Key words: Decompilation, Large Language Model, Control structure recovery, Readability