引用本文
  • 张伯伦,梁瑞刚.RoBin: 基于轮转负样本队列的跨模态二进制代码对比学习框架[J].信息安全学报,已采用    [点击复制]
  • Zhang Bolun,Liang Ruigang.RoBin: A Cross-Modal Binary Program Contrastive Learning Framework Based on Rotating Negative Sample Queues[J].Journal of Cyber Security,Accept   [点击复制]
【打印本页】 【下载PDF全文】 查看/发表评论下载PDF阅读器关闭

过刊浏览    高级检索

本文已被:浏览 295次   下载 0  
RoBin: 基于轮转负样本队列的跨模态二进制代码对比学习框架
张伯伦, 梁瑞刚
0
(中国科学院信息工程研究所)
摘要:
在漏洞挖掘和安全性分析等关键任务中,二进制代码分析发挥着不可或缺的作用。然而,现有方法存在两个主要局限:一方面,研究多局限于单一对齐场景(如二进制与源代码或自然语言之一对齐),缺乏统一的跨模态对齐能力,难以支持复杂的跨模态任务;另一方面,二进制代码的语义信息缺失与当前对比学习方法对困难负样本的覆盖不足,显著限制了跨模态语义建模的效果。针对上述问题,本文提出了跨模态二进制代码对比学习框架RoBin。该框架首先通过大量探索性实验,确立了在“二进制-源代码-自然语言”场景下采用独立模型分别处理不同模态数据的最佳模型结构。在此基础上,设计了分阶段的训练方案:通过掩码语言建模赋予二进制模型基础语法理解能力;利用双模态监督提升二进制模型初始语义对齐能力;在动态对齐阶段使三模态模型间的表示空间趋于一致;并创新性地提出"轮转负样本队列算法",在不显著提升训练开销的前提下大幅扩充困难负样本覆盖。实验结果表明,在典型的跨模态检索任务中,RoBin取得了显著性能提升:从10,000个候选项中检索时,源代码检索二进制函数的 从39%提高到76.7%,自然语言检索二进制函数的 从53.77%提高到65.29%。这些结果显著超越了同类前沿方法,使RoBin成为首个原生支持三模态协同对齐的专用框架,为多模态协作下的二进制分析提供了全新的技术思路与工程实践方案。
关键词:  二进制表示学习  对比学习  跨模态检索
DOI:
投稿时间:2024-12-30修订日期:2025-02-18
基金项目:
RoBin: A Cross-Modal Binary Program Contrastive Learning Framework Based on Rotating Negative Sample Queues
Zhang Bolun, Liang Ruigang
(Institute of Information Engineering, CAS)
Abstract:
Binary code analysis is very important for finding security problems and checking vulnerabilities, but current methods have two main weaknesses. First, most research only focuses on matching binary code with either source code or natural language separately, making it hard to handle tasks that need all three types of data working together. Second, because binary code loses important meanings and existing learning methods don't use enough difficult examples, it's tough to connect these different types of information properly. To solve these issues, this paper creates a new learning system called RoBin for better cross-type analysis. It first tested different designs to find the best way to handle binary, source code, and natural language data in-dependently but still connect them. Then, it trains the system in steps: using "fill-in-the-blank" exercises to teach basic code grammar, matching two data types first for early connections, adding the third type later to make all three work smoothly, and using a special method to collect harder examples for better learning without slowing down training. The system also saves computer memory efficiently. Tests show big improvements: when searching through 10,000 items, RoBin boosts the success rate of finding matching binary code from source code from 39% to 76.7%, and from natural language from 53.77% to 65.29%. These results beat other advanced methods, making RoBin the first system that truly connects all three data types well, offering new ideas and tools for multi-type binary analysis.
Key words:  Binary code representation learning  contrastive learning  cross-modal retrieval