基于机器阅读理解的网络安全事件抽取方法

黄振洋; 王雨城; 王高升; 刘培培; 朱红松; 李红; 孙利民

引用本文：

黄振洋,王雨城,王高升,刘培培,朱红松,李红,孙利民.基于机器阅读理解的网络安全事件抽取方法[J].信息安全学报,已采用 [点击复制]
HuangZhenyang,WangYucheng,WangGaosheng,LiuPeipei,ZhuHongsong,LiHong,SunLimin.A Cybersecurity Event Extraction Method based on Machine Reading Comprehension[J].Journal of Cyber Security,Accept [点击复制]

本文已被：浏览 4202次下载 0次
基于机器阅读理解的网络安全事件抽取方法
黄振洋, 王雨城, 王高升, 刘培培, 朱红松, 李红, 孙利民
0 字体:加大+\|默认\|缩小-
(中国科学院信息工程研究所)

摘要:

信息技术的飞速发展，在促进经济社会繁荣进步的同时，也为网络空间带来新的安全风险与挑战。互联网的文本媒体中承载了大量的网络安全事件信息，从海量的文本中抽取网络安全事件信息，可以为网络威胁监控、安全事件关系推理、安全事件演化预测等下游任务提供结构化的数据支撑，有利于风险评估和防御部署，缓解网络安全威胁。现有的网络安全事件抽取方法通常将事件抽取视为分类任务，虽然在一定程度上取得了令人满意的结果，但由于其依赖大量有标注数据和特征工程，以及模型设计复杂等问题，整体的事件抽取性能受到限制。针对这些问题，本文提出了一种基于机器阅读理解的网络安全事件抽取方法（命名为CEReader），将网络安全事件抽取形式化为阅读理解问题，通过多轮问答的方式抽取事件信息，并在训练时融合外部资源进行数据增强。在模型预测阶段，采用基于双仿射注意力机制的答案预测方法，缓解已有的阅读理解方法在答案预测时仅考虑token级别的信息所导致的预测不准确的问题。实验结果表明，CEReader在网络安全事件数据集CASIE上面具有出色的表现。特别地，CEReader在少样本的条件下同样取得了令人满意的抽取效果，适用于网络安全事件抽取场景。

关键词: 网络安全事件事件抽取信息抽取机器阅读理解

DOI：

投稿时间：2022-03-07修订日期：2022-03-29

基金项目:科技部国家重点研发计划

A Cybersecurity Event Extraction Method based on Machine Reading Comprehension

HuangZhenyang, WangYucheng, WangGaosheng, LiuPeipei, ZhuHongsong, LiHong, SunLimin

(Institute of Information Engineering，Chinese Academy of Sciences)

Abstract:

The rapid development of information technology not only promotes the prosperity and progress of the economy and society, but also brings new risks and challenges to cyberspace. The texts on the Internet often carry a large amount of information about cybersecurity events. Extracting structured cybersecurity event information from these texts can sup-port downstream tasks such as network threat monitoring, cybersecurity event relationship reasoning, and cybersecurity event evolution prediction, which is conducive to risk assessment and defense deployment and thus alleviate network security threats. Existing cybersecurity event extraction methods usually consider the event extraction task as a classifi-cation problem. Although these methods have achieved promising results to a certain extent, the overall event extraction can be limited due to its dependence on a large number of labeled data and feature engineering, as well as the complexity of the model design. To address this issue, we propose a method for cybersecurity event extraction based on machine reading comprehension, namely CEReader. We formulate the cybersecurity event extraction as a machine reading com-prehension problem, extracting event information through multi-turns of question answering. Besides, external resources can be integrated for data augmentation during training. In the inference phase, a biaffine attention mechanism is adopted to predict the answer, which solves the problem of inaccurate answer prediction caused by the existing reading compre-hension methods only considering the token level information. The experimental results on the CASIE dataset demon-strate the effectiveness of our method. Even under the data-scarce scenario, our method still has good performance, which shows that our method is suitable for cybersecurity event extraction.

Key words: cybersecurity event event extraction information extraction machine reading comprehension