复述检测技术综述

李铂鑫; 李鹏; 齐保元; 王斌; 王丽宏

引用本文：

李铂鑫,李鹏,齐保元,王斌,王丽宏.复述检测技术综述[J].信息安全学报,2020,5(5):95-109 [点击复制]
LI Boxin,LI Peng,QI Baoyuan,WANG Bin,WANG Lihong.A Survey on Paraphrase Identification Technology[J].Journal of Cyber Security,2020,5(5):95-109 [点击复制]

本文已被：浏览 10181次下载 8344次	码上扫一扫！
复述检测技术综述
李铂鑫^1,2, 李鹏^1,2, 齐保元^1,2, 王斌^1,2, 王丽宏³
0 字体:加大+\|默认\|缩小-
(1.中国科学院信息工程研究所北京中国 100093;2.中国科学院大学网络空间安全学院北京中国 100049;3.国家计算机网络应急技术处理协调中心北京中国 100029)

摘要:

网络内容安全日益受到各界的关注。自然语言处理中用于判断两个文本语义是否相同的复述检测技术，可以把语义相同表述形式不同的的看法、意见等聚成一类，大幅提高舆情监控的效率;亦可识别出经过改写的不良敏感信息，有效提高不良敏感信息的召回率。本文旨在介绍当前复述检测技术领域的研究进展。首先介绍复述检测的概念、应用场景和研究现状。然后对复述检测方法进行分类，本文从计算方式上将复述检测方法分为基于相似度的方法和基于特征的方法，依次介绍每类方法的特点、优缺点，并详述一些有代表性的方法，重点介绍了基于深度学习的复述检测方法。最后详细分析了复述检测技术当前存在的问题，并对未来的发展趋势进行了展望。

关键词: 网络内容安全网络舆情监控自然语言处理复述检测深度学习神经网络

DOI：10.19363/J.cnki.cn10-1380/tn.2020.09.07

投稿时间：2018-08-16修订日期：2019-01-25

基金项目:本课题得到国家重点研发计划课题（No.2016YFB0801003）；中国科学院战略性先导科技专项（C类）（No.XDC02040400）资助。

A Survey on Paraphrase Identification Technology

LI Boxin^1,2, LI Peng^1,2, QI Baoyuan^1,2, WANG Bin^1,2, WANG Lihong³

(1.Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China;2.School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049, China;3.National Computer Network Emergency Response Technical Team Coordination Center of China, Beijing 100029, China)

Abstract:

Network content security has received increasing attention from all walks of life. Paraphrase identification technology, commonly used to judge whether two text capture the same meaning in the field of natural language processing, can come in handy. This technology can aggregate the same views and opinions into the same category, greatly improving the efficiency of network public opinion monitoring. Also, it can identify the rewritten sensitive information and effectively improve the recall rate of bad sensitive information. This paper focuses on the research progress in the field of paraphrase identification. Firstly, we introduce the concept, application scenarios and research status of paraphrase identification. Secondly, we classify paraphrase identification methods into two categories:similarity-based methods and feature-based methods. Then we introduce the characteristics, advantages and disadvantages of each type in turn, and detail some representative ones. Among them, deep learning methods are highly focused. Finally, it is the detailed analysis of current problems and prospect of this field.

Key words: web content security public opinion monitoring natural language processing paraphrase identification deep learning neural network