【打印本页】      【下载PDF全文】   查看/发表评论  下载PDF阅读器  关闭
←前一篇|后一篇→ 过刊浏览    高级检索
本文已被:浏览 6086次   下载 4622 本文二维码信息
码上扫一扫!
复述检测技术综述
李铂鑫,李鹏,齐保元,王斌,王丽宏
分享到: 微信 更多
(中国科学院信息工程研究所 北京 中国 100093;中国科学院大学网络空间安全学院 北京 中国 100049;国家计算机网络应急技术处理协调中心 北京 中国 100029)
摘要:
网络内容安全日益受到各界的关注。自然语言处理中用于判断两个文本语义是否相同的复述检测技术,可以把语义相同表述形式不同的的看法、意见等聚成一类,大幅提高舆情监控的效率;亦可识别出经过改写的不良敏感信息,有效提高不良敏感信息的召回率。本文旨在介绍当前复述检测技术领域的研究进展。首先介绍复述检测的概念、应用场景和研究现状。然后对复述检测方法进行分类,本文从计算方式上将复述检测方法分为基于相似度的方法和基于特征的方法,依次介绍每类方法的特点、优缺点,并详述一些有代表性的方法,重点介绍了基于深度学习的复述检测方法。最后详细分析了复述检测技术当前存在的问题,并对未来的发展趋势进行了展望。
关键词:  网络内容安全  网络舆情监控  自然语言处理  复述检测  深度学习  神经网络
DOI:10.19363/J.cnki.cn10-1380/tn.2020.09.07
投稿时间:2018-08-16修订日期:2019-01-25
基金项目:本课题得到国家重点研发计划课题(No.2016YFB0801003);中国科学院战略性先导科技专项(C类)(No.XDC02040400)资助。
A Survey on Paraphrase Identification Technology
LI Boxin,LI Peng,QI Baoyuan,WANG Bin,WANG Lihong
Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China;School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049, China;National Computer Network Emergency Response Technical Team Coordination Center of China, Beijing 100029, China
Abstract:
Network content security has received increasing attention from all walks of life. Paraphrase identification technology, commonly used to judge whether two text capture the same meaning in the field of natural language processing, can come in handy. This technology can aggregate the same views and opinions into the same category, greatly improving the efficiency of network public opinion monitoring. Also, it can identify the rewritten sensitive information and effectively improve the recall rate of bad sensitive information. This paper focuses on the research progress in the field of paraphrase identification. Firstly, we introduce the concept, application scenarios and research status of paraphrase identification. Secondly, we classify paraphrase identification methods into two categories:similarity-based methods and feature-based methods. Then we introduce the characteristics, advantages and disadvantages of each type in turn, and detail some representative ones. Among them, deep learning methods are highly focused. Finally, it is the detailed analysis of current problems and prospect of this field.
Key words:  web content security  public opinion monitoring  natural language processing  paraphrase identification  deep learning  neural network