摘要: |
网络书写具有随意性、非正规性等特点。变体词就是网络语言作为一种不规范语言的显著特色,人们往往出于避免审查、表达情感、讽刺、娱乐等需求将相对严肃、规范、敏感的词用相对不规范、不敏感的词来代替,用来代替原来词的新词就叫做变体词(Morph)。变体词和其对应的原来的词(目标实体词)会分别在非规范文本和规范文本中共存,甚至变体词会渗透到规范文本中。变体词使行文更加生动活泼,相关事件、消息也传播得更加广泛。但是因为变体词通常是某种隐喻,已不再是其表面字词的意义了,从而使网络上文体与正式文本(如新闻等)具有巨大的差异。由此如何识别出这些变体词及其所对应的目标实体词对于下游的自然语言处理技术具有重要的意义。本文首先介绍了变体词的定义和特征,变体词的生成规律,总结了当前变体词的识别和规范化的主要技术进展和成果,最后是此领域发展方向的展望。 |
关键词: 社交网络 变体词识别 变体词规范化 深度学习 神经网络 表示学习 |
DOI:10.19363/j.cnki.cn10-1380/tn.2016.03.006 |
Received:April 01, 2016Revised:June 16, 2016 |
基金项目:本课题得到国家科技支撑计划(编号:2012BAH46B03),中国科学院战略先导专项(编号:XDA06030200)资助。 |
|
Chinese Morphs Identification and Normalization |
SHA Ying,LIANG Qi,WANG Bin |
China Institute of information engineering, CAS, Beijing 100093, China |
Abstract: |
Internet language is a casual informal language. Entity morph is an important feature of Internet Language. In some situation, Internet users are keen on creating kinds of morphs, special kinds of fake alternative names to achieve some goals, express strong sentiment or humor, and avoid censorship. Entity morphs and their target entities respectively appear on informal and formal text. And in some situation, entity morphs even appear on formal text. Although using entity morphs has some advantages, but morphs are big barriers for natural language processing (NLP). So it is very important to research on morph identification and normalization. First, we will introduce the definition of morphs and the features of morphs; second, we will show the rules of generating morphs; third, the current progress of morph identification and normalization will be demonstrated. Finally, it is the prospect of this field. |
Key words: social network morph identification morph normalization deep learning neural network representation learning |