基于深度学习的跨自然语言与程序语言生成任务综述

宋小祎; 张若定; 张妍; 张梅山; 黎家通

本文已被：浏览 10741次下载 6741次	码上扫一扫！
基于深度学习的跨自然语言与程序语言生成任务综述
宋小祎,张若定,张妍,张梅山,黎家通
分享到：微信更多字体:加大+\|默认\|缩小-
(中国科学院信息工程研究所北京中国 100093;中国科学院大学网络空间安全学院北京中国 100049;哈尔滨工业大学(深圳) 计算与智能研究深圳中国 518055)

摘要:

近年来,随着人工智能技术的发展,许多编程人员期望计算机代替他们自动完成程序代码或者代码注释的编写等任务。跨自然语言与程序语言(Natural languages and programming languages,NL-PL)生成即为此类任务,指自然语言和程序语言之间的相互转换任务,包括自然语言到程序语言的生成和程序语言到自然语言的生成两类任务。最近几年,跨NL-PL生成在研究与应用方面呈现出爆发式的增长,尤其是随着深度学习(Deep learning,DL)技术的发展,越来越多研究人员开始利用DL技术来提升跨NL-PL生成任务效果。他们通过优化程序表示方式、改进神经网络模型以及设计大型预训练模型等方法,在该领域取得了众多突破性的进展。在基于DL的跨NL-PL生成技术获得迅猛发展的同时,大型互联网公司逐渐将该领域的研究成果付诸商用,因此,模型应用安全性也受到了学术界和业界的紧密关注。为了进一步系统地研究跨NL-PL生成技术,对这些已有的成果进行梳理非常必要。本文以程序生成和注释生成这两类典型跨NL-PL生成任务为切入点,对该领域具有代表性的最新文献进行归纳总结。我们从众多已有参考文献中抽象出一个基于DL的跨NL-PL生成通用实现模型,并将该模型划分为程序表示、语言处理和语言生成三大组件。在我们提出的通用实现模型的基础上,我们进一步从程序代码表示方法、网络模型结构、模型在业界的应用、应用过程中存在的安全问题与安全研究现状、该领域常用数据集和模型效果等方面详细梳理分析已有研究成果及进展脉络。最后,我们总结了该领域现阶段存在的研究问题,并展望了未来的发展方向。

关键词: 深度学习|跨自然语言与程序语言|程序表示|模型算法

DOI：10.19363/J.cnki.cn10-1380/tn.2023.05.06

投稿时间：2022-08-31修订日期：2022-12-15

基金项目:本课题得到工业互联网创新发展计划(No. TC200H030)和 2021 年重庆市属本科高校与中科院所属院所合作项目(No. HZ2021015)资助。

A Review of Deep Learning Based Generation Tasks Across Natural and Programming Languages

SONG Xiaoyi,ZHANG Ruoding,ZHANG Yan,ZHANG Meishan,LI Jiatong

Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China;School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049, China;Institute of Computing and Intelligence, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, China

Abstract:

In recent years, with the development of artificial intelligence technology, many programmers expect to let computers automatically complete the tasks of writing program code, generating code comments and so on. Across natural and programming languages (NL-PL) generation tasks are carry out these jobs, which refers to the mutual conversion of natural language and programming language, namely, natural language to programming language generation and programming language to natural language generation. A dramatically increasing number of researchers have been taken in across NL-PL generation task in the past few years. Especially with the development of deep learning (DL) technology, many researchers use DL technology to improve the performance of generation tasks across NL-PL. By optimizing program representation methods, improving neural network models, or designing large-scale pre-training models, we have seen a good progress in the literature on this problem. In the meantime, a number of Internet magnates are working on converting research achievements into commercial use. As a result, the underlying model security in practical application has attracted close attention by both academia and industry. In order to systematically investigate across NL-PL generation tasks techniques, it is necessary to classify the state-of-the-art results. In this paper, we focus on two aspects of across NL-PL generation tasks: program generation and comment generation. We reviewed the latest representative literature in this area. Besides, we proposed a generic implementation model of DL-based across NL-PL generation tasks. The proposed model consists of three components: program representation, language processing and language generation. We further analyzed the existing research results from the aspects of program code representation, network model structure, model application in industry, security problems and research status in the application process, common datasets and the model performance. Towards the end, we summarized the existing limitations in the literature as well as the possible research directions in the future.

Key words: deep learning|across natural and programming languages|programming representation|model algorithm