探讨GitHubCopilot生成的代码的可验证性.docx资源-CSDN文库

版权申诉

190 浏览量 2024-04-17 12:47:30 上传评论收藏 68KB DOCX 举报

资源推荐

资源详情

资源评论

Proceedings of the ACM on Programming Languages, Vol. 1, No. HATRA, Article 1. Publication date: December 2022.

Exploring the Verifiability of Code Generated by GitHub

Copilot

DAKOTA WONG, University of Waterloo, Canada

AUSTIN KOTHIG, University of Waterloo, Canada

PATRICK LAM, University of Waterloo, Canada

GitHub’s Copilot generates code quickly. We investigate whether it generates good code. Our approach is to

identify a set of problems, ask Copilot to generate solutions, and attempt to formally verify these solutions

with Dafny. Our formal verification is with respect to hand-crafted specifications. We have carried out this

process on 6 problems and succeeded in formally verifying 4 of the created solutions. We found evidence

which corroborates the current consensus in the literature: Copilot is a powerful tool; however, it should not

be “flying the plane" by itself.

CCS Concepts: • Software and its engineering

---

General programming languages; • Social and

professional topics --- History of programming languages;

Additional Key Words and Phrases: GitHub Copilot, Program Synthesis, Software Verification.

ACM Reference Format:

Dakota Wong, Austin Kothig, and Patrick Lam. 2022. Exploring the Verifiability of Code Generated by GitHub

Copilot. Proc. ACM Program. Lang. 1, HATRA, Article 1 (December 2022), 10 pages.

INTRODUCTION

In aviation, commercial airplanes have two yokes (controls) which allows for sharing of responsibil-

ities and workload between a primary pilot and a copilot. Both operators are fully trained, informed,

and capable of flying the aircraft; however, only one of them needs to be controlling the aircraft at

any one moment. In software development, traditionally, code is written by an individual. In team

environments, this code is typically scrutinized and assessed by a number of people, but originally

it had to be developed by an individual. Similar to a pilot, software developers are ordinarily well

trained, informed, and capable of programming; however, inefficient or incorrect code segments can

potentially slip between the cracks and into code-bases when a developer does not fully understand

their tools, the domain, or the required program specifications.

GitHub, with the use of OpenAI’s Codex [Chen et al. 2021], has released an “AI pair programmer”

to assist in software development [GitHub 2022]. This programming partner is aptly named Copilot.

The aim of this tool is to shift the programmer’s time and attention away from boilerplate and

repetitive code patterns, and toward the critical design of the system. Copilot works by providing

implementation suggestions based on natural language descriptions of the required code. Copilot is

currently capable of generating code in eleven programming languages, including C, C++, C#, Go,

Java, JavaScript, PHP, Python, Ruby, Scala, and TypeScript

The community is struggling to understand what Copilot can and cannot do, and proofs are one

indicator of where Copilot might stand in terms of generating correct code. We know that Copilot

definitely generates some code, but what can we conclude about the code that it generates? Thus,

our goal is to gain some understanding of the quality of Copilot’s generated code; our approach is

https://docs.github.com/en/copilot/overview-of-github-copilot/about-github-copilot

Authors’ addresses: Dakota Wong, University of Waterloo, Canada, [email protected]; Austin Kothig, University of

Waterloo, Canada, [email protected]; Patrick Lam, University of Waterloo, Canada, [email protected].

2022.

2475-1421/2022/12-ART1

$15.00

https://doi.org/

1:2

Dakota

Wong,

Austin

Kothig,

and

Patrick

Lam

Proceedings of the ACM on Programming Languages, Vol. 1, No. HATRA, Article 1. Publication date: December 2022.

to attempt to verify selected examples of Copilot-generated code using the software verification

language Dafny [Leino 2010].

WORK

There has been a great deal of scrutiny of Copilot and of the code it produces. Early studies which

aim to empirically evaluate Copilot [Nguyen and Nadi 2022] have found that while this early

version of the model is not perfect, its current results are impressive. In their work, Nguyen and

Nadi tested Copilot against 33 LeetCode

questions (4 from the easy category, 17 from the medium

category, and 12 from the hard category) using four programming languages (Python, JavaScript,

Java, C), ran the generated code against the LeetCode test cases, and examined the percentage of

passing cases. Java performed best, being able to solve all test cases for 19 of the 33 problems. The

authors found that, often, the code that Copilot generated could be simplified, and that sometimes

the generated code would require methods or packages that have not been defined or declared.

Copilot generates code differently from traditional genetic programming paradigms. Sobania et al.

[2022] found that Copilot could generate correct solutions to a variety of problems, with comparable

accuracy to genetic programming. The improvement that Copilot makes over traditional approaches

is that it was trained on a 159 GB dataset of unlabelled data, whereas genetic programming requires

the dataset to be rigorously labeled. Thus, Sobania et al. claimed that the code Copilot generates is

often easier to read and understand compared to that generated by genetic programming.

It is clear from the current research that Copilot is not a magic wand solution to problems in

software engineering. With its flaws, how are developers able to benefit from this tool effectively?

Dakhel et al. [2022] compared Copilot’s proposed solutions against human programmers on a

set of fundamental problems. The authors found that Copilot was unable to understand certain

limitations embedded in the problem description, and while understanding the fundamentals of

what is being requested, it can sometimes create errors when it does not understand. They conclude

that while the generated code from Copilot is not a perfect solution, it can still be incorporated and

modified into a project for faster early-stage prototyping.

Continuing on the question of usability of Copilot-generated code by developers, Vaithilingam

et al. [2022] performed a user study. They found that using Copilot did not improve the task

completion time or success rate, but that the developers preferred to start from the Copilot-

generated code rather than a blank slate. The researchers also pointed out that it was challenging

for developers to work with long blocks of generated code (and indeed, some developers failed

to complete on time due to difficulties debugging generated code)—attempting to verify the code

would also be a challenge, though one that might provide more understanding. One difference

between our study and theirs, though, is that we did not encounter generated solutions that were

known incorrect, and their users did.

Any introduction of small errors can potentially compound over time, creating large vulnera-

bilities in a software stack. Pearce et al. [2022] investigated code generated by Copilot for known

security vulnerabilities. The authors prompted Copilot to generate code for 89 cybersecurity scenar-

ios. They found that approximately 40% of generated code snippets were vulnerable to some form of

known attack. Also, Siddiq et al. [2022] found code smells (including security smells) in training sets

used for transformer-based code generation models for Python as well as in suggestions produced

by Copilot. If code generated by Copilot is to be used in production, it needs to be rigorously vetted

for exploits (and keeping in mind the caveat from Vaithilingam et al. about understanding Copilot’s

generated code).

https://leetcode.com/

Exploring the Verifiability of Code Generated by GitHub Copilot

1:3

Proceedings of the ACM on Programming Languages, Vol. 1, No. HATRA, Article 1. Publication date: December 2022.

METHODOLOGY

Our goal is to apply verification techniques to code generated by GitHub Copilot. We came up with

sample problems and tasked Copilot with generating solutions in Python. We then used Dafny to

verify that Copilot’s generated code satisfies the requirements.

More specifically, our methodology is as follows. First, we chose 6 problems to give to Copilot

(five that we chose ourselves and one from Nguyen and Nadi [2022]). We wanted to test whether

Copilot could generate a function which would solve the problems; by “solve”, we mean that the

generated code satisfies the problem’s formal specifications. Thus, we wrote formal specifications

in Dafny for each problem, consisting of requires and ensures statements.

Having defined the problems, we then asked Copilot to generate solutions in Python. To give

Copilot the best chance of generating a good solution, we gave it a full function signature, including

a descriptive function name, parameter name(s), parameter type(s), and an output type. With

this input, Copilot then generated the function implementation. Since our goal was to test the

capabilities of Copilot, we always chose the first suggested function.

Next, we carefully translated each Copilot implementation from Python to Dafny and added the

specifications that we had created. We took care to preserve the behaviour of the implementation.

In all cases, the implementations contained a loop of some type. Hence, the provided requires

and ensures statements were inadequate to generate a proof that verified the implementation. We

thus attempted to manually derive loop invariants so that Dafny could verify the implementation.

During the verification process, we added extra requires and ensures clauses, if they were needed to

verify the program, along with additional helper functions and methods when necessary.

RESULTS

We have attempted to verify 6 Copilot implementations, one of which intersects the LeetCode

problems in [Nguyen and Nadi 2022]. Table 1 summarizes the algorithms, implementations, and

verification attempts. Incidentally, during testing, consistent with Nguyen and Nadi [2022], we

found that Copilot would periodically include packages or methods that had not been defined; it

happened to not do so for any of the implementations we describe here.

We have archived the current version of the Copilot-generated implementations and our Dafny

code at https://zenodo.org/record/7040924.

Table 1.

Algorithm implementations generated by Copilot. Dafny verification attempt statistics.

Algorithm

Python

Verification

Verified

Specification

# Invariants

Name

LOC

Difficulty

LOC

Binary Search (4.1)

Easy

✓

Two Sums (4.2)

Medium

✓

Largest Sum

Medium

✓

Sort Array

Medium

✓

Prime Factors (4.3)

Hard

> 23

Max Heapify (4.4)

Hard

> 4

Figure 1 presents the Copilot prompts that we used to generate implementations.

We were able to create a valid proof in Dafny for 4 of the 6 solutions generated by Copilot; we

next discuss 2 of the valid proofs and our 2 failed attempts.

剩余11页未读，继续阅读

评论收藏

内容反馈

版权申诉

百态老人

粉丝: 1641
资源: 2万+

探讨 GitHub Copilot 生成的代码的可验证性.docx

最新资源

探讨 GitHub Copilot 生成的代码的可验证性.docx

使用GitHub Copilot生成的代码性能评估.docx

Visual Studio Code 插件 GitHub Copilot 离线资源 1.100.0.0

GitHub.Copilot.Vsix.1.79.0.0.vsix

GitHub Copilot Visual Studio 2022 扩展插件安装包

GitHub Copilot的实证分析.docx

github-copilot-intellij-1.1.15.1115

GitHub Copilot 教程.docx

提高开发者生产力：深入探究GitHub Copilot的AI驱动代码自动补全.docx

github-copilot-intellij-1.4.15.4632.zip.cab

GitHub Copilot webstome插件

GitHub.Copilot.Vsix.1.137.0.0.vsix

VsCode GitHub Copilot插件安装包1.133.0.0

GitHub的Copilot可能会带来全球1.5万亿美元的GDP增长。.docx

程序员的新朋友：GitHub Copilot.docx

AI在软件开发中的应用 ChatGPT与GitHub Copilot的潜力.docx

AI对开发者生产力的影响：来自GitHub Copilot的证据.docx

GitHub Copilot AI pair programmer：资产还是负债.docx

github-copilot-intellij-1.2.22.3129

github-copilot-intellij-1.2.5.2507.zip

相关实用应用程序（Windows可用）

免费可用的ChatGPT网页版.zip

ChatGPT使用总结：150个ChatGPT提示词模板（完整版）

chromedriver-win64.zip

全国计算机二级WPSoffice精选350道选择题题库（含答案）.pdf

农村公交与异构无人机协同配送优化

李飞飞自传 我看见的世界 The World I see

哈尔滨工业大学-ChatGPT调研报告-2023.3.6-94页.pdf

4个亲测好用的ChatGPT4渠道

基于小波与卷积神经网络的多尺度时间序列分类.zip

最新资源

李飞飞自传我看见的世界 The World I see