没有合适的资源?快使用搜索试试~ 我知道了~
探讨 GitHub Copilot 生成的代码的可验证性.docx
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 190 浏览量
2024-04-17
12:47:30
上传
评论
收藏 68KB DOCX 举报
温馨提示
试读
12页
探讨 GitHub Copilot 生成的代码的可验证性.docx
资源推荐
资源详情
资源评论
Proceedings of the ACM on Programming Languages, Vol. 1, No. HATRA, Article 1. Publication date: December 2022.
Exploring the Verifiability of Code Generated by GitHub
Copilot
DAKOTA WONG, University of Waterloo, Canada
AUSTIN KOTHIG, University of Waterloo, Canada
PATRICK LAM, University of Waterloo, Canada
GitHub’s Copilot generates code quickly. We investigate whether it generates good code. Our approach is to
identify a set of problems, ask Copilot to generate solutions, and attempt to formally verify these solutions
with Dafny. Our formal verification is with respect to hand-crafted specifications. We have carried out this
process on 6 problems and succeeded in formally verifying 4 of the created solutions. We found evidence
which corroborates the current consensus in the literature: Copilot is a powerful tool; however, it should not
be “flying the plane" by itself.
CCS Concepts: • Software and its engineering
---
General programming languages; • Social and
professional topics --- History of programming languages;
Additional Key Words and Phrases: GitHub Copilot, Program Synthesis, Software Verification.
ACM Reference Format:
Dakota Wong, Austin Kothig, and Patrick Lam. 2022. Exploring the Verifiability of Code Generated by GitHub
Copilot. Proc. ACM Program. Lang. 1, HATRA, Article 1 (December 2022), 10 pages.
1
INTRODUCTION
In aviation, commercial airplanes have two yokes (controls) which allows for sharing of responsibil-
ities and workload between a primary pilot and a copilot. Both operators are fully trained, informed,
and capable of flying the aircraft; however, only one of them needs to be controlling the aircraft at
any one moment. In software development, traditionally, code is written by an individual. In team
environments, this code is typically scrutinized and assessed by a number of people, but originally
it had to be developed by an individual. Similar to a pilot, software developers are ordinarily well
trained, informed, and capable of programming; however, inefficient or incorrect code segments can
potentially slip between the cracks and into code-bases when a developer does not fully understand
their tools, the domain, or the required program specifications.
GitHub, with the use of OpenAI’s Codex [Chen et al. 2021], has released an “AI pair programmer”
to assist in software development [GitHub 2022]. This programming partner is aptly named Copilot.
The aim of this tool is to shift the programmer’s time and attention away from boilerplate and
repetitive code patterns, and toward the critical design of the system. Copilot works by providing
implementation suggestions based on natural language descriptions of the required code. Copilot is
currently capable of generating code in eleven programming languages, including C, C++, C#, Go,
Java, JavaScript, PHP, Python, Ruby, Scala, and TypeScript
1
.
The community is struggling to understand what Copilot can and cannot do, and proofs are one
indicator of where Copilot might stand in terms of generating correct code. We know that Copilot
definitely generates some code, but what can we conclude about the code that it generates? Thus,
our goal is to gain some understanding of the quality of Copilot’s generated code; our approach is
1
https://docs.github.com/en/copilot/overview-of-github-copilot/about-github-copilot
Authors’ addresses: Dakota Wong, University of Waterloo, Canada, [email protected]; Austin Kothig, University of
Waterloo, Canada, [email protected]; Patrick Lam, University of Waterloo, Canada, [email protected].
2022.
2475-1421/2022/12-ART1
$15.00
https://doi.org/
1
1:2
Dakota
Wong,
Austin
Kothig,
and
Patrick
Lam
Proceedings of the ACM on Programming Languages, Vol. 1, No. HATRA, Article 1. Publication date: December 2022.
to attempt to verify selected examples of Copilot-generated code using the software verification
language Dafny [Leino 2010].
2
RELATED
WORK
There has been a great deal of scrutiny of Copilot and of the code it produces. Early studies which
aim to empirically evaluate Copilot [Nguyen and Nadi 2022] have found that while this early
version of the model is not perfect, its current results are impressive. In their work, Nguyen and
Nadi tested Copilot against 33 LeetCode
2
questions (4 from the easy category, 17 from the medium
category, and 12 from the hard category) using four programming languages (Python, JavaScript,
Java, C), ran the generated code against the LeetCode test cases, and examined the percentage of
passing cases. Java performed best, being able to solve all test cases for 19 of the 33 problems. The
authors found that, often, the code that Copilot generated could be simplified, and that sometimes
the generated code would require methods or packages that have not been defined or declared.
Copilot generates code differently from traditional genetic programming paradigms. Sobania et al.
[2022] found that Copilot could generate correct solutions to a variety of problems, with comparable
accuracy to genetic programming. The improvement that Copilot makes over traditional approaches
is that it was trained on a 159 GB dataset of unlabelled data, whereas genetic programming requires
the dataset to be rigorously labeled. Thus, Sobania et al. claimed that the code Copilot generates is
often easier to read and understand compared to that generated by genetic programming.
It is clear from the current research that Copilot is not a magic wand solution to problems in
software engineering. With its flaws, how are developers able to benefit from this tool effectively?
Dakhel et al. [2022] compared Copilot’s proposed solutions against human programmers on a
set of fundamental problems. The authors found that Copilot was unable to understand certain
limitations embedded in the problem description, and while understanding the fundamentals of
what is being requested, it can sometimes create errors when it does not understand. They conclude
that while the generated code from Copilot is not a perfect solution, it can still be incorporated and
modified into a project for faster early-stage prototyping.
Continuing on the question of usability of Copilot-generated code by developers, Vaithilingam
et al. [2022] performed a user study. They found that using Copilot did not improve the task
completion time or success rate, but that the developers preferred to start from the Copilot-
generated code rather than a blank slate. The researchers also pointed out that it was challenging
for developers to work with long blocks of generated code (and indeed, some developers failed
to complete on time due to difficulties debugging generated code)—attempting to verify the code
would also be a challenge, though one that might provide more understanding. One difference
between our study and theirs, though, is that we did not encounter generated solutions that were
known incorrect, and their users did.
Any introduction of small errors can potentially compound over time, creating large vulnera-
bilities in a software stack. Pearce et al. [2022] investigated code generated by Copilot for known
security vulnerabilities. The authors prompted Copilot to generate code for 89 cybersecurity scenar-
ios. They found that approximately 40% of generated code snippets were vulnerable to some form of
known attack. Also, Siddiq et al. [2022] found code smells (including security smells) in training sets
used for transformer-based code generation models for Python as well as in suggestions produced
by Copilot. If code generated by Copilot is to be used in production, it needs to be rigorously vetted
for exploits (and keeping in mind the caveat from Vaithilingam et al. about understanding Copilot’s
generated code).
2
https://leetcode.com/
Exploring the Verifiability of Code Generated by GitHub Copilot
1:3
Proceedings of the ACM on Programming Languages, Vol. 1, No. HATRA, Article 1. Publication date: December 2022.
3
METHODOLOGY
Our goal is to apply verification techniques to code generated by GitHub Copilot. We came up with
sample problems and tasked Copilot with generating solutions in Python. We then used Dafny to
verify that Copilot’s generated code satisfies the requirements.
More specifically, our methodology is as follows. First, we chose 6 problems to give to Copilot
(five that we chose ourselves and one from Nguyen and Nadi [2022]). We wanted to test whether
Copilot could generate a function which would solve the problems; by “solve”, we mean that the
generated code satisfies the problem’s formal specifications. Thus, we wrote formal specifications
in Dafny for each problem, consisting of requires and ensures statements.
Having defined the problems, we then asked Copilot to generate solutions in Python. To give
Copilot the best chance of generating a good solution, we gave it a full function signature, including
a descriptive function name, parameter name(s), parameter type(s), and an output type. With
this input, Copilot then generated the function implementation. Since our goal was to test the
capabilities of Copilot, we always chose the first suggested function.
Next, we carefully translated each Copilot implementation from Python to Dafny and added the
specifications that we had created. We took care to preserve the behaviour of the implementation.
In all cases, the implementations contained a loop of some type. Hence, the provided requires
and ensures statements were inadequate to generate a proof that verified the implementation. We
thus attempted to manually derive loop invariants so that Dafny could verify the implementation.
During the verification process, we added extra requires and ensures clauses, if they were needed to
verify the program, along with additional helper functions and methods when necessary.
4
RESULTS
We have attempted to verify 6 Copilot implementations, one of which intersects the LeetCode
problems in [Nguyen and Nadi 2022]. Table 1 summarizes the algorithms, implementations, and
verification attempts. Incidentally, during testing, consistent with Nguyen and Nadi [2022], we
found that Copilot would periodically include packages or methods that had not been defined; it
happened to not do so for any of the implementations we describe here.
We have archived the current version of the Copilot-generated implementations and our Dafny
code at https://zenodo.org/record/7040924.
Table 1.
Algorithm implementations generated by Copilot. Dafny verification attempt statistics.
Algorithm
Python
Verification
Verified
Specification
# Invariants
Name
LOC
Difficulty
LOC
Binary Search (4.1)
12
Easy
✓
5
3
Two Sums (4.2)
5
Medium
✓
6
7
Largest Sum
10
Medium
✓
2
3
Sort Array
6
Medium
✓
4
11
Prime Factors (4.3)
12
Hard
3
> 23
Max Heapify (4.4)
14
Hard
4
> 4
Figure 1 presents the Copilot prompts that we used to generate implementations.
We were able to create a valid proof in Dafny for 4 of the 6 solutions generated by Copilot; we
next discuss 2 of the valid proofs and our 2 failed attempts.
剩余11页未读,继续阅读
资源评论
百态老人
- 粉丝: 1641
- 资源: 2万+
下载权益
C知道特权
VIP文章
课程特权
开通VIP
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功