没有合适的资源?快使用搜索试试~ 我知道了~
识别 GitHub 上的异常提交.pdf
0 下载量 185 浏览量
2024-05-08
13:37:39
上传
评论
收藏 898KB PDF 举报
温馨提示
试读
35页
识别 GitHub 上的异常提交.pdf
资源推荐
资源详情
资源评论
JOURNAL OF SOFTWARE: EVOLUTION AND PROCESS
J. Softw. Evol. and Proc. 0000; 00:2–35
Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/smr
Identifying Unusual Commits on GitHub
Raman Goyal
∗1
, Gabriel Ferreira
2
, Christian K
¨
astner
2
and James Herbsleb
2
1
Indian Institute of Information Technology, Allahabad
2
Carnegie Mellon University
SUMMARY
Transparent environments and social-coding platforms as GitHub help developers to stay abreast of changes
during the development and maintenance phase of a project. Especially, notification feeds can help developers
to learn about relevant changes in other projects. Unfortunately, transparent environments can quickly
overwhelm developers with too many notifications, such that they loose the important ones in a sea of
noise. Complementing existing prioritization and filtering strategies based on binary compatibility and code
ownership, we develop an anomaly-detection mechanism to identify unusual commits in a repository, that
stand out with respect to other changes in the same repository or by the same developer. Among others, we
detect exceptionally large commits, commits at unusual times, and commits touching rarely changed file
types given the characteristics of a particular repository or developer. We automatically flag unusual commits
on GitHub through a browser plugin. In an interactive survey with 173 active GitHub users, rating commits
in a project of their interest, we found that, though our unusual score is only a weak predictor of whether
developers want to be notified about a commit, information about unusual characteristics of a commit change
how developers regard commits. Our anomaly-detection mechanism is a building block for scaling transparent
environments. Copyright
c
0000 John Wiley & Sons, Ltd.
Received . . .
KEY WORDS:
software ecosystems; notification feeds; information overload; transparent environments;
anomaly detection
Copyright
c
0000 John Wiley & Sons, Ltd.
Prepared using smrauth.cls [Version: 2012/07/12 v2.10]
2
1. INTRODUCTION
Collaborative development in open source, software ecosystems, and also industrial software systems
relies increasingly on decentralized decision making [17, 20, 27, 41]. Interdependent components
evolve independently and often with little explicit collaboration. Backward-incompatible changes
that break modularity and produce rippling effects on downstream components are often necessary to
avoid opportunity costs (not fixing mistakes, stifling change in the face of evolving requirements) and
common in practice [10, 14, 19,25, 29, 35, 38
–
40, 43, 47]. In addition, components may change to add
new functionality that developers might want to adopt. Identifying relevant changes and reacting to
them if needed can create a significant burden on developers during maintenance [3,4,6,24,35,42,47].
Seeds of a solution can be found in today’s transparent environments or social-coding platforms
such as GitHub, LaunchPad, and Bitbucket. These environments provide mechanisms for notification
and exploration, that help developers to stay abreast of activities across collections of projects
without central planning [11, 12]. For example, on GitHub, developers can watch projects and receive
a notification feed of activities in watched projects, such as push events or bug reports. These tools
work well at small scales, but break down for large projects where imprecise and insufficiently rich
notification mechanisms lead to information overload from notification cluttering. By inspecting
publicly available events on GitHub, we found that active developers typically receive dozens of
public event notifications a day and a single active project can produce over 100 notifications per
day (and many more when including notifications of indirect dependencies). When we previously
interviewed active GitHub users, many reported drowning in change notifications, for example
stating “I stopped with the email – now I use the GitHub notifications page. And the volume is a
problem” and “I just wander through GitHub activity streams occasionally. [But] it is very much
a crap shoot to actually get useful information from the feed” [3, 11].
∗
Correspondence to: Journals Production Department, John Wiley & Sons, Ltd, The Atrium, Southern Gate, Chichester,
West Sussex, PO19 8SQ, UK.
Copyright
c
0000 John Wiley & Sons, Ltd. J. Softw. Evol. and Proc. (0000)
Prepared using smrauth.cls DOI: 10.1002/smr
3
A key to scale transparent environments is to identify relevant notifications and route them to
affected or interested developers. There are many possible reasons why a change might be relevant
for a developer, including the following explored in prior work:
•
Identify breaking changes: Typically most changes are backward compatible. Notifications
about the rare breaking changes are of especial importance to maintainers of affected
downstream projects. Continuous integration platforms can help to highlight changes that
break the system. In addition, Holmes and Walker designed a system that statically detects
certain incompatible interface changes in Java to filter notifications correspondingly [24].
•
Identify critical fixes to vulnerabilities: Patched vulnerabilities in upstream projects are typically
of high importance to update the dependency to a newer version. The service Gemnasium
tracks dependencies among Ruby packages and notifies registered package maintainers if an
upstream dependency has a known vulnerability (CVE). In addition, several simple heuristics
and learning approaches can identify bug fixing commits [22,55].
•
Identify relevance based on prior activity: In large code bases, developers may be interested
only in notifications about code that relates to their own activities, such as notifications
about changes in code that they have written. Padhye et al. model relevance based on simple
heuristics regarding prior modifications, code ownership, and commit messages to similarly
reduce information overload [42].
In this paper, we explore a different, complementary strategy to identify another class of relevant
notifications:
•
Identify unusual changes: We identify changes that are unusual or stand out with respect to
other changes in the repository. For example, commits that are particularly large, changes to
artifacts in a programming language not commonly used in the project or by that developer, or
changes with exceptionally long commit messages might be worth noting. We developed an
programming-language-independent anomaly detection mechanism that identifies outliers with
regard to other changes in the same repository or other changes by the same developer.
Copyright
c
0000 John Wiley & Sons, Ltd. J. Softw. Evol. and Proc. (0000)
Prepared using smrauth.cls DOI: 10.1002/smr
4
We detect outliers using statistical models capturing common characteristics of commits within a
project or by a developer. Based on those models, we provide an anomaly score for each commit. The
anomaly score can be used to prioritize and filter notification feeds, in concert with other detection
approaches, such as detecting breaking changes. In addition, anomaly scores can highlight unusual
commits in the revision history to support exploration and inspection and to point out unusual
characteristics during code reviews to focus the reviewer’s attention. We implemented a prototype
of our anomaly detection mechanism and provide a frontend through a browser plugin that injects
anomaly scores, including an explanation, into the commit history on GitHub pages.
In an evaluation, we analyze to what degree our model can predict changes developers will identify
as unusual and to what degree we can identify commits about which developers want to be notified.
We design an online survey with which participants rated commits in a repository of their choice. In
each selected repository, we select five random commits with different anomaly scores (stratified
sampling) and ask participants whether they judge the commit as unusual and whether they would
want to be notified. We found that our unusual score only weakly reflects our participants’ notion
of unusualness and is also only a weak predictor of whether developers want to be notified (to be
expected as we capture only a subset of characteristics of important commits), but we also found
that information about unusual characteristics about commits are actionable. When provided with
additional information about why a commit is a statistical outlier, participants often revisited their
position and identified commits as relevant for a notification.
Overall, we make the following contributions: (1) We design an anomaly model based on commit
characteristics to identify unusual commits in a repository and by a developer. (2) We tailor statistical
learning methods to build such models for Git repositories. (3) We integrate anomaly scores and
explanations into the GitHub web page using an implementation based on a browser plugin. (4) We
design an experimental setup to learn about the importance of unusual commits in a repository of the
participant’s choice. (5) We evaluate our anomaly model with 173 GitHub developers, showing that
despite weak predictive power, information about statistical outliers is actionable.
Copyright
c
0000 John Wiley & Sons, Ltd. J. Softw. Evol. and Proc. (0000)
Prepared using smrauth.cls DOI: 10.1002/smr
5
2. INDICATORS FOR IMPORTANT COMMITS
There are many reasons why a commit might be considered ‘unusual’ or important. In this work,
we refer to commits as unusual if they are statistical outliers according to some criteria, such
as commits that are substantially larger or were committed at an unconventional time of day. We
intentionally used a broad and subjective term to cover a wide range of different outlier characteristics.
Our mechanism is flexible enough to incorporate additional characteristics and select and weigh
characteristics depending on developer preferences.
As part of a presurvey of our evaluation, which includes demographic questions about the
participants’ experience and knowledge about the selected repository, we asked our participants
(professional and academic GitHub users, see Sec. 4 for details) two open questions:
•
Some commits stand out among all commits in a repository. What characteristics make commits
stand out?
• What kind of commits do you usually pay attention to?
With both questions, we elicit commit characteristics that developers use to distinguish important
commits from unimportant ones.
Among our participants, the following indicators for important commits were very commonly
mentioned (at least by 30 developers):
•
Commits that introduce new features (often associated with feature requests); for example, one
participant claimed interest in “commits that adds nice features to the project.”
•
Commits that signify major development steps, usually related to merging, milestones, and
releases.
•
Commits that are large in size (in terms of lines of code or files changed); for example, one
participant wrote that changes stand out if they include “extensive changes, lots of churn.”
• Commits that fix bugs or security issues.
•
Commits that change code about which the developers have particular knowledge or that could
affect their current tasks (code ownership, dependencies).
Copyright
c
0000 John Wiley & Sons, Ltd. J. Softw. Evol. and Proc. (0000)
Prepared using smrauth.cls DOI: 10.1002/smr
剩余34页未读,继续阅读
资源评论
百态老人
- 粉丝: 1664
- 资源: 2万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功