IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 14, NO. 7, JULY 2018 3289
Cross-Project Transfer Representation Learning
for Vulnerable Function Discovery
Guanjun Lin , Jun Zhang ,Member,IEEE, Wei Luo, Lei Pan ,Member,IEEE,
Ya n g X i a n g
,SeniorMember,IEEE,OlivierDeVel,andPaulMontague
Abstract—Machine learning is now widely used to detect
security vulnerabilities in the software, even before the soft-
ware is released. But its potential is often severely compro-
mised at the early stage of a software project when we face
ashortageofhigh-qualitytrainingdataandhavetorelyon
overly generic hand-crafted features. This paper addresses
this cold-start problem of machine learning, by learning
rich features that generalize across similar projects. To
reach an optimal balance between feature-richness and
generalizability, we devise a data-driven method including
the following innovative ideas. First, the code semantics are
revealed through serialized abstract syntax trees (ASTs),
with tokens encoded by Continuous Bag-of-Words neural
embeddings. Next, the serialized ASTs are fed to a sequen-
tial deep learning classifier (Bi-LSTM) to obtain a represen-
tation indicative of software vulnerability. Finally, the neural
representation obtained from existing software projects is
then transferred to the new project to enable early vulner-
ability detection even with a small set of training labels.
To validate this vulnerability detection approach, we manu-
ally labeled 457 vulnerable functions and collected 30 000+
nonvulnerable functions from six open-source projects.
The empirical results confirmed that the trained model is
capable of generating representations that are indicative
of program vulnerability and is adaptable across multi-
ple projects. Compared with the traditional code metrics,
our transfer-learned representations are more effective for
predicting vulnerable functions, both within a project and
across multiple projects.
Index Terms—Abstract syntax tree, cross-project,
representation learning, transfer learning, vulnerability
discovery.
Manuscript received March 21, 2018; accepted March 26, 2018. Date
of publication April 2, 2018; date of current version July 2, 2018. Paper
no. TII-18-0714. (Corresponding author: Jun Zhang.)
G. Lin, W. Luo, L. Pan are with the School of Information Technology,
Deakin University, Geelong, VIC 3216, Australia (e-mail: lingu@deakin.
edu.au; wei.luo@deakin.edu.au; l.pan@deakin.edu.au).
J. Zhang is with the School of Software and Electrical Engineering,
Swinburne University of Technology, Melbourne, VIC 3122, Australia
(e-mail: junzhang@swin.edu.au).
Y. X i a n g i s w i t h t h e D i g i t a l R e s e a r c h & I n n o va t i o n C a p a b i l i t y P l a t f o r m ,
Swinburne University of Technology, Melbourne, VIC 3122, Australia
(e-mail: yxiang@swin.edu.au).
O. De Vel and P. Montague are with the Defence Science &
Tec h n o l o g y G r o u p, D e p a r t m e n t o f D e fen c e, Mar i b y r n o n g , V I C 3 0 3 2 ,
Australia (e-mail: Olivier.DeVel@dst.defence.gov .au; Paul.Montague@
dst.defence.gov.au).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TII.2018.2821768
I. INTRODUCTION
V
ULNERABILITIES in software critically undermine the
security of computer systems and threaten the IT infras-
tructure of many government sectors and organizations. For
instance, the recently disclosed “Heartbleed”and“Shellshock”
vulnerabilities, and a vulnerability in the server message block
(SMB) protocol exploited by the WannaCry ransomware have
affected a wide range of systems and millions of users world-
wide. According to [4] and [26], one of the major causes of se-
curity incidents and breaches can be attributed to the exploitable
vulnerabilities in software. Once a vulnerability is exploited by
attackers, companies and organizations may suffer from sig-
nificant financial loss as well as irreparable damage to their
reputation [22].
The early detection of vulnerabilities in applications is vi-
tal for implementing cost-effective attack-mitigation solutions.
From the perspective of code execution, techniques for iden-
tifying vulnerabilities can be categorized into static, dynamic,
and hybrid approaches. Static techniques, such as rule-based
analysis [6], code similarity detection i.e., code clone detection
[8], [9], and symbolic execution [2], mainly rely on the analysis
of source code, but often struggle to reveal bugs and vulner-
abilities occurring at the runtime. Dynamic analysis includes
fuzzing test [23] and taint analysis [17], and focuses on detect-
ing vulnerabilities manifested during program execution, but in
general, has low-code coverage. The hybrid approaches combin-
ing static and dynamic analysis techniques aim to overcome the
aforementioned weaknesses. However, all of these approaches
rely on a limited set of known syntactic or behavioral patterns
of vulnerabilities, and such deficiency raises the challenge of
detecting the previously unseen vulnerabilities.
Data-driven vulnerability discovery using machine learn-
ing (ML) provides a new opportunity for intelligent, effec-
tive, and efficient vulnerability detection. The existing ML-
based approaches primarily operate on source code, which
offers better human readability. Researchers have applied
source-code based features, s uch as imports (i.e., header files),
function calls [16], software complexity metrics, and code
changes [22], as indicators for identifying potentially vulner-
able files or code fragments. Moreover, features and informa-
tion obtained from version control systems, such as developer
activities [12] and code commits [20], were also adopted for pre-
dicting vulnerabilities. Most recently, two studies: 1) VUDDY
[9]; and 2) VulPecker [10], focused on detecting vulnerable
1551-3203 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications
standards/publications/rights/index.html for more information.