Cross-ProjectTransferRepresentationLearningforVulnerableFunctionDiscovery资源-CSDN文库

需积分: 36 26 浏览量 2021-02-20 15:12:39 上传评论收藏 1.08MB PDF 举报

本文讨论的是如何在软件项目中提前发现易受攻击的功能，特别是当面临高品质训练数据不足和必须依赖过于泛化的手工特征时。文章提出的解决方案是通过跨项目迁移表示学习来丰富特征，以便在相似项目中泛化。文章核心在于解决机器学习中的冷启动问题，即在软件项目早期阶段，由于缺乏高质量的训练数据，机器学习模型的潜力通常受到限制，不得不依赖过于泛化的手工特征。为了解决这个问题，作者提出了一种数据驱动的方法，包括以下创新点： 1. 利用序列化抽象语法树（AST）揭示代码语义，并使用连续词袋（Continuous Bag-of-Words）神经嵌入来编码标识符。 2. 将序列化的AST输入到序列深度学习分类器（双向长短期记忆网络，Bi-LSTM）中，以获取有助于识别软件漏洞的表征。 3. 将从现有软件项目中获得的神经表征转移到新项目中，从而即使在训练标签数量很少的情况下也能实现早期漏洞检测。为了验证这一漏洞检测方法的有效性，作者手动标记了来自六个开源项目中的457个易受攻击的函数，并收集了30000多个非易受攻击的函数。实验结果证实，训练有素的模型能够生成有助于识别程序漏洞的表征，并且可以在多个项目之间进行适配。与传统的代码度量指标相比，作者提出的迁移学习表征在预测项目内的易受攻击功能以及跨多个项目时更为有效。文章涉及的关键知识点有： - 抽象语法树（AST）：用于表示代码结构的树状数据结构，表示程序的语法层面的结构。 - 连续词袋（Continuous Bag-of-Words）模型：一种基于神经网络的词嵌入方法，通常用于自然语言处理中，用于将单词转换为向量空间中的点。 - Bi-LSTM（双向长短期记忆网络）：一种特殊的循环神经网络（RNN），可以学习数据序列中的长期依赖关系，对于处理和预测时间序列数据中的事件和模式特别有效。 - 跨项目迁移学习：一种机器学习技术，利用一个项目的知识（模型、数据等）来提高另一个项目的性能。 - 漏洞表示学习：通过机器学习方法学习软件漏洞的特征和模式，以帮助发现潜在的安全漏洞。 - 安全漏洞检测：使用各种方法和工具识别和评估软件中潜在的安全漏洞的过程。在解决机器学习面临的冷启动问题方面，本研究提出的方法利用数据驱动的策略，通过学习不同项目间的共性特征来克服数据不足的限制。这可以提高机器学习模型在早期开发阶段对于安全漏洞的检测能力，使得项目可以在较低的开发成本下进行提前的安全评估和加固，这对于软件开发和安全社区来说具有重要的意义。

资源推荐

资源详情

资源评论

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 14, NO. 7, JULY 2018 3289

Cross-Project Transfer Representation Learning

for Vulnerable Function Discovery

Guanjun Lin , Jun Zhang ,Member,IEEE, Wei Luo, Lei Pan ,Member,IEEE,

Ya n g X i a n g

,SeniorMember,IEEE,OlivierDeVel,andPaulMontague

Abstract—Machine learning is now widely used to detect

security vulnerabilities in the software, even before the soft-

ware is released. But its potential is often severely compro-

mised at the early stage of a software project when we face

ashortageofhigh-qualitytrainingdataandhavetorelyon

overly generic hand-crafted features. This paper addresses

this cold-start problem of machine learning, by learning

rich features that generalize across similar projects. To

reach an optimal balance between feature-richness and

generalizability, we devise a data-driven method including

the following innovative ideas. First, the code semantics are

revealed through serialized abstract syntax trees (ASTs),

with tokens encoded by Continuous Bag-of-Words neural

embeddings. Next, the serialized ASTs are fed to a sequen-

tial deep learning classiﬁer (Bi-LSTM) to obtain a represen-

tation indicative of software vulnerability. Finally, the neural

representation obtained from existing software projects is

then transferred to the new project to enable early vulner-

ability detection even with a small set of training labels.

To validate this vulnerability detection approach, we manu-

ally labeled 457 vulnerable functions and collected 30 000+

nonvulnerable functions from six open-source projects.

The empirical results conﬁrmed that the trained model is

capable of generating representations that are indicative

of program vulnerability and is adaptable across multi-

ple projects. Compared with the traditional code metrics,

our transfer-learned representations are more effective for

predicting vulnerable functions, both within a project and

across multiple projects.

Index Terms—Abstract syntax tree, cross-project,

representation learning, transfer learning, vulnerability

discovery.

Manuscript received March 21, 2018; accepted March 26, 2018. Date

of publication April 2, 2018; date of current version July 2, 2018. Paper

no. TII-18-0714. (Corresponding author: Jun Zhang.)

G. Lin, W. Luo, L. Pan are with the School of Information Technology,

Deakin University, Geelong, VIC 3216, Australia (e-mail: lingu@deakin.

edu.au; wei.luo@deakin.edu.au; l.pan@deakin.edu.au).

J. Zhang is with the School of Software and Electrical Engineering,

Swinburne University of Technology, Melbourne, VIC 3122, Australia

(e-mail: junzhang@swin.edu.au).

Y. X i a n g i s w i t h t h e D i g i t a l R e s e a r c h & I n n o va t i o n C a p a b i l i t y P l a t f o r m ,

Swinburne University of Technology, Melbourne, VIC 3122, Australia

(e-mail: yxiang@swin.edu.au).

O. De Vel and P. Montague are with the Defence Science &

Tec h n o l o g y G r o u p, D e p a r t m e n t o f D e fen c e, Mar i b y r n o n g , V I C 3 0 3 2 ,

Australia (e-mail: Olivier.DeVel@dst.defence.gov .au; Paul.Montague@

dst.defence.gov.au).

Color versions of one or more of the ﬁgures in this paper are available

online at http://ieeexplore.ieee.org.

Digital Object Identiﬁer 10.1109/TII.2018.2821768

I. INTRODUCTION

ULNERABILITIES in software critically undermine the

security of computer systems and threaten the IT infras-

tructure of many government sectors and organizations. For

instance, the recently disclosed “Heartbleed”and“Shellshock”

vulnerabilities, and a vulnerability in the server message block

(SMB) protocol exploited by the WannaCry ransomware have

affected a wide range of systems and millions of users world-

wide. According to [4] and [26], one of the major causes of se-

curity incidents and breaches can be attributed to the exploitable

vulnerabilities in software. Once a vulnerability is exploited by

attackers, companies and organizations may suffer from sig-

niﬁcant ﬁnancial loss as well as irreparable damage to their

reputation [22].

The early detection of vulnerabilities in applications is vi-

tal for implementing cost-effective attack-mitigation solutions.

From the perspective of code execution, techniques for iden-

tifying vulnerabilities can be categorized into static, dynamic,

and hybrid approaches. Static techniques, such as rule-based

analysis [6], code similarity detection i.e., code clone detection

[8], [9], and symbolic execution [2], mainly rely on the analysis

of source code, but often struggle to reveal bugs and vulner-

abilities occurring at the runtime. Dynamic analysis includes

fuzzing test [23] and taint analysis [17], and focuses on detect-

ing vulnerabilities manifested during program execution, but in

general, has low-code coverage. The hybrid approaches combin-

ing static and dynamic analysis techniques aim to overcome the

aforementioned weaknesses. However, all of these approaches

rely on a limited set of known syntactic or behavioral patterns

of vulnerabilities, and such deﬁciency raises the challenge of

detecting the previously unseen vulnerabilities.

Data-driven vulnerability discovery using machine learn-

ing (ML) provides a new opportunity for intelligent, effec-

tive, and efﬁcient vulnerability detection. The existing ML-

based approaches primarily operate on source code, which

offers better human readability. Researchers have applied

source-code based features, s uch as imports (i.e., header ﬁles),

function calls [16], software complexity metrics, and code

changes [22], as indicators for identifying potentially vulner-

able ﬁles or code fragments. Moreover, features and informa-

tion obtained from version control systems, such as developer

activities [12] and code commits [20], were also adopted for pre-

dicting vulnerabilities. Most recently, two studies: 1) VUDDY

[9]; and 2) VulPecker [10], focused on detecting vulnerable

See http://www.ieee.org/publications

standards/publications/rights/index.html for more information.

3290 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 14, NO. 7, JULY 2018

Fig. 1. Proposed framework for vulnerability discovery. It contains three stages: The ﬁrst stage is to pretrain a bidirectional long short-term memory

(Bi-LSTM) network using source code projects; the second stage is to feed the trained network with the target project to obtain representations as

features; the last stage is to train an ML classiﬁer with the learned features.

functions and code fragments based on code clone/similarity

analysis, nevertheless, both approaches incur high false negative

rate.

However, most of the existing ML-based approaches focus

on software component- or ﬁle-level vulnerability detection,

which rely on the manual effort and expertise of the code au-

ditor to inspect the code base to accurately pinpoint the exact

location of the vulnerabilities. Because of the relative scarcity

of vulnerabilities, there is insufﬁcient historical vulnerability

data for training and validating a statistical model, especially

on inactive open-source projects. In this paper, we aim to ex-

plore a ﬁne-grained vulnerability detection approach targeting

multiple software projects. To overcome the challenges, we pro-

pose a framework that solves this problem in three stages (see

Fig. 1). First, we create vulnerability ground truth data at the

function-level. Second, we extract features from the abstract

syntax trees (ASTs) of each function. Speciﬁcally, we use a

parser to obtain ASTs in a serialized form by using depth-

ﬁrst traversal (DFT). Then, we convert the serialized ASTs to

equal-length sequences while preserving the structural and se-

mantic features. To further reﬁne these features, we apply a long

short-term memory (LSTM) [7] recurrent neural network with

Word2vec [13] embeddings for learning a higher level of repre-

sentations. We hypothesize that the algorithm has the capacity of

automatically extracting deep vulnerable programming features

that contain richer information than the shallow features driven

by domain knowledge. We also hypothesize that the learned

low-level representations are transferable and are independent

of software projects using the same programming language.

Last, given a target project with insufﬁcient labeled data, we ap-

ply the same feature-extraction process and feed the data to the

pretrained network for learning a subset of representations. Sub-

sequently, the learned representations are used to train a classi-

ﬁer for vulnerability prediction. The empirical study shows that

the features extracted using our method are signiﬁcantly more

effective than software code metrics (CMs) in detecting vulner-

abilities. Despite the small number of instances labeled in the

projects, our algorithm is capable of effectively utilizing avail-

able data from other projects for pretraining a basic network,

which can then be used for extracting deep AST representa-

tions for the projects with insufﬁcient data. Empirical results

demonstrate the effectiveness of learned representations which

contribute to better detection accuracy than using traditional

CMs.

In summary, our contributions are three-fold.

1) We propose a framework for function-level vulnerability

discovery, which offers a ﬁne-grained detection capabil-

ity, facilitating a quick location of vulnerabilities.

2) We develop an approach to extract the sequential fea-

tures of ASTs that capture the structural and semantic

information of functions. Such information reﬂects the

vulnerable programming patterns.

3) We construct a Bi-LSTM network for effectively extract-

ing deep AST representations, which supports the trans-

ferability across software projects. The empirical studies

show that the deep AST representations provide the pre-

cise identiﬁcation of vulnerable functions (80% precision

is achieved when retrieving the 10 most probable vulner-

able functions).

The rest of this paper is organized as follows: Section II

presents how features are extracted from ASTs derived from

source code functions. Section III describes how to leverage

LSTM for obtaining the sequential patterns in ASTs for vul-

nerability detection. Then, Section IV evaluates the detection

performance of the proposed approach using two sets of exper-

iments for the evaluation of the effectiveness of our deep AST

representations and transfer-representation learning. Section VI

concludes this paper.

II. F

UNCTION LEVEL AST ENGINEERING

We believe that software vulnerabilities are often reﬂected

in the syntactical structure of source code, particularly at the

function-level. To capture such features and code properties, we

follow the early work of Yamaguchi et al. [25]. The authors

assumed that vulnerable programming patterns are associated

with many vulnerabilities, and these patterns can be revealed

by analyzing the program’s ASTs. An AST is a syntactical

structure of source code (for instance, a function), depicting the

relationships among the components of the code in a hierarchical

tree view, and faithfully representing the function-level control

ﬂow [see Fig. 2(a) and (b)]. Compared with the control ﬂow

graphs (CFGs), ASTs provide a natural program representation

at the function level and reserve more information of the source

code, while CFGs usually do not include variable declarations.

Therefore, in this paper, we choose ASTs for extracting the

latent programming patterns. To achieve this, an AST needs

to be serialized for converting to a vector while preserving its

剩余8页未读，继续阅读

评论收藏

内容反馈

ithicker

粉丝: 285
资源: 11

Cross-Project Transfer Representation Learning for Vulnerable Fu...

最新资源

Cross-Project Transfer Representation Learning for Vulnerable Fu...

vulnerable-components-workshop

super-simple-vulnerable-guestbook:超级简单的漏洞留言簿

M2GRL_A Multi-task Multi-view Graph Representation Learning Framework for Web-sc

grail-matlab-master Generic RepresentAtIon Learning for

Image-embodied Knowledge Representation Learning

Self-Supervised Video Representation Learning by Context and Mot

Hands-On Transfer Learning with Python

属性图的自监督一致性表示学习_Self-supervised Consensus Representation Learning

2019-[斯坦福 Leskovec]-Hierarchical Graph Representation Learning w

Deep-High-Resolution-Representation-Learning-for-Cross-Resolution-Person-Re-identification:IEEE多媒体交易杂志（正在审查中）

Graph-based Knowledge Representation: Computational Foundations of Conceptual Graphs

Semi-Supervised Sparse Representation Based

Chan_Active-Contours-without-Edges-for-Vector-Valued-Images_Journal-of-Visual-Communication-and-Image-Representation_2000.pdf

Meta-Learning Update Rules for Unsupervised Representation Learn

Learning Representation for Multi-View Data Analysis

BERT_SE A Pre-trained Language Representation Model for Software

Disentangled Representation Learning GAN for Pose-Invariant Face Recognition

Multi-Task Representation Learning for Demographic Prediction

Representation Learning for Word, Sense, Phrase, Document and Knowledge-刘知远

2023数学建模国赛优秀论文合集(A~E)

Academic+Phrasebank+2021+Edition+_中英文对照.pdf

基于python的超市管理系统的设计与实现毕业论文+项目文档源码

1000套计算机毕业设计带源码

数模国赛word模板.zip

IEEE期刊论文格式模板word

2021年国赛A题（FAST主动反射面形状调节）论文+代码材料.zip

2023高教社数学建模C题 - 蔬菜类商品的自动定价与补货决策【数据处理详细代码】

Python大作业（包含论文）——可打包的双人五子棋程序

软考 系统分析师论文 范文

最新资源

软考系统分析师论文范文