基于多源深度学习的软件漏洞自动发现框架研究与应用资源-CSDN文库

版权申诉

37 浏览量 2025-01-06 11:41:14 上传评论收藏 12.3MB PDF 举报

资源推荐

资源详情

资源评论

1545-5971 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2019.2954088, IEEE

Transactions on Dependable and Secure Computing

Software Vulnerability Discovery via Learning

Multi-domain Knowledge Bases

Guanjun Lin, Jun Zhang*, Senior Member, IEEE, Wei Luo, Lei Pan, Member, IEEE,

Olivier De Vel, Paul Montague, and Yang Xiang Senior Member, IEEE

Abstract—Machine learning (ML) has great potential in automated code vulnerability discovery. However, automated discovery

application driven by off-the-shelf machine learning tools often performs poorly due to the shortage of high-quality training data. The

scarceness of vulnerability data is almost always a problem for any developing software project during its early stages, which is referred

to as the cold-start problem. This paper proposes a framework that utilizes transferable knowledge from pre-existing data sources. In

order to improve the detection performance, multiple vulnerability-relevant data sources were selected to form a broader base for

learning transferable knowledge. The selected vulnerability-relevant data sources are cross-domain, including historical vulnerability

data from different software projects and data from the Software Assurance Reference Database (SARD) consisting of synthetic

vulnerability examples and proof-of-concept test cases. To extract the information applicable in vulnerability detection from the

cross-domain data sets, we designed a deep-learning-based framework with Long-short Term Memory (LSTM) cells. Our framework

combines the heterogeneous data sources to learn uniﬁed representations of the patterns of the vulnerable source codes. Empirical

studies showed that the uniﬁed representations generated by the proposed deep learning networks are feasible and effective, and are

transferable for real-world vulnerability detection. Our experiments demonstrated that by leveraging two heterogeneous data sources,

the performance of our vulnerability detection outperformed the static vulnerability discovery tool Flawﬁnder. The ﬁndings of this paper

may stimulate further research in ML-based vulnerability detection using heterogeneous data sources.

Index Terms—Vulnerability detection, Representation Learning, Deep learning.

1 INTRODUCTION

Many cybersecurity incidents and data breaches are caused

by exploitable vulnerabilities in software [21, 41]. Discov-

ering and detecting software vulnerabilities has been an

important research direction. Automated techniques such as

rules-based analysis [9, 47], symbolic execution [4], and fuzz

testing [42] have been proposed to enhance the vulnerability

search. However, these techniques are inefﬁcient when used

on a large code base in practice [48]. To improve efﬁciency,

machine learning (ML) techniques were applied to automate

the detection of software vulnerabilities and to accelerate the

code inspection process.

ML algorithms are capable of learning latent patterns

indicative of vulnerable/defective code, potentially outper-

forming the rules derived from experience, with a signiﬁ-

cantly improved level of generalization. Nevertheless, the

application of traditional ML techniques to vulnerability

detection still requires human experts to deﬁne features,

which largely relies on human experience, level of expertise

and depth of domain knowledge [18]. With deep learning,

code fragments can be directly used for learning without

the need for manual feature extraction and thus relieving

Jun Zhang is the corresponding author.

Guanjun Lin, Jun Zhang, and Yang Xiang are with School of Software and

Electrical Engineering, Swinburne University of Technology, Melbourne, VIC

3122, Australia (e-mail: {glin, junzhang, yxiang}@swin.edu.au)

Wei Luo and Lei Pan are with School of Information Technology,

Deakin University, Geelong, VIC 3216, Australia (e-mail: {wei.luo,

l.pan}@deakin.edu.au).

Olivier De Vel and Paul Montague are with the Defence Science & Technology

Group (DSTG), Department of Defence, Australia (e-mail:{Olivier.DeVel,

Paul.Montague}@dst.defence.gov.au)

experts from the time-consuming and possibly error-prone

feature engineering tasks. Recent studies have utilized neu-

ral networks for automated learning of semantic feature [45]

and high-level representations [18, 20] that could indicate

potential vulnerabilities in Java and C/C++ source code.

However, the existing ML-based vulnerability/defect

detection approaches, such as [18, 45, 53], have been con-

structed based on the assumption of the availability of suf-

ﬁcient labeled training data from the homogeneous sources.

Unfortunately, this assumption is not always valid. There

are no known publicly available software vulnerability

repositories that provide real-world vulnerability data with

code and label pairs [18, 20]. The relative scarcity of real-

world vulnerability data exacerbates the shortcomings of

the existing ML-based solutions on real-world software

projects, especially for the software projects with a few

historical instances of the detected vulnerabilities. Due to

the expensive process of manual software vulnerability col-

lection, the lack of training data causes the supervised ML-

based vulnerability detection approaches to be challenging

to apply. Hence, in practice, manual efforts are still required

for the vulnerability discovery task [44].

To enhance the automation of software vulnerability

discovery, we propose a deep learning based framework

with the capability of leveraging multiple heterogeneous

vulnerability-relevant data sources for effectively and auto-

matically learning latent vulnerable programming patterns.

On the one hand, the automated learning of vulnerable

programming patterns can relieve human experts of the

tedious and error-prone feature engineering tasks. On the

other hand, the learned latent representations of the vul-

Transactions on Dependable and Secure Computing

nerable patterns from the combination of heterogeneous

vulnerability-relevant data sources can be used as useful

features to leverage for the shortage of labeled data. A

vulnerability-relevant data source is generally one that is

publicly available and that includes quasi real-world vul-

nerability samples as, for example, the vulnerability samples

in the Software Assurance Reference Dataset (SARD) project

[30]. The vulnerability samples that we used from the SARD

project are mostly synthetic test cases presented as proof-of-

concept vulnerabilities and patches for a human program-

mer to learn. We assume that the deep learning algorithms

can derive “basic patterns” from the artiﬁcially constructed

vulnerabilities. The other vulnerability data source includes

a limited number of historical vulnerability data instances

collected from some popular open-source software projects.

By combining the two cross-domain data sources, we design

the algorithms to collectively extract the useful information

not only from the real-world vulnerability data but also

from the synthetic data sets for improving the vulnerability

detection performance.

To utilize the heterogeneous data sources, the proposed

framework consists of two independent deep learning net-

works. Each network is trained independently using one

of the data sources. Based on the ﬁndings of our previous

work [20], a Long Short-Term Memory (LSTM) [13] network

trained by the historical vulnerability data source could

be used as a feature extractor to generate useful features

which contain the vulnerable information learned from the

vulnerability data source. By using the generated features

for training, a classiﬁer can maintain its performance with

the lack of labeled data. In this paper, we explore the transfer

representation learning capability of a neural network fur-

ther by learning vulnerable patterns from two vulnerability-

relevant data sources. We hypothesize that the source for

learning latent vulnerable code patterns should not be lim-

ited to the historical vulnerability data source containing

real-world software projects. A vulnerability-related data

source (i.e., the SARD project) containing the artiﬁcial vul-

nerability samples should also be used as a vulnerability

knowledge base. By using two independent neural net-

works to learn the vulnerability-relevant knowledge from

two data sources, respectively, we can combine the learned

knowledge to complement the shortage of labeled data and

to enhance the vulnerability detection capability.

Firstly, we train two networks using the aforementioned

two vulnerability-relevant data sources. Then, both trained

networks are used as feature extractors. Given a project with

limited labeled data, we feed the data to each trained net-

work for deriving a subset of vulnerability knowledge rep-

resentations as features. Secondly, we combine the learned

representations from each network as features by concate-

nating the representations. Then, we train a random forest

classiﬁer based on the combined representations. Lastly, the

trained classiﬁer can be used for detecting vulnerabilities

(see Fig. 2). Even for a given project without any labeled

data, we can still use one of the trained networks as the

classiﬁer for vulnerability detection (see Fig. 1). To ensure

the reproducibility, we have publicized our code and the

sorted data at Github

. In summary, the contributions of

this paper are three-fold:

• We propose a deep learning framework utilizing het-

erogeneous vulnerability-relevant data sources based

on two independent deep representation learning

networks capable of extracting the useful features for

software vulnerable code detection.

• We validate the design of our framework through

experiments and demonstrate that using neural net-

works as feature extractors and a separated classi-

ﬁer to train on the extracted features improve the

vulnerability detection performance — a maximum

improvement of 60% in precision and 24% in recall

was observed.

• Our empirical studies found that the aggregated

representations learned by the two independent net-

works lead to optimal detection performance when

compared with the settings of using any single net-

work — a maximum improvement of 5% in precision

and 4% in recall was observed. The performance

of the proposed framework outperformed Flawﬁnder

[47] and our previous work [20]. It implies that the

proposed framework can be further extended to cater

to multiple data sources.

The rest of this paper is organized as follows: Section

2 provides an overview of the proposed approach and

how data sources are collected. Section 3 presents the im-

plementations of the learning of high-level representations

from the cross-domain data sources. Our experiments and

resulting evaluations are presented in Section 4, followed by

a discussion of the limitations of our approach in Section 5.

Section 6 lists some related studies, and Section 7 concludes

this paper.

2 APPROACH OVERVIEW AND DATA COLLECTION

This section provides an overview of our proposed frame-

work for software vulnerability detection by describing

a workﬂow of how the code feature representations are

learned for vulnerability detection. Following this, we intro-

duce the code data sources and the data collection process.

2.1 Problem Formulation

The proposed method takes a list of functions from a pro-

gram as input and outputs a function ranking list based

on the likelihood of the input functions being vulnerable.

Let F = {a

, a

, ...a

} be all C source code functions

(both existing and to-be-developed) in the given software

project. We aim to ﬁnd a function-level vulnerability de-

tector D : F 7→ [0, 1], where “1” stands for deﬁnitely

vulnerable, and “0” stands for deﬁnitely non-vulnerable,

such that D(a

) measures the probability of function a

containing vulnerable code. Often it sufﬁces to treat D(a

)

as a vulnerability score so that we can investigate a small

number of functions of the top risk.

1. https://github.com/DanielLin1986/RepresentationsLearningFro

mMulti domain

Transactions on Dependable and Secure Computing

Results

Bi-LSTM

Train

A software project with

no labeled data

Trained Bi-LSTM

Test

Training

Test

Real-world

vulnerability

data source

Fig. 1: Scenario 1: In this scenario, a target software project

has no labeled data. We train a Bi-LSTM network using the

real-world historical vulnerability data from other software

projects (the source projects) and feed the target project’s

code directly to the trained network for classiﬁcation.

2.2 Workﬂow

The proposed framework handles different scenarios of the

vulnerability detection process. In the ﬁrst scenario (Sce-

nario 1), we hypothesize that a target software project has

no labeled vulnerability data (see Fig. 1). By using trans-

fer learning, our network utilizes the relevant knowledge

learned from one task to be applied to a different but related

task. In this scenario, we use the historical vulnerability data

which are real-world vulnerabilities for training a neural

network. We hypothesize that the vulnerable functions of

the source projects contain the project-independent vulner-

able patterns shared among the vulnerabilities across differ-

ent software projects and these patterns are discoverable by

using the trained neural network for vulnerability detection

on a target project.

In the second scenario (Scenario 2), we hypothesize that

the target software project has some labeled data available,

but the amount of labeled data is insufﬁcient to train a statis-

tically robust classiﬁer. Hence, we exploit the representation

learning capability of deep learning algorithm to learn from

other vulnerability-relevant data sources, which remedies

the shortage of the labeled vulnerability data of the target

project.

We divide Scenario 2 into three stages, as depicted in

Fig. 2. In the ﬁrst stage, we train two independent deep

learning networks, one for each data source. For the de-

tails of the data sources, please refer to Section 2.3. We

hypothesize that the trained networks with initialized pa-

rameters or weights can capture a broader vulnerability

“knowledge” from both vulnerability-relevant data sources.

That is, the trained deep networks have learned the hidden

patterns in the respective data sources, and the learned

patterns from these networks should contain the more

generic vulnerability-relevant information than the patterns

obtained from an isolated network trained with a single data

source. In addition, we found that an alternative solution

which uses one single network to learn from both the

vulnerability-relevant data sources resulted in suboptimal

performance.

The second stage of the scenario obtains the learned

patterns or representations from the trained networks. This

stage, namely the feature representation learning stage,

uses the available labeled data from the target software

project and feeds them to the two trained deep networks

to obtain two groups of representations, respectively. Then,

we combine the representations by concatenating them to

form an aggregated feature set. For example, we feed a

sample to a network trained by one of the vulnerability-

relevant data sources. The generated representation by the

network is a vector v

= [r

, r

]. We then feed the

same sample to the other network trained by another

vulnerability-relevant data source and obtain the represen-

tation denoted as the vector v

= [r

, r

]. The combined

feature vector is derived by concatenating v

and v

, that is,

concat

= [r

, r

In the ﬁnal stage, we use the remaining data from the

target code project as the test set and feed them to each

trained network. The obtained representations of the labeled

data are used as inputs to train a random forest classiﬁer,

and the representations of the test set are fed to the trained

classiﬁer to obtain the performance results.

To address the data imbalance between vulnerable and

non-vulnerable code samples in the real-world projects,

we apply cost-sensitive learning by incorporating differ-

ent weights of vulnerable and non-vulnerable classes into

the objective functions (a.k.a. loss functions) of classi-

ﬁers used in our experiments. In this paper, we use

the equation: class weight = total samples/(n classes ∗

one class samples) to calculate the weights for each

class, where n classes is the number of classes, and

one class samples is the number of samples in one class.

This equation is based on a heuristic proposed by King and

Zeng [17]. That is, the vulnerable class will have a larger

weight to penalize the misclassiﬁcation cost of the vulnera-

ble class more than that of the non-vulnerable class. This

setup enables classiﬁers to overcome the data imbalance

challenge during the training phase.

2.3 Data Collection

To overcome the shortage of labeled real-world vulnerability

data, we introduce the synthetic vulnerability samples from

the SARD project together with real-world vulnerability

data sets in our experiments.

2.3.1 The synthetic vulnerability sources from the SARD

project

The synthetic vulnerability samples collected from the

SARD project were mainly artiﬁcially constructed test cases

either to simulate known vulnerable source code settings

and/or to provide proof-of-concept code demonstrations.

In this paper, we only used the C/C++ test cases for our

experiments. We developed a crawler to download all of

the relevant ﬁles. Each downloaded sample is a source

code ﬁle containing at least one function. According to the

SARD naming convention for each test case, the vulnerable

functions are named with the phrases such as “bad” or

“badSink”, and the non-vulnerable ones are with the names

containing words like “good” or “goodSink”. Therefore, we

extracted the functions from the source code ﬁles and la-

beled them as either vulnerable or non-vulnerable according

to the SARD naming convention. We hypothesize that the

synthetic vulnerable samples contain the proof-of-concept

code describing the “basic vulnerability patterns”, and that

Transactions on Dependable and Secure Computing

Fig. 2: Scenario 2: In this scenario there are some labeled data available for the target software project. This scenario

consists of three stages: the ﬁrst stage trains two deep learning networks using two data sources; in the second stage, we

feed each trained network with the labeled data to obtain two groups of feature representations, and then combine both

groups of feature representations to train a random forest classiﬁer; In the third stage, we obtain feature representations by

feeding each trained network with the test set (i.e., the unlabeled data) from the target project, and ﬁnally use the resulting

representations as inputs to the trained random forest classiﬁer to obtain the classiﬁcation result.

these patterns are discoverable by deep learning algorithms,

speciﬁcally by the bidirectional LSTM.

2.3.2 The Real-world vulnerability data

We chose to use the real-world vulnerability data source

collected by Lin et al. [20] because the granularity of this

data source is set at the function level. In this paper, we

augmented the data source by adding the vulnerabilities

disclosed until April 1, 2018. The data source contains

the vulnerable and non-vulnerable functions from the six

open-source projects, including FFmpeg, LibTIFF, LibPNG,

Pidgin, VLC media player, and Asterisk. The vulnerabil-

ity labels were obtained from the National Vulnerability

Database (NVD) [29] and from the Common Vulnerability

and Exposures (CVE) [26] websites. These function-level

vulnerability data allows us to build classiﬁers for function-

level vulnerability detection, thus providing a more ﬁne-

grained detection capability than that can be achieved at the

ﬁle- or component-level. Before matching the labels with the

source code, we downloaded the corresponding versions of

each project’s source code from GitHub. Subsequently, each

vulnerable function in the software project was manually

located and labeled according to the information provided

by NVD and CVE websites. Lin et al. [20] discarded the

vulnerabilities that spanned across multiple functions or

multiple ﬁles (e.g., inter-procedural vulnerabilities). Exclud-

ing the identiﬁed vulnerable functions and the discarded

vulnerabilities, they treated the remaining functions as the

non-vulnerable ones (see Table 1). By using the function

extraction tool, they were able to extract approximately 90%

of non-vulnerable functions. We hypothesize that the vul-

nerable functions in real-world software projects written in

the same programming language share generic patterns of

the vulnerabilities that are project-agnostic and discoverable

by a Bi-LSTM network.

3 UNIFIED REPRESENTATION LEARNING

Although both data sources contain source code functions,

the samples from the two sources vary in types and com-

plexity. This Section describes how deep learning networks

are applied to handle the data sources of different types

TABLE 1: The data sources used in the experiments.

Data source Dataset/collection

# of functions used/collected

Vulnerable Non-vulnerable

Synthetic

samples/Test

cases from the

SARD project

C source code

samples

83,710 52,290

Real-world

Open source

projects

FFmpeg

213 5,701

LibTIFF

96 731

LibPNG

43 577

Pidgin

29 8,050

VLC media

player

42 3,636

Asterisk

56 14,648

(e.g., synthetic and real-world vulnerability data), and dif-

ferent processing methods (e.g., ASTs and source code) for

learning uniﬁed high-level representations with respect to

the vulnerabilities of interest. To process the heterogeneous

data sources, we feed them to different networks and use

one of the hidden layers’ output as the learned high-level

representations.

3.1 Raw Representations

Before feeding the data sources to the respective networks,

the data need to be in a format compatible with deep

neural networks. At this stage, we name the data as “raw

representations”. The data samples from the SARD projects

and the real-world vulnerability data sets are source code

samples written in the C/C++ languages. Many previous

studies, such as [49] and [36], have applied text mining and

natural language processing (NLP) techniques for source

code-level defect and vulnerability detection. The under-

pinning assumption is that source code is logical, structural

and semantically meaningful. The code can be treated as

a “special” language understood by machines (compilers)

and communicated by developers, resembling a natural lan-

guage. This assumption has been formalized by Allamanis

et al. [2] as the “naturalness hypothesis”. In this paper, we

agree that there is a strong semblance between the source

code ﬁles/functions and “paragraph”/“sentences” found in

剩余16页未读，继续阅读

评论收藏

内容反馈

版权申诉

pk_xz123456

粉丝: 2874
资源: 4045

基于多源深度学习的软件漏洞自动发现框架研究与应用

基于多源数据深度学习的地下空间智慧防灾系统框架研究.pdf

基于电网多源数据融合的深度学习算法研究与应用.pdf

基于深度学习的数据级多源融合定位增强算法.pdf

基于深度学习的大型水电机组智能运维框架研究.pdf

基于电力大数据的多源异构参数融合方法的研究与应用

基于混合深度学习模型的网络服务软件漏洞挖掘方法.pdf

多源数据与深度学习支持下的人本城市设计：以上海苏州河两岸城市绿道规划研究为例.pdf

基于多源数据的北京市朝阳区人口时空格局评估与预测.pdf

基于多源异构大数据的大型活动安全预警平台研究与应用.pdf

基于多源大数据的智慧社区环境评价研究.pdf

基于深度学习与多源特征融合的心肌梗塞识别研究_毕业论文.pdf

基于深度学习的数据融合方法研究综述.pdf

基于深度学习的旋转机械故障诊断研究综述.pdf

基于多源图像融合的收获目标准确定位研究

基于多源大数据的实时公共交通服务指数研究——以天津市为例.pdf

基于多源遥感的农业灌溉监测.ppt

基于深度学习的推荐系统研究综述.pdf

基于深度学习和机器视觉的多源数据感知技术研究.pdf

基于多源遥感数据的2010-2014年南昌市城市扩张研究

毕业设计-pyhton-flask基于多源海洋数据的信息平台开发与应用研究

基于深度学习的数据融合方法研究综述

基于多源异构大数据的大型活动安全预警平台研究与应用.zip

最新资源