372 XUELI HU, SHUAI YI, ZHENXING WANG, LIANCHENG ZHANG
form its tasks, it must request the relevant services after being executed. If system
calls are monitored and behavior trails of the entire lifecycle system calls are ac-
quired, then system calls can be analyzed and the behaviors can be examined. Thus
it can be concluded whether the codes are malicious or not. References [1–5] selected
system calls as features for malwares detection. And some other related research
results extract different features as features for malwares detection [6–8].However,
in [3] only the frequencies of a single system call are considered, whereas the depen-
dencies between certain system calls and their orders are not considered. Reference
[4] only considered part of system call events during the features selection, so the
detection is not comprehensive.
Due to the rapid development of malware, malware sample set update is not
timely, leading to obsolete samples, easy to lead to training phase of unbalanced
data sample sets problem. Malware detection is a typical unbalanced sample set
classification problem. Unbalanced data means that some classes have very few
instances in the dataset (i.e., minority classes), while other classes have many in-
stances (i.e., majority classes). According to Gary M Weiss [9], unbalanced data set
can cause a series of problems, such as data sparseness, data fragmentation, and in-
ductive deviation. These problems might reduce the performances of the traditional
classification methods. To improve the unbalanced data sets and improve the system
efficiency, [10] used sparse matrices extracted from local singular value decomposi-
tion in order to reduce the system load. But singular value decomposition cannot
reduce the impact of the unbalanced feature selection on the detection results.
Aiming at the above research problems, this paper proved a method. The paper
contributions of research presented in this paper are as listed below:
1. Proved an approach based on two anti-debugging techniques, the App self-
attachment and the dynamic process additional state, is proposed. the method
tracks all the system call sequences of the App progress and obtains the system call
traces of the entire lifecycle.
2. Most of the existing Android malware detection techniques are based on oc-
curring frequency of each individual system call, whereas the dependencies between
multiple system calls are neglected. However, that shortcoming is compensated in
the proposed approach by sorting out of some implicit features from the N-Gram
data sets. Experimental results have demonstrated that these features are highly
effective in malware detection.
3. The TF-RF relevance category feature weighting method is adopted. Through
feature computation and inter-class correlation based on features selection, the ap-
proach can effectively select as many beneficial minority class features as is possible,
while maintaining the majority class features.
2. System architecture
According to the fact that the malware uses some certain system call sequences
to execute its malicious behavior and that these system call sequences rarely appear
in the normal codes, they can be used to extract the malware features. In order
to achieve malware detection, TF-RF algorithm was used for features extraction