没有合适的资源?快使用搜索试试~ 我知道了~
2021 - A Survey of Binary Code Similarity.pdf
需积分: 0 6 下载量 179 浏览量
2022-01-02
22:03:06
上传
评论 1
收藏 708KB PDF 举报
温馨提示
试读
38页
2021 - A Survey of Binary Code Similarity.pdf
资源推荐
资源详情
资源评论
51
A Survey of Binary Code Similarity
IRFAN UL HAQ, National University of Computer and Emerging Sciences, Pakistan
JUAN CABALLERO, IMDEA Software Institute, Spain
Binary code similarityapproaches compare two or more pieces of binary code to identify their similarities
and dierences. The ability to compare binary code enables many real-world applications on scenarios where
source code may not be available such as patch analysis, bug search, and malware detection and analysis. Over
the past 22 years numerous binary code similarity approaches have been proposed, but the research area has
not yet been systematically analyzed. This article presents the rst survey of binary code similarity. It analyzes
70 binary code similarity approaches, which are systematized on four aspects: (1) the applications they enable,
(2) their approach characteristics, (3) how the approaches are implemented, and (4) the benchmarks and
methodologies used to evaluate them. In addition, the survey discusses the scope and origins of the area, its
evolution over the past two decades, and the challenges that lie ahead.
CCS Concepts: • Security and privacy → Software reverse engineering;
Additional Key Words and Phrases: Binary code similarity, code ding, code search, cross-architecture,
executable
ACM Reference format:
Irfan Ul Haq and Juan Caballero. 2021. A Survey of Binary Code Similarity. ACM Comput. Surv. 54, 3, Article
51 (April 2021), 38 pages.
https://doi.org/10.1145/3446371
1 INTRODUCTION
Binary code similarity approaches compare two or more pieces of binary code, e.g., basic blocks,
functions, or whole programs, to identify their similarities and dierences. Comparing binary code
is fundamental in scenarios where the program source code is not available, which happens with
commercial-of-the-shelf (COTS) programs, legacy programs, and malware. Binary code similarity
has a wide list of real-world applications such as bug search [17, 25–28, 39, 42, 43, 48, 56, 77, 92,
93, 109, 125], malware clustering [52, 53, 66], malware detection [12, 15, 70], malware lineage [58,
75, 84], patch generation [5], patch analysis [36, 44, 47, 54, 56, 64, 127], porting information across
program versions [36, 44, 118], and software theft detection [79].
This research was partially supported by the Regional Government of Madrid through the BLOQUES-CM (S2018/TCS-4339)
grant, and by the Spanish Government through the SCUM (RTI2018-102043-B-I00) grant. This research is also supported
by the EIE Funds of the European Union. All opinions, ndings and conclusions, or recommendations expressed herein are
those of the authors and do not necessarily reect the views of the sponsors.
Authors’ addresses: I. Ul Haq, National University of Computer and Emerging Sciences, Chiniot-Faisalabad Cam-
pus, Pakistan; email: irfanul.haq@nu.edu.pk; J. Caballero, IMDEA Software Institute, Madrid, Spain; email: juan.
caballero@imdea.org.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and
the full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be
honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specic permission and/or a fee. Request permissions from permissions@acm.org.
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.
0360-0300/2021/04-ART51 $15.00
https://doi.org/10.1145/3446371
ACM Computing Surveys, Vol. 54, No. 3, Article 51. Publication date: April 2021.
51:2 I. Ul Haq and J. Caballero
Identifying binary code similarity is challenging because much program semantics are lost due
to the compilation process including function names, variable names, source comments, and data
structure denitions. Additionally, even when the program source code does not change, the bi-
nary code may change if the source is recompiled, due to secondary changes introduced by the
compilation process. For example, the resulting binary code can signicantly change when using
dierent compilers, changing compiler optimizations, and selecting dierent target operating sys-
tems and CPU architectures. Furthermore, obfuscation transformations can be applied on both the
source code and the generated binary code, hiding the original code.
Given its applications and challenges, over the past 22 years numerous binary code similarity
approaches have been proposed. However, as far as we know, there does not exist a systematic
survey of this research area. Previous surveys deal with binary code obfuscation techniques in
packer tools [101], binary code type inference [14], and dynamic malware analysis techniques [37].
Those topics are related because binary code similarity may need to tackle obfuscation, binary code
type inference may leverage similar binary analysis platforms, and malware is often a target of
binary code similarity approaches. But binary code similarity is well separated from those topics,
as shown by previous surveys having no overlap with this article on the set of papers analyzed.
Other surveys have explored similarity detection on any binary data, i.e., not specic to code,
such as hashing for similarity search [115] and similarity metrics on numerical and binary feature
vectors [16, 107]. In contrast, this survey focuses on approaches that compare binary code, i.e.,
that disassemble the executable byte stream.
This article presents the rst survey of binary code similarity. It rst identies 70 binary code
similarity approaches through a systematic selection process that examines over 100 papers pub-
lished in research venues from dierent computer science areas such as computer security, soft-
ware engineering, programming languages, and machine learning. Then, it systematizes four as-
pects of those 70 approaches: (1) the applications they enable, (2) their approach characteristics,
(3) how the approaches have been implemented, and (4) the benchmarks and methodologies used
to evaluate them. In addition, it discusses the scope and origin of binary code similarity, its evolu-
tion over the past two decades, and the challenges that lie ahead.
Binary code similarity approaches widely vary in their approach, implementation, and evalua-
tion. This survey systematizes each of those aspects and summarizes the results in easy-to-access
tables that compare the 70 approaches across multiple dimensions, allowing beginners and experts
to quickly understand their similarities and dierences. For example, the approach systematization
includes, among others, the number of input pieces of binary code being compared (e.g., one-to-
one, one-to-many, many-to-many); the granularity of the pieces of binary code analyzed (e.g.,
basic blocks, functions, programs); whether the comparison happens at the syntactical represen-
tation, the graph structure, or the code semantics; the type of analysis used (e.g., static, dynamic,
symbolic); and the techniques used for scalability (e.g., hashing, embedding, indexing). The imple-
mentation systematization includes the binary analysis platforms used to build the approach, the
programming languages used to code it, the supported architectures for the input pieces of binary
code being compared, and whether the approach is publicly released. The evaluation systematiza-
tion covers the datasets on which the approaches are evaluated and the evaluation methodology
including the type of evaluation (e.g., accuracy, comparison with prior works, performance) and
how the robustness of the approach is evaluated in the face of common code transformations such
as changes in compiler and compilation options, dierent architectures, and obfuscation.
Beyond the systematization, this survey also discusses how binary code similarity has evolved
from binary code ding to binary code search and how the focus has moved from a single ar-
chitecture to cross-architecture approaches. It shows that it is a vibrant eld of research as many
new approaches are still being proposed. It discusses technical challenges that remain open but
ACM Computing Surveys, Vol. 54, No. 3, Article 51. Publication date: April 2021.
A Survey of Binary Code Similarity 51:3
Fig. 1. The extended compilation process. Doed boxes are optional code transformations typically used
for obfuscation. For a given source code, changing any of the gray boxes may produce a dierent binary
program.
concludes that the future of the area is bright with important application scenarios under way
such as those related to binary code search engines and the Internet-of-Things.
Article structure. This article is organized as follows. Section 2 provides an overview of binary
code similarity. Section 3 details the scope of the survey and the paper selection process. Section 4
summarizes applications of binary code similarity and Section 5 the evolution of the eld over
the last two decades. Section 6 systematizes the characteristics of the 70 binary code similarity
approaches, Section 7 their implementation, and Section 8 their evaluation. Finally, we discuss
future research directions in Section 9 and conclude in Section 10.
2OVERVIEW
In this section, we rst provide background on the compilation process (Section 2.1). Then, we
present an overview of the binary code similarity problem (Section 2.2).
2.1 Compilation Process
Binary code refers to the machine code that is produced by the compilation process and that can
be run directly by a CPU. The standard compilation process takes as input the source code les
of a program. It compiles them with a chosen compiler and optimization level and for a specic
platform (dened by the architecture, word size, and OS) producing object les. Those object les
are then linked into a binary program, either a stand-alone executable or a library.
Binary code similarity approaches typically deal with an extended compilation process, illustrated
in Figure 1, which adds two optional steps to the standard compilation process: source code and
binary code transformations. Both types of transformations are typically semantics preserving
(i.e., do not change the program functionality) and are most commonly used for obfuscation (i.e.,
to hamper reverse-engineering of the distributed binary programs). Source code transformations
happen before compilation. Thus, their input and output are both source code. They can be applied
regardless of the target platform but may be specic to the programming language used to write
the program. On the other hand, binary code transformations happen after compilation. Thus,
their input and output are binary code. They are independent of the programming language used
but may be specic to the target platform.
Obfuscation is a fundamental step in malware but can also be applied to benign programs, e.g., to
protect their intellectual property. There exist o-the-shelf obfuscation tools that use source code
transformations (e.g., Tigress [21]), as well as binary code transformations (e.g., packers [112]).
Packing is a binary code transformation widely used by malware. Once a new version of a malware
family is ready, the malware authors pack the resulting executable to hide its functionality and
thus bypass detection by commercial malware detectors. The packing process takes as input an
executable and produces another executable with the same functionality, but with the original code
hidden (e.g., encrypted as data and unpacked at runtime). The packing process is typically applied
ACM Computing Surveys, Vol. 54, No. 3, Article 51. Publication date: April 2021.
51:4 I. Ul Haq and J. Caballero
many times to the same input executable, creating polymorphic variants of exactly the same source
code, which look dierent to malware detectors. Nowadays, the majority of malware is packed and
malware often uses custom packers for which o-the-shelf unpackers are not available [112].
A main challenge in binary code similarity is that the compilation process can produce dierent
binary code representations for the same source code. An author can modify any of the gray
boxes in Figure 1 to produce a dierent, but semantically equivalent, binary program from the
same source code. Some of these modications may be due to the standard compilation process.
For example, to improve program eciency, an author may vary the compiler’s optimization
level or change the compiler altogether. Both changes will transform the produced binary code,
despite the source code remaining unchanged. An author may also change the target platform
to obtain a version of the program suitable for a dierent architecture. In this case, the produced
binary code may radically dier if the new target architecture uses a dierent instruction set. An
author may also deliberately apply obfuscation transformations to produce polymorphic variants
of the same source code. The produced variants will typically have the same functionality dened
by the original source code. A desirable goal for binary code similarity approaches is that they
are able to identify the similarity of binary code that corresponds to the same source code having
undergone dierent transformations. The robustness of a binary code similarity approach captures
the compilation and obfuscation transformations that it can handle, i.e., the transformations
despite which it can still detect similarity.
2.2 Binary Code Similarity Overview
Binary code similarity approaches compare pieces of binary code. The three main characteristics of
binary code similarity approaches are (1) the type of the comparison (identical, similar, equivalent),
(2) the granularity of the pieces of binary code being compared (e.g., instructions, basic blocks,
functions), and (3) the number of input pieces being compared (one-to-one, one-to-many, many-
to-many). We detail these three characteristics next. For simplicity, we describe the comparison
type and comparison granularity for two inputs and then generalize to multiple inputs.
Comparison type. Two (or more) pieces of binary code are identical if they have the same syn-
tax, i.e., the same representation. The binary code can be represented in dierent ways such as a
hexadecimal string of raw bytes, a sequence of disassembled instructions, or a control-ow graph.
Determining if several pieces of binary code are identical is a Boolean decision (either they are
identical or not) that is easy to check: simply apply a cryptographic hash (e.g., SHA256) to the
contents of each piece. If the hash is the same, the pieces are identical. However, such a straight-
forward approach fails to detect similarity in many cases. For example, compiling the same pro-
gram source code twice, using dierent compilation parameters (i.e., compiler, compiler version,
optimization level, target platform), produces functionally similar binary code, but with dierent
hash. Even when all compilation parameters are the same, the produced executables will have dif-
ferent le hash. This happens because the executable may include metadata that diers in both
compilations such as the compilation date, which is automatically computed and included into the
header of the generated executable.
Two pieces of binary code are equivalent if they have the same semantics, i.e., if they oer exactly
the same functionality. Equivalence does not care about the syntax of the binary code. Clearly, two
identical pieces of binary code will have the same semantics, but dierent pieces of binary code
may as well. For example, mov %eax,$0 and xor %eax,%eax are semantically equivalent x86 in-
structions because both set the value of register EAX to zero. Similarly, the same source code com-
piled for two dierent target architectures should produce equivalent executables whose syntax
may be completely dierent if the architectures use dierent instruction sets. Note, however, that
ACM Computing Surveys, Vol. 54, No. 3, Article 51. Publication date: April 2021.
A Survey of Binary Code Similarity 51:5
Fig. 2. Similarity at a finer granularity can be used to infer a dierent type of similarity at a coarser
granularity.
semantic equivalence goes beyond binary code compiled from the same source code. For example,
two dierent implementations of the same encryption algorithm should be semantically equiva-
lent. Proving that two arbitrary programs are functionally equivalent is an undecidable problem
that reduces to solving the halting problem [72]. In practice, determining binary code equivalence
is a very expensive process that can only be performed for small pieces of binary code.
Two pieces of binary code can be considered similar if their syntax, structure, or semantics
are similar. Syntactic similarity compares the code representation. For example, clone detection
approaches consider that a target piece of binary code is a clone of some source binary code if their
syntax is similar. Structural similarity compares graph representations of binary code (e.g., control
ow graphs, callgraphs). It sits between syntactic and semantic similarity. The intuition is that the
control ow of the binary code captures to some extent its semantics, e.g., the decisions taken
on the data. Furthermore, the graph representation captures multiple syntactic representations of
the same functionality. However, it is possible to modify the graph structure without aecting
the semantics, e.g., by inlining functions. Semantic similarity compares the code functionality. A
simple approach to semantic similarity compares the interaction between the program and its
environment through operating system (OS) APIs or system calls. But two programs with similar
system or API calls can still implement signicantly dierent functionality, e.g., by processing
dierently the output of the system calls, so more ne-grained semantic similarity approaches
focus on a syntax-independent comparison of the code.
Generally speaking, the more robust a binary code similarity approach is, i.e., the more transfor-
mations it can capture, the more expensive it is. Syntactic similarity approaches are the cheapest to
compute, but the least robust. They are sensitive to simple changes in the binary code, e.g., regis-
ter reallocation, instruction reordering, replacing instructions with semantically equivalent ones.
Structural similarity sits in the middle. It is robust against multiple syntactical transformations but
sensitive to transformations that change code structure such as code inlining or removal of unused
function parameters. Semantic similarity is robust against semantics-preserving transformations,
despite changes to the code syntax and structure, but it is very expensive to compute for large
pieces of binary code.
Comparison granularity. Binary code similarity approaches can be applied at dierent granu-
larities. Common granularities are instructions, basic blocks, functions, and whole programs. To
perform a comparison at a coarser granularity, some approaches use a dierent comparison at a
ner granularity and then combine the ner granularity results. For example, to compare whether
two programs are similar, an approach could determine the fraction of identical functions between
both programs. Thus, we dierentiate between the input granularity, i.e., the granularity of the in-
put pieces of binary code that the approach compares, and the approach granularities, i.e., the
granularities of the dierent comparisons in the approach.
Applying a specic comparison at a ner granularity may restrict the type of comparison
that can be performed at a coarser granularity, as illustrated in Figure 2. The gure shows that
ACM Computing Surveys, Vol. 54, No. 3, Article 51. Publication date: April 2021.
剩余37页未读,继续阅读
资源评论
dream_uping
- 粉丝: 4w+
- 资源: 374
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功