【免费】2021-ASurveyofBinaryCodeSimilarity.pdf资源-CSDN文库

2021-ASurvey

需积分: 0 179 浏览量 2022-01-02 22:03:06 上传评论 1 收藏 708KB PDF 举报

资源推荐

资源详情

资源评论

A Survey of Binary Code Similarity

IRFAN UL HAQ, National University of Computer and Emerging Sciences, Pakistan

JUAN CABALLERO, IMDEA Software Institute, Spain

Binary code similarityapproaches compare two or more pieces of binary code to identify their similarities

and dierences. The ability to compare binary code enables many real-world applications on scenarios where

source code may not be available such as patch analysis, bug search, and malware detection and analysis. Over

the past 22 years numerous binary code similarity approaches have been proposed, but the research area has

not yet been systematically analyzed. This article presents the rst survey of binary code similarity. It analyzes

70 binary code similarity approaches, which are systematized on four aspects: (1) the applications they enable,

(2) their approach characteristics, (3) how the approaches are implemented, and (4) the benchmarks and

methodologies used to evaluate them. In addition, the survey discusses the scope and origins of the area, its

evolution over the past two decades, and the challenges that lie ahead.

CCS Concepts: • Security and privacy → Software reverse engineering;

Additional Key Words and Phrases: Binary code similarity, code ding, code search, cross-architecture,

executable

ACM Reference format:

Irfan Ul Haq and Juan Caballero. 2021. A Survey of Binary Code Similarity. ACM Comput. Surv. 54, 3, Article

51 (April 2021), 38 pages.

https://doi.org/10.1145/3446371

1 INTRODUCTION

Binary code similarity approaches compare two or more pieces of binary code, e.g., basic blocks,

functions, or whole programs, to identify their similarities and dierences. Comparing binary code

is fundamental in scenarios where the program source code is not available, which happens with

commercial-of-the-shelf (COTS) programs, legacy programs, and malware. Binary code similarity

has a wide list of real-world applications such as bug search [17, 25–28, 39, 42, 43, 48, 56, 77, 92,

93, 109, 125], malware clustering [52, 53, 66], malware detection [12, 15, 70], malware lineage [58,

75, 84], patch generation [5], patch analysis [36, 44, 47, 54, 56, 64, 127], porting information across

program versions [36, 44, 118], and software theft detection [79].

This research was partially supported by the Regional Government of Madrid through the BLOQUES-CM (S2018/TCS-4339)

grant, and by the Spanish Government through the SCUM (RTI2018-102043-B-I00) grant. This research is also supported

by the EIE Funds of the European Union. All opinions, ndings and conclusions, or recommendations expressed herein are

those of the authors and do not necessarily reect the views of the sponsors.

Authors’ addresses: I. Ul Haq, National University of Computer and Emerging Sciences, Chiniot-Faisalabad Cam-

pus, Pakistan; email: irfanul.haq@nu.edu.pk; J. Caballero, IMDEA Software Institute, Madrid, Spain; email: juan.

caballero@imdea.org.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee

provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and

the full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be

honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists,

requires prior specic permission and/or a fee. Request permissions from permissions@acm.org.

0360-0300/2021/04-ART51 $15.00

https://doi.org/10.1145/3446371

ACM Computing Surveys, Vol. 54, No. 3, Article 51. Publication date: April 2021.

51:2 I. Ul Haq and J. Caballero

Identifying binary code similarity is challenging because much program semantics are lost due

to the compilation process including function names, variable names, source comments, and data

structure denitions. Additionally, even when the program source code does not change, the bi-

nary code may change if the source is recompiled, due to secondary changes introduced by the

compilation process. For example, the resulting binary code can signicantly change when using

dierent compilers, changing compiler optimizations, and selecting dierent target operating sys-

tems and CPU architectures. Furthermore, obfuscation transformations can be applied on both the

source code and the generated binary code, hiding the original code.

Given its applications and challenges, over the past 22 years numerous binary code similarity

approaches have been proposed. However, as far as we know, there does not exist a systematic

survey of this research area. Previous surveys deal with binary code obfuscation techniques in

packer tools [101], binary code type inference [14], and dynamic malware analysis techniques [37].

Those topics are related because binary code similarity may need to tackle obfuscation, binary code

type inference may leverage similar binary analysis platforms, and malware is often a target of

binary code similarity approaches. But binary code similarity is well separated from those topics,

as shown by previous surveys having no overlap with this article on the set of papers analyzed.

Other surveys have explored similarity detection on any binary data, i.e., not specic to code,

such as hashing for similarity search [115] and similarity metrics on numerical and binary feature

vectors [16, 107]. In contrast, this survey focuses on approaches that compare binary code, i.e.,

that disassemble the executable byte stream.

This article presents the rst survey of binary code similarity. It rst identies 70 binary code

similarity approaches through a systematic selection process that examines over 100 papers pub-

lished in research venues from dierent computer science areas such as computer security, soft-

ware engineering, programming languages, and machine learning. Then, it systematizes four as-

pects of those 70 approaches: (1) the applications they enable, (2) their approach characteristics,

(3) how the approaches have been implemented, and (4) the benchmarks and methodologies used

to evaluate them. In addition, it discusses the scope and origin of binary code similarity, its evolu-

tion over the past two decades, and the challenges that lie ahead.

Binary code similarity approaches widely vary in their approach, implementation, and evalua-

tion. This survey systematizes each of those aspects and summarizes the results in easy-to-access

tables that compare the 70 approaches across multiple dimensions, allowing beginners and experts

to quickly understand their similarities and dierences. For example, the approach systematization

includes, among others, the number of input pieces of binary code being compared (e.g., one-to-

one, one-to-many, many-to-many); the granularity of the pieces of binary code analyzed (e.g.,

basic blocks, functions, programs); whether the comparison happens at the syntactical represen-

tation, the graph structure, or the code semantics; the type of analysis used (e.g., static, dynamic,

symbolic); and the techniques used for scalability (e.g., hashing, embedding, indexing). The imple-

mentation systematization includes the binary analysis platforms used to build the approach, the

programming languages used to code it, the supported architectures for the input pieces of binary

code being compared, and whether the approach is publicly released. The evaluation systematiza-

tion covers the datasets on which the approaches are evaluated and the evaluation methodology

including the type of evaluation (e.g., accuracy, comparison with prior works, performance) and

how the robustness of the approach is evaluated in the face of common code transformations such

as changes in compiler and compilation options, dierent architectures, and obfuscation.

Beyond the systematization, this survey also discusses how binary code similarity has evolved

from binary code ding to binary code search and how the focus has moved from a single ar-

chitecture to cross-architecture approaches. It shows that it is a vibrant eld of research as many

new approaches are still being proposed. It discusses technical challenges that remain open but

ACM Computing Surveys, Vol. 54, No. 3, Article 51. Publication date: April 2021.

A Survey of Binary Code Similarity 51:3

Fig. 1. The extended compilation process. Doed boxes are optional code transformations typically used

for obfuscation. For a given source code, changing any of the gray boxes may produce a dierent binary

program.

concludes that the future of the area is bright with important application scenarios under way

such as those related to binary code search engines and the Internet-of-Things.

Article structure. This article is organized as follows. Section 2 provides an overview of binary

code similarity. Section 3 details the scope of the survey and the paper selection process. Section 4

summarizes applications of binary code similarity and Section 5 the evolution of the eld over

the last two decades. Section 6 systematizes the characteristics of the 70 binary code similarity

approaches, Section 7 their implementation, and Section 8 their evaluation. Finally, we discuss

future research directions in Section 9 and conclude in Section 10.

2OVERVIEW

In this section, we rst provide background on the compilation process (Section 2.1). Then, we

present an overview of the binary code similarity problem (Section 2.2).

2.1 Compilation Process

Binary code refers to the machine code that is produced by the compilation process and that can

be run directly by a CPU. The standard compilation process takes as input the source code les

of a program. It compiles them with a chosen compiler and optimization level and for a specic

platform (dened by the architecture, word size, and OS) producing object les. Those object les

are then linked into a binary program, either a stand-alone executable or a library.

Binary code similarity approaches typically deal with an extended compilation process, illustrated

in Figure 1, which adds two optional steps to the standard compilation process: source code and

binary code transformations. Both types of transformations are typically semantics preserving

(i.e., do not change the program functionality) and are most commonly used for obfuscation (i.e.,

to hamper reverse-engineering of the distributed binary programs). Source code transformations

happen before compilation. Thus, their input and output are both source code. They can be applied

regardless of the target platform but may be specic to the programming language used to write

the program. On the other hand, binary code transformations happen after compilation. Thus,

their input and output are binary code. They are independent of the programming language used

but may be specic to the target platform.

Obfuscation is a fundamental step in malware but can also be applied to benign programs, e.g., to

protect their intellectual property. There exist o-the-shelf obfuscation tools that use source code

transformations (e.g., Tigress [21]), as well as binary code transformations (e.g., packers [112]).

Packing is a binary code transformation widely used by malware. Once a new version of a malware

family is ready, the malware authors pack the resulting executable to hide its functionality and

thus bypass detection by commercial malware detectors. The packing process takes as input an

executable and produces another executable with the same functionality, but with the original code

hidden (e.g., encrypted as data and unpacked at runtime). The packing process is typically applied

ACM Computing Surveys, Vol. 54, No. 3, Article 51. Publication date: April 2021.

51:4 I. Ul Haq and J. Caballero

many times to the same input executable, creating polymorphic variants of exactly the same source

code, which look dierent to malware detectors. Nowadays, the majority of malware is packed and

malware often uses custom packers for which o-the-shelf unpackers are not available [112].

A main challenge in binary code similarity is that the compilation process can produce dierent

binary code representations for the same source code. An author can modify any of the gray

boxes in Figure 1 to produce a dierent, but semantically equivalent, binary program from the

same source code. Some of these modications may be due to the standard compilation process.

For example, to improve program eciency, an author may vary the compiler’s optimization

level or change the compiler altogether. Both changes will transform the produced binary code,

despite the source code remaining unchanged. An author may also change the target platform

to obtain a version of the program suitable for a dierent architecture. In this case, the produced

binary code may radically dier if the new target architecture uses a dierent instruction set. An

author may also deliberately apply obfuscation transformations to produce polymorphic variants

of the same source code. The produced variants will typically have the same functionality dened

by the original source code. A desirable goal for binary code similarity approaches is that they

are able to identify the similarity of binary code that corresponds to the same source code having

undergone dierent transformations. The robustness of a binary code similarity approach captures

the compilation and obfuscation transformations that it can handle, i.e., the transformations

despite which it can still detect similarity.

2.2 Binary Code Similarity Overview

Binary code similarity approaches compare pieces of binary code. The three main characteristics of

binary code similarity approaches are (1) the type of the comparison (identical, similar, equivalent),

(2) the granularity of the pieces of binary code being compared (e.g., instructions, basic blocks,

functions), and (3) the number of input pieces being compared (one-to-one, one-to-many, many-

to-many). We detail these three characteristics next. For simplicity, we describe the comparison

type and comparison granularity for two inputs and then generalize to multiple inputs.

Comparison type. Two (or more) pieces of binary code are identical if they have the same syn-

tax, i.e., the same representation. The binary code can be represented in dierent ways such as a

hexadecimal string of raw bytes, a sequence of disassembled instructions, or a control-ow graph.

Determining if several pieces of binary code are identical is a Boolean decision (either they are

identical or not) that is easy to check: simply apply a cryptographic hash (e.g., SHA256) to the

contents of each piece. If the hash is the same, the pieces are identical. However, such a straight-

forward approach fails to detect similarity in many cases. For example, compiling the same pro-

gram source code twice, using dierent compilation parameters (i.e., compiler, compiler version,

optimization level, target platform), produces functionally similar binary code, but with dierent

hash. Even when all compilation parameters are the same, the produced executables will have dif-

ferent le hash. This happens because the executable may include metadata that diers in both

compilations such as the compilation date, which is automatically computed and included into the

header of the generated executable.

Two pieces of binary code are equivalent if they have the same semantics, i.e., if they oer exactly

the same functionality. Equivalence does not care about the syntax of the binary code. Clearly, two

identical pieces of binary code will have the same semantics, but dierent pieces of binary code

may as well. For example, mov %eax,$0 and xor %eax,%eax are semantically equivalent x86 in-

structions because both set the value of register EAX to zero. Similarly, the same source code com-

piled for two dierent target architectures should produce equivalent executables whose syntax

may be completely dierent if the architectures use dierent instruction sets. Note, however, that

ACM Computing Surveys, Vol. 54, No. 3, Article 51. Publication date: April 2021.

A Survey of Binary Code Similarity 51:5

Fig. 2. Similarity at a finer granularity can be used to infer a dierent type of similarity at a coarser

granularity.

semantic equivalence goes beyond binary code compiled from the same source code. For example,

two dierent implementations of the same encryption algorithm should be semantically equiva-

lent. Proving that two arbitrary programs are functionally equivalent is an undecidable problem

that reduces to solving the halting problem [72]. In practice, determining binary code equivalence

is a very expensive process that can only be performed for small pieces of binary code.

Two pieces of binary code can be considered similar if their syntax, structure, or semantics

are similar. Syntactic similarity compares the code representation. For example, clone detection

approaches consider that a target piece of binary code is a clone of some source binary code if their

syntax is similar. Structural similarity compares graph representations of binary code (e.g., control

ow graphs, callgraphs). It sits between syntactic and semantic similarity. The intuition is that the

control ow of the binary code captures to some extent its semantics, e.g., the decisions taken

on the data. Furthermore, the graph representation captures multiple syntactic representations of

the same functionality. However, it is possible to modify the graph structure without aecting

the semantics, e.g., by inlining functions. Semantic similarity compares the code functionality. A

simple approach to semantic similarity compares the interaction between the program and its

environment through operating system (OS) APIs or system calls. But two programs with similar

system or API calls can still implement signicantly dierent functionality, e.g., by processing

dierently the output of the system calls, so more ne-grained semantic similarity approaches

focus on a syntax-independent comparison of the code.

Generally speaking, the more robust a binary code similarity approach is, i.e., the more transfor-

mations it can capture, the more expensive it is. Syntactic similarity approaches are the cheapest to

compute, but the least robust. They are sensitive to simple changes in the binary code, e.g., regis-

ter reallocation, instruction reordering, replacing instructions with semantically equivalent ones.

Structural similarity sits in the middle. It is robust against multiple syntactical transformations but

sensitive to transformations that change code structure such as code inlining or removal of unused

function parameters. Semantic similarity is robust against semantics-preserving transformations,

despite changes to the code syntax and structure, but it is very expensive to compute for large

pieces of binary code.

Comparison granularity. Binary code similarity approaches can be applied at dierent granu-

larities. Common granularities are instructions, basic blocks, functions, and whole programs. To

perform a comparison at a coarser granularity, some approaches use a dierent comparison at a

ner granularity and then combine the ner granularity results. For example, to compare whether

two programs are similar, an approach could determine the fraction of identical functions between

both programs. Thus, we dierentiate between the input granularity, i.e., the granularity of the in-

put pieces of binary code that the approach compares, and the approach granularities, i.e., the

granularities of the dierent comparisons in the approach.

Applying a specic comparison at a ner granularity may restrict the type of comparison

that can be performed at a coarser granularity, as illustrated in Figure 2. The gure shows that

ACM Computing Surveys, Vol. 54, No. 3, Article 51. Publication date: April 2021.

剩余37页未读，继续阅读

评论收藏

内容反馈

dream_uping

粉丝: 4w+
资源: 374

2021 - A Survey of Binary Code Similarity.pdf

最新资源

2021 - A Survey of Binary Code Similarity.pdf

2021_Semantic-aware Binary Code Representation with BERT.docx

2021_Semantic-aware Binary Code Representation with BERT PPT.pdf

ICC IWUIC 2021 - International Wildland-Urban Interface Code.pdf

ICC IECC 2021 - 2021 International Energy Conservation Code.pdf

discovRE__Efficient_Cross-Architecture_Identification_of_Bugs_in_Binary_Code.pdf

amoeba-mysql-binary-2.2.0.tar.gz

sqlite-netFx46-binary-bundle-x64-2015-1.0.113.0.zip

sqlite-netFx40-static-binary-x64-2010-1.0.112.0.zip

sqlite-netFx40-binary-Win32_2010-1.0.94.0.zip

ICC IFGC 2021 International Fuel Gas Code.pdf

ICC IPMC 2021 International Property Maintenance Code.pdf

BS 7582-2021 safe cabinets — Code of practice.pdf

ICC IGCC-2021 International Green Construction Code.pdf

PostgreSQL-11.10-osr507-Binary.tar.gz

hyperledger-fabric-ca-linux-amd64-1.4.0.tar.gz

hyperledger-fabric-linux-amd64-1.4.0.tar.gz

sqlite-netFx40-binary-x64-2010-1.0.106.0

sqlite-netFx46-binary-x64-2015-1.0.113.0.zip

ICC 2021 - 2021 International Plumbing Code.pdf

ICC IEBC 2021 - International Existing Building Code.pdf

【船级社】 BV NI672 Guidelines for Remote Survey 2021-06.pdf

ICC IZC 2021 -International Zoning Code.pdf

二值神经网络综述（Binary Neural Networks: A Survey）【北航】.pdf

sqlite-netFx40-binary-x64-2010-1.0.94.0.zip

sqlite-netFx46-binary-x64-2015-1.0.106.0

sqlite-netFx20-binary-x64-2005-1.0.112.0.zip

sqlite-netFx451-binary-x64-2013-1.0.106.0

jdk-8u202-linux-arm64-vfp-hflt.tar.gz

最新资源