【免费】论文《Logram:使用n-Gram词典的高效日志解析》翻译

流程挖掘

日志解析

需积分: 0 193 浏览量 2023-08-14 16:11:16 上传评论收藏 1.19MB DOCX 举报

资源推荐

资源详情

资源评论

Logram: Efficient Log Parsing Using

―

𝑮𝒓𝒂𝒎

Dictionaries

Logram:使用

―

𝑮𝒓𝒂𝒎

词典的高效日志解析

Abstract—Software systems usually record important runtime information in their logs. Logs

help practitioners understand system runtime behaviors and diagnose field failures. As logs

are usually very large in size, automated log analysis is needed to assist practitioners in their

software operation and maintenance efforts. Typically, the first step of automated log analysis

is log parsing, i.e., converting unstructured raw logs into structured data. However, log parsing

is challenging, because logs are produced by static templates in the source code (i.e., logging

statements) yet the templates are usually inaccessible when parsing logs. Prior work proposed

automated log parsing approaches that have achieved high accuracy. However, as the volume

of logs grows rapidly in the era of cloud computing, efficiency becomes a major concern in log

parsing. In this work, we propose an automated log parsing approach, Logram, which leverages

n-gram dictionaries to achieve efficient log parsing. We evaluated Logram on 16 public log

datasets and compared Logram with five state-of-the-art log parsing approaches. We found

that Logram achieves a higher parsing accuracy than the best existing approaches (i.e., at least

10% higher, on average) and also outperforms these approaches in efficiency (i.e., 1.8 to 5.1

times faster than the second-fastest approaches in terms of end-to-end parsing time).

Furthermore, we deployed Logram on Spark and we found that Logram scales out efficiently

with the number of Spark nodes (e.g., with near-linear scalability for some logs) without

sacrificing parsing accuracy. In addition, we demonstrated that Logram can support effective

online parsing of logs, achieving similar parsing results and efficiency to the offline mode.

Index Terms—Log parsing, Log analysis, N-gram

摘要：软件系统通常在记录中记录重要实时信息。日志帮助从业者更好地理解系统的实时行为，

并且诊断产生的错误。随着日志大小的规模不断变大，自动化日志分析被用来帮助从业者进行

软件操作和维护工作。通常，自动化日志分析的第一步是日志解析，即将无结构行日志转换为

有结构数据。然而，日志解析是一项挑战，因为日志通过源代码中的静态模板产生，尽管解析

日志时模板通常是不可用的。目前已有的自动化日志解析方法已经能够实现高准确率。然而，

在云计算行业日志的规模极速增长，有效性成为日志解析中主要焦点。在本文工作中，我们提

出一种自动化日志解析方法，Logram，使用

―

𝐺𝑟𝑎𝑚

词典来实现高效日志解析。我们在 16 个

公共日志数据集上评估了 Logram，并且用五个最先进的日志解析方法与 Logram 进行对比。我

们发现 Logram 比现有的最好的方法达到的解析准确率更高（即，平均至少高出来 10%）并且

在效率方面也比这些方法的表现更好（按照端到端解析时间来算，比第二名方法快出来 1.8 到

5.1 倍）。进一步将，我们在 Spark 上部署了 Logram，同时我们发现随着 Spark 节点的数量的增

加，Logram 能够有效地进行横向扩展而不牺牲解析准确率。除此之外，我们展示了 Logram 能

够有效支持的在线解析日志，与离线方法相比，能够实现近乎相同的解析结果和效率。

关键词：日志解析，日志分析，

𝑁

―

𝐺𝑟𝑎𝑚

1 INTRODUCTION

1 引言

Modern software systems usually record valuable runtime information (e.g., important

events and variable values) in logs. Logs play an important role for practitioners to

understand the runtime behaviors of software systems and to diagnose system failures [1],

[2]. However, since logs are often very large in size (e.g., tens or hundreds of gigabytes)

[3], [4], prior research has proposed automated approaches to analyze logs. These

automated approaches help practitioners with various software maintenance and operation

activities, such as anomaly detection [5], [6], [7], [8], [9], failure diagnosis [10], [11],

performance diagnosis and improvement [12], [13], and system comprehension [10], [14].

Recently, the fast-emerging AIOps (Artificial Intelligence for IT Operations) solutions also

depend heavily on automated analysis of operation logs [15], [16], [17], [18], [19].

现在软件系统通常实时记录日志中有价值的信息（例如重要事件和变量值）。日

志对从业者来说起到了重要的作用，用来理解软件系统的实时行为并且诊断系统错

误。然而，由于日志的规模非常大（例如几十或者几百 GB），先前的研究提出分

析日志的自动化方法。这种自动化方法用各种软件维持和操作活动来帮助从业者，

例如缺陷检测，错误分析，性能诊断和提升以及系统理解。最近，快速崛起的 AIOps

（IT 运营的人工智能）解决方案严重依赖操作日志的自动化分析。

Logs are generated by logging statements in the source code. As shown in Figure 1, a

logging statement is composed of log level (i.e., info), static text (i.e., “Found block” and

“locally”), and dynamic variables (i.e., “$blockId”). During system runtime, the logging

statement would generate raw log messages, which is a line of unstructured text that

contains the static text and the values for the dynamic variables (e.g., “rdd 42 20”) that are

specified in the logging statement. The log message also contains information such as the

timestamp (e.g., “17/06/09 20:11:11”) of when the event happened. In other words, logging

statements define the templates for the log messages that are generated at runtime.

Automated log analysis usually has difficulties analyzing and processing the unstructured

logs due to their dynamic nature [5], [10]. Instead, a log parsing step is needed to convert

the unstructured logs into a structured format before the analysis. The goal of log parsing

is to extract the static template, dynamic variables, and the header information (i.e.,

timestamp, log level, and logger name) from a raw log message to a structured format.

Such structured information is then used as input for automated log analysis. He et al. [20]

found that the results of log parsing are critical to the success of log analysis tasks.

在源码中的日志语句中生成日志。图 1 所示，一个日志语句由日志级（即

𝑖𝑛𝑓𝑜

），静态文本（即 “

Found

𝑏𝑙𝑜𝑐𝑘

” 和 “

locally

”），以及动态变量

（即”$blockID”）。在系统运行时，日志语句将生成行原始日志信息，这是非结构文

本行，它包含静态文本和日志语句中指定的动态变量值（即”

𝑟𝑑𝑑_42_20

”）。日志信

息同样包含如时间戳信息（例如”17/06/09 20:11:11”），用来说明事件发生的时间。

换句话说，日志语句定义了实时生成日志信息的模板。由于非结构日志的动态特性，

我们分析和处理这些日志是具有一定难度的。因此，日志解析在分析日之前需要将

非结构日志转化为有结构的形式。日志解析的目标是从原始日志信息到结构格式提

取静态模板，动态变量以及头信息（即时间戳，日志级别以及记录器名称）。这种结

构化的信息用做自动化日志分析的输入。He 等人发现日志解析的结果对于日志分析

任务的成功来说是非常重要的。

In practice, practitioners usually write ad hoc log parsing scripts that depend heavily on

specially-designed regular expressions [21], [22], [23]. As modern software systems

usually contain large numbers of log templates which are constantly evolving [24], [25],

[26], practitioners need to invest a significant amount of efforts to develop and maintain

such regular expressions. In order to ease the pain of developing and maintaining ad hoc

log parsing scripts, prior work proposed various approaches for automated log parsing [21].

For example, Drain [22] uses a fixed-depth tree to parse logs. Each layer of the tree defines

a rule for grouping log messages (e.g., log message length, preceding tokens, and token

similarity). At the end, log messages with the same templates are clustered into the same

groups. Zhu et al. [21] proposed a benchmark and thoroughly compared prior approaches

for automated log parsing.

实践中，从业者通常编写特别的日志解析脚本，这些脚本严重依赖专门设计的正则

表达式。目前的软件系统通常包含大量不断演变的日志模板，从业者需要投入巨大

的精力来发展和维持这种正则表达式。为了减轻发展和维持特殊日志解析的痛苦，

先前的工作在自动化日志解析方面提出了各种工作。例如，Drain 使用固定深度的

树来解析日志。树的每层定义了一组日志信息的规则（例如，日志消息长度，优先

托肯，以及托肯相似性）。最后，具有相同模板的日志消息被聚类到相同组。Zhu 等

人提出了基线方法，并且全面比较了前人在自动化解析日志领域提出的方法。

Despite the existence of prior log parsers, as the size of logs grows rapidly [1], [2], [27]

and the need for low-latency log analysis increases [19], [28], efficiency becomes an

important concern for log parsing. In this work, we propose Logram, an automated log

parsing approach that leverages n-gram dictionaries to achieve efficient log parsing. In

short, Logram uses dictionaries to store the frequencies of n-grams in logs and leverage the

n-gram dictionaries to extract the static templates and dynamic variables in logs. Our

intuition is that frequent n-grams are more likely to represent the static templates while rare

n-grams are more likely to be dynamic variables. The n-gram dictionaries can be

constructed and queried efficiently, i.e., with a complexity of O(n) and O(1), respectively.

尽管存在先前的日志解析器，随着日志规模的急速增长，以及对低延迟日志分

析需求的增加，日志解析的高效性成为主要关心的问题。在本文中，我们提出 Logram，

一种利用

―

𝐺𝑟𝑎𝑚

词典来实现高效日志解析的自动化日志解析方法。简而言之，

Logram 使用词典来存储日志中

―

𝐺𝑟𝑎𝑚

的频率，并且利用

𝑛

―

𝐺𝑟𝑎𝑚

词典来提取日

志中的静态模板和动态变量。我们的直觉是，频繁的

―

Gram

更可能表征静态模板，

同时不频繁的

―

𝐺𝑟𝑎𝑚

更可能表征动态变量。

―

𝐺𝑟𝑎𝑚

可以高效的构建和查询，即

构建的时间复杂度为

(𝑛)

，查询的时间复杂度为

𝑂(1)

。

We evaluated Logram on 16 log datasets [21] and compared Logram with five state-of-the-

art log parsing approaches. We found that Logram achieves a higher accuracy compared

with the best existing approaches (i.e., at least 10% higher on average), and that Logram

outperforms these best existing approaches in efficiency, achieving a parsing speed that is

1.8 to 5.1 times faster than the second-fastest approaches. Furthermore, as the n-gram

dictionaries can be constructed in parallel and aggregated efficiently, we demonstrated that

Logram can achieve high scalability when deployed on a multi-core environment (e.g., a

Spark cluster), without sacrificing any parsing accuracy. Finally, we demonstrated that

Logram can support effective online parsing, i.e., by updating the n-gram dictionaries

continuously when new logs are added in a streaming manner.

我们在 16 个日志数据集上评估了 Logram，同时将 Logram 与五个最先进的日

志解析方法相比较。我们发现 Logram 与现有的最好方法相比实现了更高的准确率

（即平均至少高出 10%），并且 Logram 在时间性能上超过了目前最好的方法，实现

了比第二名的方法更快出 1.8 到 5.1 倍。此外，

―

Gram

词典能够并行构造和高效聚

合，我们展示了当 Logram 部署在多核环境中能够实现高扩展性，而不用牺牲任何

解析准确率。最后，我们展示了 Logram 能够支持高效在线解析，即当新日志以流

的方式添加时，可以不断更新

―

Gram

词典。

剩余52页未读，继续阅读

评论收藏

内容反馈

ProgrammerMonkey

粉丝: 43
资源: 37

论文《Logram:使用n-Gram词典的高效日志解析》翻译

ngram：快速n-Gram标记化

n-gram:使用字符 N-gram 的电影评论语义分析

音节-匕首：使用n-gram标记器的音素音节化

Autocomplete:基于N-gram语言模型的下一个词预测

n-gram:从文本中获取n-gram

ZEN:基于N-gram表示的基于BERT的中文文本编码器

text-prediction-R:使用 N-Gram 模型的文本预测应用程序，由 R & Shiny 开发

使用 N-Gram 进行文本挖掘-研究论文

Adhoc-n-gram:Adhoc-n-gram距离技术

automark:使用JavaScript n-gram的马尔可夫链式自动完成

n-gram-tree:用Java编写的n-gram模型

使用N-Gram模型基于特征扩展的短文本分类

基于N-Gram的计算机病毒特征码自动提取的改进方法.7z

N-gram语言模型

ngram2vec:n-gram的嵌入

第三章：N-gram Language Models的PPT

论文研究-基于N-gram语言模型的哈萨克文机构名识别.pdf

ngram-java:Java中的n-gram预测器

lebowski:N-gram 的文本分析

相关实用应用程序（Windows可用）

免费可用的ChatGPT网页版.zip

ChatGPT使用总结：150个ChatGPT提示词模板（完整版）

chromedriver-win64.zip

全国计算机二级WPSoffice精选350道选择题题库（含答案）.pdf

农村公交与异构无人机协同配送优化

李飞飞自传 我看见的世界 The World I see

哈尔滨工业大学-ChatGPT调研报告-2023.3.6-94页.pdf

4个亲测好用的ChatGPT4渠道

基于小波与卷积神经网络的多尺度时间序列分类.zip

最新资源

李飞飞自传我看见的世界 The World I see