【免费】论文《Spell：大型非结构化系统日志的在线流式解析》翻译

流程挖掘

日志解析

需积分: 0 140 浏览量 2023-08-27 19:51:03 上传评论 1 收藏 1.56MB DOCX 举报

资源推荐

资源详情

资源评论

Spell: Online Streaming Parsing of Large Unstructured System Logs

Spell：大型非结构化系统日志的在线流式解析

Abstract—System event logs have been frequently used as a valuable resource in data-

driven approaches to enhance system health and stability. A typical procedure in system

log analytics is to first parse unstructured logs to structured data, and then apply data

mining and machine learning techniques and/or build workflow models from the resulting

structured data. Previous work on parsing system event logs focused on offline, batch

processing of raw log files. But increasingly, applications demand online monitoring and

processing. As a result, a streaming method to parse unstructured logs is needed. We

propose an online streaming method Spell, which utilizes a longest common subsequence

based approach, to parse system event logs. We show how to dynamically extract log

patterns from incoming logs and how to maintain a set of discovered message types in

streaming fashion. An enhancement to find more accurate message types is also proposed.

We also propose and evaluate a method to automatically discover semantic meanings for

parameter fields identified by Spell. We compare Spell against state-of-the-art methods to

extract patterns from system event logs on large real data. The results demonstrate that,

compared with other log parsing alternatives, Spell shows its superiority in terms of both

efficiency and effectiveness.

Index Terms—Log parsing, log data, system logs

摘要—系统事件日志作为一种有价值的资源，在数据驱动的方法中被频繁地使用，

以增强系统的健康和稳定性。系统日志分析的一个典型过程是首先将非结构化日志

解析为结构化数据，然后应用数据挖掘和机器学习技术和/或从构建结构化数据中建

立工作流模型。以前解析系统事件日志的工作主要集中在脱机、批处理原始日志文

件上。但越来越多的应用程序需要在线监控和处理。因此，需要一种流式方法来

解析非结构化日志。提出了一种基于最长公共子序列的在线流方法 Spell，用于解

析系统事件日志。我们展示了如何从传入日志中动态提取日志模式，以及如何以流

的方式维护一组发现的消息类型。本文还提出了一种改进，以发现更准确的消息类

型。我们还提出并评估了一种自动发现由 Spell 识别的参数字段语义的方法。我们

将 Spell 与现有的最先进的方法进行比较，以从大型实际数据上的系统事件日志中

提取模式。结果表明，与其他日志解析方案相比，Spell 在效率和效果上都展示出

了优越性。

关键词—日志解析，日志数据，系统日志

1 INTRODUCTION

1 引言

The increasing complexity of modern computer systems has become a significant

limiting factor in deploying and managing them. Being able to be alerted and mitigate the

problem right away has become a fundamental requirement in many computer systems. As

a result, automatically detecting anomalies upon happening in an online fashion is an

appealing solution. Data-driven methods based on machine learning and data mining

techniques are heavily employed to understand complex system behaviors, for example,

exploring machine data for automatic pattern discovery and anomaly detection. System

logs, as a universal data source that contains important information such as execution paths

and program running status, are valuable assets in assisting these data-driven system

analytics, in order to gain insights that are useful to enhance system health, stability, and

usability.

现代计算机系统日益复杂，已成为部署和管理它们的一个重要限制因素。在许

多计算机系统中，能够及时发出警报并减轻问题已经成为一项基本要求。因此，以

在线方式自动检测异常是一个很有吸引力的解决方案。基于机器学习和数据挖掘技

术的数据驱动方法被大量应用于理解复杂的系统行为，例如，探索用于自动模式发

现和缺陷检测的机器数据。系统日志，作为一个通用的数据源，包含了重要的信息，

例如执行路径和程序运行状态，是帮助这些数据驱动的系统分析的宝贵资产，以便

获得有助于增强系统健康、稳定和可用性的见解。

The effectiveness of system log mining has been validated by recent literature. Logs

could be used to detect execution anomalies [1], [2], [3], monitor network failures [4], or

even find software bugs [5]. Researchers have also used system logs to discover and

diagnose performance problems [6]. Logs contain intrinsic underlying information that

could help to understand system behaviors [7].

系统日志挖掘的有效性已被最近的文献所验证。日志可以用来检测执行异常

[1]、[2]、[3]，监视网络故障[4]，甚至发现软件 bug[5]。研究人员还使用系统日志

来发现和诊断性能问题[6]。日志包含可以帮助理解系统行为[7]的内在底层信息。

To alleviate the pain of diving into massive unstructured log data, in most prior work,

the first and foremost step is to automatically parse the unstructured system logs to

structured data [1], [2], [3], [5]. There have been a substantial study on how to achieve this,

for example, using regular expressions [8], leveraging the source code [5], or parsing

purely based on system log characteristics using data mining approaches such as clustering

and iterative partitioning [1], [9], [10], [11]. Nevertheless, except the approach that uses

regular expressions which requires domain-specific expert knowledge [8], hence, does not

work for general purpose system log parsing, or the approach that leverages the source

code [12] which is often unavailable, none of the previous methods could achieve truly

online parsing in a streaming fashion. Some work claimed “online” processing, but with

the requirement of doing some extensive offline processing first [13], or using regular

expressions to remove certain fields [14], and only then matching log entries with the data

structures and patterns previously identified.

为了减轻深入研究大量非结构化日志数据的痛苦，在大多数先前的工作中，第

一步也是最重要的一步是自动将非结构化系统日志解析为结构化数据[1]、[2]、[3]、

[5]。关于如何实现这一点已经有了大量的研究，例如，使用正则表达式[8]、利用源

代码[5]、或者使用纯粹基于系统日志特征的数据挖掘方法(如聚类和迭代划分[1]、

[9]、[10]、[11])进行解析。然而，除了使用正则表达式的方法需要领域特定的专家

知识[8]，因此，不能用于一般用途的系统日志解析，或者利用源代码[12]的方法(通

常不可用)，以前的方法都不能以流方式实现真正的在线解析。有些工作要求进行

“在线”处理，但首先需要进行一些广泛的离线处理，或者使用正则表达式删除某

些字段[13]，然后才将日志项与先前识别的数据结构和模式进行匹配。

Furthermore, previous methods that are tuned for a specific type of system log may work

terribly on a new format or type of system logs. For example, OpenStack is a very popular

open source cloud infrastructure. Its logs contain various formats that are not present in

previous system logs, such as JSON format. OpenStack log analysis is important for

automatic problem solving but current work remains manually parsing message types as

the first step [15]. Our method is designed as a general-purpose streaming log parsing

method, hence, it’s system, type, or format agnostic. In our evaluation, we have collected

OpenStack raw log messages as our test data, and we obtain ground truth message types

by parsing them from the source code of OpenStack; the results show that our method

works substantially better than all previous methods.

此外，针对特定类型的系统日志进行调优的以前的方法可能在新的格式或类型

的系统日志类型上表现得很糟糕。例如，OpenStack 就是一个非常流行的开源云基

础设施。它的日志包含以前的系统日志中不存在的各种格式，比如 JSON 格式。

OpenStack 日志分析对于自动解决问题很重要，但是目前的工作仍然是手动解析消

息类型，作为第一步[15]。我们的方法被设计为通用的流日志解析方法，因此，它

与系统、类型或格式无关。在我们的评估中，我们收集了 OpenStack 原始日志消息

作为我们的测试数据，并从 OpenStack 的源代码中解析得到基准真相消息类型;结果

表明，我们的方法比以前的方法有明显的提高。

There is also an increasing demand to properly manage and store system logs [16]. Thus

log management systems (LMS) are in great need and becoming widely deployed in recent

years (e.g., ELK by Elastic.co). A typical architecture of an LMS is shown in Fig. 1. On

each node, a log shipper forwards log entries to a centralized server, which often contains

a log parser, a log indexer, a storage engine and a user interface. In such systems the default

log parser only parses simple schema information such as timestamp and hostname. The

剩余66页未读，继续阅读

评论收藏

内容反馈

ProgrammerMonkey

粉丝: 43
资源: 37

论文《Spell：大型非结构化系统日志的在线流式解析》翻译

Spell:自动提取日志文件中的密钥

SPELL:卫星过程执行语言和库-开源

perl-pod-spell-commonmistakes:捕获 POD 中的常见错别字

retext-spell:插件检查拼写

LISTEN ATTEND AND SPELL A NEURAL NETWORK FOR SPEECH RECOGNITION.pdf

ka_GE.spell：ლექსიკონი-格鲁吉亚语拼写检查字典

DBC spell.dbc 各字段解释.rar_dbc_spell.dbc_spell.dbc说明_trinitycore_wo

WoW-Spell-Editor 魔兽世界3.3.5a版本dbc文件编辑器

double-spell:双拼练习

Lucene SpellChecker3.0.2

node-spellchecker:SpellChecker节点模块

spellChecker控件

WoW_Spell_Editor_v1_8_8.exe

VS2010 Spell Checker

SpellChecker:Xcode的SpellChecker

Vusial studio 代码拼写检查SpellChecker

Polar SpellChecker Component ActiveX控件

fin-spell:基于Voikko的Firefox拼写检查（实验性）

spellframework:SPELL 框架

相关实用应用程序（Windows可用）

免费可用的ChatGPT网页版.zip

ChatGPT使用总结：150个ChatGPT提示词模板（完整版）

chromedriver-win64.zip

全国计算机二级WPSoffice精选350道选择题题库（含答案）.pdf

农村公交与异构无人机协同配送优化

李飞飞自传 我看见的世界 The World I see

哈尔滨工业大学-ChatGPT调研报告-2023.3.6-94页.pdf

4个亲测好用的ChatGPT4渠道

基于小波与卷积神经网络的多尺度时间序列分类.zip

最新资源

李飞飞自传我看见的世界 The World I see