【免费】论文《高效恢复缺失的事件》翻译资源-CSDN文库

流程挖掘

事件日志

需积分: 0 115 浏览量 2024-01-24 21:23:31 上传评论收藏 1.26MB DOCX 举报

资源推荐

资源详情

资源评论

Efficient Recovery of Missing Events

有效恢复缺失的事件

Abstract—For various entering and transmission issues raised by human or system,

missing events often occur in event data, which record execution logs of business

processes. Without recovering the missing events, applications such as provenance

analysis or complex event processing built upon event data are not reliable. Following

the minimum change discipline in improving data quality, it is also rational to find a

recovery that minimally differs from the original data. Existing recovery approaches

fall short of efficiency owing to enumerating and searching over all of the possible

sequences of events. In this paper, we study the efficient techniques for recovering

missing events. According to our theoretical results, the recovery problem appears to

be NP-hard. Nevertheless, advanced indexing, pruning techniques are developed to

further improve the recovery efficiency. The experimental results demonstrate that our

minimum recovery approach achieves high accuracy, and significantly outperforms the

state-of-the-art technique for up to five orders of magnitudes improvement in time

performance.

Index Terms—Data repairing, event data processing, petri net

摘要：对于人工或系统提出的各种输入和传输问题，事件数据中经常会出现丢失

事件，这些事件数据记录了业务流程的执行日志。如果不恢复丢失的事件，诸如

起源分析或基于事件数据的复杂事件处理等应用程序就不可靠。遵循改善数据质

量的最小变化原则，找到与原始数据最小差异的恢复也是合理的。由于枚举和搜

索所有可能的事件序列，现有的恢复方法缺乏效率。在这篇论文中，我们研究了

恢复丢失事件的有效技术。根据我们的理论结果，恢复问题似乎是 np 困难的。

然而，先进的索引剪枝技术进一步提高了恢复效率。实验结果表明，我们的最小

恢复方法具有较高的精度，在时间性能上显著优于最先进的技术，提高了 5 个数

量级。

关键词：数据修复，事件数据处理，petri 网

1INTRODUCTION

Business processes continuously generate huge volume of event data, ranging

from traditional enterprise office automation systems or scientific workflows [2], [8] to

recent Web services and online transactions [17]. In event data management,

provenance analysis [20] identifies the sequence of steps leading to a data item, and

complex event processing [5] detects interesting event patterns from data. While

querying and mining on event data are widely investigated, the quality of event data

itself draws less attention.

According to our survey of real event data recorded by a train manufacturer, at

least 47.66 percent events are missed in a total of 4, 470 event sequences. The missing

events occur for various reasons, such as (1) forgot to submit when manually recording

event logs, (2) suffered from system failures, or (3) mess after collecting the events

from heterogeneous execution environment. In this survey, the most typical missing

events are routing events (41.43 percent of the 47.66 percent missing events), which

pass or distribute tasks (e.g., by the manager) to one or multiple staffs for subsequent

processing. Since no changes have been made on products by such routing events, the

system did not collect these events (belonging to the aforesaid type (3) missing events).

Without recovering such missing routing events, the prerequisites of an event might be

absent, e.g., having no idea about which manager the task is passed from. The

provenance of a product item, i.e., the sequence of steps used to produce the item, is

unlikely to obtain.

Moreover, mining patterns over event data is important, since event patterns (with

distinct appearance frequencies) can be utilized as discriminative features in matching

and identifying heterogeneous events [28]. The basic idea is to enrich the features of

singleton events (less discriminative) by more complex event patterns (more

discriminative) in event matching (see more details in [27], [28]). As presented in [15],

interesting event patterns (a.k.a. episodes) that occur frequently could be discovered

from event sequences. For instance, a pattern

SEQ

(𝐵, 𝐴𝑁𝐷

(

𝐶,𝐷

)

,𝐸)

indicates that after

B, two events C and D are executed in parallel, followed by E. It could be discovered

with high frequency from a set of sequences <ABCDEH>, <ABDCEH>,

<ABCDEG>, ..., where the pattern appears. Again, without recovering the missing

events, e.g., C from the same sources owing to network interrupt, the event patterns

could no longer be discovered.

In general, the task of recovering missing events could hardly be performed

without any prior knowledge. Fortunately, most business events do not occur randomly.

Instead, event data often follow certain business rules or constraints, such as process

specifications [5]. We focus on recovering missing events in the light of process

specifications.

Example 1. Consider a real process specification in Fig. 1a for producing an

engineering drawing in a train manufacturer. Each square (namely transition) denotes

a task in the process specification, e.g., transition

𝐴

represents a task of drafting. All

the arrows attached to a transition denote that the corresponding flows should be

executed in parallel. For example, both the dimension checking (task C) and the

tolerance checking (task

𝐷

) should be conducted after line type proofing (task B) in

the drawing. Moreover, the process can carry on evaluating the drawing (task

𝐸

) only

if both C and D are executed. Circles in the figure, namely places always appearing

between transitions, could express choice semantics. That is, when multiple choices are

associated, only one of the flows going out a place can be executed. For instance, place

b6 leads to either revising the drawing (task F), archiving it (task G) or discarding it

(task H) after evaluation (E).

An execution of the process generates a sequence of events, where each event

corresponds to a task in the process specification. We say that a sequence conforms to

the specification if it successfully executes from the source place

𝑏

𝑠𝑡𝑎𝑟𝑡

to the sink

place bend following the flow constraints in the specification. For example, the first

sequence <ABCDEG> in Fig. 1b denotes a complete execution of engineering drawing

including steps drafting, line type proofing, dimension checking, tolerance checking,

evaluating, archiving from bstart to bend.

Owing to various data quality issues, event logs are often incomplete. For instance,

the second sequence <ABCEG> has an event D missed during the collection of event

logs from the database for dimension checking. Without recovering the missing event

D, it is unlikely to find the provenance step of E. Moreover, if such data transmission

problems occur frequently in the dimension checking database, an incorrect event

pattern without dimension checking step in engineering drawing will be mined.

The minimum change principle is widely considered in data repairing [3], [18],

following the intuition that systems or human always try to minimize their mistakes in

practice, i.e., to minimally miss events in our scenarios. For instance, to recover the

third sequence <ABCDG> in Fig. 1 in Example 1, a minimum recovery <ABCDEG>

could be more likely referring to the aforesaid intuition of minimum (missing) mistakes,

rather than <ABCDEFBCDEG>, <ABCDEFBCDEFBCDEG>, ... Indeed, as discussed

in Example 4, the chance that all events in a loop (e.g., FBCDE in the aforesaid recovery)

are missing is low in practice. Consequently, the accuracy of the former minimum

recovery could be generally higher than the latter excessive recoveries. Thereby, the

minimum recovery is also considered in the previous study of missing events [13].

Challenges: The hardness of recovering the missing events originates from the

complexity of the recovery problem. Efficiently obtaining an optimal recovery is not

trivial.

We illustrate an example where the sequences cannot be trivially recovered.

Consider the process specification presented in Fig. 2, with explicit termination node

bend [23]. Each ei corresponds to an agent of processing some job, which can either

skip a job (denoted by si) or accept it (by ai and execute ei). Suppose that a sequence

<c1c2c3b1> is given. Simply filling the prerequisites of observed events from the

process specification, e.g., <s1s2s3s4> for b1, may not be able to form a valid recovery.

On the other hand, a large number of possible recoveries for the sequence could be

enumerated, e.g., <a1e1a2c1c2e2c3s3s4b1>, <a1e1a2c1

c2e2s3s4c3b1>; ... ;<s1a2e2s3s4c1c2c3b1>; .. . Efficiently finding the optimal

recovery (with the minimum change) is highly non-trivial.

Efficiently computing the recovery of missing events is essential and challenging

with the following considerations: (1) The number of transitions could be as large as

118 in real process/workflow specifications [7], and 1,000 in event sequences according

to our observation in Section 6. The number of possible paths (even w.r.t. specifications

without loops) is exponential to the number of transitions [6], [16], and thus may hardly

be regarded as a constant. Indeed, infinite sequences of events could be generated when

loops exist in process specifications. (2) Real-time business process monitoring [10],

e.g., detection of shoplifting, or large/ suspicious financial transactions [5], requires

recovering the missing events in an online manner. It further motivates us to develop

more efficient pruning and heuristics for missing event recovery.

Our main contributions in this paper are summarized as:

We propose a linear time backtracking algorithm for the recovery of a simple case,

where all the events are in parallel execution without any choices.

We reveal the NP-hardness of finding the minimum recovery of missing events in

general settings (with choices). To the best of our knowledge, this is the first study on

analyzing the hardness of the missing event recovery problem.

We present a branching framework for recovery in general cases. A branching

index with advanced pruning techniques are developed to speed-up recovery.

We devise efficient heuristic that performs directly on process specifications

without generating branching index. By precomputing possible executions (paths),

recovery efficiency is significantly improved. Several typical cases are studied, on

which the heuristic shows the best performance.

Finally, we report the extensive experimental evaluation on real data.

1 引言

业务流程持续生成大量的事件数据，从传统的企业办公自动化系统或科学的

工作流程[2]、[8]到最近的 Web 服务和在线交易[17]。在事件数据管理中，起源

分析[20]标识出通向数据项的步骤序列，而复杂事件处理[5]从数据中检测出有趣

的事件模式。虽然对事件数据的查询和挖掘得到了广泛的研究，但对事件数据本

身的质量关注较少。

根据我们对一家火车制造商记录的真实事件数据的调查，在总共 4470 个事

件序列中，至少有 47.66%的事件被遗漏了。丢失事件的发生有多种原因，例如(1)

手工记录事件日志时忘记提交，(2)系统故障，或(3)从异构执行环境收集事件后

出现混乱。在这个调查中，最典型的缺失事件是路由事件(47.66%的事件遗漏中

缺失事件占比为 41.43%)，它将任务(例如，由经理)传递或分发给一个或多个员

工进行后续处理。由于此类路由事件对产品没有任何改变，因此系统不收集这些

剩余66页未读，继续阅读

评论收藏

内容反馈

ProgrammerMonkey

粉丝: 43
资源: 37

论文《高效恢复缺失的事件》翻译

YOLO系列论文翻译

论文研究-基于网络内存的内存数据库高效恢复技术.pdf

从明星代言事件看中国广告业诚信缺失问题-论文.zip

SCI翻译论文的软件

【论文翻译】Fast R-CNN论文原文与中文翻译.rar

YOLO系列YOLOv1论文超详细解读（翻译 ＋学习笔记）.pdf

java基于蚁群算法路由选择可视化动态模拟(论文+开题报告+翻译+任务书+外文翻译)

opentld 基础论文翻译

JAVA基于蚁群算法路由选择可视化动态模拟(论文+开题报告+翻译+任务书+外文翻译).rar

毕业论文 外文翻译

电力系统毕业论文中英文外文文献翻译.zip毕业设计论文范文类参考资料下载

java办公自动化系统(源代码+论文+外文翻译)下载.zip

基于暗通道先验的单幅图像去雾论文翻译

vb中介管理系统（源代码+论文+开题报告+外文翻译+答辩ppt）.rar

毕业论文 中英文翻译

英文文献 翻译 中文 从全球博士学位论文全文数据库翻译过来的几篇英文文献

英语论文寒暄语对翻译的影响

基于暗通道先验的单幅图像去雾（何凯明）论文翻译

Google-GFS,Bigtable,Mapreduce三大论文英文原版+中文翻译

相关实用应用程序（Windows可用）

免费可用的ChatGPT网页版.zip

ChatGPT使用总结：150个ChatGPT提示词模板（完整版）

chromedriver-win64.zip

全国计算机二级WPSoffice精选350道选择题题库（含答案）.pdf

农村公交与异构无人机协同配送优化

李飞飞自传 我看见的世界 The World I see

哈尔滨工业大学-ChatGPT调研报告-2023.3.6-94页.pdf

4个亲测好用的ChatGPT4渠道

基于小波与卷积神经网络的多尺度时间序列分类.zip

学术海报模板+论文科研+研究生

最新资源

YOLO系列YOLOv1论文超详细解读（翻译＋学习笔记）.pdf

毕业论文外文翻译

毕业论文中英文翻译

英文文献翻译中文从全球博士学位论文全文数据库翻译过来的几篇英文文献

李飞飞自传我看见的世界 The World I see