没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
Heuristic Recovery of Missing Events in Process Logs
Abstract—Event logs are of paramount significance for process mining and complex
event processing. Yet, the quality of event logs remains a serious problem. Missing
events of logs are usually caused by omitting manual recording, system failures, and
hybrid storage of executions of different processes. It has been proved that the problem
of minimum recovery based on a priori process specification is NP-hard. State-of-the-
art approach is still lacking in efficiency because of the large search space. To address
this issue, in this paper, we leverage the technique of process decomposition and present
heuristics to efficiently prune the unqualified sub-processes that fail to generate the
minimum recovery. We employ real-world processes and their incomplete sequences
to evaluate our heuristic approach. The experimental results demonstrate that our
approach achieves high accuracy as the state-of-the-art approach does, but it is more
efficient.
Keywords : missing events; heuristic recovery; Petri nets; process decomposition;
trace replaying
流程日志中丢失事件的启发式恢复
摘要:事件日志对于流程挖掘和复杂事件处理具有重要意义。然而,事件日志的质量仍然
是一个严重的问题。日志丢失事件通常是由于忽略了手动记录、系统故障和不同流程的执
行的混合存储而导致的。证明了基于先验流程规范的最小回收问题是 NP-hard 问题。由于
搜索空间大,最先进的方法仍然缺乏效率。为了解决这个问题,在这篇论文中,我们利用
了流程分解的技术,并提出了启发式的方法来有效地剪枝不合格的(即不能产生)最小恢
复的子流程。我们使用真实世界的流程及其不完全序列来评估我们的启发式方法。实验结
果表明,该方法与现有的方法相比具有较高的精度,但效率更高。
关键词:缺失事件;启发式恢复;Petri 网;流程分解;轨迹重演
1. INTRODUCTION
Contemporarily, business processes of enterprises are event-driven [1], [2]. For one
thing, the executions of business processes are triggered and piloted by events. For
another, business processes continuously generate large volume of event data when
they are in execution. Accordingly, event logs play an increasingly crucial role in
today’s companies. However, the quality of event logs still remains a big problem,
which seriously affects the analysis results of many business intelligence functionalities,
e.g., complex event processing [3], provenance analysis [4], and process mining [5], to
name a few.
Missing events are the major reasons for the low-quality event logs [6], [7]. In
practice, there are several scenarios which could cause event missing in process logs,
such as omitting manual recording, temporary system failures, and hybrid storage of
executions of different processes [6], [7], [8], [9]. We are not surprised that it is hard to
recover the missing events without any predefined knowledge. In this paper, we study
the problem of automatic recovery of missing events in the light of a priori process
specification. Without losing generality, we employ a special kind of Petri nets [10],
that is, workflow nets [11], to represent process specifications.
For an incomplete event sequence, there can be multiple recoveries because of
parallel routings (i.e., concurrencies) and alternative routings (i.e., choices) in the
process specification. If the specification involves iterative routings (i.e., loops), the
number of recoveries is even infinite. Following the minimum change principle in
improving data quality [12], [13], we focus on finding the optimal recovery which is
minimally different from the original incomplete sequence. This is a rational choice,
because the probability of missing more events is much smaller than that of missing
less events. Unfortunately, it is an NP-hard problem to find the minimum recovery of
an incomplete sequence in a predefined process specification [7]. On the other hand,
the process system continuously produces event sequences in real-time. Hence, the
generated sequences can be regarded as a kind of big data. For these reasons, it is of
great importance to efficiently recover the incomplete sequences.
In order to fulfill the efficient recovery, the state-of-the-art approach (i.e., the
branching approach [7]) first maps a process specification modeled as a Petri net into a
homomorphic specification (called branching net) which does not contain OR-joins (i.e.,
merges of different alternative braches), and then utilizes advanced indexing and pruning
techniques to reduce the search space. In the spite the fact that the branching approach is
much more efficient than the exhaustive search, there is still much room for the
improvement.
In this paper, we present a more efficient approach for the minimum recovery.
Considering the fact that a process specification may be composed by several
independent subprocesses, the minimum recovery of an incomplete sequence can only
be in one of the sub-processes. With this in mind, we propose heuristics to identify the
sub-process that can generate the minimum recovery, while other sub-processes that
fail to generate the minimum recovery are pruned without further investigation. In this
way, the search space is significantly reduced. According to the qualified sub-process,
we propose a linear-time algorithm of trace replaying to find the minimum recovery.
This algorithm is promising in practice. For one thing, it avoids enumerating all possible
event interleavings to find the minimum recovery. For another, it further decreases the
time cost by combining the conformance checking [5] and subsequent sequence
recovery into one single step.
To evaluate our approach, we perform experiments by applying our approach and
the state-of-the art approach (i.e., the branching approach [7]) to a data set which
consists of 36 real-world process specifications and their incomplete event sequences.
The experimental results demonstrate that our approach achieves high accuracy for the
recovery, and is more efficient than the branching approach.
In a nutshell, the key contribution of our work is threefold:
We present a linear time algorithm which is based on trace replaying to
generate the minimum recovery of an incomplete event sequence from a
qualified sub-process with no choices.
We present a heuristics recovery approach for process specifications with
choices and loops. More specifically, we first decompose the process into
several subprocesses, and then present heuristics to prune unqualified sub-
processes that fails to generate the minimum recovery, thus significantly
accelerating the recovery process.
We evaluate our approach by employing a real-world data set to compare our
approach with the state-of-the-art approach. The experimental results show
the effectiveness and efficiency of our approach.
The remainder of this paper is organized as follows. Section II introduces the
background of the research problem. Section III proposes our sequence recovery
approach. Section IV shows our experimental results. Section V reviews related
work. Section VI concludes the paper and envisions the future work.
1 引言
同时,企业的业务流程是事件驱动的[1]、[2]。首先,业务流程的执行是由
事件触发和引导的。另一方面,业务流程在执行时不断地生成大量事件数据。
因此,事件日志在当今的公司中扮演着越来越重要的角色。然而,事件日志的
质量仍然是一个大问题,这严重影响了许多业务智能功能的分析结果,例如复
杂事件处理[3]、来源分析[4]和流程挖掘[5]等。
事件缺失是导致事件日志质量低下的主要原因[6]、[7]。在实践中,有几种
情况可能会导致流程日志中的事件丢失,例如忽略手动记录,临时系统故障以
及不同进程执行的混合存储[6],[7],[8],[9]。我们并不感到惊讶,这是很难恢
复丢失的事件没有任何预定义的知识。在本文中,我们依据一个先验的流程规
范研究自动恢复丢失的事件问题。在不失一般性的情况下,我们采用一种特殊
的 Petri 网[10],即工作流网[11]来表示流程规范。
对于不完整的事件序列,由于流程规范中存在并行路由(即,并发)和替
代路由(即,选择),会存在多种修复序列。如果规范涉及迭代路由(即,循
环),恢复的数量甚至是无限的。遵循改善数据质量的最小变化原则[12],[13],
我们专注于寻找与原始不完整序列差异最小的最佳恢复。这是一个理性的选择,
因为丢失更多事件的概率远小于错过更少事件的概率。不幸的是, 从预定义流
程规范中找到不完整序列的最小恢复是 NP-hard 问题[7]。另一方面,流程系统
连续地实时产生事件序列。因此,生成的序列可以被视为一种大数据。因此,
有效地恢复不完整的序列是非常重要的。
为了实现有效的恢复,最先进的方法(即分支方法[7])首先将流程规范成的
Petri 网映射为不包含 OR-joins(即不同可选分支的合并)的同态规范(称为分支网),
然后利用先进的索引和剪枝技术来减少搜索空间。尽管分支搜索法比穷举搜索
法高效得多,但仍有很大的改进空间。
在本文中,我们提出了一种更有效的最小恢复方法。考虑到流程规范可能
由几个独立的子流程组成,不完整序列的最小恢复只能在其中一个子流程中。
考虑到这一点,我们提出了识别能够生成最小恢复的子流程的启发式方法,而
其他不能生成最小恢复的子流程则在没有进一步研究的情况下被剪枝。这样,
搜索空间就大大减少了。根据限定子流程,提出了一种轨迹重演的线性时间算
法来求最小恢复。该算法具有较好的应用前景。首先,它避免列举所有可能的
事件交错来寻找最小恢复。另一方面,通过将合规检查[5]和后续的序列恢复合
并到一个步骤中,进一步降低了时间成本。
为了评估我们的方法,我们通过将我们的方法和最新的方法(即分支方法[7])
应用于一个数据集合来执行实验,该数据集由 36 个真实世界的流程规范及其不
完整的事件序列组成。实验结果表明,该方法具有较高的恢复精度,且比分支
方法更有效。
简而言之,我们工作的主要贡献有三个方面:
我们提出了一种基于轨迹重演的线性时间算法,从无选择的限定子流
程中产生不完整事件序列的最小恢复。
我们提出了一种启发式恢复方法,用于带有选择和循环的流程规范。
更具体地说,我们首先将流程分解为几个子流程,然后提出启发式方
法来删除未生成最小恢复的不合格子流程,从而显著加快恢复流程。
我们通过使用真实世界的数据集合来比较我们的方法和最先进的方法
来评估我们的方法。实验结果表明了该方法的有效性和效率。
这篇论文的其余部分组织如下。第二部分介绍了研究问题的背景。第三部
分提出了序列恢复方法。第四部分是我们的实验结果。第五节审查有关工作。
第六部分对论文进行了总结,并对今后的工作进行了展望。
II. RELATED WORK
Although data recovery is intensively studied in the field of data mining [14],
recovering process logs is a novel research topic. It is notable that log recovery is highly
relevant to conformance checking of business processes, so we first review related work
on conformance checking.
As a kind of process mining technique, conformance checking focuses on
calculating the consistency degree (ranging from 0 to 1.0) between a process and its
event log [15], [16], [17], [18]. It has two major applications. First, it is used to
determine to what extent an event sequence conforms to a predefined process. Second,
剩余32页未读,继续阅读
资源评论
ProgrammerMonkey
- 粉丝: 43
- 资源: 37
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功