没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
Efficient Recovery of Missing Events
有效恢复缺失的事件
Abstract—For various entering and transmission issues raised by human or system,
missing events often occur in event data, which record execution logs of business
processes. Without recovering the missing events, applications such as provenance
analysis or complex event processing built upon event data are not reliable. Following
the minimum change discipline in improving data quality, it is also rational to find a
recovery that minimally differs from the original data. Existing recovery approaches
fall short of efficiency owing to enumerating and searching over all of the possible
sequences of events. In this paper, we study the efficient techniques for recovering
missing events. According to our theoretical results, the recovery problem appears to
be NP-hard. Nevertheless, advanced indexing, pruning techniques are developed to
further improve the recovery efficiency. The experimental results demonstrate that our
minimum recovery approach achieves high accuracy, and significantly outperforms the
state-of-the-art technique for up to five orders of magnitudes improvement in time
performance.
Index Terms—Data repairing, event data processing, petri net
摘要:对于人工或系统提出的各种输入和传输问题,事件数据中经常会出现丢失
事件,这些事件数据记录了业务流程的执行日志。如果不恢复丢失的事件,诸如
起源分析或基于事件数据的复杂事件处理等应用程序就不可靠。遵循改善数据质
量的最小变化原则,找到与原始数据最小差异的恢复也是合理的。由于枚举和搜
索所有可能的事件序列,现有的恢复方法缺乏效率。在这篇论文中,我们研究了
恢复丢失事件的有效技术。根据我们的理论结果,恢复问题似乎是 np 困难的。
然而,先进的索引剪枝技术进一步提高了恢复效率。实验结果表明,我们的最小
恢复方法具有较高的精度,在时间性能上显著优于最先进的技术,提高了 5 个数
量级。
关键词:数据修复,事件数据处理,petri 网
1INTRODUCTION
Business processes continuously generate huge volume of event data, ranging
from traditional enterprise office automation systems or scientific workflows [2], [8] to
recent Web services and online transactions [17]. In event data management,
provenance analysis [20] identifies the sequence of steps leading to a data item, and
complex event processing [5] detects interesting event patterns from data. While
querying and mining on event data are widely investigated, the quality of event data
itself draws less attention.
According to our survey of real event data recorded by a train manufacturer, at
least 47.66 percent events are missed in a total of 4, 470 event sequences. The missing
events occur for various reasons, such as (1) forgot to submit when manually recording
event logs, (2) suffered from system failures, or (3) mess after collecting the events
from heterogeneous execution environment. In this survey, the most typical missing
events are routing events (41.43 percent of the 47.66 percent missing events), which
pass or distribute tasks (e.g., by the manager) to one or multiple staffs for subsequent
processing. Since no changes have been made on products by such routing events, the
system did not collect these events (belonging to the aforesaid type (3) missing events).
Without recovering such missing routing events, the prerequisites of an event might be
absent, e.g., having no idea about which manager the task is passed from. The
provenance of a product item, i.e., the sequence of steps used to produce the item, is
unlikely to obtain.
Moreover, mining patterns over event data is important, since event patterns (with
distinct appearance frequencies) can be utilized as discriminative features in matching
and identifying heterogeneous events [28]. The basic idea is to enrich the features of
singleton events (less discriminative) by more complex event patterns (more
discriminative) in event matching (see more details in [27], [28]). As presented in [15],
interesting event patterns (a.k.a. episodes) that occur frequently could be discovered
from event sequences. For instance, a pattern
SEQ
(𝐵, 𝐴𝑁𝐷
(
𝐶,𝐷
)
,𝐸)
indicates that after
B, two events C and D are executed in parallel, followed by E. It could be discovered
with high frequency from a set of sequences <ABCDEH>, <ABDCEH>,
<ABCDEG>, ..., where the pattern appears. Again, without recovering the missing
events, e.g., C from the same sources owing to network interrupt, the event patterns
could no longer be discovered.
In general, the task of recovering missing events could hardly be performed
without any prior knowledge. Fortunately, most business events do not occur randomly.
Instead, event data often follow certain business rules or constraints, such as process
specifications [5]. We focus on recovering missing events in the light of process
specifications.
Example 1. Consider a real process specification in Fig. 1a for producing an
engineering drawing in a train manufacturer. Each square (namely transition) denotes
a task in the process specification, e.g., transition
𝐴
represents a task of drafting. All
the arrows attached to a transition denote that the corresponding flows should be
executed in parallel. For example, both the dimension checking (task C) and the
tolerance checking (task
𝐷
) should be conducted after line type proofing (task B) in
the drawing. Moreover, the process can carry on evaluating the drawing (task
𝐸
) only
if both C and D are executed. Circles in the figure, namely places always appearing
between transitions, could express choice semantics. That is, when multiple choices are
associated, only one of the flows going out a place can be executed. For instance, place
b6 leads to either revising the drawing (task F), archiving it (task G) or discarding it
(task H) after evaluation (E).
An execution of the process generates a sequence of events, where each event
corresponds to a task in the process specification. We say that a sequence conforms to
the specification if it successfully executes from the source place
𝑏
𝑠𝑡𝑎𝑟𝑡
to the sink
place bend following the flow constraints in the specification. For example, the first
sequence <ABCDEG> in Fig. 1b denotes a complete execution of engineering drawing
including steps drafting, line type proofing, dimension checking, tolerance checking,
evaluating, archiving from bstart to bend.
Owing to various data quality issues, event logs are often incomplete. For instance,
the second sequence <ABCEG> has an event D missed during the collection of event
logs from the database for dimension checking. Without recovering the missing event
D, it is unlikely to find the provenance step of E. Moreover, if such data transmission
problems occur frequently in the dimension checking database, an incorrect event
pattern without dimension checking step in engineering drawing will be mined.
The minimum change principle is widely considered in data repairing [3], [18],
following the intuition that systems or human always try to minimize their mistakes in
practice, i.e., to minimally miss events in our scenarios. For instance, to recover the
third sequence <ABCDG> in Fig. 1 in Example 1, a minimum recovery <ABCDEG>
could be more likely referring to the aforesaid intuition of minimum (missing) mistakes,
rather than <ABCDEFBCDEG>, <ABCDEFBCDEFBCDEG>, ... Indeed, as discussed
in Example 4, the chance that all events in a loop (e.g., FBCDE in the aforesaid recovery)
are missing is low in practice. Consequently, the accuracy of the former minimum
recovery could be generally higher than the latter excessive recoveries. Thereby, the
minimum recovery is also considered in the previous study of missing events [13].
Challenges: The hardness of recovering the missing events originates from the
complexity of the recovery problem. Efficiently obtaining an optimal recovery is not
trivial.
We illustrate an example where the sequences cannot be trivially recovered.
Consider the process specification presented in Fig. 2, with explicit termination node
bend [23]. Each ei corresponds to an agent of processing some job, which can either
skip a job (denoted by si) or accept it (by ai and execute ei). Suppose that a sequence
<c1c2c3b1> is given. Simply filling the prerequisites of observed events from the
process specification, e.g., <s1s2s3s4> for b1, may not be able to form a valid recovery.
On the other hand, a large number of possible recoveries for the sequence could be
enumerated, e.g., <a1e1a2c1c2e2c3s3s4b1>, <a1e1a2c1
c2e2s3s4c3b1>; ... ;<s1a2e2s3s4c1c2c3b1>; .. . Efficiently finding the optimal
recovery (with the minimum change) is highly non-trivial.
Efficiently computing the recovery of missing events is essential and challenging
with the following considerations: (1) The number of transitions could be as large as
118 in real process/workflow specifications [7], and 1,000 in event sequences according
to our observation in Section 6. The number of possible paths (even w.r.t. specifications
without loops) is exponential to the number of transitions [6], [16], and thus may hardly
be regarded as a constant. Indeed, infinite sequences of events could be generated when
loops exist in process specifications. (2) Real-time business process monitoring [10],
e.g., detection of shoplifting, or large/ suspicious financial transactions [5], requires
recovering the missing events in an online manner. It further motivates us to develop
more efficient pruning and heuristics for missing event recovery.
Our main contributions in this paper are summarized as:
We propose a linear time backtracking algorithm for the recovery of a simple case,
where all the events are in parallel execution without any choices.
We reveal the NP-hardness of finding the minimum recovery of missing events in
general settings (with choices). To the best of our knowledge, this is the first study on
analyzing the hardness of the missing event recovery problem.
We present a branching framework for recovery in general cases. A branching
index with advanced pruning techniques are developed to speed-up recovery.
We devise efficient heuristic that performs directly on process specifications
without generating branching index. By precomputing possible executions (paths),
recovery efficiency is significantly improved. Several typical cases are studied, on
which the heuristic shows the best performance.
Finally, we report the extensive experimental evaluation on real data.
1 引言
业务流程持续生成大量的事件数据,从传统的企业办公自动化系统或科学的
工作流程[2]、[8]到最近的 Web 服务和在线交易[17]。在事件数据管理中,起源
分析[20]标识出通向数据项的步骤序列,而复杂事件处理[5]从数据中检测出有趣
的事件模式。虽然对事件数据的查询和挖掘得到了广泛的研究,但对事件数据本
身的质量关注较少。
根据我们对一家火车制造商记录的真实事件数据的调查,在总共 4470 个事
件序列中,至少有 47.66%的事件被遗漏了。丢失事件的发生有多种原因,例如(1)
手工记录事件日志时忘记提交,(2)系统故障,或(3)从异构执行环境收集事件后
出现混乱。在这个调查中,最典型的缺失事件是路由事件(47.66%的事件遗漏中
缺失事件占比为 41.43%),它将任务(例如,由经理)传递或分发给一个或多个员
工进行后续处理。由于此类路由事件对产品没有任何改变,因此系统不收集这些
剩余66页未读,继续阅读
资源评论
ProgrammerMonkey
- 粉丝: 43
- 资源: 37
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功