没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
Outlier Detection Techniques for Process Mining Applications
Abstract. Classical outlier detection approaches may hardly fit process mining applications, since in
these settings anomalies emerge not only as deviations from the sequence of events most often registered
in the log, but also as deviations from the behavior prescribed by some (possibly unknown) process
model. These issues have been faced in the paper via an approach for singling out anomalous evolutions
within a set of process traces, which takes into account both statistical properties of the log and the
constraints associated with the process model. The approach combines the discovery of frequent
execution patterns with a cluster-based anomaly detection procedure; notably, this procedure is suited
to deal with categorical data and is, hence, interesting in its own, given that outlier detection has mainly
been studied on numerical domains in the literature. All the algorithms presented in the paper have been
implemented and integrated into a system prototype that has been thoroughly tested to assess its
scalability and effectiveness.
摘要:经典的离群值检测方法可能很难适合流程挖掘应用,因为在这些设置中,异常不仅表现
为与最常记录在日志中的事件序列的偏差,而且还表现为与某些(可能未知的)流程模型规定的
行为的偏差。这些问题在论文中已经通过一种在流程轨迹的集合中挑出异常演化的方法得到了
解决,该方法同时考虑了日志的统计特性和与流程模型相关的约束。该方法将频繁执行模式的
发现与基于集群的缺陷检测过程相结合;值得注意的是,这一过程适合于处理分类数据,因此,
考虑到文献中离群值检测主要是在数值域上进行的研究,它本身就很有趣。论文中提出的所有
算法都已实现并集成到一个系统原型中,该原型已经过全面测试,以评估其可伸缩性和有效性。
1 Introduction
1 引言
Several efforts have recently been spent in the scientific community and in the industry to exploit
data mining techniques for the analysis of process logs [12], and to extract high-quality knowledge on
the actual behavior of business processes (see, e.g., [6,3]). In a typical process mining scenario, a set of
traces (registering the sequencing of activities performed along several enactments) is given to hand and
the aim is to derive a model explaining all the episodes recorded in them. Eventually, the “mined” model
is used to (re)design a detailed process schema, capable to support forthcoming enactments. As an
example, the event log (over activities a, b, ...o) shown in the right side of Figure 1 might be given in
input, and the goal would be to derive a model like the one shown in the left side, representing a
simplified process schema according to the intuitive notation where precedence relationships are
depicted as directed arrows between activities (e.g., b must be executed after a and concurrently with
c).
最近,科学界和业界都在努力开发数据挖掘技术,以分析流程日志[12],并提取关于业务
流程的实际行为的高质量知识(参见,[6,3])。在一个典型的流程挖掘场景中,会提供一组集合
(记录沿几个规则执行的活动的顺序),目的是导出一个模型来解释其中记录的所有事件。最后,
“挖掘”模型被用来(重新)设计一个详细的流程模式,能够支持即将到来的实施。例如,图 1右
侧所示的事件日志(关于活动 a、b、…o)可能在输入中给出,目标是派生出一个类似于左侧所示
的模型,根据直观的符号表示一个简化的流程模式,其中优先关系被描述为活动之间的定向箭
头(例如,b 必须在 a 之后执行,并与 c 并发)。
In the paper, this peculiar aspect of process mining is investigated and the problem of singling out
exceptional individuals (usually referred to as outliers in the literature) from a set of traces is addressed.
在论文中,研究了流程挖掘的这个特殊方面,并讨论了从轨迹的集合中挑出例外个体(通
常在文献中称为离群值)的问题。
Outlier detection has already found important applications in bioinformatics [1], fraud detection [5],
and intrusion detection [9], just to cite a few. When adapting these approaches for process mining
applications, novel challenges however come into play:
离群点检测在生物信息学[1]、欺诈检测[5]和入侵检测[9]等领域已经有了重要的应用。在将
这些方法应用于流程挖掘应用时,出现了新的挑战:
(C1) On the one hand, looking only at the sequencing of the events may be misleading in some cases.
Indeed, real processes usually allow for a high degree of concurrency, and are to produce a lot of traces
that only differ in the ordering between parallel tasks. Consequently, the mere application of existing
outlier detection approaches for sequential data to process logs may yield many false positives, as a
notable fraction of task sequences might have very low frequency in the log. As an example, in Figure
1, each of the traces in {s1, ..., s5} rarely occurs in the log, but it is not to be classified as anomalous.
Indeed, they correspond to a different interleaving of the same enactment, which occurs in 10 of 40
traces.
(C1) 一方面,只看事件的先后次序在某些案例下可能会产生误导。实际上,真正的流程通常
允许高度的并发性,并且会产生许多只在并行任务之间的顺序不同的轨迹。因此,仅仅将现有
的用于顺序数据的离群值检测方法应用于流程日志可能会产生许多假阳性结果,因为有相当一
部分任务序列在日志中出现的频率可能非常低。例如,在图 1 中,{
𝑠
1
,…,
𝑠
5
}很少出现在日
志中,但不属于异常。事实上,它们对应于同一实施的不同交错,发生在 40 条轨迹中的 10 条。
(C2) On the other hand, considering the compliance with an ideal schema may lead to false negatives,
as some trace might well be supported by a model, yet representing a behavior that deviates from that
observed in the majority of the traces. As an example, in Figure 1, traces
s
6
and
s
7
correspond to the
same behavior where all the activities have been executed. Even though this behavior is admitted by the
process model on the left, it is anomalous since it only characterizes 3 of 40 traces.
(C2) 另一方面,考虑到与理想模式的遵从性可能会导致假负性,因为一些轨迹可能很好地得
到模型的支持,但表示的行为却与在大多数轨迹中观察到的行为不同。例如,在图 1 中,轨迹
s
6
和
s
7
对应于执行所有活动的相同行为。尽管左边的流程模型承认了这种行为,但它是异常的,
因为它只描述了 40 条轨迹中的 3 条。
Facing (C1) and (C2) is complicated by the fact that the process model underlying a given set of traces
is generally unknown and must be inferred from the data itself. E.g., in our running example, a
preliminary question is how we can recognize the abnormality of s9, ..., s14, without any a-priori
knowledge about the model for the given process.
面对(C1)和(C2)很复杂,因为在给定的轨迹集合下的流程模型通常是未知的,必须从数据本身
推断。例如,在我们的运行示例中,一个初步的问题是我们如何能够识别
𝑠
9
,…,
𝑠
14
的异常,
而没有任何关于给定流程的模型的先验知识。
Addressing this question and subsequently (C1) and (C2) is precisely the aim the paper, where an outlier
detection technique tailored for process mining applications is discussed. In a nutshell, rather than
extracting a model that accurately describes all possible execution paths for the process (but, the
anomalies as well), the idea is of capturing the “normal” behavior of the process by simpler (partial)
models consisting of frequent structural patterns. More precisely, outliers are found by a two-steps
approach:
� First, we mine the patterns of executions that are likely to characterize the behavior of a given
log. In fact, we specialize earlier frequent pattern mining approaches to the context of process
logs, by (i) defining a notion of pattern which effectively characterizes concurrent processes by
accounting for typical routing constructs, and by (ii) presenting an algorithm for their
identification.
� Second, we use an outlier detection approach which is cluster-based, i.e., it computes a
clustering for the logs (where the similarity measure roughly accounts for how many patterns
jointly characterize the execution of the traces) and finds outliers as those individuals that
hardly belong to any of the computed clusters or that belong to clusters whose size is
definitively smaller than the average cluster size.
解决这个问题以及随后的(C1)和(C2)正是论文的目的,其中讨论了为流程挖掘应用量身定制的
离群点检测技术。简而言之,与其提取一个精确描述流程所有可能执行路径的模型(但是,也
包括异常情况),还不如通过由频繁的结构模式组成的更简单(部分)的模型来捕获流程的“正常”
行为。更准确地说,异常值是通过两步方法找到的:
� 首先,我们挖掘可能描述给定日志的行为的执行模式。实际上,我们通过(i)定义一个
模式概念(通过考虑典型的路由构造有效地并发流程特征),以及(ii)提出一种识别它们的
算法,专门化了早期对流程日志上下文的频繁模式挖掘方法。
� 其次,我们使用了一种基于聚类的离群点检测方法,即,它为日志计算一个聚类(相似
度度量大致说明了有多少模式共同描述了轨迹的执行),并发现异常值为那些几乎不属
于任何计算得到的聚类的个体,或者属于其大小绝对小于平均聚类大小的聚类的个体。
剩余16页未读,继续阅读
资源评论
ProgrammerMonkey
- 粉丝: 43
- 资源: 37
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功