【免费】论文【流程感知信息系统日志中轨迹异常检测的算法】翻译资源-CSDN文库

需积分: 0 64 浏览量更新于2023-07-18 收藏 1005KB PDF 举报

流程感知信息系统日志中轨迹异常检测的算法本文讨论了用于检测流程感知系统日志中的异常的四种算法。这些算法可以应用于检测流程感知信息系统日志中的异常轨迹，以提高数据质量和流程的可靠性。一、流程感知信息系统简介流程感知信息系统（PAIS）是一种软件系统，该系统管理和执行涉及人员、应用程序和/或信息源的操作流程，基于流程模型。这些系统可以是临时或松散的，例如灵活的工作流系统、案例处理系统和科学工作流程。在这些系统中，执行活动的人员可以控制决定案件下一步的方向，以达到案件的目标。二、轨迹异常检测算法在流程感知信息系统日志中检测异常轨迹是非常重要的，可以帮助提高数据质量和流程的可靠性。本文讨论了四种用于检测流程感知信息系统日志中的异常轨迹的算法： 1. Infrequent Traces Algorithm：该算法将日志中不常见的轨迹标记为潜在异常轨迹。 2. 阈值算法（Threshold Algorithm）：该算法基于从日志或其子集挖掘流程模型，通过设置阈值来检测异常轨迹。 3. 迭代算法（Iterative Algorithm）：该算法基于从日志或其子集挖掘流程模型，通过迭代方式来检测异常轨迹。 4. 采样算法（Sampling Algorithm）：该算法基于从日志或其子集挖掘流程模型，通过采样方式来检测异常轨迹。三、算法评估这些算法在一组 1500 条人工日志上进行了评估，其中异常轨迹的数量以及每个异常轨迹在日志中出现的次数都有不同的概况。事实证明，采样算法是最有效的解决方案。四、实践应用我们还将该算法应用于真实日志，并将检测到的异常轨迹与依赖于手动选择的不同程序检测到的异常轨迹的结果进行比较。五、结论流程感知信息系统日志中轨迹异常检测的算法可以有效地检测流程感知信息系统日志中的异常轨迹，从而提高数据质量和流程的可靠性。本文讨论的四种算法可以应用于不同的场景中，以满足不同的需求。六、未来工作在未来，我们计划继续研究和改进流程感知信息系统日志中轨迹异常检测的算法，以满足不断变化的需求和挑战。同时，我们还计划将这些算法应用于更多的领域，以提高数据质量和流程的可靠性。

Algorithms for anomaly detection of traces in logs of process aware information systems

流程感知信息系统日志中轨迹异常检测的算法

Abstract

This paper discusses four algorithms for detecting anomalies in logs of process aware systems.

One of the algorithms only marks as potential anomalies traces that are infrequent in the log. The

other three algorithms: threshold, iterative and sampling are based on mining a process model from

the log, or a subset of it. The algorithms were evaluated on a set of 1500 artificial logs, with different

profiles on the number of anomalous traces and the number of times each anomalous traces was

present in the log. The sampling algorithm proved to be the most effective solution. We also applied

the algorithm to a real log, and compared the resulting detected anomalous traces with the ones

detected by a different procedure that relies on manual choices.

摘要

本文讨论了用于检测流程感知系统日志中的异常的四种算法。其中一种算法仅将日志中

不常见的轨迹标记为潜在异常轨迹。其他三种算法：阈值、迭代和采样基于从日志或其子集

挖掘流程模型。这些算法在一组 1500 条人工日志上进行了评估，其中异常轨迹的数量以及

每个异常轨迹在日志中出现的次数都有不同的概况。事实证明，采样算法是最有效的解决方

案。我们还将该算法应用于真实日志，并将检测到的异常轨迹与依赖于手动选择的不同程序

检测到的异常轨迹的结果进行比较。

1. Introduction and motivation

1. 简介和动机

Process aware information systems (PAISs) are ‘‘a software system that manages and executes

operational processes involving people, applications, and/or information sources on the basis of

process models’’ [17]. In this paper we are interested in systems in which the execution of said

processes are not predefined beforehand. Such systems fall within the ad hoc and loosely framed

systems described in [17]. The central aspect of such loosely framed system is that the people that

are executing the activities are in control to decide how the case will proceed next, in order to

achieve the goals for the case. Such loosely framed system include flexible workflow systems, case

handling systems, and scientific workflows.

流程感知信息系统（PAIS）是“涉及人员、应用程序和/或基于流程模型的信息源的管理和

执行流程的软件系统”[17]。在本文中，我们感兴趣的是未预先预定义所述流程的执行的系

统。此类系统属于[17]中描述的临时且松散框架的系统。这种松散框架系统的核心在于，执

行活动的人员可以控制决定案件下一步如何进行，以实现案例的目标。这种松散框架系统包

括灵活的工作流程系统、案例处理系统和科学工作流。

These systems allow for more control by the people executing the process, and are a possible

solution to what has been called the inflexibility of workflow systems [25,33]. But such flexibility

may come with a cost. Systems that are not under control of a pre-specified process model may be

subject to frauds and errors. Detecting such cases of frauds, exceptions and errors, which we will

call anomalies, is the goal of this research.

这些系统允许执行流程的人员进行更多控制，并且是所谓的工作流系统不灵活性的可能解决

方案[25,33]。但这种灵活性可能会带来成本。不受预先指定的流程模型控制的系统可能会受

到欺诈和错误的影响。检测此类欺诈、异常和错误（我们将其称为异常）的情况是本研究的

目标。

From the point of view of this research, the execution of a case or an instance of a process is a

sequence of activities that were executed on the behalf of that case. Thus the case ‘‘the firing of

John Jacob Jingleheimer Schmidt’’ is an instance of a process of ‘‘firing’’, and for Mr. Schmidt case

the following activities were executed: ‘‘inform Mr. Schmidt’’, ‘‘calculate balance due’’, ‘‘explain

severance benefits’’ and so on. In this paper, activities are considered atomic and their duration is

not important, thus the set of activities executed can be seen as a sequence. Furthermore we will not

attribute meaningful names to the activities, but refer to them using single letter names. Thus, Mr.

Schmidt firing case is seen as the sequence of activities abcbd, for example. Such sequences of

single letter activities are called traces. The set (or better the multiset) of traces from which one is

trying to identify the anomalies is called a log. Each trace can appear many times in the log, and

thus the multiset, and each time a particular trace appears in the log is called a trance-instance.

从本研究的角度来看，案例或流程实例的执行是代表该案例执行的一系列活动。因此，“解

雇 John Jacob Jingleheimer Schmidt”案件是“解雇”流程的一个实例，对于 Schmidt 先生案

件，执行了以下活动：“通知 Schmidt 先生”、“”计算应付余额”、“解释遣散费”等。在本

文中，活动被认为是原子的，其持续时间并不重要，因此执行的活动集可以看作是一个序列。

此外，我们不会为活动赋予有意义的名称，而是使用单字母名称来引用它们。因此，施密特

先生解雇案例被视为 abcbd 的活动序列。这种单一的字母活动序列称为轨迹。人们试图从

中识别异常的轨迹集合（或更准确地说是多重集）称为日志。每个轨迹可以在日志中出现多

次，因此多重集，并且每次特定轨迹出现在日志中都称为轨迹实例。

This research presents results in detecting anomalies in logs of execution of PAIS, where the

anomaly is detected solely based on the sequence and choices of activities that took place in that

anomalous execution. Thus, using the example above, one would detect that Mr. Schmidt firing was

anomalous because the particular sequence abcbd of activities was too different from the sequences

of activities for all or most of the other firing cases. For example, it may be the case that the activity

‘‘terminate Mr. Schmidt system access’’ was performed much later than usual, which could indicate

either that the system administrator was not properly trained regarding the security policies, or that

there was a collusion to allow Mr. Schmidt access to data he no longer should access.

这项研究展示了检测 PAIS 执行日志中的异常的结果，其中仅根据异常执行中发生的活动

的顺序和选择来检测异常。因此，使用上面的示例，人们会检测到 Schmidt 先生解雇案例

是异常的，因为特定的活动序列 abcbd 与所有或大多数其他解雇案例的活动序列差异太大。

例如，“终止施密特先生系统访问”活动的执行时间可能比平常晚得多，这可能表明系统管

理员没有接受过有关安全策略的适当培训，或者存在安全策略问题。串通允许施密特先生访

问他不再应该访问的数据。

Of course, the anomalous nature of a case may be derived from the values involved in some of the

activities (for example, Mr. Schmidt’s health benefits remain active for 300 month after his firing),

or because of the people who executed some of the activities (for example, the system access

termination activity was executed by a senior vice president), or because of time to perform an

activity or the whole process was greater or less than normal (for example, the calculation of balance

due was faster then normal). We call these examples as data, organizational, and time anomalies, to

match the other four aspects of process models [36]. This research is restricted to control flow

anomalies.

当然，案件的异常性质可能源于某些活动所涉及的价值观（例如，施密特先生的健康福利在

他被解雇后的 300 个月内仍然有效），或者是因为执行某些活动的人。活动（例如，系统访

问终止活动是由高级副总裁执行的），或者由于执行活动的时间或整个流程大于或小于正常

情况（例如，到期余额的计算速度快于正常情况））。我们将这些示例称为数据、组织和时

间异常，以匹配流程模型的其他四个方面。这项研究仅限于控制流异常。

1.1. Toward a definition of anomalous trace

1.1.异常轨迹的定义

Anomalous traces, once discovered, must be analyzed to find out if indeed they are examples of

incorrect executions or if they are acceptable executions, and if they are found to be incorrect

executions, the reasons for and consequences of these executions must be further investigated. Thus,

the algorithms discussed in this paper must be used as a first automated step toward a more

comprehensive security auditing practice for flexible or loosely framed PAIS. This places some

constraints for the algorithms. If we focus on a fraud perspective—that is, that the anomalous traces

are a possible indication of frauds, then missing any of the potential fraudulent executions has

serious consequences. Thus, the algorithms to detect the anomalous traces must have a very low

false negative rate. The false negative cases are traces that the algorithm flagged as negative (or

‘‘normal’’) and that attribution was wrong. Such false negative cases will not be forwarded to the

specialists that would determine that the trace was indeed a fraud and take the appropriate measures.

On the other hand, given that this human analysis of whether an anomalous trace is indeed a fraud

is a costly process, one would also prefer if the algorithms had low false positive rates – that is, the

number of cases that are mistakenly flagged as anomalous when in fact they are not – should also

be kept low. But a low false positive rate is less important than a low false negative rate.

一旦发现异常轨迹，必须对其进行分析，以确定它们是否确实是错误执行的示例，或者是否

是可接受的执行，如果发现是错误执行，则必须进一步调查这些执行的原因和后果。因此，

本文讨论的算法必须用作针对灵活或松散框架的 PAIS 进行更全面的安全审计实践的自动

化第一步。这给算法带来了一些限制。如果我们关注欺诈的角度，即异常轨迹可能是欺诈的

迹象，那么错过任何潜在的欺诈执行都会产生严重的后果。因此，检测异常痕迹的算法必须

具有非常低的假负率。假负性案例是算法标记为负（或“正常”）且为错误的轨迹。此类假

负案例不会转发给专家，专家将确定该跟踪确实是欺诈行为并采取适当的措施。另一方面，

考虑到人工分析异常轨迹是否确实是欺诈是一个成本高昂的过程，人们也希望算法具有较低

的假负率，即被错误标记为异常（事实上它们不是）的案例数量——也应该保持低水平。但

低假正率不如低假负率重要。

If the anomalous traces are interpreted as errors, either erroneous execution or erroneous logging of

the processes, then the unbalance of costs between a false negative and a false positive is less severe.

A false negative will not generate the loss of revenue that an undetected fraud usually will incur.

Therefore under this perspective a more even balance between false negative and false positive rates

should be aimed at. In this paper we will also explore this alternative.

如果异常轨迹被解释为错误，或者错误执行或者错误记录流程，则假负性和假正性之间的成

本不平衡就不那么严重。假负性不会造成未发现的欺诈行为通常会造成的收入损失。因此，

从这个角度来看，应该在假负性率和假正性之间实现更均衡的平衡。在本文中，我们还将探

讨这种替代方案。

Finally, let us address the issue of what is an anomalous execution of a process. Chandola et al

[10], in an important survey on anomaly detection, discuss that there is no formal definition of

anomaly, only intuitions that guide the development of different algorithms and techniques. For

example, one may have the intuition that ‘‘normal’’ data falls ‘‘together’’ (in some appropriate

distance metric) and that anomalies are ‘‘spread apart’’. This intuition based on distance leads to the

development of many algorithms based on nearest neighbor [10, Section 5]. If on the other hand,

one has the intuition that anomalies are data points that have low probability of occurring (given the

appropriate generative model for the ‘‘normal’’ data), this intuition leads to the development of

family of techniques described in [10, Section 7] as statistical detection models.

The same apply to our research: we have no formal definition of an anomalous traces, but we have

some intuitions that guided the development of the algorithms discussed herein. They are

⚫ the set of executions can be partitioned into a set of normal and anomalous executions,

⚫ each of the anomalous execution is ‘‘infrequent’’ among the set of all executions, although

the whole set of anomalous executions may not be that infrequent,

⚫ the process models that ‘‘explain’’ the executions in the normal set ‘‘make sense’’,

⚫ the process models that could explain both the normal executions and some of the

anomalous ones ‘‘make less sense’’.

The terms ‘‘infrequent’’, ‘‘explain’’ and ‘‘make sense’’ need to be further refined if one wants to

transform these intuitions into one or more algorithms. Nevertheless these intuitions can be

formalized in some more precise notation, leaving the uncertainties confined into a few constants

and relations.

• given a set  of activity names,

• a trace  is defined as 



• a log  is defined as a multiset of traces   󰇝  



󰇞 where 



is the multiplicity of

the trace in the log,

• the size of a log  is the number of trace-instances in it, that is   



 , 

󰇛



󰇜









• the frequency of a trace  in the log  is defined as 



󰇛



󰇜

 



󰇛󰇜,

• there exists a constant 



represents the term ‘‘infrequent’’,

• there exists a relation ‘‘explain’’ between a process model  and a log  denoted by

  ,

• there exists a partial order ‘‘make more sense’’ between models denoted as 



 



which indicates that 



‘‘make more sense’’ than 



Now, our intuitions regarding anomalies can be pseudo-formalized as, given a log 

•  can be partitioned into two multisets  (anomalous) and  (normal) such that  

   and      ,

•   



 , 



󰇛󰇜  



• let 



be the maximum under the partial order  of 󰇝  󰇞.

• and let 



be the maximum under the partial order  of 󰇝  󰇞,

• then 



 



最后，让我们解决什么是流程异常执行的问题。 Chandola 等人在一项关于异常检测的

重要调查中讨论了异常没有正式的定义，只有指导不同算法和技术开发的直觉。例如，人们

可能有这样的直觉：“正常”数据“在一起”（以某种适当的距离度量），而异常数据“分散”。

这种基于距离的直觉导致了许多基于最近邻的算法的发展。另一方面，如果人们有这样的直

觉：异常是发生概率较低的数据点（给定“正常”数据的适当生成模型），这种直觉会导致

[ 10，第 7 节]中作为统计检测模型所描述的一系列技术的发展。

这同样适用于我们的研究：我们没有异常轨迹的正式定义，但我们有一些直觉指导本文讨论

的算法的开发。他们是

• 该执行集可以分为一组正常执行和异常执行，

• 尽管整个异常执行集合可能并不那么频繁，但每个异常执行在所有执行集合中都是

“不频繁”的，

• 流程模型“解释”了正常“有意义”集合中的执行，

• 流程模型可以解释正常执行和一些“意义不大”异常执行。

如果想将这些直觉转化为一种或多种算法，那么术语“不频繁”、“解释”和“有意义”

需要进一步细化。然而，这些直觉可以用一些更精确的符号形式化，将不确定性限制在一些

常数和关系中。

• 给定一组活动名称 ，

• 轨迹  定义为



，

• 日志  定义为轨迹的多集   󰇝  



󰇞，其中 



是日志中轨迹的重数，

• 日志  的大小是其中轨迹实例的数量，即  



 , 

󰇛



󰇜









，

• 日志  中轨迹  的频率定义为 



󰇛



󰇜

 



󰇛󰇜，

• 存在一个常数 



代表术语“不频繁”，

• 流程模型  和日志  之间存在“解释”关系，表示为   ，

• 模型之间存在一种偏序关系“有意义”，表示为



 



，这表明 



比 



，“更有

意义”

现在，我们对异常的直觉可以伪形式化为，日志 

•  可以划分为两个多集 （异常）和 （正常），使得     和     ，

•   



 ， 



󰇛󰇜  



，

• 令 



为󰇝  󰇞偏序  下的最大值。

• 同时令 



为 󰇝  󰇞 偏序  下的最大值。

• 然后 



 



。

1.2. Naive detection approach

1.2.简单的检测方法

Before discussing these terms (‘‘infrequent’’, ‘‘explain’’, and ‘‘make sense’’), the intuitions above

lend themselves to a first algorithm, which we call the naive algorithm. The naive algorithm resolves

剩余27页未读，继续阅读

资源推荐

资源评论

ProgrammerMonkey

粉丝: 47
资源: 38

论文【流程感知信息系统日志中轨迹异常检测的算法】翻译

论文《基于LSTM的流程实例异常检测:基准测试和调整》翻译

论文《LogAnomaly:无结构日志中顺序和数量异常的无监督检测》翻译

一种基于日志信息和CNN_text的软件系统异常检测方法_梅御东1

基于轨迹的异常行为检测，用于智能交通监控

体验报告：用于异常检测的系统日志分析

智能化工业园区安防系统python源码（多传感数据监测融合系统、空地协同集群巡检、异常检测与小目标检测算法+后端）.zip

论文研究-基于流分解的异常检测算法.pdf

智能视频监控系统异常行为检测算法研究综述.pdf

基于MapReduce的并行异常检测算法(毕业论文).caj

使用日志的异常检测 (2).docx

GPS轨迹纠偏算法，java代码，包含异常点检测、滤波平滑，

日志异常检测 和多指标时间序列异常检测.zip

基于车牌识别系统车辆轨迹的行为异常检测_孙玉砚.pdf

论文研究-改进协同表示的高光谱图像异常检测算法.pdf

基于轨迹大数据离线挖掘与在线实时监测的出租车异常轨迹检测算法.docx

研究论文-基于 ＰＣＡ 和 ＫＲＸ 算法的高光谱异常检测.pdf

AUV 轨迹跟踪 PID算法

烟雾检测 烟雾检测的算法

混沌RBF神经网络异常检测算法.pdf

高光谱异常检测KRX算法

《深度学习视频异常检测》2020综述论文

CAN异常检测——logbert实现

最新资源

日志异常检测和多指标时间序列异常检测.zip

研究论文-基于ＰＣＡ和ＫＲＸ算法的高光谱异常检测.pdf

烟雾检测烟雾检测的算法