没有合适的资源?快使用搜索试试~ 我知道了~
流程感知信息系统日志中轨迹异常检测的算法
本文讨论了用于检测流程感知系统日志中的异常的四种算法。这些算法可以应用于检测流程感知信息系统日志中的异常轨迹,以提高数据质量和流程的可靠性。
一、流程感知信息系统简介
流程感知信息系统(PAIS)是一种软件系统,该系统管理和执行涉及人员、应用程序和/或信息源的操作流程,基于流程模型。这些系统可以是临时或松散的,例如灵活的工作流系统、案例处理系统和科学工作流程。在这些系统中,执行活动的人员可以控制决定案件下一步的方向,以达到案件的目标。
二、轨迹异常检测算法
在流程感知信息系统日志中检测异常轨迹是非常重要的,可以帮助提高数据质量和流程的可靠性。本文讨论了四种用于检测流程感知信息系统日志中的异常轨迹的算法:
1. Infrequent Traces Algorithm:该算法将日志中不常见的轨迹标记为潜在异常轨迹。
2. 阈值算法(Threshold Algorithm):该算法基于从日志或其子集挖掘流程模型,通过设置阈值来检测异常轨迹。
3. 迭代算法(Iterative Algorithm):该算法基于从日志或其子集挖掘流程模型,通过迭代方式来检测异常轨迹。
4. 采样算法(Sampling Algorithm):该算法基于从日志或其子集挖掘流程模型,通过采样方式来检测异常轨迹。
三、算法评估
这些算法在一组 1500 条人工日志上进行了评估,其中异常轨迹的数量以及每个异常轨迹在日志中出现的次数都有不同的概况。事实证明,采样算法是最有效的解决方案。
四、实践应用
我们还将该算法应用于真实日志,并将检测到的异常轨迹与依赖于手动选择的不同程序检测到的异常轨迹的结果进行比较。
五、结论
流程感知信息系统日志中轨迹异常检测的算法可以有效地检测流程感知信息系统日志中的异常轨迹,从而提高数据质量和流程的可靠性。本文讨论的四种算法可以应用于不同的场景中,以满足不同的需求。
六、未来工作
在未来,我们计划继续研究和改进流程感知信息系统日志中轨迹异常检测的算法,以满足不断变化的需求和挑战。同时,我们还计划将这些算法应用于更多的领域,以提高数据质量和流程的可靠性。
Algorithms for anomaly detection of traces in logs of process aware information systems
流程感知信息系统日志中轨迹异常检测的算法
Abstract
This paper discusses four algorithms for detecting anomalies in logs of process aware systems.
One of the algorithms only marks as potential anomalies traces that are infrequent in the log. The
other three algorithms: threshold, iterative and sampling are based on mining a process model from
the log, or a subset of it. The algorithms were evaluated on a set of 1500 artificial logs, with different
profiles on the number of anomalous traces and the number of times each anomalous traces was
present in the log. The sampling algorithm proved to be the most effective solution. We also applied
the algorithm to a real log, and compared the resulting detected anomalous traces with the ones
detected by a different procedure that relies on manual choices.
摘要
本文讨论了用于检测流程感知系统日志中的异常的四种算法。其中一种算法仅将日志中
不常见的轨迹标记为潜在异常轨迹。其他三种算法:阈值、迭代和采样基于从日志或其子集
挖掘流程模型。这些算法在一组 1500 条人工日志上进行了评估,其中异常轨迹的数量以及
每个异常轨迹在日志中出现的次数都有不同的概况。事实证明,采样算法是最有效的解决方
案。我们还将该算法应用于真实日志,并将检测到的异常轨迹与依赖于手动选择的不同程序
检测到的异常轨迹的结果进行比较。
1. Introduction and motivation
1. 简介和动机
Process aware information systems (PAISs) are ‘‘a software system that manages and executes
operational processes involving people, applications, and/or information sources on the basis of
process models’’ [17]. In this paper we are interested in systems in which the execution of said
processes are not predefined beforehand. Such systems fall within the ad hoc and loosely framed
systems described in [17]. The central aspect of such loosely framed system is that the people that
are executing the activities are in control to decide how the case will proceed next, in order to
achieve the goals for the case. Such loosely framed system include flexible workflow systems, case
handling systems, and scientific workflows.
流程感知信息系统(PAIS)是“涉及人员、应用程序和/或基于流程模型的信息源的管理和
执行流程的软件系统”[17]。在本文中,我们感兴趣的是未预先预定义所述流程的执行的系
统。此类系统属于[17]中描述的临时且松散框架的系统。这种松散框架系统的核心在于,执
行活动的人员可以控制决定案件下一步如何进行,以实现案例的目标。这种松散框架系统包
括灵活的工作流程系统、案例处理系统和科学工作流。
These systems allow for more control by the people executing the process, and are a possible
solution to what has been called the inflexibility of workflow systems [25,33]. But such flexibility
may come with a cost. Systems that are not under control of a pre-specified process model may be
subject to frauds and errors. Detecting such cases of frauds, exceptions and errors, which we will
call anomalies, is the goal of this research.
这些系统允许执行流程的人员进行更多控制,并且是所谓的工作流系统不灵活性的可能解决
方案[25,33]。但这种灵活性可能会带来成本。不受预先指定的流程模型控制的系统可能会受
到欺诈和错误的影响。检测此类欺诈、异常和错误(我们将其称为异常)的情况是本研究的
目标。
From the point of view of this research, the execution of a case or an instance of a process is a
sequence of activities that were executed on the behalf of that case. Thus the case ‘‘the firing of
John Jacob Jingleheimer Schmidt’’ is an instance of a process of ‘‘firing’’, and for Mr. Schmidt case
the following activities were executed: ‘‘inform Mr. Schmidt’’, ‘‘calculate balance due’’, ‘‘explain
severance benefits’’ and so on. In this paper, activities are considered atomic and their duration is
not important, thus the set of activities executed can be seen as a sequence. Furthermore we will not
attribute meaningful names to the activities, but refer to them using single letter names. Thus, Mr.
Schmidt firing case is seen as the sequence of activities abcbd, for example. Such sequences of
single letter activities are called traces. The set (or better the multiset) of traces from which one is
trying to identify the anomalies is called a log. Each trace can appear many times in the log, and
thus the multiset, and each time a particular trace appears in the log is called a trance-instance.
从本研究的角度来看,案例或流程实例的执行是代表该案例执行的一系列活动。因此,“解
雇 John Jacob Jingleheimer Schmidt”案件是“解雇”流程的一个实例,对于 Schmidt 先生案
件,执行了以下活动:“通知 Schmidt 先生”、“”计算应付余额”、“解释遣散费”等。在本
文中,活动被认为是原子的,其持续时间并不重要,因此执行的活动集可以看作是一个序列。
此外,我们不会为活动赋予有意义的名称,而是使用单字母名称来引用它们。因此,施密特
先生解雇案例被视为 abcbd 的活动序列。这种单一的字母活动序列称为轨迹。人们试图从
中识别异常的轨迹集合(或更准确地说是多重集)称为日志。每个轨迹可以在日志中出现多
次,因此多重集,并且每次特定轨迹出现在日志中都称为轨迹实例。
This research presents results in detecting anomalies in logs of execution of PAIS, where the
anomaly is detected solely based on the sequence and choices of activities that took place in that
anomalous execution. Thus, using the example above, one would detect that Mr. Schmidt firing was
anomalous because the particular sequence abcbd of activities was too different from the sequences
of activities for all or most of the other firing cases. For example, it may be the case that the activity
‘‘terminate Mr. Schmidt system access’’ was performed much later than usual, which could indicate
either that the system administrator was not properly trained regarding the security policies, or that
there was a collusion to allow Mr. Schmidt access to data he no longer should access.
这项研究展示了检测 PAIS 执行日志中的异常的结果,其中仅根据异常执行中发生的活动
的顺序和选择来检测异常。因此,使用上面的示例,人们会检测到 Schmidt 先生解雇案例
是异常的,因为特定的活动序列 abcbd 与所有或大多数其他解雇案例的活动序列差异太大。
例如,“终止施密特先生系统访问”活动的执行时间可能比平常晚得多,这可能表明系统管
理员没有接受过有关安全策略的适当培训,或者存在安全策略问题。串通允许施密特先生访
问他不再应该访问的数据。
Of course, the anomalous nature of a case may be derived from the values involved in some of the
activities (for example, Mr. Schmidt’s health benefits remain active for 300 month after his firing),
or because of the people who executed some of the activities (for example, the system access
termination activity was executed by a senior vice president), or because of time to perform an
activity or the whole process was greater or less than normal (for example, the calculation of balance
due was faster then normal). We call these examples as data, organizational, and time anomalies, to
match the other four aspects of process models [36]. This research is restricted to control flow
anomalies.
当然,案件的异常性质可能源于某些活动所涉及的价值观(例如,施密特先生的健康福利在
他被解雇后的 300 个月内仍然有效),或者是因为执行某些活动的人。活动(例如,系统访
问终止活动是由高级副总裁执行的),或者由于执行活动的时间或整个流程大于或小于正常
情况(例如,到期余额的计算速度快于正常情况) )。我们将这些示例称为数据、组织和时
间异常,以匹配流程模型的其他四个方面。这项研究仅限于控制流异常。
1.1. Toward a definition of anomalous trace
1.1.异常轨迹的定义
Anomalous traces, once discovered, must be analyzed to find out if indeed they are examples of
incorrect executions or if they are acceptable executions, and if they are found to be incorrect
executions, the reasons for and consequences of these executions must be further investigated. Thus,
the algorithms discussed in this paper must be used as a first automated step toward a more
comprehensive security auditing practice for flexible or loosely framed PAIS. This places some
constraints for the algorithms. If we focus on a fraud perspective—that is, that the anomalous traces
are a possible indication of frauds, then missing any of the potential fraudulent executions has
serious consequences. Thus, the algorithms to detect the anomalous traces must have a very low
false negative rate. The false negative cases are traces that the algorithm flagged as negative (or
‘‘normal’’) and that attribution was wrong. Such false negative cases will not be forwarded to the
specialists that would determine that the trace was indeed a fraud and take the appropriate measures.
On the other hand, given that this human analysis of whether an anomalous trace is indeed a fraud
is a costly process, one would also prefer if the algorithms had low false positive rates – that is, the
number of cases that are mistakenly flagged as anomalous when in fact they are not – should also
be kept low. But a low false positive rate is less important than a low false negative rate.
一旦发现异常轨迹,必须对其进行分析,以确定它们是否确实是错误执行的示例,或者是否
是可接受的执行,如果发现是错误执行,则必须进一步调查这些执行的原因和后果。因此,
本文讨论的算法必须用作针对灵活或松散框架的 PAIS 进行更全面的安全审计实践的自动
化第一步。这给算法带来了一些限制。如果我们关注欺诈的角度,即异常轨迹可能是欺诈的
迹象,那么错过任何潜在的欺诈执行都会产生严重的后果。因此,检测异常痕迹的算法必须
具有非常低的假负率。假负性案例是算法标记为负(或“正常”)且为错误的轨迹。此类假
负案例不会转发给专家,专家将确定该跟踪确实是欺诈行为并采取适当的措施。另一方面,
考虑到人工分析异常轨迹是否确实是欺诈是一个成本高昂的过程,人们也希望算法具有较低
的假负率,即被错误标记为异常(事实上它们不是)的案例数量——也应该保持低水平。但
低假正率不如低假负率重要。
If the anomalous traces are interpreted as errors, either erroneous execution or erroneous logging of
the processes, then the unbalance of costs between a false negative and a false positive is less severe.
A false negative will not generate the loss of revenue that an undetected fraud usually will incur.
Therefore under this perspective a more even balance between false negative and false positive rates
should be aimed at. In this paper we will also explore this alternative.
如果异常轨迹被解释为错误,或者错误执行或者错误记录流程,则假负性和假正性之间的成
本不平衡就不那么严重。假负性不会造成未发现的欺诈行为通常会造成的收入损失。因此,
从这个角度来看,应该在假负性率和假正性之间实现更均衡的平衡。在本文中,我们还将探
讨这种替代方案。
Finally, let us address the issue of what is an anomalous execution of a process. Chandola et al
[10], in an important survey on anomaly detection, discuss that there is no formal definition of
anomaly, only intuitions that guide the development of different algorithms and techniques. For
example, one may have the intuition that ‘‘normal’’ data falls ‘‘together’’ (in some appropriate
distance metric) and that anomalies are ‘‘spread apart’’. This intuition based on distance leads to the
development of many algorithms based on nearest neighbor [10, Section 5]. If on the other hand,
one has the intuition that anomalies are data points that have low probability of occurring (given the
appropriate generative model for the ‘‘normal’’ data), this intuition leads to the development of
family of techniques described in [10, Section 7] as statistical detection models.
The same apply to our research: we have no formal definition of an anomalous traces, but we have
some intuitions that guided the development of the algorithms discussed herein. They are
⚫ the set of executions can be partitioned into a set of normal and anomalous executions,
⚫ each of the anomalous execution is ‘‘infrequent’’ among the set of all executions, although
the whole set of anomalous executions may not be that infrequent,
⚫ the process models that ‘‘explain’’ the executions in the normal set ‘‘make sense’’,
⚫ the process models that could explain both the normal executions and some of the
anomalous ones ‘‘make less sense’’.
The terms ‘‘infrequent’’, ‘‘explain’’ and ‘‘make sense’’ need to be further refined if one wants to
transform these intuitions into one or more algorithms. Nevertheless these intuitions can be
formalized in some more precise notation, leaving the uncertainties confined into a few constants
and relations.
• given a set of activity names,
• a trace is defined as
,
• a log is defined as a multiset of traces
where
is the multiplicity of
the trace in the log,
• the size of a log is the number of trace-instances in it, that is
,
,
• the frequency of a trace in the log is defined as
,
• there exists a constant
represents the term ‘‘infrequent’’,
• there exists a relation ‘‘explain’’ between a process model and a log denoted by
,
• there exists a partial order ‘‘make more sense’’ between models denoted as
which indicates that
‘‘make more sense’’ than
.
Now, our intuitions regarding anomalies can be pseudo-formalized as, given a log
• can be partitioned into two multisets (anomalous) and (normal) such that
and ,
•
,
,
• let
be the maximum under the partial order of .
• and let
be the maximum under the partial order of ,
• then
.
最后,让我们解决什么是流程异常执行的问题。 Chandola 等人在一项关于异常检测的
重要调查中讨论了异常没有正式的定义,只有指导不同算法和技术开发的直觉。例如,人们
可能有这样的直觉:“正常”数据“在一起”(以某种适当的距离度量),而异常数据“分散”。
这种基于距离的直觉导致了许多基于最近邻的算法的发展。另一方面,如果人们有这样的直
觉:异常是发生概率较低的数据点(给定“正常”数据的适当生成模型),这种直觉会导致
[ 10,第 7 节]中作为统计检测模型所描述的一系列技术的发展。
这同样适用于我们的研究:我们没有异常轨迹的正式定义,但我们有一些直觉指导本文讨论
的算法的开发。他们是
• 该执行集可以分为一组正常执行和异常执行,
• 尽管整个异常执行集合可能并不那么频繁,但每个异常执行在所有执行集合中都是
“不频繁”的,
• 流程模型“解释”了正常“有意义”集合中的执行,
• 流程模型可以解释正常执行和一些“意义不大”异常执行。
如果想将这些直觉转化为一种或多种算法,那么术语“不频繁”、“解释”和“有意义”
需要进一步细化。然而,这些直觉可以用一些更精确的符号形式化,将不确定性限制在一些
常数和关系中。
• 给定一组活动名称 ,
• 轨迹 定义为
,
• 日志 定义为轨迹的多集
,其中
是日志中轨迹的重数,
• 日志 的大小是其中轨迹实例的数量,即
,
,
• 日志 中轨迹 的频率定义为
,
• 存在一个常数
代表术语“不频繁”,
• 流程模型 和 日志 之间存在“解释”关系,表示为 ,
• 模型之间存在一种偏序关系“有意义”,表示为
,这表明
比
,“更有
意义”
现在,我们对异常的直觉可以伪形式化为,日志
• 可以划分为两个多集 (异常)和 (正常),使得 和 ,
•
,
,
• 令
为 偏序 下的最大值。
• 同时令
为 偏序 下的最大值。
• 然后
。
1.2. Naive detection approach
1.2.简单的检测方法
Before discussing these terms (‘‘infrequent’’, ‘‘explain’’, and ‘‘make sense’’), the intuitions above
lend themselves to a first algorithm, which we call the naive algorithm. The naive algorithm resolves
剩余27页未读,继续阅读
资源推荐
资源评论
103 浏览量
2023-09-06 上传
109 浏览量
182 浏览量
2019-07-22 上传
171 浏览量
5星 · 资源好评率100%
160 浏览量
5星 · 资源好评率100%
2023-02-17 上传
170 浏览量
173 浏览量
2019-08-07 上传
5星 · 资源好评率100%
166 浏览量
2012-04-16 上传
192 浏览量
2016-10-06 上传
145 浏览量
资源评论
ProgrammerMonkey
- 粉丝: 47
- 资源: 38
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- CE. TOOLS. 测试工具人
- 张杰的音乐目录 : 如果爱 - 张杰
- 基于Springboot的贫困生资助系统源码
- 机械设计伺服四足机器人sw20可编辑非常好的设计图纸100%好用.zip
- 万捷APK界面类名获取工具 - Apkactivity - apk界面路径查看器
- 安卓手机广告屏蔽器AdGuard
- Java Web开发技术总复习4.docx
- 机械设计垂直 管式锅炉sw18可编辑非常好的设计图纸100%好用.zip
- Python程序火车票分析助手使用说明
- 批量处理美术资源 替换成自己想要的文件结构
- 机械设计大型转子干燥机sw15可编辑非常好的设计图纸100%好用.zip
- Labview与阿特拉斯开放式通讯 网口读取扭矩值 包括Labview程序、阿特拉斯调试软件、开放式通讯测试软件、开放式通讯协议、PM4000手册
- 圣诞树html网页代码,打开可以直接看
- 山水工程试点DID工具变量.xlsx
- Nginx支持服务端的负载均衡配置文件
- 同步磁阻电机SynRM滑模控制 1.基于FOC策略,其中转速环采用滑模控制器,较PI提高系统的动态响应能力 2.提供算法对应的参考文献和仿真模型 仿真模型纯手工搭建
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功