没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
内容概要:本文介绍了利用通信记录识别企业内部阴谋活动参与者的两种数学模型。这两种模型分别是基于向量空间的聚类模型和基于电力网络的模型。向量空间模型通过对员工之间的交互频率和主题关联进行建模,使用k-means聚类将已知的有罪和无罪员工有效分离。电力网络模型则构建了一个图,节点表示员工和主题,边权重表示通信频率,已知有罪节点设为电压1,无罪节点设为电压0,未知节点通过求解电路确定其嫌疑程度。实验结果表明这两种模型都能有效地识别潜在嫌疑人。 适合人群:数据科学家、算法工程师、信息安全研究人员。 使用场景及目标:适用于需要从复杂通信数据中提取关键参与者和模式的企业内部调查、法律案件支持和其他安全相关任务。目标是从大量数据中快速定位可疑行为。 其他说明:模型不仅展示了在现有条件下的有效性,还通过“留一法”测试验证了模型的鲁棒性和抗干扰能力。同时,模型对重复名称的处理和未来改进的方向也进行了详细讨论。
资源推荐
资源详情
资源评论
Crime Ring Analysis with Electric Networks
Team 17160
February 13, 2012
Abstract
In this report, we identify the participants in a corporate conspiracy plan using records
of messages sent and received by employees of the company. We detail the motivation,
mathematical details, robustness testing, strengths, and weaknesses of the two models
we co-developed and used, and present and interpret their results. Both models use the
senders and receivers of the messages, as well as their topic labels, but not timestamps
or full-text content, which are not available to us.
Our first model is a vector space model which maps each of 83 nodes to a vector
whose entries represent the number of communications between a node and each other
node, as well as the number of communications involving a node and containing a
specified topic (the vectors are of length 83 + 15 = 98). Within this model, we use
the Euclidean norm as a distance measure, and show that k-means clustering into
two clusters meaningfully separates known guilty and non-guilty employees when both
node-node and node-topic interactions are considered, as well as when only node-topic
interactions are considered, in both the case at hand as well as the example previous
case. We also use cluster analysis to determine that, among all cases of repeated
names, Beth is the most likely to be represented by two nodes. The clustering model
does not assume known guilt or innocence a priori and doesn’t strongly suggest any
additional conspirators – however, its ability to separate known guilty and innocent
employees granted credibility to the usage of linear spaces and equally weighted node-
topic connections in our second, main model.
Our main model was an electrical circuit where each node and topic is represented
by a vertex in the circuit graph, and people/topics are connected via a conductance
proportional to the number of interactions made (these are the same as the corre-
sponding entries in the vector space model). In this model, the people and topics
given to be guilty were set to a reference voltage of 1 while those given to be innocent
were grounded to 0. The voltages of unknown nodes could then be determined by
solving the circuit in a DC setting. We chose this model because of its simplicity (to
prevent overfitting), expressiveness, flexibility to accommodate new types of data and
new problem times, and because the metrics it uses are supported by the clustering
model.
We tested the robustness of our model by checking for leave-one-out discrepancies
– that is, we experimentally ran the model once for each known conspirator, leaving
1
Control Number: 17160 Page 2 of 23
that conspirator out of the set of vertexes held at 1V, and analyzed the consistency of
these results with our main results. Our model held up to this analysis, showing that
our results are not overly sensitive to the given known conspirators, and do not imply
guilt by direct association to any individual. We likewise performed leave-one-out tests
for each suspicious topic, and again the model held up. We ran the model as well as
validation both for the case of Chris being innocent, as well as for the case of Christ
being guilty (and topic 1 being suspicious).
We present our results and recommendations for who should be further investigated.
Control Number: 17160 Page 3 of 23
Contents
1 Problem Statement 4
2 Approach Philosophy 4
3 Data Interpretation and Assumptions 5
4 Clustering Model 5
4.1 Model Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2 Figuring Out Duplicate Names . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.3 Validating Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5 Electric Network Model 8
5.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5.2 Bipartite Graph Circuit Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.2.1 Structure and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.2.2 Node to Topic Conductivity . . . . . . . . . . . . . . . . . . . . . . . 9
5.2.3 Node to Node Conductivity . . . . . . . . . . . . . . . . . . . . . . . 10
5.2.4 Topic to Topic Conductivity . . . . . . . . . . . . . . . . . . . . . . . 11
5.3 Applicability to Generic Agent-Ranking Problems . . . . . . . . . . . . . . . 11
5.4 Strengths and Weaknesses of the Model . . . . . . . . . . . . . . . . . . . . . 12
6 Results of Model 13
6.1 Immediate Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6.1.1 Involvement of Managers . . . . . . . . . . . . . . . . . . . . . . . . . 13
6.1.2 Effects of Additional Known Conspirators or Topics . . . . . . . . . . 15
6.2 Robustness of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6.2.1 Leave-One-Out Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6.2.2 Merging Repeated Names . . . . . . . . . . . . . . . . . . . . . . . . 17
6.3 Determining Potential Cutoff . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7 Future work 18
7.1 Semantic Network Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.2 Text Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
8 Recommendation to the DA 19
A Complete Listing of Results 21
Control Number: 17160 Page 4 of 23
1 Problem Statement
In this problem, we aim to uncover the participants in a corporate conspiracy plan using
records of messages sent and received by employees of the company. We are not given access
to the full-text content of the messages – however, the messages are pre-labeled using 15
general conversation topics, the descriptions of which are given. Using this information, as
well as a given set of known conspirators and known non-conspirators, we aim to identify a
set of likely conspirators among the remaining employees. Of particular interest are three of
the company’s senior members: Gretchen, Jerome, and Delores.
2 Approach Philosophy
The development of our model was guided by a few basic philosophies, which arise from the
legal nature of the problem and the nature and scope of the example given.
First, we wanted to avoid overfitting our data. Specifically, we determined that the general
process of:
• Coming up with a generic model
• Fitting its parameters to best explain the given example
• Running the model with those parameters on the case at hand
would not be an effective approach. For example, a model with three weight parameters,
tested at 11 values each (say 0 to 1 in increments of 0.1) would result in a search space of
1331 models. Among 10 people in the training example, there are 1024 possible combinations
of guilty and innocent people. With no other assumptions, we would expect
1331
1024
> 1 set of
three parameters to happen to match the results of the given example, regardless of whether
or not those parameters result in a good general model.
Instead, we chose to develop two simple concurrent models that support each other. Our
main model is an electric network model that assumes the guilt of those given to be guilty
and flags topics given to be suspicious. To support this model and give credibility to the
metrics, choices of fixed parameters, and data assumptions it uses, we developed an auxiliary
clustering model. Both of these models are discussed in detail in sections 4 and 5.
Second, we wanted to avoid models that implied any level of presumption of guilt in any inter-
mediate term (e.g. conditional probability terms in an intermediate iteration of a PageRank-
like algorithm). Because the problem statement is ambiguous about the legal proceedings
themselves, and of exactly why the data is abridged the way it is, we wanted to avoid the
chance of using any intermediate assumptions (even in an iterative calculation) that could
could later undermine the admissibility of subsequent evidence collected on the basis of our
recommendations (for example, a violation of probable cause).
Third, we understand that any algorithmic approach on the basis of communication data
risks implies some degree of guilt by association. In order to protect the credibility of our
Control Number: 17160 Page 5 of 23
work, we used leave-one-out testing as a measure of the robustness of our results – that is,
for each person known to be involved in the conspiracy, we ran the model without assuming
them guilty to ensure any of our results are not too dependent on associations with any
particular individual being assumed guilty.
3 Data Interpretation and Assumptions
Our dataset consisted of 400 message headers (i.e. sender-receiver pairs) sent between 83
nodes, representing employees within the company. 7 (or 8, if counting Chris) employees
were known to be tied to the conspiracy. The messages were additionally pre-categorized
into 15 topics, 3 of which were known to be related to the conspiracy plan. Some messages
contained multiple topics – 36 messages contained (exactly) 2 categories, and 11 contained 3
categories. Most of the 15 topics were associated with between 20 and 40 unique messages.
We were not given the actual message contents.
318 out of the 400 messages were sent between a unique pair of employees, which unfortu-
nately makes it difficult to compare the relative frequencies of contact between different pairs
of individuals, as there were very few instances of a person sending more than one message
to another individual. This made it difficult to meaningfully consider the communication
frequencies between pairs of nodes without overfitting the data (we worked around this by
considering node-to-topic connections in our clustering and electric network models).
There were two main ambiguities within our data. First, there were 5 employee names that
corresponded to 2 nodes each. These names include Gretchen and Jerome, the names of
two senior managers in the company. In each of these cases, it was unclear whether or not
the two nodes were different communication nodes (e.g. different cell phones) that belonged
to the same employee, or whether the two nodes represented different employees with the
same name. We used cluster analysis to predict which such pairs of nodes to regard as single
individuals.
The second main ambiguity was a single message (line 215 in Topics.xls) that contained
three topics, one of which was marked 18, an undefined topic number, and a clear error in
the data. We did not find any conclusive clustering evidence in favor of classifying this topic
among any of the 13 possible third-topics, and threw the value out (e.g. we only associated
the message with its two valid topics).
4 Clustering Model
In this section we set up a model that splits workers into groups based on interaction with
others and with topics. Rather than use this as our main model for ranking the employees
in terms of guilt, we use it to answer some preliminary questions such as whether or not
duplicate names refer to the same person and how to exactly use the provided message data
in our main model.
剩余22页未读,继续阅读
资源评论
pk_xz123456
- 粉丝: 3010
- 资源: 4226
下载权益
C知道特权
VIP文章
课程特权
开通VIP
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 基于LGMD建模的物体深度运动方向估计方法.pdf
- 基于本征正交分解法的缸内流场循环变动特性分析.pdf
- 基于齿向修形的抛物线锥齿轮仿真分析.pdf
- 基于Spark的层次聚类算法的研究与应用.pdf
- 基于MVC的考勤系统设计与实现.pdf
- 基于TRNSYS的供水管加肋辐射地板节能性研究.pdf
- 基于NS-3的CR认知网络仿真平台研究.pdf
- 基于WIFI的智能手机遥控灯光系统.pdf
- 基于YOLOv5的移动机器人动态视觉SLAM算法研究.pdf
- 基于ViT的中欧班列集装箱Logo图像分类识别研究.pdf
- 基于动态优先级算法的RGV调度策略.pdf
- 基于改进YOLOv5s的森林烟火检测算法.pdf
- 基于离散元理论的筒辊磨粉磨特性仿真分析.pdf
- 基于门控卷积和堆叠自注意力的离线手写汉字识别算法研究.pdf
- 基于向量特征的车辆轨迹预测.pdf
- 基于姿态估计三维人脸形状重建.pdf
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功