犯罪网络分析中的电力模型与聚类方法应用资源-CSDN文库

版权申诉

聚类

数据挖掘

65 浏览量 2024-11-29 22:50:23 上传评论收藏 255KB PDF 举报

资源推荐

资源详情

资源评论

Crime Ring Analysis with Electric Networks

Team 17160

February 13, 2012

Abstract

In this report, we identify the participants in a corporate conspiracy plan using records

of messages sent and received by employees of the company. We detail the motivation,

mathematical details, robustness testing, strengths, and weaknesses of the two models

we co-developed and used, and present and interpret their results. Both models use the

senders and receivers of the messages, as well as their topic labels, but not timestamps

or full-text content, which are not available to us.

Our ﬁrst model is a vector space model which maps each of 83 nodes to a vector

whose entries represent the number of communications between a node and each other

node, as well as the number of communications involving a node and containing a

speciﬁed topic (the vectors are of length 83 + 15 = 98). Within this model, we use

the Euclidean norm as a distance measure, and show that k-means clustering into

two clusters meaningfully separates known guilty and non-guilty employees when both

node-node and node-topic interactions are considered, as well as when only node-topic

interactions are considered, in both the case at hand as well as the example previous

case. We also use cluster analysis to determine that, among all cases of repeated

names, Beth is the most likely to be represented by two nodes. The clustering model

does not assume known guilt or innocence a priori and doesn’t strongly suggest any

additional conspirators – however, its ability to separate known guilty and innocent

employees granted credibility to the usage of linear spaces and equally weighted node-

topic connections in our second, main model.

Our main model was an electrical circuit where each node and topic is represented

by a vertex in the circuit graph, and people/topics are connected via a conductance

proportional to the number of interactions made (these are the same as the corre-

sponding entries in the vector space model). In this model, the people and topics

given to be guilty were set to a reference voltage of 1 while those given to be innocent

were grounded to 0. The voltages of unknown nodes could then be determined by

solving the circuit in a DC setting. We chose this model because of its simplicity (to

prevent overﬁtting), expressiveness, ﬂexibility to accommodate new types of data and

new problem times, and because the metrics it uses are supported by the clustering

model.

We tested the robustness of our model by checking for leave-one-out discrepancies

– that is, we experimentally ran the model once for each known conspirator, leaving

Control Number: 17160 Page 4 of 23

1 Problem Statement

In this problem, we aim to uncover the participants in a corporate conspiracy plan using

records of messages sent and received by employees of the company. We are not given access

to the full-text content of the messages – however, the messages are pre-labeled using 15

general conversation topics, the descriptions of which are given. Using this information, as

well as a given set of known conspirators and known non-conspirators, we aim to identify a

set of likely conspirators among the remaining employees. Of particular interest are three of

the company’s senior members: Gretchen, Jerome, and Delores.

2 Approach Philosophy

The development of our model was guided by a few basic philosophies, which arise from the

legal nature of the problem and the nature and scope of the example given.

First, we wanted to avoid overﬁtting our data. Speciﬁcally, we determined that the general

process of:

• Coming up with a generic model

• Fitting its parameters to best explain the given example

• Running the model with those parameters on the case at hand

would not be an eﬀective approach. For example, a model with three weight parameters,

tested at 11 values each (say 0 to 1 in increments of 0.1) would result in a search space of

1331 models. Among 10 people in the training example, there are 1024 possible combinations

of guilty and innocent people. With no other assumptions, we would expect

1331

1024

> 1 set of

three parameters to happen to match the results of the given example, regardless of whether

or not those parameters result in a good general model.

Instead, we chose to develop two simple concurrent models that support each other. Our

main model is an electric network model that assumes the guilt of those given to be guilty

and ﬂags topics given to be suspicious. To support this model and give credibility to the

metrics, choices of ﬁxed parameters, and data assumptions it uses, we developed an auxiliary

clustering model. Both of these models are discussed in detail in sections 4 and 5.

Second, we wanted to avoid models that implied any level of presumption of guilt in any inter-

mediate term (e.g. conditional probability terms in an intermediate iteration of a PageRank-

like algorithm). Because the problem statement is ambiguous about the legal proceedings

themselves, and of exactly why the data is abridged the way it is, we wanted to avoid the

chance of using any intermediate assumptions (even in an iterative calculation) that could

could later undermine the admissibility of subsequent evidence collected on the basis of our

recommendations (for example, a violation of probable cause).

Third, we understand that any algorithmic approach on the basis of communication data

risks implies some degree of guilt by association. In order to protect the credibility of our

Control Number: 17160 Page 5 of 23

work, we used leave-one-out testing as a measure of the robustness of our results – that is,

for each person known to be involved in the conspiracy, we ran the model without assuming

them guilty to ensure any of our results are not too dependent on associations with any

particular individual being assumed guilty.

3 Data Interpretation and Assumptions

Our dataset consisted of 400 message headers (i.e. sender-receiver pairs) sent between 83

nodes, representing employees within the company. 7 (or 8, if counting Chris) employees

were known to be tied to the conspiracy. The messages were additionally pre-categorized

into 15 topics, 3 of which were known to be related to the conspiracy plan. Some messages

contained multiple topics – 36 messages contained (exactly) 2 categories, and 11 contained 3

categories. Most of the 15 topics were associated with between 20 and 40 unique messages.

We were not given the actual message contents.

318 out of the 400 messages were sent between a unique pair of employees, which unfortu-

nately makes it diﬃcult to compare the relative frequencies of contact between diﬀerent pairs

of individuals, as there were very few instances of a person sending more than one message

to another individual. This made it diﬃcult to meaningfully consider the communication

frequencies between pairs of nodes without overﬁtting the data (we worked around this by

considering node-to-topic connections in our clustering and electric network models).

There were two main ambiguities within our data. First, there were 5 employee names that

corresponded to 2 nodes each. These names include Gretchen and Jerome, the names of

two senior managers in the company. In each of these cases, it was unclear whether or not

the two nodes were diﬀerent communication nodes (e.g. diﬀerent cell phones) that belonged

to the same employee, or whether the two nodes represented diﬀerent employees with the

same name. We used cluster analysis to predict which such pairs of nodes to regard as single

individuals.

The second main ambiguity was a single message (line 215 in Topics.xls) that contained

three topics, one of which was marked 18, an undeﬁned topic number, and a clear error in

the data. We did not ﬁnd any conclusive clustering evidence in favor of classifying this topic

among any of the 13 possible third-topics, and threw the value out (e.g. we only associated

the message with its two valid topics).

4 Clustering Model

In this section we set up a model that splits workers into groups based on interaction with

others and with topics. Rather than use this as our main model for ranking the employees

in terms of guilt, we use it to answer some preliminary questions such as whether or not

duplicate names refer to the same person and how to exactly use the provided message data

in our main model.

剩余22页未读，继续阅读

评论收藏

内容反馈

版权申诉

pk_xz123456

粉丝: 3010
资源: 4226

犯罪网络分析中的电力模型与聚类方法应用

网络游戏-传感器网络中基于混合因子分析模型的分布式聚类方法.zip

R语言中的聚类分析：方法、实现与应用案例

主成分分析、因子分析、聚类分析的比较与应用

数据挖掘聚类分析及其应用

模糊聚类分析及其应用 电子书

聚类算法在电力中的应用

聚类分析在电信消费模式中的应用

层次分析法与灰色聚类分析

聚类分析在网页信息分析中的应用

类平均聚类方法 类平均聚类方法

基于EM算法的模型聚类的研究及应用.zip

主成分分析与聚类分析方法

聚类分析，主成因分析，判别分析应用代码

人工智能的聚类方法

Python聚类分析应用（干货）(基于Python的聚类分析及其应用_庄怡雯.pdf)

20.MATLAB神经网络43个案例分析 基于Kohonen网络的聚类算法-网络入侵聚类.rar

欧式距离聚类分析

基于描述语境特征词与改进GSDMM模型的服务聚类方法.docx

聚类分析方法

基于matlab复杂网络中聚类系数与关联系数

18.MATLAB神经网络43个案例分析 广义神经网络的聚类算法-网络入侵聚类.rar

Sklearn操作与聚类分析模型构建与评价.docx

聚类算法全套教程+k均值+层次聚类+高斯混合模型+距离+聚类算法应用等

聚类分析SPSS用法

最新资源

模糊聚类分析及其应用电子书

类平均聚类方法类平均聚类方法

20.MATLAB神经网络43个案例分析基于Kohonen网络的聚类算法-网络入侵聚类.rar

18.MATLAB神经网络43个案例分析广义神经网络的聚类算法-网络入侵聚类.rar