For Office Use Only T1 - T2 - T3 - T4 - For Office Use Only F1 - F2 - F3 - F4
For Office Use Only T1 - T2 - T3 - T4 - For Office Use Only F1 - F2 - F3 - F4
12218
For office use only For office use only
T1 ________________ F1 ________________
T2 ________________ F2 ________________
T3 ________________ F3 ________________
Problem Chosen
T4 ________________ F4 ________________
C
Social network analysis (SNA) has been winning more and more attention in the last twenty
years. The method provides an insight into various networks. In this paper, we use SNA and
related techniques to analyze crime data and try to get potential criminal gang.
After first investigating the hidden features of a conspiratorial group, we consider one’s
closeness to the criminal gang as the main reason in deciding whether a staff member should be
regarded as a conspirator. The introduction of cooperation distance metric (CD-metric) combined
with the analysis on cooperation factor makes up our model’s major base. Within these definitions,
potential conspirators can be worked out by choosing those on the top of the ascending CD-metric
list. Highest values of CD-metric can be recognized as being the farthest away from the conspiracy,
in other words, they are the most innocent ones. A 12 members’ criminal group is dug out.
Before identifying the leading criminals, it is essential for us to quantify the ability of
people’s leadership. Centrality analysis suggests a good way in determining the focal point(s) in a
network. We make some amendments of the centrality measures so that it can be applied to a
directed graph. What’s more, we combine centrality measures(degree, closeness and betweenness)
and figure out the fusion values. We can form a ranking list of people’s ability of leadership then.
The leaders in the criminal group are sure to come to surface after comparing with the priority list
derived before.
In the refinement of the network model, we appeal to the theories of semantic network
analysis and text analysis. On the purpose of making a deeper exploration of identifying the
intrinsic character of one topic, we reapply the centrality analysis to data computing right after the
construction of a semantic web. Based on text analysis approach, we build up a vector space
model and compare the final outputs with ones obtained above, the results are robust.
This model’s application in other areas such as the biomedical domain has been discussed at
the end of this paper, excellent performance happens when accessed to enough amounts of data.
With a well-organized structure and a wide range of application, this methodology is sure to have
a bright future.
Keywords: Social Network Analysis, Centrality Measures, Semantic Web, Text Analysis
更多数学建模资料请关注微店店铺“数学建模学习交流”
https://k.weidian.com/RHO6PSpA
Team # 12218 Page 1 of 18
Contents
1 Introduction 2
5 Model Optimizing 11
5.1 Semantic Network Analysis . . . . . . . . . . . . . . . . . . . . . 11
5.1.1 Construction of Semantic Network . . . . . . . . . . . . . 12
5.1.2 Topic Selection . . . . . . . . . . . . . . . . . . . . . . . . 12
5.2 Text Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2.1 Vector Space Model . . . . . . . . . . . . . . . . . . . . . 13
5.2.4 Corroboration of Model Reasonableness . . . . . . . . . . 14
6 Applications 15
7 Conclusion 15
1
Team # 12218 Page 2 of 18
1 Introduction
A crime is consist of a wide range of activities ranging from traffic violations
to organized fraud in financial field. The major challenge lying ahead of all the
authorities is how to accurately and efficiently analyze tremendous amount of
the crime data [1].
With great advance in the information technology, the financial crime hap-
pens more often than ever. Due to a German report published not long before,
the identified fraud cases related to card payments have risen 345% just from
2007 to 2009 [2]. A common analytical approach to this issue during recen-
t years is called “Social Network Analysis”(SNA), which can be viewed as an
innovation on mathematical perspective combined with sociology and crimi-
nology. This method can expose the hidden structural patterns in criminal
networks [3], which can be offered to the anti-crime departments in directing
them to necessary surveillance or interrogation scenarios, after several steps of
data processing.
According to Mena’s book, SNA is a data mining technique that reveals
the structure and content of a body of information by representing it as a set
of interconnected, linked objects or entities [4].
To further present the application of the SNA approach, we arrange paper
as followed:
i. In section 2, we propose a way to construct a model in identifying potential
conspirators with the help from SNA in the first place;
ii. In section 3, verified by the results of in small network, our model is applied
to a bigger one and we rank all the people according to their possibility of
being involved in the conspiracy;
iii. In section 4, based on the centrality analysis theory, we explore a way in
determining all the criminal leaders;
iv. In section 5, after referring to the powerful techniques–semantic network
analysis and text analysis, we make some amendments of our model by
considering the content and context of the message traffic;
v. At last, we discuss the model’s further developments when provided with
enormous amount of data and talk about the future application such as in
the biomedical area.
2
Team # 12218 Page 3 of 18
2.2 Definitions
Before proceeding to the real problem, we first make some definitions:
The multi-graph G = (V, E) is assumed to be a social network, where
V denotes the set of nodes and E denotes the set of edges. A person in the
network can be denoted by a node in the graph while edges stand for messages
with certain topics passing through different persons. Thus we can separate
these persons into several cliques, judging from the topics of communication.
Let G0 = (V 0 , E 0 ) be a supergraph corresponding to multi-graph G, where
V is the set of supernodes and E 0 is the set of superedges [5]. Each su-
0
pernode Pi0 represents topic i with associated people in multi-graph G and there
exists a superedge when supernodes connected by it possess common nodes.
The weight of one superedge is the number of the common nodes.
3
Team # 12218 Page 4 of 18
2 × Nc (Pi0 , Pj0 )
CF (S 0 , Vi0 , Vj0 ) = , CF ∈ [0, 1]; (1)
Nt (Pi0 , Pj0 )
Where: Nc denotes the number of common nodes between Pi0 and Pj0 ; Nt denotes
the number of total nodes contained in the Pi0 and Pj0 .
It is obvious that not all conspirators can contact directly but they must
be very close for safety concern. To utilize this collective feature, it is essential
to define a Cooperation Distance (CD) [5]:
Definition 2.3.2 (Cooperation Distance Metric) In the supergraph G0 =
(V 0 , E 0 ) with a list of i person denoted by a query Q = {q1 , q2 , q3 , . . . , qi }. Let
v ∗ be a particular node in the multi-graph G, and V be the set of nodes in the
multi-graph G. Then, the Cooperation Distance of v ∗ is then defined as:
|Q|
ShortestP ath(S 0 , v ∗ , qi )
P
0 ∗ i=1
CD(S , v , V ) = ; (2)
|Q|
4
Team # 12218 Page 5 of 18
The top three in the list are the staff members with the highest possibilities
of being involved in the conspiracy derived from our analysis, where there are
already two are sure to be guilty. Moreover, Anne along with Jaye are both
identified to be the “farthest away” from the conspiracy. But it is sad to find
that Carol may still be misjudged as a conspirator and we may probably leave
Bob behind as well.
5
Team # 12218 Page 6 of 18
We note that the 11 persons on the top have the same values of CD metric,
on the other hand, indicates that they can be regarded as having the same
6
Team # 12218 Page 7 of 18
7
Team # 12218 Page 8 of 18
8
Team # 12218 Page 9 of 18
paths between any other two nodes in the network) passing through it:
gjk (i)
CB (i) = (3)
gjk
Where: gjk represents the number of shortest paths between any two
nodes and gjk (i) represents the number of shortest paths running through
node i;
Closeness Closeness is the sum of geodesics between the particular node and
every other node in the network:
N
X −1
CC (i) = d(i, j) (4)
j
Where: d(i, j) denotes the binary shortest distance between any two nodes
in the network;
Degree The original concept [6] of one node’s degree is its number of links as:
N
X
X
ki = CD (i) = xij (5)
j
Where: i denotes the focal node, j denotes other nodes in the network;
N is the node’s number and xij = {0, 1} are the entries in the adjacency
matrix.
However, some pioneers proposed a method which considering the weights
of each edge in a weighted network [8–10] by introducing a new measure
called strength:
XN
W
si = CD (i) = wij (6)
j
Where: i denotes the focal node, j denotes other nodes in the network;
N is the node’s number and wij are the entries in the adjacency matrix
whose value is positive when i is connected to j (so that it can reflect the
weight), otherwise is 0.
Now we utilize a method to combine both degree and strength with the
introduction of a turning parameter α [11]:
Wα si (1−α)
CD (i) = ki × ( )α = ki × sα
i (7)
ki
Where: α is a positive value which can be set according to different re-
quest. When the parameter is between 0 and 1, then it would be favorable
to obtain a high value, otherwise a low value would be preferable. In our
crime network analysis, we choose 0.5 to accomplish our goal.
The leadership of one individual conspirator can be “revealed” by one mea-
sure or the combination of two or three, whereas some extremely special cases
such as gatekeepers and receptionists are still of high possibility [12]. According
to the present research, this method can work out some results with fairly good
accuracy and is easily to spread out [13].
9
Team # 12218 Page 10 of 18
(i) Degree does not consider the overall structure of the network and will
inevitably lead to some misjudgments [14, 15];
(ii) The major challenge standing ahead of closeness is lack of capability to
deal with the disconnected components in networks [11];
(iii) Betweenness may function improperly when faced with some particular
cases such as a node has none shortest path running through it.
With so many unpredictable situations to concern, we simply place an
equal weight to all of them and get a fusion value as result. We choose 40 staff
members who possess the highest fusion value from our analysis, results are in
Table 5. They are deem to be ones who can have the most important impact on
the whole network, in other words, leaders in the company. Comparing it with
the priority list we work out, we can infer that Alex is a leader in the criminal
gang.
10
Team # 12218 Page 11 of 18
5 Model Optimizing
To make our model function better, we need to take a look back on the
weight placed on each topics. In the previous work, we simply give the significant
topics an equal random initial value while others were determined by calculation
based on the quantified correlations among them. However, this process of
quantification can not provide a satisfied result precisely. To cross this barrier,
we switch our attention to more powerful techniques–Semantic Network Analysis
(SemNa) and Text Analysis.
11
Team # 12218 Page 12 of 18
12
Team # 12218 Page 13 of 18
Where: tfik is the time of the feature item’s (Tk ) appearance in text Di while
dfk is the number of text including feature item Tk . feature item’s (Tk ) ability
of differentiating the different kinds of text is weaker when the value of dfk is
higher; N denotes the total number of text, idfk = log2 (N/dfk ) is the frequency
of reverse text in which higher value indicate a stronger ability in differentiating
the different kinds of text.
13
Team # 12218 Page 14 of 18
amazing that the members in criminal group are unchanged, where reversely
corroborate the reasonableness of the network model.
14
Team # 12218 Page 15 of 18
6 Applications
Our method in analyzing the network with a complex database is comprised
of social network analysis, centrality analysis, semantic network analysis and
text analysis. Each of them can easily be extent to different areas with the same
intrinsic mechanism as we utilized, and some present researches even explain the
reasons why these methods could possess a wide range of application with social
concerns [22].
For now, text mining has become a prevalent automated method in exploit-
ing the tremendous amount of knowledge in the biomedical literature such as
the most common ones: rule-based or knowledgebased approaches, and statis-
tical or machine-learning-based approaches [23]. The former one mainly focus
on analyzing the structure of messages with sorts of knowledge while the latter
one stresses on classifying full sentences. Several real performances of the se-
mantic web application have already been surveyed in various biomedical area
with knowledge integration and exploration [24].
With the useful information extracted from the messages transferred in
the crowds, we can appeal to the semantic network analysis and text analysis
for a deeper insight. The theory of social network analysis provides a solid
base to construct a semantic web in which content and context of one piece of
message are under much more concern. Within a social group, the difference
between each individual lies in the social experience and their relation with
others, and these difference would eventually lead to a division of the social
work, bringing various networks into function. Centrality can provide a good
measurement of the structural importance of one node in the whole network [25].
This method of importance quantification can help us with search for the focal
point(s). Therefore, we can be directed to a efficient analyze by combining
these methodology into one as a whole. A massive amount of data can enrich
our sample collection and a good artificial intelligence system can enhance our
semantic network analysis’s ability in information categorizing which strengthen
the reasonableness of the whole model.
7 Conclusion
It’s not an easy task to formulate a well-organized crime busting scenario.
In order to accomplish our goal, we first discussed about the features one crim-
inal gang could have. Based on the principle of secretness in a conspiracy, we
defined a cooperation distance (CD metric) to decide whether a person should
be regarded as being involved in the conspiracy for we thought one’s closeness
with the criminal group is a fairly conclusive argument.
Centrality measures provided a good way for us in detecting the leaders in
the criminal network. Some modifications were proposed by us in order to apply
the theory in a directed graph. Through comparison with the given information,
it was delightful for us to find that the true mangers possess high grades in the
list of leadership measurement.
In the model’s refinement work, we appealed to the techniques of semantic
network analysis and text analysis. Visualization may be the most fundamental
difference between the traditional analysis and the network research in nowadays
study. After making a development of the Social Network Analysis, we success-
15
Team # 12218 Page 16 of 18
16
Team # 12218 Page 17 of 18
References
[1] H. Chen, W. Chung, J.J. Xu, G. Wang, Y. Qin, and M. Chau. Crime data
mining: a general framework and some examples. Computer, 37(4):50–56,
2004.
[2] B. Wiesbaden. Polizeiliche kriminalstatistik. Bundesrepublik Deutschland.
Berichtsjahr, 2007.
[3] M.K. Sparrow. The application of network analysis to criminal intelligence:
An assessment of the prospects. Social networks, 13(3):251–274, 1991.
[4] J. Mena. Investigative data mining for security and criminal detection.
Butterworth-Heinemann, 2003.
[5] A.M. Fard and M. Ester. Collaborative mining in multiple social networks
data for criminal group discovery. In Computational Science and Engineer-
ing, 2009. CSE’09. International Conference on, volume 4, pages 582–587.
IEEE, 2009.
[6] L.C. Freeman. Centrality in social networks conceptual clarification. Social
networks, 1(3):215–239, 1979.
[7] J. Xu and H. Chen. Criminal network analysis and visualization. Commu-
nications of the ACM, 48(6):100–107, 2005.
[8] A. Barrat, M. Barthelemy, R. Pastor-Satorras, and A. Vespignani. The
architecture of complex weighted networks. Proceedings of the National
Academy of Sciences of the United States of America, 101(11):3747, 2004.
17
Team # 12218 Page 18 of 18
[21] Shi Zhiwei, Liu Tao, and Wu Gongyi. An effective and efficient algorithm for
text categorization (in chinese). Computer Engineering and Applications,
41(29):180–183, 2005.
[22] A. Sih, S.F. Hanser, and K.A. McHugh. Social network theory: new insights
and issues for behavioral ecologists. Behavioral Ecology and Sociobiology,
63(7):975–988, 2009.
[23] K.B. Cohen and L. Hunter. Getting started in text mining. PLoS compu-
tational biology, 4(1):e20, 2008.
[24] H. Chen, L. Ding, Z. Wu, T. Yu, L. Dhanapalan, and J.Y. Chen. Se-
mantic web for integrated network analysis in biomedicine. Briefings in
bioinformatics, 10(2):177–192, 2009.
[25] K. Chan and J. Liebowitz. The synergy of social network analysis and
knowledge mapping: a case study. International journal of management
and decision making, 7(1):19–35, 2006.
18