0% found this document useful (0 votes)
125 views69 pages

Einal - Report On Predictive Modeling of Global Terrorist Attacks Using Machine Learning PDF

This document is a project report on predictive modeling of global terrorist attacks using machine learning. It was submitted by 5 students to fulfill the requirements of a Bachelor of Technology degree in computer science and engineering. The report describes applying various machine learning algorithms like decision trees, naive Bayes, random forest, logistic regression and support vector machines to terrorism incident data to classify and predict future terrorist events. It discusses data gathering, preprocessing, feature selection, splitting the data for model training and evaluation. Statistical tests are used to compare top models and improve performance. The final model is evaluated using metrics like confusion matrix and ROC curve. The results and future scope of the work are also summarized.

Uploaded by

njain1622
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
125 views69 pages

Einal - Report On Predictive Modeling of Global Terrorist Attacks Using Machine Learning PDF

This document is a project report on predictive modeling of global terrorist attacks using machine learning. It was submitted by 5 students to fulfill the requirements of a Bachelor of Technology degree in computer science and engineering. The report describes applying various machine learning algorithms like decision trees, naive Bayes, random forest, logistic regression and support vector machines to terrorism incident data to classify and predict future terrorist events. It discusses data gathering, preprocessing, feature selection, splitting the data for model training and evaluation. Statistical tests are used to compare top models and improve performance. The final model is evaluated using metrics like confusion matrix and ROC curve. The results and future scope of the work are also summarized.

Uploaded by

njain1622
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Predictive Modeling of Global Terrorist Attacks

Using Machine Learning


Submitted as a partial fulfillment of Bachelor of Technology in Computer Science & Engineering
Of
Maulana Abul Kalam Azad University of Technology
(Formerly known as West Bengal University of Technology)

Project Report

Submitted by

Name of Students University Roll No.

Sukanya Ghosh 11600115055


Mahek Pipalia 11600115021
Aditya Agarwal 11600115001
Md Sarik 11600115022
Nilesh Jain 11600115025

Under the supervision of

Moumita Goswami
Asst. Professor, CSE

Department of Computer Science & Engineering,


MCKV Institute of Engineering
243, G.T. Road(N)
Liluah, Howrah – 711204
Department of Computer Science & Engineering
MCKV Institute of Engineering
243, G. T. Road (N),
Liluah, Howrah-711204

CERTIFICATE OF RECOMMENDATION

I hereby recommend that the thesis prepared under my supervision by Sukanya Ghosh, Mahek
Pipalia, Aditya Agarwal, Md Sarik & Nilesh Jain entitled Predictive Modeling of Global
Terrorist Attacks Using Machine Learning be accepted in partial fulfillment of the requirements for
the degree of Bachelor of Technology in Computer Science & Engineering Department.

----------------------------------------------------------- --------------------------------------
Dr. S. S.Thakur Project guide
Associate Professor & Head of the Department, Moumita Goswami
Computer Science & Engineering Department. Asst. Professor, CSE
MCKV Institute of Engineering,Howrah
MCKV Institute of Engineering
243, G. T. Road (N), Liluah
Howrah-711204
Affiliated to
Maulana Abul Kalam Azad University of Technology
(Formerly known as West Bengal University of Technology)

CERTIFICATE
This is to certify that the project entitled
Predictive Modeling of Global Terrorist Attacks Using Machine Learning
and submitted by

Name of students University Roll No.

Sukanya Ghosh 11600115055


Mahek Pipalia 11600115021
Aditya Agarwal 11600115001
Md Sarik 11600115022
Nilesh Jain 11600115025

has been carried out under the guidance of myself following the rules and regulations of the degree of
Bachelor of Technology in Computer Science & Engineering of Maulana Abul Kalam Azad
University of Technology (Formerly West Bengal University of Technology).

(Signature of the project guide)


Moumita Goswami,
Assistant Professor,
CSE. 1.
2.
3.
4.
5.
MCKV Institute of Engineering
243, G. T. Road (N), Liluah
Howrah-711204
Affiliated to
Maulana Abul Kalam Azad University of Technology
(Formerly known as West Bengal University of Technology)

CERTIFICATE OF APPROVAL
(B.Tech Degree in Computer Science & Engineering)

This project report is hereby approved as a creditable study of an engineering subject


carried out and presented in a manner satisfactory to warrant its acceptance as a
prerequisite to the degree for which it has been submitted. It is to be understood that by
this approval, the undersigned do not necessarily endorse or approve any statement
made, opinion expressed and conclusion drawn therein but approve the project report
only for the purpose for which it has been submitted

1.
COMMITTEE ON FINAL
EXAMINATION FOR 2.
EVALUATION OF 3.
PROJECT REPORT 4.
5.
ACKNOWLEDGEMENT

I take this opportunity to express my profound gratitude and deep regards to my faculty
MOUMITA GOSWAMI for her exemplary guidance, monitoring and constant
encouragement throughout the course of this project. The blessing, help and guidance
given by her time to time shall carry me a long way in the journey of life on which I am
about to embark.
I am obliged to my project team members for the valuable information provided by them
in their respective fields. I am grateful for their cooperation during the period of my
assignment.

SUKANYA GHOSH
MAHEK PIPALIA
ADITYA AGARWAL
MD SARIK
NILESH JAIN
TABLE OF CONTENTS

TITLE PAGE NO.

1. Abstract……………………………………………………...... 1
2. Introduction…………………………………………………… 2
3. Literature Review…………………………………………….. 4
3.1. Theoretical Analysis…………………………………… 4
3.2. Case Study……………………………………………... 4
4. What is Machine Learning?....................................................... 6
4.1. Supervised & Unsupervised ML Algorithms…….......... 6
4.2. Some Common ML Algorithms……………………….. 8
4.2.1. Decision Trees………………………………. ... 8
4.2.2. Naive Bayes Classification……………………. 8
4.2.3. Random Forest………………………………… 9
4.2.4. Linear Regression……………………………... 9
4.2.5. Logistic regression………………………….. ... 10
4.2.6. KNN Classification…………………................. 11
4.2.7. Support Vector Machine (SVM)………………. 12
4.2.8. Ensemble Methods…………………………….. 12
4.2.9. Clustering Algorithms…………………………. 13
4.2.10. Principal Component Analysis……………….. 13
4.2.11. Singular Value Decomposition………………. 14
4.2.12. Independent Component Analysis…………… 14
5. Data description……………………………………………….. 15
6. Key Uncertainties and Problems in the Data………………….. 17
7. Objectives of the Project……………………………………..... 18
8. Approach……………………………………………………..... 19
9. Data Gathering and Preprocessing……………………............ 21
9.1. Dealing with null values…………………………. 22
9.2. Filling null values………………………………… 23
10. Insights and Trends……………………………………………. 24
11. Dimensionality Reduction: Feature Selection and Importance... 46
12. Splitting the Dataset into: Training, Validation and Test Set…. 48
TITLE PAGE NO.

13. Appling different Models to the dataset………………………… 50


13.1. Statistical tests between top 3 Models…………… 51
13.2. Improving the Model……………………………. 52
13.3. Final Modeling and Evaluation………………….. 53
13.3.1. Confusion Matrix……………………….. 53
13.3.2. Receiver Operating Characteristic (ROC) 54
13.3.3. Confusion matrices of the Applied Models 55
13.3.4. ROC Curve of the Applied Models…….. 56
14. Results and Discussions……………………………………...…. 57
15. Conclusion…………………………………………………….... 60
16. Future Scope of Work………………………………………..…. 61
17. References…………………………………………………….... 62
1

1. ABSTRACT

In the current scenario one of the severe problem that mankind is facing is terrorist attacks, these
attacks involves pre-planning, complexity and collaboration, due to which counter terrorism has
gained the priority in each of the country. Counter terrorism is an anti-terrorism act or counter
measures that includes techniques and tactics to prevent the future attacks. Terrorist attack is an event
that is least predictable, the reason for terrorist attacks for being so scary is its unpredictability like
we never know how, when and where they will attack. Very few methodologies exist for attacks
forecasting due to lack of real time data which is confidential. In this paper/project we proposed a way
to predict the future attacks, weapons used and targets on which attacks might happen using a class of
powerful machine learning algorithms.
2

2. INTRODUCTION
Terrorism is an evolving phenomenon, i.e. it is not like terrorism is to happen before, but terrorism
was happening, is happening and it will happen in future too. So developing such future terrorist attack
prediction model will be useful which insist military people to be alert by providing the information
about what type of attack might happen in which location, its probability of happening, so that there
will be some chance of preventing upcoming terrorist attack by which we can reduce the effects of an
attack like security threat like life of a victim and stability threat like long term and short term
economic instability of country being attacked, infrastructure destruction and so on.

Since security with respect to life is of high priority that has to be provided to each citizen of a country
by government which implies reduction in crime activities. There is a need for narrowing the gap
between terrorists and counter terrorism by predicting future terrorist attacks accurately so that its
impact like victims death, injuries, psychological trauma (mental disturbance) can be reduced to some
extent and also its impact on economic stability can be controlled.

This project/paper produces a model for identification and prevention of future terrorist attacks based
on information available. Many researches has been done on analyzing terrorism incident data around
the world which might help in retrieving some patterns or an important information which contributes
in choosing an appropriate actions to prevent similar types of attacks but few experiments have been
done for detecting and preventing future attacks as this field has emerged recently.
Some of them
are:

(1) Hawkes Process, a process that is applied to predict terrorist attacks in Northern Ireland which
considered 5000 explosions ranging between years 1970-1998. The process was used to analyze when
and where the Irish Republicans Army (IRA) launched the attack, how British security forces
responded and how effective these responses were.

(2) Social Network Analysis (SNA) whose goal was to degrade the lethality of terror network i.e. what
happens when terrorist is removed from network.
(i) Quantifying terror network lethality
(ii) Predicting successor of removed terrorist
(iii) Identifying whom to remove. And some of the activities carried out are fake
account, social media fraud, and malware distribution.

(3) Terrorist Group Prediction Model (TGPM) aims to predict the group involved in a specific attack
or the group that is responsible for that attack which uses crime prediction model and group detection
model concepts. It uses clustering and association technique which showed some fair degree
of accuracy.

(4) Dynamic Bayesian Network used to predict the likelihood of future attacks, which acts an initial
step in predicting the terrorist behavior at critical transport infrastructure facility.

(5) Feed forward back propagation neural network, which is used for counter terrorism to
predict whether a person is terrorist or not. Data set is collected from a game called cutting corners
and result showed success rate between 60%- 68% for correctly identifying terrorist behavior.
3

(6) Characterizing a terrorist using pattern classification and SNA which predicts whether a person is
terrorist or not and resulted in 86% accuracy.

(7) Prediction of unsolved attacks using group prediction algorithms, to detect the terrorist group like
ISIS, involved in attacks, which considers the solved and unsolved events of terrorist of
turkey, Istanbul between the years 2003 and 2005. And is based on data mining techniques like
clustering.
4

3. LITERATURE REVIEW
3.1.Theoretical Analysis

Existing literature has applied network analysis on the theoretical relationships between terrorist
organizations and individual terrorists. Everton (2009), Clauset et al. (2008), and Moon offer three
distinct approaches to the theoretical study of terrorist network dynamics. Everton (2009) criticizes
the significant application of social network analysis to the study of terrorist networks by arguing that
the focus on identifying and targeting key players within the network is misplaced. He seeks to
understand the specific hierarchical structure within a terrorist organization (centralized leadership,
decentralized, etc.) to help security forces target specific actors in a centralized network to disrupt its
effectiveness. He identifies leadership as highly connected nodes within the network as well as an
understanding of the cosmopolitanism of the network via measurements of the local clustering
coefficients and average path lengths. His explanation is limited to individual organizations, but we
are interested in some of his measures as applied to the global network. Clauset et al. (2008)
further supplemented our understanding of the underlying network structure of terror cells by
building a hierarchal structure of a terrorist network via a maximum likelihood model to infer
connections between nodes. They compared this model to a true terrorist network and found the two
networks similar along metrics like average clustering coefficient and SCC size. Moon et al.’s (2007)
built a meta-structure of tasks and the agents assigned to complete them in order to infer the players
who would have had contact in order to carry out the attack. The paper then applies social
network analysis techniques by calculating betweenness centrality and total degree centrality
on nodes (labeled as particular figures in the organization) for each of the models they built
to identify key players within each theoretical organizational structure. The paper’s primary
strength is in its innovative approach to understanding why edges between actors within the terrorist
network exist. However, like Everton, its techniques are limited to studying single organizations,
while our paper takes a global approach.

3.2.Case Study
In addition to theoretical analyses that provide foundations to depict the underlying structure
of terrorism networks, researchers have conducted case studies that provide insights in individual
terrorist organizations. Belli et al. provided results from a case study that explores the social network
of internal members of the ”Hammound Enterprise”, which was involved in trade diversion in order
to finance terrorism organizations in Michigan. They found out that in these groups, members
are highly interconnected, making organizations more efficient while vulnerable to detection. With
key player analysis, three ringleaders are detected with a few secondary leaders, most of whom
are Islamist extremists, which shows the property of an idea centric organization. These
analyses are very representative across many case studies we have found. Krebs 2002 [5] is
widely referenced in literature about terrorist network analysis. Following the September 11
attacks, the author used publicly available information about the 19 hijackers to construct a network
of weak and strong ties based on the nature of their relationships with one another. By computing the
degree distribution, betweenness, and closeness of the nodes, this paper depicts a sketch of the covert
network behind the scenes, and allows him to identify the clear leader among the hijackers. Also from
the analysis on clustering coefficient(0.4), and average path length among the nodes, the authors found
5

out that this covert network ”trade efficiency for secrecy” by remaining quite small and operating with
little outside assistance. Krebs’ paper is one of the foundational documents in applying network
analysis to terrorist organizations post-9/11. Because his research was conducted after the fact on a
well-publicized case, he is able to delve into the nature of connections among actors in considerable
depth, which lends credence to his findings about this specific group. However, his work does not
lend itself to application on the larger terrorist landscape, because in most cases, information
about the trust and familial connections among actors in a covert network is difficult to come by
and verify. Thus, the insights gained from particular case studies can inform some of our approach to
larger networks (i.e. identifying key players based on centrality, looking for hubs in a sparse network,
etc.), but their approach is limited in scope and application to future study. Instead of exploring the
internal structures of each group, a global picture of the interaction between different groups may
provide crucial linkage that connects pieces of small cluster networks. For instance, if we can establish
a concrete theory of the collaboration between ISIS and extremists groups within the U.S., we can
potentially figure out many sources of resources of terrorist attacks in the U.S. Another benefit
of studying the global group network of terrorism is that it helps understand the global shift of
terrorism distribution and predict future moves. Though understanding individual organizations has
importance, especially in the context of extremely destructive groups like al Qaeda or ISIS, we feel
that the existing literature insufficiently accounts for the connections between terrorist organizations.
Understanding those relationships can be crucial in inferring and predicting the future of terrorist
organizations and behaviors, as we understand some organizations as behaving similarly or emulating
central organizations to determine their next actions.
6

4. WHAT IS MACHINE LEARNING?


Machine learning is an application of artificial intelligence (AI) that provides systems the ability to
automatically learn and improve from experience without being explicitly programmed. Machine
learning focuses on the development of computer programs that can access data and use it learn
for themselves.

The process of learning begins with observations or data, such as examples, direct experience,
or instruction, in order to look for patterns in data and make better decisions in the future based on the
examples that we provide. The primary aim is to allow the computers learn automatically without
human intervention or assistance and adjust actions accordingly.

So what exactly is “machine learning” anyway?

There are some basic common threads, however, and the overarching theme is best summed up by
this oft-quoted statement made by Arthur Samuel way back in 1959:

“Machine Learning is the field of study that gives computers the ability to learn without being
explicitly programmed.”

And more recently, in 1997, Tom Mitchell gave a “well-posed” definition that has proven more useful
to engineering
types:

“A computer program is said to learn from experience E with respect to some task T and
some performance measure P, if its performance on T, as measured by P, improves with experience
E.” -- Tom Mitchell, Carnegie Mellon University

ML solves problems that cannot be solved by numerical means alone.

4.1. Supervised & Unsupervised Machine Learning Algorithms

Machine learning algorithms are often categorized as supervised or unsupervised.

Supervised machine learning algorithms can apply what has been learned in the past to new data
using labeled examples to predict future events. Starting from the analysis of a known training dataset,
the learning algorithm produces an inferred function to make predictions about the output values. The
system is able to provide targets for any new input after sufficient training. The learning algorithm
can also compare its output with the correct, intended output and find errors in order to modify the
model accordingly.
In contrast, unsupervised machine learning algorithms are used when the information used to train
is neither classified nor labeled. Unsupervised learning studies how systems can infer a function to
describe a hidden structure from unlabeled data. The system doesn’t figure out the right output, but it
explores the data and can draw inferences from datasets to describe hidden structures from unlabeled
data.
7

Two other categories apart from supervised and unsupervised-

Semi-supervised machine learning algorithms fall somewhere in between supervised and


unsupervised learning, since they use both labeled and unlabeled data for training – typically a small
amount of labeled data and a large amount of unlabeled data. The systems that use this method are
able to considerably improve learning accuracy. Usually, semi-supervised learning is chosen
when the acquired labeled data requires skilled and relevant resources in order to train it /
learn from it. Otherwise, acquiring unlabeled data generally doesn’t require additional resources.

Reinforcement machine learning algorithms is a learning method that interacts with its
environment by producing actions and discovers errors or rewards. Trial and error search and delayed
reward are the most relevant characteristics of reinforcement learning. This method allows machines
and software agents to automatically determine the ideal behavior within a specific context in order
to maximize its performance. Simple reward feedback is required for the agent to learn which action
is best; this is known as the reinforcement signal.
Machine learning enables analysis of massive quantities of data. While it generally delivers faster,
more accurate results in order to identify profitable opportunities or dangerous risks, it may also
require additional time and resources to train it properly. Combining machine learning with AI and
cognitive technologies can make it even more effective in processing large volumes of information.
8

4.2. Some Common Machine Learning Algorithms


Here is the list of commonly used machine learning algorithms. These algorithms can be
applied to almost any data problem:

SUPERVISED LEARNING ALGORITHMS

4.2.1. Decision Trees:


A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their
possible consequences, including chance-event outcomes, resource costs, and utility. Take a look at
the image to get a sense of how it looks like.

Decision Tree

From a business decision point of view, a decision tree is the minimum number of yes/no questions
that one has to ask, to assess the probability of making a correct decision, most of the time. As a
method, it allows you to approach the problem in a structured and systematic way to arrive at a logical
conclusion.

4.2.2. Naive Bayes Classification:


Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’
theorem with strong (naive) independence assumptions between the features. The featured image is
the equation — with P(A|B) is posterior probability, P(B|A) is likelihood, P(A) is class prior probability,
and P(B) is predictor prior probability.

Naive Bayes Classification


9

Some of real world examples are:


 To mark an email as spam or not spam
 Classify a news article about technology, politics, or sports
 Check a piece of text expressing positive emotions, or negative emotions?
 Used for face recognition software.

4.2.3. Random Forest:


A random forest algorithm was designed to address some of the limitations of decision trees.
Random Forest comprises of decision trees which are graphs of decisions representing their course of
action or statistical probability. These multiple trees are mapped to a single tree which is called
Classification and Regression (CART) Model.
To classify an object based on its attributes, each tree gives a classification which is said to “vote” for
that class. The forest then chooses the classification with the greatest number of votes. For regression,
it considers the average of the outputs of different trees.

Random Forest works in the following way:

1. Assume the number of cases as N. A sample of these N cases is taken as the training set.
2. Consider M to be the number of input variables, a number m is selected such that m < M. The
best split between m and M is used to split the node. The value of m is held constant as the
trees are grown.
3. Each tree is grown as large as possible.
4. By aggregating the predictions of n trees (i.e., majority votes for classification, average for
regression), predict the new data.

4.2.4. Linear Regression


Initially developed in statistics to study the relationship between input and output numerical variables,
it was adopted by the machine learning community to make predictions based on the linear regression
equation.
The mathematical representation of linear regression is a linear equation that combines a specific set
of input data (x) to predict the output value (y) for that set of input values. The linear equation assigns
10

a factor to each set of input values, which are called the coefficients represented by the Greek letter
Beta (β).
The equation mentioned below represents a linear regression model with two sets of input values,
x1 and x2. y represents the output of the model, β0, β1 and β2 are the coefficients of the linear equation.
y = β0 + β1x1 + β2x2
When there is only one input variable, the linear equation represents a straight line. For simplicity,
consider β2 to be equal to zero, which would imply that the variable x2 will not influence the output of
the linear regression model. In this case, the linear regression will represent a straight line and its
equation is shown below.
y = β0 + β1x1
A graph of the linear regression equation model is as shown below

Linear regression can be used to find the general price trend of a stock over a period of time. This helps
us understand if the price movement is positive or negative.

4.2.5. Logistic regression


In logistic regression, our aim is to produce a discrete value, either 1 or 0. This helps us in finding a
definite answer to our scenario.
Logistic regression can be mathematically represented as,

The logistic regression model computes a weighted sum of the input variables similar to the linear
regression, but it runs the result through a special non-linear function, the logistic function or sigmoid
function to produce the output y.
The sigmoid/logistic function is given by the following equation.
y = 1 / (1+ e-x)
11

In general, regressions can be used in real-world applications such as:


 Credit Scoring
 Measuring the success rates of marketing campaigns
 Predicting the revenues of a certain product
 Is there going to be an earthquake on a particular day?

4.2.6. KNN Classification


KNN is short for the K-nearest neighbor method, in which the user specifies the value of K. Unlike
previous algorithms, this one trains on the entire dataset.
The goal of KNN is to predict an outcome for a new data instance. The algorithm trains the machine
to check the entire dataset to find the k-nearest instances to this new data instance or to find the k-
number of instances that are most similar to the new instance.

The prediction, or output, is one of two things:


 The mode or most frequent class, in a classification problem
 The mean of the outcomes, in a regression problem

This algorithm usually employs methods to determine proximity such as Euclidean distance and
Hamming distance.

The pros of KNN are that its simplicity and ease of use. Though it can require a lot of memory to store
large datasets, it only calculates (learns) at the moment a prediction is needed.

Let’s consider the task of classifying a green circle into class 1 and class 2. Consider the case of KNN
based on 1-nearest neighbour. In this case, KNN will classify the green circle into class 1. Now let’s
increase the number of nearest neighbours to 3 i.e., 3-nearest neighbour. As you can see in the figure
there are ‘two’ class 2 objects and ‘one’ class 1 object inside the circle. KNN will classify a green
circle into class 2 object as it forms the majority.
12

4.2.7. Support Vector Machine (SVM)


Support Vector Machine was initially used for data analysis. Initially, a set of training examples is fed
into the SVM algorithm, belonging to one or the other category. The algorithm then builds a model
that starts assigning new data to one of the categories that it has learned in the training phase.
In the SVM algorithm, a hyperplane is created which serves as a demarcation between the categories.
When the SVM algorithm processes a new data point and depending on the side on which it appears it
will be classified into one of the classes.

When related to trading, an SVM algorithm can be built which categorises the equity data as a
favourable buy, sell or neutral classes and then classifies the test data according to the rules.

4.2.8. Ensemble Methods:


Ensemble methods are learning algorithms that construct a set of classifiers and then classify new data
points by taking a weighted vote of their predictions. The original ensemble method is Bayesian
averaging, but more recent algorithms include error-correcting output coding, bagging, and boosting.

Ensemble Learning Algorithms

So how do ensemble methods work and why are they superior to individual models?
 They average out biases: If you average a bunch of democratic-leaning polls and republican-leaning
polls together, you will get an average something that isn’t leaning either way.
 They reduce the variance: The aggregate opinion of a bunch of models is less noisy than the single
opinion of one of the models. In finance, this is called diversification — a mixed portfolio of many
stocks will be much less variable than just one of the stocks alone. This is why your models will be
better with more data points rather than fewer.
 They are unlikely to over-fit: If you have individual models that didn’t over-fit, and you are
combining the predictions from each model in a simple way (average, weighted average, logistic
regression), then there’s no room for over-fitting.
13

UNSUPERVISED LEARNING ALGORITHMS

4.2.9. Clustering Algorithms:


Clustering is the task of grouping a set of objects such that objects in the same group (cluster) are more
similar to each other than to those in other groups.

Clustering Algorithms

Every clustering algorithm is different, and here are a couple of them:


 Centroid-based algorithms
 Connectivity-based algorithms
 Density-based algorithms
 Probabilistic
 Dimensionality Reduction
 Neural networks / Deep Learning

4.2.10. Principal Component Analysis:


PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations
of possibly correlated variables into a set of values of linearly uncorrelated variables called principal
components.

Principal Component Analysis

Some of the applications of PCA include compression, simplifying data for easier learning,
visualization. Notice that domain knowledge is very important while choosing whether to go forward
with PCA or not. It is not suitable in cases where data is noisy.
14

4.2.11. Singular Value Decomposition:


In linear algebra, SVD is a factorization of a real complex matrix. For a given m * n matrix M, there
exists a decomposition such that M = UΣV, where U and V are unitary matrices and Σ is a diagonal
matrix.

Singular Value Decomposition

PCA is actually a simple application of SVD. In computer vision, the 1st face recognition algorithms
used PCA and SVD in order to represent faces as a linear combination of “eigenfaces”, do
dimensionality reduction, and then match faces to identities via simple methods; although modern
methods are much more sophisticated, many still depend on similar techniques.

4.2.12. Independent Component Analysis:


ICA is a statistical technique for revealing hidden factors that underlie sets of random variables,
measurements, or signals. ICA defines a generative model for the observed multivariate data, which is
typically given as a large database of samples. In the model, the data variables are assumed to be linear
mixtures of some unknown latent variables, and the mixing system is also unknown. The latent
variables are assumed non-gaussian and mutually independent, and they are called independent
components of the observed data.

Independent Component Analysis

ICA is related to PCA, but it is a much more powerful technique that is capable of finding the
underlying factors of sources when these classic methods fail completely. Its applications include
digital images, document databases, economic indicators and psychometric measurements.
15

5. DATA DESCRIPTION
Information on more than 180,000 Terrorist Attacks
The Global Terrorism Database (GTD) is an open-source database including information on terrorist
attacks around the world from 1970 through 2017. The GTD includes systematic data on domestic as
well as international terrorist incidents that have occurred during this time period and now includes
more than 180,000 attacks. The database is maintained by researchers at the National Consortium for
the Study of Terrorism and Responses to Terrorism (START), headquartered at the University
of Maryland.

Content

Geography: Worldwide
Time period: 1970-
2017, Unit of analysis:
Attack
Variables: >100 variables on location, tactics, perpetrators, targets, and outcomes

Sources: Unclassified media articles

(Note: Please interpret changes over time with caution. Global patterns are driven by diverse trends
in particular regions, and data collection is influenced by fluctuations in access to media coverage
over both time and place.)

Some of the variables explained:

eventid : A 12-digit Event ID system. First 8 numbers – date recorded “yyyymmdd”. Last 4
numbers – sequential case number for the given day (0001, 0002 etc).
iyear : This field contains the year in which the incident occurred.
imonth : This field contains the number of the month in which the incident occurred.
iday : This field contains the numeric day of the month on which the incident occurred.
latitude : The latitude of the city in which the event occurred.
longitude : The longitude of the city in which the event occurred.
extended : The duration of an incident extended more than 24 hrs (1) - Less than 24 hrs (0).
multiple : The attack is part of a multiple incident (1) - Not part of a multiple incident (0).
suicide : The incident was a suicide attack (1) -OR- no indication of a suicide attack (0).
attacktype : Assassination(1), Hijacking(2), Kidnapping(3), Barricade Incident(4),
Bombing/Explosion(5), Armed Assault(6), Unarmed Assault(7),
Facility/Infrastructure Attack(8), Unknown(9)
targtype :22 categories ranging from Business(1), Government(general)(2), Police(3),...
Utilities(21), Violent Political Parties(22)
natlty : Nationality of target/victim
individual : Whether the attack was carried out by an individual or several individuals known to
be affiliated with a group or organization(1) or not affiliated with a group or
organization(0)
claimed : A group or person claimed responsibility for the attack (1) - No claim of responsibility
was made(0).
16

weaptype : 13 categories ranging from Biological(1), Chemical(2), Radiological(3),... Other(12),


Unknown(13)
nkill : Total number of fatalities including all victims and attackers who died.
nkillter : Limited to only perpetrator fatalities
nwound : Total number of injured victims and attackers
nwoundte :Total number of injured perpetrators

ATTRIBUTES
17

6. KEY UNCERTAINTIES & PROBLEMS IN THE DATA


One key conclusion that should be drawn from this analysis is just how difficult it is to rely on given
set of data. Another is that analyses which ignore the gross levels of uncertainty and the conflicts in
the information provided in the different sets of open source data now available, are likely to have
little legitimacy and be more misleading than useful.

The key problems involved


include:

 No agreed definition of terrorism, or how to define or measure any key metric.


 Reliance on media sources or unstated sources, cancelation of NCTC and all official
U.S. public reporting on trends and data with only the EU and Europol providing detailed
and credible official estimates.
 Radical differences in level of reporting by region, and a lack of credible data in Central and East
Asia.
 No reporting on state terrorism, and ignoring large scale killing of civilians in countries like
Syria.
 Failure to report ranges in many key areas of large-scale uncertainty –particularly in terms of
how terrorist incidents are defined and counted, and data on targets, perpetrators, and
casualties.
 Failure to clearly distinguish between insurgency and terrorism —a key problem in every state
where there is some form of active civil conflict.
 Labelling of asymmetric threats and enemies as "terrorist" for political purposes, regardless of
the real character of the fighting or actions involved and methods of attack.
 Failure to distinguish ethnicity, sect, tribe, and other key data driving the patterns of terrorism.
 Constant changes in method of analysis and reporting, and unclear historical comparability of
data shown.
 Lack of clear handling of hate crimes in collecting terrorism data.
 Focus on ideology and religion rather than the full range of causes of terrorism.

Different definitions and listing of perpetrators, and very different counts and characterization
of perpetrator actions –compounded by the lack of clear definitions of terrorist versus insurgent
actions. Lack of clear methods for reporting attacks and incidents where perpetrators cannot be
identified.
Serious limitations to the search and graphing functions of given databases –e.g. inability to search
for perpetrators in each country or region, or get totals or a range for casualties.
The reader should also be aware that the START database and the statistical annex to the
State Department Country Reports on Terrorism do provide a full range of caveats about the
definitions used and the uncertainties in the data, and that the START data offer three levels of
confidence. These caveats are not reported in detail here to limit the length of this report and the
START data
18

7. OBJECTIVES OF THE PROJECT:


It has been a month since the massive Easter bombings in Sri Lanka, taking the lives of 300 innocent
civilians many noted to be international tourists, who made their way to the islands to enjoy the peace,
food and sandy beaches. Similar to any other terrorist attack, there is always a question of “could it
have been prevented?”. Throughout the week, we learned that the Sri Lankan government had access
to intelligence provided by the Indian intelligence agencies predicting a massive terrorist strike, the
report including the name of the leader Zahran Hashim. The series of suicide bombings occurring in
churches across Sri Lanka came in wake of the deadly shootings in Mosques across New Zealand
which clearly maps a relationship between the attacks. The above instance is just one incident where
two attacks within a time frame are linked together and could have potentially been prevented.

The idea behind this project involves being able to find relations between various different terrorist
attacks also providing deep insight on all previous attacks, notorious organizations and some of the
countries that are affected the most by these deadly attacks.

The main objective is to predict the successfulness of terrorist attacks with as high accuracy as possible.

Also following are some of the vital questions that we are trying to answer using our visualizations:

1. How has terrorism evolved over the years? - Visualized through animations
2. What regions of the world are the most affected? - Facilitates drawing a comparison between 2
regions, for example North America and South America.
3. What are the most notorious organizations on a global scale and what countries have these
organizations terrorized?
4. What are the worst 40 attacks in the world, which organization claims responsibility and during
what year did the attacks occur?
5. What is the most common type of attacks and weapon of choice?
6. What is the nationality of the target?
7. Which country has the highest number of casualties?
8. Visualization of the terrorist attacks that remain unclaimed over the years

The above questions are just a few questions that are outlined in the report, the visualizations itself
provide a lot more information as you interact with them, using various filters which modify the results.
19

8. APPROACH
First the raw data will be tidied and prepared for analysis. Then, detailed exploratory data analysis
and time series analysis will be conducted to identify the most important factors that motivates
terrorism and to reveal insights and trends.

Basic approach of a Machine Learning Problem

The steps involved are –

INITIAL ANALYSIS

 Univariate Analysis –
- Detailed data dictionary / list all variables and understand what each variable mean and represent
- What is the source of each variable directly measured or calculated based on other variables or does
the variable has a time domain associated)
- Decide on the dependent (target) variable
- Assigning the correct data types and appropriate column names
- For numeric attributes check the 5 number summary min 1sQ mean 3rdQ max
- For categorical attributes check and decide on the levels , frequency
- Check and deal with data inconsistencies , missing values , errors , duplicates, outliers (boxplots) ,
numeric signs , upper and lower cases , spaces or special characters in striges
- Check distributions of the variables: Normal distribution?
- Low variance filter
- Check the imbalance in the dependent variable
- Check time variable
- Univariate visualizations

 BiVariate Analysis -
- Pairwise relations
- Pairwise visualizations like scatter plots
- Correlation analysis spearman or pearson

 Multivariate Analysis –
- Relations between more than 2 variables
- Statistical tools such as one way ANOVA analysis or rank or to compare the means.
20

EXPLORATORY DATA ANALYSIS


 Normalizing / scaling
 Sub-setting the data
 Decision rules , association rules , n grams
 Clustering such as K-means
 Hypothesise analysis
 Correlation analysis
 NLP analysis for some text attributes using word clouds
 Time Series Analysis

DIMENSIONALITY REDUCTION
 Remove attributes with too many missing values
 Remove attributes with zero or very low variance
 Remove one of the attributes with high correlations with other - prefer the one with more missing
values or lower variance
 Feature selection(decide on the importance of the attribute using statistical measures like
information gain or Gini index) (Forward selection and backward elimination)

EXPERIMENTAL DESIGN
 Randomizing, Splitting the data into training and test sets to training, validation and testing
 Treatment for imbalance (under-sampling the majority class and over-sampling the minority class)
 Cross validation such as 10 folds

MODELING
 Building the Classification Models
 Train the model
 Validate the model

EVALUATION
 Classification (Confusion matric , ROC (receiver operating characteristics) , Accuracy , Recall ,
Precision)
 Compare performance between different models (contingency tables , multivariate analysis of
variance)

IMPROVING THE MODEL


 Deal with very high accuracy(Fixation variables , overfitting)
 Deal with very low performance (many need new variables or more observations)
 Iterate to improve the model performance

INFERENCES
 Dissuasions
 Threats to validity (Internal , External , construct) and propose the solution that may mitigate these
threats
21

9. DATA GATHERING AND PREPROCESSING


The first step of this project is data gathering and preparation that is gathered from the well-known
data source of GTD that broadcasts the terrorism incidents reports in relation to the globe from
1970 to 2018. The collected dataset provides the information on every event with respect to the date
and locality of the event, armaments used; figure of fatalities and the most importantly the
responsible group. There are few steps involved in preprocessing process that are followed in order to
increase some quality attributes like the efficiency, specificity, sensitivity and the most importantly
the accuracy of the classification and prediction process.

Following are the steps which have been followed in data preparation process:

Data cleaning: This is a core step of the preprocessing of the data where the removal or reduction
of the inconsistent and noisy data is performed. Missing values of the data set are also treated
there. These missing records are identified and removed from the data sets.

Feature Selection: To increase the probability of the accurate results it is necessary to include
only relevant data and to exclude the redundant attributes. Feature selection is performed to select
the only relevant attributes of the data set. Only seven attributes are selected for the process where
one attribute acts as a class label and rest of them are the regular attributes.

Data Conversion: Data conversion is the process of converting the data from one form to the
other required form. The GTD based data is converted from categorical to numerical version to
improve the results.

Select only records that meet the terrorism attack criteria:

 Criterion 1: The act must be aimed at attaining a political, economic, religious, or social goal.
 Criterion 2: There must be evidence of an intention to coerce, intimidate, or convey some other
message to a larger audience (or audiences) than the immediate victims.
 Criterion 3: The action must be outside the context of legitimate warfare activities.
22

9.1. DEALING WITH NULL VALUES:


Removing attributes with more than 75% Null values
23

9.2. FILLING NULL VALUES


24

10. INSIGHTS AND TRENDS


Correlation graph showing correlation levels between attributes:
25

Number of attacks yearly distribution: sudden jump post year 2007 and trending to
increase.

Number of attacks region distribution: Top 2 regions are Middle East and South
Asia.
26

Number of attacks country distribution: Top 3 countries are Iraq, Pakistan and Afghanistan.

Number of attacks city distribution: Top 2 cities are Baghdad, Karachi.


27

Line graph for Wounded and killed over the years:

Line graph for casualties = wounded + killed over the years:


28

Attack type distribution:


Bombing and armed assault are the most common attack types

Weapon type distribution:


Explosive and firearms are the most common weapon type used by
terrorism groups.
29

Number of attacks terrorism group distribution: Top 2 groups are Taliban, Isil.

Multi-Line graph showing number of terrorism attacks per region over years:
30

Multi-Line graph showing number of terrorism attacks per attack type over years:

Multi-Line graph showing number of terrorism attacks per weapon type


over years:
31

Line graph showing number of casualties over years:

Looking at the above graph there are some interesting peaks for example in 2001 the peak due to 9/11
attack, the tables below show some statistics for the incident happened on those peaks

Distribution of number of terrorism attacks and number of casualties per region: Middle
East and North Africa are the top regions.
32

Distribution of number of terrorism attacks and number succeeded attacks


per region:

Distribution of number of terrorism attacks and target nationalities:


33

Distribution of number of terrorism attacks and targets:


Civilians are the most targeted.

Distribution of number of terrorism attacks and number succeeded attacks per target type:
34

Distribution of number of terrorism attacks and number of casualties per


target type:

Distribution of number of terrorism attacks per weapon type:


35

The plot shows each terrorism group where it is active in which country:

The plot shows what is the popular weapon type are used by each terrorism group:
36

The plot shows what are the popular attack type of each terrorism group:
37
38
39
40
41
42
43
44
45
46

11. DIMENSIONALITY REDUCTION:


FEATURE SELECTION AND IMPORTANCE
47

Dimensionality Reduction based on features selection:

Number of features reduced to 22, these above features will be used in our models.
Target variable – “success”
Now we will apply different models and predict the successfulness of Terrorist attacks based on
these above selected features.
The models will be evaluated and compared and the model with the best accuracy will be selected
for prediction.
48

12. SPLITTING THE DATASET INTO:


TRAINING, VALIDATION AND TEST SET
Training Dataset: The sample of data used to fit the model.
The actual dataset that we use to train the model (weights and biases in the case of Neural Network).
The model sees and learns from this data.

Validation Dataset: The sample of data used to provide an unbiased evaluation of a model fit on the
training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on
the validation dataset is incorporated into the model configuration.
The validation set is used to evaluate a given model, but this is for frequent evaluation. We as machine
learning engineers use this data to fine-tune the model hyperparameters. Hence the model
occasionally sees this data, but never does it “Learn” from this. We use the validation set results and
update higher level hyperparameters. So the validation set in a way affects a model, but indirectly.

Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on the
training dataset.
The Test dataset provides the gold standard used to evaluate the model. It is only used once a model is
completely trained (using the train and validation sets). The test set is generally what is used to evaluate
competing models.

Many a times the validation set is used as the test set, but it is not good practice. The test set is generally
well curated. It contains carefully sampled data that spans the various classes that the model would face,
when used in the real world.
49

Splitting the Dataset-

Checking the shape of training, validation and testing set-


50

13. APPLING DIFFERENT MODELS TO THE DATASET


Here we have used K- folds cross validation to rank the models scores:
In K Fold cross validation, the data is divided into k subsets and train our model on k-1 subsets and
hold the last one for test. This process is repeated k times, such that each time, one of the k subsets is
used as the test set/ validation set and the other k-1 subsets are put together to form a training set. We
then average the model against each of the folds and then finalize our model. After that we test it against
the test set.

The value of k that we have taken is k=10

Here we see that Random Forest, Decision Tree and Gradient Boosting are giving us the best
accuracy score on the training set, so we will try to improve the scores on these top three algorithms.
51

13.1. STATISTICAL TESTS BETWEEN TOP 3 MODELS:

Rejected the null hypothesis H0, and U1! = U2! =U3

There are statistically significant differences between the groups.


52

13.2. IMPROVING THE MODEL


Model Evaluation, calibration and hyper parameters tuning
“GridSearchCV” and “validation curve” are used to tune the model hyper parameters:

Hyperparameter Tuning
Hyperparameters are hugely important in getting good performance with models. In order to understand
this process, we first need to understand the difference between a model parameter and a model
hyperparameter.
Model parameters are internal to the model whose values can be estimated from the data and we are
often trying to estimate them as best as possible. Whereas hyperparameters are external to our model
and cannot be directly learned from the regular training process. These parameters express “higher-
level” properties of the model such as its complexity or how fast it should learn. Hyperparameters are
model-specific properties that are ‘fixed’ before you even train and test your model on data.
The process for finding the right hyperparameters is still somewhat of a dark art, and it currently
involves either random search or grid search across Cartesian products of sets of hyperparameters.

Grid Search- GridSearch takes a dictionary of all of the different hyperparameters that we want to test,
and then feeds all of the different combinations through the algorithm for us and then reports back to us
which one had the highest accuracy.
53

13.3. FINAL MODELING AND EVALUATION

13.3.1. Confusion Matrix:


A confusion matrix is a summary of prediction results on a classification problem. The number of
correct and incorrect predictions are summarized with count values and broken down by each class.
This is the key to the confusion matrix. The confusion matrix shows the ways in which your
classification model is confused when it makes predictions. It gives us insight not only into the errors
being made by a classifier but more importantly the types of errors that are being made.

Here,
 Class 1 : Positive
 Class 2 :
Negative

Definition of the Terms:


 Positive (P) : Observation is positive (for example: is an apple).
 Negative (N) : Observation is not positive (for example: is not an apple).
 True Positive (TP) : Observation is positive, and is predicted to be positive.
 False Negative (FN) : Observation is positive, but is predicted negative.
 True Negative (TN) : Observation is negative, and is predicted to be negative.
 False Positive (FP) : Observation is negative, but is predicted positive.

Classification Rate/Accuracy:
Classification Rate or Accuracy is given by the relation:

However, there are problems with accuracy. It assumes equal costs for both kinds of errors. A 99%
accuracy can be excellent, good, mediocre, poor or terrible depending upon the problem.
Recall:
Recall can be defined as the ratio of the total number of correctly classified positive examples divide
to the total number of positive examples. High Recall indicates the class is correctly recognized (small
number of FN).
Recall is given by the relation:

Precision:
To get the value of precision we divide the total number of correctly classified positive examples by
the total number of predicted positive examples. High Precision indicates an example labeled as
positive is indeed positive (small number of FP). Precision is given by the relation:
54

High recall, low precision: This means that most of the positive examples are correctly recognized
(low FN) but there are a lot of false positives.
Low recall, high precision: This shows that we miss a lot of positive examples (high FN) but those
we predict as positive are indeed positive (low FP)

F-measure: Since we have two measures (Precision and Recall) it helps to have a measurement that
represents both of them. We calculate an F-measure which uses Harmonic Mean in place of Arithmetic
Mean as it punishes the extreme values more. The F-Measure will always be nearer to the smaller value
of Precision or Recall.

13.3.2. Receiver Operating Characteristic (ROC) curve metric:


An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a
classification model at all classification thresholds. This curve plots two parameters:

 True Positive Rate


 False Positive Rate

True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:

TPR=TP/TP+FN

False Positive Rate (FPR) is defined as follows:

FPR=FP/FP+TN

An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification
threshold classifies more items as positive, thus increasing both False Positives and True Positives.
The following figure shows a typical ROC curve.

Figure- TP vs. FP rate at different classification thresholds.


55

13.3.3. CONFUSION MATRICES OF THE APPLIED MODELS


56

13.3.4. ROC CURVE OF THE APPLIED MODELS


57

14. RESULTS AND DISCUSSIONS

After conducting detailed exploratory data analysis and time series analysis following insights and
trends were revealed-
 2 major attack types are Bombing/Explosion and Armed Assaults which indicates the misuse of
science and technology against Humanity
 India, Pakistan and Afghanistan has seen thousands of terrorist’s act which is a worrying
factor
 Top 5 Indian Cities who has seen the terrorist acts the most-
 Srinagar
 Imphal
 New Delhi,
 Amritsar
 Sopore

 Top 5 Cities from Iraq who has seen the terrorist acts the most-
 Baghdad
 Mosul
 Kikruk,
 Baqubah
 Falluja

 Top 5 Cities from Afghanistan who has seen the terrorist acts the most-
 Kabul
 Khandar
 Jalalabad
 Lashkar Gah
 Ghazni

 Top 5 Cities from Pakistan who has seen the terrorist acts the most-
 Karachi
 Peshawar
 Quetta
 Lahore
 Jamrud
58

 Top terrorist groups-


 Taliban
 Islamic State of Iraq and the Levant (ISIL)
 Shining Path (SL)
 Farabu ndo Marti National Liberation Front (FMLN)
 Al-Shabaab
 New People's Army (NPA)
 Irish Republican Army (IRA)
 Revolutionary Armed Forces of Colombia (FARC)
 Boko Haram
 Kurdistan Workers' Party (PKK)
 Basque Fatherland and Freedom (ETA)
 Communist Party of India Maoist (CPI-Maoist)
 Maoists
 Liberation Tigers of Tamil Eelam (LTTE)
 National Liberation Army of Colombia (ELN)
 Tehrik-i-Taliban Pakistan (TTP )
 Palestinians
 Houthi extremists (Ansar Allah)
 Al-Qaida in the Arabian Peninsula (AQAP)
 Nicaraquan Democratic Force (FDN)
 Manuel Rodriquez Patriotic Front (FPMR)
 Sikh Extremists
 Corsican National Liberation Front (FLNC)
 Al-Qaida in Iraq

 Analysis of top 5 active terrorist organizations and their presence showed that-
 Taliban has waged a war against Afghanistan and the number of attacks have been
increased in last few years
 Peru suffered the most at the hand of Shining path during 80s to 90s
 Islamic State of Iraq and the Levant (ISIL) - We all know the origin of this terror group
but ISIL has started waging war against the neighboring European countries too. We see
Belgium and Russia has seen 2 incident in 2016. France has seen 9 attacks in the year
2015
 Farabundo Marti National Liberation Front (FMLN) had given tough time to El
Salvador between 80s - 90s
 Al-Shabaab is latest terrorist organization and is constantly targeting Somalia
 Boko Haram has constantly targeted Nigeria in last few years

 Below top 4 Terrorist groups accepts their act personally or they have recruited educated
young minds to post their act on Websites blog. Whichever medium ensures spreading the
news faster.

1) Taliban
2) Islamic State of Iraq and the Levant (ISIL)
3) Al-Shabaab
4) Tehrik-i-Taliban Pakistan (TTP)

 Maoist forces are in effect after 2014, which is exactly when India saw change in hand at
central government after a decade.
59

ISIL, Boko Haram and Taliban are the deadliest organizations World has ever witnessed.
Look at the increasing numbers of suicide attacks from year 2013 till 2017 which is a
worrying factor and it is likely to increase further.

Why are these organizations deadliest?

If these terrorist organizations have managed to brainwash people in great numbers to make them
believe that their sacrifice will earn them a position in Heaven then just imagine in coming years
they will succeed in building a big army which will be ready to lay down their life for their cause.

FINAL SCORES ON TEST DATASET

We have applied 7 different models on the GTD dataset for the prediction of successfulness of
terrorist attacks around the world.
The prediction was done on using these 22 features:
iyear', 'imonth', 'iday', 'country_txt',
'region_txt', 'provstate', 'city', 'latitude',
'longitude', 'attacktype1_txt', 'weaptype1_txt', 'targtype1_txt',
'target1', 'nperps', 'nperpcap', 'nkillter',
'claimed', 'gname', 'nkill', 'nwound',
'natlty1_txt', 'property'

Random Forest Classifier, Decision Tree Classifier and Gradient Tree Boosting algorithm gave the
best result on the dataset

Accuracy Precision Recall Score F1-Score


MODEL score score

Random Forest Classifier 0.934 0.93 0.93 0.93

Decision Tree Classifier 0.904 0.90 0.90 0.90

Gradient Tree Boosting 0.86 0.86 0.86 0.86

Confusion Matrix is used as evaluation tool.

 Random forest was the best classifier with > 90% precision, recall, F1-score.
 The second model was the decision tree with almost 90% precision, recall, F1-score.
 And the third model was Gradient Tree boosting with about 86%
precision, recall, F1-score.

Also, ROC Curve is used for evaluation.

 Random forest the area under the curve is about 97%.


 The second model was the decision tree the area under the curve is about 92%.

Therefore the Random Forest Classification model is our finally accepted model for prediction of
successfulness of terrorist attacks, as Random Forest Classification provides us with the maximum
score for Precision, Accuracy and Recall.
60

15. CONCLUSION
The report shows key trends largely in graphic and metric form. It does not attempt to provide the
supporting narrative that is critical to fully understanding these trends, nor to list all the many
qualifications made by the sources used regarding the limits of their models and data. These are areas
where the reader must consult the original sources directly —along with a wide range of narrative
material and other sources —to fully understand the trends that are displayed.
Even so, the report is necessarily complex. The report does show that there is value in looking at
global trends, but makes it clear that many key trends are largely regional, and must be examined on
a regional basis. It also provides key country-by-country breaks out to show that the driving factors
shaping the nature of terrorism in any given case are usually national. International networks certainly
play a key role, as do factors like religion and culture, but the forms terrorism take normally differ
sharply even between neighboring countries.
The report also must be detailed to highlight the differences and uncertainties in much of the data.
There often are sharp differences in the most basic summary data, even between two highly respected
sources like START and IHS Jane's. These differences do not reflect failures in the analytic efforts of
the sources shown. They reflect differences that are inevitable in their need to rely on open source
material, the lack of any clear definition of terrorism, the problems in measuring and displaying
uncertainty, and the need to guess and extrapolate where key data are missing.
Also by giving insight into the factors that alter the likelihood of a successful terrorist action, we can
help governments identify areas of vulnerability for existing security resources and to value
the benefits of new security resources. By providing insight into predicted efficacy of terrorism over
a given period of time, governments can mitigate the substantial indirect costs of terrorism by
reserving financial resources to compensate for indirect losses following an attack.
In short, we aim to help governments stem the loss of life and property through insight into the factors
that determine terrorism success. And, should terrorism occur, we aim to give governments the tools
to predict the budgets they will need to minimize the impact felt to society.

The World | VS | Terrorism


61

16. FUTURE SCOPE OF WORK


Further enhancement can be done by making this system more accurate by including many
more variables. The dataset available on day to day processing may become outdated, it is necessary
to have updated data for effective prediction. To this extent, the incremental approach is necessary in
making the system to learn from past as well as present data and capable of handling the both. The
data can be kept fresh by conducting different surveys.

For future research, there is a plan to further combine the classification algorithms with
genetic algorithms, and neural networks to improve the performance of classifiers, or make
hybridization between different classifiers. Another direction for advanced research is to make a
hybridization of SVM with one of the heuristic algorithms and evaluate their prediction performance.

We have identified the following major areas for improvement:


 Run more tuning iterations. 30 iterations is enough to get a sense of the value of running
hyperparameter optimization, but running more iterations may lead us to finding a better set
of hyperparameters that improve performance.
 Don't drop high cardinality categorical fields. We saw in our feature importance printout that
region-based locations were quite important features, suggesting that the country and
nationality features may be helpful to performance. It would make sense to implement some
sort of rebinding or truncating process by which we could keep 20-50 of the 100+ levels of
these variables in our dataset, rather than dropping them. An easy way to go about this would
be to keep only the top K levels by total number of incidents in that country in the training
data.
 Extract information from text fields. We dropped over 20 different fields containing text that
could be useful and predictive. We should instead either use keyword matching to extract
some information from these fields or engage in a more fulsome NLP approach and encode
this text (as TF-IDF features or similar) for use in ML modelling.
 Tune more hyperparameters. We barely scratched the surface of the hyperparameters that
could be tuned in our GBM. For example, regularization, ensemble size, and learning rate
can all be altered to try and find a better configuration.
 Stack and ensemble. By using cross-validation to get pseudo-out-of-sample predictions, we
can train models to learn from other models' predictions in a process called "stacking." This
would allow us to combine the predictions of a number of well-performing GBM
configurations or combine our tree-based GBM with another model like a multilayer
perceptron via stacking under a simpler model like Logistic Regression. Using a simpler
model like Logistic Regression can also have a positive effect on calibration.
 Learn a confidence measure. When labelling the "unknown" data, it would be extremely
helpful to get a sense of how confident the model is in its prediction at each datapoint beyond
the confidence implied by its predicted probability. We can train a model on the outputs of
our model and on some of the key features that predicts how likely our model is to be correct
on that particular sample. This model could then be used to estimate our confidence around
our predictions on the unknown data.
62

17. REFERENCES

 Zhi-Hua Zhou. Paper on Ensemble Learning. National Key Laboratory for Novel Software
Technology, Nanjing University, Nanjing 210093, China.

 Chaman Verma, Sarika Malhotra, Sharmila and Vineeta Verma, Sardar Vallabhbhai Patel
University of Agriculture and Technology, Predictive Modeling of Terrorist Attacks Using
Machine Learning, International Journal of Pure and Applied Mathematics, Volume 119
No.15 2018, 49-61

 Giorgio Valentini and Francesco Masulli, Ensembles of Learning Machines. INFM, Istituto
Nazionale per la Fisica della Materia, 16146 Genova, Italy, DISI, Universit`a di Genova, 16146
Genova, Italy.

 Thomas G. Dietterich. Ensemble Methods in Machine Learning, Oregon State University,


Corvallis, Oregon, USA.
 Ibrahim Toure, Aryya Gangopadhyay, Analyzing Real Time Terrorism Data. Malathi and Dr. S.
Santhosh Baboo, Evolving Data Mining Algorithms on the Prevailing Crime Trend An Intelligent
CrimePrediction Model, In International Journal of Scientific & Engineering Research, June
2011, Vol. 2, Issue.

 Ana Swanson Wonkblog, The eerie math that could predict terrorist attacks, March 1, 2016.

 V.S. Subrahmanian ,Terrorist Social Network Analysis: Past, Present, and Future. UMIACS
University of Maryland.

 Manoj K. Jha, M.ASCE1, Dynamic Bayesian Network for Predicting the Likelihood of a
Terrorist Attack at Critical Transportation Infrastructure Facilities, Journal Of Infrastructure
Systems Asce / MARCH 2009.

 Ghada M. Tolan and Omar S. Soliman, An Experimental Study of Classification Algorithms for
Terrorism Prediction, International Journal of Knowledge Engineering, Vol. 1, No. 2,
September 2015

 Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification Algorithms:
Bagging, boosting and variants. Machine Learning, 36(1/2).105-139.

 Varun Teja Gundabathula* and V. Vaidhehi, An Efficient Modelling of Terrorist Groups in India
using Machine Learning Algorithms, Indian Journal of Science and Technology, Vol 11(15),
DOI: 10.17485/ijst/2018/v11i15/121766, April 2018

 Derosa M. CSIS Report: Data Mining and Data Analysis for Counterterrorism. Center for
Strategic and International Studies. May 2004. p. 1-32.

 Sormani R. Criticality assessment of terrorism related events at different time scales. Journal of
Ambient Intelligence and Humanized Computing. Feb 2017; 8(1):9-27.

 Zulkepli FS, Ibrahim R, Saeed F. Data pre-processing techniques for research performance
analysis. Recent Developments in Intelligent Computing, Communication and Devices. Aug
2017. p. 157-62.

You might also like