Einal - Report On Predictive Modeling of Global Terrorist Attacks Using Machine Learning PDF
Einal - Report On Predictive Modeling of Global Terrorist Attacks Using Machine Learning PDF
Project Report
Submitted by
Moumita Goswami
Asst. Professor, CSE
CERTIFICATE OF RECOMMENDATION
I hereby recommend that the thesis prepared under my supervision by Sukanya Ghosh, Mahek
Pipalia, Aditya Agarwal, Md Sarik & Nilesh Jain entitled Predictive Modeling of Global
Terrorist Attacks Using Machine Learning be accepted in partial fulfillment of the requirements for
the degree of Bachelor of Technology in Computer Science & Engineering Department.
----------------------------------------------------------- --------------------------------------
Dr. S. S.Thakur Project guide
Associate Professor & Head of the Department, Moumita Goswami
Computer Science & Engineering Department. Asst. Professor, CSE
MCKV Institute of Engineering,Howrah
MCKV Institute of Engineering
243, G. T. Road (N), Liluah
Howrah-711204
Affiliated to
Maulana Abul Kalam Azad University of Technology
(Formerly known as West Bengal University of Technology)
CERTIFICATE
This is to certify that the project entitled
Predictive Modeling of Global Terrorist Attacks Using Machine Learning
and submitted by
has been carried out under the guidance of myself following the rules and regulations of the degree of
Bachelor of Technology in Computer Science & Engineering of Maulana Abul Kalam Azad
University of Technology (Formerly West Bengal University of Technology).
CERTIFICATE OF APPROVAL
(B.Tech Degree in Computer Science & Engineering)
1.
COMMITTEE ON FINAL
EXAMINATION FOR 2.
EVALUATION OF 3.
PROJECT REPORT 4.
5.
ACKNOWLEDGEMENT
I take this opportunity to express my profound gratitude and deep regards to my faculty
MOUMITA GOSWAMI for her exemplary guidance, monitoring and constant
encouragement throughout the course of this project. The blessing, help and guidance
given by her time to time shall carry me a long way in the journey of life on which I am
about to embark.
I am obliged to my project team members for the valuable information provided by them
in their respective fields. I am grateful for their cooperation during the period of my
assignment.
SUKANYA GHOSH
MAHEK PIPALIA
ADITYA AGARWAL
MD SARIK
NILESH JAIN
TABLE OF CONTENTS
1. Abstract……………………………………………………...... 1
2. Introduction…………………………………………………… 2
3. Literature Review…………………………………………….. 4
3.1. Theoretical Analysis…………………………………… 4
3.2. Case Study……………………………………………... 4
4. What is Machine Learning?....................................................... 6
4.1. Supervised & Unsupervised ML Algorithms…….......... 6
4.2. Some Common ML Algorithms……………………….. 8
4.2.1. Decision Trees………………………………. ... 8
4.2.2. Naive Bayes Classification……………………. 8
4.2.3. Random Forest………………………………… 9
4.2.4. Linear Regression……………………………... 9
4.2.5. Logistic regression………………………….. ... 10
4.2.6. KNN Classification…………………................. 11
4.2.7. Support Vector Machine (SVM)………………. 12
4.2.8. Ensemble Methods…………………………….. 12
4.2.9. Clustering Algorithms…………………………. 13
4.2.10. Principal Component Analysis……………….. 13
4.2.11. Singular Value Decomposition………………. 14
4.2.12. Independent Component Analysis…………… 14
5. Data description……………………………………………….. 15
6. Key Uncertainties and Problems in the Data………………….. 17
7. Objectives of the Project……………………………………..... 18
8. Approach……………………………………………………..... 19
9. Data Gathering and Preprocessing……………………............ 21
9.1. Dealing with null values…………………………. 22
9.2. Filling null values………………………………… 23
10. Insights and Trends……………………………………………. 24
11. Dimensionality Reduction: Feature Selection and Importance... 46
12. Splitting the Dataset into: Training, Validation and Test Set…. 48
TITLE PAGE NO.
1. ABSTRACT
In the current scenario one of the severe problem that mankind is facing is terrorist attacks, these
attacks involves pre-planning, complexity and collaboration, due to which counter terrorism has
gained the priority in each of the country. Counter terrorism is an anti-terrorism act or counter
measures that includes techniques and tactics to prevent the future attacks. Terrorist attack is an event
that is least predictable, the reason for terrorist attacks for being so scary is its unpredictability like
we never know how, when and where they will attack. Very few methodologies exist for attacks
forecasting due to lack of real time data which is confidential. In this paper/project we proposed a way
to predict the future attacks, weapons used and targets on which attacks might happen using a class of
powerful machine learning algorithms.
2
2. INTRODUCTION
Terrorism is an evolving phenomenon, i.e. it is not like terrorism is to happen before, but terrorism
was happening, is happening and it will happen in future too. So developing such future terrorist attack
prediction model will be useful which insist military people to be alert by providing the information
about what type of attack might happen in which location, its probability of happening, so that there
will be some chance of preventing upcoming terrorist attack by which we can reduce the effects of an
attack like security threat like life of a victim and stability threat like long term and short term
economic instability of country being attacked, infrastructure destruction and so on.
Since security with respect to life is of high priority that has to be provided to each citizen of a country
by government which implies reduction in crime activities. There is a need for narrowing the gap
between terrorists and counter terrorism by predicting future terrorist attacks accurately so that its
impact like victims death, injuries, psychological trauma (mental disturbance) can be reduced to some
extent and also its impact on economic stability can be controlled.
This project/paper produces a model for identification and prevention of future terrorist attacks based
on information available. Many researches has been done on analyzing terrorism incident data around
the world which might help in retrieving some patterns or an important information which contributes
in choosing an appropriate actions to prevent similar types of attacks but few experiments have been
done for detecting and preventing future attacks as this field has emerged recently.
Some of them
are:
(1) Hawkes Process, a process that is applied to predict terrorist attacks in Northern Ireland which
considered 5000 explosions ranging between years 1970-1998. The process was used to analyze when
and where the Irish Republicans Army (IRA) launched the attack, how British security forces
responded and how effective these responses were.
(2) Social Network Analysis (SNA) whose goal was to degrade the lethality of terror network i.e. what
happens when terrorist is removed from network.
(i) Quantifying terror network lethality
(ii) Predicting successor of removed terrorist
(iii) Identifying whom to remove. And some of the activities carried out are fake
account, social media fraud, and malware distribution.
(3) Terrorist Group Prediction Model (TGPM) aims to predict the group involved in a specific attack
or the group that is responsible for that attack which uses crime prediction model and group detection
model concepts. It uses clustering and association technique which showed some fair degree
of accuracy.
(4) Dynamic Bayesian Network used to predict the likelihood of future attacks, which acts an initial
step in predicting the terrorist behavior at critical transport infrastructure facility.
(5) Feed forward back propagation neural network, which is used for counter terrorism to
predict whether a person is terrorist or not. Data set is collected from a game called cutting corners
and result showed success rate between 60%- 68% for correctly identifying terrorist behavior.
3
(6) Characterizing a terrorist using pattern classification and SNA which predicts whether a person is
terrorist or not and resulted in 86% accuracy.
(7) Prediction of unsolved attacks using group prediction algorithms, to detect the terrorist group like
ISIS, involved in attacks, which considers the solved and unsolved events of terrorist of
turkey, Istanbul between the years 2003 and 2005. And is based on data mining techniques like
clustering.
4
3. LITERATURE REVIEW
3.1.Theoretical Analysis
Existing literature has applied network analysis on the theoretical relationships between terrorist
organizations and individual terrorists. Everton (2009), Clauset et al. (2008), and Moon offer three
distinct approaches to the theoretical study of terrorist network dynamics. Everton (2009) criticizes
the significant application of social network analysis to the study of terrorist networks by arguing that
the focus on identifying and targeting key players within the network is misplaced. He seeks to
understand the specific hierarchical structure within a terrorist organization (centralized leadership,
decentralized, etc.) to help security forces target specific actors in a centralized network to disrupt its
effectiveness. He identifies leadership as highly connected nodes within the network as well as an
understanding of the cosmopolitanism of the network via measurements of the local clustering
coefficients and average path lengths. His explanation is limited to individual organizations, but we
are interested in some of his measures as applied to the global network. Clauset et al. (2008)
further supplemented our understanding of the underlying network structure of terror cells by
building a hierarchal structure of a terrorist network via a maximum likelihood model to infer
connections between nodes. They compared this model to a true terrorist network and found the two
networks similar along metrics like average clustering coefficient and SCC size. Moon et al.’s (2007)
built a meta-structure of tasks and the agents assigned to complete them in order to infer the players
who would have had contact in order to carry out the attack. The paper then applies social
network analysis techniques by calculating betweenness centrality and total degree centrality
on nodes (labeled as particular figures in the organization) for each of the models they built
to identify key players within each theoretical organizational structure. The paper’s primary
strength is in its innovative approach to understanding why edges between actors within the terrorist
network exist. However, like Everton, its techniques are limited to studying single organizations,
while our paper takes a global approach.
3.2.Case Study
In addition to theoretical analyses that provide foundations to depict the underlying structure
of terrorism networks, researchers have conducted case studies that provide insights in individual
terrorist organizations. Belli et al. provided results from a case study that explores the social network
of internal members of the ”Hammound Enterprise”, which was involved in trade diversion in order
to finance terrorism organizations in Michigan. They found out that in these groups, members
are highly interconnected, making organizations more efficient while vulnerable to detection. With
key player analysis, three ringleaders are detected with a few secondary leaders, most of whom
are Islamist extremists, which shows the property of an idea centric organization. These
analyses are very representative across many case studies we have found. Krebs 2002 [5] is
widely referenced in literature about terrorist network analysis. Following the September 11
attacks, the author used publicly available information about the 19 hijackers to construct a network
of weak and strong ties based on the nature of their relationships with one another. By computing the
degree distribution, betweenness, and closeness of the nodes, this paper depicts a sketch of the covert
network behind the scenes, and allows him to identify the clear leader among the hijackers. Also from
the analysis on clustering coefficient(0.4), and average path length among the nodes, the authors found
5
out that this covert network ”trade efficiency for secrecy” by remaining quite small and operating with
little outside assistance. Krebs’ paper is one of the foundational documents in applying network
analysis to terrorist organizations post-9/11. Because his research was conducted after the fact on a
well-publicized case, he is able to delve into the nature of connections among actors in considerable
depth, which lends credence to his findings about this specific group. However, his work does not
lend itself to application on the larger terrorist landscape, because in most cases, information
about the trust and familial connections among actors in a covert network is difficult to come by
and verify. Thus, the insights gained from particular case studies can inform some of our approach to
larger networks (i.e. identifying key players based on centrality, looking for hubs in a sparse network,
etc.), but their approach is limited in scope and application to future study. Instead of exploring the
internal structures of each group, a global picture of the interaction between different groups may
provide crucial linkage that connects pieces of small cluster networks. For instance, if we can establish
a concrete theory of the collaboration between ISIS and extremists groups within the U.S., we can
potentially figure out many sources of resources of terrorist attacks in the U.S. Another benefit
of studying the global group network of terrorism is that it helps understand the global shift of
terrorism distribution and predict future moves. Though understanding individual organizations has
importance, especially in the context of extremely destructive groups like al Qaeda or ISIS, we feel
that the existing literature insufficiently accounts for the connections between terrorist organizations.
Understanding those relationships can be crucial in inferring and predicting the future of terrorist
organizations and behaviors, as we understand some organizations as behaving similarly or emulating
central organizations to determine their next actions.
6
The process of learning begins with observations or data, such as examples, direct experience,
or instruction, in order to look for patterns in data and make better decisions in the future based on the
examples that we provide. The primary aim is to allow the computers learn automatically without
human intervention or assistance and adjust actions accordingly.
There are some basic common threads, however, and the overarching theme is best summed up by
this oft-quoted statement made by Arthur Samuel way back in 1959:
“Machine Learning is the field of study that gives computers the ability to learn without being
explicitly programmed.”
And more recently, in 1997, Tom Mitchell gave a “well-posed” definition that has proven more useful
to engineering
types:
“A computer program is said to learn from experience E with respect to some task T and
some performance measure P, if its performance on T, as measured by P, improves with experience
E.” -- Tom Mitchell, Carnegie Mellon University
Supervised machine learning algorithms can apply what has been learned in the past to new data
using labeled examples to predict future events. Starting from the analysis of a known training dataset,
the learning algorithm produces an inferred function to make predictions about the output values. The
system is able to provide targets for any new input after sufficient training. The learning algorithm
can also compare its output with the correct, intended output and find errors in order to modify the
model accordingly.
In contrast, unsupervised machine learning algorithms are used when the information used to train
is neither classified nor labeled. Unsupervised learning studies how systems can infer a function to
describe a hidden structure from unlabeled data. The system doesn’t figure out the right output, but it
explores the data and can draw inferences from datasets to describe hidden structures from unlabeled
data.
7
Reinforcement machine learning algorithms is a learning method that interacts with its
environment by producing actions and discovers errors or rewards. Trial and error search and delayed
reward are the most relevant characteristics of reinforcement learning. This method allows machines
and software agents to automatically determine the ideal behavior within a specific context in order
to maximize its performance. Simple reward feedback is required for the agent to learn which action
is best; this is known as the reinforcement signal.
Machine learning enables analysis of massive quantities of data. While it generally delivers faster,
more accurate results in order to identify profitable opportunities or dangerous risks, it may also
require additional time and resources to train it properly. Combining machine learning with AI and
cognitive technologies can make it even more effective in processing large volumes of information.
8
Decision Tree
From a business decision point of view, a decision tree is the minimum number of yes/no questions
that one has to ask, to assess the probability of making a correct decision, most of the time. As a
method, it allows you to approach the problem in a structured and systematic way to arrive at a logical
conclusion.
1. Assume the number of cases as N. A sample of these N cases is taken as the training set.
2. Consider M to be the number of input variables, a number m is selected such that m < M. The
best split between m and M is used to split the node. The value of m is held constant as the
trees are grown.
3. Each tree is grown as large as possible.
4. By aggregating the predictions of n trees (i.e., majority votes for classification, average for
regression), predict the new data.
a factor to each set of input values, which are called the coefficients represented by the Greek letter
Beta (β).
The equation mentioned below represents a linear regression model with two sets of input values,
x1 and x2. y represents the output of the model, β0, β1 and β2 are the coefficients of the linear equation.
y = β0 + β1x1 + β2x2
When there is only one input variable, the linear equation represents a straight line. For simplicity,
consider β2 to be equal to zero, which would imply that the variable x2 will not influence the output of
the linear regression model. In this case, the linear regression will represent a straight line and its
equation is shown below.
y = β0 + β1x1
A graph of the linear regression equation model is as shown below
Linear regression can be used to find the general price trend of a stock over a period of time. This helps
us understand if the price movement is positive or negative.
The logistic regression model computes a weighted sum of the input variables similar to the linear
regression, but it runs the result through a special non-linear function, the logistic function or sigmoid
function to produce the output y.
The sigmoid/logistic function is given by the following equation.
y = 1 / (1+ e-x)
11
This algorithm usually employs methods to determine proximity such as Euclidean distance and
Hamming distance.
The pros of KNN are that its simplicity and ease of use. Though it can require a lot of memory to store
large datasets, it only calculates (learns) at the moment a prediction is needed.
Let’s consider the task of classifying a green circle into class 1 and class 2. Consider the case of KNN
based on 1-nearest neighbour. In this case, KNN will classify the green circle into class 1. Now let’s
increase the number of nearest neighbours to 3 i.e., 3-nearest neighbour. As you can see in the figure
there are ‘two’ class 2 objects and ‘one’ class 1 object inside the circle. KNN will classify a green
circle into class 2 object as it forms the majority.
12
When related to trading, an SVM algorithm can be built which categorises the equity data as a
favourable buy, sell or neutral classes and then classifies the test data according to the rules.
So how do ensemble methods work and why are they superior to individual models?
They average out biases: If you average a bunch of democratic-leaning polls and republican-leaning
polls together, you will get an average something that isn’t leaning either way.
They reduce the variance: The aggregate opinion of a bunch of models is less noisy than the single
opinion of one of the models. In finance, this is called diversification — a mixed portfolio of many
stocks will be much less variable than just one of the stocks alone. This is why your models will be
better with more data points rather than fewer.
They are unlikely to over-fit: If you have individual models that didn’t over-fit, and you are
combining the predictions from each model in a simple way (average, weighted average, logistic
regression), then there’s no room for over-fitting.
13
Clustering Algorithms
Some of the applications of PCA include compression, simplifying data for easier learning,
visualization. Notice that domain knowledge is very important while choosing whether to go forward
with PCA or not. It is not suitable in cases where data is noisy.
14
PCA is actually a simple application of SVD. In computer vision, the 1st face recognition algorithms
used PCA and SVD in order to represent faces as a linear combination of “eigenfaces”, do
dimensionality reduction, and then match faces to identities via simple methods; although modern
methods are much more sophisticated, many still depend on similar techniques.
ICA is related to PCA, but it is a much more powerful technique that is capable of finding the
underlying factors of sources when these classic methods fail completely. Its applications include
digital images, document databases, economic indicators and psychometric measurements.
15
5. DATA DESCRIPTION
Information on more than 180,000 Terrorist Attacks
The Global Terrorism Database (GTD) is an open-source database including information on terrorist
attacks around the world from 1970 through 2017. The GTD includes systematic data on domestic as
well as international terrorist incidents that have occurred during this time period and now includes
more than 180,000 attacks. The database is maintained by researchers at the National Consortium for
the Study of Terrorism and Responses to Terrorism (START), headquartered at the University
of Maryland.
Content
Geography: Worldwide
Time period: 1970-
2017, Unit of analysis:
Attack
Variables: >100 variables on location, tactics, perpetrators, targets, and outcomes
(Note: Please interpret changes over time with caution. Global patterns are driven by diverse trends
in particular regions, and data collection is influenced by fluctuations in access to media coverage
over both time and place.)
eventid : A 12-digit Event ID system. First 8 numbers – date recorded “yyyymmdd”. Last 4
numbers – sequential case number for the given day (0001, 0002 etc).
iyear : This field contains the year in which the incident occurred.
imonth : This field contains the number of the month in which the incident occurred.
iday : This field contains the numeric day of the month on which the incident occurred.
latitude : The latitude of the city in which the event occurred.
longitude : The longitude of the city in which the event occurred.
extended : The duration of an incident extended more than 24 hrs (1) - Less than 24 hrs (0).
multiple : The attack is part of a multiple incident (1) - Not part of a multiple incident (0).
suicide : The incident was a suicide attack (1) -OR- no indication of a suicide attack (0).
attacktype : Assassination(1), Hijacking(2), Kidnapping(3), Barricade Incident(4),
Bombing/Explosion(5), Armed Assault(6), Unarmed Assault(7),
Facility/Infrastructure Attack(8), Unknown(9)
targtype :22 categories ranging from Business(1), Government(general)(2), Police(3),...
Utilities(21), Violent Political Parties(22)
natlty : Nationality of target/victim
individual : Whether the attack was carried out by an individual or several individuals known to
be affiliated with a group or organization(1) or not affiliated with a group or
organization(0)
claimed : A group or person claimed responsibility for the attack (1) - No claim of responsibility
was made(0).
16
ATTRIBUTES
17
Different definitions and listing of perpetrators, and very different counts and characterization
of perpetrator actions –compounded by the lack of clear definitions of terrorist versus insurgent
actions. Lack of clear methods for reporting attacks and incidents where perpetrators cannot be
identified.
Serious limitations to the search and graphing functions of given databases –e.g. inability to search
for perpetrators in each country or region, or get totals or a range for casualties.
The reader should also be aware that the START database and the statistical annex to the
State Department Country Reports on Terrorism do provide a full range of caveats about the
definitions used and the uncertainties in the data, and that the START data offer three levels of
confidence. These caveats are not reported in detail here to limit the length of this report and the
START data
18
The idea behind this project involves being able to find relations between various different terrorist
attacks also providing deep insight on all previous attacks, notorious organizations and some of the
countries that are affected the most by these deadly attacks.
The main objective is to predict the successfulness of terrorist attacks with as high accuracy as possible.
Also following are some of the vital questions that we are trying to answer using our visualizations:
1. How has terrorism evolved over the years? - Visualized through animations
2. What regions of the world are the most affected? - Facilitates drawing a comparison between 2
regions, for example North America and South America.
3. What are the most notorious organizations on a global scale and what countries have these
organizations terrorized?
4. What are the worst 40 attacks in the world, which organization claims responsibility and during
what year did the attacks occur?
5. What is the most common type of attacks and weapon of choice?
6. What is the nationality of the target?
7. Which country has the highest number of casualties?
8. Visualization of the terrorist attacks that remain unclaimed over the years
The above questions are just a few questions that are outlined in the report, the visualizations itself
provide a lot more information as you interact with them, using various filters which modify the results.
19
8. APPROACH
First the raw data will be tidied and prepared for analysis. Then, detailed exploratory data analysis
and time series analysis will be conducted to identify the most important factors that motivates
terrorism and to reveal insights and trends.
INITIAL ANALYSIS
Univariate Analysis –
- Detailed data dictionary / list all variables and understand what each variable mean and represent
- What is the source of each variable directly measured or calculated based on other variables or does
the variable has a time domain associated)
- Decide on the dependent (target) variable
- Assigning the correct data types and appropriate column names
- For numeric attributes check the 5 number summary min 1sQ mean 3rdQ max
- For categorical attributes check and decide on the levels , frequency
- Check and deal with data inconsistencies , missing values , errors , duplicates, outliers (boxplots) ,
numeric signs , upper and lower cases , spaces or special characters in striges
- Check distributions of the variables: Normal distribution?
- Low variance filter
- Check the imbalance in the dependent variable
- Check time variable
- Univariate visualizations
BiVariate Analysis -
- Pairwise relations
- Pairwise visualizations like scatter plots
- Correlation analysis spearman or pearson
Multivariate Analysis –
- Relations between more than 2 variables
- Statistical tools such as one way ANOVA analysis or rank or to compare the means.
20
DIMENSIONALITY REDUCTION
Remove attributes with too many missing values
Remove attributes with zero or very low variance
Remove one of the attributes with high correlations with other - prefer the one with more missing
values or lower variance
Feature selection(decide on the importance of the attribute using statistical measures like
information gain or Gini index) (Forward selection and backward elimination)
EXPERIMENTAL DESIGN
Randomizing, Splitting the data into training and test sets to training, validation and testing
Treatment for imbalance (under-sampling the majority class and over-sampling the minority class)
Cross validation such as 10 folds
MODELING
Building the Classification Models
Train the model
Validate the model
EVALUATION
Classification (Confusion matric , ROC (receiver operating characteristics) , Accuracy , Recall ,
Precision)
Compare performance between different models (contingency tables , multivariate analysis of
variance)
INFERENCES
Dissuasions
Threats to validity (Internal , External , construct) and propose the solution that may mitigate these
threats
21
Following are the steps which have been followed in data preparation process:
Data cleaning: This is a core step of the preprocessing of the data where the removal or reduction
of the inconsistent and noisy data is performed. Missing values of the data set are also treated
there. These missing records are identified and removed from the data sets.
Feature Selection: To increase the probability of the accurate results it is necessary to include
only relevant data and to exclude the redundant attributes. Feature selection is performed to select
the only relevant attributes of the data set. Only seven attributes are selected for the process where
one attribute acts as a class label and rest of them are the regular attributes.
Data Conversion: Data conversion is the process of converting the data from one form to the
other required form. The GTD based data is converted from categorical to numerical version to
improve the results.
Criterion 1: The act must be aimed at attaining a political, economic, religious, or social goal.
Criterion 2: There must be evidence of an intention to coerce, intimidate, or convey some other
message to a larger audience (or audiences) than the immediate victims.
Criterion 3: The action must be outside the context of legitimate warfare activities.
22
Number of attacks yearly distribution: sudden jump post year 2007 and trending to
increase.
Number of attacks region distribution: Top 2 regions are Middle East and South
Asia.
26
Number of attacks country distribution: Top 3 countries are Iraq, Pakistan and Afghanistan.
Number of attacks terrorism group distribution: Top 2 groups are Taliban, Isil.
Multi-Line graph showing number of terrorism attacks per region over years:
30
Multi-Line graph showing number of terrorism attacks per attack type over years:
Looking at the above graph there are some interesting peaks for example in 2001 the peak due to 9/11
attack, the tables below show some statistics for the incident happened on those peaks
Distribution of number of terrorism attacks and number of casualties per region: Middle
East and North Africa are the top regions.
32
Distribution of number of terrorism attacks and number succeeded attacks per target type:
34
The plot shows each terrorism group where it is active in which country:
The plot shows what is the popular weapon type are used by each terrorism group:
36
The plot shows what are the popular attack type of each terrorism group:
37
38
39
40
41
42
43
44
45
46
Number of features reduced to 22, these above features will be used in our models.
Target variable – “success”
Now we will apply different models and predict the successfulness of Terrorist attacks based on
these above selected features.
The models will be evaluated and compared and the model with the best accuracy will be selected
for prediction.
48
Validation Dataset: The sample of data used to provide an unbiased evaluation of a model fit on the
training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on
the validation dataset is incorporated into the model configuration.
The validation set is used to evaluate a given model, but this is for frequent evaluation. We as machine
learning engineers use this data to fine-tune the model hyperparameters. Hence the model
occasionally sees this data, but never does it “Learn” from this. We use the validation set results and
update higher level hyperparameters. So the validation set in a way affects a model, but indirectly.
Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on the
training dataset.
The Test dataset provides the gold standard used to evaluate the model. It is only used once a model is
completely trained (using the train and validation sets). The test set is generally what is used to evaluate
competing models.
Many a times the validation set is used as the test set, but it is not good practice. The test set is generally
well curated. It contains carefully sampled data that spans the various classes that the model would face,
when used in the real world.
49
Here we see that Random Forest, Decision Tree and Gradient Boosting are giving us the best
accuracy score on the training set, so we will try to improve the scores on these top three algorithms.
51
Hyperparameter Tuning
Hyperparameters are hugely important in getting good performance with models. In order to understand
this process, we first need to understand the difference between a model parameter and a model
hyperparameter.
Model parameters are internal to the model whose values can be estimated from the data and we are
often trying to estimate them as best as possible. Whereas hyperparameters are external to our model
and cannot be directly learned from the regular training process. These parameters express “higher-
level” properties of the model such as its complexity or how fast it should learn. Hyperparameters are
model-specific properties that are ‘fixed’ before you even train and test your model on data.
The process for finding the right hyperparameters is still somewhat of a dark art, and it currently
involves either random search or grid search across Cartesian products of sets of hyperparameters.
Grid Search- GridSearch takes a dictionary of all of the different hyperparameters that we want to test,
and then feeds all of the different combinations through the algorithm for us and then reports back to us
which one had the highest accuracy.
53
Here,
Class 1 : Positive
Class 2 :
Negative
Classification Rate/Accuracy:
Classification Rate or Accuracy is given by the relation:
However, there are problems with accuracy. It assumes equal costs for both kinds of errors. A 99%
accuracy can be excellent, good, mediocre, poor or terrible depending upon the problem.
Recall:
Recall can be defined as the ratio of the total number of correctly classified positive examples divide
to the total number of positive examples. High Recall indicates the class is correctly recognized (small
number of FN).
Recall is given by the relation:
Precision:
To get the value of precision we divide the total number of correctly classified positive examples by
the total number of predicted positive examples. High Precision indicates an example labeled as
positive is indeed positive (small number of FP). Precision is given by the relation:
54
High recall, low precision: This means that most of the positive examples are correctly recognized
(low FN) but there are a lot of false positives.
Low recall, high precision: This shows that we miss a lot of positive examples (high FN) but those
we predict as positive are indeed positive (low FP)
F-measure: Since we have two measures (Precision and Recall) it helps to have a measurement that
represents both of them. We calculate an F-measure which uses Harmonic Mean in place of Arithmetic
Mean as it punishes the extreme values more. The F-Measure will always be nearer to the smaller value
of Precision or Recall.
True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:
TPR=TP/TP+FN
FPR=FP/FP+TN
An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification
threshold classifies more items as positive, thus increasing both False Positives and True Positives.
The following figure shows a typical ROC curve.
After conducting detailed exploratory data analysis and time series analysis following insights and
trends were revealed-
2 major attack types are Bombing/Explosion and Armed Assaults which indicates the misuse of
science and technology against Humanity
India, Pakistan and Afghanistan has seen thousands of terrorist’s act which is a worrying
factor
Top 5 Indian Cities who has seen the terrorist acts the most-
Srinagar
Imphal
New Delhi,
Amritsar
Sopore
Top 5 Cities from Iraq who has seen the terrorist acts the most-
Baghdad
Mosul
Kikruk,
Baqubah
Falluja
Top 5 Cities from Afghanistan who has seen the terrorist acts the most-
Kabul
Khandar
Jalalabad
Lashkar Gah
Ghazni
Top 5 Cities from Pakistan who has seen the terrorist acts the most-
Karachi
Peshawar
Quetta
Lahore
Jamrud
58
Analysis of top 5 active terrorist organizations and their presence showed that-
Taliban has waged a war against Afghanistan and the number of attacks have been
increased in last few years
Peru suffered the most at the hand of Shining path during 80s to 90s
Islamic State of Iraq and the Levant (ISIL) - We all know the origin of this terror group
but ISIL has started waging war against the neighboring European countries too. We see
Belgium and Russia has seen 2 incident in 2016. France has seen 9 attacks in the year
2015
Farabundo Marti National Liberation Front (FMLN) had given tough time to El
Salvador between 80s - 90s
Al-Shabaab is latest terrorist organization and is constantly targeting Somalia
Boko Haram has constantly targeted Nigeria in last few years
Below top 4 Terrorist groups accepts their act personally or they have recruited educated
young minds to post their act on Websites blog. Whichever medium ensures spreading the
news faster.
1) Taliban
2) Islamic State of Iraq and the Levant (ISIL)
3) Al-Shabaab
4) Tehrik-i-Taliban Pakistan (TTP)
Maoist forces are in effect after 2014, which is exactly when India saw change in hand at
central government after a decade.
59
ISIL, Boko Haram and Taliban are the deadliest organizations World has ever witnessed.
Look at the increasing numbers of suicide attacks from year 2013 till 2017 which is a
worrying factor and it is likely to increase further.
If these terrorist organizations have managed to brainwash people in great numbers to make them
believe that their sacrifice will earn them a position in Heaven then just imagine in coming years
they will succeed in building a big army which will be ready to lay down their life for their cause.
We have applied 7 different models on the GTD dataset for the prediction of successfulness of
terrorist attacks around the world.
The prediction was done on using these 22 features:
iyear', 'imonth', 'iday', 'country_txt',
'region_txt', 'provstate', 'city', 'latitude',
'longitude', 'attacktype1_txt', 'weaptype1_txt', 'targtype1_txt',
'target1', 'nperps', 'nperpcap', 'nkillter',
'claimed', 'gname', 'nkill', 'nwound',
'natlty1_txt', 'property'
Random Forest Classifier, Decision Tree Classifier and Gradient Tree Boosting algorithm gave the
best result on the dataset
Random forest was the best classifier with > 90% precision, recall, F1-score.
The second model was the decision tree with almost 90% precision, recall, F1-score.
And the third model was Gradient Tree boosting with about 86%
precision, recall, F1-score.
Therefore the Random Forest Classification model is our finally accepted model for prediction of
successfulness of terrorist attacks, as Random Forest Classification provides us with the maximum
score for Precision, Accuracy and Recall.
60
15. CONCLUSION
The report shows key trends largely in graphic and metric form. It does not attempt to provide the
supporting narrative that is critical to fully understanding these trends, nor to list all the many
qualifications made by the sources used regarding the limits of their models and data. These are areas
where the reader must consult the original sources directly —along with a wide range of narrative
material and other sources —to fully understand the trends that are displayed.
Even so, the report is necessarily complex. The report does show that there is value in looking at
global trends, but makes it clear that many key trends are largely regional, and must be examined on
a regional basis. It also provides key country-by-country breaks out to show that the driving factors
shaping the nature of terrorism in any given case are usually national. International networks certainly
play a key role, as do factors like religion and culture, but the forms terrorism take normally differ
sharply even between neighboring countries.
The report also must be detailed to highlight the differences and uncertainties in much of the data.
There often are sharp differences in the most basic summary data, even between two highly respected
sources like START and IHS Jane's. These differences do not reflect failures in the analytic efforts of
the sources shown. They reflect differences that are inevitable in their need to rely on open source
material, the lack of any clear definition of terrorism, the problems in measuring and displaying
uncertainty, and the need to guess and extrapolate where key data are missing.
Also by giving insight into the factors that alter the likelihood of a successful terrorist action, we can
help governments identify areas of vulnerability for existing security resources and to value
the benefits of new security resources. By providing insight into predicted efficacy of terrorism over
a given period of time, governments can mitigate the substantial indirect costs of terrorism by
reserving financial resources to compensate for indirect losses following an attack.
In short, we aim to help governments stem the loss of life and property through insight into the factors
that determine terrorism success. And, should terrorism occur, we aim to give governments the tools
to predict the budgets they will need to minimize the impact felt to society.
For future research, there is a plan to further combine the classification algorithms with
genetic algorithms, and neural networks to improve the performance of classifiers, or make
hybridization between different classifiers. Another direction for advanced research is to make a
hybridization of SVM with one of the heuristic algorithms and evaluate their prediction performance.
17. REFERENCES
Zhi-Hua Zhou. Paper on Ensemble Learning. National Key Laboratory for Novel Software
Technology, Nanjing University, Nanjing 210093, China.
Chaman Verma, Sarika Malhotra, Sharmila and Vineeta Verma, Sardar Vallabhbhai Patel
University of Agriculture and Technology, Predictive Modeling of Terrorist Attacks Using
Machine Learning, International Journal of Pure and Applied Mathematics, Volume 119
No.15 2018, 49-61
Giorgio Valentini and Francesco Masulli, Ensembles of Learning Machines. INFM, Istituto
Nazionale per la Fisica della Materia, 16146 Genova, Italy, DISI, Universit`a di Genova, 16146
Genova, Italy.
Ana Swanson Wonkblog, The eerie math that could predict terrorist attacks, March 1, 2016.
V.S. Subrahmanian ,Terrorist Social Network Analysis: Past, Present, and Future. UMIACS
University of Maryland.
Manoj K. Jha, M.ASCE1, Dynamic Bayesian Network for Predicting the Likelihood of a
Terrorist Attack at Critical Transportation Infrastructure Facilities, Journal Of Infrastructure
Systems Asce / MARCH 2009.
Ghada M. Tolan and Omar S. Soliman, An Experimental Study of Classification Algorithms for
Terrorism Prediction, International Journal of Knowledge Engineering, Vol. 1, No. 2,
September 2015
Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification Algorithms:
Bagging, boosting and variants. Machine Learning, 36(1/2).105-139.
Varun Teja Gundabathula* and V. Vaidhehi, An Efficient Modelling of Terrorist Groups in India
using Machine Learning Algorithms, Indian Journal of Science and Technology, Vol 11(15),
DOI: 10.17485/ijst/2018/v11i15/121766, April 2018
Derosa M. CSIS Report: Data Mining and Data Analysis for Counterterrorism. Center for
Strategic and International Studies. May 2004. p. 1-32.
Sormani R. Criticality assessment of terrorism related events at different time scales. Journal of
Ambient Intelligence and Humanized Computing. Feb 2017; 8(1):9-27.
Zulkepli FS, Ibrahim R, Saeed F. Data pre-processing techniques for research performance
analysis. Recent Developments in Intelligent Computing, Communication and Devices. Aug
2017. p. 157-62.