PAM - Complete
PAM - Complete
PREDICTIVE ANALYTICS
MODELLING
By
Pulkit Dwivedi
Assistant Professor (Chandigarh University)
Course Objective
The students will be able to illustrate the interaction of multi-faceted
fields like data mining
• P. Simon, ," Too Big to Ignore: The Business Case for Big Data”, Wiley India,
2013
OTHER LINKS
• https://developer.ibm.com/predictiveanalytics/videos/category/tutorials/
• https://www.ibm.com/developerworks/library/ba-predictive-
analytics1/index.html
Data Deluge!
Structure of the Data
What is Data Mining?
How should mined information be?
How should mined information be?
How should mined information be?
Tasks in Data Mining
Anomaly Detection
Association Rule Mining
Clustering
Classification
Regression
Difference b/w Data Mining and ML
While data mining is simply looking for patterns that already
exist in the data, machine learning goes beyond what’s
happened in the past to predict future outcomes based on the
pre-existing data.
Different tools work with varying types of data mining, depending on the
algorithms they employ. Thus, data analysts must be sure to choose the
correct tools.
Data mining techniques are not infallible, so there’s always the risk that the
information isn’t entirely accurate. This obstacle is especially relevant if
there’s a lack of diversity in the dataset.
Companies can potentially sell the customer data they have gleaned to other
businesses and organizations, raising privacy concerns.
Data mining requires large databases, making the process hard to manage.
Applications of Data Mining
Applications of Data Mining
Must-have Skills You Need for
Data Mining
Skills You Need for Data Mining
COMPUTER SCIENCE SKILLS
1. Project Experience
1. Collect initial data: Acquire the necessary data and (if necessary) load
it into your analysis tool.
2. Describe data: Examine the data and document its surface properties
like data format, number of records, or field identities.
CRISP-DM Phase - 2
3. Explore data: Dig deeper into the data. Query it, visualize it, and
identify relationships among the data.
1. Select data: Determine which data sets will be used and document
reasons for inclusion/exclusion.
2. Clean data: Often this is the lengthiest task. Without it, you’ll likely fall
victim to garbage-in, garbage-out. A common practice during this task
is to correct, impute, or remove erroneous values.
CRISP-DM Phase - 3
4. Construct data: Derive new attributes that will be helpful. For
example, derive someone’s body mass index from height and weight
fields.
5. Integrate data: Create new data sets by combining data from multiple
sources.
Here you’ll likely build and assess various models based on several
different modeling techniques.
A model is not particularly useful unless the customer can access its
results. The complexity of this phase varies widely.
https://www.ibm.com/account/
reg/in-en/signup?formid=urx-
19947
IBM SPSS Modeler
IBM® SPSS® Modeler is an analytical platform that enables
organizations and researchers to uncover patterns in data and
build predictive models to address key business outcomes.
Menu MANAGER
Toolbar
STREAM
CANVAS
Palettes
Nodes PROJECT
WINDOW
Market Share of Analytics products
CRISP-DM in IBM SPSS Modeler
CRISP-DM in IBM SPSS Modeler
IBM® SPSS® Modeler incorporates the CRISP-DM methodology in two
ways to provide unique support for effective data mining.
The CRISP-DM project tool helps you organize project streams, output,
and annotations according to the phases of a typical data mining
project. You can produce reports at any time during the project based
on the notes for streams and CRISP-DM phases.
Once nodes have been placed on the stream canvas, they can be
linked together to form a stream.
SPSS Modeler GUI: Palettes
Nodes (operations on the data) are contained in palettes.
The palettes are located at the bottom of the Modeler user interface.
Each palette contains a group of related nodes that are available for
you to add to the data stream.
For example, the Sources palette contains nodes that you can use to
read data into Modeler, and the Graphs palette contains nodes that
you can use to explore your data visually.
The icons that are shown depend on the active, selected palette.
SPSS Modeler GUI: Palettes
Palettes
SPSS Modeler GUI: Palettes
The Favorites palette contains commonly used nodes. You can
customize which nodes appear in this palette, as well as their order—
for that matter, you can customize any palette.
Sources nodes are used to access data.
Record Ops nodes manipulate rows (cases).
Field Ops nodes manipulate columns (variables).
Graphs nodes are used for data visualization.
Modeling nodes contain dozens of data mining algorithms.
Output nodes present data to the screen.
Export nodes write data to a data file.
IBM SPSS Statistics nodes can be used in conjunction with IBM SPSS
Statistics.
SPSS Modeler GUI: Menus
In the upper left-hand section of the Modeler user interface, there are eight
menus. The menus control a variety of options within Modeler, as follows:
File allows users to create, open, and save Modeler streams and projects.
Edit allows users to perform editing operations, for example, copying/pasting
objects and editing individual nodes.
Insert allows users to insert a particular node as an alternative to dragging a node
from a palette.
View allows users to toggle between hiding and displaying items (for example,
the toolbar or the Project window).
Tools allows users to manipulate the environment in which Modeler works and
provides facilities for working with scripts.
SuperNode allows users to create, edit, and save a condensed stream.
Window allows users to close related windows (for example, all open output
windows), or switch between open windows.
Help allows users to access help on a variety of topics or to view a tutorial.
SPSS Modeler GUI: Toolbar
The icons on the toolbar represent commonly used options that can also be accessed
via the menus, however the toolbar allows users to enable these options via this easy-
to use, one-click alternative. The options on the toolbar include:
The Streams tab opens, renames, saves, and deletes streams created
in a session.
The Outputs tab stores Modeler output, such as graphs and tables.
You can save output objects directly from this manager.
The Models tab contains the results of the models created in Modeler.
These models can be browsed directly from the Models tab or on the
stream displayed on the canvas.
SPSS Modeler GUI: Manager tabs
Manager
Tabs
SPSS Modeler GUI: Project Window
In the lower right-hand corner of the Modeler user interface, there is
the project window. This window offers two ways to organize your
data mining work, including:
The Classes tab, which organizes your work in Modeler by the type of
objects created
SPSS Modeler GUI: Project Window
Project
Window
Building streams
As was mentioned previously, Modeler allows users to mine data
visually on the stream canvas.
This means that you will not be writing code for your data mining
projects; instead you will be placing nodes on the stream canvas.
Record ID column
SPSS Modeler: Record Id Column
Record ID column
SPSS Modeler: Data Audit Node
Building a stream
When two or more nodes have been placed on the stream canvas,
they need to be connected to produce a stream. This can be thought
of as representing the flow of data through the nodes.
Connecting nodes allows you to bring data into Modeler, explore the
data, manipulate the data (to either clean it up or create additional
fields), build a model, evaluate the model, and ultimately score the
data.
SPSS Modeler: Building a stream
SPSS Modeler: Building a stream
SPSS Modeler: Data Audit Node
SPSS Modeler: Data Audit Node
SPSS Modeler: Data Audit Node
SPSS Modeler: Data Audit Node
SPSS Modeler: Data Audit Node
SPSS Modeler: Data Audit Node
SPSS Modeler: Data Audit Node
The Data Audit node provides a comprehensive first look at the
data you bring into IBM® SPSS® Modeler, presented in an easy-
to-read matrix that can be sorted and used to generate full-size
graphs and a variety of data preparation nodes.
Based upon the data source, options on the Data tab allow you
to override the specified storage type for fields as they are
imported (or created).
Unit of Analysis
One of the most important ideas in a research project is the unit
of analysis.
The unit of analysis is the major entity that you are analyzing in
your study.
For instance, any of the following could be a unit of analysis in a
study:
individuals
groups
artifacts (books, photos, newspapers)
geographical units (town, census tract, state)
social interactions (dyadic relations, divorces, arrests)
Why is it called the ‘unit of
analysis’ and not something else
(like, the unit of sampling)?
Why it is called Unit of Analysis?
Because it is the analysis you do in your study that determines what the unit
is.
On the other hand, if you are comparing the two classes on classroom
climate, your unit of analysis is the group, in this case the classroom, because
you only have a classroom climate score for the class as a whole and not for
each individual student.
For different analyses in the same study you may have different units of
analysis.
Why it is called Unit of Analysis?
If you decide to base an analysis on student scores, the individual is
the unit.
Even though you had data at the student level, you use aggregates in
the analysis.
ASSIGNMENT
Encoding Categorical Data
• Categorical data is data which has some categories such as, in
below dataset; there are two categorical variable, Country,
and Purchased.
age and salary column values are not on the same scale. salary
values dominate the age values, and it will produce an incorrect
result. So to remove this issue, we need to perform feature
scaling for machine learning.
Feature Scaling
• There are two ways to perform feature scaling in machine
learning:
• Standardization
Normalization
Feature Scaling
• There are two ways to perform feature scaling in machine
learning:
• Standardization
Normalization
Standardization
x x-mean (x-mean)^2 x'
6 3 9 1.603567
2 -1 1 -0.534522
3 0 0 0
1 -2 4 -1.069045
12 14
SUM
3 3.5
MEAN
S.D. 1.870828693
Introduction to Modeling
Modeling Algorithm Types
Most Common Algorithms
Naïve Bayes Classifier Algorithm (Supervised Learning -
Classification)
Linear Regression (Supervised Learning/Regression)
Logistic Regression (Supervised Learning/Regression)
Decision Trees (Supervised Learning – Classification/Regression)
Random Forests (Supervised Learning – Classification/Regression)
K- Nearest Neighbours (Supervised Learning)
K Means Clustering Algorithm (Unsupervised Learning -
Clustering)
Support Vector Machine Algorithm (Supervised Learning -
Classification)
Artificial Neural Networks (Reinforcement Learning)
Supervised Learning
Machine is taught by example.
The operator provides the learning algorithm with a known dataset that
includes desired inputs and outputs, and the algorithm must find a
method to determine how to arrive at those inputs and outputs.
While the operator knows the correct answers to the problem, the
algorithm identifies patterns in data, learns from observations and
makes predictions.
The algorithm makes predictions and is corrected by the operator – and
this process continues until the algorithm achieves a high level of
accuracy/performance.
Under the umbrella of supervised learning fall:
Classification
Regression
Forecasting
1. Classification: ML program draw a conclusion from observed values
and determine to what category new observations belong.
For example, when filtering emails as ‘spam’ or ‘not spam’, the program
must look at existing observational data and filter the emails
accordingly.
Training data
Training data are the sub-dataset which we use to train a model.
Algorithms study the hidden patterns and insights which are hidden inside
these observations and learn from them.
The model will be trained over and over again using the data in the training
set machine learning and continue to learn the features of this data.
Test data
In Machine learning Test data is the sub-dataset that we use to evaluate the
performance of a model built using a training dataset.
Although we extract Both train and test data from the same dataset, the test
dataset should not contain any training dataset data.
Validation Data
Validation data are a sub-dataset separated from the training data, and it’s used to
validate the model during the training process.
During training, validation data infuses new data into the model that it hasn’t
evaluated before.
Validation data provides the first test against unseen data, allowing data scientists
to evaluate how well the model makes predictions based on the new data.
Not all data scientists use validation data, but it can provide some helpful
information to optimize hyperparameters, which influence how the model
assesses data.
There is some semantic ambiguity between validation data and testing data. Some
organizations call testing datasets “validation datasets.” Ultimately, if there are
three datasets to tune and check ML algorithms, validation data typically helps
tune the algorithm and testing data provides the final assessment.
Test, Train and Validation Data
Decision Tree Classification
Decision Tree Classification
A Decision Tree is a supervised Machine learning algorithm. It is used in
both classification and regression algorithms.
A decision tree consists of the root nodes, children nodes, and leaf
nodes.
In a Decision tree, there are two nodes, which are the Decision Node
and Leaf Node.
Decision Tree Classification
Decision nodes are used to make any decision and have multiple
branches.
Leaf nodes are the output of those decisions and do not contain
any further branches.
Below are the two reasons for using the Decision tree:
For the next node, the algorithm again compares the attribute
value with the other sub-nodes and move further.
It continues the process until it reaches the leaf node of the tree.
Algorithm
Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best
attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you
cannot further classify the nodes and called the final node as a leaf node.
Example
Suppose there is a candidate who has a job offer and wants to
decide whether he should accept the offer or Not.
So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM).
The root node splits further into the next decision node (distance
from the office) and one leaf node based on the corresponding
labels.
The next decision node further gets split into one decision node
(Cab facility) and one leaf node. Finally, the decision node splits
into two leaf nodes (Accepted offers and Declined offer).
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that
how to select the best attribute for the root node and for sub-
nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this
measurement, we can easily select the best attribute for the
nodes of the tree. There are two popular techniques for ASM,
which are:
Information Gain
Gini Index
Information Gain
Information gain is the measurement of changes in entropy after
the segmentation of a dataset based on an attribute.
It calculates how much information a feature provides us about a
class.
According to the value of information gain, we split the node and
build the decision tree.
A decision tree algorithm always tries to maximize the value of
information gain, and a node/attribute having the highest
information gain is split first. It can be calculated using the below
formula:
InformationGain=
Entropy(S) - [(Weighted Avg) *Entropy(each feature)]
Entropy
Entropy can be defined as a measure of the purity of the sub split.
Entropy always lies between 0 to 1. The entropy of any split can
be calculated by this formula.
Confusion Matrix
A confusion matrix is a table that is often used to describe the
performance of a classification model (or "classifier") on a set of
test data for which the true values are known.
• True negatives (TN): We predicted no, and they don't have the
disease.
• Actual values = [‘dog’, ‘cat’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘dog’,
‘cat’, ‘dog’, ‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘dog’,
‘dog’, ‘cat’]
•
Predicted values = [‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’,
‘cat’, ‘cat’, ‘cat’, ‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’,
‘dog’, ‘dog’, ‘cat’]
Another Example…
True Positive (TP) = 6
You predicted positive and it’s true. You predicted that an animal is a cat and it
actually is.
a. Accuracy
b. Precision
c. Recall
d. F1-Score
Accuracy
Accuracy simply measures how often the classifier makes the
correct prediction. It’s the ratio between the number of correct
predictions and the total number of predictions.
Precision
It is a measure of correctness that is achieved in true
prediction. In simple words, it tells us how many
predictions are actually positive out of all the total positive
predicted.
The cluster center is created in such a way that the distance between
the data points of one cluster is minimum as compared to another
cluster centroid.
Partitioning Clustering
Density-Based Clustering
The density-based clustering method connects the highly-dense
areas into clusters, and the arbitrarily shaped distributions are
formed as long as the dense region can be connected.
The dense areas in data space are divided from each other by
sparser areas.
• Since the clustering did not change at all during the last
iteration, we’re done.
Working of K- means Clustering : Step 6
https://www.naftaliharris.com/blog/visualizing-k-means-
clustering/
Choosing the Appropriate Number of
Clusters
Two methods that are commonly used to evaluate the
appropriate number of clusters:
The Squared Error for each point is the square of the distance of the
point from its representation i.e. its predicted cluster center.
The WCSS score is the sum of these Squared Errors for all the points.
The silhouette coefficient
Silhouette coefficient is another quality measure of clustering – and it
applies to any clustering, not just k-Means. Silhouette-Coefficient of
observation i is calculated as
4. Scaling is required
2. Find the closest (most similar) pair of clusters and make them into one
cluster, we now have N-1 clusters. This can be done in various ways to
identify similar and dissimilar measures.
3. Find the two closest clusters and make them to one cluster. We now have N-
2 clusters. This can be done using agglomerative clustering linkage
techniques.
4. Repeat steps 2 and 3 until all observations are clustered into one single
cluster of size N.
Agglomerative Hierarchical Clustering
Clustering algorithms use various distance or dissimilarity
measures to develop different clusters. Lower/closer distance
indicates that data or observation are similar and would get
grouped in a single cluster. Remember that the higher the
similarity depicts observation is similar.
Here’s one way to calculate similarity – Take the distance between the
centroids of these clusters. The points having the least distance are
referred to as similar points and we can merge them. We can refer to
this as a distance-based algorithm as well (since we are calculating the
distances between the clusters).
√(10-7)^2 = √9 = 3
Similarly, we can calculate all the distances and fill the proximity
matrix.
Steps to Perform Hierarchical Clustering
Step 1: First, we assign all the points to an individual cluster:
Here, we have taken the maximum of the two marks (7, 10) to
replace the marks for this cluster. Instead of the maximum, we
can also take the minimum value or the average values as
well. Now, we will again calculate the proximity matrix for
these clusters:
• Step 3: We will repeat step 2 until only a
single cluster is left.
• This is because failing to make a choice just means you are using the
default option for a procedure, which most of the time is not optimal.
• There are two problems associated with missing data, and these affect
the quantity and quality of the data:
• Selecting Define blanks chooses Null and White space (remember, Empty String is a
subset of White space, so it is also selected), and in this way these types of missing
data are specified. To specify a predefined code, or a blank value, you can add each
individual value to a separate cell in the Missing values area, or you can enter a range
of numeric values if they are consecutive.
• Select cases
• Sort cases
• Identify and remove duplicate cases
• Reclassify categorical values
Cleaning and Selecting Data
• Having finished the initial data understanding phase,
we are ready to move onto the data preparation phase.
For example, you might want to build a model that only includes
people that have certain characteristics (for example, customers
who have purchased something within the last six months).
• You can sort data on more than one field. In addition, each field
can be sorted in ascending or descending order
Identifying and removing duplicate
cases
• Datasets may contain duplicate records that often
must be removed before data mining can begin. For
example, the same individual may appear multiple
times in a dataset with different addresses.
• The reclassified values can replace the original values for a field,
although a safer approach is to create a new field, thus retaining the
original field as well:
1. Place a Reclassify node from the Field Ops palette onto the
canvas.
• This can be the most creative and challenging aspect of a data mining project.
• For example, you might have survey data, but this data might need to be summed for
more information, such as a total score on the survey, or the average score on the
survey, and so on. In other words, it is important to create additional fields.
Drop-down list of Derive Node
Drop-down list of Derive Node
Modeling Nodes
Modeling Nodes
Modeling Nodes