PAM - Unit1 PDF
PAM - Unit1 PDF
https://www.ibm.com/developerworks/library/ba-predictive-
analytics1/index.html
Data Deluge!
Structure of the Data
What is Data Mining?
How should mined information be?
How should mined information be?
How should mined information be?
Tasks in Data Mining
Anomaly Detection
Association Rule Mining
Clustering
Classification
Regression
Difference b/w Data Mining and ML
https://www.ibm.com/account/ reg/in-
en/signup?formid=urx-
19947
IBM SPSS Modeler
IBM® SPSS® Modeler is an analytical platform that enables
organizations and researchers to uncover patterns in data
and build predictive models to address key business
outcomes.
Moreover, aside from a suite of predictive algorithms, SPSS
Modeler also contains an extensive array of analytical routines
that include data segmentation procedures, association
analysis, anomaly detection, feature selection and time series
forecasting.
IBM SPSS Modeler
These analytical capabilities, coupled with Modeler’s rich
functionality in the areas of data integration and preparation
tasks, enable users build entire end-to-end applications from
the reading of raw data files to the deployment of predictions
and recommendations back to the business.
As such, IBM® SPSS® Modeler is widely regarded and one of
the most mature and powerful applications of its kind.
IBM SPSS Modeler GUI
Me MANAG
nu Toolb ER
ar
STREA
M
Palet CANV
PROJE
tes
AS Nod
es CT
WIND
OW
Market Share of Analytics products
CRISP-DM in IBM SPSS Modeler
CRISP-DM in IBM SPSS Modeler
IBM® SPSS® Modeler incorporates the CRISP-DM methodology in
two ways to provide unique support for effective data mining.
The CRISP-DM project tool helps you organize project streams,
output, and annotations according to the phases of a typical data
mining project. You can produce reports at any time during the
project based on the notes for streams and CRISP-DM phases.
Help for CRISP-DM guides you through the process of conducting a
data mining project. The help system includes tasks lists for each step
as well as examples of how CRISP-DM works in the real world. You
can access CRISP-DM Help by choosing CRISP-DM Help from the
main window Help menu
SPSS Modeler GUI: Stream Canvas
The stream canvas is the main work area in Modeler.
It is located in the center of the Modeler user interface.
The stream canvas can be thought of as a surface on which to
place icons or nodes.
These nodes represent operations to be carried out on the data.
Once nodes have been placed on the stream canvas, they can
be linked together to form a stream.
SPSS Modeler GUI: Palettes
Nodes (operations on the data) are contained in palettes.
The palettes are located at the bottom of the Modeler user
interface.
Each palette contains a group of related nodes that are available
for you to add to the data stream.
For example, the Sources palette contains nodes that you can use
to read data into Modeler, and the Graphs palette contains nodes
that you can use to explore your data visually.
The icons that are shown depend on the active, selected palette.
SPSS Modeler GUI: Palettes
Palet
tes
SPSS Modeler GUI: Palettes
The Favorites palette contains commonly used nodes. You can
customize which nodes appear in this palette, as well as their order
— for that matter, you can customize any palette.
Sources nodes are used to access data.
Record Ops nodes manipulate rows (cases).
Field Ops nodes manipulate columns (variables).
Graphs nodes are used for data visualization.
Modeling nodes contain dozens of data mining algorithms.
Output nodes present data to the screen.
Export nodes write data to a data file.
IBM SPSS Statistics nodes can be used in conjunction with IBM
SPSS Statistics.
SPSS Modeler GUI: Menus
In the upper left-hand section of the Modeler user interface, there are
eight menus. The menus control a variety of options within Modeler, as
follows:
File allows users to create, open, and save Modeler streams and projects.
Edit allows users to perform editing operations, for example, copying/pasting
objects and editing individual nodes.
Insert allows users to insert a particular node as an alternative to dragging a
node from a palette.
View allows users to toggle between hiding and displaying items (for
example, the toolbar or the Project window).
Tools allows users to manipulate the environment in which Modeler works and
provides facilities for working with scripts.
SuperNode allows users to create, edit, and save a condensed stream.
Window allows users to close related windows (for example, all open
output windows), or switch between open windows.
Help allows users to access help on a variety of topics or to view a tutorial.
SPSS Modeler GUI: Toolbar
The icons on the toolbar represent commonly used options that can also be accessed
via the menus, however the toolbar allows users to enable these options via this
easy- to use, one-click alternative. The options on the toolbar include:
Man
age
r
Tab
s
SPSS Modeler GUI: Project Window
Proj
ect
Wind
ow
Building streams
As was mentioned previously, Modeler allows users to mine
data visually on the stream canvas.
This means that you will not be writing code for your data
mining projects; instead you will be placing nodes on the stream
canvas.
Remember that nodes represent operations to be carried out on
the data. So once nodes have been placed on the stream canvas,
they need to be linked together to form a stream.
A stream represents the flow of data going through a number
of operations (nodes).
SPSS Modeler: Adding Data Source Node
Read the
data
SPSS Modeler: Preview your data
SPSS Modeler: Preview your data
SPSS Modeler: Data Source
SPSS Modeler: Filter Data
Record ID
column
SPSS Modeler: Record Id Column
Record ID
column
SPSS Modeler: Data Audit Node
Building a stream
When two or more nodes have been placed on the stream canvas,
they need to be connected to produce a stream. This can be
thought of as representing the flow of data through the nodes.
Connecting nodes allows you to bring data into Modeler, explore
the data, manipulate the data (to either clean it up or create
additional fields), build a model, evaluate the model, and ultimately
score the data.
SPSS Modeler: Building a stream
SPSS Modeler: Building a stream
SPSS Modeler: Data Audit Node
SPSS Modeler: Data Audit Node
SPSS Modeler: Data Audit Node
SPSS Modeler: Data Audit Node
SPSS Modeler: Data Audit Node
SPSS Modeler: Data Audit Node
SPSS Modeler: Data Audit Node
ASSIGNME
NT
Encoding Categorical Data
Categorical data is data which has some categories such as, in
below dataset; there are two categorical variable,
Country, and Purchased.
Since machine learning model completely works on
mathematics and numbers, but if our dataset would have a
categorical variable, then it may create trouble while building
the model. So it is necessary to encode these categorical
variables into numbers.
Feature Scaling
Feature scaling is the final step of data preprocessing in
machine learning. It is a technique to standardize the
independent variables of the dataset in a specific range. In
feature scaling, we put our variables in the same range and in
the same scale so that no any variable dominate the other
variable.
age and salary column values are not on the same scale.
salary values dominate the age values, and it will produce an
incorrect result. So to remove this issue, we need to perform
feature scaling for machine learning.
Feature Scaling
There are two ways to perform featurescaling in
machine learning:
Standardization
Normaliz
ation
Feature Scaling
There are two ways to perform featurescaling in
machine learning:
Standardization
Normaliz
ation
Standardization
Dummy Variable Trap
Categorical
Variable
Dummy Variable Trap
Categorical Label
Variable Encoding
Dummy Variable Trap
Categorical Label
Variable Encoding
Hot Encoding
Dummy Variable Trap
Categorical Label
Variable Encoding
Hot Encoding
Introduction to Modeling
Modeling Algorithm Types
Most Common Algorithms
Naïve Bayes Classifier Algorithm (Supervised
Learning - Classification)
Linear Regression (Supervised Learning/Regression)
Logistic Regression (Supervised Learning/Regression)
Decision Trees (Supervised Learning – Classification/Regression)
Random Forests (Supervised Learning –
Classification/Regression)
K- Nearest Neighbours (Supervised Learning)
K Means Clustering Algorithm (Unsupervised
Learning - Clustering)
Support Vector Machine Algorithm (Supervised
Learning - Classification)
Artificial Neural Networks (Reinforcement Learning)
Supervised Learning
Machine is taught by example.
The operator provides the learning algorithm with a known dataset
that includes desired inputs and outputs, and the algorithm must find
a method to determine how to arrive at those inputs and outputs.
While the operator knows the correct answers to the problem,
the algorithm identifies patterns in data, learns from
observations and makes predictions.
The algorithm makes predictions and is corrected by the operator –
and this process continues until the algorithm achieves a high level of
accuracy/performance.
Under the umbrella of supervised learning fall:
Classification
Regression
Forecasting
1. Classification: ML program draw a conclusion from observed
values and determine to what category new observations belong.
For example, when filtering emails as ‘spam’ or ‘not spam’, the
program must look at existing observational data and filter the emails
accordingly.
2. Regression: ML program must estimate and understand the
about the future based on the past and present data, and is
commonly used to analyze trends.
Classification Example: Object Recognition
Classification Example: Credit Scoring
During training, validation data infuses new data into the model that it hasn’t
evaluated before.
Validation data provides the first test against unseen data, allowing data scientists
to evaluate how well the model makes predictions based on the new data.
Not all data scientists use validation data, but it can provide some helpful
information to optimize hyperparameters, which influence how the model
assesses data.
There is some semantic ambiguity between validation data and testing data.
Some organizations call testing datasets “validation datasets.” Ultimately, if there
are three datasets to tune and check ML algorithms, validation data typically
helps tune the algorithm and testing data provides the final assessment.
Test, Train and Validation Data
Decision Tree Classification
Decision Tree Classification
A Decision Tree is a supervised Machine learning algorithm. It is used
in both classification and regression algorithms.
The decision tree is like a tree with nodes.
The branches depend on a numberof factors. It splits data
into branches like these till it achieves a threshold value.
A decision tree consists of the root nodes, children nodes, and leaf
nodes.
In a Decision tree, there are two nodes, which are the Decision Node
and Leaf Node.
Decision Tree Classification
Decision nodes are used to make any decision and have multiple
branches.
Leaf nodes are the output of those decisions and do not contain
any further branches.
In order to build a tree, we use the CART algorithm, which
stands for Classification and Regression Tree algorithm.
A decision tree simply asks a question, and based on the answer
(Yes/No), it further split the tree into subtrees.
Why use Decision Tree?
There are various algorithms in Machine learning, so choosing
the best algorithm for the given dataset and problem is the
main point to remember while creating a machine learning
model.
Below are the two reasons for using the Decision tree:
1. Decision Trees usually mimic human thinking ability while
making a decision, so it is easy to understand.
2. The logic behind the decision tree can be easily understood
because it shows a tree-like structure.
Decision Tree Terminologies
Wor
In k?
a decision tree, for predicting the class of the given dataset,
the algorithm starts from the root node of the tree.
This algorithm compares the values of root attribute with the
record (real dataset) attribute and, based on the comparison,
follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute
value with the other sub-nodes and move further.
It continues the process until it reaches the leaf node of the
tree.
Algorithm
Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best
attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you
cannot further classify the nodes and called the final node as a leaf node.
Example
Suppose there is a candidate who has a job offer and wants to
decide whether he should accept the offer or Not.
So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM).
The root node splits further into the next decision node
(distance from the office) and one leaf node based on the
corresponding labels.
The next decision node further gets split into one decision node
(Cab facility) and one leaf node. Finally, the decision node splits
into two leaf nodes (Accepted offers and Declined offer).
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that
how to select the best attribute for the root node and for sub-
nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this
measurement, we can easily select the best attribute for
the nodes of the tree. There are two popular techniques for
ASM, which are:
Information Gain
Gini Index
Information Gain
Information gain is the measurement of changes in entropy
after the segmentation of a dataset based on an attribute.
It calculates how much information a feature provides us about
a class.
According to the value of information gain, we split the node
and build the decision tree.
A decision tree algorithm always tries to maximize the value of
information gain, and a node/attribute having the highest
information gain is split first. It can be calculated using the
below formula:
InformationGain=
Entropy(S) - [(Weighted Avg) *Entropy(each
feature)]
Entropy
Entropy can be defined as a measure of the purity of the sub
split. Entropy always lies between 0 to 1. The entropy of any
split can be calculated by this formula.
Confusion Matrix
a. Accurac
y
b. Precisio
n
c. Recall
d. F1-
Score
Accuracy
Accuracy simply measures how often the classifier makes
the correct prediction. It’s the ratio between the number of
correct predictions and the total number of predictions.
Precision
For
est
Important Hyperparameters