0% found this document useful (0 votes)
13 views217 pages

PAM - Unit1 PDF

Uploaded by

Akshay Rathore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views217 pages

PAM - Unit1 PDF

Uploaded by

Akshay Rathore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
You are on page 1/ 217

APEX INSTITUTE OF TECHNOLOGY.


AIT-IBM CSE
CHANDIGARH UNIVERSITY, MOHALI
PREDICTIVE
ANALYTICS
MODELLING
By Pulkit Dwivedi
Assistant Professor (Chandigarh
University)
Course Objective
The students will be able to illustrate the interaction of multi-
faceted fields like data mining
The students will be able to understand statistics and mathematics
in the development of Predictive Analytics
The students shall understand and Apply the concepts of
different models
The students shall understand various aspects of IBM SPSS
Modeler interface
The students shall be able to familiarize with various data
clustering and dimension reduction techniques
Books
E. Siegel, “Predictive Analytics: The Power to Predict Who Will Click,
Buy, Lie, or Die ". John Wiley & Sons, Inc, 2013.
P. Simon, ," Too Big to Ignore: The Business Case for Big Data”, Wiley India,
2013
J. W. Foreman, " Data Smart: Using Data Science to Transform
information into Insight,", Addison-Wesley
OTHER LINKS
https://developer.ibm.com/predictiveanalytics/videos/category/tutorials/

https://www.ibm.com/developerworks/library/ba-predictive-
analytics1/index.html
Data Deluge!
Structure of the Data
What is Data Mining?
How should mined information be?
How should mined information be?
How should mined information be?
Tasks in Data Mining
Anomaly Detection
Association Rule Mining
Clustering
Classification
Regression
Difference b/w Data Mining and ML

While data mining is simply looking for patterns that already


exist in the data, machine learning goes beyond what’s
happened in the past to predict future outcomes based on
the pre-existing data.
In data mining, the ‘rules’ or patterns are unknown at the start
of the process. Whereas, with machine learning, the machine is
usually given some rules or variables to understand the data
and learn.
Difference b/w Data Mining and ML

Data mining is a more manual process that relies on human


intervention and decision making. But, with machine learning,
once the initial rules are in place, the process of extracting
information and ‘learning’ and refining is automatic, and takes
place without human intervention. In other words, the
machine becomes more intelligent by itself.
Data mining is used on an existing dataset (like a data
warehouse) to find patterns. Machine learning, on the
other hand, is trained on a ‘training’ data set, which
teaches the computer how to make sense of data, and
then to make predictions about new data sets.
Are There Any Drawbacks
to Data Mining?
Drawbacks of Data Mining
Many data analytics tools are complex and challenging to use. Data
scientists need the right training to use the tools effectively.
Different tools work with varying types of data mining, depending on
the algorithms they employ. Thus, data analysts must be sure to choose
the correct tools.
Data mining techniques are not infallible, so there’s always the risk that
the information isn’t entirely accurate. This obstacle is especially relevant if
there’s a lack of diversity in the dataset.
Companies can potentially sell the customer data they have gleaned to
other businesses and organizations, raising privacy concerns.
Data mining requires large databases, making the process hard to manage.
Applications of Data Mining
Applications of Data Mining
Must-have Skills You
Need for Data Mining
Skills You Need for Data Mining

COMPUTER SCIENCE SKILLS


1. Programming/statistics language: R, Python, C++, Java,
Matlab, SQL, SAS
2. Big data processing frameworks: Hadoop, Storm, Samza,
Spark, Flink
3. Operating System: Linux
4. Database knowledge: Relational Databases (SQL or Oracle)
& Non-Relational Databases (MongoDB, Cassandra,
Dynamo, CouchDB)
Skills You Need for Data Mining

STATISTICS AND ALOGIRITHIM SKILLS


1. Basic Statistics Knowledge: Probability, Probability Distribution,
Correlation, Regression, Linear Algebra, Stochastic Process
2. Data structures include arrays, linked list, stacks, queues, trees,
hash table, set…etc, and common Algorithms include sorting,
searching, dynamic programming, recursion…etc
3. Machine Learning/Deep Learning Algorithm
Skills You Need for Data Mining

OTHER REQUIRED SKILLS


1. Project Experience
2. Communication & Presentation
Skills
Data Mining
Process/Lifecycle/Strategy
CRISP DM Data Mining Strategy

The CRoss Industry Standard Process for Data Mining (CRISP-


DM) is a process model that serves as the base for a data
science process. It has six sequential phases:
1. Business understanding – What does the business
need?
2. Data understanding – What data do we have / need? Is
it clean?
3. Data preparation – How do we organize the data for
modeling?
4. Modeling – What modeling techniques should we apply?
5. Evaluation – Which model best meets the business
objectives?
6. Deployment – How do stakeholders access the results?
CRISP-DM Phase - 1
BUSINESS UNDERSTANDING
The Business Understanding phase focuses on
understanding the objectives and requirements
of the project.
1. Determine business objectives: You should first

“thoroughly understand, from a business perspective, what the


customer really wants to accomplish.” and then define business
success criteria.
2. Assess situation: Determine resources availability, project

requirements, assess risks and contingencies, and conduct a


cost- benefit analysis.
CRISP-DM Phase - 1

1. Determine data mining goals: In addition to defining


the business objectives, you should also define what success looks
like from a technical data mining perspective
2. Produce project plan: Select technologies and tools
and define detailed plans for each project phase.

While many teams hurry through this phase,


establishing a strong business understanding is
like building the foundation of a house –
absolutely essential.
Any good project starts
with a deep understanding
of the customer’s needs.
Data mining projects are no
exception and CRISP-DM recognizes
this.
CRISP-DM Phase – 2
DATA UNDERSTANDING
Adding to the foundation of Business
Understanding, it drives the focus to identify,
collect, and analyze the data sets that can help you
accomplish the project goals.
1. Collect initial data: Acquire the necessary data and (if
necessary) load it into your analysis tool.
2. Describe data: Examine the data and document its surface
properties like data format, number of records, or field identities.
CRISP-DM Phase - 2
1. Explore data: Dig deeper into the data. Query it, visualize
it, and identify relationships among the data.
2. Verify data quality: How clean/dirty is the data?
Document any quality issues.
CRISP-DM Phase - 3
DATA PREPARATION
This phase, which is often referred to as “data
munging”, prepares the final data set(s) for
modeling.
1. Select data: Determine which data sets will be used and
document reasons for inclusion/exclusion.
2. Clean data: Often this is the lengthiest task. Without it, you’ll
likely fall victim to garbage-in, garbage-out. A common practice
during this task is to correct, impute, or remove erroneous values.
CRISP-DM Phase - 3
1. Construct data: Derive new attributes that will be helpful. For
example, derive someone’s body mass index from height and weight
fields.
2. Integrate data: Create new data sets by combining data from
multiple sources.
3. Format data: Re-format data as necessary. For example, you
might convert string values that store numbers to numeric values
so that you can perform mathematical operations.
A common rule of thumb
is that 80% of the project
is data preparation.
CRISP-DM Phase - 4
MODELING
Here you’ll likely build and assess various models
based on several different modeling techniques.
1. Select modeling techniques: Determine which
algorithms to try (e.g. regression, neural net).
2. Generate test design: Pending your modeling approach,
you might need to split the data into training, test, and validation
sets.
CRISP-DM Phase - 4
1. Build model: As glamorous as this might sound, this might just
be executing a few lines of code like “reg = LinearRegression().fit(X,
y)”.
2. Assess model: Generally, multiple models are competing against
each other, and the data scientist needs to interpret the model results
based on domain knowledge, the pre-defined success criteria, and the
test design.
In practice teams should continue iterating
until they find a “good enough” model,
proceed through the CRISP-DM lifecycle, then
further improve the model in future iterations.
What is widely regarded as
data science’s most exciting
work is also often the
shortest phase of the project
CRISP-DM Phase - 5
EVALUATION
Whereas the Assess Model task of the Modeling
phase focuses on technical model assessment,
the Evaluation phase looks more broadly at
which model best meets the business and what
to do next.
1. Evaluate results: Do the models meet the business
success criteria? Which one(s) should we approve for the
business?
CRISP-DM Phase - 5

1. Review process: Review the work accomplished. Was


anything overlooked? Were all steps properly executed?
Summarize findings and correct anything if needed.
2. Determine next steps: Based on the previous three
tasks, determine whether to proceed to deployment, iterate
further, or initiate new projects.
CRISP-DM Phase - 6
DEPLOYMENT
A model is not particularly useful unless the
customer can access its results. The complexity of
this phase varies widely.
1. Plan deployment: Develop and document a plan for
deploying the model.
2. Plan monitoring and maintenance: Develop a thorough
monitoring and maintenance plan to avoid issues during the
operational phase (or post-project phase) of a model.
CRISP-DM Phase - 6
1. Produce final report: The project team documents a
summary of the project which might include a final presentation of
data mining results.
2. Review project: Conduct a project retrospective about what
went well, what could have been better, and how to improve in the
future.
Your organization’s work might not end there. As
a project framework, CRISP-DM does not outline
what to do after the project (also known as
“operations”). But if the model is going to
production, be sure you maintain the model in
production. Constant monitoring and occasional
model tuning is often required.
Depending on the
requirements, the
deployment phase can be
as simple as generating a
report or as complex as
implementing a
repeatable data mining
process across the
enterprise.
IBM SPSS Modeler
IBM SPSS Modeler

https://www.ibm.com/account/ reg/in-
en/signup?formid=urx-
19947
IBM SPSS Modeler
IBM® SPSS® Modeler is an analytical platform that enables
organizations and researchers to uncover patterns in data
and build predictive models to address key business
outcomes.
Moreover, aside from a suite of predictive algorithms, SPSS
Modeler also contains an extensive array of analytical routines
that include data segmentation procedures, association
analysis, anomaly detection, feature selection and time series
forecasting.
IBM SPSS Modeler
These analytical capabilities, coupled with Modeler’s rich
functionality in the areas of data integration and preparation
tasks, enable users build entire end-to-end applications from
the reading of raw data files to the deployment of predictions
and recommendations back to the business.
As such, IBM® SPSS® Modeler is widely regarded and one of
the most mature and powerful applications of its kind.
IBM SPSS Modeler GUI

Me MANAG
nu Toolb ER
ar
STREA
M
Palet CANV
PROJE
tes
AS Nod
es CT
WIND
OW
Market Share of Analytics products
CRISP-DM in IBM SPSS Modeler
CRISP-DM in IBM SPSS Modeler
IBM® SPSS® Modeler incorporates the CRISP-DM methodology in
two ways to provide unique support for effective data mining.
The CRISP-DM project tool helps you organize project streams,
output, and annotations according to the phases of a typical data
mining project. You can produce reports at any time during the
project based on the notes for streams and CRISP-DM phases.
Help for CRISP-DM guides you through the process of conducting a
data mining project. The help system includes tasks lists for each step
as well as examples of how CRISP-DM works in the real world. You
can access CRISP-DM Help by choosing CRISP-DM Help from the
main window Help menu
SPSS Modeler GUI: Stream Canvas
The stream canvas is the main work area in Modeler.
It is located in the center of the Modeler user interface.
The stream canvas can be thought of as a surface on which to
place icons or nodes.
These nodes represent operations to be carried out on the data.
Once nodes have been placed on the stream canvas, they can
be linked together to form a stream.
SPSS Modeler GUI: Palettes
Nodes (operations on the data) are contained in palettes.
The palettes are located at the bottom of the Modeler user
interface.
Each palette contains a group of related nodes that are available
for you to add to the data stream.
For example, the Sources palette contains nodes that you can use
to read data into Modeler, and the Graphs palette contains nodes
that you can use to explore your data visually.
The icons that are shown depend on the active, selected palette.
SPSS Modeler GUI: Palettes

Palet
tes
SPSS Modeler GUI: Palettes
The Favorites palette contains commonly used nodes. You can
customize which nodes appear in this palette, as well as their order
— for that matter, you can customize any palette.
Sources nodes are used to access data.
Record Ops nodes manipulate rows (cases).
Field Ops nodes manipulate columns (variables).
Graphs nodes are used for data visualization.
Modeling nodes contain dozens of data mining algorithms.
Output nodes present data to the screen.
Export nodes write data to a data file.
IBM SPSS Statistics nodes can be used in conjunction with IBM
SPSS Statistics.
SPSS Modeler GUI: Menus
In the upper left-hand section of the Modeler user interface, there are
eight menus. The menus control a variety of options within Modeler, as
follows:
File allows users to create, open, and save Modeler streams and projects.
Edit allows users to perform editing operations, for example, copying/pasting
objects and editing individual nodes.
Insert allows users to insert a particular node as an alternative to dragging a
node from a palette.
View allows users to toggle between hiding and displaying items (for
example, the toolbar or the Project window).
Tools allows users to manipulate the environment in which Modeler works and
provides facilities for working with scripts.
SuperNode allows users to create, edit, and save a condensed stream.
Window allows users to close related windows (for example, all open
output windows), or switch between open windows.
Help allows users to access help on a variety of topics or to view a tutorial.
SPSS Modeler GUI: Toolbar
The icons on the toolbar represent commonly used options that can also be accessed
via the menus, however the toolbar allows users to enable these options via this
easy- to use, one-click alternative. The options on the toolbar include:

Creating a new stream Editing the properties of


Opening an existing stream the current stream
Saving the current stream Previewing the running of a
Printing the current stream stream
Moving a selection to the clipboard Running the current stream
Running a selection
Copying a selection to the
clipboard Encapsulating selected nodes
Pasting a selection to the clipboard into a supernode
Undoing the last action Zooming in on a supernode
Redoing the last action
Showing/hiding stream markup
Inserting a new comment
Searching for nodes in the
current stream Opening IBM SPSS
Modeler Advantage
SPSS Modeler GUI: Manager tabs
In the upper right-hand corner of the Modeler user interface, there
are three types of manager tabs. Each tab (Streams, Outputs, and
Models) is used to view and manage the corresponding type of
object, as follows:
The Streams tab opens, renames, saves, and deletes streams
created in a session.
The Outputs tab stores Modeler output, such as graphs and
tables. You can save output objects directly from this manager.
The Models tab contains the results of the models created in
Modeler. These models can be browsed directly from the Models tab
or on the stream displayed on the canvas.
SPSS Modeler GUI: Manager tabs

Man
age
r
Tab
s
SPSS Modeler GUI: Project Window

In the lower right-hand corner of the Modeler user interface, there


is the project window. This window offers two ways to organize your
data mining work, including:
The CRISP-DM tab, which organizes streams, output, and
annotations according to the phases of the CRISP-DM process model
The Classes tab, which organizes your work in Modeler by the type
of objects created
SPSS Modeler GUI: Project Window

Proj
ect
Wind
ow
Building streams
As was mentioned previously, Modeler allows users to mine
data visually on the stream canvas.
This means that you will not be writing code for your data
mining projects; instead you will be placing nodes on the stream
canvas.
Remember that nodes represent operations to be carried out on
the data. So once nodes have been placed on the stream canvas,
they need to be linked together to form a stream.
A stream represents the flow of data going through a number
of operations (nodes).
SPSS Modeler: Adding Data Source Node

Read the
data
SPSS Modeler: Preview your data
SPSS Modeler: Preview your data
SPSS Modeler: Data Source
SPSS Modeler: Filter Data

You can chose


which
features/variabl
es you want to
consider
SPSS Modeler: Filter Data
SPSS Modeler: Check features type
SPSS Modeler: Check features type
SPSS Modeler: Input/target variable
SPSS Modeler: Input/target variable
SPSS Modeler: Record Id Column

Record ID
column
SPSS Modeler: Record Id Column

Record ID
column
SPSS Modeler: Data Audit Node
Building a stream
When two or more nodes have been placed on the stream canvas,
they need to be connected to produce a stream. This can be
thought of as representing the flow of data through the nodes.
Connecting nodes allows you to bring data into Modeler, explore
the data, manipulate the data (to either clean it up or create
additional fields), build a model, evaluate the model, and ultimately
score the data.
SPSS Modeler: Building a stream
SPSS Modeler: Building a stream
SPSS Modeler: Data Audit Node
SPSS Modeler: Data Audit Node
SPSS Modeler: Data Audit Node
SPSS Modeler: Data Audit Node
SPSS Modeler: Data Audit Node
SPSS Modeler: Data Audit Node
SPSS Modeler: Data Audit Node

The Data Audit node provides a comprehensive first look at the


data you bring into IBM® SPSS® Modeler, presented in an easy-
to-read matrix that can be sorted and used to generate full-size
graphs and a variety of data preparation nodes.
The Audit tab displays a report that provides summary statistics,
histograms, and distribution graphs that may be useful in
gaining a preliminary understanding of the data. The report also
displays the storage icon before the field name.
The Quality tab in the audit report displays information about
outliers, extremes, and missing values, and offers tools for
handling these values.
SPSS Modeler: Graph Node
SPSS Modeler: Graph Node
SPSS Modeler: Graph Node
SPSS Modeler: Graph Node
SPSS Modeler: Graph Node
SPSS Modeler: Graph Node
SPSS Modeler: Checking Quality
Collecting your data: Data
Structure, Data type etc.
Data Structure
SPSS Modeler source nodes allow you to reads data from:
Industry standard comma-delimited (CSV) text files
including ASCII files.
XML files (wish TM1 did this!)
Statistics Files
SAS
Excel
Databases (DB2TM, OracleTM. SWL ServerTM, and a
variety of other databases) supported via ODBC
Other OLAP cubes, including Cognos TM1 cubes and views.
Data Structure

SPSS Modeler requires a “rectangular” data structure –


records (rows of the data table) and fields (column of the data
table) that are handled in the Data tab.
Based upon the data source, options on the Data tab allow
you to override the specified storage type for fields as they are
imported (or created).
Unit of Analysis
One of the most important ideas in a research project is the
unit of analysis.
The unit of analysis is the major entity that you are analyzing
in your study.
For instance, any of the following could be a unit of analysis in
a study:
individuals
groups
artifacts (books, photos, newspapers)
geographical units (town, census tract, state)
social interactions (dyadic relations, divorces, arrests)
Why is it called the ‘unit of
analysis’ and not something else
(like, the unit of sampling)?
Why it is called Unit of Analysis?
Because it is the analysis you do in your study that determines what the
unit is.
For instance, if you are comparing the children in two classrooms on
achievement test scores, the unit is the individual child because you have
a score for each child.
On the other hand, if you are comparing the two classes on classroom
climate, your unit of analysis is the group, in this case the classroom,
because you only have a classroom climate score for the class as a whole
and not for each individual student.
For different analyses in the same study you may have different units
of analysis.
Why it is called Unit of Analysis?
If you decide to base an analysis on student scores, the individual
is the unit.
But you might decide to compare average classroom performance. In
this case, since the data that goes into the analysis is the average
itself (and not the individuals’ scores) the unit of analysis is actually
the group.
Even though you had data at the student level, you use aggregates
in the analysis.
In many areas of research these hierarchies of analysis units have
become particularly important and have spawned a whole area of
statistical analysis sometimes referred to as hierarchical
modeling.
Field Storage

Use the Field column to view and select fields in the


current dataset.
Available Storage Types
String: Contain non-numeric/alpha-numeric data
Integer: Contain integer values
Real: Values are numbers that may include
Date: Date values specified in a standard format such as year,
month, and day (for example, 2007-09-26).
Time: Time measured as a duration. For example, a service call
lasting 1 hour, 26 minutes, and 38 seconds might be represented as
01:26:38, depending on the current time format as specified in the
Stream Properties dialog box.
Timestamp: Values that include both a date and time
component, for example 2007–09–26 09:04:00, again depending on
the current date and time formats in the Stream Properties dialog
box.
List: Introduced in SPSS Modeler version 17, a List storage
field contains multiple values for a single record.
List storage type icons
List measurement level icons
Restructuring Data
A key concern for data analysts engaged in predictive analytics
projects is defining and creating the ‘unit of analysis’. Simply,
this refers to what a row of data needs to represent in order
for the analysis to make sense.
If, for example, the analytical goal of the project is to predict
which customers are likely to redeem a voucher, then the
predictive model will probably require data where each row is
a different customer not a different transaction.
If, on the other hand, the analytical goal is to identify
fraudulent attempts to purchase tickets for an event, then the
model probably requires sample data where each row is a
transaction.
Restructuring Data: The Distinct Node

Duplications in data files are a regular headache for


most businesses and they can occur for lot of reasons.
Often, the customers may register themselves more than
once; contact details may be updated causing an extra row of
data to be added rather than overwriting the existing one;
merging information from different departments or
organisations may create duplicates because there isn’t an
exact match.
In any case, Modeler helps us to identify duplicate cases
and resolve these issues with the Distinct node.
Restructuring Data: The Distinct Node
Restructuring Data: The Distinct Node
Restructuring Data: The Distinct Node
Restructuring Data: The Distinct Node
Data Cleaning
Making the data consistent across the values
Replace the special characters for example: replace $ and comma signs in
the column of Sales/Income/Profit i.e. making $10,000 as 10000.
Making the format of the date column consistent with the format of the
tool used for data analysis
Check for null or missing values, also check for the negative values
Smoothing of the noise present in the data by identifying and
treating for outliers
The data cleaning steps vary and depend on the nature of the data.
For instance, text data consisting of, say, reviews, or tweets would
have to be cleaned to make the cases of the words the same, remove
punctuation marks, any special characters, remove common words,
and differentiate words based on the parts of speech.
NOTE: THE ABOVE STEPS ARE NOT
COMPREHENSIVE
Data Cleaning: Handling Null/Missing
Valu
es
The null values in the dataset are imputed using
mean/median or mode based on the type of data that is
missing:
Numerical Data: If a numerical value is missing,
then replace that value with mean or median.
It is preferred to impute using the median value as the
average or the mean values are influenced by the outliers and
skewness present in the data and are pulled in their respective
direction.
Categorical Data: When categorical data is missing,
replace that with the value which is most occurring i.e. by
mode.
Data Cleaning: Handling Null/Missing
Values
Now, if a column has, let’s say, 50% of its
values missing, then do we replace all of
those missing values with the respective
median or mode value?
Data Cleaning: Handling Null/Missing
Values
Now, if a column has, let’s say, 50% of its
values missing, then do we replace all of
those missing values with the respective
median or mode value?
Actually, we don’t. We delete that
particular column in that case. We don’t
impute
it because then that column will be biased
towards the median/mode value and will
naturally have the most influence on the
dependent variable.
Data Cleaning: Outlier Detection
Outliers are the values that look different from the other
values in the data.
To check for the presence of outliers, we can plot BoxPlot.
Outlier Detection Techniques

ASSIGNME
NT
Encoding Categorical Data
Categorical data is data which has some categories such as, in
below dataset; there are two categorical variable,
Country, and Purchased.
Since machine learning model completely works on
mathematics and numbers, but if our dataset would have a
categorical variable, then it may create trouble while building
the model. So it is necessary to encode these categorical
variables into numbers.
Feature Scaling
Feature scaling is the final step of data preprocessing in
machine learning. It is a technique to standardize the
independent variables of the dataset in a specific range. In
feature scaling, we put our variables in the same range and in
the same scale so that no any variable dominate the other
variable.
age and salary column values are not on the same scale.
salary values dominate the age values, and it will produce an
incorrect result. So to remove this issue, we need to perform
feature scaling for machine learning.
Feature Scaling
There are two ways to perform featurescaling in
machine learning:
Standardization

Normaliz
ation
Feature Scaling
There are two ways to perform featurescaling in
machine learning:
Standardization

Normaliz
ation
Standardization
Dummy Variable Trap

Categorical
Variable
Dummy Variable Trap

Categorical Label
Variable Encoding
Dummy Variable Trap

Categorical Label
Variable Encoding

Hot Encoding
Dummy Variable Trap

Categorical Label
Variable Encoding

Hot Encoding
Introduction to Modeling
Modeling Algorithm Types
Most Common Algorithms
Naïve Bayes Classifier Algorithm (Supervised
Learning - Classification)
Linear Regression (Supervised Learning/Regression)
Logistic Regression (Supervised Learning/Regression)
Decision Trees (Supervised Learning – Classification/Regression)
Random Forests (Supervised Learning –
Classification/Regression)
K- Nearest Neighbours (Supervised Learning)
K Means Clustering Algorithm (Unsupervised
Learning - Clustering)
Support Vector Machine Algorithm (Supervised
Learning - Classification)
Artificial Neural Networks (Reinforcement Learning)
Supervised Learning
Machine is taught by example.
The operator provides the learning algorithm with a known dataset
that includes desired inputs and outputs, and the algorithm must find
a method to determine how to arrive at those inputs and outputs.
While the operator knows the correct answers to the problem,
the algorithm identifies patterns in data, learns from
observations and makes predictions.
The algorithm makes predictions and is corrected by the operator –
and this process continues until the algorithm achieves a high level of
accuracy/performance.
Under the umbrella of supervised learning fall:
Classification
Regression
Forecasting
1. Classification: ML program draw a conclusion from observed
values and determine to what category new observations belong.
For example, when filtering emails as ‘spam’ or ‘not spam’, the
program must look at existing observational data and filter the emails
accordingly.
2. Regression: ML program must estimate and understand the

relationships among variables. Regression analysis focuses on one


dependent variable and a series of other changing variables –
making it particularly useful for prediction and forecasting.
3. Forecasting: Forecasting is the process of making predictions

about the future based on the past and present data, and is
commonly used to analyze trends.
Classification Example: Object Recognition
Classification Example: Credit Scoring

Differentiating between low-risk


and high-risk customers from their
income and savings

Discriminant: IF income > θ1 AND savings > θ2


THEN low-risk ELSE high-risk
Unsupervised learning
Here, the algorithm studies data to identify patterns.
There is no answer key or human operator to provide instruction.
Instead, the machine determines the correlations and relationships
by analyzing available data.
In an unsupervised learning process, the learning algorithm is left
to interpret large data sets and address that data accordingly.
The algorithm tries to organize that data in some way to describe
its structure.
This might mean grouping the data into clusters or arranging it in a
way that looks more organized.
As it assesses more data, its ability to make decisions on that
data gradually improves and becomes more refined.
Under the umbrella of unsupervised learning, fall:
Clustering, Dimensionality Reduction
1. Clustering: Clustering involves grouping sets of
similar data (based on defined criteria). It’s useful
for segmenting data into several groups and
performing analysis on each data set to find
patterns.
2. Dimension reduction: Dimension reduction
reduces the number of variables being considered
to find the exact information required.
Clustering Example: Crime prediction
Reinforcement learning
Reinforcement Learning is a feedback-based learning technique
in which an agent learns to behave in an environment by
performing the actions and seeing the results of actions.
For each good action, the agent gets positive feedback, and
for each bad action, the agent gets negative feedback or
penalty.
In Reinforcement Learning, the agent learns automatically
using feedbacks without any labeled data, unlike supervised
learning.
Since there is no labeled data, so the agent is bound to learn by
its experience only.
RL solves a specific type of problem where decision
making is sequential, and the goal is long-term, such as
game-playing, robotics, etc.
Example: Suppose there is an AI agent present within a
maze environment, and his goal is to find the diamond.
The agent interacts with the environment by performing some
actions, and based on those actions, the state of the agent gets
changed, and it also receives a reward or penalty as feedback.
Agent(): An entity that can perceive/explore the environment and
act upon it.
Environment(): A situation in which an agent is present or
surrounded by. In RL, we assume the stochastic environment, which
means it is random in nature.
Action(): Actions are the moves taken by an agent within the
environment.
State(): State is a situation returned by the environment after each
action taken by the agent.
Reward(): A feedback returned to the agent from the environment
to evaluate the action of the agent.
Semi-supervised learning
Semi-supervised learning is similar to supervised learning,
but instead uses both labelled and un-labelled data.
Labelled data is essentially information that has meaningful
tags so that the algorithm can understand the data, whilst un-
labelled data lacks that information.
By using this combination, machine learning algorithms can
learn to label un-labelled data.
Machine Learning Pipeline
Train, Test and Validation Data
To build and evaluate the performance of a machine learning model, we
usually break our dataset into two distinct datasets. These two datasets are
the training data and test data.
Training data
Training data are the sub-dataset which we use to train a model.
Algorithms study the hidden patterns and insights which are hidden
inside these observations and learn from them.
The model will be trained over and over again using the data in the
training set machine learning and continue to learn the features of this
data.
Test data
In Machine learning Test data is the sub-dataset that we use to evaluate
the performance of a model built using a training dataset.
Although we extract Both train and test data from the same dataset, the
test dataset should not contain any training dataset data.
Validation Data
Validation data are a sub-dataset separated from the training data, and it’s used
to validate the model during the training process.

During training, validation data infuses new data into the model that it hasn’t
evaluated before.

Validation data provides the first test against unseen data, allowing data scientists
to evaluate how well the model makes predictions based on the new data.

Not all data scientists use validation data, but it can provide some helpful
information to optimize hyperparameters, which influence how the model
assesses data.

There is some semantic ambiguity between validation data and testing data.
Some organizations call testing datasets “validation datasets.” Ultimately, if there
are three datasets to tune and check ML algorithms, validation data typically
helps tune the algorithm and testing data provides the final assessment.
Test, Train and Validation Data
Decision Tree Classification
Decision Tree Classification
A Decision Tree is a supervised Machine learning algorithm. It is used
in both classification and regression algorithms.
The decision tree is like a tree with nodes.
The branches depend on a numberof factors. It splits data
into branches like these till it achieves a threshold value.
A decision tree consists of the root nodes, children nodes, and leaf
nodes.
In a Decision tree, there are two nodes, which are the Decision Node
and Leaf Node.
Decision Tree Classification
Decision nodes are used to make any decision and have multiple
branches.
Leaf nodes are the output of those decisions and do not contain
any further branches.
In order to build a tree, we use the CART algorithm, which
stands for Classification and Regression Tree algorithm.
A decision tree simply asks a question, and based on the answer
(Yes/No), it further split the tree into subtrees.
Why use Decision Tree?
There are various algorithms in Machine learning, so choosing
the best algorithm for the given dataset and problem is the
main point to remember while creating a machine learning
model.
Below are the two reasons for using the Decision tree:
1. Decision Trees usually mimic human thinking ability while
making a decision, so it is easy to understand.
2. The logic behind the decision tree can be easily understood
because it shows a tree-like structure.
Decision Tree Terminologies

Root Node: Root node is from where the decision tree


starts. It represents the entire dataset, which further gets
divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the
tree cannot be segregated further after getting a leaf node.
Splitting: Splitting is the process of dividing the
decision node/root node into sub-nodes according to the given
conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the
unwanted branches from the tree.
Parent/Child node: The root node of the tree is called
the parent node, and other nodes are called the child nodes.
How does the Decision Tree algorithm

Wor
In k?
a decision tree, for predicting the class of the given dataset,
the algorithm starts from the root node of the tree.
This algorithm compares the values of root attribute with the
record (real dataset) attribute and, based on the comparison,
follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute
value with the other sub-nodes and move further.
It continues the process until it reaches the leaf node of the
tree.
Algorithm
Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best
attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you
cannot further classify the nodes and called the final node as a leaf node.
Example
Suppose there is a candidate who has a job offer and wants to
decide whether he should accept the offer or Not.
So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM).
The root node splits further into the next decision node
(distance from the office) and one leaf node based on the
corresponding labels.
The next decision node further gets split into one decision node
(Cab facility) and one leaf node. Finally, the decision node splits
into two leaf nodes (Accepted offers and Declined offer).
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that
how to select the best attribute for the root node and for sub-
nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this
measurement, we can easily select the best attribute for
the nodes of the tree. There are two popular techniques for
ASM, which are:
Information Gain
Gini Index
Information Gain
Information gain is the measurement of changes in entropy
after the segmentation of a dataset based on an attribute.
It calculates how much information a feature provides us about
a class.
According to the value of information gain, we split the node
and build the decision tree.
A decision tree algorithm always tries to maximize the value of
information gain, and a node/attribute having the highest
information gain is split first. It can be calculated using the
below formula:
InformationGain=
Entropy(S) - [(Weighted Avg) *Entropy(each
feature)]
Entropy
Entropy can be defined as a measure of the purity of the sub
split. Entropy always lies between 0 to 1. The entropy of any
split can be calculated by this formula.
Confusion Matrix

A confusion matrix is a table that is often used to describe


the performance of a classification model (or
"classifier") on a set of test data for which the true values are
known.
Consider binary classification:
There are two possible predicted classes: "yes" and "no".
If we were predicting the presence of a disease, for example,
"yes" would mean they have the disease, and "no" would
mean they don't have the disease.
The classifier made a total of 165 predictions (e.g., 165
patients were being tested for the presence of that disease).
Out of those 165 cases, the classifier predicted "yes" 110
times, and "no" 55 times.
In reality, 105 patients in the sample have the disease, and
60 patients do not.
Basic Terminology
True positives (TP): These are cases in which we
predicted yes (they have the disease), and they do have the
disease.
True negatives (TN): We predicted no, and they don't
have the disease.
False positives (FP): We predicted yes, but they
don't actually have the disease. (Also known as a "Type I
error.")
False negatives (FN): We predicted no, but they
actually do have the disease. (Also known as a "Type II error.")
Basic Terminology

True positives (TP): These are cases in which


we predicted yes (they have the disease), and they do
have the disease.
Basic Terminology

True negatives (TN): We predicted no, and


they don't have the disease.
Basic Terminology

False positives (FP): We predicted yes, but


they don't actually have the disease. (Also known as a
"Type I error.")
Basic Terminology

False negatives (FN): We predicted no, but


they actually do have the disease. (Also known as a
"Type II error.")
Another Example…

We have a total of 20 cats and dogs and our model


predicts whether it is a cat or not.
Actual values = [‘dog’, ‘cat’, ‘dog’, ‘cat’,
‘dog’, ‘dog’, ‘cat’, ‘dog’,
‘cat’, ‘dog’, ‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’,
‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’]

Predicted values = [‘dog’, ‘dog’, ‘dog’, ‘cat’,
‘dog’, ‘dog’, ‘cat’,
‘cat’, ‘cat’, ‘cat’, ‘dog’, ‘dog’, ‘dog’, ‘cat’,
‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’]
Another Example…
True Positive (TP) = 6
You predicted positive and it’s true. You predicted that an animal is a cat and
it actually is.
True Negative (TN) = 11
You predicted negative and it’s true. You predicted that animal is not a cat
and it actually is not (it’s a dog).
False Positive (Type 1 Error) (FP) = 2
You predicted positive and it’s false. You predicted that animal is a cat but
it actually is not (it’s a dog).
False Negative (Type 2 Error) (FN) = 1
You predicted negative and it’s false. You predicted that animal is not a cat
but it actually is.
Other Evaluation Metrics

a. Accurac
y
b. Precisio
n
c. Recall
d. F1-
Score
Accuracy
Accuracy simply measures how often the classifier makes
the correct prediction. It’s the ratio between the number of
correct predictions and the total number of predictions.
Precision

It is a measure of correctness that is achieved in


true prediction. In simple words, it tells us how
many predictions are actually positive out of all the
total positive predicted.
Precision is defined as the ratio of the total number
of correctly classified positive classes divided by
the total number of predicted positive classes.
Or, out of all the predictive positive classes, how much
we predicted correctly. Precision should be
high(ideally 1).
Precision

“Precision is a useful metric in cases where


False Positive is a higher concern than False
Negatives”
Ex 1:- In Spam Detection : Need to focus on
precision
Suppose mail is not a spam but model is predicted
as spam : FP (False Positive). We always try to reduce FP.
Ex 2:- Precision is important in music or video
recommendation systems, e-commerce
websites, etc. Wrong results could lead to customer
churn and be harmful to the business.
Recall
It is a measure of actual observations which are
predicted correctly, i.e. how many observations of positive
class are actually predicted as positive.
It is also known as Sensitivity.
Recall is a valid choice of evaluation metric when we
want to capture as many positives as possible.
Recall is defined as the ratio of the total number of
correctly classified positive classes divide by the
total number of positive classes. Or, out of all the
positive classes, how much we have predicted correctly.
Recall should be high(ideally 1).
Recall
“Recall is a useful metric in cases where False
Negative trumps False Positive”
Ex 1:- suppose person having cancer (or) not? He is
suffering from cancer but model predicted as not
suffering from cancer
Ex 2:- Recall is important in medical cases where
it doesn’t matter whether we raise a false alarm but
the actual positive cases should not go
undetected!
Recall would be a better metric because we don’t want
to accidentally discharge an infected person and
let them mix with the healthy population thereby
spreading contagious virus. Now you can
understand why accuracy was a bad metric for our
model.
F-measure / F1-Score

The F1 score is a number between 0 and 1 and is the


harmonic mean of precision and recall.
F1 score sort of maintains a balance between the
precision and recall for your classifier. If your
precision is low, the F1 is
low and if the recall is low again your F1 score is
low.
There will be cases where there is no clear distinction
between whether Precision is more important or
Recall. We combine them!
Other Evaluation
MetricsIs it
necessary to check
for recall (or)
precision if you
already have a high
accuracy?
We can not rely on a single value
of accuracy in classification when
the classes are imbalanced.
For example, we have a dataset
of 100 patients in which 5 have
diabetes and 95 are healthy.
However, if our model only
predicts the majority class i.e. all
100 people are healthy even
though we have a classification
accuracy of 95%.
When to use Accuracy /
Precision / Recall / F1-Score?
Accuracy is used when the True Positives and
True Negatives are more important. Accuracy is a
better metric for Balanced Data.
Whenever False Positive is much more important use
Precision.
Whenever False Negative is much more important use
Recall.
F1-Score is used when the False
Negatives and False Positives are
important. F1-Score is a better metric for
Imbalanced Data.
Random Forest
Random Forest

A random forest is a machine learning technique that’s


used to solve regression and classification problems.
It utilizes ensemble learning, which is a technique that
combines many classifiers to provide solutions to complex
problems.
Instead of relying on one decision tree, the random forest
takes the prediction from each tree and based on the majority
votes of predictions, and it predicts the final output.
Random Forest

The greaternumber of trees in the forest leads to


higher accuracy and prevents the problem of overfitting.
ASSUMPTION OF RF: The predictions from each tree
must have very low correlations.
How does Random Forest algorithm work?

Random Forest works in two-phase first is to create the random forest by


combining N decision tree, and second is to make predictions for each
tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-1: Select random K data points from the training set.
Step-2: Build the decision trees associated with the selected data
points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step 1 & 2.
Step-5: For new data points, find the predictions of each decision tree,
and assign the new data points to the category that wins the majority
votes.
Example: Suppose there is a dataset that contains
multiple fruit images.
So, this dataset is given to the Random forest classifier.
The dataset is divided into subsets and given to each
decision tree.
During the training phase, each decision tree produces a
prediction result, and when a new data point occurs, then
based on the majority of results, the Random Forest
classifier predicts the final decision.
Ensemble simply means combining multiple models.
Thus a collection of models is used to make predictions rather
than an individual model.
Ensemble uses two types of methods: Bagging and Boosting
1. Bagging – It creates a different training subset from
sample training data with replacement & the final output is
based on majority voting. For example, Random Forest.
2. Boosting – It combines weak learners into strong
learners by creating sequential models such that the final
model has the highest accuracy. For example, ADA BOOST,
XG BOOST
Does random forest works
on the Bagging principle
or Boosting principle?
Bagging
Bagging, also known as Bootstrap Aggregation is the
ensemble technique used by random forest.
Bagging chooses a random sample from the data set.
Hence each model is generated from the samples (Bootstrap
Samples) provided by the Original Data with replacement known as
row sampling.
This step of row sampling with replacement is called bootstrap.
Now each model is trained independently which generates results.
The final output is based on majority voting after combining the
results of all models.
This step which involves combining all the results and generating
output based on majority voting is known as aggregation.
Important Features of Random Forest

1. Diversity- Not all attributes/variables/features


are considered while making an individual tree, each
tree is different.
2. Immune to the curse of
dimensionality- Since each tree does not consider
all the features, the feature space is reduced.
3. Parallelization-Each tree is created
independently out of different data and attributes. This
means that we can make full use of the CPU to build
random forests.
4. Stability- Stability arises because the result is
based on majority voting/ averaging.
Difference Between Decision Tree & Random

For
est
Important Hyperparameters

Hyperparameters are used in random forests to either


enhance the performance and predictive power of models
or to make the model faster.
Following hyperparameters increases the
predictive power:
1. n_estimators– number of trees the algorithm builds

before averaging the predictions.


2. max_features– maximum number of features random

forest considers splitting a node.


3. mini_sample_leaf– determines the minimum

number of leaves required to split an internal node.

You might also like