0% found this document useful (0 votes)
108 views322 pages

PAM - Complete

Uploaded by

Daksh Rawal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views322 pages

PAM - Complete

Uploaded by

Daksh Rawal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 322

APEX INSTITUTE OF TECHNOLOGY.


AIT-IBM CSE
CHANDIGARH UNIVERSITY, MOHALI

PREDICTIVE ANALYTICS
MODELLING
By
Pulkit Dwivedi
Assistant Professor (Chandigarh University)
Course Objective
 The students will be able to illustrate the interaction of multi-faceted
fields like data mining

 The students will be able to understand statistics and mathematics in


the development of Predictive Analytics

 The students shall understand and Apply the concepts of different


models

 The students shall understand various aspects of IBM SPSS Modeler


interface

 The students shall be able to familiarize with various data clustering


and dimension reduction techniques
Books
• E. Siegel, “Predictive Analytics: The Power to Predict Who Will Click, Buy,
Lie, or Die ". John Wiley & Sons, Inc, 2013.

• P. Simon, ," Too Big to Ignore: The Business Case for Big Data”, Wiley India,
2013

• J. W. Foreman, " Data Smart: Using Data Science to Transform


information into Insight,", Addison-Wesley

OTHER LINKS
• https://developer.ibm.com/predictiveanalytics/videos/category/tutorials/

• https://www.ibm.com/developerworks/library/ba-predictive-
analytics1/index.html
Data Deluge!
Structure of the Data
What is Data Mining?
How should mined information be?
How should mined information be?
How should mined information be?
Tasks in Data Mining
Anomaly Detection
Association Rule Mining
Clustering
Classification
Regression
Difference b/w Data Mining and ML
 While data mining is simply looking for patterns that already
exist in the data, machine learning goes beyond what’s
happened in the past to predict future outcomes based on the
pre-existing data.

 In data mining, the ‘rules’ or patterns are unknown at the start


of the process. Whereas, with machine learning, the machine is
usually given some rules or variables to understand the data and
learn.
Difference b/w Data Mining and ML
 Data mining is a more manual process that relies on human
intervention and decision making. But, with machine learning,
once the initial rules are in place, the process of extracting
information and ‘learning’ and refining is automatic, and takes
place without human intervention. In other words, the machine
becomes more intelligent by itself.

 Data mining is used on an existing dataset (like a data


warehouse) to find patterns. Machine learning, on the other
hand, is trained on a ‘training’ data set, which teaches the
computer how to make sense of data, and then to make
predictions about new data sets.
Are There Any Drawbacks to
Data Mining?
Drawbacks of Data Mining
 Many data analytics tools are complex and challenging to use. Data scientists
need the right training to use the tools effectively.

 Different tools work with varying types of data mining, depending on the
algorithms they employ. Thus, data analysts must be sure to choose the
correct tools.

 Data mining techniques are not infallible, so there’s always the risk that the
information isn’t entirely accurate. This obstacle is especially relevant if
there’s a lack of diversity in the dataset.

 Companies can potentially sell the customer data they have gleaned to other
businesses and organizations, raising privacy concerns.

 Data mining requires large databases, making the process hard to manage.
Applications of Data Mining
Applications of Data Mining
Must-have Skills You Need for
Data Mining
Skills You Need for Data Mining
 COMPUTER SCIENCE SKILLS

1. Programming/statistics language: R, Python, C++, Java, Matlab,


SQL, SAS

2. Big data processing frameworks: Hadoop, Storm, Samza, Spark,


Flink

3. Operating System: Linux

4. Database knowledge: Relational Databases (SQL or Oracle) &


Non-Relational Databases (MongoDB, Cassandra, Dynamo,
CouchDB)
Skills You Need for Data Mining
 STATISTICS AND ALOGIRITHIM SKILLS

1. Basic Statistics Knowledge: Probability, Probability Distribution,


Correlation, Regression, Linear Algebra, Stochastic Process

2. Data structures include arrays, linked list, stacks, queues, trees,


hash table, set…etc, and common Algorithms include sorting,
searching, dynamic programming, recursion…etc

3. Machine Learning/Deep Learning Algorithm


Skills You Need for Data Mining
 OTHER REQUIRED SKILLS

1. Project Experience

2. Communication & Presentation Skills


Data Mining
Process/Lifecycle/Strategy
CRISP DM Data Mining Strategy
 The CRoss Industry Standard Process for Data Mining (CRISP-
DM) is a process model that serves as the base for a data science
process. It has six sequential phases:

1. Business understanding – What does the business need?


2. Data understanding – What data do we have / need? Is it clean?
3. Data preparation – How do we organize the data for modeling?
4. Modeling – What modeling techniques should we apply?
5. Evaluation – Which model best meets the business objectives?
6. Deployment – How do stakeholders access the results?
CRISP-DM Phase - 1
 BUSINESS UNDERSTANDING

The Business Understanding phase focuses on understanding the


objectives and requirements of the project.

1. Determine business objectives: You should first “thoroughly


understand, from a business perspective, what the customer really
wants to accomplish.” and then define business success criteria.

2. Assess situation: Determine resources availability, project


requirements, assess risks and contingencies, and conduct a cost-
benefit analysis.
CRISP-DM Phase - 1
3. Determine data mining goals: In addition to defining the business
objectives, you should also define what success looks like from a
technical data mining perspective

4. Produce project plan: Select technologies and tools and define


detailed plans for each project phase.

• While many teams hurry through this phase, establishing a strong


business understanding is like building the foundation of a house –
absolutely essential.
Any good project starts with a
deep understanding of the
customer’s needs. Data mining
projects are no exception and
CRISP-DM recognizes this.
CRISP-DM Phase – 2
 DATA UNDERSTANDING

Adding to the foundation of Business Understanding, it drives the


focus to identify, collect, and analyze the data sets that can help you
accomplish the project goals.

1. Collect initial data: Acquire the necessary data and (if necessary) load
it into your analysis tool.

2. Describe data: Examine the data and document its surface properties
like data format, number of records, or field identities.
CRISP-DM Phase - 2
3. Explore data: Dig deeper into the data. Query it, visualize it, and
identify relationships among the data.

4. Verify data quality: How clean/dirty is the data? Document any


quality issues.
CRISP-DM Phase - 3
 DATA PREPARATION

This phase, which is often referred to as “data munging”, prepares


the final data set(s) for modeling.

1. Select data: Determine which data sets will be used and document
reasons for inclusion/exclusion.

2. Clean data: Often this is the lengthiest task. Without it, you’ll likely fall
victim to garbage-in, garbage-out. A common practice during this task
is to correct, impute, or remove erroneous values.
CRISP-DM Phase - 3
4. Construct data: Derive new attributes that will be helpful. For
example, derive someone’s body mass index from height and weight
fields.

5. Integrate data: Create new data sets by combining data from multiple
sources.

6. Format data: Re-format data as necessary. For example, you might


convert string values that store numbers to numeric values so that
you can perform mathematical operations.
A common rule of thumb is that
80% of the project is data
preparation.
CRISP-DM Phase - 4
 MODELING

Here you’ll likely build and assess various models based on several
different modeling techniques.

1. Select modeling techniques: Determine which algorithms to try (e.g.


regression, neural net).

2. Generate test design: Pending your modeling approach, you might


need to split the data into training, test, and validation sets.
CRISP-DM Phase - 4
3. Build model: As glamorous as this might sound, this might just be
executing a few lines of code like “reg = LinearRegression().fit(X, y)”.

4. Assess model: Generally, multiple models are competing against each


other, and the data scientist needs to interpret the model results
based on domain knowledge, the pre-defined success criteria, and the
test design.

• In practice teams should continue iterating until they find a “good


enough” model, proceed through the CRISP-DM lifecycle, then
further improve the model in future iterations.
What is widely regarded as data
science’s most exciting work is
also often the shortest phase of
the project
CRISP-DM Phase - 5
 EVALUATION

Whereas the Assess Model task of the Modeling phase focuses on


technical model assessment, the Evaluation phase looks more
broadly at which model best meets the business and what to do
next.

1. Evaluate results: Do the models meet the business success criteria?


Which one(s) should we approve for the business?
CRISP-DM Phase - 5
2. Review process: Review the work accomplished. Was anything
overlooked? Were all steps properly executed? Summarize
findings and correct anything if needed.

3. Determine next steps: Based on the previous three tasks,


determine whether to proceed to deployment, iterate further,
or initiate new projects.
CRISP-DM Phase - 6
 DEPLOYMENT

A model is not particularly useful unless the customer can access its
results. The complexity of this phase varies widely.

1. Plan deployment: Develop and document a plan for deploying the


model.

2. Plan monitoring and maintenance: Develop a thorough monitoring


and maintenance plan to avoid issues during the operational phase (or
post-project phase) of a model.
CRISP-DM Phase - 6
3. Produce final report: The project team documents a summary of the
project which might include a final presentation of data mining
results.

4. Review project: Conduct a project retrospective about what went


well, what could have been better, and how to improve in the future.

• Your organization’s work might not end there. As a project


framework, CRISP-DM does not outline what to do after the project
(also known as “operations”). But if the model is going to
production, be sure you maintain the model in production. Constant
monitoring and occasional model tuning is often required.
Depending on the requirements,
the deployment phase can be as
simple as generating a report or
as complex as implementing a
repeatable data mining process
across the enterprise.
IBM SPSS Modeler
IBM SPSS Modeler

https://www.ibm.com/account/
reg/in-en/signup?formid=urx-
19947
IBM SPSS Modeler
 IBM® SPSS® Modeler is an analytical platform that enables
organizations and researchers to uncover patterns in data and
build predictive models to address key business outcomes.

 Moreover, aside from a suite of predictive algorithms, SPSS


Modeler also contains an extensive array of analytical routines
that include data segmentation procedures, association analysis,
anomaly detection, feature selection and time series
forecasting.
IBM SPSS Modeler
 These analytical capabilities, coupled with Modeler’s rich
functionality in the areas of data integration and preparation
tasks, enable users build entire end-to-end applications from the
reading of raw data files to the deployment of predictions and
recommendations back to the business.

 As such, IBM® SPSS® Modeler is widely regarded and one of the


most mature and powerful applications of its kind.
IBM SPSS Modeler GUI

Menu MANAGER
Toolbar

STREAM
CANVAS
Palettes
Nodes PROJECT
WINDOW
Market Share of Analytics products
CRISP-DM in IBM SPSS Modeler
CRISP-DM in IBM SPSS Modeler
 IBM® SPSS® Modeler incorporates the CRISP-DM methodology in two
ways to provide unique support for effective data mining.

 The CRISP-DM project tool helps you organize project streams, output,
and annotations according to the phases of a typical data mining
project. You can produce reports at any time during the project based
on the notes for streams and CRISP-DM phases.

 Help for CRISP-DM guides you through the process of conducting a


data mining project. The help system includes tasks lists for each step
as well as examples of how CRISP-DM works in the real world. You can
access CRISP-DM Help by choosing CRISP-DM Help from the main
window Help menu
SPSS Modeler GUI: Stream Canvas
 The stream canvas is the main work area in Modeler.

 It is located in the center of the Modeler user interface.

 The stream canvas can be thought of as a surface on which to place


icons or nodes.

 These nodes represent operations to be carried out on the data.

 Once nodes have been placed on the stream canvas, they can be
linked together to form a stream.
SPSS Modeler GUI: Palettes
 Nodes (operations on the data) are contained in palettes.

 The palettes are located at the bottom of the Modeler user interface.

 Each palette contains a group of related nodes that are available for
you to add to the data stream.

 For example, the Sources palette contains nodes that you can use to
read data into Modeler, and the Graphs palette contains nodes that
you can use to explore your data visually.

 The icons that are shown depend on the active, selected palette.
SPSS Modeler GUI: Palettes

Palettes
SPSS Modeler GUI: Palettes
 The Favorites palette contains commonly used nodes. You can
customize which nodes appear in this palette, as well as their order—
for that matter, you can customize any palette.
 Sources nodes are used to access data.
 Record Ops nodes manipulate rows (cases).
 Field Ops nodes manipulate columns (variables).
 Graphs nodes are used for data visualization.
 Modeling nodes contain dozens of data mining algorithms.
 Output nodes present data to the screen.
 Export nodes write data to a data file.
 IBM SPSS Statistics nodes can be used in conjunction with IBM SPSS
Statistics.
SPSS Modeler GUI: Menus
 In the upper left-hand section of the Modeler user interface, there are eight
menus. The menus control a variety of options within Modeler, as follows:
 File allows users to create, open, and save Modeler streams and projects.
 Edit allows users to perform editing operations, for example, copying/pasting
objects and editing individual nodes.
 Insert allows users to insert a particular node as an alternative to dragging a node
from a palette.
 View allows users to toggle between hiding and displaying items (for example,
the toolbar or the Project window).
 Tools allows users to manipulate the environment in which Modeler works and
provides facilities for working with scripts.
 SuperNode allows users to create, edit, and save a condensed stream.
 Window allows users to close related windows (for example, all open output
windows), or switch between open windows.
 Help allows users to access help on a variety of topics or to view a tutorial.
SPSS Modeler GUI: Toolbar
 The icons on the toolbar represent commonly used options that can also be accessed
via the menus, however the toolbar allows users to enable these options via this easy-
to use, one-click alternative. The options on the toolbar include:

 Creating a new stream  Editing the properties of the


 Opening an existing stream current stream
 Saving the current stream  Previewing the running of a stream
 Printing the current stream
 Running the current stream
 Moving a selection to the clipboard
 Running a selection
 Copying a selection to the clipboard
 Encapsulating selected nodes into
a supernode
 Pasting a selection to the clipboard  Zooming in on a supernode
 Undoing the last action  Showing/hiding stream markup
 Redoing the last action  Inserting a new comment
 Searching for nodes in the current  Opening IBM SPSS Modeler
stream Advantage
SPSS Modeler GUI: Manager tabs
 In the upper right-hand corner of the Modeler user interface, there
are three types of manager tabs. Each tab (Streams, Outputs, and
Models) is used to view and manage the corresponding type of object,
as follows:

 The Streams tab opens, renames, saves, and deletes streams created
in a session.

 The Outputs tab stores Modeler output, such as graphs and tables.
You can save output objects directly from this manager.

 The Models tab contains the results of the models created in Modeler.
These models can be browsed directly from the Models tab or on the
stream displayed on the canvas.
SPSS Modeler GUI: Manager tabs

Manager
Tabs
SPSS Modeler GUI: Project Window
 In the lower right-hand corner of the Modeler user interface, there is
the project window. This window offers two ways to organize your
data mining work, including:

 The CRISP-DM tab, which organizes streams, output, and annotations


according to the phases of the CRISP-DM process model

 The Classes tab, which organizes your work in Modeler by the type of
objects created
SPSS Modeler GUI: Project Window

Project
Window
Building streams
 As was mentioned previously, Modeler allows users to mine data
visually on the stream canvas.

 This means that you will not be writing code for your data mining
projects; instead you will be placing nodes on the stream canvas.

 Remember that nodes represent operations to be carried out on the


data. So once nodes have been placed on the stream canvas, they
need to be linked together to form a stream.

 A stream represents the flow of data going through a number of


operations (nodes).
SPSS Modeler: Adding Data Source Node

Read the data


SPSS Modeler: Preview your data
SPSS Modeler: Preview your data
SPSS Modeler: Data Source
SPSS Modeler: Filter Data

You can chose which


features/variables you
want to consider
SPSS Modeler: Filter Data
SPSS Modeler: Check features type
SPSS Modeler: Check features type
SPSS Modeler: Input/target variable
SPSS Modeler: Input/target variable
SPSS Modeler: Record Id Column

Record ID column
SPSS Modeler: Record Id Column

Record ID column
SPSS Modeler: Data Audit Node
Building a stream
 When two or more nodes have been placed on the stream canvas,
they need to be connected to produce a stream. This can be thought
of as representing the flow of data through the nodes.

 Connecting nodes allows you to bring data into Modeler, explore the
data, manipulate the data (to either clean it up or create additional
fields), build a model, evaluate the model, and ultimately score the
data.
SPSS Modeler: Building a stream
SPSS Modeler: Building a stream
SPSS Modeler: Data Audit Node
SPSS Modeler: Data Audit Node
SPSS Modeler: Data Audit Node
SPSS Modeler: Data Audit Node
SPSS Modeler: Data Audit Node
SPSS Modeler: Data Audit Node
SPSS Modeler: Data Audit Node
 The Data Audit node provides a comprehensive first look at the
data you bring into IBM® SPSS® Modeler, presented in an easy-
to-read matrix that can be sorted and used to generate full-size
graphs and a variety of data preparation nodes.

 The Audit tab displays a report that provides summary statistics,


histograms, and distribution graphs that may be useful in gaining
a preliminary understanding of the data. The report also displays
the storage icon before the field name.

 The Quality tab in the audit report displays information about


outliers, extremes, and missing values, and offers tools for
handling these values.
SPSS Modeler: Graph Node
SPSS Modeler: Graph Node
SPSS Modeler: Graph Node
SPSS Modeler: Graph Node
SPSS Modeler: Graph Node
SPSS Modeler: Graph Node
SPSS Modeler: Checking Quality
Collecting your data: Data
Structure, Data type etc.
Data Structure
 SPSS Modeler source nodes allow you to reads data from:
 Industry standard comma-delimited (CSV) text files including
ASCII files.
 XML files (wish TM1 did this!)
 Statistics Files
 SAS
 Excel
 Databases (DB2TM, OracleTM. SWL ServerTM, and a variety
of other databases) supported via ODBC
 Other OLAP cubes, including Cognos TM1 cubes and views.
Data Structure
 SPSS Modeler requires a “rectangular” data structure – records
(rows of the data table) and fields (column of the data table)
that are handled in the Data tab.

 Based upon the data source, options on the Data tab allow you
to override the specified storage type for fields as they are
imported (or created).
Unit of Analysis
 One of the most important ideas in a research project is the unit
of analysis.
 The unit of analysis is the major entity that you are analyzing in
your study.
 For instance, any of the following could be a unit of analysis in a
study:
 individuals
 groups
 artifacts (books, photos, newspapers)
 geographical units (town, census tract, state)
 social interactions (dyadic relations, divorces, arrests)
Why is it called the ‘unit of
analysis’ and not something else
(like, the unit of sampling)?
Why it is called Unit of Analysis?
 Because it is the analysis you do in your study that determines what the unit
is.

 For instance, if you are comparing the children in two classrooms on


achievement test scores, the unit is the individual child because you have a
score for each child.

 On the other hand, if you are comparing the two classes on classroom
climate, your unit of analysis is the group, in this case the classroom, because
you only have a classroom climate score for the class as a whole and not for
each individual student.

 For different analyses in the same study you may have different units of
analysis.
Why it is called Unit of Analysis?
 If you decide to base an analysis on student scores, the individual is
the unit.

 But you might decide to compare average classroom performance. In


this case, since the data that goes into the analysis is the average itself
(and not the individuals’ scores) the unit of analysis is actually the
group.

 Even though you had data at the student level, you use aggregates in
the analysis.

 In many areas of research these hierarchies of analysis units have


become particularly important and have spawned a whole area of
statistical analysis sometimes referred to as hierarchical modeling.
Field Storage
 Use the Field column to view and select fields in the current
dataset.
Available Storage Types
 String: Contain non-numeric/alpha-numeric data
 Integer: Contain integer values
 Real: Values are numbers that may include
 Date: Date values specified in a standard format such as year, month,
and day (for example, 2007-09-26).
 Time: Time measured as a duration. For example, a service call lasting
1 hour, 26 minutes, and 38 seconds might be represented as 01:26:38,
depending on the current time format as specified in the Stream
Properties dialog box.
 Timestamp: Values that include both a date and time component, for
example 2007–09–26 09:04:00, again depending on the current date
and time formats in the Stream Properties dialog box.
 List: Introduced in SPSS Modeler version 17, a List storage field
contains multiple values for a single record.
List storage type icons
List measurement level icons
Restructuring Data
 A key concern for data analysts engaged in predictive analytics
projects is defining and creating the ‘unit of analysis’. Simply,
this refers to what a row of data needs to represent in order for
the analysis to make sense.

 If, for example, the analytical goal of the project is to predict


which customers are likely to redeem a voucher, then the
predictive model will probably require data where each row is a
different customer not a different transaction.

 If, on the other hand, the analytical goal is to identify fraudulent


attempts to purchase tickets for an event, then the model
probably requires sample data where each row is a transaction.
Restructuring Data: The Distinct Node
 Duplications in data files are a regular headache for most
businesses and they can occur for lot of reasons.

 Often, the customers may register themselves more than once;


contact details may be updated causing an extra row of data to
be added rather than overwriting the existing one; merging
information from different departments or organisations may
create duplicates because there isn’t an exact match.

 In any case, Modeler helps us to identify duplicate cases and


resolve these issues with the Distinct node.
Restructuring Data: The Distinct Node
Restructuring Data: The Distinct Node
Restructuring Data: The Distinct Node
Restructuring Data: The Distinct Node
Data Cleaning
 Making the data consistent across the values
• Replace the special characters for example: replace $ and comma signs in the
column of Sales/Income/Profit i.e. making $10,000 as 10000.
• Making the format of the date column consistent with the format of the tool
used for data analysis
 Check for null or missing values, also check for the negative values
 Smoothing of the noise present in the data by identifying and treating
for outliers
 The data cleaning steps vary and depend on the nature of the data. For
instance, text data consisting of, say, reviews, or tweets would have to
be cleaned to make the cases of the words the same, remove
punctuation marks, any special characters, remove common words,
and differentiate words based on the parts of speech.
 NOTE: THE ABOVE STEPS ARE NOT COMPREHENSIVE
Data Cleaning: Handling Null/Missing
Values
 The null values in the dataset are imputed using mean/median
or mode based on the type of data that is missing:

 Numerical Data: If a numerical value is missing, then replace


that value with mean or median.

 It is preferred to impute using the median value as the average


or the mean values are influenced by the outliers and skewness
present in the data and are pulled in their respective direction.

 Categorical Data: When categorical data is missing, replace that


with the value which is most occurring i.e. by mode.
Data Cleaning: Handling Null/Missing
Values
 Now, if a column has, let’s say, 50% of its
values missing, then do we replace all of those
missing values with the respective median or
mode value?
Data Cleaning: Handling Null/Missing
Values
 Now, if a column has, let’s say, 50% of its
values missing, then do we replace all of those
missing values with the respective median or
mode value?

 Actually, we don’t. We delete that particular


column in that case. We don’t impute
it because then that column will be biased
towards the median/mode value and will
naturally have the most influence on the
dependent variable.
Data Cleaning: Outlier Detection
 Outliers are the values that look different from the other values
in the data.

 To check for the presence of outliers, we can plot BoxPlot.


Outlier Detection Techniques

 ASSIGNMENT
Encoding Categorical Data
• Categorical data is data which has some categories such as, in
below dataset; there are two categorical variable, Country,
and Purchased.

• Since machine learning model completely works on mathematics


and numbers, but if our dataset would have a categorical
variable, then it may create trouble while building the model. So
it is necessary to encode these categorical variables into
numbers.
Feature Scaling
 Feature scaling is the final step of data preprocessing in machine
learning. It is a technique to standardize the independent
variables of the dataset in a specific range. In feature scaling, we
put our variables in the same range and in the same scale so that
no any variable dominate the other variable.

 age and salary column values are not on the same scale. salary
values dominate the age values, and it will produce an incorrect
result. So to remove this issue, we need to perform feature
scaling for machine learning.
Feature Scaling
• There are two ways to perform feature scaling in machine
learning:
• Standardization

Normalization
Feature Scaling
• There are two ways to perform feature scaling in machine
learning:
• Standardization

Normalization
Standardization
x x-mean (x-mean)^2 x'

6 3 9 1.603567

2 -1 1 -0.534522

3 0 0 0

1 -2 4 -1.069045

12 14
SUM
3 3.5
MEAN

S.D. 1.870828693
Introduction to Modeling
Modeling Algorithm Types
Most Common Algorithms
 Naïve Bayes Classifier Algorithm (Supervised Learning -
Classification)
 Linear Regression (Supervised Learning/Regression)
 Logistic Regression (Supervised Learning/Regression)
 Decision Trees (Supervised Learning – Classification/Regression)
 Random Forests (Supervised Learning – Classification/Regression)
 K- Nearest Neighbours (Supervised Learning)
 K Means Clustering Algorithm (Unsupervised Learning -
Clustering)
 Support Vector Machine Algorithm (Supervised Learning -
Classification)
 Artificial Neural Networks (Reinforcement Learning)
Supervised Learning
 Machine is taught by example.
 The operator provides the learning algorithm with a known dataset that
includes desired inputs and outputs, and the algorithm must find a
method to determine how to arrive at those inputs and outputs.
 While the operator knows the correct answers to the problem, the
algorithm identifies patterns in data, learns from observations and
makes predictions.
 The algorithm makes predictions and is corrected by the operator – and
this process continues until the algorithm achieves a high level of
accuracy/performance.
 Under the umbrella of supervised learning fall:
 Classification
 Regression
 Forecasting
1. Classification: ML program draw a conclusion from observed values
and determine to what category new observations belong.
For example, when filtering emails as ‘spam’ or ‘not spam’, the program
must look at existing observational data and filter the emails
accordingly.

2. Regression: ML program must estimate and understand the


relationships among variables. Regression analysis focuses on one
dependent variable and a series of other changing variables – making it
particularly useful for prediction and forecasting.

3. Forecasting: Forecasting is the process of making predictions about


the future based on the past and present data, and is commonly used to
analyze trends.
Classification Example: Object Recognition
Classification Example: Credit Scoring

Differentiating between low-risk and


high-risk customers from their income
and savings

Discriminant: IF income > θ1 AND savings > θ2


THEN low-risk ELSE high-risk
Unsupervised learning
 Here, the algorithm studies data to identify patterns.
 There is no answer key or human operator to provide instruction.
 Instead, the machine determines the correlations and relationships by
analyzing available data.
 In an unsupervised learning process, the learning algorithm is left to
interpret large data sets and address that data accordingly.
 The algorithm tries to organize that data in some way to describe its
structure.
 This might mean grouping the data into clusters or arranging it in a way
that looks more organized.
 As it assesses more data, its ability to make decisions on that data
gradually improves and becomes more refined.
 Under the umbrella of unsupervised learning, fall: Clustering,
Dimensionality Reduction
1. Clustering: Clustering involves grouping sets of similar
data (based on defined criteria). It’s useful for
segmenting data into several groups and performing
analysis on each data set to find patterns.

2. Dimension reduction: Dimension reduction reduces the


number of variables being considered to find the
exact information required.
Clustering Example: Crime prediction
Reinforcement learning
 Reinforcement Learning is a feedback-based learning technique in
which an agent learns to behave in an environment by performing
the actions and seeing the results of actions.
 For each good action, the agent gets positive feedback, and for
each bad action, the agent gets negative feedback or penalty.
 In Reinforcement Learning, the agent learns automatically using
feedbacks without any labeled data, unlike supervised learning.
 Since there is no labeled data, so the agent is bound to learn by its
experience only.
 RL solves a specific type of problem where decision making is
sequential, and the goal is long-term, such as game-playing,
robotics, etc.
 Example: Suppose there is an AI agent present within a maze
environment, and his goal is to find the diamond. The agent
interacts with the environment by performing some actions, and
based on those actions, the state of the agent gets changed, and it
also receives a reward or penalty as feedback.
• Agent(): An entity that can perceive/explore the environment and act
upon it.

• Environment(): A situation in which an agent is present or surrounded


by. In RL, we assume the stochastic environment, which means it is
random in nature.

• Action(): Actions are the moves taken by an agent within the


environment.

• State(): State is a situation returned by the environment after each


action taken by the agent.

• Reward(): A feedback returned to the agent from the environment to


evaluate the action of the agent.
Semi-supervised learning
 Semi-supervised learning is similar to supervised learning, but
instead uses both labelled and un-labelled data.

 Labelled data is essentially information that has meaningful tags


so that the algorithm can understand the data, whilst un-labelled
data lacks that information.

 By using this combination, machine learning algorithms can learn


to label un-labelled data.
Machine Learning Pipeline
Train, Test and Validation Data
To build and evaluate the performance of a machine learning model, we usually
break our dataset into two distinct datasets. These two datasets are the training
data and test data.

Training data
 Training data are the sub-dataset which we use to train a model.
 Algorithms study the hidden patterns and insights which are hidden inside
these observations and learn from them.
 The model will be trained over and over again using the data in the training
set machine learning and continue to learn the features of this data.

Test data
 In Machine learning Test data is the sub-dataset that we use to evaluate the
performance of a model built using a training dataset.
 Although we extract Both train and test data from the same dataset, the test
dataset should not contain any training dataset data.
Validation Data
 Validation data are a sub-dataset separated from the training data, and it’s used to
validate the model during the training process.

 During training, validation data infuses new data into the model that it hasn’t
evaluated before.

 Validation data provides the first test against unseen data, allowing data scientists
to evaluate how well the model makes predictions based on the new data.

 Not all data scientists use validation data, but it can provide some helpful
information to optimize hyperparameters, which influence how the model
assesses data.

 There is some semantic ambiguity between validation data and testing data. Some
organizations call testing datasets “validation datasets.” Ultimately, if there are
three datasets to tune and check ML algorithms, validation data typically helps
tune the algorithm and testing data provides the final assessment.
Test, Train and Validation Data
Decision Tree Classification
Decision Tree Classification
 A Decision Tree is a supervised Machine learning algorithm. It is used in
both classification and regression algorithms.

 The decision tree is like a tree with nodes.

 The branches depend on a number of factors. It splits data into


branches like these till it achieves a threshold value.

 A decision tree consists of the root nodes, children nodes, and leaf
nodes.

 In a Decision tree, there are two nodes, which are the Decision Node
and Leaf Node.
Decision Tree Classification
 Decision nodes are used to make any decision and have multiple
branches.

 Leaf nodes are the output of those decisions and do not contain
any further branches.

 In order to build a tree, we use the CART algorithm, which stands


for Classification and Regression Tree algorithm.

 A decision tree simply asks a question, and based on the answer


(Yes/No), it further split the tree into subtrees.
Why use Decision Tree?
 There are various algorithms in Machine learning, so choosing the
best algorithm for the given dataset and problem is the main
point to remember while creating a machine learning model.

 Below are the two reasons for using the Decision tree:

1. Decision Trees usually mimic human thinking ability while making


a decision, so it is easy to understand.

2. The logic behind the decision tree can be easily understood


because it shows a tree-like structure.
Decision Tree Terminologies
 Root Node: Root node is from where the decision tree starts. It
represents the entire dataset, which further gets divided into two
or more homogeneous sets.
 Leaf Node: Leaf nodes are the final output node, and the tree
cannot be segregated further after getting a leaf node.
 Splitting: Splitting is the process of dividing the decision
node/root node into sub-nodes according to the given conditions.
 Branch/Sub Tree: A tree formed by splitting the tree.
 Pruning: Pruning is the process of removing the unwanted
branches from the tree.
 Parent/Child node: The root node of the tree is called the parent
node, and other nodes are called the child nodes.
How does the Decision Tree algorithm
Work?
 In a decision tree, for predicting the class of the given dataset, the
algorithm starts from the root node of the tree.

 This algorithm compares the values of root attribute with the


record (real dataset) attribute and, based on the comparison,
follows the branch and jumps to the next node.

 For the next node, the algorithm again compares the attribute
value with the other sub-nodes and move further.

 It continues the process until it reaches the leaf node of the tree.
Algorithm
 Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.

 Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).

 Step-3: Divide the S into subsets that contains possible values for the best
attributes.

 Step-4: Generate the decision tree node, which contains the best attribute.

 Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you
cannot further classify the nodes and called the final node as a leaf node.
Example
 Suppose there is a candidate who has a job offer and wants to
decide whether he should accept the offer or Not.

 So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM).

 The root node splits further into the next decision node (distance
from the office) and one leaf node based on the corresponding
labels.

 The next decision node further gets split into one decision node
(Cab facility) and one leaf node. Finally, the decision node splits
into two leaf nodes (Accepted offers and Declined offer).
Attribute Selection Measures
 While implementing a Decision tree, the main issue arises that
how to select the best attribute for the root node and for sub-
nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this
measurement, we can easily select the best attribute for the
nodes of the tree. There are two popular techniques for ASM,
which are:

 Information Gain

 Gini Index
Information Gain
 Information gain is the measurement of changes in entropy after
the segmentation of a dataset based on an attribute.
 It calculates how much information a feature provides us about a
class.
 According to the value of information gain, we split the node and
build the decision tree.
 A decision tree algorithm always tries to maximize the value of
information gain, and a node/attribute having the highest
information gain is split first. It can be calculated using the below
formula:
InformationGain=
Entropy(S) - [(Weighted Avg) *Entropy(each feature)]
Entropy
 Entropy can be defined as a measure of the purity of the sub split.
Entropy always lies between 0 to 1. The entropy of any split can
be calculated by this formula.
Confusion Matrix
 A confusion matrix is a table that is often used to describe the
performance of a classification model (or "classifier") on a set of
test data for which the true values are known.

 Consider binary classification:


• There are two possible predicted classes: "yes" and "no".
• If we were predicting the presence of a disease, for example,
"yes" would mean they have the disease, and "no" would mean
they don't have the disease.
• The classifier made a total of 165 predictions (e.g., 165 patients
were being tested for the presence of that disease).
• Out of those 165 cases, the classifier predicted "yes" 110 times,
and "no" 55 times.
• In reality, 105 patients in the sample have the disease, and 60
patients do not.
Basic Terminology
• True positives (TP): These are cases in which we predicted yes
(they have the disease), and they do have the disease.

• True negatives (TN): We predicted no, and they don't have the
disease.

• False positives (FP): We predicted yes, but they don't actually


have the disease. (Also known as a "Type I error.")

• False negatives (FN): We predicted no, but they actually do have


the disease. (Also known as a "Type II error.")
Basic Terminology

•True positives (TP): These are cases in which we predicted


yes (they have the disease), and they do have the disease.
Basic Terminology

•True negatives (TN): We predicted no, and they don't


have the disease.
Basic Terminology

•False positives (FP): We predicted yes, but they don't


actually have the disease. (Also known as a "Type I error.")
Basic Terminology

•False negatives (FN): We predicted no, but they actually


do have the disease. (Also known as a "Type II error.")
Another Example…
• We have a total of 20 cats and dogs and our model predicts
whether it is a cat or not.

• Actual values = [‘dog’, ‘cat’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘dog’,
‘cat’, ‘dog’, ‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘dog’,
‘dog’, ‘cat’]

Predicted values = [‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’,
‘cat’, ‘cat’, ‘cat’, ‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’,
‘dog’, ‘dog’, ‘cat’]
Another Example…
True Positive (TP) = 6
You predicted positive and it’s true. You predicted that an animal is a cat and it
actually is.

True Negative (TN) = 11


You predicted negative and it’s true. You predicted that animal is not a cat and
it actually is not (it’s a dog).

False Positive (Type 1 Error) (FP) = 2


You predicted positive and it’s false. You predicted that animal is a cat but it
actually is not (it’s a dog).

False Negative (Type 2 Error) (FN) = 1


You predicted negative and it’s false. You predicted that animal is not a cat but
it actually is.
Other Evaluation Metrics

a. Accuracy

b. Precision

c. Recall

d. F1-Score
Accuracy
 Accuracy simply measures how often the classifier makes the
correct prediction. It’s the ratio between the number of correct
predictions and the total number of predictions.
Precision
 It is a measure of correctness that is achieved in true
prediction. In simple words, it tells us how many
predictions are actually positive out of all the total positive
predicted.

 Precision is defined as the ratio of the total number


of correctly classified positive classes divided by the total
number of predicted positive classes.

 Or, out of all the predictive positive classes, how much we


predicted correctly. Precision should be high(ideally 1).
Precision
• “Precision is a useful metric in cases where False Positive is a
higher concern than False Negatives”

• Ex 1:- In Spam Detection : Need to focus on precision

• Suppose mail is not a spam but model is predicted as spam : FP


(False Positive). We always try to reduce FP.

• Ex 2:- Precision is important in music or video recommendation


systems, e-commerce websites, etc. Wrong results could lead to
customer churn and be harmful to the business.
Recall
 It is a measure of actual observations which are
predicted correctly, i.e. how many observations of positive class
are actually predicted as positive.

 It is also known as Sensitivity.

 Recall is a valid choice of evaluation metric when we want to


capture as many positives as possible.

 Recall is defined as the ratio of the total number of correctly


classified positive classes divide by the total number of positive
classes. Or, out of all the positive classes, how much we have
predicted correctly. Recall should be high(ideally 1).
Recall
 “Recall is a useful metric in cases where False Negative trumps False
Positive”

• Ex 1:- suppose person having cancer (or) not? He is suffering from


cancer but model predicted as not suffering from cancer

• Ex 2:- Recall is important in medical cases where it doesn’t matter


whether we raise a false alarm but the actual positive cases should
not go undetected!

• Recall would be a better metric because we don’t want to accidentally


discharge an infected person and let them mix with the healthy
population thereby spreading contagious virus. Now you can
understand why accuracy was a bad metric for our model.
F-measure / F1-Score
 The F1 score is a number between 0 and 1 and is the harmonic
mean of precision and recall.

 F1 score sort of maintains a balance between the precision and


recall for your classifier. If your precision is low, the F1 is
low and if the recall is low again your F1 score is low.

 There will be cases where there is no clear distinction between


whether Precision is more important or Recall. We combine
them!
Other Evaluation MetricsIs it
necessary to check for recall
(or) precision if you already
have a high accuracy?
 We can not rely on a single value of accuracy in
classification when the classes are imbalanced.

 For example, we have a dataset of 100 patients


in which 5 have diabetes and 95 are healthy.
However, if our model only predicts the
majority class i.e. all 100 people are healthy
even though we have a classification accuracy
of 95%.
When to use Accuracy / Precision /
Recall / F1-Score?
• Accuracy is used when the True Positives and True Negatives
are more important. Accuracy is a better metric for Balanced
Data.

• Whenever False Positive is much more important use Precision.

• Whenever False Negative is much more important use Recall.

• F1-Score is used when the False Negatives and False


Positives are important. F1-Score is a better metric
for Imbalanced Data.
Random Forest
Random Forest

• A random forest is a machine learning technique that’s used


to solve regression and classification problems.

• It utilizes ensemble learning, which is a technique that


combines many classifiers to provide solutions to complex
problems.

• Instead of relying on one decision tree, the random forest


takes the prediction from each tree and based on the majority
votes of predictions, and it predicts the final output.
Random Forest

• The greater number of trees in the forest leads to higher


accuracy and prevents the problem of overfitting.

• ASSUMPTION OF RF: The predictions from each tree must


have very low correlations.
How does Random Forest algorithm work?
• Random Forest works in two-phase first is to create the random forest by
combining N decision tree, and second is to make predictions for each tree
created in the first phase.
• The Working process can be explained in the below steps and diagram:
• Step-1: Select random K data points from the training set.
• Step-2: Build the decision trees associated with the selected data points
(Subsets).
• Step-3: Choose the number N for decision trees that you want to build.
• Step-4: Repeat Step 1 & 2.
• Step-5: For new data points, find the predictions of each decision tree, and
assign the new data points to the category that wins the majority votes.
• Example: Suppose there is a dataset that contains multiple
fruit images.
• So, this dataset is given to the Random forest classifier.
• The dataset is divided into subsets and given to each decision
tree.
• During the training phase, each decision tree produces a
prediction result, and when a new data point occurs, then
based on the majority of results, the Random Forest classifier
predicts the final decision.
• Ensemble simply means combining multiple models.
• Thus a collection of models is used to make predictions rather
than an individual model.
• Ensemble uses two types of methods: Bagging and Boosting

• 1. Bagging – It creates a different training subset from sample


training data with replacement & the final output is based on
majority voting. For example, Random Forest.

• 2. Boosting – It combines weak learners into strong learners


by creating sequential models such that the final model has
the highest accuracy. For example, ADA BOOST, XG BOOST
Does random forest works
on the Bagging principle or
Boosting principle?
Bagging
• Bagging, also known as Bootstrap Aggregation is the ensemble
technique used by random forest.
• Bagging chooses a random sample from the data set.
• Hence each model is generated from the samples (Bootstrap
Samples) provided by the Original Data with replacement known
as row sampling.
• This step of row sampling with replacement is called bootstrap.
• Now each model is trained independently which generates results.
• The final output is based on majority voting after combining the
results of all models.
• This step which involves combining all the results and generating
output based on majority voting is known as aggregation.
Important Features of Random Forest
• 1. Diversity- Not all attributes/variables/features are
considered while making an individual tree, each tree is
different.

• 2. Immune to the curse of dimensionality- Since each tree


does not consider all the features, the feature space is
reduced.

• 3. Parallelization-Each tree is created independently out of


different data and attributes. This means that we can make
full use of the CPU to build random forests.

• 4. Stability- Stability arises because the result is based on


majority voting/ averaging.
Difference Between Decision Tree & Random
Forest
Important Hyperparameters

• Hyperparameters are used in random forests to either


enhance the performance and predictive power of models or
to make the model faster.

Following hyperparameters increases the predictive power:

1. n_estimators– number of trees the algorithm builds before


averaging the predictions.
2. max_features– maximum number of features random forest
considers splitting a node.
3. mini_sample_leaf– determines the minimum number of
leaves required to split an internal node.
Un-supervised Learning
Un-Supervised Learning

 Unsupervised learning, also known as unsupervised machine


learning, uses machine learning algorithms to analyze and cluster
unlabeled datasets.

 These algorithms discover hidden patterns or data groupings


without the need for human intervention.

 Its ability to discover similarities and differences in information


make it the ideal solution for exploratory data analysis, cross-
selling strategies, customer segmentation, and image
recognition.

 Techniques - Clustering, PCA, Association, GANs, Autoencoder


Why use Un-Supervised Learning?

 Unsupervised learning is helpful for finding useful insights from


the data.

 Unsupervised learning is much similar as a human learns to think


by their own experiences, which makes it closer to the real AI.

 Unsupervised learning works on unlabeled and uncategorized


data which make unsupervised learning more important.

 In real-world, we do not always have input data with the


corresponding output so to solve such cases, we need
unsupervised learning.
Clustering
Clustering
 In machine learning, we often group examples as a first step to
understand a subject (data set) in a machine learning system.
Grouping unlabeled examples is called clustering.

 As the examples are unlabeled, clustering relies on unsupervised


machine learning. If the examples are labeled, then clustering
becomes classification

 Once all the examples are grouped, a human can optionally


supply meaning to each cluster.
Properties of Clusters
 All the data points in a cluster should be similar to each other.

 The data points from different clusters should be as different as


possible.
Types of Clustering Methods
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
Partitioning Clustering
 It is a type of clustering that divides the data into non-hierarchical
groups.

 It is also known as the centroid-based method.

 The most common example of partitioning clustering is the K-Means


Clustering algorithm.

 In this type, the dataset is divided into a set of k groups, where K is


used to define the number of pre-defined groups.

 The cluster center is created in such a way that the distance between
the data points of one cluster is minimum as compared to another
cluster centroid.
Partitioning Clustering
Density-Based Clustering
 The density-based clustering method connects the highly-dense
areas into clusters, and the arbitrarily shaped distributions are
formed as long as the dense region can be connected.

 This algorithm does it by identifying different clusters in the


dataset and connects the areas of high densities into clusters.

 The dense areas in data space are divided from each other by
sparser areas.

 These algorithms can face difficulty in clustering the data points


if the dataset has varying densities and high dimensions.
Density-Based Clustering
Distribution Model-Based Clustering
 In the distribution model-based clustering method, the data is
divided based on the probability of how a dataset belongs to a
particular distribution.

 The grouping is done by assuming some distributions


commonly Gaussian Distribution.

 The example of this type is the Expectation-Maximization


Clustering algorithm that uses Gaussian Mixture Models (GMM).
Distribution Model-Based Clustering
Hierarchical Clustering
 Hierarchical clustering can be used as an alternative for the
partitioned clustering as there is no requirement of pre-
specifying the number of clusters to be created.

 In this technique, the dataset is divided into clusters to create a


tree-like structure, which is also called a dendrogram.

 The observations or any number of clusters can be selected by


cutting the tree at the correct level.

 The most common example of this method is the Agglomerative


Hierarchical algorithm.
Hierarchical Clustering
Fuzzy Clustering
 Fuzzy clustering is a type of soft method in which a data object
may belong to more than one group or cluster.

 Fuzzy C-means algorithm is the example of this type of


clustering; it is sometimes also known as the Fuzzy k-means
algorithm.
K-means Clustering
K- means Clustering
 K-means clustering is a simple and elegant approach for
partitioning a data set into K distinct, non-overlapping clusters.

 To perform K-means clustering, we must first specify the desired


number of clusters K; then the K-means algorithm will assign
each observation to exactly one of the K clusters
K- means Clustering
 The K-means algorithm clusters data by trying to separate
samples in n groups of equal variance, minimizing a criterion
known as the inertia or within-cluster sum-of-squares.

 The K-means algorithm aims to choose centroid that minimize


the inertia, or within-cluster sum-of-squares criterion.
Working of K- means Clustering

 Consider below data points:


Working of K- means Clustering : Step 1

• Step 1. Determine the value “K”, the value “K” represents


the number of clusters.

• In this case, we’ll select K=3. That is to say, we want to identify


3 clusters.
Working of K- means Clustering : Step 2

• Step 2. Randomly select 3 distinct centroid (new data points


as cluster initialization)

• for example — attempts 1. “K” is equal 3 so there are 3


centroid, in which case it will be the cluster initialization
Working of K- means Clustering : Step 3
• Step 3. Measure the distance (euclidean distance) between
each point and the centroid

• for example, measure the distance between first point and


the centroid.
Working of K- means Clustering : Step 4

• Step 4. Assign the each point to the nearest cluster

• for example, measure the distance between first point and


the centroid.
Working of K- means Clustering : Step 4

• Do the same treatment for the other unlabeled point, until we


get this

Working of K- means Clustering : Step 5

 Step 5. Calculate the mean of each cluster as new centroid

 Update the centroid with mean of each cluster


Working of K- means Clustering : Step 6

 Step 6. Repeat step 3–5 with the new center of cluster


Working of K- means Clustering : Step 6

• Repeat until stop:

• Convergence. (No further changes)


• Maximum number of iterations.

• Since the clustering did not change at all during the last
iteration, we’re done.
Working of K- means Clustering : Step 6

• Has this process been completed? NO

• Remember, the K-means algorithm aims to choose centroid


that minimize the inertia, or within-cluster sum-of-squares
criterion

 So, how to assess the results of this clustering? let’s continue


Working of K- means Clustering : Step 7
 Step 7. Calculate the variance of each cluster

 Since K-means clustering can’t “see” the best clustering, it is only


option is to keep track of these clusters, and their total variance,
and do the whole thing over again with different starting points.
Working of K- means Clustering : Step 8
• Step 8. Repeat step 2–7 until get the lowest sum of variance

• For example — attempts 2 with different random centroid


Working of K- means Clustering : Step 8
Working of K- means Clustering : Step 8

• Repeat until stop:

• Until we get the lowest sum of variance and pick those


cluster as our result
Working of K- means Clustering
1. We are specifying K = 3 here. The black points are centroids for these three clusters
represented by red, yellow, and blue points.
2. The distance from each example x to each centroid c using some distance metric is
computed.
3. The closest centroid is assigned to each example. This is an iterative process until
the assignment of data points do not change after the centroids were recomputed.
Visualizing K-Means Clustering

https://www.naftaliharris.com/blog/visualizing-k-means-
clustering/
Choosing the Appropriate Number of
Clusters
 Two methods that are commonly used to evaluate the
appropriate number of clusters:

 The elbow method


 The silhouette coefficient

 These are often used as complementary evaluation techniques


rather than one being preferred over the other.
Elbow Method
 Calculate the Within-Cluster-Sum of Squared Errors (WCSS)
for different values of k, and choose the k for which WCSS becomes
first starts to diminish.

where Yi is centroid for observation Xi.

 Within-Cluster-Sum of Squared Errors sounds a bit complex. Let’s


break it down:

 The Squared Error for each point is the square of the distance of the
point from its representation i.e. its predicted cluster center.

 The WCSS score is the sum of these Squared Errors for all the points.
The silhouette coefficient
 Silhouette coefficient is another quality measure of clustering – and it
applies to any clustering, not just k-Means. Silhouette-Coefficient of
observation i is calculated as

 where a is average distance to all other observations within same


cluster as that of observation i while b is minimum of average distance
to all other observations from all other clusters.

 Silhouette coefficient of clustering result is average of si for all


observations i.

 This metric is between +1 representing best clustering and -1


representing worst clustering.
The silhouette coefficient
 Plotting SC against K we see highest coefficient of 0.63 with 3
clusters and second highest of 0.60 with 2 clusters. For higher
number of clusters, SC sharply drops and stays low. This is
because further fragmenting a given cluster makes
both a and b closer to each other.
Comparison b\w Elbow & silhouette
coefficient
 While WCSS is comparable for same data for different k, its
number is not comparable across different clustering solutions
on different data, and hence doesn’t have absolute threshold.

 On the other hand, Silhouette Coefficient has fixed range and


hence can be used overall metric comparing quality of clustering
irrespective of data or number of clusters.
Advantages of K-means
1. Easy to understand and implement

2. K clusters helps us in labelling the data

3. Distance calculation is simple

4. Helps to eliminate subjectivity from the analysis


Disadvantages of K-means
1. Computationally intensive

2. Deciding K can be challenging

3. Using correct distance metric can be a challenge

4. Scaling is required

5. Sensitive to the starting position of initial centroid

6. Susceptible to curse of dimensionality


Hierarchical Clustering
Hierarchical Clustering
 Hierarchical Clustering creates clusters in a hierarchical tree-like
structure (also called a Dendrogram).

 A subset of similar data is created in a tree-like structure in


which the root node corresponds to entire data, and branches
are created from the root node to form several clusters.

 Hierarchical Clustering is of two types.


 Divisive
 Agglomerative Hierarchical Clustering
Hierarchical Clustering
 Divisive Hierarchical Clustering is also termed as a top-down
clustering approach. In this technique, entire data or observation
is assigned to a single cluster. The cluster is further split until
there is one cluster for each data or observation.

 Agglomerative Hierarchical Clustering is popularly known as a


bottom-up approach, wherein each data or observation is
treated as its cluster. A pair of clusters are combined until all
clusters are merged into one big cluster that contains all the
data.

 Both algorithms are exactly the opposite of each other.


Agglomerative Hierarchical Clustering
 For a set of N observations to be clustered:

1. Start assigning each observation as a single point cluster, so that if we have


N observations, we have N clusters, each containing just one observation.

2. Find the closest (most similar) pair of clusters and make them into one
cluster, we now have N-1 clusters. This can be done in various ways to
identify similar and dissimilar measures.

3. Find the two closest clusters and make them to one cluster. We now have N-
2 clusters. This can be done using agglomerative clustering linkage
techniques.

4. Repeat steps 2 and 3 until all observations are clustered into one single
cluster of size N.
Agglomerative Hierarchical Clustering
 Clustering algorithms use various distance or dissimilarity
measures to develop different clusters. Lower/closer distance
indicates that data or observation are similar and would get
grouped in a single cluster. Remember that the higher the
similarity depicts observation is similar.

 Step 2 can be done in various ways to identify similar and


dissimilar measures. Namely,
 Euclidean Distance
 Manhattan Distance
 Minkowski Distance
 Jaccard Similarity Coefficient
 Cosine Similarity
 Gower’s Similarity Coefficient
Euclidean Distance
 The Euclidean distance is the most widely used distance measure
when the variables are continuous (either interval or ratio scale).

 The Euclidean distance between two points calculates the length


of a segment connecting the two points. It is the most evident
way of representing the distance between two points.

 The Pythagorean Theorem can be used to calculate the distance


between two points, as shown in the figure below. If the points
(x1, y1)) and (x2, y2) in 2-dimensional space.
Euclidean Distance
Manhattan Distance
 Euclidean distance may not be suitable while measuring the
distance between different locations. If we wanted to measure a
distance between two retail stores in a city, then Manhattan
distance will be more suitable to use, instead of Euclidean
distance.

 The distance between two points in a grid-based on a strictly


horizontal and vertical path. The Manhattan distance is the
simple sum of the horizontal and vertical components.

 In nutshell, we can say Manhattan distance is the distance if you


had to travel along coordinates only.
Jaccard Similarity Coefficient/Jaccard
Index
 Jaccard Similarity Coefficient can be used when
your data or variables are qualitative in nature. In
particular to be used when the variables are
represented in binary form such as (0, 1) or (Yes, No).
Jaccard Similarity Coefficient/Jaccard
Index
• Note that we need to transform the data into binary form before
applying Jaccard Index. Let’s consider Store 1 and Store 2 sell
below items and each item is considered as an element.
• Then, we can observe that bread, jam, coke and cake are sold by
both stores. Hence, 1 is assigned for both stores.

Jaccard Index value ranges from 0 to 1. Higher the similarity when


Jaccard index is high.
Steps to perform Hierarchical Clustering
 We merge the most similar points or clusters in hierarchical clustering
– we know this. Now the question is – how do we decide which points
are similar and which are not? It’s one of the most important
questions in clustering!

 Here’s one way to calculate similarity – Take the distance between the
centroids of these clusters. The points having the least distance are
referred to as similar points and we can merge them. We can refer to
this as a distance-based algorithm as well (since we are calculating the
distances between the clusters).

 In hierarchical clustering, we have a concept called a proximity matrix.


This stores the distances between each point. Let’s take an example to
understand this matrix as well as the steps to perform hierarchical
clustering.
Hierarchical Clustering Example
 Suppose a teacher wants to divide her students into different
groups. She has the marks scored by each student in an
assignment and based on these marks, she wants to segment
them into groups.
 There’s no fixed target here as to how many groups to have.
Since the teacher does not know what type of students should
be assigned to which group, it cannot be solved as a supervised
learning problem.
 So, we will try to apply hierarchical clustering here and segment
the students into different groups.
 Let’s take a sample of 5 students:
Creating Proximity Matrix
 First, we will create a proximity matrix which will tell us the
distance between each of these points. Since we are calculating
the distance of each point from each of the other points, we will
get a square matrix of shape n X n (where n is the number of
observations).

 Let’s make the 5 x 5 proximity matrix for our example:


Creating Proximity Matrix
 The diagonal elements of this matrix will always be 0 as the
distance of a point with itself is always 0.

 We will use the Euclidean distance formula to calculate the rest


of the distances. So, let’s say we want to calculate the distance
between point 1 and 2:

√(10-7)^2 = √9 = 3

 Similarly, we can calculate all the distances and fill the proximity
matrix.
Steps to Perform Hierarchical Clustering
 Step 1: First, we assign all the points to an individual cluster:

 Different colors here represent different clusters. You can see


that we have 5 different clusters for the 5 points in our data.
Steps to Perform Hierarchical Clustering
 Step 2: Next, we will look at the smallest distance in the
proximity matrix and merge the points with the smallest
distance. We then update the proximity matrix:

 Here, the smallest distance is 3 and hence we will merge point 1


and 2:
 Let’s look at the updated clusters and accordingly update the
proximity matrix:

 Here, we have taken the maximum of the two marks (7, 10) to
replace the marks for this cluster. Instead of the maximum, we
can also take the minimum value or the average values as
well. Now, we will again calculate the proximity matrix for
these clusters:
• Step 3: We will repeat step 2 until only a
single cluster is left.

• So, we will first look at the minimum


distance in the proximity matrix and then
merge the closest pair of clusters. We will
get the merged clusters as shown below
after repeating these steps:

• We started with 5 clusters and finally have


a single cluster.

• This is how agglomerative hierarchical


clustering works.
How should we Choose the Number of
Clusters in Hierarchical Clustering?

 To get the number of clusters for hierarchical clustering, we


make use of an awesome concept called a Dendrogram.

 A dendrogram is a tree-like diagram that records the sequences


of merges or splits.

 Let’s get back to our teacher-student example. Whenever we


merge two clusters, a dendrogram will record the distance
between these clusters and represent it in graph form. Let’s see
how a dendrogram looks like:
How should we Choose the Number of
Clusters in Hierarchical Clustering?
How should we Choose the Number of
Clusters in Hierarchical Clustering?

 We have the samples of the dataset on the x-axis and the


distance on the y-axis. Whenever two clusters are merged, we
will join them in this dendrogram and the height of the join will
be the distance between these points. Let’s build the
dendrogram for our example:
How should we Choose the Number of
Clusters in Hierarchical Clustering?
 We started by merging sample 1 and 2 and the distance
between these two samples was 3 (refer to the first proximity
matrix in the previous section). Let’s plot this in the dendrogram:
How should we Choose the Number of
Clusters in Hierarchical Clustering?
 Here, we can see that we have merged sample 1 and 2. The
vertical line represents the distance between these samples.
Similarly, we plot all the steps where we merged the clusters
and finally, we get a dendrogram like this:
How should we Choose the Number of
Clusters in Hierarchical Clustering?
• We can clearly visualize the steps of hierarchical
clustering. More the distance of the vertical lines in the
dendrogram, more the distance between those clusters.

• Now, we can set a threshold distance and draw a horizontal line


(Generally, we try to set the threshold in such a way that it cuts
the tallest vertical line). Let’s set this threshold as 12 and draw a
horizontal line:
How should we Choose the Number of
Clusters in Hierarchical Clustering?

 The number of clusters will be the number of vertical lines


which are being intersected by the line drawn using the
threshold.

 In the above example, since the red line intersects 2 vertical


lines, we will have 2 clusters. One cluster will have a sample
(1,2,4) and the other will have a sample (3,5). Pretty
straightforward, right?

 This is how we can decide the number of clusters using a


dendrogram in Hierarchical Clustering.
Handling Missing Data
• Missing data is different than other topics in that you cannot just
choose to ignore it.

• This is because failing to make a choice just means you are using the
default option for a procedure, which most of the time is not optimal.

• In fact, it is important to remember that every model deals with


missing data in a certain way, and some modeling techniques handle
missing data better than others.

• There are two problems associated with missing data, and these affect
the quantity and quality of the data:

– Missing data reduces sample size (quantity)


– Responders may be different from nonresponders (quality—there could be biased
results)
Ways to handle Missing Data
• There are three ways to address missing data:
– Remove fields
– Remove cases
– Impute missing values

• In some situations, it may be necessary to remove


cases instead of fields. For example, you may be
developing a predictive model to predict customers'
purchasing behavior and you simply do not have
enough information concerning new customers. The
easiest way to remove cases would be to use a Select
node
Defining missing values in Type
node
• To define blank values:

• 1. Edit the Var.File node.


• 2. Click on the Types tab.
• 3. Click on the Missing cell for the field Region.
• 4. Select Specify in the Missing column.
• 5. Click Define blanks.

• Selecting Define blanks chooses Null and White space (remember, Empty String is a
subset of White space, so it is also selected), and in this way these types of missing
data are specified. To specify a predefined code, or a blank value, you can add each
individual value to a separate cell in the Missing values area, or you can enter a range
of numeric values if they are consecutive.

• 6. Type "Not applicable" in the first Missing values cell.


• 7. Hit Enter:
• We have now specified that "Not applicable" is a code
for missing data for the field Region.
• 8. Click OK.
• In our dataset, we will only define one field as having
missing data.
• 9. Click on the Clear Values button.
• 10. Click on the Read Values button:
• The asterisk indicates that missing values have been
defined for the field Region. Now Not applicable is no
longer considered a valid value for the field Region, but
it will still be shown in graphs and other output.
However, models will now treat the category Not
applicable as a missing value.

• 11. Click OK.


Imputing missing values with the
Data Audit node

• Already discussed in LAB


Cleaning and Selecting Data

• You will learn how to:

• Select cases
• Sort cases
• Identify and remove duplicate cases
• Reclassify categorical values
Cleaning and Selecting Data
• Having finished the initial data understanding phase,
we are ready to move onto the data preparation phase.

• Data preparation is the most time-consuming aspect of


data mining.

• Every data mining project will require different types of


data preparation.
Selecting cases
 Often during a data mining project, you will need to select a
subset of records.

 For example, you might want to build a model that only includes
people that have certain characteristics (for example, customers
who have purchased something within the last six months).

 The Select node is used when you want to select or discard a


subset of records based on a specific condition.
Expression Builder in Select Node
• Expressions are built in the large textbox. Operations (addition, subtraction, and so
on) can be pasted into the expression textbox by clicking the corresponding buttons.
• The Function list contains functions available in Modeler:
Sorting cases
• At times it is useful to see if any values look unusual. The Sort
node arranges cases into ascending or descending order based
on the values of one or more fields. It also helps in the
optimization of other nodes so that they perform more
efficiently:
1. Place a Sort node from the Record Ops palette onto the
canvas.
2. Connect the Select node to the Sort node.
3. Edit the Sort node.

• You can sort data on more than one field. In addition, each field
can be sorted in ascending or descending order
Identifying and removing duplicate
cases
• Datasets may contain duplicate records that often
must be removed before data mining can begin. For
example, the same individual may appear multiple
times in a dataset with different addresses.

• The Distinct node finds or removes duplicate records in


a dataset. The Distinct node, located in the Record Ops
palette, checks for duplicate records and identifies the
cases that appear more than once in a file so they can
be reviewed and/or removed.
Reclassifying categorical values
• The Reclassify node allows you to reclassify or recode the data values
for categorical fields. For example, let's say that customers reported
satisfaction ratings on a ten-point scale. However, after inspecting the
distribution of this field, you realized that if the ratings were
reclassified into a three-point (negative, neutral, or positive) scale,
that would be more useful for prediction. This is exactly the function
the Reclassify node performs.

• The reclassified values can replace the original values for a field,
although a safer approach is to create a new field, thus retaining the
original field as well:

1. Place a Reclassify node from the Field Ops palette onto the
canvas.

2. Edit the Reclassify node.


Deriving New Fields
• A very important aspect of every data mining project is to extract as much
information as possible. Every project will begin with some data, but it is the
responsibility of the data miner to gather additional information from what is already
known.

• This can be the most creative and challenging aspect of a data mining project.

• For example, you might have survey data, but this data might need to be summed for
more information, such as a total score on the survey, or the average score on the
survey, and so on. In other words, it is important to create additional fields.
Drop-down list of Derive Node
Drop-down list of Derive Node
Modeling Nodes
Modeling Nodes
Modeling Nodes

You might also like