0% found this document useful (0 votes)
24 views33 pages

Unit IV

Uploaded by

22wh1a1244
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views33 pages

Unit IV

Uploaded by

22wh1a1244
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

UNIT IV

Object Segmentation
UNIT IV Syllabus
Object Segmentation: Regression Vs Segmentation – Supervised and Unsupervised Learning, Tree Building – Regression,
Classification, Overfitting, Pruning and Complexity, Multiple Decision Trees etc. Time Series Methods: Arima, Measures of
Forecast Accuracy, STL approach, Extract features from generated model as Height, Average Energy etc and Analyze for
prediction
Object Segmentation

This method consists of using your knowledge of the various profiles exhibited by your customers.
Thanks to the domain-specific knowledge that you have of your customers, you determine the
In this scenario, you are the Marketing Director of a large retail bank. You want to customize criteria of the segmentation model intuitively, and build the clusters yourself. The main disadvantage
your communications using data modeling. The bank wants to offer a new financial product of this method is that the number of information items available for each customer will invariably
to its customers. Your project consists of launching a direct marketing campaign aimed at grow with time. The more data your database accumulates, the harder it is for you to manually
promoting this product. In order to customize the marketing messages from the bank and create clusters that take all data into consideration and to develop a response to your business issue.
improve communication with the various customers and prospects for this new product, the
Furthermore, as the increasing volume of information requires you to build segmentation models
senior management of the bank asks you to build a segmentation model of the customers of
with increasing frequency, the time required to build these segmentation models becomes
this product. Using Modeler - Segmentation/Clustering, you can rapidly develop a
descriptive model with the least possible cost. This model shows the characteristic profiles increasingly more significant. Finally, management may want you to rationalize your methods, and to
of the customers interested in your new product, and thus responds to your business issue perform your segmentation using a method not based purely on your intuition. Defending your
and fulfills your objectives. segmentation method based on intuition may be difficult.

Your Objective Consider the following case. Using Regression/Classification, you have
contacted the prospects most likely to be interested in your new financial product, and
identified the ideal number of prospects to contact out of the entire database meeting the
deadlines and within the budget you were allowed. To improve the rate of return of your
campaign, senior management asks you to:
● Build a segmentation model of your customers,
● Analyze the characteristics of the identified clusters,
● Define customized communications for each cluster.
The segmentation model in particular should allow you to distinguish customer clusters by
virtue of their propensity to purchase the new high-end savings product proposed by your
firm. You will optimize your understanding of your customers.
6.2.2 Classical Statistical Method 6.2.3 Automated Analytics Method
Segmentation/Clustering allows you to build a
On the basis of the information that you have, a data mining expert could build a segmentation segmentation model of your customers in a few minutes,
model. In other words, you could ask a statistical expert to create a mathematical model that would taking into consideration the interest expressed by your
allow you to build clusters based on the profiles of your customers. To implement this method, the customers in your new product.
statistician must: Segmentation/Clustering automatically detects
● Perform a detailed analysis of your database. interactions between the variables to build
● Prepare your database down to the smallest detail, specifically, encoding the variables as a homogeneous sub-sets, or clusters. Each cluster is
function of their type (nominal, ordinal or continuous) in preparation for segmentation. The homogeneous with respect to the entire set of variables,
encoding strategy used will determine the type of segmentation model obtained. At this step, the and in particular with respect to the target variable, that
statistician may unconsciously bias the results. is, for example, "responded positively to my test". You
● Test different types of algorithms (K-means, both ascending and descending hierarchical will discover the characteristics of different clusters, such
segmentation models) and select the one best suited to your business issue. as those clusters with an excellent response rate and
● Evaluate the relevance of the clusters obtained, in particular, the response to your domain-specific those with a poor response rate. In addition, if your
business issue. After a few weeks, the statistical expert will be able to provide a certain number of customer database contains customer expenditures on
clusters, or homogeneous groups, to which each of the individuals of your database are assigned. your other products, you will also obtain information on
This method presents significant constraints. You must: product sale synergies, by cluster. Using
● Ensure that your statistical expert, who is usually from an external department, is available for the Segmentation/Clustering, you have access to all the
scheduled period, analytical features needed to define the type of message
● Ensure that the modeling costs will fit into your budget, to be sent to the cluster for each customer. You have
● Spend time explaining your domain-specific business issue to the statistician, homogeneous clusters that will allow you to respond to
● Spend time understanding the results that are provided, your business issue. Of particular importance, this
● Ask a programmer to write a program to determine the cluster associated with any new individual segmentation is systematic: the results
added to your database. In addition, this method is not systematic. Two statisticians performing this obtained do not represent a particular point of view of
segmentation on the same dataset could obtain different results. your data, and is robust or consistent. Two people
performing this segmentation using the method would
obtain the same results.
Decision Tree

 “Decision Tree” is a supervised machine learning algorithm which can be used for both classification or regression
challenges.
 It is one of the oldest but widely used practical methods for supervised learning.
 Tree models where the target variable can take a discrete set of values are called classification trees.
 Decision trees where the target variable can take continuous values (typically real numbers) are called regression
trees. (CART).
Definition:

A decision tree is a flowchart-like structure in which the branches represents conditions and the leaf nodes
represents the class label.
Tree Building – Regression, Classification

Decision tree learning is a method commonly used in data mining. The goal is to create a model that predicts the value of a
target variable based on several input variables. An example is shown on the right. Each interior node corresponds to one of
the input variables; there are edges to children for each of the possible values of that input variable. Each leaf represents a
value of the target variable given the values of the input variables represented by the path from the root to the leaf.

Decision trees used in data mining are of two main types:

 Classification tree analysis is when the predicted outcome is the class to which the data belongs.
 Regression tree analysis is when the predicted outcome can be considered a real number (e.g. the price of a house, or a
patient’s length of stay in a hospital).

In general, Decision tree analysis is a predictive modelling tool that can be applied across many areas. Decision trees can be
constructed by an algorithmic approach that can split the dataset in different ways based on different conditions. Decisions
trees are the most powerful algorithms that falls under the category of supervised algorithms.
• A decision tree is a structure that includes a root node,
branches, and leaf nodes.

• Each internal node denotes a test on an attribute, each


branch denotes the outcome of a test, and each leaf node
holds a class label.

• The topmost node in the tree is the root node.

• The following decision tree is for the concept


buy_computer that indicates whether a customer at a
company is likely to buy a computer or not.

• Each internal node represents a test on an attribute. Each


leaf node represents a class.

The benefits of having a decision tree are as follows −


•It does not require any domain knowledge.
•It is easy to comprehend.
•The learning and classification steps of a decision tree are simple and fast
“Decision Tree follows divide and conquer, by using if then else conditions. It decides on Information gain at each node.
Question in Mind

It is simple !! Measure the impurity of


a node against its leaf node
v

None of the leaf nodes are 100% YES Heart diseases or


100% No Heart Diseases. It is impure.

There are multiple ways to measure these impurities,


but the most popular one is Gini.

Courtesy: StatQuest with Josh Starmer


Courtesy: StatQuest with Josh Starmer
Pros Cons
Easy to use, build an understand
Compared to other algorithms decision trees requires
A small change in the data can cause a large change in the
less effort for data preparation during pre-processing.
structure of the decision tree causing instability.
Produce rules that are easy to interpret & implement
Do not require the assumptions of statistical models
For a Decision tree, sometimes calculation can go far more
Can work without extensive handling of missing data
complex compared to other algorithms.
A decision tree does not require normalization of data.
A decision tree does not require scaling of data as well.
Decision tree often involves higher time to train the model.

Decision tree training is relatively expensive as complexity and


time taken is more.

Sometimes, Decision Tree algorithm is inadequate for applying


regression and predicting continuous values with inaccuracy.

TIME FOR DECISION TREE IMPLEMENTATION


CHAID:

CHAID is the oldest decision tree algorithm in the history. It CHAID stands for CHI-squared Automatic Interaction Detector.
was raised in 1980 by Gordon V. Kass. Then, CART was Morgan and Sonquist (1963) proposed a simple method for fitting trees to predict a quantitative
found in 1984, It is the acronym of chi-square automatic variable. They called the method AID, for Automatic Interaction Detection. The algorithm performs
interaction detection. Here, chi-square is a metric to find stepwise splitting. It begins with a single cluster of cases and searches a candidate set of predictor
the significance of a feature. The higher the value, the variables for a way to split this cluster into two clusters. Each predictor is tested for splitting as
higher the statistical significance. Similar to the others, follows: sort all the n cases on the predictor and examine all n-1 ways to split the cluster in two. For
CHAID builds decision trees for classification problems. This each possible split, compute the within-cluster sum of squares about the mean of the cluster on the
means that it expects data sets having a categorical target dependent variable. Choose the best of the n-1 splits to represent the predictor’s contribution. Now
variable. do this for every other predictor. For the actual split, choose the predictor and its cut point which
yields the smallest overall within-cluster sum of squares. Categorical predictors require a different
approach. Since categories are unordered, all possible splits between categories must be
considered. For deciding on one split of k categories into two groups, this means that 2k-1 possible
splits must be considered. Once a split is found, its suitability is measured on the same within-
cluster sum of squares as for a quantitative predictor. Morgan and Sonquist called their algorithm
AID because it naturally incorporates interaction among predictors. Interaction is not correlation. It
has to do instead with conditional discrepancies. In the analysis of variance, interaction means that a
trend within one level of a variable is not parallel to a trend within another level of the same
variable. In the ANOVA model, interaction is represented by cross-products between predictors. In
the tree model, it is represented by branches from the same nodes which have different splitting
predictors further down the tree.
CART:
CART stands for Classification And Regression Tree.
CART algorithm was introduced in Breiman et al. (1986). A CART tree is a binary decision tree that is constructed by
splitting a node into two child nodes repeatedly, beginning with the root node that contains the whole learning
sample. The CART growing method attempts to maximize within-node homogeneity. The extent to which a node
does not represent a homogenous subset of cases is an indication of impurity. For example, a terminal node in which
all cases have the same value for the dependent variable is a homogenous node that requires no further splitting
because it is "pure." For categorical (nominal, ordinal) dependent variables the common measure of impurity is Gini,
which is based on squared probabilities of membership for
each category. Splits are found that maximize the homogeneity of child nodes with respect to the value of the
dependent variable.

Impurity Measure:
GINI Index
Used by the CART (classification and regression tree) algorithm, Gini impurity is a measure of how often a randomly
chosen element from the set would be incorrectly labeled if it were randomly labeled according to the distribution of
labels in the subset. Gini impurity can be computed by summing the probability fi of each item being chosen times
the probability 1-fi of a mistake in categorizing that item. It reaches its minimum (zero) when all cases in the node
fall into a single target category.
Overfitting, Pruning and Complexity

If a decision tree is fully grown, it may lose some generalization capability. This is a phenomenon
known as overfitting.

Over-fitting is the phenomenon in which the learning system


tightly fits the given training data so much that it would be
inaccurate in predicting the outcomes of the untrained data.
In decision trees, over-fitting occurs when the tree is
designed so as to perfectly fit all samples in the training
data set. Thus it ends up with branches with strict rules of
sparse data. Thus this effects the accuracy when predicting
samples that are not part of the training set.
One of the methods used to address over-fitting in decision
tree is called pruning which is done after the initial training
is complete. In pruning, you trim off the branches of the
tree, i.e., remove the decision nodes starting from the leaf
node such that the overall accuracy is not disturbed. This is
done by segregating the actual training set into two sets:
training data set, D and validation data set, V. Prepare the
decision tree using the segregated training data set, D. Then
continue trimming the tree accordingly to optimize the
accuracy of the validation data set, V.
The Role of Pruning in Decision Trees Tree Pruning
Pruning is one of the techniques that is used to overcome our Tree pruning is performed in order to remove
problem of Overfitting. Pruning, in its literal sense, is a practice
anomalies in the training data due to noise or
which involves the selective removal of certain parts of a tree(or
plant), such as branches, buds, or roots, to improve the tree’s outliers. The pruned trees are smaller and less
structure, and promote healthy growth. This is exactly what complex.
Pruning does to our Decision Trees as well. It makes it versatile so Tree Pruning Approaches
that it can adapt if we feed any new kind of data to it, thereby There are two approaches to prune a tree −
fixing the problem of overfitting. •Pre-pruning − The tree is pruned by halting its
It reduces the size of a Decision Tree which might slightly increase
construction early.
your training error but drastically decrease your testing error,
hence making it more adaptable.
•Post-pruning - This approach removes a sub-tree
Minimal Cost-Complexity Pruning is one of the types of Pruning from a fully grown tree.
of Decision Trees. Cost Complexity
The cost complexity is measured by the following
two parameters −
•Number of leaves in the tree, and
•Error rate of the tree.
Random Forest(Bootstrap Aggregation/Bagging)

Random forests is an ensemble learning algorithm that can be used for classification, regression.

It operates by constructing a multitude of decision trees at training time and outputs the class (classification) or
mean prediction (regression) of the individual tree.

Random forest, are built using Decision trees which are easy to build but has complexity in predictive learning .

They have problem in terms of classification accuracy. How to overcome? - By using Random Forest.

Random Forest combines simplicity of the decision tree with improved predictive accuracy.
Ag
gr
eg
at
io
n

Bootstrap

(with replacement)
• How to Build a Random Forest?
• How to Use it?
• How to Evaluate it?
Building a Random Forest
Step 1: construct a bootstrapped dataset
(randomly select sample(s) from the original dataset(D), s<=D)

Step 2: create a decision tree using bootstrapped dataset


(use random subset of variables(column) at each step.)

Step 3: Repeat step 1,2 (variety of trees are built, which makes random forest more effective than
individual classifiers.)
How to Use it?

Step 1 : get Unlabeled sample

Step 2 : Run it on your variety of trees.


keep track of the predicted class
label of each tree.
choose the majority vote of the
class label.
How to Evaluate it?

Out of Bag Sample- samples which were not used in the bootstrapped dataset are called out of bag dataset.
It can be used as a testing dataset.
out of bag dataset can be made to run on the trees which are built without it and look for its class label prediction(correct or not)

DT 1 Now we can measure the proportion of out of bag


sample correctly classified and proportion of out
DT 2 of bag sample incorrectly classified which is called
as the out of bag error.

Once it is measured , u can go back and optimize


DT 3 the tree construction.
Pros Cons
Same as DT
Complexity of constructing multiple trees
Ensemble Learning, accuracy improved
solve both classification as well as regression Longer Training period.
problems.
Complex memory challenges
handle missing values well
robust to outliers

TIME FOR RANDOM FOREST IMPLEMENTATION


TIME –SERIES METHODS
A time series is a sequence of observations over a certain
A time series is a sequence of period. A univariate time series consists of the values taken
observations over a certain period.
by a single variable at periodic time instances over a
The simplest example of a time
series that all of us come across period, and a multivariate time series consists of the values
on a day to day basis is the taken by multiple variables at the same periodic time
change in temperature throughout instances over a period. The simplest example of a time
the day or week or month or year. series that all of us come across on a day to day basis is
The analysis of temporal data is the change in temperature throughout the day or week or
capable of giving us useful insights
month or year.
on how a variable changes over
time. The analysis of temporal data is capable of giving us useful
insights on how a variable changes over time, or how it
depends on the change in the values of other variable(s).
This relationship of a variable on its previous values and/or
other variables can be analyzed for time series forecasting
and has numerous applications in artificial intelligence.

https://www.youtube.com/watch?v=wGUV_XqchbE

https://www.youtube.com/watch?v=GUq_tO2BjaU
Arima, Measures of Forecast Accuracy
ARIMA modeling
https://datascienceplus.com/time-series-analysis-using-arima-model-in-r/
ARIMA is the abbreviation for AutoRegressive Integrated
Moving Average. Auto Regressive (AR) terms refer to the
lags of the differenced series, Moving Average (MA) terms
refer to the lags of errors and I is the number of difference
Time series data are data points collected
used to make the time series stationary.
over a period of time as a sequence of
time gap. Time series data analysis means
analyzing the available data to find out the
Assumptions of ARIMA model
pattern or trend in the data to predict
1. Data should be stationary – by stationary it means that
some future values which will, in turn, help
the properties of the series doesn’t depend on the time
more effective and optimize business
when it is captured. A white noise series and series with
decisions.
cyclic behavior can also be considered as stationary series.
Time series analysis can be classified 2. Data should be univariate – ARIMA works on a single
as: variable. Auto-regression is all about regression with the
Univariate and multivariate past values.

Techniques used for time series Steps to be followed for ARIMA modeling:
analysis: 1. Exploratory analysis
ARIMA models 2. Fit the model
3. Diagnostic measures
Steps to be followed for ARIMA modeling:
•1. Exploratory analysis
•2. Fit the model
•3. Diagnostic measures
Exploratory analysis
•1. Autocorrelation analysis to examine serial
dependence: Used to estimate which value in the past
has a correlation with the current value. Provides the
p,d,q estimate for ARIMA models.
•2. Spectral analysis to examine cyclic behavior: Carried
out to describe how variation in a time series may be
accounted for by cyclic components. Also referred to as a
Frequency Domain analysis. Using this, periodic
components in a noisy environment can be separated
out.
•3. Trend estimation and decomposition: Used for
seasonal adjustment. It seeks to construct, from an
observed time series, a number of component series(that
could be used to reconstruct the original series) where
we get 4 components:
each of these has a certain characteristic.
•Observed – the actual data plot
Before performing any EDA on the data, we need to
•Trend – the overall upward or downward
understand the three components of a time series data:
movement of the data points
•Trend: A long-term increase or decrease in the data is
•Seasonal – any monthly/yearly pattern of the
referred to as a trend. It is not necessarily linear. It is the
data points
underlying pattern in the data over time.
•Random – unexplainable part of the data
•Seasonal: When a series is influenced by seasonal
factors i.e. quarter of the year, month or days of a week
seasonality exists in the series. It is always of a fixed and
known period. E.g. – A sudden rise in sales during
Christmas, etc.
•Cyclic: When data exhibit rises and falls that are not of
the fixed period we call it a cyclic pattern. For e.g. –
Fit the model Diagnostic measures
Once the data is ready and satisfies all the assumptions
of modeling, to determine the order of the model to be Check residual autocorrelations
fitted to the data, we need three variables: p, d, and q
•For overall significance, and for significance at
which are non-negative integers that refer to the order of individual lags
the autoregressive, integrated, and moving average parts
•Box-Pierce test, or Ljung-Box test
of the model respectively.
•Look at the acf and pacf of model residuals

To examine which p and q values will be appropriate we


need to run acf() and pacf() function.

at lag k is autocorrelation function which describes


pacf()
the correlation between all data points that are exactly k
steps apart- after accounting for their correlation with the
data between those k steps. It helps to identify the
number of autoregression (AR) coefficients(p-value) in an
ARIMA model.
Measure of Forecast Accuracy:
Forecast Accuracy can be defined as the deviation of Forecast or Prediction from the actual results.
Error = Actual demand – Forecast

We measure Forecast Accuracy by 2 methods :


1. Mean Forecast Error (MFE)

For n time periods where we have actual demand and forecast values:
Ideal value = 0;
MFE > 0, model tends to under-forecast
MFE < 0, model tends to over-forecast

2. Mean Absolute Deviation (MAD)

For n time periods where we have actual demand and forecast values:
While MFE is a measure of forecast model bias, MAD indicates the absolute size of the errors

Uses of Forecast error:


 Forecast model bias
 Absolute size of the forecast errors
 Compare alternative forecasting models
 Identify forecast models that need adjustment
ETL Approach:
Extract, Transform and Load (ETL) refers to a process in database usage and especially in data
warehousing that:

 Extracts data from homogeneous or heterogeneous data sources


 Transforms the data for storing it in proper format or structure for querying and analysis purpose
 Loads it into the final target (database, more specifically, operational data store, data mart, or data
warehouse)

Usually all the three phases execute in parallel since the data extraction takes time, so while the data is
being pulled another transformation process executes, processing the already received data and prepares
the data for loading and as soon as there is some data ready to be loaded into the target, the data loading
kicks off without waiting for the completion of the previous phases.

ETL systems commonly integrate data from multiple applications (systems), typically developed and
supported by different vendors or hosted on separate computer hardware. The disparate systems
containing the original data are frequently managed and operated by different employees. For example, a
cost accounting system may combine data from payroll, sales, and purchasing.
Commercially available ETL tools include:
 Anatella
 Alteryx
 CampaignRunner
 ESF Database Migration Toolkit
 InformaticaPowerCenter
 Talend
 IBM InfoSphereDataStage
 Ab Initio
 Oracle Data Integrator (ODI)
 Oracle Warehouse Builder (OWB)
 Microsoft SQL Server Integration Services (SSIS)
 Tomahawk Business Integrator by Novasoft Technologies.
 Pentaho Data Integration (or Kettle) opensource data integration framework
 Stambia
 Diyotta DI-SUITE for Modern Data Integration
 FlyData
 Rhino ETL
There are various steps involved in ETL. They are as below in detail:
Extract
The Extract step covers the data extraction from the source system and makes it accessible for further processing. The main
objective of the extract step is to retrieve all the required data from the source system with as little resources as possible. The
extract step should be designed in a way that it does not negatively affect the source system in terms or performance, response
time or any kind of locking.
There are several ways to perform the extract:
 Update notification - if the source system is able to provide a notification that a record has been changed and describe the
change, this is the easiest way to get the data.
 Incremental extract - some systems may not be able to provide notification that an update has occurred, but they are able to
identify which records have been modified and provide an extract of such records. During further ETL steps, the system needs to
identify changes and propagate it down. Note, that by using daily extract, we may not be able to handle deleted records properly.
 Full extract - some systems are not able to identify which data has been changed at all, so a full extract is the only way one can
get the data out of the system. The full extract requires keeping a copy of the last extract in the same format in order to be able to
identify changes. Full extract handles deletions as well.
 When using Incremental or Full extracts, the extract frequency is extremely important. Particularly for full extracts; the data
volumes can be in tens of gigabytes.

Clean
The cleaning step is one of the most important as it ensures the quality of the data in the data warehouse. Cleaning should perform
basic data unification rules, such as:
 Making identifiers unique (sex categories Male/Female/Unknown, M/F/null, Man/Woman/Not Available are translated to
standard Male/Female/Unknown)
 Convert null values into standardized Not Available/Not Provided value
FEATURE SELECTION

1. Demo
2. https://www.youtube.com/watch?v=VEBax2WMbEA
3.Assignment

You might also like