Unit IV
Unit IV
Object Segmentation
UNIT IV Syllabus
Object Segmentation: Regression Vs Segmentation – Supervised and Unsupervised Learning, Tree Building – Regression,
Classification, Overfitting, Pruning and Complexity, Multiple Decision Trees etc. Time Series Methods: Arima, Measures of
Forecast Accuracy, STL approach, Extract features from generated model as Height, Average Energy etc and Analyze for
prediction
Object Segmentation
This method consists of using your knowledge of the various profiles exhibited by your customers.
Thanks to the domain-specific knowledge that you have of your customers, you determine the
In this scenario, you are the Marketing Director of a large retail bank. You want to customize criteria of the segmentation model intuitively, and build the clusters yourself. The main disadvantage
your communications using data modeling. The bank wants to offer a new financial product of this method is that the number of information items available for each customer will invariably
to its customers. Your project consists of launching a direct marketing campaign aimed at grow with time. The more data your database accumulates, the harder it is for you to manually
promoting this product. In order to customize the marketing messages from the bank and create clusters that take all data into consideration and to develop a response to your business issue.
improve communication with the various customers and prospects for this new product, the
Furthermore, as the increasing volume of information requires you to build segmentation models
senior management of the bank asks you to build a segmentation model of the customers of
with increasing frequency, the time required to build these segmentation models becomes
this product. Using Modeler - Segmentation/Clustering, you can rapidly develop a
descriptive model with the least possible cost. This model shows the characteristic profiles increasingly more significant. Finally, management may want you to rationalize your methods, and to
of the customers interested in your new product, and thus responds to your business issue perform your segmentation using a method not based purely on your intuition. Defending your
and fulfills your objectives. segmentation method based on intuition may be difficult.
Your Objective Consider the following case. Using Regression/Classification, you have
contacted the prospects most likely to be interested in your new financial product, and
identified the ideal number of prospects to contact out of the entire database meeting the
deadlines and within the budget you were allowed. To improve the rate of return of your
campaign, senior management asks you to:
● Build a segmentation model of your customers,
● Analyze the characteristics of the identified clusters,
● Define customized communications for each cluster.
The segmentation model in particular should allow you to distinguish customer clusters by
virtue of their propensity to purchase the new high-end savings product proposed by your
firm. You will optimize your understanding of your customers.
6.2.2 Classical Statistical Method 6.2.3 Automated Analytics Method
Segmentation/Clustering allows you to build a
On the basis of the information that you have, a data mining expert could build a segmentation segmentation model of your customers in a few minutes,
model. In other words, you could ask a statistical expert to create a mathematical model that would taking into consideration the interest expressed by your
allow you to build clusters based on the profiles of your customers. To implement this method, the customers in your new product.
statistician must: Segmentation/Clustering automatically detects
● Perform a detailed analysis of your database. interactions between the variables to build
● Prepare your database down to the smallest detail, specifically, encoding the variables as a homogeneous sub-sets, or clusters. Each cluster is
function of their type (nominal, ordinal or continuous) in preparation for segmentation. The homogeneous with respect to the entire set of variables,
encoding strategy used will determine the type of segmentation model obtained. At this step, the and in particular with respect to the target variable, that
statistician may unconsciously bias the results. is, for example, "responded positively to my test". You
● Test different types of algorithms (K-means, both ascending and descending hierarchical will discover the characteristics of different clusters, such
segmentation models) and select the one best suited to your business issue. as those clusters with an excellent response rate and
● Evaluate the relevance of the clusters obtained, in particular, the response to your domain-specific those with a poor response rate. In addition, if your
business issue. After a few weeks, the statistical expert will be able to provide a certain number of customer database contains customer expenditures on
clusters, or homogeneous groups, to which each of the individuals of your database are assigned. your other products, you will also obtain information on
This method presents significant constraints. You must: product sale synergies, by cluster. Using
● Ensure that your statistical expert, who is usually from an external department, is available for the Segmentation/Clustering, you have access to all the
scheduled period, analytical features needed to define the type of message
● Ensure that the modeling costs will fit into your budget, to be sent to the cluster for each customer. You have
● Spend time explaining your domain-specific business issue to the statistician, homogeneous clusters that will allow you to respond to
● Spend time understanding the results that are provided, your business issue. Of particular importance, this
● Ask a programmer to write a program to determine the cluster associated with any new individual segmentation is systematic: the results
added to your database. In addition, this method is not systematic. Two statisticians performing this obtained do not represent a particular point of view of
segmentation on the same dataset could obtain different results. your data, and is robust or consistent. Two people
performing this segmentation using the method would
obtain the same results.
Decision Tree
“Decision Tree” is a supervised machine learning algorithm which can be used for both classification or regression
challenges.
It is one of the oldest but widely used practical methods for supervised learning.
Tree models where the target variable can take a discrete set of values are called classification trees.
Decision trees where the target variable can take continuous values (typically real numbers) are called regression
trees. (CART).
Definition:
A decision tree is a flowchart-like structure in which the branches represents conditions and the leaf nodes
represents the class label.
Tree Building – Regression, Classification
Decision tree learning is a method commonly used in data mining. The goal is to create a model that predicts the value of a
target variable based on several input variables. An example is shown on the right. Each interior node corresponds to one of
the input variables; there are edges to children for each of the possible values of that input variable. Each leaf represents a
value of the target variable given the values of the input variables represented by the path from the root to the leaf.
Classification tree analysis is when the predicted outcome is the class to which the data belongs.
Regression tree analysis is when the predicted outcome can be considered a real number (e.g. the price of a house, or a
patient’s length of stay in a hospital).
In general, Decision tree analysis is a predictive modelling tool that can be applied across many areas. Decision trees can be
constructed by an algorithmic approach that can split the dataset in different ways based on different conditions. Decisions
trees are the most powerful algorithms that falls under the category of supervised algorithms.
• A decision tree is a structure that includes a root node,
branches, and leaf nodes.
CHAID is the oldest decision tree algorithm in the history. It CHAID stands for CHI-squared Automatic Interaction Detector.
was raised in 1980 by Gordon V. Kass. Then, CART was Morgan and Sonquist (1963) proposed a simple method for fitting trees to predict a quantitative
found in 1984, It is the acronym of chi-square automatic variable. They called the method AID, for Automatic Interaction Detection. The algorithm performs
interaction detection. Here, chi-square is a metric to find stepwise splitting. It begins with a single cluster of cases and searches a candidate set of predictor
the significance of a feature. The higher the value, the variables for a way to split this cluster into two clusters. Each predictor is tested for splitting as
higher the statistical significance. Similar to the others, follows: sort all the n cases on the predictor and examine all n-1 ways to split the cluster in two. For
CHAID builds decision trees for classification problems. This each possible split, compute the within-cluster sum of squares about the mean of the cluster on the
means that it expects data sets having a categorical target dependent variable. Choose the best of the n-1 splits to represent the predictor’s contribution. Now
variable. do this for every other predictor. For the actual split, choose the predictor and its cut point which
yields the smallest overall within-cluster sum of squares. Categorical predictors require a different
approach. Since categories are unordered, all possible splits between categories must be
considered. For deciding on one split of k categories into two groups, this means that 2k-1 possible
splits must be considered. Once a split is found, its suitability is measured on the same within-
cluster sum of squares as for a quantitative predictor. Morgan and Sonquist called their algorithm
AID because it naturally incorporates interaction among predictors. Interaction is not correlation. It
has to do instead with conditional discrepancies. In the analysis of variance, interaction means that a
trend within one level of a variable is not parallel to a trend within another level of the same
variable. In the ANOVA model, interaction is represented by cross-products between predictors. In
the tree model, it is represented by branches from the same nodes which have different splitting
predictors further down the tree.
CART:
CART stands for Classification And Regression Tree.
CART algorithm was introduced in Breiman et al. (1986). A CART tree is a binary decision tree that is constructed by
splitting a node into two child nodes repeatedly, beginning with the root node that contains the whole learning
sample. The CART growing method attempts to maximize within-node homogeneity. The extent to which a node
does not represent a homogenous subset of cases is an indication of impurity. For example, a terminal node in which
all cases have the same value for the dependent variable is a homogenous node that requires no further splitting
because it is "pure." For categorical (nominal, ordinal) dependent variables the common measure of impurity is Gini,
which is based on squared probabilities of membership for
each category. Splits are found that maximize the homogeneity of child nodes with respect to the value of the
dependent variable.
Impurity Measure:
GINI Index
Used by the CART (classification and regression tree) algorithm, Gini impurity is a measure of how often a randomly
chosen element from the set would be incorrectly labeled if it were randomly labeled according to the distribution of
labels in the subset. Gini impurity can be computed by summing the probability fi of each item being chosen times
the probability 1-fi of a mistake in categorizing that item. It reaches its minimum (zero) when all cases in the node
fall into a single target category.
Overfitting, Pruning and Complexity
If a decision tree is fully grown, it may lose some generalization capability. This is a phenomenon
known as overfitting.
Random forests is an ensemble learning algorithm that can be used for classification, regression.
It operates by constructing a multitude of decision trees at training time and outputs the class (classification) or
mean prediction (regression) of the individual tree.
Random forest, are built using Decision trees which are easy to build but has complexity in predictive learning .
They have problem in terms of classification accuracy. How to overcome? - By using Random Forest.
Random Forest combines simplicity of the decision tree with improved predictive accuracy.
Ag
gr
eg
at
io
n
Bootstrap
(with replacement)
• How to Build a Random Forest?
• How to Use it?
• How to Evaluate it?
Building a Random Forest
Step 1: construct a bootstrapped dataset
(randomly select sample(s) from the original dataset(D), s<=D)
Step 3: Repeat step 1,2 (variety of trees are built, which makes random forest more effective than
individual classifiers.)
How to Use it?
Out of Bag Sample- samples which were not used in the bootstrapped dataset are called out of bag dataset.
It can be used as a testing dataset.
out of bag dataset can be made to run on the trees which are built without it and look for its class label prediction(correct or not)
https://www.youtube.com/watch?v=wGUV_XqchbE
https://www.youtube.com/watch?v=GUq_tO2BjaU
Arima, Measures of Forecast Accuracy
ARIMA modeling
https://datascienceplus.com/time-series-analysis-using-arima-model-in-r/
ARIMA is the abbreviation for AutoRegressive Integrated
Moving Average. Auto Regressive (AR) terms refer to the
lags of the differenced series, Moving Average (MA) terms
refer to the lags of errors and I is the number of difference
Time series data are data points collected
used to make the time series stationary.
over a period of time as a sequence of
time gap. Time series data analysis means
analyzing the available data to find out the
Assumptions of ARIMA model
pattern or trend in the data to predict
1. Data should be stationary – by stationary it means that
some future values which will, in turn, help
the properties of the series doesn’t depend on the time
more effective and optimize business
when it is captured. A white noise series and series with
decisions.
cyclic behavior can also be considered as stationary series.
Time series analysis can be classified 2. Data should be univariate – ARIMA works on a single
as: variable. Auto-regression is all about regression with the
Univariate and multivariate past values.
Techniques used for time series Steps to be followed for ARIMA modeling:
analysis: 1. Exploratory analysis
ARIMA models 2. Fit the model
3. Diagnostic measures
Steps to be followed for ARIMA modeling:
•1. Exploratory analysis
•2. Fit the model
•3. Diagnostic measures
Exploratory analysis
•1. Autocorrelation analysis to examine serial
dependence: Used to estimate which value in the past
has a correlation with the current value. Provides the
p,d,q estimate for ARIMA models.
•2. Spectral analysis to examine cyclic behavior: Carried
out to describe how variation in a time series may be
accounted for by cyclic components. Also referred to as a
Frequency Domain analysis. Using this, periodic
components in a noisy environment can be separated
out.
•3. Trend estimation and decomposition: Used for
seasonal adjustment. It seeks to construct, from an
observed time series, a number of component series(that
could be used to reconstruct the original series) where
we get 4 components:
each of these has a certain characteristic.
•Observed – the actual data plot
Before performing any EDA on the data, we need to
•Trend – the overall upward or downward
understand the three components of a time series data:
movement of the data points
•Trend: A long-term increase or decrease in the data is
•Seasonal – any monthly/yearly pattern of the
referred to as a trend. It is not necessarily linear. It is the
data points
underlying pattern in the data over time.
•Random – unexplainable part of the data
•Seasonal: When a series is influenced by seasonal
factors i.e. quarter of the year, month or days of a week
seasonality exists in the series. It is always of a fixed and
known period. E.g. – A sudden rise in sales during
Christmas, etc.
•Cyclic: When data exhibit rises and falls that are not of
the fixed period we call it a cyclic pattern. For e.g. –
Fit the model Diagnostic measures
Once the data is ready and satisfies all the assumptions
of modeling, to determine the order of the model to be Check residual autocorrelations
fitted to the data, we need three variables: p, d, and q
•For overall significance, and for significance at
which are non-negative integers that refer to the order of individual lags
the autoregressive, integrated, and moving average parts
•Box-Pierce test, or Ljung-Box test
of the model respectively.
•Look at the acf and pacf of model residuals
For n time periods where we have actual demand and forecast values:
Ideal value = 0;
MFE > 0, model tends to under-forecast
MFE < 0, model tends to over-forecast
For n time periods where we have actual demand and forecast values:
While MFE is a measure of forecast model bias, MAD indicates the absolute size of the errors
Usually all the three phases execute in parallel since the data extraction takes time, so while the data is
being pulled another transformation process executes, processing the already received data and prepares
the data for loading and as soon as there is some data ready to be loaded into the target, the data loading
kicks off without waiting for the completion of the previous phases.
ETL systems commonly integrate data from multiple applications (systems), typically developed and
supported by different vendors or hosted on separate computer hardware. The disparate systems
containing the original data are frequently managed and operated by different employees. For example, a
cost accounting system may combine data from payroll, sales, and purchasing.
Commercially available ETL tools include:
Anatella
Alteryx
CampaignRunner
ESF Database Migration Toolkit
InformaticaPowerCenter
Talend
IBM InfoSphereDataStage
Ab Initio
Oracle Data Integrator (ODI)
Oracle Warehouse Builder (OWB)
Microsoft SQL Server Integration Services (SSIS)
Tomahawk Business Integrator by Novasoft Technologies.
Pentaho Data Integration (or Kettle) opensource data integration framework
Stambia
Diyotta DI-SUITE for Modern Data Integration
FlyData
Rhino ETL
There are various steps involved in ETL. They are as below in detail:
Extract
The Extract step covers the data extraction from the source system and makes it accessible for further processing. The main
objective of the extract step is to retrieve all the required data from the source system with as little resources as possible. The
extract step should be designed in a way that it does not negatively affect the source system in terms or performance, response
time or any kind of locking.
There are several ways to perform the extract:
Update notification - if the source system is able to provide a notification that a record has been changed and describe the
change, this is the easiest way to get the data.
Incremental extract - some systems may not be able to provide notification that an update has occurred, but they are able to
identify which records have been modified and provide an extract of such records. During further ETL steps, the system needs to
identify changes and propagate it down. Note, that by using daily extract, we may not be able to handle deleted records properly.
Full extract - some systems are not able to identify which data has been changed at all, so a full extract is the only way one can
get the data out of the system. The full extract requires keeping a copy of the last extract in the same format in order to be able to
identify changes. Full extract handles deletions as well.
When using Incremental or Full extracts, the extract frequency is extremely important. Particularly for full extracts; the data
volumes can be in tens of gigabytes.
Clean
The cleaning step is one of the most important as it ensures the quality of the data in the data warehouse. Cleaning should perform
basic data unification rules, such as:
Making identifiers unique (sex categories Male/Female/Unknown, M/F/null, Man/Woman/Not Available are translated to
standard Male/Female/Unknown)
Convert null values into standardized Not Available/Not Provided value
FEATURE SELECTION
1. Demo
2. https://www.youtube.com/watch?v=VEBax2WMbEA
3.Assignment