0% found this document useful (0 votes)
76 views110 pages

UE21CS342AA2 - Unit-1 Part - 2

This document discusses data preprocessing techniques for data integration and reduction. It describes how data integration involves merging data from multiple sources which can help reduce redundancies and inconsistencies. Challenges in data integration include schema and object matching. Data reduction techniques like dimensionality reduction and numerosity reduction can obtain a smaller representation of the dataset while maintaining analytical quality. Dimensionality reduction techniques include principal component analysis and attribute subset selection. Wavelet transforms are also discussed as a dimensionality reduction method.

Uploaded by

abhay spam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views110 pages

UE21CS342AA2 - Unit-1 Part - 2

This document discusses data preprocessing techniques for data integration and reduction. It describes how data integration involves merging data from multiple sources which can help reduce redundancies and inconsistencies. Challenges in data integration include schema and object matching. Data reduction techniques like dimensionality reduction and numerosity reduction can obtain a smaller representation of the dataset while maintaining analytical quality. Dimensionality reduction techniques include principal component analysis and attribute subset selection. Wavelet transforms are also discussed as a dimensionality reduction method.

Uploaded by

abhay spam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 110

DATA ANALYTICS

UE20CS312
UNIT-1
Lecture 5 : Data Preprocessing
- Data Integration and
Reduction

Department of Computer Science and


Engineering
Data Analytics
Unit 1
Lecture 5 : Data Preprocessing – Data Integration and
Reduction

Department of Computer Science and Engineering


DATA ANALYTICS
Data
Integration
• Data analysis often requires data integration – the
merging of data from multiple data stores into a
coherent store.
• Careful integration can help reduce and avoid
redundancies and inconsistencies in the resulting
dataset. This can help improve the accuracy and speed
of the subsequent data analysis process.
• The semantic heterogeneity and structure of data pose
great challenges in data integration.
• How can we match schema and objects from different
sources?
• Schema Integration!
• Example : How can a data analyst be sure that the
attribute customer_id in table A and
customer_number in table B refer to the same
DATA ANALYTICS
Data
Integration
• Entity identification problem : Identify real world
entities from multiple data sources. Example : Bill
Clinton = William Clinton
• Detecting and resolving data value conflicts
• For the same real world entity, attribute values
from different sources are different.
• Possible reasons : different representations,
different scales, example – metric vs British units
• During integration , special attention must be paid to
the structure of the data. This is to ensure that any
attribute functional dependencies and referential
constraints in the source system match those in the
target system. For example , in one system, a discount
may be applied to the entire order whereas in another
system , it is applied to each individual line item. If this
DATA ANALYTICS
Redundancy in Data Integration

Redundant data often occur during the integration of


multiple databases.
• Object identification : The same attribute or object may
have different names in different databases which causes
redundancy.
• Derivable data : An attribute may be redundant if it can be
derived from another attribute or set of attributes. For
example , annual revenue can be derived from monthly
revenue.
• Few redundancies can be detected by correlation analysis.
DATA ANALYTICS
(chi-square) test

2(chi-square)test for independence of two variables in a contingency ta


• Null Hypothesis : The two variables are independent
• Alternate hypothesis : The two variables are not independent.
(𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑 − 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑) 2
𝜒2 = ෍
𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑

• Expected stands for what we would expect if the null hypothesis wer
• Larger the value ofχ2 the more likely the variables are correlated.
• The cells that contribute the most to the 𝜒 2 value are those whose a
count is different from the expected count.
• Can be used for categorical variables where entries are numbers(cou
and not percentages or fractions(10% of 100 needs to be entered as
• Correlation does not imply causation.
 The number of hospitals and number of car-thefts in a city may a
to be correlated. Both are casually linked to a third variable : pop
DATA ANALYTICS
Covariance analysis : An Example

https://mathcs.clarku.edu/~djoyce/ma217/covar.p
df
DATA ANALYTICS
Correlation (viewed as a linear
relationship)
• Correlation measures the linear relationship between
objects.
• To compute correlation , we standardize data objects A and B
, and then take their dot product.
DATA ANALYTICS
Covariance analysis (Numeric Data )

• Positive Covariance : If CovA,B > 0, then A and B both tend


to be larger than their expected values.
• Negative Covariance : If CovA,B < 0, then if A is larger than
its expected value, B is likely to be smaller than its expected
value.
• Independence : CovA,B = 0, but the converse is not true :
Some pairs of random variables may have a covariance of 0
but are not independent. Only under few additional
assumptions ( example , the data follows multivariate normal
distributions ) does CovA,B = 0 imply independence.
DATA ANALYTICS
Covariance analysis : An Example

Suppose two stocks A and B have the following


values in one week : (2, 5), (3, 8), (5, 10), (4, 11), (6,
14).

Question: If the stocks are affected by the same


industry trends, will their prices rise or fall
together?
E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4
× 9.6 = 4
DATA ANALYTICS
Tuple Duplication

• In addition to detecting redundancies between attributes,


duplication should be detected at the tuple level (Example ,
where there are two or more identical tuples for a unique
data entry case)
• The use of denormalized tables (often done to improve
performance by avoiding joins ) is another source of data
redundancy.
• Inconsistencies often arise between various duplicates, due
to inaccurate data entry or updating some but not all data
occurrences.
DATA ANALYTICS
Data Value Conflict Detection and Resolution

• Data integration also involves the detection and resolution


of data value conflicts.
• For example, for the same real-world entity, attribute values
from different sources may differ.
• This may be due to differences in representation, scaling or
encoding.
• For instance , a weight attribute may be stored in metric
units in once system and British imperial units in another.
DATA ANALYTICS
Data Reduction

• Data reduction techniques are applied to obtain a reduced


representation of the dataset that is much smaller in volume
, yet closely maintains the integrity of the original data.
• Analysis on the reduced dataset should be more efficient yet
produce the same or almost the same analytical results.
• Why do we need data reduction? A database or a data
warehouse may store terabytes of data. Complex data
analysis may take a very long time to run on the complete
data set.
DATA ANALYTICS
Data Reduction Strategies

• Dimensionality reduction – process of removing unimportant


attributes
• Wavelet transforms
• Principal Component Analysis (PCA)
• Attribute subset selection
• Numerosity reduction – replaces the original data volume by
an alternative, smaller forms of data representation
• Regression and log-linear models
• Histograms, clustering and sampling
• Data cube aggregation
• Data compression – transformations are applied on the data
to obtain a reduced or a compressed representation of the
DATA ANALYTICS
Dimensionality Reduction

• Curse of dimensionality
• When dimensionality increases, data becomes
increasingly sparse.
• Density and distance between points , which are critical
to clustering and outlier analysis become less
meaningful.
• The possible combinations of subspaces will grow
exponentially.
• Dimensionality reduction
• Avoids the curse of dimensionality.
• Helps to eliminate irrelevant attributes and reduce noise.
• Reduces time and space required for data analytics.
DATA ANALYTICS
Mapping data to a new space

• Fourier transform
• Wavelet transform

Two Sine Two Sine Waves + Frequency


Waves Noise
DATA ANALYTICS
Wavelet Transforms

What is a Wavelet?
A Wavelet is a wave-like oscillation that is localized in
time, an example is given below. Wavelets have two
basic properties: scale and location. Scale (or dilation)
defines how “stretched” or “squished” a wavelet is. This
property is related to frequency as defined for
waves. Location defines where the wavelet is positioned
in time (or space).
DATA ANALYTICS
Wavelet transformation
• Discrete wavelet transform (DWT)
is used for linear signal processing
and multi-resolution analysis.
• It decomposes a signal into
different frequency sub-bands. It
is applicable to n-dimensional
signals.
• Data is transformed to preserve
relative distance between objects An example of
DWT
at different resolutions.
• Compressed approximation : it
stores only a small fraction of the
strongest of the wavelet
coefficients
Wavelet families
• It is insensitive to noise , input
DATA ANALYTICS
Wavelet Transforms

Why wavelet transforms?


• A major disadvantage of the Fourier Transform is it
captures global frequency information, meaning
frequencies that persist over an entire signal. An
alternative approach is the Wavelet Transform,
which decomposes a function into a set of
wavelets.
• Wavelet transform can extract local
spectral and temporal information simultaneously
• Variety of wavelets to choose from like shown here
DATA ANALYTICS
Wavelet transformation-Working
Take a look at the animation first.
Now, the basic idea is to compute ”how
much” of a wavelet is in a signal for a
particular scale and location. For those
familiar with convolutions, that is exactly
what this is. A signal is convolved with a
set wavelets at a variety of scales.
In other words, we pick a wavelet of a
particular scale (like the blue wavelet in
the gif). Then, we slide this wavelet
across the entire signal i.e. vary its
location, where at each time step we
multiply the wavelet and signal. The
product of this multiplication gives us a
coefficient for that wavelet scale at that
DATA ANALYTICS
Wavelet transformation-Working

Like the Fourier transform, the wavelet transform deconstructs


the original signal waveform into a series of basis waveforms,
which in this case are called wavelets. However, unlike the simple
sinusoidal waves of Fourier analysis, the wavelet shapes are
complex, and, at first sight apparently arbitrary – they look like
random squiggles (although in fact they fulfil rigorous
mathematical requirements). One important feature that all
wavelets share is that they are bounded, i.e. they decline to zero
amplitude at some distance either side of the centre, which is in
obvious contrast to the sine/cosine waves used in Fourier
analysis, which go on forever. This is the underlying key to the
time localisation of the DWT.
There are a whole series of different types of “mother “ wavelets
(Daubechies, Coiflet, Symmlet etc) available, and each type occurs
in a range of sizes.A particular episode of wavelet analysis only
DATA ANALYTICS
Wavelet transformation-Working

After transformation of a raw data signal using a particular


mother wavelet you end up with basis waveforms consisting
of a series of daughter wavelets. The daughter wavelets are
all compressed or expanded versions of the mother wavelet
(they have different scales or frequencies), and each
daughter wavelet extends across a different part of the
original signal (they have differentlocations)
The important point is that each daughter wavelet is
associated with a corresponding coefficient that specifies
how much the daughter wavelet at that scale contributes to
the raw signal at that location. It is these coefficients that
contain the information relating to the original input signal,
since the daughter wavelets derived from a particular
mother wavelet are completely fixed and independent of
DATA ANALYTICS
References

• Data Mining: Concepts and Techniques by Jiawei


Han, Micheline Kamber and Jian Pei, The Morgan
Kaufmann Series in Data Management Systems,
3rd Edition Chapter : 3.3 – 3.4
• https://www.st-
andrews.ac.uk/~wjh/dataview/tutorials/dwt.html
• https://towardsdatascience.com/the-wavelet-
transform-e9cfa85d7b34
• https://www.cs.unm.edu/~mueen/Teaching/CS_5
21/Lectures/Lecture2.pdf
• https://medium.com/analytics-
vidhya/understanding-principle-component-
analysis-pca-step-by-step-e7a4bb4031d9
DATA
UE21CS342A
ANALYTICS
A2
UNIT-1
Lecture 6 : Data and
Dimensionality reduction contd.

Department of Computer Science and


Engineering
DATA ANALYTICS
Principal Component Analysis (PCA)

What is PCA?
Assume there are 50 questions in all in the survey.
The following three are among them:
1.I feel comfortable around people
2.I easily make friends
3.I like going out
These queries could appear different now. There is a catch, though.
They aren’t, generally speaking. They all gauge how extroverted you
are.
Therefore, combining them makes it logical, right? That’s where linear
algebra and dimensionality reduction methods come in!
We want to lessen the complexity of the problem by minimizing the
number of variables since we have much too many variables that
aren’t all that different. That is the main idea behind dimensionality
DATA ANALYTICS
Principal Component Analysis (PCA)

• Finds a projection that captures the largest amount of


variation in data.
• The original data is projected onto much smaller space ,
resulting in dimensionality reduction.
• We find eigenvectors of the covariance matrix and these
eigenvectors define the new space.
x
2
e

x
1
DATA ANALYTICS
PCA - Steps

Step 1: Standardize the dataset.


Step 2: Calculate the covariance matrix for the features in the
dataset.
Step 3: Calculate the eigenvalues and eigenvectors for the
covariance matrix.
Step 4: Sort eigenvalues and their corresponding eigenvectors.
Step 5: Pick k eigenvalues and form a matrix of eigenvectors.
Step 6: Transform the original matrix

Lets go step by step


DATA ANALYTICS
PCA - Steps
DATA ANALYTICS
PCA - Steps
DATA ANALYTICS
PCA - Steps
DATA ANALYTICS
PCA - Steps

And Finally!
DATA ANALYTICS
PCA using R (factoMineR , factoextra)

http://www.sthda.com/english/article
s/31-principal-component-methods-
in-r-practical-guide/112-pca-principal-
component-analysis-essentials/
DATA ANALYTICS
PCA using R (factoMineR , factoextra)
DATA ANALYTICS
PCA using R (factoMineR , factoextra)
DATA ANALYTICS
PCA using R (factoMineR , factoextra)
DATA ANALYTICS
Attribute Subset Selection

• Attribute Subset Selection reduces the dataset size by


removing irrelevant or redundant attributes.
• Redundant attributes : Information is contained or can be
extracted from other attributes. Example : MRP of a product
and the corresponding sales tax paid.
• Irrelevant attributes : They contain no information which is
useful for the data analysis task at hand. Example : SRN is
irrelevant to predict students’ GPA.
• The goal of attribute subset selection is to find the
minimum set of attributes such that the resulting
probability distribution of the data classes is as close as
possible to the original distribution obtained using all
attributes.
DATA ANALYTICS
Heuristic Search in Attribute Subset
Selection
• For n attributes , there are 2n possible subsets. An
exhaustive search is infeasible.
• The best and the worst attributes are determined using
tests of statistical significance , which assume that the
attributes are independent of each other.
• Stepwise forward selection : The procedure starts with an
empty set as the reduced set. In each iteration , the best of
the original attributes is selected and added to the reduced
set.
• Stepwise backward elimination : The procedure starts with
the full set of attributes. In each iteration , the worst
attribute remaining in the set is removed.
• Combination of forward and backward selection : In each
DATA ANALYTICS
Heuristic Search in Attribute Subset
Selection
DATA ANALYTICS
Attribute Creation – Feature
Generation
• Creating new features that can capture important
information in the dataset more effectively than the original
ones.
• For example , an attribute area can be added based on the
attribute’s height and width. By combining attributes,
accuracy can be improved and missing information about
the relationships between the attributes can be discovered.
• Three general methodologies
▪ Attribute extraction – Domain specific.
▪ Mapping data to a new space – Fourier transforms,
wavelet transforms and manifold approaches
▪ Attribute construction
o Combining features
DATA ANALYTICS
Numerosity Reduction

• Reduce data volumes by choosing alternative, smaller


forms of data representation.
• Parametric methods
▪ Assume the data fits some model. Estimate the
model parameters and store only the
parameters , discarding the data (except the
possible outliers).
▪ Examples : linear regression , multiple regression
, log-linear model
• Non-Parametric methods
▪ Do not assume models.
▪ Examples : histograms, clustering ,sampling
DATA ANALYTICS
Regression Analysis

• A collective name for techniques


for the modeling and analysis of
numerical data consisting values of
a dependent variable and of one or Y
more independent variables. 1
• The parameters are estimated so Y
as to give a best fit of the data. The y=x
1’
best fit is often evaluated using the +1
least squares method.
X x
• Regression analysis is used for 1
prediction (including forecasting of
time series data) , inference,
hypothesis testing and modeling of
DATA ANALYTICS
Log-linear model
• A log-linear model is a
mathematical model that
takes the form of a function
whose logarithm equals a
linear combination of the
parameters of the model.
• It approximates discrete
multidimensional probability
distributions for a set of
discretized attributes based
on smaller subset of
dimensional combinations.
• It is useful for dimensionality
reduction and data
DATA ANALYTICS
Histogram Analysis

• A histogram for an attribute partitions data distribution into


disjoint subsets referred to as bins or buckets.
• If each bucket represents only a single attribute value , it is
called singleton bucket.
• Partitioning rules :
▪ Equal-width : equal bucket range
▪ Equal-frequency : equal depth
• The following data are a list of AllElectronics prices for
commonly sold items in $. The numbers have been sorted: 1, 1,
5, 5, 5,5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15,
15, 18, 18, 18, 18, 18,18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21,
21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30,30, 30.
Draw a histogram using singleton buckets and equal-width bin of
DATA ANALYTICS
Histogram Analysis

Using singleton Using a width


buckets of 10$
DATA ANALYTICS
Clustering

• Records are partitioned into


clusters where records within the
cluster are similar to one another
and dissimilar to the ones in
another cluster.
• Cluster representation of data is
used to replace the actual data.
• The cluster representation can be
the centroid and the
diameter(intra-cluster distance) of
the cluster.
• It is very effective if the data is
https://www.researchgate.net/figure/The-concept-of-
clustered but not if it is smeared. intra-class-compactness-and-inter-class-separability-in-
a-two-dimension_fig3_325095062
• Can use hierarchical clustering and
DATA ANALYTICS
Sampling

• Sampling can be used as a data reduction technique as it


allows a large dataset to be represented by a much smaller
random subset.
• An advantage of sampling for data reduction is that the cost
of obtaining a sample is proportional to the size of the
sample.
• In the context of data reduction , sampling is most commonly
used to estimate an answer to an aggregate query.
• Using central limit theorem ( recall from Statistics for Data
Science!! ) , it is possible to determine a sufficient sample size
for estimating a given function within a specified degree of
error.
• Important to remember :
▪ Choose a representative subset of the data.
DATA ANALYTICS
Types of Sampling methods

https://towardsdatascience.com/8-types-of-sampling-techniques-b21adcdd2124
DATA ANALYTICS
Data Cube Aggregation

IMAGINE YOU THE DATA YOU SINCE YOU CARE THE RESULTING USUALLY DATA
HAVE TO RECEIVE HAS ABOUT ANNUAL DATASET IS CUBES ARE USED
METRICS , THE DATA
PERFORM AN SALES PER SMALLER IN TO STORE
CAN BE AGGREGATED
ANALYSIS ON QUARTER FROM SO THAT THE VOLUME MULTIDIMENSION
YEARLY SALES AT THE YEAR 2008 TO RESULTING DATA WITHOUT A LOSS AL AGGREGATED
DUNDER MIFFLIN 2010. SUMMARIZES THE OF INFORMATION INFORMATION.
PAPER COMPANY. ANNUAL SALES NECESSARY TO
RATHER THAN THE TASK AT
QUARTERLY SALES.
HAND!
DATA ANALYTICS
Data Compression

• Transformations are applied so as to


obtain a reduced or compressed
representation of the original data.
• If the original data can be
reconstructed from the compressed
data without any information loss, the
data reduction is called lossless. But if
we can reconstruct only an
approximation of the original data ,
then the data reduction is called lossy.
• There are several lossless algorithms
for string compression , however they
allow only limited data manipulation.
DATA ANALYTICS
Test your
understanding!
• What is the first step in data integration?
Understanding the metadata

• Is PCA lossy or lossless?


Lossy

• Amongst PCA and Attribute subset selection ,


which data reduction method has more
interpretability?
Attribute Subset Selection
DATA ANALYTICS
Test your
understanding!
• For an Electrocardiography (ECG) wave which is a
better transform: Fourier or wavelet?
Wavelet
Because ECG’s have signals have short intervals of
characteristic oscillation and Fourier transforms can only
capture frequencies that persist over an entire signal
which is not suitable here.

• ----------------- is a nonzero vector that stays parallel


after matrix multiplication
Eigen Vectors
DATA ANALYTICS
References

• Data Mining: Concepts and Techniques by Jiawei


Han, Micheline Kamber and Jian Pei, The Morgan
Kaufmann Series in Data Management Systems,
3rd Edition Chapter : 3.3 – 3.4
• https://www.st-
andrews.ac.uk/~wjh/dataview/tutorials/dwt.html
• https://towardsdatascience.com/the-wavelet-
transform-e9cfa85d7b34
• https://www.cs.unm.edu/~mueen/Teaching/CS_5
21/Lectures/Lecture2.pdf
• https://medium.com/analytics-
vidhya/understanding-principle-component-
analysis-pca-step-by-step-e7a4bb4031d9
DATA
UE21CS342A
ANALYTICS
A2
UNIT-1
Lecture 7 : Data Preprocessing
– Transformations and
Discretization

Department of Computer Science and


Engineering
Data Analytics
Unit 1
Lecture 7 : Data Preprocessing – Transformations and
Discretization

Department of Computer Science and Engineering


DATA ANALYTICS
Data
Transformation
A function that maps the entire set of values of a
given attribute to a new set of replacement values
such that each old value can be identified with one of
the new values.

Methods
• Smoothing
• Attribute construction
• Aggregation
• Normalization
• Discretization
• Concept hierarchy generation
DATA ANALYTICS
Smoothing

Removal of noise in data.

Techniques:
• Binning
• Regression
• Clustering
• Simple average – Time series data
• Weighted average – Time series data
• Exponential smoothening – Time series data
• Gaussian filter - Image
DATA ANALYTICS
Normalization
DATA ANALYTICS
Normalization

• Min-Max Normalization : Performs a linear


transformation. It transforms the values from [minA,
maxA] to [new_minA, new_maxA]. A value v is
transformed by

• Example : Let the income range $12,000 to $98,000 be


normalized to [0.0,1.0]. Find out the mapping for
$73,600

• This preserves the relationship among the original data


DATA ANALYTICS
Normalization

• Z-score normalization : The values for an attribute A are


normalized based on the mean and standard deviation
of A. A value v can be normalized by

Where μA is the mean and σA the standard deviation.


• Example : Let μ= 54,000 and σ = 16,000. Then z-score of
73,600 is

• This method of normalization is useful when the actual


minimum or maximum value of A is unknown or there
are outliers in A.
DATA ANALYTICS
Discretization

• Divides the range of a continuous attribute into


intervals.
• Interval labels are then used to replace the actual
data values.
• Data size can be reduced by discretization.
• If the discretized process uses class information , it
is called supervised discretization else it is called
unsupervised discretization.
• Top-down discretization : Process starts by finding
one or few points (called split or cut points) to split
the entire attribute range and then repeat this
process recursively on the resulting intervals.
• Bottom-up discretization : Also called as merging ,
starts by considering all the continuous values as
DATA ANALYTICS
Data discretization
methods
• Binning :
▪ Top-down split , unsupervised
• Histogram analysis :
▪ Top-down split , unsupervised
• Clustering analysis :
▪ Unsupervised , top-down split or bottom-up
merge
• Decision-tree analysis :
▪ Supervised , top-down split
• Correlation analysis :
▪ Unsupervised , bottom-up merge
Note : All these methods can be applied recursively.
DATA ANALYTICS
Binning

• Equal-width (distance) partitioning


▪ Divides the range into N intervals of equal size.
▪ The width of the interval is w = (Maximum –
Minimum)/N.
▪ Is susceptible to outliers and skewed data.
• Equal-depth (frequency) partitioning
▪ Divides the range into N intervals, each
containing approximately same number of
samples.
▪ Ensures good data scaling but managing
categorical attributes can get tricky.
DATA ANALYTICS
Binning - Example

https://www.saedsayad.com/unsupervised
_binning.htm
DATA ANALYTICS
Data Smoothing with
Binning

Sorted data for price (in $ ) :


4,8,9,15,21,21,24,25,26,28,29,34

Partition into equal-depth (frequency) bins


Bin-1 : 4,8,9,15
Bin-2 : 21,21,24,25
Bin-3 : 26,28,29,34
DATA ANALYTICS
Data Smoothing with
Binning

• Smoothing by bin means :


Bin-1 : 9,9,9,9
Bin-2 : 23,23,23,23
Bin-3 : 29,29,29,29

• Smoothing by bin boundaries :


Bin-1 : 4,4,4,15
Bin-2 : 21,21,25,25
Bin-3 : 26,26,26,34
DATA ANALYTICS
Binning vs Clustering (Unsupervised
Discretization)
DATA ANALYTICS
Discretization by
Classification
• Supervised : Class
labels are used in
determining the split
point.
• It is a top-down
discretization where
recursive split is
applied.
• Example : Decision
tree analysis.
• Entropy is used to https://towardsdatascience.com/entropy-how-
determine the split decision-trees-make-decisions-2946b9c18c8
DATA ANALYTICS
Discretization by Correlation
Analysis
• Supervised : Class labels are used in determining
the split point.
• Example , Chi-merge: χ2-based discretization. It is a
bottom-up merge.
• Initially each distinct value of the attribute is
considered to be one interval.
• χ2 tests are performed for every pair of adjacent
intervals.
• Adjacent intervals with the least χ2 values are
merged as it indicates similar class distributions.
• This merging process proceeds recursively until a
pre-defined threshold for χ2 is met.
DATA ANALYTICS
Discretization by Correlation
Analysis - Example

https://www.slideserve.com/forrest-
DATA ANALYTICS
Discretization by Correlation
Analysis - Example

https://www.slideserve.com/forrest-
DATA ANALYTICS
Discretization by Correlation
Analysis - Example

https://www.slideserve.com/forrest-
DATA ANALYTICS
Concept Hierarchy
Generation

• Concept hierarchy organizes concepts ( Attribute values )


hierarchically by representing a series of mappings from a
set of low-level concepts to a high-level , generalized
concepts.
• It facilitates drilling and rolling in data warehouses to
view data in multiple granularity.
• Method : Recursively reduce the data by collecting and
replacing low level concepts (such as numeric values for
age) by higher level concepts ( such as kids, teenagers,
adults, senior citizens).
• They can be explicitly specified by domain experts and/or
DATA ANALYTICS
Concept Hierarchy Generation for
Numeric Data
Discretization methods discussed till now can be used for
numeric data.
Example:
DATA ANALYTICS
Concept Hierarchy Generation for
Nominal Data

1) Specification of a partial ordering of attributes


explicitly at the schema level
• A user or an expert defines a concept hierarchy by
specifying a partial or a total ordering of attributes
at the schema level.
• For example , suppose a relational database
contains the attributes street, city, state and
country . Location dimension of the data
warehouse may contain the same attributes.
• A hierarchy can be defined by specifying the total
ordering among these attributes at the schema
DATA ANALYTICS
Concept Hierarchy Generation for
Nominal Data

2) Specification of a portion of hierarchy by explicit data


grouping
• A portion of the concept hierarchy is manually
defined.
• In a large database , it is unrealistic to define the
entire concept hierarchy by explicit value
enumeration.
• However , we can easily specify explicit groupings
for a small portion of intermediate-level data.
• For example , after specifying state and country
form a hierarchy at the schema level , a user can
DATA ANALYTICS
Concept Hierarchy Generation for
Nominal Data
3) Specification of only a partial set of attributes.
• At times , a user can have a vague idea about what
should be included in the hierarchy.
• The user may have included only a small subset of the
relevant attributes in the hierarchy specification.
• For example, instead of including all the hierarchically
relevant attributes for location , the user might have
specified only street and city.
• To handle this , embed data semantics into the
database schema. Hence one attribute will trigger a
whole group of linked attributes to be added to the
hierarchy. For example , when city is added , it would
DATA ANALYTICS
Concept Hierarchy Generation for
Nominal Data
4) Automatic generation of
hierarchies by analysis of distinct
values per attribute
• Few hierarchies can be
automatically generated
based on the analysis of the
number of distinct values per
attribute in the dataset.
• The attribute with the most
Notedistinct values isisplaced
: This method at the For example, a time dimension
not foolproof.
in a lowest
databaselevel of the
might hierarchy.
contain 20 distinct years , 12 distinct months
and 7 distinct days of the week. However , this doesn’t suggest that
DATA ANALYTICS
Test your
understanding!
• Which method of normalization must one choose if they
are dealing with a lot of outliers and don’t know the range
of their data?
Solution
Z-Score Normalization

• Which split-point is preferred for discretization using


decision trees?
Solution
A Split point which results in least entropy.

• Which normalization method strictly works in the range of


input data?
Solution
Min-Max Normalization
DATA ANALYTICS
Test your
understanding!
• Consider a set of Unsorted data for price in dollars
8 16, 9, 15, 21, 21, 24, 30, 26, 27, 30, 34

1)Smooth the data by equal frequency bins


2)On the results of part (1) apply smoothing by bin means

Solution:
1)
2)
DATA ANALYTICS
References

• Data Mining: Concepts and Techniques by Jiawei


Han, Micheline Kamber and Jian Pei, The Morgan
Kaufmann Series in Data Management Systems,
3rd Edition Chapter 3.5
• https://t4tutorials.com/binning-methods-for-
data-smoothing-in-data-mining/
DATA
UE21CS342A
ANALYTICS
A2
UNIT-1
Lecture 8 : Analysis of Variance
-1

Department of Computer Science and


Engineering
Data Analytics
Unit 1
Lecture 8 : Analysis of Variance - 1

Department of Computer Science and Engineering


DATA ANALYTICS
ANOV
A
• Analysis of Variance (ANOVA) is a
statistical technique that is used to
check if the means of two or more
groups are significantly different
from each other.
• ANOVA checks the impact of one or
more factors by comparing the
means of different samples.
• For example , assume there are 3 https://www.geeksforgeeks.org/one-
way-anova/
classes. For a given exam , you want
to find out if the marks of the
student (dependent variable)
DATA ANALYTICS
ANOV
A
To understand whether the factor (different levels of factor) has any s
significance on the population parameter ,we compare to models

1) Means Model :
• It is given by 𝒀𝒊𝒋 = 𝝁 + 𝜺𝒊𝒋
• Yij is the value of the outcome variable of jth observation for ith fac
level,  is the overall mean value of all observations, ij is the erro
assumed to be a normal distribution with mean 0 and standard d
.
• Model defined in above equation is often called the reduced mod
which the mean  is common for all levels of the factor.
DATA ANALYTICS
ANOV
A
2) Factor Effect Model :
• It is given by 𝒀𝒊𝒋 = 𝝁 + 𝝉𝒊 + 𝜺𝒊𝒋
•  is the overall mean and i is the effect of factor i (or factor effec
the difference between overall mean and the factor level mean.
• A non-zero i implies that a factor has an influence on the value o
outcome variable 𝑌𝑖𝑗 .
• Our objective is to verify if the variation is due to factors is differe
the variations due to randomness.
• Model defined in above equation is called full model.
DATA ANALYTICS
Multiple t-tests for comparing several
means
• Consider a retail store who would like to study the
impact of different levels of price discounts (factors) on
the sales (outcome variable). Let’s say they are analyzing
the levels of discounts of 0%,10% and 20%.
• If we had only 2 levels of discount , we could have used a
t-test directly to check whether a statistically significant
relationship exists between price discount and average
sales quantity.
• One option is to use 3 different 2 sample t-test:
 Test between 0% and 10%
 Test between 0% and 20%
 Test between 10% and 20%
DATA ANALYTICS
Multiple t-tests for comparing several
means
• Let,
P(A) = P(Retain H0 in test A|H0 in test A is true)
P(B) = P(Retain H0 in test B|H0 in test B is true)
P(C) = P(Retain H0 in test C|H0 in test C is true)
• Note : values of P(A) = P(B) = P(C) = 1 –  = 1 – 0.05 = 0.95
• The conditional probability of simultaneously retaining all 3
null hypotheses when they are true is P(A  B  C) = 0.953
= 0.8573.
• Now consider the following null hypothesis:
H0: 0 = 10 = 20
If we retain the null hypothesis based on the three individual
t-tests, then the significance or Type I error is not -value but
DATA ANALYTICS
The need for ANOVA

• For the case discussed , if we retain the null hypothesis


based on 3 individual tests, then the Type I error is 1 –
0.8573 = 0.1426.
• When more than 2 groups are involved, checking the
population parameter values simultaneously using t-
tests is inappropriate since the Type I and Type II errors
will be estimated incorrectly.
• For this reason, we use analysis of variance (ANOVA)
whenever we need to compare 3 or more groups for
population parameter values simultaneously.
DATA ANALYTICS
One-way ANOVA

One-way ANOVA is appropriate under the following


conditions:
1) We would like to study the impact of a single
treatment (also known as factor) at different levels
(thus forming different groups) on a continuous
response variable (or outcome variable). For the
example discussed , the variable ‘price discount’ is the
treatment (or factor) and 0%, 10%, and 20% price
discounts are the different levels (3 levels in this case),
different levels of discount is likely to have varying
impact on the sales of the product, where sales is the
DATA ANALYTICS
One-way ANOVA

2) In each group, the population response variable


follows a normal distribution and the sample
subjects are chosen using random sampling.
3) The population variances for different groups are
assumed to be same. That is, variability in the
response variable values within different groups is
same
Although conditions 2 and 3 are necessary for one-way
ANOVA, the model is robust and minor violations of the
assumptions may not result in incorrect decision about
DATA ANALYTICS
Setting up an ANOVA

• Assume that we would like to study the impact of a factor (s


discount) with k levels on a continuous variable (such as sales quanti
• Then the null and alternative hypotheses for one way ANOVA are giv
H0: 1 = 2 = 3=…= k
HA: Not all  values are equal
• Note that the alternative hypothesis, ‘not all  values are equal’, i
that some of them could be equal.
• The null hypothesis is equivalent to stating that the factor effects 1
kdefined in the equation 𝒀𝒊𝒋 = 𝝁 + 𝝉𝒊 + 𝜺𝒊𝒋 are zero.
DATA ANALYTICS
Setting up an ANOVA

The hypothesis test can be visualized as follows. Large


discrimination in the means mean the factor levels have an
impact on the outcome. If there is a little discrimination, it means
that the factor levels don’t have statistically significant impact.

https://www.analyticsvidhya.com/blog/2018/01/anova-
analysis-of-variance/
DATA ANALYTICS
Setting up an ANOVA

• We are interested in analyzing single factor effect with k levels, thus we will h
groups.
Let
k = Number of groups (or samples)
ni = Number of observations in group i (i = 1, 2, …, k)
n = Total number of observations (= σ𝑘𝑖=1 𝑛𝑖 )
Yij = Observation j in group i
𝟏
• 𝝁𝒊 = Mean of group i = σ𝐧𝐣=𝟏
𝐢
𝐘ij
𝐧𝐢

𝟏 𝐤
• 𝝁 = Overall mean = σ𝐢=𝟏 σ𝐧𝐣=𝟏
𝐢
𝐘ij
𝐧
DATA ANALYTICS
Setting up an ANOVA

To arrive at the statistic , we calculate the following measures, which are


variations within groups and between groups.
• Sum of Squares of Total Variation (SST)
Total variation is the sum of squared variation of all values of response v
(Yij) from the overall mean () and is given by
𝒌 𝒏𝒊

𝑺𝑺𝑻 = ෍ ෍(𝒀𝒊𝒋 − 𝝁)𝟐


𝒊=𝟏 𝒋=𝟏
The degrees of freedom for SST is (n1) since only the value of  is estim
from n observations and thus only one degree of freedom is lost. Mean
Total (MST) variation is given by
DATA ANALYTICS
Setting up an ANOVA

• Sum of Squares of Between (SSB) Group Variation


Sum of squares of between variation is the sum of squared variation betw
group mean (i) and the overall mean () of the data and is given by
𝒌

𝑺𝑺𝑩 = ෍ 𝒏𝒊 × (𝝁𝒊 − 𝝁)𝟐


𝒊=𝟏

The degrees of freedom is (k  1). Since the overall mean  is estimated


data, one degree of freedom is lost. Mean square between variation (M
by
𝑺𝑺𝑩
𝑴𝑺𝑩 =
𝒌−𝟏
DATA ANALYTICS
Setting up an ANOVA

• Sum of Squares of Within (SSW) Group Variation


Sum of squares of within the group variation is the sum of squared variat
observations (Yij) from that group mean (i) and is given by
𝒌 𝒏𝒊

𝑺𝑺𝑾 = ෍ ෍(𝒀𝒊𝒋 − 𝝁𝒊 )𝟐
𝒊=𝟏 𝒋=𝟏

The degrees of freedom for SSW is (n  k). Here k degrees of freedom are
we estimate k group means (i). The mean square of variation within the
𝑺𝑺𝑾
𝑴𝑺𝑾 =
𝒏−𝒌
DATA ANALYTICS
Setting up an ANOVA

https://www.datanovia.com/en/lessons/
DATA ANALYTICS
Setting up an ANOVA

We can prove algebraically

𝒌 𝒏𝒊 𝒌 𝒌 𝒏𝒊

෍ ෍(𝒀𝒊𝒋 − 𝝁)𝟐 = ෍ 𝒏𝒊 × (𝝁𝒊 − 𝝁)𝟐 + ෍ ෍(𝒀𝒊𝒋 − 𝝁𝒊 )


𝒊=𝟏 𝒋=𝟏 𝒊=𝟏 𝒊=𝟏 𝒋=𝟏

That is,
SST=SSB+SSW
DATA ANALYTICS
Cochran’s
Theorem
According to Cochran’s theorem (Kutner et al., 2013, page 70):
‘If Y1, Y2, …, Yn are drawn from a normal distribution with mean 
standard deviation  and sum of squares of total variation is decomp
into k sum of squares (SSr) with degrees of freedom dfr, then the
(SSr/2) are independent 2 variables with dfr degrees of freedom if
𝑘

෍ 𝑑𝑓𝑟 = 𝑛 − 1
𝑟=1

Note that, in the equation SST = SSB + SSW , the SST is decomposed
two sums of squares (SSB and SSW) and thus, SSB/2 and SSW/2 are
chi-square variables.
DATA ANALYTICS
The F-test

• If the null hypothesis is true, then there will be no


difference in the mean values which will result in no
difference between MSB and MSW.
• Alternatively, if the means are different, then MSB
will be larger than MSW.
• That is the ratio MSB/MSW will be close to 1 if there
is no difference between the mean values and will
be larger than 1 if the means are different.
DATA ANALYTICS
The F-test

• Following Cochran’s theorem (Kirk, 1995) MSB/MSW is a ratio


of two chi-square variate which is an F-distribution. Thus the
statistic for testing the null hypothesis is
𝑺𝑺𝑩/(𝒌 − 𝟏) 𝑴𝑺𝑩
𝑭= =
𝑺𝑺𝑾/(𝒏 − 𝒌) 𝑴𝑺𝑾
• Note that the test statistic is a one-tailed test (right tailed)
since we are interested in finding whether the variation
between groups is greater than variation within the groups.
• It is important to note that rejecting the null hypothesis will
not tell us exactly which means differ from each other , it will
only indicate that there is a difference in at least one of the
group means. We may have to conduct two-sample t-tests to
find out which mean values are different.
DATA ANALYTICS
Example ( an Experimental
Study )
Ms Rachael Khanna the brand manager of ENZO detergent powder
at the ‘one stop’ retail was interested in understanding whether
the price discounts has any impact on the sales quantity of ENZO.
To test whether the price discounts had any impact, price
discounts of 0% (no discount), 10% and 20% were given on
randomly selected days. The quantity (in kilograms) of ENZO sold
in a day under different discount levels is shown in below. Conduct
a one-way ANOVA to check whether discount had any significant
impact on the sales quantity at  = 0.05.
DATA ANALYTICS
Solution

• In this case, the number of groups k = 3; n1 = n2 = n3 = 30; 1 = 32, 2


3 = 46.4; and  = 39.05.

• The sum of squares of between groups variation (SSB) is given by


𝒌

𝑺𝑺𝑩 = ෍ 𝒏𝒊 × (𝝁𝒊 − 𝝁)𝟐


𝒊=𝟏
= 𝟑𝟎 × [(𝟑𝟐 − 𝟑𝟗. 𝟎𝟓)𝟐 + (𝟑𝟖. 𝟕𝟕 − 𝟑𝟗. 𝟎𝟓)𝟐 + (𝟒𝟔. 𝟒 − 𝟑𝟗. 𝟎𝟓)𝟐 ] = 𝟑𝟏𝟏

• Therefore
𝑺𝑺𝑩 𝟑𝟏𝟏𝟒. 𝟏𝟓𝟔
𝑴𝑺𝑩 = = = 𝟏𝟓𝟓𝟕. 𝟎𝟕𝟖
𝒌−𝟏 𝟐
DATA ANALYTICS
Solution

• The sum of squares of within the group variation is given by


𝒌 𝒏𝒊 𝟑𝟎 𝟑𝟎 𝟑𝟎

𝑺𝑺𝑾 = ෍ ෍(𝒀𝒊𝒋 − 𝝁𝒊 )𝟐 = ෍(𝒀𝟏𝒋 − 𝟑𝟐)𝟐 + ෍(𝒀𝟐𝒋 − 𝟑𝟖. 𝟕𝟕)𝟐 + ෍(𝒀𝟑𝒋 − 𝟒𝟔


𝒊=𝟏 𝒋=𝟏 𝒋=𝟏 𝒋=𝟏 𝒋=𝟏
= 𝟐𝟎𝟓𝟔. 𝟓𝟔𝟕

• Therefore
𝑺𝑺𝑾 𝟐𝟎𝟓𝟔. 𝟓𝟔𝟕
𝑴𝑺𝑾 = = = 𝟐𝟑. 𝟔𝟑
𝒏−𝒌 𝟗𝟎 − 𝟑

• The F-statistic value is

𝑴𝑺𝑩 𝟏𝟓𝟓𝟕. 𝟎𝟕𝟖


𝑭𝟐,𝟖𝟕 = = = 𝟔𝟓. 𝟖𝟔
𝑴𝑺𝑾 𝟐𝟑. 𝟔𝟑𝟖𝟕
DATA ANALYTICS
Solution

• The critical F-value with


degrees of freedom (2, 87)
for  = 0.05 is 3.101

• The p-value for F2,87 = 65.86


is 3.82 × 1018

• Since the calculated F-


statistic is much higher
than the critical F-value, we
reject the null hypothesis
and conclude that the Excel output of ANOVA for this
mean sales quantity values data

under different discounts


DATA ANALYTICS
Test your
understanding!
• What would happen if instead of using ANOVA to compare
7 groups , you performed multiple t-tests?
a) Making multiple comparisons with a t-test increases the probability
of making a Type-1 error.
b) Sir Ronald Fischer would be turning over in his grave; He put all that
work into developing ANOVA and you used multiple t-tests 
c) Nothing apart from making multiple comparisons with a t-test
requires more computation than ANOVA
d) Nothing , both are the same.
Solution
a) Making multiple comparisons with a t-test increases the probability
of making a Type-1 error.
• What kind of a hypothesis test is used for one-way ANOVA?
DATA ANALYTICS
Test your
understanding!
• For an experiment with a single factor of k levels with n
observations, the degrees of freedom for sum of squares
of variation within the group is ?
a) n-1
b) k-1
c) n–k+1
d) n-k
Solution
d) n-k
DATA ANALYTICS
Test your
understanding!
• An investigator used ANOVA to compare 4 groups of
students on numerical ability based on a class test. After
analysis of raw scores, the following results were obtained.
Calculate the value of the F-statistic

a) 3.4
b) 3.52
c) 3.88
d) 3.97
Solution:
DATA ANALYTICS
Quick Glance-Points to
remember
• Why ANOVA and what issue of the multiple T-test does it
address?
• Mean model
• Factors effect model
• Setting up 1 way ANOVA:
o Appropriate conditions where 1-way ANOVA is
applicable
o Understanding all the variables and subscripts used
o Deriving SST,SSB and SSW(corresponding MST,MSB and
MSW based on DoF)
o Cochran's theorem
o F-statistic for ANOVA
o Finally, when to accept and reject the null
DATA ANALYTICS
References

• Business Analytics by U. Dinesh Kumar – Wiley 2nd


Edition, 2022 Chapter : 7.1 - 7.3.3
• https://www.analyticsvidhya.com/blog/2018/01/
anova-analysis-of-variance/
• https://www.analyticsvidhya.com/blog/2020/06/i
ntroduction-anova-statistics-data-science-covid-
python/
• https://www.geeksforgeeks.org/one-way-anova/

You might also like