UE21CS342AA2 - Unit-1 Part - 2
UE21CS342AA2 - Unit-1 Part - 2
UE20CS312
UNIT-1
Lecture 5 : Data Preprocessing
- Data Integration and
Reduction
• Expected stands for what we would expect if the null hypothesis wer
• Larger the value ofχ2 the more likely the variables are correlated.
• The cells that contribute the most to the 𝜒 2 value are those whose a
count is different from the expected count.
• Can be used for categorical variables where entries are numbers(cou
and not percentages or fractions(10% of 100 needs to be entered as
• Correlation does not imply causation.
The number of hospitals and number of car-thefts in a city may a
to be correlated. Both are casually linked to a third variable : pop
DATA ANALYTICS
Covariance analysis : An Example
https://mathcs.clarku.edu/~djoyce/ma217/covar.p
df
DATA ANALYTICS
Correlation (viewed as a linear
relationship)
• Correlation measures the linear relationship between
objects.
• To compute correlation , we standardize data objects A and B
, and then take their dot product.
DATA ANALYTICS
Covariance analysis (Numeric Data )
• Curse of dimensionality
• When dimensionality increases, data becomes
increasingly sparse.
• Density and distance between points , which are critical
to clustering and outlier analysis become less
meaningful.
• The possible combinations of subspaces will grow
exponentially.
• Dimensionality reduction
• Avoids the curse of dimensionality.
• Helps to eliminate irrelevant attributes and reduce noise.
• Reduces time and space required for data analytics.
DATA ANALYTICS
Mapping data to a new space
• Fourier transform
• Wavelet transform
What is a Wavelet?
A Wavelet is a wave-like oscillation that is localized in
time, an example is given below. Wavelets have two
basic properties: scale and location. Scale (or dilation)
defines how “stretched” or “squished” a wavelet is. This
property is related to frequency as defined for
waves. Location defines where the wavelet is positioned
in time (or space).
DATA ANALYTICS
Wavelet transformation
• Discrete wavelet transform (DWT)
is used for linear signal processing
and multi-resolution analysis.
• It decomposes a signal into
different frequency sub-bands. It
is applicable to n-dimensional
signals.
• Data is transformed to preserve
relative distance between objects An example of
DWT
at different resolutions.
• Compressed approximation : it
stores only a small fraction of the
strongest of the wavelet
coefficients
Wavelet families
• It is insensitive to noise , input
DATA ANALYTICS
Wavelet Transforms
What is PCA?
Assume there are 50 questions in all in the survey.
The following three are among them:
1.I feel comfortable around people
2.I easily make friends
3.I like going out
These queries could appear different now. There is a catch, though.
They aren’t, generally speaking. They all gauge how extroverted you
are.
Therefore, combining them makes it logical, right? That’s where linear
algebra and dimensionality reduction methods come in!
We want to lessen the complexity of the problem by minimizing the
number of variables since we have much too many variables that
aren’t all that different. That is the main idea behind dimensionality
DATA ANALYTICS
Principal Component Analysis (PCA)
x
1
DATA ANALYTICS
PCA - Steps
And Finally!
DATA ANALYTICS
PCA using R (factoMineR , factoextra)
http://www.sthda.com/english/article
s/31-principal-component-methods-
in-r-practical-guide/112-pca-principal-
component-analysis-essentials/
DATA ANALYTICS
PCA using R (factoMineR , factoextra)
DATA ANALYTICS
PCA using R (factoMineR , factoextra)
DATA ANALYTICS
PCA using R (factoMineR , factoextra)
DATA ANALYTICS
Attribute Subset Selection
https://towardsdatascience.com/8-types-of-sampling-techniques-b21adcdd2124
DATA ANALYTICS
Data Cube Aggregation
IMAGINE YOU THE DATA YOU SINCE YOU CARE THE RESULTING USUALLY DATA
HAVE TO RECEIVE HAS ABOUT ANNUAL DATASET IS CUBES ARE USED
METRICS , THE DATA
PERFORM AN SALES PER SMALLER IN TO STORE
CAN BE AGGREGATED
ANALYSIS ON QUARTER FROM SO THAT THE VOLUME MULTIDIMENSION
YEARLY SALES AT THE YEAR 2008 TO RESULTING DATA WITHOUT A LOSS AL AGGREGATED
DUNDER MIFFLIN 2010. SUMMARIZES THE OF INFORMATION INFORMATION.
PAPER COMPANY. ANNUAL SALES NECESSARY TO
RATHER THAN THE TASK AT
QUARTERLY SALES.
HAND!
DATA ANALYTICS
Data Compression
Methods
• Smoothing
• Attribute construction
• Aggregation
• Normalization
• Discretization
• Concept hierarchy generation
DATA ANALYTICS
Smoothing
Techniques:
• Binning
• Regression
• Clustering
• Simple average – Time series data
• Weighted average – Time series data
• Exponential smoothening – Time series data
• Gaussian filter - Image
DATA ANALYTICS
Normalization
DATA ANALYTICS
Normalization
https://www.saedsayad.com/unsupervised
_binning.htm
DATA ANALYTICS
Data Smoothing with
Binning
https://www.slideserve.com/forrest-
DATA ANALYTICS
Discretization by Correlation
Analysis - Example
https://www.slideserve.com/forrest-
DATA ANALYTICS
Discretization by Correlation
Analysis - Example
https://www.slideserve.com/forrest-
DATA ANALYTICS
Concept Hierarchy
Generation
Solution:
1)
2)
DATA ANALYTICS
References
1) Means Model :
• It is given by 𝒀𝒊𝒋 = 𝝁 + 𝜺𝒊𝒋
• Yij is the value of the outcome variable of jth observation for ith fac
level, is the overall mean value of all observations, ij is the erro
assumed to be a normal distribution with mean 0 and standard d
.
• Model defined in above equation is often called the reduced mod
which the mean is common for all levels of the factor.
DATA ANALYTICS
ANOV
A
2) Factor Effect Model :
• It is given by 𝒀𝒊𝒋 = 𝝁 + 𝝉𝒊 + 𝜺𝒊𝒋
• is the overall mean and i is the effect of factor i (or factor effec
the difference between overall mean and the factor level mean.
• A non-zero i implies that a factor has an influence on the value o
outcome variable 𝑌𝑖𝑗 .
• Our objective is to verify if the variation is due to factors is differe
the variations due to randomness.
• Model defined in above equation is called full model.
DATA ANALYTICS
Multiple t-tests for comparing several
means
• Consider a retail store who would like to study the
impact of different levels of price discounts (factors) on
the sales (outcome variable). Let’s say they are analyzing
the levels of discounts of 0%,10% and 20%.
• If we had only 2 levels of discount , we could have used a
t-test directly to check whether a statistically significant
relationship exists between price discount and average
sales quantity.
• One option is to use 3 different 2 sample t-test:
Test between 0% and 10%
Test between 0% and 20%
Test between 10% and 20%
DATA ANALYTICS
Multiple t-tests for comparing several
means
• Let,
P(A) = P(Retain H0 in test A|H0 in test A is true)
P(B) = P(Retain H0 in test B|H0 in test B is true)
P(C) = P(Retain H0 in test C|H0 in test C is true)
• Note : values of P(A) = P(B) = P(C) = 1 – = 1 – 0.05 = 0.95
• The conditional probability of simultaneously retaining all 3
null hypotheses when they are true is P(A B C) = 0.953
= 0.8573.
• Now consider the following null hypothesis:
H0: 0 = 10 = 20
If we retain the null hypothesis based on the three individual
t-tests, then the significance or Type I error is not -value but
DATA ANALYTICS
The need for ANOVA
https://www.analyticsvidhya.com/blog/2018/01/anova-
analysis-of-variance/
DATA ANALYTICS
Setting up an ANOVA
• We are interested in analyzing single factor effect with k levels, thus we will h
groups.
Let
k = Number of groups (or samples)
ni = Number of observations in group i (i = 1, 2, …, k)
n = Total number of observations (= σ𝑘𝑖=1 𝑛𝑖 )
Yij = Observation j in group i
𝟏
• 𝝁𝒊 = Mean of group i = σ𝐧𝐣=𝟏
𝐢
𝐘ij
𝐧𝐢
𝟏 𝐤
• 𝝁 = Overall mean = σ𝐢=𝟏 σ𝐧𝐣=𝟏
𝐢
𝐘ij
𝐧
DATA ANALYTICS
Setting up an ANOVA
𝑺𝑺𝑾 = (𝒀𝒊𝒋 − 𝝁𝒊 )𝟐
𝒊=𝟏 𝒋=𝟏
The degrees of freedom for SSW is (n k). Here k degrees of freedom are
we estimate k group means (i). The mean square of variation within the
𝑺𝑺𝑾
𝑴𝑺𝑾 =
𝒏−𝒌
DATA ANALYTICS
Setting up an ANOVA
https://www.datanovia.com/en/lessons/
DATA ANALYTICS
Setting up an ANOVA
𝒌 𝒏𝒊 𝒌 𝒌 𝒏𝒊
That is,
SST=SSB+SSW
DATA ANALYTICS
Cochran’s
Theorem
According to Cochran’s theorem (Kutner et al., 2013, page 70):
‘If Y1, Y2, …, Yn are drawn from a normal distribution with mean
standard deviation and sum of squares of total variation is decomp
into k sum of squares (SSr) with degrees of freedom dfr, then the
(SSr/2) are independent 2 variables with dfr degrees of freedom if
𝑘
𝑑𝑓𝑟 = 𝑛 − 1
𝑟=1
Note that, in the equation SST = SSB + SSW , the SST is decomposed
two sums of squares (SSB and SSW) and thus, SSB/2 and SSW/2 are
chi-square variables.
DATA ANALYTICS
The F-test
• Therefore
𝑺𝑺𝑩 𝟑𝟏𝟏𝟒. 𝟏𝟓𝟔
𝑴𝑺𝑩 = = = 𝟏𝟓𝟓𝟕. 𝟎𝟕𝟖
𝒌−𝟏 𝟐
DATA ANALYTICS
Solution
• Therefore
𝑺𝑺𝑾 𝟐𝟎𝟓𝟔. 𝟓𝟔𝟕
𝑴𝑺𝑾 = = = 𝟐𝟑. 𝟔𝟑
𝒏−𝒌 𝟗𝟎 − 𝟑
a) 3.4
b) 3.52
c) 3.88
d) 3.97
Solution:
DATA ANALYTICS
Quick Glance-Points to
remember
• Why ANOVA and what issue of the multiple T-test does it
address?
• Mean model
• Factors effect model
• Setting up 1 way ANOVA:
o Appropriate conditions where 1-way ANOVA is
applicable
o Understanding all the variables and subscripts used
o Deriving SST,SSB and SSW(corresponding MST,MSB and
MSW based on DoF)
o Cochran's theorem
o F-statistic for ANOVA
o Finally, when to accept and reject the null
DATA ANALYTICS
References