Unit 2 Data Analytics Asha Karegowda May2021
Unit 2 Data Analytics Asha Karegowda May2021
Unit 2.
Topics covered:
Data Visualization Methods for Two and Higher Dimensional Data
(Scatter plot, radar plot, parallel plot)
Correlation Analysis
Karl Pearson correlation coefficient,
Kendall’s tau rank correlation coefficient or simply Kendall’s tau
Spearman’s rank correlation coefficient or Spearman’s rho)
Partial and multiple correlations
Simple linear regression
Outlier Detection for Single and multidimensional Data . . . . . . . . 64
Reference - Michael R. Berthod, Christian Borgelt, Frank Hoppner,
Guide to Intelligent Data Analysis, Springe Series
And various web sites
a) Two dimensional plot : Scatter Plot: Scatter plots refer to displays where two
attributes are plotted against each other. The two axes of the coordinate system
represent the two considered attributes, and each instance in the data set is represented
by a point, a circle, or any other symbol.
It should be noted that the scatter plots—like all other visualization techniques—
are very useful tools to discover simple structures and patterns or peculiar deviations like
outliers in a data set. But there is no guarantee that a scatter plot or any visualization
technique will automatically show all or even any interesting or deviating pattern in the
data set. A scatter plot with no outliers does not mean that there are no outliers in the data
set. It only means that there are no outliers with respect to the combination of the
attributes displayed in the scatter plot.
16
14
X y
12 5 15
10 3 6
2 10
8
2 15
6 3 6
4
5 15
0
0 2 4 6
X y
18
16 5 15.3
14 3.2 6
12 2.4 10
10 2 15
8 3 6
6 5 15
4
2
0
0 2 4 6
What is Jitter effect? Some points are simply plotted at exactly the same position,
since their measured same value for two different objects. To avoid this impression
of seeing less objects than there actually are, one can add jitter (i.e random value)
to the scatter plot. To avoid this impression of seeing less objects than there actually
are, one can add jitter to the scatter plot. Instead of plotting the symbols exactly at
the coordinate’s specified by the values in the data set, we add a small random value
to each original value in the data table. This ensures that a point originally lying left
or below another point will always remain left or below the other point, even when
the jitter is added
Jitter is essential when categorical attributes are used for the coordinate axes of a
scatter plots, since categorical attributes have only a limited number of possible
values, so that plotting of objects at exactly the same position occurs very often
when no jitter is added.
a) Parallel Coordinates :
Parallel coordinates is a common way of visualizing high-dimensional geometry and
analyzing multivariate data. Parallel coordinates draw the coordinate axes parallel to each
other, so that there is no limitation for the number of axes to be displayed. For a data
object, a polyline is drawn connecting the values of the data object for the attributes on
the corresponding axes.
A parallel coordinate plot maps each row in the data table as a line, or profile. Each
attribute of a row is represented by a point on the line. This makes parallel coordinate
plots similar in appearance to line charts, but the way data is translated into a plot is
substantially different. For larger data sets, it becomes more or less impossible to track
the lines that correspond to a data object in parallel coordinates plots or even to discover
general structures. It can be helpful to generate separate parallel coordinate plots for
different subsets of the data.
Consider, for example, a data table where a laboratory has measured the amount of
various carbohydrates contained in various fruit and vegetables.
http://stn.spotfire.com/spotfire_client_help/para/para_what_is_a_parallel_coordinate_plot
.htm
For each food type, it is now possible to plot a profile of how the carbohydrates are
distributed. The technicians in the laboratory can now see which food types are similar to
each other in carbohydrate distribution, by comparing the profiles to each other. This is
where the parallel coordinate plot is really useful, to compare profiles in order to find
similarities.
The values in a parallel coordinate plot are always normalized. This means that for each
point along the X-axis, the lowest value in the corresponding column is set to 0% and the
highest value in that column is set to 100% along the Y-axis. The scale of the various
columns is totally separate, so do not compare the height of the curve in one column to
the height of the curve in another column.
Min max normalization is done as follows:
For a given attribute, the value V is normalized as V’ using min max normalization as
follows.
V’ = (V- Vmin)/ ( Vmax- Vmin)
Where Vmin and Vmax are min and max values of a attribute A to which V belongs
For larger data sets, it becomes more or less impossible to track the lines that correspond
to a data object in parallel coordinates plots or even to discover general structures. It can
be helpful to generate separate parallel coordinate plots for different subsets of the data.
A parallel coordinate plot maps each row in the data table as a line, or profile. Each
attribute of a row is represented by a point on the line. This makes parallel coordinate
plots similar in appearance to line charts, but the way data is translated into a plot is
substantially different.
Consider, for example, a data table where a laboratory has measured the amount of
various carbohydrates contained in various fruit and vegetables.
For each food type, it is now possible to plot a profile of how the carbohydrates are
distributed. The technicians in the laboratory can now see which food types are similar to
each other in carbohydrate distribution, by comparing the profiles to each other. This is
where the parallel coordinate plot is really useful, to compare profiles in order to find
similarities.
The values in a parallel coordinate plot are always normalized. This means that for each
point along the X-axis, the lowest value in the corresponding column is set to 0% and the
highest value in that column is set to 100% along the Y-axis. The scale of the various
columns is totally separate, so do not compare the height of the curve in one column to
the height of the curve in another column.
Radar plots are only suited for smaller data sets. For such smaller data sets, it is
sometimes better not to draw all data objects in the system of coordinate axes but to draw
each data object separately, which is then called a star plot.
Purpose: Radar chart tries to answer the questions like- which observations are most
similar, i.e., are there clusters of observations? And are there outliers? What is the status
of different variables before and after improvement activities? Which variables are more
important to observe intensively among many variables? and many other questions.
Example: Plot Radar chart for the given data for scores in 5 subjects by Section A and
Section B Students. The max score of each subject is 100.
Section Section
A B
Sub1 50 90
Sub2 40 80
Sub3 100 40
Sub4 50 80
Sub5 40 100
The final marks with max score for each section as 100 is shown in table below
School A (Max 100) School B(100) School C(100)
Sub1 80 75 47
Sub2 60 95 54
Sub3 20 60 74
Sub4 40 55 80
Sub5 90 40 34
(1) Scatter Diagram: Scatter diagram is drawn to visualize the relationship between two
variables. The values of more important variable are plotted on the X-axis while the
values of the variable are plotted on the Y-axis. On the graph, dots are plotted to
represent different pairs of data. When dots are plotted to represent all the pairs, we get a
scatter diagram. The way the dots scatter gives an indication of the kind of relationship
which exists between the two variables. While drawing scatter diagram, it is not
necessary to
take at the point of sign the zero values of X and Y variables, but the minimum values of
the variables considered may be taken.
When there is a positive correlation between the variables, the dots on the scatter diagram
run from left hand bottom to the right hand upper corner. In case of perfect positive
correlation all the dots will lie on a straight line.
When a negative correlation exists between the variables, dots on the scatter diagram run
from the upper left hand corner to the bottom right hand corner. In case of perfect
negative correlation, all the dots lie on a straight line.
In this method the individual values of the two variables are plotted on the graph paper.
Therefore two curves are obtained-one for X variable and another for Y variable.
Interpreting Graph: The graph is interpreted as follows:
(i) If both the curves run parallel or nearly parallel or more in the same direction, there is
positive correlation,
(ii) On the other hand, if both the curves move in the opposite direction, there is a
negative correlation.
-1≤ r ≤ 1
The larger the absolute value of the Pearson correlation coefficient, the stronger the linear
relationship between the two attributes. For |r | = 1 the values of X and Y lie exactly on a
line. Positive (negative) correlation indicates a line with positive (negative) slope.
Pearson’s correlation coefficient measures linear correlation. Even if there is a functional
dependency between two attributes, but the function is nonlinear but monotone,
Pearson’s correlation coefficient will not be − 1 or 1. It can even be far away from these
values, depending on how much the function describing the functional relationship
deviates from a line.
Problem: Compute Pearson’s correlation coefficient for the given data. Comment on the
result
Note: Alternate method The Pearson’s correlation coefficient can also be computed
using the following formulae.
where r(xi) is the rank of value xi when we sort the list (x1, . . . , xn) in increasing order.
r(yi) is defined analogously. When the rankings of the x- and y-values are exactly in the
same order, Spearman’s rho will yield the value 1. If they are in reverse order, we will
obtain the value −1. The above equation is also represented as
values.
n=9
= 1 – (6*12)/(9(81-1)) = 1 – 72/720 = 1-0.1 = 0.9
The Spearman Rank Correlation for this set of data is 0.9 shows high positive correlation
between the marks scored by students in Maths and physics
The Spearman correlation coefficient, rs, can take values from +1 to -1. A rs of +1
indicates a perfect association of ranks, a rs of zero indicates no association between
ranks and a rs of -1 indicates a perfect negative association of ranks. The closer rs is to
zero, the weaker the association between the ranks.
Note : In some case it becomes necessary to rank two or more items an identical rank. In
such cases, it is customary to give each item an average rank. Therefore, if two items are
equal for 4th and 5th rank, each item shall be ranked 4.5 . It means, where two or more
items are to be ranked equal, the rank assigned for purposes of calculating coefficient of
correlation is the average of ranks which these items would have got had they differed
slightly from each other. When equal ranks are assigned to some items, the rank
correlation formula is also adjusted.
Note in example1, the ranks were not given, instead, physics and maths marks were
given, we need to convert it to rank since we need to compute the Spearman Rank
Correlation. In example 2, the three judges are directly ranking the contestants, hence no
need to convert to rank.
Judge1 1 6 5 10 3 2 4 9 7 8
Judge2 3 5 8 4 7 10 2 1 6 9
Judge3 6 4 9 8 1 2 3 10 5 7
Solution : Let R1, R2 and R3 denote the ranks given by he first, second and third judges
respectively and let be the rank correlation coefficient between th ranks given by ith
and jth judges, i ≠j= 1,2,3.
Let dij = Ri-Rj, the difference of ranks of an individual’s given by the ith and jth judge.
n = 10
Comment: Since is maximum , the pair of first and third judges has the nearest
approach to common tastes in beauty.
Since and are negative, the pair of judges (1,2) and (2,3) have opposite
(divergent) tastes in beauty.
Example 3 (Case when two or more values for a given attribute is repeated ie. Tie case
i.e. Repeated Ranks)
In case of attributes if there is a tie i.e. if any two or more individuals are placed together
in any classification w.r.t an attribute or if in case of variable data there is more than one
itme with the same value in either or both the series, then Spearmans formula for
calculating the rank correlation coefficient breaks down, since in his case the variable X
and Y do not take the values from 1 to n.
The common ranks are assigned to the repeated items. These common ranks are the
arithmetic mean of the ranks which these items would have got if they were different
from each other and the next time will get the rank next to the rank used in computing the
common rank. For example, a rank say 4 is repeated is two times, the common rank to be
assigned to each item is (4+5)/2 = 4.5. The next item will be assigned rank 6. If an item is
repeated thrice as rank 7, then the common rank to be assigned to each value of these
value will average of ( 7+8+9) = 8. The next rank to be assigned will be 10.
If a large proportion of the ranks are tied, then we need to apply an adjustment or a
2 is added to
Problem Statement:
A psychologist wanted to compare two methods A and B of teaching. He selected a
random sample of 22 students. He grouped them into 11 pairs so that the students in
a pair have approximately equal scored on an intelligence test. In each paper one
student was taught by method A and the other by method B and examined after the
course. The marks obtained by them are tabulated below:
A 24 29 19 14 30 19 27 30 20 28 11
B 37 35 16 26 23 27 19 20 16 11 21
Find the rank correlation coefficient.
Solution:Let X denote the scores of students taught by method A and Y denote the
scores of students taught by method B.
For X, value 30 is repeated twice, hence rank of 30 is avg of rank 1 and rank 2 , i.e,
(1+2)/2 = 1.5. The rank 1.5 is assigned to both the values 30 for X. Since we have
considered rank 1 and rank2, hence the next highest value of X with value 29 is
assigned rank 3.
Similarly the value 19 is repeated twice, hence avg of rank (8+9) /2 = 8.5 is assigned
to each of value 19 for column X and next value 14 is assigned rank 10
Similarly the value 16 in Y is repeated twice, hence avg of (9+10)/2 = 9.5 to each of
of value 16 and next value 11 in Y is assigned rank 11
X Y Rank(X) Rank(Y) d = d2
Rank(X) – Ranks(Y)
24 37 6 1 5 25
29 35 3 2 1 1
19 16 8.5 9.5 -1 1
14 26 10 4 6 36
30 23 1.5 5 -3.5 12.25
19 27 8.5 3 5.5 30.25
27 19 5 8 -3 9.0
30 20 1.5 7 -5.5 30.25
20 16 7 9.5 -2.5 6.25
28 11 4 11 -7 49.00
11 21 11 6 5 25.00
∑225.0
= 2((2*2)-1)/12 = 0.5
30 is repeated 2 times hence correction factor for 30 = 2((2*2)-1)/12 = 0.5
16 is repeated 2 times hence correction factor for 16 = 2((2*2)-1)/12 = 0.5
We need to add correction factor of 16, 19 and 30 to shown in solved problem.
Comment: there exists negative correlation between two methods X and Y of teaching
Problem statement:
From the following data calculate the rank correlation coefficient after making
adjustment for tied ranks.
X 48 33 40 9 16 16 65 24 16 57
Y 13 13 24 6 15 4 20 9 6 19
Solution :
X Y Rank(X) Rank(Y) d = d2
Rank(X) – Ranks(Y)
48 13 3 5.5 -2.5 6.25
33 13 5 5.5 -0.5 0.25
40 24 4 1 3 9
9 6 10 8.5 1.5 2.25
16 15 8 4 4 16
16 4 8 10 -2 4
65 20 1 2 -1 1
24 9 6 7 -1 1
16 6 8 8.5 -0.5 0.25
57 19 2 3 -1 1
∑41
Correction factor for 13 = 2 ((2*2) - 1)/12 = 0.5 (note 13 is repeated two times)
Value 6 in Y is repeated 2 times, hence take average of (8+9)/2 = 8.5 , assign 8.5 to each
value 6 in Y.
Correction factor for 6 = 2 ((2*2) - 1)/12 = 0.5 (note 6 is repeated two times)
= (C-D) / (C+D)
where C and D denote the numbers of concordant and discordant pairs, respectively:
C = |{ (i,j) | xi < xj and yi < yj }| ,
D = |{ (i,j) | xi < xj and yi >= yj }| .
A concordant pair is when the rank of the second variable is greater than the rank
of the former variable.
A discordant pair is when the rank is equal to or less than the rank of the first
variable
Rank correlation coefficients like Spearman’s rho and Kendall’s tau depend only on the
order (ranks) of the values and are therefore more robust against extreme outliers than
Pearson’s correlation coefficient. For categorical attributes, these correlation coefficients
are not applicable.
Compute Kendalls tau coefficient for rankings provided by master and student1 for 7
paintings. Interpret the result.
Rank1 Rank2
(x) (y)
6 4
5 7
7 5
4 2
1 1
2 3
3 6
Solution: sort Rank 1 and accordingly arrange Rank2 as shown below. Find number of
Concordant and discordant
Eg :Compute Kendall’s Tau Rank Correlation for given data:
Rank1 Rank2 C D
1 1 6 0
2 3 4 1
3 6 1 3
4 2 3 0
5 7 0 2
6 4 1 0
7 5 Sum c=15 Sum d=6
To calculate Kendalls Tau, count up the total number of C’s and D’s. C= 15, D = 6
τ = (C-D) /(C+D) = (15 − 6 )/(15 + 6) = 7 /21 = 0.42857
Rank 1 and Rank2 are fairly positively correlated.
Student 1 x B E A F D C
Student 2 y F C A B D E
Solution:
Ranks are not given directly. For Student 1, B is first preference, E is second preference and so on. For Student
2, first preference in F , second is C and so on.
Arrange data based on preference of student 1
University Student 1 x Student 2 y Concordant Discordant
C D
B 1 4 2 3
E 2 6 0 4
A 3 3 1 2
F 4 1 2 0
D 5 5 0 1
C 6 2
∑5 ∑10
Outlier Detection
An outlier is simply a value or data object that is far away or very different from all or
most of the other data. Outliers can correspond to erroneous data coming from wrong
measurements or typing mistakes when data are entered manually. Erroneous data
should be corrected, or, if this is not possible, they should be excluded from the data set
before further analysis steps are carried out. If one replaces the largest value in a sample
by an extremely large value or even infinity, the median will not change. But the mean
value will tend to infinity, even though only one value in the sample goes to infinity.
For numerical attributes, outlier detection is more difficult. We have already classified
certain data points in a box plot as outliers. For asymmetric distributions, box plots tend
to contain more outliers. . Heavy tailed distributions tend to show more outliers in a box
plot, whereas for a sample from a uniform distribution, we would expect no outliers at
all, no matter how large the sample size is. the sample size is. The standard assumption
for outlier tests for continuous attributes is that the underlying distribution is a normal
distribution. Grubb’s test is a test for outliers for normal distributions taking the sample
size into account.
Grubbs' test is defined for the hypothesis:
H0: There are no outliers in the data set
Ha: There is at least one outlier in the data set
with and s denoting the sample mean and standard deviation, respectively. The Grubbs
test statistic is the largest absolute deviation from the sample mean in units of the sample
standard deviation.This is the two-sided version of the test.
-------------------------------------
Example :
Apply Grubbs test to find data 6.8 in the given set of data is an outlier.
Data : 3.3, 6.8, 2.9, 3.8, 3.1. Given Gtable value for 5 observation = 1.672
Solution:
Mean = 3.98, std div = 1.6
Gcal = |value to tested – mean| / std div = |6.8-3.98|/1.6 = 1.763
If Gcal > Gtable then the hypothesis must be rejected.
H0 : Grubbs null hypothesis is that there are no outliers
H1: outlier present
Since 1.763> 1.672, the data 6.8 is outlier and needs to be removed
After removal of 6.8, mean = 3.275 and std div = 0.39. we can observe that the standard
deviation is reduced since
Example 2. Find if data 5 is outlier for the given data set using Grubbs test
Data 5, 10, 9.5, 9.8, 9.9. Given Gtable value for 5 observation = 1.672
Solution : Value to be tested = 5, mean = 8.84, std div = 2.155
Gcal = |value to tested – mean| / std div , = |5-8.84|/2.155 = 1.8
Since 1.8> 1.672 hence the data 5 is outlier and must be removed.
------------------------------------------------------------------
Note: sample Grubbs table
Number of G for 95%
observations confidence
i.e n value
4 1.463
5 1.672
6 1.822
7 1.938
8 2.032
Partial correlation measures the strength of a relationship between two variables, while controlling for
the effect of one or more other variables. (Note: In partial correlation, there is no distinction between
independent and dependent variables)
Simple correlation does not include the effect of the other variables as they are completely ignored. But in
case of partial correlations, the impact of the independent variable is held constant.
Studying the relationship between fertilizer and crop yield keeping the weather conditions
constant.
Effect of milk intake on body weight keeping age constant
Studying the relationship between anxiety level and academic achievement, while controlling for
the intelligence.
The slope parameter slope β1is of particular interest since it indicates how the expected value of
the dependent variable depends upon the explanatory variable x, as shown in Figure.
Problem: For the given data find the line of regression of Y on X. Also find the value of
Y given X = 100.
Data :
X 91 97 108 121 67 154 54 73 111 57
Y 71 75 69 97 70 91 39 61 80 47