0% found this document useful (0 votes)
20 views31 pages

Unit 2 Data Analytics Asha Karegowda May2021

Karegowda DataAnalytics Unit1 Part 1 Notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views31 pages

Unit 2 Data Analytics Asha Karegowda May2021

Karegowda DataAnalytics Unit1 Part 1 Notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

VII Sem Data Analytics (OE41) Unit 2 notes II Unit

Unit 2.
Topics covered:
Data Visualization Methods for Two and Higher Dimensional Data
(Scatter plot, radar plot, parallel plot)
Correlation Analysis
 Karl Pearson correlation coefficient,
 Kendall’s tau rank correlation coefficient or simply Kendall’s tau
 Spearman’s rank correlation coefficient or Spearman’s rho)
 Partial and multiple correlations
 Simple linear regression
Outlier Detection for Single and multidimensional Data . . . . . . . . 64
Reference - Michael R. Berthod, Christian Borgelt, Frank Hoppner,
Guide to Intelligent Data Analysis, Springe Series
And various web sites

a) Two dimensional plot : Scatter Plot: Scatter plots refer to displays where two
attributes are plotted against each other. The two axes of the coordinate system
represent the two considered attributes, and each instance in the data set is represented
by a point, a circle, or any other symbol.
It should be noted that the scatter plots—like all other visualization techniques—
are very useful tools to discover simple structures and patterns or peculiar deviations like
outliers in a data set. But there is no guarantee that a scatter plot or any visualization
technique will automatically show all or even any interesting or deviating pattern in the
data set. A scatter plot with no outliers does not mean that there are no outliers in the data
set. It only means that there are no outliers with respect to the combination of the
attributes displayed in the scatter plot.

16

14
X y
12 5 15
10 3 6
2 10
8
2 15
6 3 6
4
5 15

0
0 2 4 6

Before applying jitter effect

Dr. Asha Gowda Karegowda, 1


Associate Professor, Dept. of MCA, SIT
VII Sem Data Analytics (OE41) Unit 2 notes II Unit

X y
18
16 5 15.3
14 3.2 6
12 2.4 10
10 2 15
8 3 6
6 5 15
4
2
0
0 2 4 6

After applying jitter effect

What is Jitter effect? Some points are simply plotted at exactly the same position,
since their measured same value for two different objects. To avoid this impression
of seeing less objects than there actually are, one can add jitter (i.e random value)
to the scatter plot. To avoid this impression of seeing less objects than there actually
are, one can add jitter to the scatter plot. Instead of plotting the symbols exactly at
the coordinate’s specified by the values in the data set, we add a small random value
to each original value in the data table. This ensures that a point originally lying left
or below another point will always remain left or below the other point, even when
the jitter is added
Jitter is essential when categorical attributes are used for the coordinate axes of a
scatter plots, since categorical attributes have only a limited number of possible
values, so that plotting of objects at exactly the same position occurs very often
when no jitter is added.

Example: Selling Price of a House Scatter plot


A real estate agent wanted to know to what extent the selling price of a house is
related to its size. . . It appears that in fact there is a relationship: the greater
the house size the greater the selling price (positive relationship)
Different patterns of scatter plot

Dr. Asha Gowda Karegowda, 2


Associate Professor, Dept. of MCA, SIT
VII Sem Data Analytics (OE41) Unit 2 notes II Unit

4.3.2 Visualization Methods for Higher-Dimensional Data :


Principal Component Analysis, Parallel Coordinates, Radar and Star Plots

a) Parallel Coordinates :
Parallel coordinates is a common way of visualizing high-dimensional geometry and
analyzing multivariate data. Parallel coordinates draw the coordinate axes parallel to each
other, so that there is no limitation for the number of axes to be displayed. For a data
object, a polyline is drawn connecting the values of the data object for the attributes on
the corresponding axes.
A parallel coordinate plot maps each row in the data table as a line, or profile. Each
attribute of a row is represented by a point on the line. This makes parallel coordinate
plots similar in appearance to line charts, but the way data is translated into a plot is
substantially different. For larger data sets, it becomes more or less impossible to track
the lines that correspond to a data object in parallel coordinates plots or even to discover
general structures. It can be helpful to generate separate parallel coordinate plots for
different subsets of the data.
Consider, for example, a data table where a laboratory has measured the amount of
various carbohydrates contained in various fruit and vegetables.
http://stn.spotfire.com/spotfire_client_help/para/para_what_is_a_parallel_coordinate_plot
.htm

For each food type, it is now possible to plot a profile of how the carbohydrates are
distributed. The technicians in the laboratory can now see which food types are similar to
each other in carbohydrate distribution, by comparing the profiles to each other. This is
where the parallel coordinate plot is really useful, to compare profiles in order to find
similarities.

The values in a parallel coordinate plot are always normalized. This means that for each
point along the X-axis, the lowest value in the corresponding column is set to 0% and the
highest value in that column is set to 100% along the Y-axis. The scale of the various
columns is totally separate, so do not compare the height of the curve in one column to
the height of the curve in another column.
Min max normalization is done as follows:
For a given attribute, the value V is normalized as V’ using min max normalization as
follows.
V’ = (V- Vmin)/ ( Vmax- Vmin)
Where Vmin and Vmax are min and max values of a attribute A to which V belongs

For larger data sets, it becomes more or less impossible to track the lines that correspond
to a data object in parallel coordinates plots or even to discover general structures. It can
be helpful to generate separate parallel coordinate plots for different subsets of the data.

Dr. Asha Gowda Karegowda, 3


Associate Professor, Dept. of MCA, SIT
VII Sem Data Analytics (OE41) Unit 2 notes II Unit

A parallel coordinate plot maps each row in the data table as a line, or profile. Each
attribute of a row is represented by a point on the line. This makes parallel coordinate
plots similar in appearance to line charts, but the way data is translated into a plot is
substantially different.

Consider, for example, a data table where a laboratory has measured the amount of
various carbohydrates contained in various fruit and vegetables.

For each food type, it is now possible to plot a profile of how the carbohydrates are
distributed. The technicians in the laboratory can now see which food types are similar to
each other in carbohydrate distribution, by comparing the profiles to each other. This is
where the parallel coordinate plot is really useful, to compare profiles in order to find
similarities.

Normalized data using min-max normalization method


New value = (old value - min )/ (max – min)

Apple glucose value is 2.10


New value = (2.10 - 0.60)/ ( 4.40-0.60) = 0.3947= 40%
Banana glucose value = 4.40
New value = (4.40 - 0.60)/ ( 4.40-0.60) = 1.0= 100%
Corn glucose value = 0.60
New value = (0.60 - 0.60)/ ( 4.40-0.60) = 0.0= 0%
Cucumber glucose value = 0.70
New value = (0.70 - 0.60)/ ( 4.40-0.60) = 0.0263= 2.6%
Lettuce glucose value = 1.30
New value = (1.30 - 0.60)/ ( 4.40-0.60) = = 0.184 = 18.4%
Tomatoes glucose value = 1.30
New value = (1.30 - 0.60)/ ( 4.40-0.60) = = 0.184 = 18.4%

Apple Fructose value is 4.50


New value = (4.50 - 0.20)/ ( 4.50-0.20) = 1.0= 100%
Apple Maltose value is 0.0
New value = (0.0 - 0.0)/ ( 0.3-0.0) = 0.0= 0%
Apple Saccharose value is 1.30
New value = (1.30 - 0.0)/ ( 6.40-0.0) = 0.203 = 20.3%

Dr. Asha Gowda Karegowda, 4


Associate Professor, Dept. of MCA, SIT
VII Sem Data Analytics (OE41) Unit 2 notes II Unit

Food Glucose Fructose Maltose Saccharose


Apple 40% 100% 0% 20.3%
Bananas 100% 46.5% 0% 100%
Corn 0% 0% 100% 35.93%
Cucumber 2.6% 11.62% 0% 0%
Lettuce 18.4% 16.27% 0% 0%
Tomatoes 18/4% 41.86% 0% 0%

The values in a parallel coordinate plot are always normalized. This means that for each
point along the X-axis, the lowest value in the corresponding column is set to 0% and the
highest value in that column is set to 100% along the Y-axis. The scale of the various
columns is totally separate, so do not compare the height of the curve in one column to
the height of the curve in another column.

Students Physics Chemistry Maths


College1 70 120 100
College2 50 80 30
College3 30 80 20
College4 90 40 150
Normalized data:

Students Physics Chemistry Maths


College1 66.66% 100% 61.53%
College2 50% 50% 7.6%
College3 0% 50% 0%
College4 100% 0% 100%

Dr. Asha Gowda Karegowda, 5


Associate Professor, Dept. of MCA, SIT
VII Sem Data Analytics (OE41) Unit 2 notes II Unit

b) Radar and Star Plots


Radar chart is also known as spider chart or star plot because it looks like spider’s web
or stars. Radar chart is a graphical method of displaying multivariate data in the form of
a two-dimensional chart of three or more quantitative and qualitative variables
represented on axes starting from the same point, origin. The radar chart is a visual
representation as a chart consists of a sequence of equi-angular spokes, called radii, with
each spoke representing one of the variables. The data length of a spoke is proportional
to the magnitude of the variable for the data point relative to the maximum magnitude of
the variable across all data points. A line is drawn connecting the data values for each
spoke. This gives the plot a star-like appearance and the origin of one of the popular
names for this chart. The radar chart is also known as web chart, spider chart, star
chart, star plot, cobweb chart, irregular polygon, polar chart, or kiviat diagram

Radar plots are only suited for smaller data sets. For such smaller data sets, it is
sometimes better not to draw all data objects in the system of coordinate axes but to draw
each data object separately, which is then called a star plot.
Purpose: Radar chart tries to answer the questions like- which observations are most
similar, i.e., are there clusters of observations? And are there outliers? What is the status
of different variables before and after improvement activities? Which variables are more
important to observe intensively among many variables? and many other questions.

Example: Plot Radar chart for the given data for scores in 5 subjects by Section A and
Section B Students. The max score of each subject is 100.
Section Section
A B
Sub1 50 90
Sub2 40 80
Sub3 100 40
Sub4 50 80
Sub5 40 100

Dr. Asha Gowda Karegowda, 6


Associate Professor, Dept. of MCA, SIT
VII Sem Data Analytics (OE41) Unit 2 notes II Unit

Plot Radar chart for the given data.


School A School B School C
(Max 100) (Max 200) Max (150)
Sub1 80 150 70
Sub2 60 190 80
Sub3 20 120 110
Sub4 40 110 120
Sub5 90 80 50
Note: The Max score differs for each section. Hence before plotting for comparison, we
need to make sure the max score is common. Let us make max marks as 100 for each
section. In that case Section A marks will be remain unchanged
Section B marks will be changed as ( marks scored/ 200 ) * 100
For example score 110 in subject 4 for School B will be reduced to (110/200) *100 = 55
Section C marks will be changed as ( marks scored/ 150 ) * 100
For example score 120 in subject 4 of School C will be reduced to (120/150) *100 = 80

The final marks with max score for each section as 100 is shown in table below
School A (Max 100) School B(100) School C(100)
Sub1 80 75 47
Sub2 60 95 54
Sub3 20 60 74
Sub4 40 55 80
Sub5 90 40 34

Dr. Asha Gowda Karegowda, 7


Associate Professor, Dept. of MCA, SIT
VII Sem Data Analytics (OE41) Unit 2 notes II Unit

Methods of Studying Correlation


There are different methods which help us to find out whether the variables are related or
not.
1. Scatter Diagram Method
2. Graphic Method.
3. Karl Pearson’s Coefficient of correlation.
4. Spearman Rank Method.
5. Kendall Tau Rank Method
6. Partial correlation
7. Multiple correlation
8. simple linear regression (predicting one variable using other )

(1) Scatter Diagram: Scatter diagram is drawn to visualize the relationship between two
variables. The values of more important variable are plotted on the X-axis while the
values of the variable are plotted on the Y-axis. On the graph, dots are plotted to
represent different pairs of data. When dots are plotted to represent all the pairs, we get a
scatter diagram. The way the dots scatter gives an indication of the kind of relationship
which exists between the two variables. While drawing scatter diagram, it is not
necessary to
take at the point of sign the zero values of X and Y variables, but the minimum values of
the variables considered may be taken.

When there is a positive correlation between the variables, the dots on the scatter diagram
run from left hand bottom to the right hand upper corner. In case of perfect positive
correlation all the dots will lie on a straight line.

When a negative correlation exists between the variables, dots on the scatter diagram run
from the upper left hand corner to the bottom right hand corner. In case of perfect
negative correlation, all the dots lie on a straight line.

(2) Graphic Method.

Dr. Asha Gowda Karegowda, 8


Associate Professor, Dept. of MCA, SIT
VII Sem Data Analytics (OE41) Unit 2 notes II Unit

In this method the individual values of the two variables are plotted on the graph paper.
Therefore two curves are obtained-one for X variable and another for Y variable.
Interpreting Graph: The graph is interpreted as follows:
(i) If both the curves run parallel or nearly parallel or more in the same direction, there is
positive correlation,
(ii) On the other hand, if both the curves move in the opposite direction, there is a
negative correlation.

3) Pearsson Correlations coefficient


The (sample) Pearson’s correlation coefficient is a measure for a linear relationship
between two numerical attributes X and Y and is defined as

-1≤ r ≤ 1
The larger the absolute value of the Pearson correlation coefficient, the stronger the linear
relationship between the two attributes. For |r | = 1 the values of X and Y lie exactly on a
line. Positive (negative) correlation indicates a line with positive (negative) slope.
Pearson’s correlation coefficient measures linear correlation. Even if there is a functional
dependency between two attributes, but the function is nonlinear but monotone,
Pearson’s correlation coefficient will not be − 1 or 1. It can even be far away from these
values, depending on how much the function describing the functional relationship
deviates from a line.
Problem: Compute Pearson’s correlation coefficient for the given data. Comment on the
result

Dr. Asha Gowda Karegowda, 9


Associate Professor, Dept. of MCA, SIT
VII Sem Data Analytics (OE41) Unit 2 notes II Unit

Dr. Asha Gowda Karegowda, 10


Associate Professor, Dept. of MCA, SIT
VII Sem Data Analytics (OE41) Unit 2 notes II Unit

Note: Alternate method The Pearson’s correlation coefficient can also be computed
using the following formulae.

Pearson’s correlation coefficient of 0.998 indicates High correlation between x and y

Dr. Asha Gowda Karegowda, 11


Associate Professor, Dept. of MCA, SIT
VII Sem Data Analytics (OE41) Unit 2 notes II Unit

4)Rank correlation coefficients.


There are many problems of business and industry when it is not possible to measure the
variable under consideration quantitatively or the statistical series is composed of items
which can not be exactly measured. For instance, it may be possible for the two judges to
rank six different brands of cookies in terms of taste, whereas it may be difficult to give
them a numerical grade in terms of taste. In such problems. Spearman’s coefficient of
rank correlation is used.
It ignoring the exact numerical values of the attributes and considering only the ordering
of the values. Rank correlation coefficients intend to measure monotonous correlations
between attributes where the monotonous function does not have to be linear.
Spearman’s rank correlation coefficient (Spearman’s rho) is defined as

where r(xi) is the rank of value xi when we sort the list (x1, . . . , xn) in increasing order.
r(yi) is defined analogously. When the rankings of the x- and y-values are exactly in the
same order, Spearman’s rho will yield the value 1. If they are in reverse order, we will
obtain the value −1. The above equation is also represented as

where d = r(xi)-r(yi) i.e difference of ranks for X and Y


Note : When we are given the actual data and not the ranks, it becomes necessary for us
to assign the ranks. Ranks can be assigned by taking either the highest value as one ( or
the lowest value as one). But if we start by taking the highest value or the lowest value
we must follow the same order for both the variables to assign ranks.

Example 1. Compute Spearman Rank correlations ( with no ranks given)


The scores for nine students in physics and math are as follows:
Physics: 35, 23, 47, 17, 10, 43, 9, 6, 28
Mathematics:30, 33, 45, 23, 8, 49, 12, 4, 31
Compute the student’s ranks in the two subjects and compute the Spearman rank
correlation.
Step 1: Find the ranks for each individual subject. assign the rank 1 to the highest score, 2
to the next highest and so on.:
Step 2: Compute d as the difference between ranks. For example, the first student’s
physics rank is 3 and math rank is 5, so the difference is 3 points. Then square your d

Dr. Asha Gowda Karegowda, 12


Associate Professor, Dept. of MCA, SIT
VII Sem Data Analytics (OE41) Unit 2 notes II Unit

values.

Step 3: Sum (add up) all of your d-squared values.


d = rank(x) –rank(y)
Σ d2 = 12
Step 4: Insert the values into the formula. These ranks are not tied, so use the first
formula:

n=9
= 1 – (6*12)/(9(81-1)) = 1 – 72/720 = 1-0.1 = 0.9

The Spearman Rank Correlation for this set of data is 0.9 shows high positive correlation
between the marks scored by students in Maths and physics

The Spearman correlation coefficient, rs, can take values from +1 to -1. A rs of +1
indicates a perfect association of ranks, a rs of zero indicates no association between
ranks and a rs of -1 indicates a perfect negative association of ranks. The closer rs is to
zero, the weaker the association between the ranks.

Note : In some case it becomes necessary to rank two or more items an identical rank. In
such cases, it is customary to give each item an average rank. Therefore, if two items are
equal for 4th and 5th rank, each item shall be ranked 4.5 . It means, where two or more
items are to be ranked equal, the rank assigned for purposes of calculating coefficient of
correlation is the average of ranks which these items would have got had they differed
slightly from each other. When equal ranks are assigned to some items, the rank
correlation formula is also adjusted.

Dr. Asha Gowda Karegowda, 13


Associate Professor, Dept. of MCA, SIT
VII Sem Data Analytics (OE41) Unit 2 notes II Unit

Example 2 Compute Spearman Rank correlations (Ranks are given)


Ten competitors in a beauty contest are ranked by 3 judges in the following order.
Use the rank correlation coefficient to determine which pair of judges has the
nearest approach to common tastes in beauty.

Note in example1, the ranks were not given, instead, physics and maths marks were
given, we need to convert it to rank since we need to compute the Spearman Rank
Correlation. In example 2, the three judges are directly ranking the contestants, hence no
need to convert to rank.

Judge1 1 6 5 10 3 2 4 9 7 8
Judge2 3 5 8 4 7 10 2 1 6 9
Judge3 6 4 9 8 1 2 3 10 5 7

Solution : Let R1, R2 and R3 denote the ranks given by he first, second and third judges

respectively and let be the rank correlation coefficient between th ranks given by ith
and jth judges, i ≠j= 1,2,3.
Let dij = Ri-Rj, the difference of ranks of an individual’s given by the ith and jth judge.

R1 R2 R3 d12 = d13 = d23= d122 d132 d232


R1-R2 R1-R3 R2-R3
1 3 6 -2 -5 -3 4 25 9
6 5 4 1 2 1 1 4 1
5 8 9 -3 -4 -1 9 16 1
10 4 8 6 2 -4 36 4 16
3 7 1 -4 2 6 16 4 36
2 10 2 -8 0 8 64 0 64
4 2 3 2 1 -1 4 1 1
9 1 10 8 -1 -9 64 1 81
7 6 5 1 2 1 1 4 1
8 9 7 -1 1 2 1 1 4
∑200 ∑60 ∑214

n = 10

= 1 – (6 *200)/( 10*99) = -0.2121

Dr. Asha Gowda Karegowda, 14


Associate Professor, Dept. of MCA, SIT
VII Sem Data Analytics (OE41) Unit 2 notes II Unit

= 1 – (6 *60)/( 10*99) = 0.6363

= 1 – (6 *214)/( 10*99) = -0.2970

Comment: Since is maximum , the pair of first and third judges has the nearest
approach to common tastes in beauty.

Since and are negative, the pair of judges (1,2) and (2,3) have opposite
(divergent) tastes in beauty.

Example 3 (Case when two or more values for a given attribute is repeated ie. Tie case
i.e. Repeated Ranks)
In case of attributes if there is a tie i.e. if any two or more individuals are placed together
in any classification w.r.t an attribute or if in case of variable data there is more than one
itme with the same value in either or both the series, then Spearmans formula for
calculating the rank correlation coefficient breaks down, since in his case the variable X
and Y do not take the values from 1 to n.
The common ranks are assigned to the repeated items. These common ranks are the
arithmetic mean of the ranks which these items would have got if they were different
from each other and the next time will get the rank next to the rank used in computing the
common rank. For example, a rank say 4 is repeated is two times, the common rank to be
assigned to each item is (4+5)/2 = 4.5. The next item will be assigned rank 6. If an item is
repeated thrice as rank 7, then the common rank to be assigned to each value of these
value will average of ( 7+8+9) = 8. The next rank to be assigned will be 10.
If a large proportion of the ranks are tied, then we need to apply an adjustment or a

correction factor to equation

Dr. Asha Gowda Karegowda, 15


Associate Professor, Dept. of MCA, SIT
VII Sem Data Analytics (OE41) Unit 2 notes II Unit

The correction factor is added to , where m is the


number of times an item is repeated. This correction factor is to be added for each
repeated value in both the series.
For example: If the value 7 is repeated three times then cf(7) = 3(3*3- 1)/12 = 3(9-1)/12 =

2 is added to

Problem Statement:
A psychologist wanted to compare two methods A and B of teaching. He selected a
random sample of 22 students. He grouped them into 11 pairs so that the students in
a pair have approximately equal scored on an intelligence test. In each paper one
student was taught by method A and the other by method B and examined after the
course. The marks obtained by them are tabulated below:

A 24 29 19 14 30 19 27 30 20 28 11
B 37 35 16 26 23 27 19 20 16 11 21
Find the rank correlation coefficient.

Solution:Let X denote the scores of students taught by method A and Y denote the
scores of students taught by method B.
For X, value 30 is repeated twice, hence rank of 30 is avg of rank 1 and rank 2 , i.e,
(1+2)/2 = 1.5. The rank 1.5 is assigned to both the values 30 for X. Since we have
considered rank 1 and rank2, hence the next highest value of X with value 29 is
assigned rank 3.
Similarly the value 19 is repeated twice, hence avg of rank (8+9) /2 = 8.5 is assigned
to each of value 19 for column X and next value 14 is assigned rank 10
Similarly the value 16 in Y is repeated twice, hence avg of (9+10)/2 = 9.5 to each of
of value 16 and next value 11 in Y is assigned rank 11

X Y Rank(X) Rank(Y) d = d2
Rank(X) – Ranks(Y)
24 37 6 1 5 25
29 35 3 2 1 1
19 16 8.5 9.5 -1 1
14 26 10 4 6 36
30 23 1.5 5 -3.5 12.25
19 27 8.5 3 5.5 30.25
27 19 5 8 -3 9.0
30 20 1.5 7 -5.5 30.25
20 16 7 9.5 -2.5 6.25
28 11 4 11 -7 49.00
11 21 11 6 5 25.00
∑225.0

Dr. Asha Gowda Karegowda, 16


Associate Professor, Dept. of MCA, SIT
VII Sem Data Analytics (OE41) Unit 2 notes II Unit

Note: 19 is repeated 2 times hence correction factor for 19 =

= 2((2*2)-1)/12 = 0.5
30 is repeated 2 times hence correction factor for 30 = 2((2*2)-1)/12 = 0.5
16 is repeated 2 times hence correction factor for 16 = 2((2*2)-1)/12 = 0.5
We need to add correction factor of 16, 19 and 30 to shown in solved problem.

Comment: there exists negative correlation between two methods X and Y of teaching

Problem statement:
From the following data calculate the rank correlation coefficient after making
adjustment for tied ranks.
X 48 33 40 9 16 16 65 24 16 57
Y 13 13 24 6 15 4 20 9 6 19

Solution :
X Y Rank(X) Rank(Y) d = d2
Rank(X) – Ranks(Y)
48 13 3 5.5 -2.5 6.25
33 13 5 5.5 -0.5 0.25
40 24 4 1 3 9
9 6 10 8.5 1.5 2.25
16 15 8 4 4 16
16 4 8 10 -2 4
65 20 1 2 -1 1
24 9 6 7 -1 1
16 6 8 8.5 -0.5 0.25
57 19 2 3 -1 1
∑41

Value 16 in X is repeated 3 times hence take average of (7+8+9)/3 = 8. Rank 8 is


repeated for each value of 16.
Correction factor for 16 = 3 ((3*3) - 1)/12 = 2 (note 16 is repeated three times)
Value 13 in Y is repeated 2 times, hence take average of (5+6)/2 = 5.5 , assign 5.5 to each
value 13 in Y.

Dr. Asha Gowda Karegowda, 17


Associate Professor, Dept. of MCA, SIT
VII Sem Data Analytics (OE41) Unit 2 notes II Unit

Correction factor for 13 = 2 ((2*2) - 1)/12 = 0.5 (note 13 is repeated two times)
Value 6 in Y is repeated 2 times, hence take average of (8+9)/2 = 8.5 , assign 8.5 to each
value 6 in Y.
Correction factor for 6 = 2 ((2*2) - 1)/12 = 0.5 (note 6 is repeated two times)

Comment: there exists strong positive correlation between X and Y

c) Kendall’s tau rank correlation coefficient: Spearman’s rho is based on difference


in ranks of X and Y attributes. The Kendalls tau is based on the comparison of the orders
of pairs of values.
Assuming that xi < xj , the two pairs (xi,xj ) and (yi,yj ) are called concordant if yi < yj ,
i.e., when the two pairs are in the same order.
They are called discordant when they are in reverse order, assuming that xi < xj , the two
pairs (xi,xj ) and (yi,yj ) are discordant when yi > yj . Kendall’s tau is computed as

= (C-D) / (C+D)

where C and D denote the numbers of concordant and discordant pairs, respectively:
C = |{ (i,j) | xi < xj and yi < yj }| ,
D = |{ (i,j) | xi < xj and yi >= yj }| .

A concordant pair is when the rank of the second variable is greater than the rank
of the former variable.
A discordant pair is when the rank is equal to or less than the rank of the first
variable

Rank correlation coefficients like Spearman’s rho and Kendall’s tau depend only on the
order (ranks) of the values and are therefore more robust against extreme outliers than
Pearson’s correlation coefficient. For categorical attributes, these correlation coefficients
are not applicable.

Dr. Asha Gowda Karegowda, 18


Associate Professor, Dept. of MCA, SIT
VII Sem Data Analytics (OE41) Unit 2 notes II Unit

Compute Kendalls tau coefficient for rankings provided by master and student1 for 7
paintings. Interpret the result.
Rank1 Rank2
(x) (y)
6 4
5 7
7 5
4 2
1 1
2 3
3 6
Solution: sort Rank 1 and accordingly arrange Rank2 as shown below. Find number of
Concordant and discordant
Eg :Compute Kendall’s Tau Rank Correlation for given data:

Rank1 Rank2 C D
1 1 6 0
2 3 4 1
3 6 1 3
4 2 3 0
5 7 0 2
6 4 1 0
7 5 Sum c=15 Sum d=6

To calculate Kendalls Tau, count up the total number of C’s and D’s. C= 15, D = 6
τ = (C-D) /(C+D) = (15 − 6 )/(15 + 6) = 7 /21 = 0.42857
Rank 1 and Rank2 are fairly positively correlated.

Example 2 : Two students are considering applying to same six universities


( A, B,C,D,E,F) for masters. Their order of preference is as follows. Comment on if there is
agreement among both students for university preference.

Student 1 x B E A F D C
Student 2 y F C A B D E

Solution:
Ranks are not given directly. For Student 1, B is first preference, E is second preference and so on. For Student
2, first preference in F , second is C and so on.
Arrange data based on preference of student 1
University Student 1 x Student 2 y Concordant Discordant
C D
B 1 4 2 3
E 2 6 0 4
A 3 3 1 2
F 4 1 2 0
D 5 5 0 1
C 6 2
∑5 ∑10

Dr. Asha Gowda Karegowda, 19


Associate Professor, Dept. of MCA, SIT
VII Sem Data Analytics (OE41) Unit 2 notes II Unit

COMMENT: Disagreement among students regarding preferences of universities.

Examples on Kendal’s Tau with ties (repeated values )

Dr. Asha Gowda Karegowda, 20


Associate Professor, Dept. of MCA, SIT
VII Sem Data Analytics (OE41) Unit 2 notes II Unit

Dr. Asha Gowda Karegowda, 21


Associate Professor, Dept. of MCA, SIT
VII Sem Data Analytics (OE41) Unit 2 notes II Unit

Dr. Asha Gowda Karegowda, 22


Associate Professor, Dept. of MCA, SIT
VII Sem Data Analytics (OE41) Unit 2 notes II Unit

Outlier Detection
An outlier is simply a value or data object that is far away or very different from all or
most of the other data. Outliers can correspond to erroneous data coming from wrong
measurements or typing mistakes when data are entered manually. Erroneous data
should be corrected, or, if this is not possible, they should be excluded from the data set
before further analysis steps are carried out. If one replaces the largest value in a sample
by an extremely large value or even infinity, the median will not change. But the mean
value will tend to infinity, even though only one value in the sample goes to infinity.

Outlier Detection for Single Attributes


For a categorical attribute, one can consider the finite set of values. An outlier is a
value that occurs with a frequency extremely lower than the frequency of all other values.
However, in some cases, this might be actually the target of our analysis. If we want to
set up an automatic quality control system and want to train a classifier, classifying the
parts as correct or with failures based on measurements of the produced parts, we will
probably have so many correct parts in comparison the ones with failures that we would
consider them as outliers. However, removing these ―outliers‖ from the data set would
actually make it impossible to achieve our original goal to derive a classifier from the
data set that can identify the parts with failures.

For numerical attributes, outlier detection is more difficult. We have already classified
certain data points in a box plot as outliers. For asymmetric distributions, box plots tend
to contain more outliers. . Heavy tailed distributions tend to show more outliers in a box
plot, whereas for a sample from a uniform distribution, we would expect no outliers at
all, no matter how large the sample size is. the sample size is. The standard assumption
for outlier tests for continuous attributes is that the underlying distribution is a normal
distribution. Grubb’s test is a test for outliers for normal distributions taking the sample
size into account.
Grubbs' test is defined for the hypothesis:
H0: There are no outliers in the data set
Ha: There is at least one outlier in the data set

The Grubbs' test statistic is defined as:

with and s denoting the sample mean and standard deviation, respectively. The Grubbs
test statistic is the largest absolute deviation from the sample mean in units of the sample
standard deviation.This is the two-sided version of the test.
-------------------------------------

Dr. Asha Gowda Karegowda, 23


Associate Professor, Dept. of MCA, SIT
VII Sem Data Analytics (OE41) Unit 2 notes II Unit

Example :
Apply Grubbs test to find data 6.8 in the given set of data is an outlier.
Data : 3.3, 6.8, 2.9, 3.8, 3.1. Given Gtable value for 5 observation = 1.672
Solution:
Mean = 3.98, std div = 1.6
Gcal = |value to tested – mean| / std div = |6.8-3.98|/1.6 = 1.763
If Gcal > Gtable then the hypothesis must be rejected.
H0 : Grubbs null hypothesis is that there are no outliers
H1: outlier present

Since 1.763> 1.672, the data 6.8 is outlier and needs to be removed
After removal of 6.8, mean = 3.275 and std div = 0.39. we can observe that the standard
deviation is reduced since

Example 2. Find if data 5 is outlier for the given data set using Grubbs test
Data 5, 10, 9.5, 9.8, 9.9. Given Gtable value for 5 observation = 1.672
Solution : Value to be tested = 5, mean = 8.84, std div = 2.155
Gcal = |value to tested – mean| / std div , = |5-8.84|/2.155 = 1.8
Since 1.8> 1.672 hence the data 5 is outlier and must be removed.
------------------------------------------------------------------
Note: sample Grubbs table
Number of G for 95%
observations confidence
i.e n value
4 1.463
5 1.672
6 1.822
7 1.938
8 2.032

Outlier Detection for Multidimensional Data


Outlier detection in multidimensional data is usually not based on specific assumptions
on the distribution of the data and is not carried out in the sense of statistical tests.
Visualization techniques provide a simple method for outlier detection in
multidimensional data. Scatter plots can be used when only two attributes are considered.
One can also use dimension-reduction methods like PCA or multidimensional scaling in
order to identify outliers in the corresponding plots. There are many approaches for
finding outliers in multidimensional data based on clustering the data and defining those
data objects as outliers that cannot be assigned reasonably to any cluster . There are also
distance-based, density-based, and projection-based methods for outlier detection.

Dr. Asha Gowda Karegowda, 24


Associate Professor, Dept. of MCA, SIT
VII Sem Data Analytics (OE41) Unit 2 notes II Unit

Dr. Asha Gowda Karegowda, 25


Associate Professor, Dept. of MCA, SIT
VII Sem Data Analytics (OE41) Unit 2 notes II Unit

Partial correlation measures the strength of a relationship between two variables, while controlling for
the effect of one or more other variables. (Note: In partial correlation, there is no distinction between
independent and dependent variables)

Simple correlation does not include the effect of the other variables as they are completely ignored. But in
case of partial correlations, the impact of the independent variable is held constant.

Examples of Partial correlations:

 Studying the relationship between fertilizer and crop yield keeping the weather conditions
constant.
 Effect of milk intake on body weight keeping age constant
 Studying the relationship between anxiety level and academic achievement, while controlling for
the intelligence.

Dr. Asha Gowda Karegowda, 26


Associate Professor, Dept. of MCA, SIT
VII Sem Data Analytics (OE41) Unit 2 notes II Unit

Dr. Asha Gowda Karegowda, 27


Associate Professor, Dept. of MCA, SIT
VII Sem Data Analytics (OE41) Unit 2 notes II Unit

Dr. Asha Gowda Karegowda, 28


Associate Professor, Dept. of MCA, SIT
VII Sem Data Analytics (OE41) Unit 2 notes II Unit

Simple Linear Regression


It is a statistical method that allows us to summarize and study relationships between two continuous
(quantitative) variables:
 One variable, denoted x , is regarded as the predictor or independent variable.
 The other variable, denoted y, is regarded as the response or dependent variable.
In contrast, multiple linear regression, which we study later in this course, gets its adjective "multiple,"
because it concerns the study of two or more predictor variables.
Some examples of statistical relationships might include:
 Height and weight — as height increases, you'd expect the weight to increase, but not perfectly.
 Alcohol consumed and blood alcohol content — as alcohol consumption increases, you'd expect
one's blood alcohol content to increase, but not perfectly.
 Vital lung capacity and pack-years of smoking — as the amount of smoking increases (as
quantified by the number of pack-years of smoking), you'd expect lung function (as quantified by
vital lung capacity) to decrease, but not perfectly.
 Driving speed and gas mileage — as driving speed increases, you'd expect gas mileage to
decrease, but not perfectly.
 Hours studied and Marks obtained
For given X and Y values, the Then, the equation for the best fitting line for linear regression is given by

; where y is the dependent variable, x is the independent variable, b0 is the y


intercept, b1 is the slope of regression line (slope tells how much y increases with every unit increase in X)

Dr. Asha Gowda Karegowda, 29


Associate Professor, Dept. of MCA, SIT
VII Sem Data Analytics (OE41) Unit 2 notes II Unit

The slope parameter slope β1is of particular interest since it indicates how the expected value of
the dependent variable depends upon the explanatory variable x, as shown in Figure.

Problem: For the given data find the line of regression of Y on X. Also find the value of
Y given X = 100.
Data :
X 91 97 108 121 67 154 54 73 111 57
Y 71 75 69 97 70 91 39 61 80 47

Solution : Mean(y) = 700/10= 70, Mean(x) = 900/10 = 90 ,


X Y dx= dy = dx * dy (dx)2
x-mean(x) y-mean(y)
91 71 1 1 1 1
97 75 7 -5 35 49
108 69 18 1 18 324
121 97 31 27 837 961
67 70 23 0 0 529
154 91 34 21 714 1156
54 39 39 -31 1209 1521
73 61 17 -9 153 289
111 80 21 10 210 441
57 47 -33 23 759 1089
3900 6360

b1= 3900/6360 = 0.6132 (slope)

Dr. Asha Gowda Karegowda, 30


Associate Professor, Dept. of MCA, SIT
VII Sem Data Analytics (OE41) Unit 2 notes II Unit

= 70 – (0.6132) *90 = 14.83 (intercept)

Regression line of Y on X is Y= 14.83+ 0.6132 X

Given X = 100, Y = 14.83+ 0.6132 *(100) = 76.15

Dr. Asha Gowda Karegowda, 31


Associate Professor, Dept. of MCA, SIT

You might also like