lec24
lec24
lec24
Welcome to the next lecture on the course, descriptive aesthetics with R Software. You may recall that, in
the last two lectures, we had discussed the concept of movements. And we discussed the raw movements,
central movements and absolute movements. We also learned how to compute them, on R Software. Now,
I am, going to introduce here, the concept of skewness and courtesies, which are again, the two features
of a frequency curve and our objective is that, to understand firstly, what are these features? Secondly
how to quantify them? And thirdly how to compute them on the R Software? When we are trying to
quantify them, then you will see that, we will need the definition of the movements and in particularly the
central movements. And that was the reason, that I had, explained those concepts earlier. So now, let us
start this discussion.
And first we try to understand, what is the skewness? The dictionary or the literal meaning of the
skewness: is lack of symmetry. What does this mean? This symmetry is, talking about the symmetry of
the, frequency curve or the frequency density; you have seen that, how we have computed the frequency
table, from there we had drawn the frequency density curve. So, when we say that, this is the lack of
symmetry. Then what is symmetry? Symmetry here is like this, so this is the basic meaning of symmetry.
So now, I am saying that, this symmetry is lacking, when the symmetry is lacking, what will happen?
Means in, ideal situation if I say, suppose if I say this is my symmetric curve, then the symmetry is
disturbed mean, this curve will look like this or this curve will look like this. Now, what is the
interpretation of these curves? Suppose I try to take an example, where we are counting, the number of
persons passing through a place, where many, many offices are located. So, we know, what is the
phenomena? The phenomena is like this; that usually the office will start at, nine o'clock, ten o'clock, in
the morning. The traffic at that point will be very less say around 7 a.m., 8 a.m., in the morning. And then,
the traffic will start increasing and then, it will increase say up to 10 o'clock 10:30 or say 11 o clock in the
morning and after that the traffic will decrease.
So, in case if I want to, show this phenomena through a curve, this curve will look like this, suppose if I
say, this is the time here, I'd say here, 10 a.m.. And this is the time here, somewhere here say, 7 a.m... And
this is that time here, up to here, say here 3 p.m. to 10:00 a.m., 11 a.m. and so on. And here is the number
of persons passing through that point. So, I can say here that from 7 a.m. this frequency or the number of
persons, this is very small, it starts increasing and then it keeps on increasing up to say 10 o clock and
then after that everybody comes in the office and then, there are less number of people, who are coming
to office and then finally, see if you come up to here, 3 p.m. this number will decrease. Now, on the other
hand, the opposite happens in the third case, suppose if I try to mark these points, as here I said 12:00
p.m. ,1 p.m. ,I'm up to here sometime here, say here 6 p.m. and then here 7 p.m. and say here 9 p.m. and
here we try to count the, same did, same record, the same data that is the number of person, which is
denoting the frequency. So now, what will happen once the office starts and offices from say 9:00 to 5:00
or say 9 a.m. to 6 p.m. or so, so usually in that marketplace are in the that place, where there are many,
many offices, people will be working inside the office and then, in the evening when the office hours,
closes then they will leave the office. So what will happen? Say from say 12 clock or 1 o clock, the
number of person passing through that point will be very less. And this number will start increasing say
from, 4 p.m. 5 p.m. and it will be maximum say, between say 5 p.m. and 6 p.m. and once everybody has,
left the office, then the number of person passing through that point will, sharply decrease and say at 7
p.m., 8 p.m. the number will be very, very less. So, now how to denote this phenomena through a curve,
so this type of phenomena can be expressed by, this curve. So, initially at 12 o clock the number is very,
very less. And then, say around 5 p.m., 6 p.m., the number is increasing and then it is decreasing after, say
7 p.m. or so on. In both these cases, what is happening, you can see here, more data is concentrated here
and the first figure and more data is concentrated on the right-hand side, in the last figure.
So, these are the areas in red color, where more data is concentrated. Now, if you try to take that third
figure, you can see here in this case, the curve is symmetric around the mid value. If you try to break the
curve into two parts, from this point and if you try to fold it and if you keep it or dis thing, then this will
look symmetrical. So what I'm going to say? Suppose the curve is like this and if I try to break it in the
mid and if I fold it, then the curve will look the same. So, this is what we mean by symmetry. And this
feature is missing in the, first and last curve. That if you try to break the first curve add to this point and
the last curve at this point, where I am moving my pin, then this will not be symmetric. So, the objective
here is, how to study this departure from symmetry, on the basis of given set of data, I would like to
know, on the basis or given set of data. That whether the data is concentrated, on the left hand side more
or more concentrated on the right hand side of the frequency curve. So, this feature is called as,
‘Skewness’. And in order to quantify it, we have a coefficient of skewness. So now, I can say that here is
skewness gives us an idea about the, shape of the curve which is obtained by the frequency distribution or
the frequency curve of the data. And it indicates us, the nature and concentration of observation towards,
higher or lower values of the variable.
So, the probability density function, of our normal distribution function, is given by like this. And this
property density function is controlled by two parameters, mu and Sigma square. So, we denote this
function, as say here n, which P is normal and the two parameters mu and Sigma square, which are the
parameters of this, density function, here this mu is indicating, the mean and Sigma square is indicating
the variance. So, if I try to draw this curve, this will look here like this. So, this value is indicating here,
the mean and this is spread here, I don't the mean this is giving us the value of Sigma Square. And this
curve is actually, symmetric around mean. So, we try to compute, the coefficient of skewness and kurtosis
in the case of normal distribution. And in this case the coefficient of skewness comes out to be zero and
coefficient of kurtosis comes out to be zero. And this was the reason that we were trying to, conclude on
the basis of coefficient of skewness, being zero positive or negative. And similarly we are going to do,
with the case of kurtosis. So, now the curve of the normal distribution will have zero kurtosis. So, now we
can compare the peaks, of other curves, with respect to the normal curve. And this is what we are doing in
this picture,
here if you try to see, the curve in the mid, curve number here tow this is the curve of normal distribution.
And we are trying to compare the flatness or peakedness, with respect to the curve number two. So, now
you can see here, first into the curve number three, this is here, which is here, the curve number three has
got, more peak, then that curve number two. And similarly if you try to look in the curve number one,
then curve number one has got a smaller peak, than the curve number two. So, what we try to do here?
That all those curves which have got higher Peaks than the normal curve, they are called as,’ Leptokurtic’,
the peakedness of the normal distribution, is called as,’ Mesokurtic’. And the peakedness of those curves,
which have got, lower peak than that of normal curve, this is called,’ Platykurtic’. And this is how we try
to; characterize the less or more peakedness, with respect to the wickedness of the, normal distribution or
the majority curve.
Now the Pearson’s is how to quantify it. So, we have a coefficient of kurtosis and there are different
types of coefficient of kurtosis, but here we are first going to consider, the Karl Pearson's coefficient of
kurtosis, which is denoted by beta - that is the standard notation. And similar to the coefficient of
skewness you can see here that this beta 2 is also depending on the mu 2 and mu 4, what are this mu 2 and
mu 4? Mu 2 is the second central movement and mu 4 is the 4th central movement. And the coefficient of
kurtosis is defined as the 4th central movement, divided by the square of second central moment, the value
of beta 2 for a normal distribution; this comes out to be 3. So, what we try to do? That we try to define
another measure, which is be tied to minus 3 and we denoted by here gamma 2. Now the advantage of
gamma 2 is that that just by looking at the value of gamma 2, we will get the idea of the magnitude and if
this is greater than 0, is smaller than 0 or equal to 0, will give us the idea about the, nature of hump. So,
that is why we have two coefficient of kurtosis and in R software, this gamma 2 is produce in the
outcome.
So, now you can see here the same thing. So, I can see here for a normal distribution beta 2 is equal to 3
and gamma 2 equal to 0 and if beta 2 is greater than 3 or gamma 2 is greater than 0, then we say that the
curve is leptokurtic, if beta 2 is equal to 3 and or equivalently the gamma 2 is equal to 0, then we say the
frequency distribution or the frequency curve is Mesokurtic. And if beta 2 is smaller than 3 and gamma 2
is smaller than 0, then we say that the distribution is platykurtic. So, you can see here that in the same
figure that we had drawn, for the curves leptokurtic, we have this 4 major critique we have this and for
platykurtic, we have this. So, this is about the coefficient of kurtosis.
Some properties, which I'm not going to prove here, it is just trying to say that beta 2, the coefficient of
kurtosis will always, be greater than or equal to one and coefficient of kurtosis beta 2, will always be
greater than the coefficient of skewness, beta 1 and this beta 2 will always, be greater than or equal to beta
1 plus 1, these are some properties just for your information, I'm not going to use it here.
Now, I would try to take an example and show you that, how to measure this is skewness and kurtosis, in
the data. So, I am going to use the same example, in which we had collected the timings of twenty
participants in a race and this data has been stored in Sider, variable check time. Now, after this, we
simply have to use the command SKEWNESS, skewness and inside the argument give the, data vector
and this is giving us the value, zero point, zero five seven five nine seven six two. So, this is indicating:
that the skewness is, greater than zero. So, the frequency curve, in this case is, positively skewed and
similarly you can see here, when you try to operate the kurtosis command, on the time vector, then it is
giving to the value, one point seven and which is greater than zero. So, this is indicating that, the curve is
leptokurtic, leptokurtic means, the hum, of this curve, is greater than the, hump of normal distribution.
Now, I will try to assure you it on R software,
so you can see here, I have here that data on time, first I need to load the package, library, moments and I
already, have installed this package on my computer. So now, I will need, the skewness of time, this
comes out to be like this and kurtosis, of time. Right. You can see here, this is the same thing which you
have just obtained and this is the screenshot of the same operation: that I just did.
And then, the next slide is the screenshot of the same operation that we have done. Now, I would stop in
this lecture and you can see that, we have discussed the coefficient of skewness and kurtosis, which are
going to give you, two more pieces of information about your frequency curve, beside central tendency
and variation. Now, you know how to find the behavior of the frequency curve, with respect to central
tendency, variation, lack of symmetry and peakedness. And now you can see, just by looking the data, you
were not getting all these things, but now, you know: that how to quantify these things and how the
graphics in this case will look like, is if you try to plot the, frequency curves of the data, which I have
taken say time or say time dot na and try to see, whether the feature, of the curve is matching with the
information given by the coefficient of skewness and kurtosis or not? And you will see that, it is
matching. So, this is the advantage of these tools of descriptive statistics: that instead of looking at the
huge, data sets, you simply try to, look into these values graphically, as well as quantitative way. And they
will give you a very reliable information, but, you should know, how to use this information and how to
interpret that data. So now, up to now, from the beginning I have used the tools, when we have data, only
on one variable. So, we have discussed the univariate tools, of descriptive statistics. Now, from the next
lecture, I will take up the case when we have more than one variable and P and in particular I will
consider two variables. So, when we have the data on two variables, they also have some hidden,
properties and characteristics, so how to quantify them? And how to have the information graphically?
These are the topics which I will be taking from the next lecture. So, you practice these tools, try to
understand it and enjoy it. And I will see you in the next lecture. Till then. Good bye.