Descriptive Stats With R Software Book
Descriptive Stats With R Software Book
800
�
600
/
6
S
riq.
s.<SP
4,sr:fJ
6,00o
4,<$P
-,_,,sr:fJ
-,_,,<SP, 5,000
r:fJ
'l,s ,,.
',s<:fJ 3,000
,,<SP
,,000
<f$,
2,500
2.000
2,00o
8
7 347
Fig.3
0 2
14,000 ,50
12•OOo
10,000
Fig.2 6,000
s,OOo
4,000
0 2,000 300
250
Q1
200
3,876
Q2
Introduction to R software
Welcome to the course on descriptive statistics with R software. This is our first lecture.
Although I believe that all the candidates who are attending this course, they have a basic
knowledge of R software. And in case, if you do not have then there is another MOOC
course on Introduction to R software. I would request you that you please go through that
lecture. But in order to have a quick review of the mathematical tools, statistical tools and the
commands, which we are going to use in this course, I will try to demonstrate it here in the
next couple of lectures. So, next couple of lectures will be devoted to the introduction to R
software and the tools which we are going to use in the statistical part they will be
1
So, first question comes what is this R software and why this is so important for us to use and
what is the role of R software in mathematics and statistics and other basic sciences. Right.
So, we all know that use of software is desirable and moreover it is an essential part of any
statistical analysis or any mathematical analysis or rather I would say that any analysis is
incomplete, if this is without the use of mathematics or statistics. So, definitely you need to
learn the software how to use it in all basic sciences rather this is physics, chemistry,
sciences or anything else Right. Now I would like to concentrate here basically on the
statistical software. So, once we go through with the statistical software, there are different
types of software which are available in the literature. For example, there is SPSS, there is
SAS, there is a Minitab, there is Stata, there is Matlab and similarly on the same lines, there
2
Now, the question comes, what is the difference between the two? And how this R came into
picture? This are is basically developed by a team and this is called as R development core
team and this software is freely available on the website www.r-project.org and it is a free
software. This is the biggest advantage of this software, all other software they are mostly
paid software and sometime it is difficult for a common person to afford them. So, R software
gives us a solution and it gives us a very strong edge over the other software that this is a free
software and it's not only a software but this also supports many free packages. Free packages
means in the R, there is a base package which includes most of the common features which
are usable but there are some specialized tasks, they can also be completed using the
3
So, that is why it is very, very popular and R is an environment for data manipulation, for
statistical computing, graphical display, as well as data analysis, or even who can think of
other things also and this gives us an effective way to handle the data and storage of the
outcome is also possible. The storage can be done one by one or the storage can be done in a
loop or in other ways which will be sort of automated way, you run the program and, and the
outcome of the analysis, will be stored in some file and that can be exported to different types
of software’s. Right. And this case simple as well as complicated computations are also
possible. Simulations like a Monte Carlo simulation in statistics, they are also possible. Now
what is in case if you are involved in research, without doing simulation. it is very difficult to
survive unless and until you do a research to give the mathematical aspect, give the statistical
aspect, is not only the sufficient thing but you also need to do the Monte Carlo simulation to
demonstrate the utility and application of your tools using the data. So, this R software also
works in that thing. And now if you know that in most of the software whatever are your
graphics means any software will have rather two types of outcomes- graphics, as well as
some numerical values. So, the numerical value as I already told you that they can be stored
in a file similarly the graphic can also be stored. There are two options, the graphics can be
displayed on screen.
4
As well as they can be stored and their hard copies are also possible. And in R there is a
programming language. It's not only that there are some built-in functions, that can be used
but you can also create your own programs. And this is the main advantage that you can use
the built-in function, as well as you can create your own function using the programming
language. And that is why this R has an environment of using the programming language
which is its own programming language, it is different than other languages although it is
very, very similar to other languages but it has its own syntax and commands. And this R
software has a basically a statistical computing environment. There are many, many things in
statistics and also in mathematics which are directly supported by using the built-in function
or by using the contributed packages. And the environment which are provides that is very
suitable for the statistical and mathematical, calculations, computations, and getting a
5
And as I said the biggest advantage of R software that it is a free software and it is an open
source. This means what? This means it's not like a black box. If you really want to see what
is happening inside the program you can just open it and can and can have an idea that how R
is computing something and as I said earlier, there are some packages which are built-in, they
are already embedded in the R software and there are some contributed packages, contributed
packages mean somebody is doing his doing their research and he or she develops something
new they can write a program and after a scrutiny or after checking or verifying by the R
core development team this program can be uploaded on the website of R software and then
anybody can just download it and can use it in their own research work. Right. So, in R it is
also possible to contribute our own packages and this gives us a very strong advantages over
the other software this type of facility is not really available in many other software's. Right.
And all those commands whatever we are using they can be saved if they can be their output
can be can be used and so, on this R software is available for all sorts of operating system like
as
6
Refer slide time :( 08:19)
Windows, Unix, Linux Macintosh and whatever are the graphics, they can also be stored in
different types of format the most common format of graphics is this PostScript file or the
PDF format, although they can also be stored in JPEG format they can directly be copied
from the software and they can be pasted in say the type of software like as this MS Office
7
Now after this brief introduction and advantages of R, I will just try to show you that how
you can install the R on your computer so, in order to install the software what you have to do
that you have to go to this website. Now once you go to this website, you can this is here a
sort of the screenshot of the webpage that you will get. And here if you try to see, there are
here links. This link or you can also come through here this C R A N mirror and you can
simply click here just click here and after that the software, will be downloaded and once you
download it you simply have to click on the software, it will ask you various option. And you
simply have to press different clicks and then the software will be installed and once this is
installed on your computer. So, now I will come on the desktop and I will show you that how
the things are going to happen so, you can see here there is an R icon and I simply double
click on it and you will get this type of window here. Right. Just in order to make it more
clear, I will try to increase the font size so, that you can see whatever I am doing here clearly
Right. So, you will get here this type of thing and now this screen what you are seeing here
this is called ‘console’, and by pressing the here control L that means you have to press two
8
key ctrl + L together this will clear the screen and we will use it for executing different types
of commands. So, now let me come here to over slides and I will try to show you here what's
So, if you see here this is screenshot which is here this is the same thing which I just showed
you on the R software this is called ‘console’, and here you have seen that there is a sign here
something like greater than, this is the place where you write all that R commands. You try to
type all the commands and this place is called as command line. So, that will be our common
terminology come on console or type the command on the command line and so on.
9
So, you need to be familiar with this thing. So, what we try to do? Whenever we want to
execute any command, I will go to the command line and I will type it there and I will press
ENTER and that will execute the command. When you are trying to work with R software,
there are two options. First option is that you can use directly the R software and you can
type your commands, you can store your commands. And see the outcome directly on the R
software on the console and second option is that there are some software which are
available, they are the free software. For example, one software is R studio and another
software is Tinn R and similarly there are other software are also this software they help us
they are the interface between the R and us. So, using the R studio or Tinn R or similar type
of software will help you more in executing the commands of R.and getting the outcome but
here my objective is not to teach you here the R software. So, I will be working only on the
console and, and I will leave it up to you that this software you want to use. Okay? So, this R
studio software this is available at this way website www.rstudio.com and similarly this Tinn
R software is available at this site. So, if you wish you can just download it and install it n the
10
Refer slide time :( 12:58)
Now I will try to just take a quick commands and quick things which are needed for us to
work in Statistics using the R software. So, as I told you that in our there is a base package.
Base package contains some essential libraries which are usually common among the users
and these libraries are required to do a statistical work. And some of the libraries are the part
of the base package and some of the libraries have to be downloaded from the website and
these libraries are needed to execute a particular type of task. Right. So, first I will try to
demonstrate here how you will install the package from this library and how you are going to
bring it to a platform where it is available to use in any data set. Okay. So, first command
which I would try to illustrate here is install dot packages install.packages, install.packages is
a function which is used to download the libraries. And in order to use it, this is the syntax.
Suppose I want to download here a package ggplot2. This is used for graphics so, I have to
type here install.packages and inside these two brackets here, I have to write double quotes
11
11
here, and say here. And then I have to write down ggplot2 that is the name of the package
which I want to install. Similarly in case if I want to install the package graphics, this can be
done by typing install.packages and inside the brackets, within the double quotes, you have to
write graphics and similarly. If you want to install another package here say, cluster you
simply have to type your install.packages and inside the brackets, within the double quotes,
just write C L U S T E R and once we enter R software will guide you how to install this
software. And then so, you simply have to choose a mirror and then you have to simply click,
click and this package will be installed here. I'm not demonstrating here but I would but I
would like to leave it up to you that you practice and after installing this package suppose I
12
12
Then I need to use hear another command here say library and the syntax is very, very
simple. It is simply here library and inside the bracket you have to write down the name of
the package, which you would like to use. For example if I want to use the package cluster
then I have to load the library as library. And inside the bracket I have to write here cluster
and similarly if I want to use the package ggplot2 then I have to write down here the
command library and inside the brackets, I have to write down here ggplot2 and similarly if
you want to use the package graphics. It means again just write library and then you have to
use here the graphics inside the brackets. So, this is how you can install package and this is
how you have to load the library before you start using it. There are some libraries which
come as a part of the package in R. For example, there is one very popular library which is
here mass M A S S in capital letters. Actually MASS means modern applied statistics using S
Plus. This is a book which was written by Venables and Riplay and this MASS corresponding
to the first letter of M say from modern, and A from applied, S from statistics and S from S
Plus. S plus is the software which is very, very similar. So, in case if you want to use the data
set or the libraries to use in that book you have to use this package MASS by writing the
command library. And inside the bracket capital M, capital A and capital SS and for that, you
need not to download it because, this package is already available in the base package and
similarly there are say the different types of libraries for doing specialized job. For example,
if you want to use the generalized additive models then there is another library say MGCV
and in order to use the generalized additive models, you use the command library and MGCV
so, this is how you can install a package and can use the library. And in case if you want to
take some help or if you want to know what are the contents of the library, what is contained
13
13
And suppose I want to know what are the contents of the libraries, ay here cluster. Right. So,
I simply have to type here library and inside these brackets you simply type here help. Help is
equal to say here name, name of the library which is her cluster and once you do it this will
give you here different types of information here that what is the package and what is the
version? What is the date? When it was incorporated? What is the priority, well and who
wrote this one? And there will be many that information availability
14
14
And this for example here you can see here a screenshot of the same thing. For example if
you want to see it, here I can show you here that if I simply try to show you here I can just
copy this command here. Right. So, you can see here that this is, these are the details about
the package cluster. So, I have simply and there is a lot of information complete information
about the package. And this is what I meant when I said that R is not a black box means you
have to complete details. Right. And if you want to know what is the programming of this
package, you can also know about it. So, let me come back here and here you can see this is a
15
15
And after this, yeah this is very important now some final words about using the R software
that whenever you are trying to start a new programming, you will always be using some
variables, which are given by some names like as x y z A B C and so on. So, it is possible
that today you are trying to write down a program in which you have used the variable name
say x and y and say after some time you write another program in which you are using the
same name x and y. so, what will happen? That as soon as you define the new variable x and
y the, the earlier x and y will be erased or it is also possible that some of your friend comes
with the same computer and he defines x and y in a different way so, the way you had defined
earlier x and y will be erased. So, it is always a better option that you try to remove the names
before you leave or we are before you start a new program. So, in order to remove a variable,
the command here is rm and inside the bracket you need to write down the name of the
variable for example if I have three variables here see here x say y and say z then I simply
need to write here r m and inside the bracket, you have to write x y z which is here. And once
you do it this will remove the variables x y z and then you can use here one variable at a time
16
16
Refer slide time :( 21:36)
This is your choice and simply once you are done and if you want to close the session you
want to quit R then the command here is q and you have to write a two parenthesis. So, once
you write q and opening and closing bracket you will come out of the R program. So, now I
will stop here and I would request you that you please try to have a quick look on the basics
of R, the commands in case if you have done it earlier that will help you and he will see you
17
17
Lecture 02
18
Welcome to the, next lecture on, Descriptive Statistics, with R software. You may recall that, in the earlier
lecture we had a quick review about the R Software and the related information. Now, in this lecture and
in the next couple of lectures, I’m going to give you, an idea about the basic operations or the basic
mathematical operations, in R Software. And these are the operations which we are going to use again
and again in the forthcoming lecture. So, this will be a quick introduction to the minimum basic required
this mathematical operators and how to use them in R in this lecture and in the next couple of lecture.
So you can see here, as soon as you start the R Software, on the console you get a sign like greater that
(>) sign. This I explained you in the earlier lecture. This is the prompt sign. This is the prompt sign on the
console, after which you try to write down the commands, for their execution. You type the command,
after the prompt sign and press enter, the command will be executed. So one thing I would like to clear
you here, that when we are trying to assign a value to a variable, in mathematics, we always write, suppose
I want to assign a value 2, to a variable say, x, then I would write here, x = 2. The same thing is followed
19
in R also. Suppose I have a variable x and I want to assign here the value to x, then I would write here,
say x = 2. But before that, I would try to explain you one thing more. When this R started, at that time,
the initial assignment operator was not this equality sign. But this was less (<) than and hyphen (-) sign.
So in the older versions of R, once we try to assign a value, I have to write, less than and hyphen.
For example, if I want to assign a value 40 to a variable x, then I need to write down here x < - and here
40, like this. But now in the recent versions of R Software, they have used the equality sign, so either I try
to write down here, x = 40 or say, x < - 40, both are acceptable, in the recent versions of R. So that you
have to, just keep in mind. So I would try to show you here, how this things are happening and what does
this mean, when I say that I’m trying to assign a value x=40.
20
So, suppose if I try to show you here, if I try to copy this command and I try to paste it on the R console,
you can see here, this is giving me, now I press here x and in turn, this is giving me x=40. And similarly
if I try to write down here, x=40, using the equality sign, then also x is giving me here 40. Or I can take
here another example, just to discriminate between the two, if I write x=20, now if you see here, x becomes
here, 20.
21
And then once you are trying to assign a value to a variable, then there are two options. You can assign,
a numeric value to the variable, as well as, you can assign, a variable to variable also. For example, if I
say here, suppose I assign here x=40 and now I say, I will multiply this x by 3 and whatever is the outcome,
this I will try to store in the, in say, another variable here, y. So I can write down also here, y=x*3. X star
(*) means, star operator is the multiplicative sign. Where which I will explain after couple of slides also.
And similarly the minus (-) signs is the difference operator. So similarly if I want to multiply x by 3 and
I want to store in a variable y. I can write down here y = x*3. And similarly, in case if I want to find out
the difference between x and y, that means I want to find out x-y and suppose I want to store the values
of x-y in say another variable, say z, then I would write down here, z=x-y, like as here. Right, I can show
you here, on the R console, that how the things are happening.
So I try to show you here, suppose if I take here, x=20 and now I say y=3 into multiplied by x. So, if I
show here, the value of x here is, 20 and the value of here, y comes out to be 20 into 3, this is 60. And
suppose I. try to find out the value of x-y and I try to store it in, say, another variable here, Z. So if I try
22
to write down here, z=x-y and if I try to, see the value of z, this comes out to be minus 40, that is 20 – 60.
So you can see here, this is how; we are going to assign or store a value inside a variable.
Now if you try to see here, there is another symbol here, Hash (#). This hash is used to indicate, that the
given command, on the command line, is only a comment. It has not to be executed. And once you write
there, Hash (#) and followed by whatever you want to write, any command or any syntax, nothing is going
to be executed and this will be taken as the comment and all mathematical operator after that will be
ignored.
23
Let us consider, say, x is equal to say, 20, and say, y equal to here, 40, once again I try to define here, z=x-
y, which is here, giving me the value -20, that it 20-40. Now I define here, another value here, say here,
m and I try to put here Hash (#) sign. And I try to say here, Hash (#)m=x-y and you will see here, there is
no value here, something called here, ‘Error’. Because, it has not been executed. This is only stored inside
the R console, as a comment. And the comment cannot be used. Right. The use of this comment sign is
this. When we are trying to write down a longer program, we try to explain the features and the names of
the variable, inside the program, so that after sometime, when we forgotten about it, by looking at this
comments, I can recall, what was the meaning of x, what was the meaning of y or say, what was the
meaning of z.
24
Now, the next aspect. A very important point in using R is this.-There is a difference in the use of small
For example, incase if I try to use here, small x, say equal to 40 and if I try to write down here X capital
X = 40, then these two are different. Right? I can show you here, with an example, on the R console itself.
25
For example, if I’m trying to, use here, some command here, say x equal to here 40. Now you can see
here, the value of x stored here, is 40. But, now in case if I say here, capital X, capital X is equal here is
equal to 70, then once I say here, capital X, this will give me here the, 70 and if I try to recall the value of
small x, this is 40. And even I can find out the value of here, x minus small x, this capital X minus small
x, this will come out to be, 70 minus 40, this is 30. So you can see here, that this small and capital letter
26
this is the, screenshot of the same thing, which I have done here. But, yeah I mean, I have used a different
values, but that is the same thing here. Right. Okay. Now another thing statistics is that, you will always
be dealing with the data. That is why you are doing a computation. So in a statistics and in using R,
whenever we want to input the data, I always have to write here, the letter, c.
27
Suppose I want to write down the values, 1,2,3,4,5, then I need to input the data in R, by writing here, c,
small c, not capital C. And the 1 comma, 2 comma, 3 comma , 4 comma, 5, that will, all the values are
separated, by a comma (,) operator. And in case if you don’t use the c command here, then it will give
you, different types trouble. For example, I can show you here it, on the R console here.
28
Now, let me choose here, x=1, 2, 3. Now you will see here, it is giving me an error. But incase if I try to
use here, x=c(1, 2, 3), then it is giving me here, x as here, 1 2 3 . Right. Okay. So, always remember to
Now I will come to the basic operation, that in case if you want to do the addition, then the operator here
is, plus (+) sign. The usual mathematical sign, you have to use the, plus operator here. For example, first
I will try to show you, the outcome and then I will show, you on the R console also. Suppose if I want to
add, 5 and 3, then I simply have to give here, 5 and 3, and then here, the operator plus (+). So I simply
have to type here, 5+3 and this will give me here, the value, 8. So, this is the screenshot. And similarly, if
you want to multiply 5 and 3, then you have to write the, 5 multiplied by 3. So the multiplication operator
here is, the star (*). So I would simply write it here, 5*3 and this will give me here a value here, 15. And
29
this is here, the screenshot. So, similarly if you want to subtract two values, the mathematical operator is
the same, that is, so this subtraction sign (-), hyphen. And if I want to subtract, 5 and 3, then I have to
write here, 5-3, in the same way, as we do in the usual mathematics and this will give me the value here,
2 and this is here the screenshot. And, in case if I want to use the division, then the division operator is
backslash (/) and incase if I want to divide, 5 by 2, then I simply have to write down here, 5/2 and you can
see here, the answer will come out to be, 2.5 and this is here, the screenshot, of the operator.
And similarly incase if I want to write down here, 3 square. Right. So, incase if I want to write down here
3 square, then I have here now, two options. I can write down here, say 3, say here at (^) 2 or I can write
down here 3 double star (**) 2. So, you can see here, I have used here two commands. This and this here,
two indicate the same value, 3 square. So whenever you want to write any mathematical equation, related
to the power, then I have two option. One is to use the at (^) sign another the thing is to use the double
star (**) sign. And 3 square in both the case will give 9 and these are the two operations here, which I will
30
And similarly, when, once you are trying to giving the power, this power can be an integer or it can be a
fraction value. Also for example, in case if I want to find out the value of the square root of 3, the square
of 3, is only 3, is to the power of, 1 by 2, which is equal to a 3, 0.5, 3 raise to the power of 0.5. So I can
write down this value here as, say, 3 hat(^) 0.5 or 3 **0.5, which I have done here. And you can see here
in both the cases, the values are coming out to be 1.73 and this is, and these two are, the screenshot which
And similarly, incase if you want to find out the, value of, 1 upon square root of 3, so this is going to be
3, this power of minus 1 upon 2, this is 3 raise to the power of minus 0.5 and it can be written, either by
putting the hat(^) sign or by using the double star (**) sign. For example here, I have written this thing
and here you can see here, this is the screenshot here. And, yeah, in case if you have more than one
operator at the same time. Suppose you are trying to use the plus(+) sign, minus sign(-), multiplication
sign(*), division sign(/), then using the same rule, what we have the BODMAS rule, the same BODDMAS
31
Refer Slide Time :( 14:26)
So for example, now I would try to show you all thsse things over the R console here, so that you have
some confidence here. For example, I can use here, x equal to here 5, y is equal to here, say 10, so now,
in case if I want to add it, I can write down here, x+y=15 or in case if you want to write down her directly,
say 5+here 10, it will also give you the value here, 15.
So similarly, in case if I want to, subtract it, say 10-5, so 10-5 will be a, see here, if you can see here, 5
and suppose if, I give here the value here, also y minus here x, so this will also be here, 5. And similarly,
in case if I want to multiply 10 and 5, so I have to write down here 10*5 and this will give me the value
here, 50. And similarly if I want to multiply, x and y, which are taking the same values, this will again
give me the value, 50. Now, in case if I want to divide, 10 by 5, so I have to write down, 10/5 and will
32
give me the value then, 2 and similarly if I want to write down here y divided by here x, this will again
And similarly, I clear this screen, by pressing control L (Ctrl L). And suppose I want to find out the, square
root of 2, then I have to give it here, say a 2, say hat(^), 0.5, so this will give me here, 1.414214. And
similarly if I want to find out the value of 1 upon, square root of 2. Then I have to give here the value
here, 2^-0.5 and this will give me the value of here, 1 upon, square root of 2. And similarly, if you want
to find out here the value of 2 cube, then I have two options here. I can write down here, 2^3, this will
give me here the value here 8 and similarly, if I write down here, 2**3, this will also give me the value
here, 8. Right. And similarly incase if I have, any other mathematical operator like as, 9+4-5^8*3/7, so
33
this is again the value, which using on the rule of BODDMAS. These are very simple operators, but
definitely, before we go into the statistical part, you need to practice them, so that you are more conversant.
So, I would request you, you please practice them. And I will see you in the next lecture. Till then.
Goodbye.
34
Lecture 03
Welcome to the next lecture on descriptive statistics with R software. You may recall that in the
earlier lecture we had learnt how to use this R software as a calculator and we had learned different
types of mathematical operators – addition, subtraction, multiplication, division but all those were
based on when we are trying to add or subtract only the numbers. Now, in this lecture, we are going
to consider the vectors of the numbers or we call it as data vectors and we would like to see, we
would like to learn that how these mathematical operators- addition, subtraction, division, power
etc., they work with data vector. So you will have here two options that, one option, when we are
trying to work with data vector versus number and second option will be when we are trying to
work with data vector versus data vector. So let us start the course.
So now we are going to simply look into the aspect of the our software that how to handle the data
vectors with respect to the different mathematical operators addition, subtraction, multiplication,
division etc. So let me try to take care different types of examples and through those examples. I am
35
So let me take here data vector consisting of 4 values 3 4 5 and 6 and as I told you earlier, all these
values have to be used with the c operator. So I will try to write down here c and inside the bracket,
I will try to write down these four value – 3, 4, 5, 6 and now then I try to write down here hat and 2.
What does this mean? So now I come back to my problem. What I really wanted to do. Suppose I
want to find out the value of 3 square 4 square 5 square 6 square. You have to notice that here the
powers in each and every number this is the same two-two-two -two. So now this two is actually
here and this three is here, this four is here, this five is here, and this six is here. So once I try to
write down here this data vector c inside bracket three four five six hat two then it is going to give
me an outcome like three square, four cube, five square, six cube which have the values 9, 16, 25
and 36 respectively. So what you have observed that once you are trying to use the power operator
with the vector, then the power is being operated on each and every number inside the data vector.
36
I will show you but here is the R console. Suppose if I run say 3 comma 4 comma 5 comma 6 and
then here hat - this will give you a 9 16 25 36. Similarly if you want to find out here cube, you can
make it here c(3, 4, 5, 6,) hat 3 and this will give you the value of 3 cube, 4 cube, 5 cube and 6 cube
and instead of here hat you can also use here the operator say double star and this is the same thing,
you can see here. So this is what you have to keep in mind that once you are trying to do the
operation with a vector, the power of this is getting distributed. Now I try to take another option
where I am trying to use here two vectors. The data is inside a vector and the power which I had
taken earlier as a scalar - now this is this itself is a vector. So now that if you try to see what is
happening? So here essentially we wanted to find out the value of 3 square, 4 cube, 5 square, 6
cube. So you can see here that these powers are something like 2, 3 and then once again 2 and 3,
right. So these powers are being repeated here these powers are written here as a data vector 2 3 2 3
and once they are operated over the data vector c(3, 4, 5, 6) then 2 and 3, both operate pairwise that
means 2 will come here and it will make it here 2 then 3 will come here and this will make it here 3,
once again 2 will come here and it will make it here 5 square 3 will come here and it will make it
37
here 6 cube, so this is here the value of 9 64 25 216 which is the value of 3 squared4 cube 5 square
So what do you need to learn here that whenever we are trying to give the data vector and I have to
operate a power operator then the powers are distributed over the data vector in the same sequence
in which they are given and the data is given inside the data vector, right.
So let me try to show you here over the R console how it happens, So let me take here a vector 3 4
5 comma 6 three four five six and I am operating here a power operator c(2, 3) and here three you
can see here that this is coming out to be like this, right.
38
And now let me take it another example on the similar lines where I want to compute this values- 1
square, 2 square, 3 raise to the power of 4, 4 square, 5 cube and 6 raise to the power of four. We
need to observe here that there are powers say 2 3 4 and once again here 2 3 and 4 which are
similar, so these powers 2 3 and 4 and this power 2 3 and 4, they can be given inside a data vector
here like as a c 2 3 4 and these values here 1 2 3 4 5 and 6, they can be given here another vector
here like this one. So once they are operated you can see here that this 2 is coming to here in the
first place, this 3 is coming to here in the second place and this 4 is coming to coming here in the
third place and similarly once again these three operators comes together 2 comes over here, 3
comes over here and 4 comes over here like this. So I can write down here 1 square, 2 cube, 3 raise
to the powers of here 4 and so on and this is the value which has been obtained here 1 8 81 16 125
1296. So you can just practice it yourself. So what you have to learn here that whenever we are
finding out the power of a vector where the powers itself are given in the form of a data vector then
the power moves over on the data vector exactly in the same order.
Now the next question comes that suppose if the number of powers to be operated are not a
multiple of the number of data vectors. For example, here you can see, I have here six values 1 2 3
4 5 6 and the powers are ahead 2 3 4, right. so this 2 3 4 is being operated on 1 2 3 4 5 6 and so on.
39
So for example if I try to take care another example to show you the outcome, you can see here
what really happens and after that I will show you on the R console. Suppose I try to take here a
vector here 2 3 4 5, this is my data vector using the operator c and my another data vector is
containing here the value 3 4 5. So what's really happening? Once I try to use here a power
operator, so this 3 comes over here and it becomes here 2 cube, this 4 comes here and it becomes 3
raise to the power of here 4, 5 comes over here and this becomes here 4 raised to its power of 5 and
now after this 3 starts coming over here and this becomes here 5 cube but after that there is no
space place here for 4 and 5 to come. So what really will happen? What are we going to compute?
We are going to compute here 2 cube, 3 raise to the power of here 4, 4 is the power here 5 then after
that 5 cube but after that there is no space for these powers 4 and 5 to come. So in this case, this
will compute the value on the basis of whatever data is available but for the remaining value it will
give me a warning message and it is clearly saying that longer object length is not a multiple of the
shorted object length. So whenever in R, you are getting a warning, warning means it is something
like a literal sense of warning. It will not harm you but you have to be careful, the second will be
error messages that means one is making a mistake and without that the program will not run
40
but with warning the program will run but it will give you a message that you have to be careful
while executing it. So let me try to show you here on the R console first.
So suppose if I try to take the same examples here, suppose if I take the same vector, same data
vector 3 4 5 6 and then I try to multiply it here with here this thing, so now if I try to add here say
here for 2 3 4, so now you can see here that this is not really a multiple and it gives me here this
value here is 9 here is 3 square, the 64 is 4 cube, this 625 is 5 raise to the power 4 that is 5 into 5 is
25 and 25 into 25 is 625 and the last value 36 here is the value of 6 with this square 36 but after that
it is telling you that in this thing, the longer object length is not a multiple of the shorted object
length and this is giving you here a warning message. So you have to be careful, ok.
41
So that was about the power operator. Similarly if I try to take other operators like a multiplication,
addition, division. So first let me take it here multiplication so again I would try to take two
examples, one with the multiplication of a data vector with a scalar and then with the vector. So you
can see here I have taken here data vector of 2 3 4 5 and this is going to be multiplied by here 6. In
this case once you are trying to multiply a scalar value 6 with data vector then the scalar is going to
be multiplied in each and every value of the data vector. For example you can see here, suppose I
want to find out the value here 2 into 6, 3 into 6, 4 into 6, 5 into 6 then in this case this 6 is going to
be here common. So this 6 is appearing here and this data vector 2 3 4 and here 5 , this comes over
here inside the data vector and the outcome is going to be here say 2 into 6 that is 12, 3 into 6 is 18,
4 into 6 is 24 and 5 into 6 is 30, right and this is the screenshot. Later on, I will try to show you over
the R console also, right. Now what I am going to do here that I am trying to consider here two data
vectors. So in this example there is one data vector consisting of values 2 3 4 5 and another data
vector is minus 2 minus 3 minus 4 and 6 and both these data vectors are getting multiplied over
each other. So essentially we are trying to find out here the value of 2 into 2 minus 2, 3 into minus
3, 4 into minus 4 and 5 into 6, so what you can see here that these values 2 3 4 and here 5, I am
trying to combine in one data vector 2 3 4 5 and this value here - 2 - 3 – 4 and 6, they are getting
combined in say another value here - 2 - 3 4 and here 6 and this data vector is appearing here
and this data vector is appearing here and so when this multiplication operator comes into picture,
then this value is going to be multiplied by this first value, first value of data vector and first value
of another data vector, then second value of data vector and second value of data vector, third value
of data vector 4 with -4 which is the third value in the second data vector and 5 and here 6 with the
second data vector. So you can see here that there is an element-wise multiplication, the first
position is being multiplied with the first position, second position is being multiplied by the second
42
position and so on and this is the screenshot here of this same operation and here I can show you
with this another example where the number of data points in the second vector are simply a
multiple of the number of data points in the first vector but they are not equal. So here there are four
values in the first data vector two three four five and there are two values in the second guitar
So essentially here we are trying to find out the value of 2 into 6, 3 into 7, 4 into 6 and 5 into 7. So
you can see here that these two values 6 and 7, they are being here repeated and these values they
are coming over here in the second data vector and this values 2 3 4 and here 5, they are coming
here as a first data vector and this multiplication sign is being converted into the R operator which
is say star. So in this case, this is 6 is being multiplied to 2, 3 is being multiplied to 7 and once
again once this process is complete then once again 6 is multiplied 2, and 4 is multiplied to 7. So
this operation is going in this particular way. So this is what you have to keep in mind, right and
similarly, here I would try to show you that in case if the number of data vector in the second vector
43
is not a multiple of the number of data points in the first vector then what happens? We have a
So here I try to take an example of four values, 2 3 4 5 and the first data vector and value 6 7 8 in
the second data vector and once they are getting multiplied, then using the same rule this 6 gets
multiplied with this 2, 7 gets multiplied by here 3 into 3 and 8 is being applied into here 4. Now
after this once again this process is repeated and this this 6 is getting multiplied over here 5 but
after this there is no place to multiply the number by 7 and 8. So in this case we are simply trying to
find out the value of 2 into 6, 3 into 7, 4 into 8, 5 into 6 and after this, there has to be two more
values where I know that these are going to be multiplied by 7 and 8 but not present. So that is why
we are getting here a warning message that the longer object length is not a multiple of the shorter
object length. So now let us try to do this operation over the R console and you can see here, here is
the screenshot of what we are now going to do here but I would like to show you over the R
10
44
Now if you try to see here I will try to take care data vector c 2 3 4 5 and I try to multiply it by here
7. So you can see here that this number is coming out to be 2 into 7, 3 into 7, 4 into 7, and 5 into 7
which is 14 21 28 and 35 respectively. Similarly, in case if I try to multiply this data vector c 2 3 4
5 with say here another data vector, here 5 6 7 8 then I get this thing. So what is happening? This
can is coming due to 2 into 5, this 18 is coming due to 3 into 6, this 28 is coming by 4 into 7 and
this 40 is coming due to 5 into 8. So I am getting here a value like this and similarly if I try to make
it here that first data vector has four data values whereas the second data vector has only two
values. So again I have a nice outcome without any warning. So this 10 is coming because of 2 into
5, this 18 is coming because of 3 into 6, this 20 is coming because of this 4 into 5 and this 30 is
coming because of 5 into 6. Now in case if I try to add here one more number here if I see here and
in second data vector that instead of 5 6, I try to make it here 5 6 7 and if I try to multiply it by the
data vector 2 3 4 5 then you can see here I'm getting here a warning message. Why? Because this
10 is the outcome of this first value 2 multiplied by 5 first value, this 18 is the multiplication of 3
with 6, please try to look into the highlighted part, this 28 is the outcome of 4 multiplied by 7 but
after this when I try to multiply here 5, this will be multiplied by here 5, this is 25 but then after that
11
45
there is no number, there are no numbers to multiply by 6 and 7. So that is why it is giving me a
Now the same thing I would try to do with addition here. So once again I am trying to take here a
vector of four values 2 3 4 5 and I am trying to add here a scalar 20. So you can see here that this 20
am getting here the value here 2 + 20 which is here 22, and 3 + 20, 23, 4 + 20, 24, 5 + 20, 25. So
essentially I wanted to add the number 20 in the values 2 3 4 and 5. So that is why this 20 here is
being written over here, yes, like this as the common value and this value 2 3 4 and 5, they are
written here as here say another data vector 3 2 3 4 5 and this is here the the output of this
12
46
and similarly if you try to take here the vector with the say here where the length of the vector is not
multiple then again we have the similar outcome that if I try to take here a vector of four values 2 3
4 5 and say another vector of three values 6 7 and 8, so the number of values in the first and
second vector are not the same, so in this case this 6 is being added to 2, 7 is being added to 3, 8 is
being added to 4, which is giving me an outcome 8 10 and 12 but then I try to add here 5 and here
6, this is giving me here 11 but there are no numbers to add with 7 and 8. So essentially I am
getting here the value 2 + 6, 3 + 7, 4 + 8, 5 + 6, right. So this 6 7 8, they are combined here in one
vector here 6 7 and here 8 which is written over here and now this number is left alone and so now
we are trying to mean means ideally there should be here 7 and 8 but now here the data vector here
is c 2 3 4 5 and so ideally if there would had been two more here numbers then I would not have
got the warning, so this is the same style, same operation that we are getting over here and you can
see here this is a screenshot of the same operation which you have just learned and now but before
going into the further details, let me try to show you here how these things are happening, right.
13
47
So let me first clear the screen. Let me take here data vectors 2 3 4 and here 5 and I try to add here
a scalar number here 20, you can get here, you can see here that every number 2 3 4 5 is being
added with 20 and that is the outcome what we are getting over here and similarly if I try to take
care another vector here the same data vector c 2 3 4 5 and I try to add here 6 7 8 and here 9. So you
can see here the outcome is this first 8 is coming because of 2 plus 6, this 10 is coming because of 3
plus 7, this 12 is coming because of 4 and 8, 4 + 8 and this is coming because of adding 5 and 9
together, right and similarly in the same operation, if I try to add here, if I try to remove one
number, so that the length of the vectors are not really multiple then in this case means I will get a
similar outcome where I have a warning message. So this outcome is essentially, 8 is because of 2
plus 6, this 10 is because of 3 plus 7, this 12 is because of 4 plus 7 and this 11th is because of 5 plus
8 so the same thing continues over here. So what I will do that I will try to simply take some
examples on the similar line for subtraction and division and that will give you a clear idea that how
14
48
So let me come on the R console and let me take the same example where the data vector is
consisting of four values 2 3 4 5. So I try to subtract here a value here say 1 from 2 3 4 5, so that
means each and every value in that data vector 2 3 4 5 is being subtracted by 1, so the answer what
we expect is 2 minus 1, 3 minus 1, 4 minus 1, and 5 minus 1, and if I try to enter I get the same
outcome here 1 2 3 4 and similarly if I try to subtract two vectors which are of the multiple lengths
same, as I can say here 2, this same in this second vector is c 2 3. So what I would expect here that
the outcome will be here 2 minus this 2, please try to look into the highlighted part, 3 – this 3 then
again 4 minus this 2 and finally 5 minus 3, so this answer comes out to be 0 0 2 and 2 and similarly
if I try to subtract here the vector of the same length say the same vector say c 2 3 4 5 minus c 2 3 4
5 that means the outcome will be 2 minus 2, 3 minus 3, 4 minus 4, and 5 minus 5, and this answer
will come out to be 0 0 0 0 and just for illustration, if I try to take here, say here some other values
just to make you more comfortable, suppose if I try to write here 7 8 9 and 10. So this means I am
trying to subtract 2 and 7 that means 2 minus 7 and then 3 minus 8, then 4 minus 9 and 5 minus 10.
So you can see here the answer comes to be -5 -5 -5 -5. This -5 is coming because of 2 minus 7,
the second -5 is coming here because of 3 minus 8 and third -5 is coming out because of here 4
15
49
minus 9 and last -5 is coming out because of 5 minus here 10. So similarly if you try to take here
some other value here say 8 2 1 and 3 say c 2 3 4 5, -8 2 1 3, so this value will be -6 1 3 and 2 and
similarly if you try to take here another say some other value which where the lengths are not
multiple. So if I try to subtract c 2 3 4 5 minus c 8 2 1 3 7, this value will come out to be 2 minus 8,
3 minus this 2, 4 minus this 1, 5 minus this 3 which is coming out to be -6 1 3 2, right, like this and
last value it is here say 2 and then here -7, so this is here - 5 but after this, there are no values the
two vectors are not of multiple lengths. So that is why I am getting here a warning, warning
message here. So you can see here that the same type of operations work.
Similarly I try to show you here how the division operator works. So let me take here a data vector
of say 5 10 15 and 20, four values and if I try to divide each and every value by here 5 so I'm trying
to give here a scalar data vector divided by a scalar value 5. So you can see here this value is
because 10 is being divided by 5, this 3 is coming because 15 is being divided by 5, and this 4 is
coming because this 20 is being divided by 5 and similarly if I try to take the same data vector but
now instead I try to divide it by here say another data vector, say here 5 and 10, two value, so in
16
50
this case you can see here both the data vectors are of multiple lengths. So what will happen here
that once I enter, I get the outcome 1 1 3 2. This 1 is coming because I am trying to divide this 5
with the first value in the second data vector 5 which is 5 divided by 5 which is 1. This 1, this is
highlighted 1, is the outcome of this 10 divided by this 10 which is here 1. The third value 3, this is
an outcome of 15 divided by the first value 5 which is 3, 15 divided by 5 is 3 and the last value 2
here is an outcome of this 20 divided by 10 which is 2. So you can see here that the division
operator also works exactly on the same lines as the power operator, multiplication operator,
addition operator and so on. And similarly, if I try to take the two vectors of say different lengths,
so for example, I can say here I want to divide here 5 10 by here and say a 3. So now in this case
the first data vector has four values and the second data vector has only here three values. So in this
case we expect that we will get a warning message and yeah that is what is happening you can see
here that these first three values, 5 10 and 15, these are being divided by 5 10 and 3, for the first
value 5 divided by 5 which is 1, the second value 1 this is here 10 divided by here this 10, third
value here is 15 divided by here this value here 3 which is here 5 but now after this, the fourth value
20, 20 is being divided by here this 5 is 4 but but after that, there are no values because because the
two vectors are not of multiple lengths. So we are getting here a warning message.
So you can see here that the vector operation or the operation of data vector when the data vector
has different types of outcome in comparison to other software. So here you have to be careful and
but means I can assure you that these types of operations are going to be very very useful when we
are going to start with the statistics part. So in the next lecture, I will try to take some other aspect
on the computation of the values using R software and till then, you please practice these operators,
try to take more example, try to create more example yourself and then try to verify whatever you
17
51
are getting, is it really matching with the mathematical operation which you can do by your own
hand, manually and we will see you in the next lecture, till then good bye.
18
52
`
Lecture - 04
handling
Welcome to the, next lecture on descriptive statistics with R software. You may recall that in
the earlier lectures, we had discussed, that how R is going to be used for different types of
computations, when we are trying to use the scalars as well as data vectors. In this lecture, we
will continue with the same topic and I would try to show you, that in R, there are two types of
computations, one using simple operation, where you have to define and beside those there are
some built-in functions and those built-in functions can be used, directly over the data or the
data vector, to obtain the required outcome. In this lecture I will also show you that, how you
are going to handle the R software, when some values or one value in our data vector is
missing, which is not available due to any reason. So, let us start our, lecture here. Now, what
I will do, I will try to take some examples and through those example, I will try to show you,
an illustrator, that how to use the built-in function, there are many, many functions available,
so it is not practically possible for me to cover all the built-in functions. But, I will take
sufficient number of examples, so that you are comfortable in using all other functions.
53
`
So, let me take here, the first example, that suppose I want to find out the, maximum among
some values, for example, I am take taking here a data vector, which contains four values, four
point two, three point seven, minus two point three and four. And I really want to know, out of
these four values, whichever is the maximum out of them. So, you can see here means,
obviously because, there are only four values, so you can see here, that this 4.2 is the maximum
value and once you enter it here, it will give you, this value 4.2. and if you try to, do it over the
R console, here is the screenshot. Well, I will also be showing you later on, that how to do this
thing and one thing what you have to observe here, that earlier I had told you, that whenever
you are trying to give a data, you have to give it in the form of a data vector using the c
command. But, there are some built-in functions, which can be used, without using the c
command. So, I am giving you here, an example of this, max function. Here, you can see here,
in this case and in the second case, I am finding the maximum among the same set of values,
but here I have used, here the c command. so here, I am trying to combine that data using the
c command, for these four values, 4 point two, three point seven, minus two point three and
four and here also I get the same outcome and here is the screenshot of this thing, but here, I
would like to give you one advice, well it is more difficult to keep a track that, which built-in
54
`
function is going to use with the c command and what are the built-in function, which are used
without the c command, and so I will say, the simple rule of thumb is this always uses the c
command. So, whenever I have to give a data vector, without creating any confusion, without
creating any problem, I will simply try to give the data vector using the c command. Right.
Okay. Similarly, in case if you want to find out the minimum, so minimum again, I am trying
But here, in this case, the command here is m i n and after the command in the case of
maximum, it was m a x. Right. so now, in case if I try to write down here mi n, inside the
bracket, if I try to give her all the data values and if, if I try to press your enter, on the R console,
then I get here, the value minus two point three and you can see yourself here, that out of four
point two, three point seven, minus two point three and 4, this minus two point three is the
minimum value. Okay. So, minimum and maximum are the functions, where you need not to
write the entire program yourself. But, somebody has already done the programming, to find
3
55
`
out the minimum and maximum values and that program has been renamed, as say m i n and
says ma m a x and you simply need to use it. But, definitely as I told you that R is not a black
box, so in case, if you really want to know that, what logic has been used, what programming
has been used, it is possible to look into the steps, which are used in finding these values.
Okay. Here, is the same operation and again, I'm trying to show you here, that here I am, not
using this c command and here, I am using c command. I mean, say the data is combined, using
the c command, whereas in the first case the data is not combined using the c command but, in
both the cases the outcome is the same. So, again I will advise, you that always use the c
command, to combine the data vector and here is the output or the screenshot. Now, I will try
to show you, that how these things are happening on the R console. Right.
So, let me, go to here, this thing first I try to copy this command and I come to here,
56
`
R console and I try to paste this here. So, you can see here, this will give you, four point two
and yeah. In case, if I want to write down, here the c command, you can see here yeah, so now,
the same the data, data vector has been combined using the c command and if you see the
answer is going to be the same and similarly in case if you try to take here, another example
here, suppose here, between 60, 20, say 56, 87, 97, 35 and so on. So, you can see here, the
maximum value, here is 97. Right. Okay. Now, similarly in case if you try to find out the
minimum, then I can do here, I can use here the command, m i n, over the same data set, 4
point two, three point seven, minus two point three and four and you can see here, now the
minimum value is coming out to be, minus 2 point 3 and similarly in case if I try to use here
this c commend on here, even then you will get the same outcome. Right. And similarly if you
try to take it, another value here and if I try to say here, minimum of the data vector between,
56, 23, 98, 65, 74, 34 and so on. So, it comes out to be 23. So, you see here, I am trying to take
57
`
a very small data set, where you can verify yourself that whether it is giving you a minimum
value or a maximum value, but this built-in function are very useful when you are trying to
deal with a huge data set where there may be five thousand values, ten thousand values or
twenty thousand values or so on, there you cannot find out these values manually, so these
functions help you there. Now, if you try to understand I have taken here, two examples, m i n
the minimum and m a x, maximum and I have illustrated how you can operate it over a data
vector. Now, similar to minimum and maximum, there are some other built-in functions, which
are available in R. There is a long list, but here, in this slide I am trying to give you an overview
of those built-in function and how to use those built-in function, the process and the procedure
is exactly the same what you have learnt in the case of minimum and maximum.
For example, in case if you are interested in finding out an absolute value. Right. Then you
simply have to write here, a b s inside the brackets, you simply have to give the here the data
58
`
vector, even a single value or a vector value, both are acceptable. Similarly, in case if you want
to find out the square root of any value or any data vector, the function is s q r t and similarly,
in case if you want to find out the rounding of value, the function is r o u n d, if you want to
floor, the function is f l o o r, if you want to ceiling, then the function is c ei l i n g, if you want
to find out the sum of two numbers or say sum of two vectors, then the function here is s u m
and for put it the function is p r od and similarly if you want to make different types of logs,
these functions are there for exponential, functions are there for trigonometric functions, sine,
cos, tan, etc.,, are there and hyperbolic function says sine h, cos h, tan h etc., are also there
and, and there is a long list. But now, I will try to take here, some example and I will try to
For example, in case if you want to find out, the square root of any value, suppose I want to
find out, the square root of 4. Right. So, in this case I simply have to write it s q r t and inside
59
`
here, single value 4. And if you try to see it here, this valuable, value will come out to be 2.
Now, I will try to illustrate you, that if you want to find out the square root of a data vector,
then exactly on the same way, as addition, subtraction, multiplication, division, they are
operated, over each and every element in the data vector, similarly this square root operator
will also be executed over all the elements of the data vector. if for example, in case if I try to
take here a data vector c, say four, nine, sixteen and twenty five and suppose if I try to find out,
the square root of this data vector, then this will be operated like this, say here, c square root
of four, square root of nine, square root of sixteen, square root of twenty-five, and this answer
will come out to be here, two, three, four and here five. So, you can see here, in this case, I'm
trying to do the same thing and the outcome is coming out to be here, two, three, four and five.
60
`
Now, and similarly means, if you try to take another example, that you can try yourself try to
find out a square root of nine and then you square root of this another data vector, contained
containing the values 9, 25, 36, 49. So, I'm just leaving it for your practice. Okay.
Now, another important aspect, sum and product, there is a built-in function here s u m and
this functions, finds out the sum of all the values inside the data vector, for example, in case if
I want to find out the values of 2 plus, 3 plus, 4 plus, 5. then I can give these values in the form
of a data vector, consisting of four values, 2, 3, 4, 5 and then I simply have to operate it here
and write s u m of this data vector and this will, give us the value of 2 plus, 3 plus, 4 plus, 5.
that is the summation of all the values inside the data vector and this value will come out to be
here 14 and this is the screenshot here. Similarly in case if I say, I want to find out the product
of say 2, 3, 4, 5, that is 2 into 3 into 4 into 5 that means, I can combine this data in the form of
a data vector, c, two, three, four, five and then if I use here the built-in function p r o d, then
61
`
this will give us the multiplication of all the values inside the data vector. Right. So, this p r o
d function, this is used to find out the product of all the values inside the data vector. So, for
example here, I try to use, this command over the R console and this value comes out to be
120. 2, 3’s are 6, 6, 4’s are 24, 24, 5’s are 120. Right. And this is the screenshot of the same
operation. Now, I try to come to the R console, so that I can show you, these operations. Right.
So I will try to show you here, the square root sum and product operation.
For example, if I try to find out here, square root of here, see here nine, you can see here this
comes out to be here three and if I try to find out the square root of our theta vector, say here,
4, 9, 16, 25. So, this will give us the value of square root of 4, a square root of 9, is square root
of 16, square root of 25. Which is 2, 3, 4 and 5? Ok? Now, look at me they take here some
negative value, let us see, what happens -4, say here 9, 16 and here 25. So, this will give here,
what is here N a N? this is something, new for us, so I will try to show you, the use of this N a
10
62
`
N etc, all these things after a little couple of slides in the same lecture. So, this is, trying to
show you, in very simple word that there is some issue, some problem and it's showing that it
is not possible to compute it. Right. So, that is what you have to be careful that, for this minus
4, it is giving us the value and a something like, not available type of thing and for all other
So now, let me give you an idea of the, sum function. So, if I try to find out the here, sum of
say here, five values, 5, 6, 7, 8, 9, combine in the data vector. because, this comes out to be
here 35 and this sum operator is also valid for negative values, also for example, some of – 3,
– 4, - 5 and say – 6, this will again come out to be -18, so there is no issue and in case if I try
to take here, two data vectors also, that also can be done, that we already have seen in the earlier
11
63
`
lecture and similarly if I try to show you here, now the product function, say if I want to make
it here, product of here c 2, 3, 4, 5. Right. So, this comes out to be hundred twenty 2, 3’s are 6
and 6, 4’s are 24, 24, 5’s are, 120. So, it is trying to give us the product of all the values present
inside the data vector. Now, if I try to take here one value to be here negative, then let me see
whether this operates or not. So, you can see here that the answer is coming out to be minus
120 because, this sum and product function they are the valid mathematical functions and they
are operative over positive, as well as, negative values. Right. and similarly if you want to c
the, use of your absolute function. So, I can show you here, absolute of here, say this minus
nine is actually nine, so you can see here, which is happening and similarly if I try to take here,
c a data vector consisting of c here, - 9, - 6 and 7 and here 8. So, you can see here, that there
are two negative values and two positive values in this data vector. so now, once you try to
operate the absolute function, it will give you the minus 9, will become 9, minus 6 will become
say 6 & 7 & 8 they are, they already are the positive values, so their absolute function remains
the same, so you can see here, that these operations are not difficult to do in the R software.
12
64
`
And they will help us with hitter in solving many complicated functions, very easily and beside
this thing I would also try to show you, that because for a quick revision, that whenever we are
trying to assign a value to a vector, then it is also possible to assign a new variable to a new
vector, for example, in case if I try to say, take here, data vector of two, three, four, five and
suppose I want to obtain the square of all the values say, 2 square, 3 square, 4 square, 5 square,
then I can simply denote it here, by here x hat 2 and this value can be stored into a new variable
here y. Right. and once you try to see the value of here y, this will come out to be 4, 9, 16, 25,
which is, 2 square, 3 square, 4 square, and 5 squared and this is here the screen shot, so the
cursor pretty simple operation, so that will help us in assigning, the outcome of an operation
into a new variable. Right. if for example, in statistics, we will see at many, many places we
13
65
`
So basically, if I want to find out the value of suppose I have, some data vector here consisting
of some values x 1, x 2, say here, x n and suppose I want to find out the sum of squares of these
value, that means, you first try to square the value and then find out the sum. so now, using this
built-in function, you can see here, what I can do? So, what I have to do? Here, that first I need
to define here all the values inside a data vector, so I am trying to define here, a data vector
here c, 2, 3, 4, 5. Right. And now, this is asking me to find out the square, so what I can do
here? That, I need to find out the square, so I can write down here x hat 2 or I can write down
even here x multiplication x and this is going to give me the value of here, say two square,
three square, 4 square and five square and now, I need to find out the value of sum of two
square, three square, four square, 5 square. So now, I can use here, the built-in function here,
sum x hat 2 or see here, some x star x and this is going to give me the value of 2 square plus, 3
square plus, 4 square plus, 5 square. So, you can see here, that in case, if I want to find out the
value of this function here Z, which is the sum of squares of different values
66
`
then I can write down, it here simply like here, sum x hat 2and this value can be stored in a
variable, say here z and some of 2 square plus, 3 square plus, 4 square plus, 5 b square, comes
out to be here, 54 and this is the outcome here, So let me try to show you, it over the R console
15
67
`
So, you try to see here, I try to define here, a vector here, 2, 3, 4 & 5 and I try to find out here,
x square. Right. And suppose if I try to find out here, z is equal to here, some of say x square,
so this comes out to be and if you try to see, what is the value of z? This comes out to be 54
Right. And similarly in case, if you want to find out the product of 2 is square, 3 square, 4
square, 5 square, I can write down here, Again, here is some instead of some, I can write down
here, product of x square and this will come out to be 14400. So, what is this value this is the
product of 2 square, into 3 square, into 4 a square, into 5 square or simply the product of, 2
square, three square, four square, five square. Right. So, these types of operations are possible
16
68
`
The computation of this function is very important or very popular function in statistic.
n n
Suppose, I want to find out the value of this function here, z ( xi x ) 2 xi2 nx 2 , what
i 1 i 1
n
is here x ? x here is the arithmetic mean, it is simply here xi . So, what I am trying to show
i 1
you here, that if I have got the values, I can find out it's arithmetic mean and for that I have
some built-in function, what is called as here mean m e a n? so now, what I want here, first that
I want that each and every value xi , should be subtracted from the automatic mean and then,
this value has to be squared and then all these values are squared and then I want to find out
n
the sum of all those values. Now, this value can be further simplified, x
i 1
2
i nx 2 and if you
n n
try to open it this becomes a summation, xi2 nx 2 2 x xi and this becomes
i 1 i 1
n n n
xi2 nx 2 2 x xi . So, this quantity comes out to be, the same as,
i 1 i 1
x
i 1
2
i nx 2 , which I
have written here. So now, in order to compute, this value, what I can do, that I already have
computed, this value, this I already had written here as sum of say x hat 2 and what about this
value in x value ? What is your n? n is the number of elements in the data vector, so for that
there is a built-in command in R, which is called here as a length, so if I try to say here length
of here x, so this is going to give me the length of the data vector or the total number of data
points, in the data vector x, so instead of here n, I can use here the command here length of
here x and for x bar, I already have a command built in command here, so you here mean of
here x so I can write down here, mean of here x and say here square. Well don't worry, all these
commands like as mean, length, etc., we are going to discuss in the further slides, but here I
wanted to give you an example, to show you that how this built-in functions are going to be
17
69
`
So now, in case if you try to do it on the R console here. So, suppose I try to take here the data
set, say 2, 3, 4, 5 and this is the same value, so I have written here sum of x square, which is
corresponding to this thing, length of here x which is corresponding to here n and this here
mean of x square here which is corresponding to x 2 and once you try to store, the value of this
function, into a new variable here z. So, once you try to execute it, first I try to show you here,
what is the length of here x? Which, which is coming out to be here, 4 and what is the value
of here z, this comes out to be at 5 and this is the screen short of the same operation. Right.
Okay.
18
70
`
And similarly if I try to take another example on the same lines, suppose I try to take here two
data vectors, say here, x 1 and x 2, consisting of four values 2, 3, 4, 5 and 6, 7, 8, 9 and suppose
I want to find out the sum of the product. So, this is something like this, 2 into 6 plus, 3 into 7
plus, 4 into 8 plus, 5 into 9. So, I need to find out the value of this thing. so now, I can do it
very easily using the this built-in operator. First I need to find out the multiplication, so for that
I can simply use here the operator x1*x2, that means, I'm trying to multiply the components of
two vectors x 1 and x 2 and whatever is the outcome, that I am going to sum. So, this value
will come out to be 110. Right. So, before I go further, I will try to show you, on the R console
19
71
`
So, suppose if I try to, to take it here the same data vector, that we have taken it here is here
earlier, 2, 3, 4, 5. First if you try to see here, the pin length of here x, there are 4 elements. So
it should come out to be here 4 and similarly if you want to find out the help of, find out the
here, mean of here x mean of x will come out to be a 3.5, that is 2 plus, 3 plus, 4 plus, 5 divided
by 4. Which is equal to 3 point 5? Now, in case if you want to find out the same command
here, sum of here x square, minus length of x into mean of x square. So, you can see here, this
function is coming out to be here, like this and the value is coming out to be here 5, and
similarly if I try to take here another data vector here, say 6, 7, 8 and here 9 and if I try to find
out the sum of the product of x and y, data vector so, what this is coming out to be 110. What
is this value? This is trying to first multiply the corresponding elements of two data vector x
and y and whatever is the product, it is trying to find out the sum. So, with this illustration you
can see here very clearly, that this built-in function will help us in many types of operation that
Now, after this I would like to address, another small topic, which is very, very useful, suppose
someone is asked to collect, suppose five thousand data values and after the values are
collected, they are manually entered, on a computer and suppose there are various possibilities,
that the number of values present in the data vector, are not really all and some data is missing.
20
72
`
So, in case if a value is missing in any data vector, then how to handle it? this type of situation
may occur, for example, someone is asked to collect the data from say five houses and suppose
he goes to the third house and the house is locked. So he will try to indicate that this data is not
available, so he will use some symbol, so in R, in case if the data is missing, there are some
standard symbols, which are used and there are some functions and commands, which help us
in modifying the statistical tool and the mathematical tools, to handle the missing data. So, this
So, before I go to discuss, the handling of missing data, first let me inform you few small things.
There are two letters, TRUE and FALSE, which are written in capital letters, capital T, capital
R, capital U, capital E and all capital letters in false, these two are the logical operators and
these operators are used to compare different expressions. We know, that in mathematics, there
21
73
`
are two types of operator which are the mathematical operator and, and say another are logical
operator logical operator, for example, say less than, for example, if I say five is say more than
three, so this is true or this is false, this is true. But, if I say five is smaller than three, then it is
false. So, here I just want to know the answer in terms of true and false, I am NOT interested
in, how much it is larger and how much it is, is smaller? So, these are our logical operators, so
this capital letters TRUE and capital returns FALSE, they are the logical operators and they are
the reserved word, means, as soon as you write here. capital T, capital u, capital R, capital U,
capital E, the R will automatically assume it, that you are going to use a logical operator. So,
you cannot use it to define a new variable or, or any variable from your side, that will be
acceptable and also in case of here the entire word, TRUE or FALSE, you can also use here,
the first letter, capital T or say capital F, to denote the words TRUE and FALSE respectively.
So, T can be replaced for say here TRUE and say F can be replaced for here FALSE, right,
and remember one thing these TRUE and FALSE have to be written only in that capital letters.
In case if you are writing this in small letters or even if a single letter in these two words is
small, this will not remain as a logical operator and R will not consider it as a logical operator.
So, this small true and small false, this is not possible, and they are not the same as the TRUE
22
74
`
As soon as, we get any data, then I try to input the data. First option is this, I can input the data
in the form of a data vector, using the command c. Now, I would like to know, is there any
value which is missing in the data/ So, first question is how to know whether any data value is
missing inside the data vector? So, in order to note this thing, there is a built in command here,
what we call here? Say, is dot n a and inside the bracket, you have to give the data vector, in
which you want to find, is any value missing. So, for example, I try to take here data vector
here, consisting of four values 10, 20, 30, 40, here you can see, all the values are present and
no values is missing, so I try to execute the command, is dot NA and inside the bracket say x
and you will get here an outcome like this, FALSE, FALSE, FALSE, FALSE, that means, this
10 value is not missing, so saying that 10 is missing or is 10 missing, this is FALSE. Similarly
this FALSE corresponding to this 20. So, so I'm asking with this command is 20 missing,
answer is FALSE, no it is not mean, what you think it is present? Now, both the 30 and 40
these values are available, so this command is dot n it is giving me a FALSE statement, so this
is the screenshot, of the same operation, I will try to show you over the R console also,
23
75
`
And now let me take here another example, Here I am trying to replace the value 20 from the
earlier data vector by here N A. Right. you can see here I am writing here, capital N and capital
A, this is also a reserved word, I will try to discuss it after a couple of slides but, this is also a
reserved word and this is used in R, to indicate that the value is not available. Right. so now,
in this case, whosoever is entering the data, he has to be told, that in case if the data is missing,
he or she has to write NA in place of the missing data, so now, the data vector comes out to be
consisting of here, four values, 10, NA, 30 and 40. So, now when I try to operate the command
here is dot na, this is giving me the outcome here FALSE, TRUE, FALSE, FALSE, this means
here what? This FALSE corresponding to this10, so I'm trying to ask here is 10 missing, answer
says, no, it is available, hence my, my command is FALSE. So, it is giving me the value here
FALSE. Now, I come to the second value here, NA and I ask is dot na and this value, is say
yes. this value is missing and hence the answer is TRUE, my statement is true and similarly for
24
76
`
30 and 40, these two values are available, so this is trying to say, that these values are not
And after this, Suppose, if there are more than one values, even then there is no problem at all.
For example, in the same data vector if I try to miss two values 20 and 30 here, in this new x,
so once I try to operate the command here, is dot na, then it is giving me here, FALSE and
FALSE for the values, for those values which are available and it is giving me the answer,
TRUE and TRUE, for the values which are not available, so by this operation I can always find
out whether the values are missing or not. So as long as, I am getting here, all FALSE, that
means all the values are available and if I am getting even a single TRUE, that means value is
missing in the data vector and now, I will try to show you that in case if the value is not available
it is missing in the data vector, then what happens in case if I try to operate any built-in function.
25
77
`
R Console, so that I can show you, that is it really happening or not. So, if I try to take here for
example here, the x greater vector here, c here, 20, 30, 40 and here 50. Right. So, this is my
here x, so I try to do hit is dot na and inside here x and it is giving me FALSE, FALSE, FALSE,
FALSE, all that means, all the data values are available. Now, in the same data set, I try to
replace, the second value by NA and now, you can see here, my this x becomes here, 20 NA,
40, 50 and if I try to repeat here, the same command is dot na, can it is giving me here for the
missing value, it is giving me here TRUE. Right. And similarly if I try to make it here, more
than two values to be missing. So, and if I try to operate with the same operator, I get here,
FALSE for those values, which are available and the TRUE for those values, which are absent,
which are missing .so, for the 20 and a 40 and a I am getting the outcome FALSE, TRUE,
26
78
`
So now, let me come back to our slide and let me try to show you here, that how the things are
going to happen, when some value is missing. So, suppose I try to consider the data set here,
in which one value is missing, 20, 30, NA and 40. Now, suppose I want to find out the mean,
mean of x. This sample mean is defined as here sum of all the values x1 plus, x2 plus, xn
divided by the total number of observations here x. Right. So now, in case if you try to find out
the mean of 10 plus, 20 and NA and 40, this will become here, 10 plus, 20 plus, NA, plus 40,
more of elements in the data vector which is here 4 and you will see here, it is not really
mathematically possible to add value NA in any numerical values. So, this answer will come
out to be here NA. Right. Where as,in case if you try to use this command, see here n a dot, r
m equal to TRUE, allowed this command, I am going to explain you in later on, whenever we
are going to deal with the statical function but, here I want to give you an idea that that how
you are going to modify the same command, when there is a missing value in the data vector.
Right. So, in this case suppose I know, that there is one value, which is missing in the data set,
I have to modify my command, mean of x’s, mean of x. So, I can write down here, mean of
27
79
`
here x and I'm trying to write down n a dot rm, that means, NA value has to be removed and
this option is TRUE or FALSE, this option is TRUE. By writing T, that means all the any value
has have to be removed and the arithmetic mean has to be calculated on the basis of available
numerical values. So, in case if I try to write down, this command here, then this arithmetic
mean will be found here as 10 plus, 20 plus, 40 divided by 3 not 4. Right. And then I will get
here a value here 23.33 and so on. So, this is how we try to work when some values are missing,
So, let me take here, the data here, x to be here 10, 20 say here, NA, and here 40. Right. So you
can see here, this is my data here and if I try to write down here mean of x, this is giving you
me, me, me here NA and, and if I try to find out here mean of x, n a dot r m, is equal to TRUE,
TRUE, then you get here, the value 23.33. So, here means again just for the sake of illustration,
I will try to show you that instead of here using the entire word T R U E, I can also use here,
capital T and the answer is coming out to be same, but on the other hand, in case if I try to
28
80
`
make it here, say small letters, say r u in small letters quickest, will give me an error and even
if I try to find out this with only n a dot r m, is equal to a small T, that is not capital T, this is
But, before concluding the Lecture, I would try to inform you, that in R, in some places, you
will be getting an outcome, like here N U L L in capital letters. This is going to be the outcome,
which is returned by some of the function, remember one thing, NA and NULL they are not
the same thing and even this capital N A and is small na, they’re also not the same thing capital
N, capital A is a result word and this is case sensitive. The difference between NA and na is
the following; NA is a place holder, place holder means, yes, inside the class, a student has
been assigned a seat, but the student is absent today, it does not mean that a student does not
exist. So in this case the student is missing, so that is going to be given by NA, not available,
at that point, but in case if I am trying to use the word, here N U L L, this NULL stands for
something, which never existed. So that is the difference between the use of NA, NULL, which
29
81
`
we have to be careful in the while we try to use it, so here I would like to stop and I would
request you that you please, try to attempt your assignment, try to take some exercises from the
books, or even you can create your own exercises, just try to take a data set, do small
manipulations and verify them with their manual calculation, that are you getting the same
thing, which you used to get manually, try to replace some data set in that data vector by NA
and try to see, what is really happening? If one value is missing, two values missing and even
if no value is missing, how you are going to obtain or how you are going to interpret the value
of this logical operator TRUE and FALSE. Right. So, you practice and we will see you in the
30
82
Lecture 05
Welcome to the lecture on course descriptive statistics with R software. In the last couple of lectures, we
have understood that how R is going to be useful for doing various types of mathematical operations, and
we also understood that how the missing data values can be handled inside the R software. Now in this
lecture I will try to give you an idea that how to handle matrix in R software, so, what is the matrix?
83
If you try to see from the mathematical point of view matrix is simply a rectangular array, which has got p
rows and say n column and this will be denoted as a matrix of order p cross n. For example I can always
write a matrix like here x is equal to that's a standard notation x 1 1 x 1 2 say here x 1 say here p and say
here x 2 1 x 2 2 up to here x say n 1 and x 2 2 here and something like here x 2 p 2 p and then x n 2 and say
here x n p. Right. So, this is here a matrix and what we are going to assume that here all these values are
some numerical value some real values. Right. So, I will say that all these entries in the matrix are some
numerical values, they are currently some real number. Right. And in case if I want to denote a particular
element for example if I want to denote this x 1 2, So, x 1 2 is going to be denoted like x 1 comma 2. So,
that means this is the element on the first row and in the second column. So, a question arises here what is
the difference between data vector and vector in terms of matrix theory data vector is a data vector but in a
matrix theory, we have a number of different commands and it has a different structure. So, first you have
to decide that whether the data has been inserted in the form of a data vector or in a form of vectors and
matrices, both these operations are going to be different that is what we are going to see in the further
lectures.
84
So, the first question comes how to create a matrix? So, in order to create a matrix we have a command
here m a t r i x, matrix and inside the bracket you can see here I am writing here several thing nrow ncall
data. So, this nrow is trying to give me the information that how many rows I want so, this is giving us the
information on number of rows ncol is similarly it is trying to give us any information that what are the
total number of columns in the matrix. And what data has to be given? This has to be given using the c
command inside a data vector that has to be arraigned inside the matrix. So, here if you try to see I am using
here nrow is equal to 4 that means the number of rows are here 4. And number of columns this is here 2 so,
there are going to be 4 into 2 that 8 values in the data vector. So, I'm trying to write those values here 1 2 3
4 5 6 7 8. So, now the data is going to be arranged in four rows and two columns. So, you can see here there
are 4 rows 1 2 3 4 and there are two columns here 1 and 2 and data is going like this 1 2 3 4 and then from
here 5 6 7 and here 8. So, yes there can be a question that way why this the data is going to be column wise
or why it cannot be row wise. So, that I will try to address but here at this moment I would request to you
please try to observe how the matrix has been created. I have simply given the number of rows, number of
85
columns and the data and based on that a matrix of order 4 by 2 has been created. Right. And because, for
your remembrance
The parameter nrow defines the number of rows in a matrix the parameter ncol defines the number of
columns in the matrix and the parameter data defines what data has to be given inside the matrix. Right.
86
And usually in case if you are not giving any option whether the data has to be entered in row wise or say
column wise, the default is column wise as we have seen in the earlier slide.
So, now I'm going to consider here the same matrix here x which I have just denoted and suppose my issue
is this suppose my query is this I want to access a particular element. So, how to obtain a particular element?
Suppose I want to obtain this element 7. so, now what is the address of 7? The address of 7 is this, this is
located in the third row and second column. SoI will try to write down here the name of the matrix small x
87
and inside this square bracket I will try to write down the address. So, in this case this address is going to
be x bracket 3 comma 2. And once I try to type x 3 comma 2 on the R console, I will get here the value 7
which is the same value here like this. So before I go further I would try to show you that how the things
same matrix here and you can see here the gthis is here the matrix x. Right. And suppose if I want to obtain
a particular element say 3 comma 2, I am getting here 7 so, this 7 correspond to this thing. Similarly if you
88
want to find out the see here 2 comma 2, this is 6 but if you try to find out here 2 comma 7 you can see here
Now I try to address here the second issue that in case if you want to enter the data row wise. Then what
option you have to give? You simply have to give an option here or add a parameter here byrow is equal to
TRUE. So, you can see here I mean all other part of the syntax is the same what we use in the earlier slide?
But I have used here one thing byrow is equal to TRUE. And in this case you can see that data 1 2 3 4 5 6
89
Refer slide time :( 07:39)
So, this is how we going to do and this is the screenshot I will try to show you on the screen also on the R
console also.
90
Refer slide time :( 07:49)
And similarly in case if the data has to be entered column wise then I simply have to add the parameter
here byrow and which is now here FALSE that means I don't want the data to be entered in the row wise
mode. So, obviously once this statement is false because, then the opposite is true that the data has to be
entered by column wise. So, in case if I try to execute these things over the R console I can show you here
so if I try to use here by equal to here FALSE. I should choose my font size here so that you can see it here
clearly you can see here you can reduce this font size here and you can see here that this is here by row is
equal to FALSE. Right. And now in case if I try to see what is here x this is the same thing but now in case
if I try to do the same thing and if I try to make it here TRUE or I can use it capital T also here now you
can see what, what is the outcome here? The data here in the first case and the data here in the second case
91
these are arranged in different ways. First in the case of column wise and then the second case it is row
wise, right.
So, this is the screenshot of the same operation for your information only
10
92
Refer slide time :( 09:11)
And similarly if I want to find out the transpose of a matrix what is the transpose of a matrix? When we
try to interchange the rows and columns then it is called the transpose, for example if I say I have here a
11
93
matrix here x is equal to 1 5 2 6 3 7 4 8 then this matrix can be given in R console using this command
matrix and by the same command that we have used earlier and its outcome will like this. Now in case if I
want to find out the transpose, the command to find out the transpose is here t x mean t means a transpose
and inside the bracket you have to give the name of the matrix of which you want to find out the transpose.
12
94
matrix was 1 2 3 4 and then here 5 6 7 and here 8. But now this number of rows and number of columns
are changed here the number of rows are here four and number of here columns are here two but now once
the transpose the number of rows, they are here two and number of columns here are here four and the data
is now say 1 2 3 4 and then 5 6 and then it is 7 and here 8. So, this tx is the command to find out the
transpose of a matrix in R and now I would try to show you on the R console also. For example if I try to
13
95
take here the same matrix that we had taken earlier. So, here you can see here this is your here x matrix and
if I try to find out the transpose as t of here x. Now this is your changed this first row this becomes first
column, second row here becomes second column, third row here becomes third column, fourth row
becomes here fourth column. So, this is how we can find out the transpose of a matrix
Now next I try to address that how we are going to use the operations of matrix addition and subtraction
in the matrix setup. So, we know that in matrix if I have a matrix like here 1 2 3 4 and if I try to multiply it
by here a scalar 5 then this operation is done on each and every operation, say I say each and every element.
So, 1 into 5 2 into 5 3 into 5 and 4 into 5 and now, in case if I try to make it here the multiplication of a
matrix with say another matrix, then this is given by 1 into 5, plus 2 into 7, 1 into 5, and 2 into 7 multiplied
it. And then add it then again 1 into 6, plus 2 into 8, 3 into 5, plus 4 into 7, and then 3 into 6, plus 4 into 8.
14
96
So, this is how the multiplication is done in mathematics and this is what is taught to all of us. So, now in
this case I try to take here the same matrix here which is here in which the data vectors are 1 2 8
15
97
And now in case if I try to multiply this matrix by here 5. So, the operator what you have to learn here, is
star (*) this is the same operator what was used for multiplication. So, remember when you are trying to
multiply a matrix by a scalar then the operator is only star. When you are trying to multiply matrix by a
matrix then I will have a different operator. So, in this case if you try to see here if your x matrix is like this
and if you try to obtain here 5 into x then you can see here that this element is multiplied by 5 this element
is multiplied by 5 this element is multiplied by 5 and each and every element is multiplied by 5 and here
you are getting the outcome 1 into 5, 3 into 5, 5 into 5, 7 into 5, 2 into 5, 4 into 5, 6 into 5 and 8 into 5 that
is 40
16
98
And here is the screenshot of the same thing. So, I will try to show you here it on the R console also.
Now you can see here this is your here x and now you are trying to make it here 5 into x, so you can see
here that every element has been multiplied by 5. And now I'm trying to consider the multiplication of a
matrix by a matrix. So, you already have created a matrix here x and you already have created a matrix here
transpose of x, that is already there so, I would like to utilize that thing. Right. So, now I am trying to
multiply the transpose of a matrix x with matrix here x and now you can see here this is the operator. So,
this is a matrix and this is a matrix. So, when you are trying to multiply a matrix by a matrix of suitable
order then you have to use the operator percentage multiplication and percentage and remember one thing
means all those rules for the matrix multiplication from mathematics, they have to be satisfied here for
example if you have here two matrices A and B they can be multiplied only if their orders are like this A
17
99
if a is of order m cross n then B has to be of order of say n cross p. so, these two orders have to matched
otherwise this won't be valid so this is how we try to do it here and you just I would try to show you on the
R console also.
And if I try to show you here so, this you can see here xtx will come out to be like this
18
100
Now I will try to pick here some more examples to make you understand. And suppose I try to take here
say here two other matrices which are they are just matrices of order two by two and whose data values are
1 2 3 4 and data has been entered by putting the parameter byrow is equal to true. Right. So, in case if you
try to insert that data, this outcome will come out to be like this here there are 2 rows 1 2 & 2 columns 1
and here 2 and the data here is entered by row 1 2 and then 3 4 and similarly I try to take another data set
11 12 13 14 and on the similar lines I try to create here another matrix of 2 by 2 and I call it here as a z. so,
this matrix will look like this. The data will be 11 12 13 14 and so, now I have here 2 matrices of order this
2x2 and I will try to show you that how to multiply them so, Right. So, you can see here there are two
matrices here y and here z and if I try to multiply here y percentage star percentage you get here like this.
Right. Yeah and here is the screenshot of the same operation for your understanding. So, this is here Right.
19
101
And this is again the same screenshot, which has been obtained over here. Right?
20
102
And now after multiplication I would try to address the addition and subtraction. Addition and subtraction
is quite simpler, simpler in the sense that you have to use the same operator that you had used earlier. For
addition you simply have to use the plus operator and for subtraction you have to use the minus operator.
But again I would repeat that all the rules of matrix operation they have to be satisfied here before you try
to do any matrix operation. And this is pretty common that once you are trying to handle a complicated
structure where you are trying to deal with the various matrices many times the orders of the matrices do
not match and this gives you error that you have to actually see what is really happening. So, just be careful
so, when I am trying to add here two matrices here A and B I will assume that they have got the same order
say m cross n and I am trying to subtract here two matrices. And we I will assume that they have got the
same order that means the same number of rows and same number of columns. Now I'm trying to consider
here the same matrix which I had created earlier and you can see here that this was another matrix which I
had created five into x. so, now I have here two matrices here x which is of order 4 by 2 and then I have
here another matrix here 5 into x which is of order 4 by 2. So, I can add it together so now I try to add x
21
103
and 5 x and you will see here that what will really happen that all the corresponding elements of x and 5 x
will be added. And similarly in case if I try to do here fraction here then what will happen that the
corresponding elements of the two matrices will be subtracted. So, using the plus operator I can do addition
and using the minus operator, I can do subtraction which is pretty straightforward. Right. So, I can show
you here on the R console also. So, now let me take here one more example of the same matrix y and z that
we have created earlier to have to show you the addition and subtraction operations.
So, if you try to see here earlier I had created these two matrices, y and here z of order 2 by 2 the y was
order 2 by 2 z was order 2 by 2 using the data set 1 2 3 4 and 11 12 13 14. Now in case if I try to make it
here addition so you can see here or rather I can show you this 1 and 11 will be added 2 and 12 will be
added 3 and 13 will be added and 4 and 14 will be added and the same thing will happen to the subtraction
also. Right. So, these are the here the operation if I try to use here addition operation operator and if I try
22
104
to use here subtraction operator two matrices of the same order have been added and subtracted. Right. So,
before I go further let me try to show you it on the R console also so, you can see here, okay, first let me
go through here x and you can see here this was very matrix x and now you want to do it here x plus 5 into
x so, this is here something like this and if you want to see here 5 into x is what here so it is like this so,
you can see here that x and 5 x are added together. Right. And similarly if you try to see here this is your
here x and if you and if your 5 x is like this and now I try to make it here 5 x minus x subtraction. So, you
can see here that the corresponding elements have been subtracted and if you try to make it, it x minus 5
into x I mean then again all the values with a negative sign will occur. And similarly if you try to recall I
had created this y matrix and z matrix so, I can make it here y plus z you can see here this is like this I mean
the corresponding elements are added and if I try to do subtraction y minus z then the corresponding
elements are subtracted. So, you can see here that it is not really a difficult operation. Now I would like to
address
23
105
here another issue that once you are trying to deal with the vectors and matrices then sometimes you need
to access a particular part of the matrix that can be a particular role that can be a particular column or that
can be a particular sub matrix. Then how to do it? So, I will try to show you here suppose I try to create
here a matter which is the same matrix which I have created earlier. And suppose I want to access the third
row so, my matrix has been given by say here by the name x one, two, three, four, five, six, seven, eight
and I want to use here or call that third row. So, in that case I simply have to write down the name of the
matrix and then I have to write down the address, address you can see here this is the address of row and
here this is the address of column which is actually here blank. So, so as long as you give the address to be
the blank this will indicate are that the entire row is needed. So, and it is not difficult to remember because
if you try to see this 3 comma blank inside the square brackets this is the same address which is given over
here. so, that is not even difficult to remember. Sometimes people do get confused that way to put the blank
on the row or in the column so, don't worry for these things you simply look in the matrix and try to look
into that the row or column that you want to access and simply try to give the same address in the as given
in the matrix. And similarly in case if I want to access the second column so, second column here is like
this you can see here it consisting of the values 2 4 6 8 so, again the address of the column here is given
here it here is a blank sign comma and then 2 inside the square brackets. So, I can write down here the same
address over here say her matrix name and then inside the square bracket this is here the row address which
is here left as blank and this is here the column address. And this value will come out to be here 2 4 6 8
yes here you have to be little bit careful that I am trying to call the second column so, ideally this should
come like 2 4 6 8 but this doesn't happen in R. If you want to call a row or if you want to call a column the
outcome will look similar. But whether you are calling a row or a column that can be accessed only by
looking at the command whether you have said x inside the bracket blank space comma 3 or you have said
x inside the square bracket 3 comma blank space. Right. And I will try to show you it on the R console
also but before that I may show you something more that suppose I want to want to recall or access a sub
matrix of a matrix. Submatrix of a matrix means a particular section of a matrix. Suppose I want to find out
from the same matrix I want to recall this part only and this part has to be left means I don't want to call it
24
106
so this is the submatrix which I want to call this is pretty simple always remember one thing whatever you
want to recall just try to give the correct address Right. so I can use here x and now I have to give the
address, the rows and columns which I want to choose. So I'm trying to choose here the rows 1 2 & 3 so, I
can give it here 1 colon 3 1 2 3 and what about the columns? I am trying to choose here 2 columns first
column its second column. So, I'm trying to use here 1 colon 2 and this is my address and as soon as I try
to write down here and enter on the console I will get here the same matrix here. So, this you can compare
this is the same matrix and this is the screenshot. So, I will try to show you these things on the R console
here.
so, I will try to take it here x this the same matrix that we had considered earlier now suppose I want to
recall the second row. So, you can see here I'm simply not typing anything after this after the comma and I
will get here the second row 3 and 4 you can see here and similarly if I want to find out here the fourth row
this is here 7 and 8 and similarly if I want to find out here, first column then I have to leave it here blank
or don't type anything comma 1 and this will give me the first column 1 3 5 and 7 but again you can see
here that this structure and with the structure they are the same so, you will not be able to look or you will
not be able to, to decide whether you have recalled a particular row or a particular column but by looking
at these addresses or the structure of these addresses I can always find out whether I have recalled a column
or a row. Right. Similarly in case if I want to find out the sub matrix suppose I want to find out a submatrix
consisting of first 3 rows 1 2 3 and two columns first and second column then you can see here I am getting
here the same thing. And similarly in case if I want to find out here only the first two rows and first two
columns then means I can give the row address to be row number 1 and 2 and column number 1 and 2 and
you will get get here the same matrix, from here you can see here this is the submatrix with what you have
obtained suppose I want to find out another submatrix which is consisting of the third row fourth row and
first column and I can column. So, I can write down here x inside the square bracket 3 colon 4 and then and
the number of columns 1 to 2. And so, you can see here that you are getting the 5 6 7 8 and this is the same
matrix here which you have obtained here. Right. So, similarly I have tried my best here to take or consider
only those commands related to the matrix theory which are going to be useful for us but beside those things
25
107
means most of the matrix operation are possible in R and built-in functions are available. For example if
you want to find out the inverse of a matrix there is a command Solve, s o l v e but this list is very, very
long so, I would leave it up to you that whenever you want to use a particular operation, related to matrix
to matrix theory please try to consult a book or the R software help menu and try to see how that matrix
operation can be done. And I would like to stop here and I would request that you please try to make more
practice so that you get more conversant with these things and from the next lecture we will start with the
statistics part. So, you enjoy the course practice it and I will see you in the next lecture. Till then, Good bye.
26
108
Lecture 06
Introduction to Descriptive
Welcome to the lecture on the course descriptive statistics with R software. Now you may kindly recall
that in all the earlier lectures, we had discussed and we had an idea that how R software is going to help
us in different types of computations. Now from this lecture we are going to start the discussion on the
topics of Statistics. But, here I would like to tell you one thing or I would like to clarify one thing. The
topics in descriptive statistics, whatever I am going to consider here, they are pretty elementary and I will
try to go to the depth, as much as possible, under the given time frame, but my idea is not really here to
teach you statistics. My idea here is that, most of the topics you will see, you know. And my objective is
that I would like to make you comfortable, that in case if you want to use the R software for the computation
of those topics, then you should be comfortable, you should be confident. Once you are confident in
handling the basic, topics I am, sure and I am confident, that there should not be any problem in handling
the the advanced topics in statistics. Besides this thing, people are using these statistical tools, very often.
But sometime, they don't know why they are using it; sometimes they don't know what is the interpretation
of different quantities. So, that will be my another objective on which I would try to discuss here that
whatever the, the tools of descriptive statistics I am going to handle, I will try to discuss about their concept,
their implications and their competition using the R software. Right. So, in this, lecture I am definitely not
going to use any R software for the competition but I would try to give you, an overview, that what is
descriptive statistics and how it helps and what are the ways, in which it can give us different types of
109
So, only I will be considering on the basic aspects in this lecture and possibly in the next lecture also. So,
one of the basic fundamental question comes here, what is really statistics? So, I would say simply, that
statistics is a science which turns the data, or which tries to take out the information contained inside the
data and converts it into a form which is useful for making a decision This decision can be at, your office
level, at policy formulation, for forecasting, at country level or say anything else. So, what are we going to
do here? We are trying to collect the data, we are trying to analyze the data and based, on that we will try
to, make some statistical inferences and those, inferences are drawn, from the numerical facts which we are
going to call as, as data. So, data is a very important thing in statistics and it gives you some information
110
This is, essentially the starting point for knowledge discovery on the basis of data. Yes, the discovery of
knowledge can be done from different types of ways, by looking at the picture, by reading some subject
and similarly statistics also gives us a tool to discover the knowledge that is contained inside the data. So,
data is essentially a very important source of information, data contains many information inside but, the
problem is the following, if I know something, then I can speak and I can inform you, but data cannot speak,
data cannot listen, data cannot understand the language what we speak. For example if I am, speaking Hindi
or English language possibly you can understand it, but if I try to speak in a language, which you don't
understand, then I will not be able to transfer the knowledge. So, similarly data has its own language, for
example, you have seen a first somebody has some, problem in speaking or hearing then there is a signage
language and that language can be understood only by those people who understand it. So, similarly data
also has a language and which is based on, different types of symbols, notations and interpretation. So, our
objective here is that, that we want to know the tools, by which we can draw the information contained
111
Refer Slide Time :( 5: 53)
So, essentially I can say here, that statistics is a language of data and it provides a, scientific way to extract
and retrieve the information hidden inside the data. And remember one thing; its taxes cannot do any
miracle. Sometime if you try to see, you might have heard, that people try to make different types of jokes,
different types or comments, on the statistic that is that statistics lies or something like this. So, I would like
to inform you that, it started statistics never lies. Right. The statistics is simply based on the data and now
it is our capability that how much we can retrieve the information in a from inside the data, for example, if
somebody is speaking to us in a sign language then it depends on our capability that how much I can
correctly interpret. It so remember one thing, that statistics also cannot change the process or the
phenomena, whatever process is happening that will happen and as a statistician, I am not allowed to, alter
or change the process, I simply have to collect the data, on the basis of the process which is happening,
112
which is continuing and on that basis I have to take a call, I have to take a decision, that which of the tool
is most appropriate tool in this situation, to draw or to extract the information hidden inside the data. Right.
Ok and definitely, the inferences what we are going to draw on the basis of a statistical tool, they are going
to use for different purposes, for example one of the basic purpose is forecasting. So for example statistical
tools provide forecasting but remember one thing, this is not like a astrologers parrot, that the parrot chooses
a chit and then one reads it that what is my future. It does forecasting but on the basis of some scientific
principle and the principles of statistics. Now whenever you are coming to this aspect, that you have to
choose a tool on a data set, to extract the correct information, then in this process I can divide the entire
113
One part I would say here, suppose my data is here and there are two options, that the data is correct and
the second option is, this data may be wrong, what do you really mean by data is correct and data is wrong?
Suppose I want to know, the average height of the students in a class, then obviously, I have to collect the
data on the height. But suppose I am, trying to collect the data on the weight and based on that I am, trying
to infer the data on the height, then it is not appropriate. Right. So, in that sense I am, trying to say, that the
data has been chosen correctly, which is matching, with the objective of the study and second aspect is,
choice of tool. There are many, many state scale tool which are available and we are going to study them
in the future lectures, in the further lectures. But the main thing is this, one has to choose, what is the correct
tool to diagnose, the problem and to solve the problem, just like, if you go to the shop of a medicine or if
you go to a doctor, there are thousands of medicine. But, what doctor does? Doctor tries to decide that
which medicine is most suitable for the given problem. So, similarly in statistics also, we have many, many
tools and we need to make a decision that, which tool is going to be appropriate, to draw the correct
statistical inference, for a given problem. So, I know, how the choice statistical tool can be correct or the
choice of statistical tool can also be wrong. Now there are four options, first option is this, suppose I am
trying to choose the wrong statistical tool over a wrong set of data. Then in this case, the decision is not
going to be the correct. The next option is this, means I can choose the wrong statistical tool and the data
is correct, even in this case, I'm trying to use the wrong tool over a correct data, I will get an incorrect
decision and similarly, in case if I try to take here the, wrong data and if I try to take here the correct
statistical tool. So, using the correct statistical tool, all are wrong data will also, give us the incorrect
decision. Now, the last option which is correct is that, I have to use the correct statistical tool over the
correct data. So, this is the only option, that unless and until you try to choose the correct statistical tool and
you collect your data correctly, you will not be able to get the correct statistical inference out of it. The rule
is very simple, garbage in, garbage out. Right. And, one thing we would like to inform is that, sometime
people do come to us and they ask us to do the statistics very quickly, for example I have to submit my
thesis tomorrow, I have this data please try to help me, right, well, it is not, so simple at that stage. Because
you are the one who has understood the entire process and as a statistician, I have not been told about your
114
process. So first you need to explain me the entire process then I will try to understand the data generating
process and only then, I can do something and it is also possible, that the type of tool which you need that
may not always be available, but it needs to be developed. So, statistics always need some time to
understand the phenomena. Now, another popular question which I am asked that when I come to the aspect
of descriptive statistic, there are two types of tools. One is graphical tools and another is analytical tool.
So, there are different types of graphical tool, like as, two dimension, three dimension plots, pie scatter
diagram, pie diagram, Histogram bar plot, stem leaf plot, box plot etc. there is a long list. In this case people
do ask me, which of the graph is more suitable or they also feel that if they try to use more number of
graphs, then their analysis is going to be better, I would say this is only a myth, you simply have to choose
the correct graph and you have to use the correct number of graphics. So, the appropriate choice of graph
115
and appropriate number of graphics, that is only going to help you in getting the correct information and
the analytical tool, I can say there are different aspects, on which, we try to analyze the data, for example I
would try to find out what is the central tendency of the data? What is the variation in the data? And what
is the structure of the data? And what type of relationship are existing inside the data? So, for example,
when I come on the aspect of measure of central tendency then we have different tools mean, median,
mode, geometric mean, harmonic mean, quantiles, etc. and similarly when we are trying to, understand the
variability in the data, then we have different types of tools, variance, standard deviation, standard error
mean deviation, absolute deviation, range, etc. So, you can see here, in this case, there are two aspects, one
is the central tendency of the data and another is the spread or the variation in the data. Now, in case if you
want to study the central tendency of the data, then and then out of this list, you have to choose the
116
appropriate tool and similarly in case if you want to study the nature of the variation in the data, then you
have to choose the appropriate tool. And similarly in case if you want to find out, what is the structure of
then you have to, choose a proper tool for symmetry and then there are concept of skewness, concepts of
kurtosis, these concepts are going to give us more information, that what is the structure of the data. And
then, another aspect can be, if I have, a data on say more than two aspects something like height or weight
or say height, weight and age, then there may exist some relationship in the data also or there may exist
some coherent structure inside the data. Then in order to study those aspect we have the tools of correlation
117
coefficient, rank correlation, multiple correlation, partial correlation, coefficients correlation ratio, intra
class correlation coefficient, linear regression, nonlinear regression etc. So, there is a long list.
So, when we talk about the descriptive statistics, descriptive statistics is not a tool. But this is the collection
of the appropriate number of tools, that may include the graphical tools, as well as, that may include the
analytical tool and the choice of analytical tool also depends what exactly do you want to study? Many
times people do come to us and they ask us, sir, can you please do some static analysis on my data? In that
case I would always, request them please, let me know what really you, want to know from this set of data?
And based on that, I am going to take a appropriated decision and I'm, going to decide that which of the
10
118
statistical tool can give an answer to your query and then I would try to use it, so another question crops up
here, that which of the tool is a better option, graphical tool or say analytical tool,/ My suggestion is that
please use both of them, because if you try to see, this descriptive Statistics, is the point of starting point
for any analysis, what you have in your hand? You simply have a set of data, data are some numerical
values. So, you can always imagine that in front of you, there are 20 values, there are hundred values, there
are 2,000 values or there can be two million or so two billion values. And all those values are sitting silently
and you are the one, who is going to, start the knowledge discovery on the basis of given set of data. Right.
So, I would say, don't make a rule, but depending on the condition, try to use both types of tool, later on in
these lectures, I will show you that, how the graphical tools and, and how these analytical tools can be used,
under what type of condition, and how they can be computed on the basis of R software. Now, what is the
difference between, between the use of graphical tool and an analytical tool? Graphical tools provide us a
visualization. This will give us a first-hand information and what about analytical tools? They will give us
the information in quantitative form and they will give us the quantitative information. So, graphics will
give us information, but we have to look into this and then we have to draw a proper statistical inference.
And this analytical tool will give us a number, which we have to interpret to make a correct statistical
inference. So, I would say usually, or in most of the cases, the graphical tools and analytical tools, both
work together because the process is the same, data is the same and data is never telling you, that please
use, only the graphical tool or please use, only the analytical tool, this is only you, who is going to take a
call that whether graphical tools have to be used or say, these analytical tools have to be used. So, please
try to make an appropriate decision keeping in mind the objective of your study and the type of information
which is contained inside the data. And in statistics, why do statistics comes into picture? Statistics comes
into picture
11
119
because variation always exists in all the process. What do you mean by variation? For example, if I say,
suppose you try to take a plot and you try to put, say this hundred grams of seeds and say after a month,
you will get a crop and suppose you will get, one kilogram of seeds. Now in case, if you try to repeat the
same thing, try to use the same plot or the same sizes of plots and put the same quantity of seeds, do you
think that, in all the plots, you are always going to get exactly one kg of field, this is practically very
difficult, there will be some difference, one plot may be given you one kg, another plot may be given you
one point one kg and say another plot might be giving you 900 grams and so on. So, the variation always
exists in all the process and in statistics, our basic objective is that that we want to understand the process
of this variation, we want to control this variation. And to draw a statistical inference out of the data with
minimum variability. So, this is, one of the basic objective, so in statistics or in descriptive statistics what
we really want to do, we have a set of data, now we are going to use that data, on a statistical tool, that may
consist of say analytical tool, as well as graphical tool. Now I will be getting some information from the
12
120
graphical tool, I will be getting some information from the analytical tool and now, this is my responsibility,
that I have to combine the information coming from both the aspects together and I have to convert it, into
a piece of information, which is useful, which is interpretable or that can be conveyed, to the experimenter,
who might not have any knowledge about the statistics. Right.
So, I can say, that using the information gained by the tools, of descriptive statistics and combining them
together, to reach to a meaningful conclusion, to depict the information hidden inside the data, is the
13
121
objective of any statistical analysis and proper interpretation of those outcome is very important and all
these outcomes, all these inferences are made only, on the basis of data. So the next question come, how
this data is coming? So, there are two processes, one deterministic process and say another is non-
deterministic process, deterministic process means, you know the outcome in advance. But non-
deterministic process are where you really do not know, the outcome in advance. So, in statistics whenever
we are calling or when we are and whenever we are trying to understand the data generating process, the
data generating process is always random or say, non-deterministic and that is why, the role of statistics
comes into picture, once there is no random variation things will become purely mathematical.
So, a simple question that arises here, why should we collect the data? So data is collected with different
types of objective. First is, to verify the theoretical findings, for example suppose if I say, in children the
14
122
height increases as the, weight increases or the weight of that child increases as the height increases, suppose
that is my theoretical finding and if I want to verify it whether, this is really happening in real life or not?
So, I need to collect the data and I need to verify this finding, secondly I have some objective and I really
want to know, the outcome of that process, so I have to collect the data, which is being generated from that
process and then I have to use a statistical tool, to the correct statistical outcome and remember one thing,
the inference what we are going to draw, that is just on the basis of the collected data, you cannot argue,
that some statistical inferences is coming and which is from some other source, beyond the data. So,
particularly when we are talking of the tools of descriptive statistics, we try to speak of information or we
try to report the information, which is coming only from the given set of data. And yes, the information
which is coming from this data, that we try to convert it in the form of statistical inferences, which is further
use in the development of statistical models, which are used for policy, decisions, classification, forecasting
15
123
Now, in case if you want to make a study on a statistical experiment, then what are the steps involved.
Right. The first objective step, that please identify the objective of the statistical analysis, which is missing
in, most of the cases, in my experience. Right. People simply try to collect the data and after collection of
the data, many times they try to decide what type of statistical inferences they can draw from this data?
Well that is not bad, but at least my suggestion is that before you collect the data, please try to, decide the
objective of your study and try to ask, why I am collecting the data and based on that you have to take
further steps. Okay. How to get the data, the data can be obtained from a laboratory experiment, from a
survey, from some primary sources, or from some secondary sources, called as, ‘Primary Data or say,
Secondary Data’. But whatever is the data, I am not bothered about that, my objective is this, I have to use
the, correct statistical tool and I believe that by using the correct aesthetical tool, I can get a correct, correct,
16
124
that is my belief and with this objective I am moving forward. So, the next question comes, what is an
observation? So, the unit, on which we try to measure the data, that is called an ‘Observation’. What does
this mean? Suppose I want to measure the height, so first I have to decide height of children or say height
of elders, suppose I decide, that I need to measure the height of the children between the ages, say 5 years,
to 7 years. So, what I would say? That I will try to collect some children who’s, ages are between five years
and seven years and then I will try to record the heights of those children. So, the heights of those children
will be some numerical value and they will be called as an, ‘Observation’. So, similarly in case if I want to
find out the number of persons, number of cars, monthly expenditure on say food in any family, then these
are also my observations, which are trying to cater, to some objective of statistical analysis. Next definition
which I would, like to discuss here, what is called a population? The collection of all the unit is called a
‘Population’. For example, in the in the earlier, example when I wanted to the data on the height of children
between five and seven years, you are trying to collect the data, only on some of the children. But do you
think, they are the only children, no, there are many, many children in that city, in that locality, in that
country. What, what are you trying to do? You are simply trying to choose, some of the children and then
you are trying to record the data. So, the collection of all the data, that can be locality, that can be city, that
can be country, that depends on the objective, that is called as a, ‘Population of children’ whose ages are
17
125
Similarly, suppose I want to find out the average age of the, of all the female students, in class ten, in a
school, on the basis of a sample. Then my population is going to consist of all the female students in class
ten, in that school, that will be my population. But if I want to study, the average age of all the female
students in that city, then all the female students in that city, who are studying in class ten, that will consist
of my population.
18
126
And similarly in say, in another example if my objective is that I want to know, that how many female
employees have salaries more than the, male employees in a given company, on the basis of a sample. So,
in that case, my population will consist of all, the female employees in the company. From there I will try
19
127
Now, the next question is what is a sample? So, sample is only a subset of the population, a basic question
comes- why we use this sample? Well, that is the main objective and main advantage of using statistics.
We are always interested in finding out a statistical conclusion, which is for the entire population, maybe
of country, maybe of city or may be of village and there will be large number of people, who have to work
to collect the data, that is very difficult. So, the advantage of statistics is that, that the statistics says, that
instead of collecting the data on the entire population, if one can collect the data, on a small fraction, that
we call a sample, then on the basis of sample of the data, the Statistics can help in getting a reliable,
statistical inferences which are going to be valid for the entire population. So, that is why collection of
sample is very important in statistics. So, whenever we are trying to collect the data, in a sample, we
believe, that whatever are the characteristics, whatever are the features, which are present in the population,
they are also present in the sample. For example, you have seen that if you go to a market and you want to
buy, some wheat and there is a bag of hundred kilogram of wheat, usually you will not open the entire bag,
but you will try to take a small sample, maybe consist of say this 20 grains, 40 grains hundred grains, you
20
128
simply try to look at those grains and based on the quality of the grains, you try to make an inference for
the entire bag, which is of hundred kg. So, now, this is my sample, sample is possibly consisting of the
grains of wheat, may be 20 gram, 30 gram, 40 gram and based on that whatever we are going to conclude,
that is going to valid for the, for the entire bag. It's not even the entire bag, but I will say, the entire wheat
available in that shop. So that is why the collection of data in the sample is very important and we believe,
that the data has been collected in such a way, such that the sample is representative. So, this will be our
basic assumption and that goes without saying, that in all his statistical analysis the sample means, sample
is representative, what is the representative sample? That means all the characteristics which are present in
the population, are also present in the sample, for example in case if the quality of the wheat is not so good,
suppose there are ten grains, which are infected by some insects. Then we assume, that in the population
also, means a similar type of proportion will continue, in case of the ten percents seeds in my sample, in
my hand, are not good so we, so I believe, there 10% of the wheat in the entire bag is also not good. So,
21
129
So the basic, foundation like that in case if my sample is good, sample is representative, then my statistical
inferences are also, going to be good. And there are various ways in statistics which help us in choosing the
representative sample or, or say correct sample so, so we have different types of sampling scheme like
simple random sampling, stratified sampling, cluster sampling, systematic sampling, multi phase sampling,
multistage sampling etc. And which helps us and guide us, that how to choose the correct, good and
representative sample, but definitely this is not the objective of the course and I'm not going to discuss that
what are the different sampling procedures to collect a good data. So, now in this lecture, I have given you
a brief background that may not look very mathematical but believe me this is, very important for us to
understand that what are we going to do under what type of condition, only then, I will be able to take the
correct decision and I have, recorded or I have given this lecture, with this objective only. So, I will try to
continue with some more basic definitions in the next lecture. And you try to understand this lecture and
try to create a foundation inside your mind to understand the tools of descriptive statistics and I will see
22
130
LECTURE-07
You may recall that in the last lecture we had a small discussion on different aspects of
descriptive statistics.
Now from this lecture I will be moving towards more mathematics, well in this lecture this is a
very small amount of mathematics but my idea in this lecture is to make you understand what
are the different types of terminologies and how they are represented. Once we understand that
131
what is the nomenclature and how the things are being represented that will help us in better
Now let us try to start our discussion with the first topic, what is a variable? Whenever we are
trying to conduct any statistical analysis, before that, there is a collection of data, and even
before that, there is always an objective and the objective is based on the research problem, or
in simple words, what we really want to know. Once I decide this question what we really want
to know based on that I try to collect that data on the relevant variable from a population, and
then I tried to collect the data on that aspect which I want to know and this aspect in simple
language is a variable. So I can briefly say that once a research question is fixed and the
population of interest is identified, then we try to collect the data on something, data on what?
We tried to collect the data on a statistical variable, what does this mean? That whatever is my
132
objective based on that I will try to collect the data on a relevant question which is going to be
Before I go further I must tell you that there is a strong mathematical definition of these
variables, random variables what we are going to discuss in the further slides, but here my
objective is not to go to that mathematical level, my modest objective is that for a beginner how
the things have to be understood, for example I will be dealing with the definitions of
continuous random variable, discrete random variables, and if you come on the area of measure
theory in the statistics there is a hardcore mathematical definition of these concepts, but
definitely my idea here is to give you a flavor or make you understand, what are these things
and how they are going to be used in the collection of the data, so please keep this thing in mind
So I can say here whatever is the information in which we are interested that is captured inside
133
now in statistics there is a convention that these random variables are always represented by say
this here capital letter, and in our case usually I will be denoting the random variables by X, Y,
Z etc., and when I try to type it then usually they are typed in a mathematical mode which is an
sort of a italics mode, right, so this is what you have to keep in mind that whenever you try to
This number of variable can be one or they can be more than one also, so whenever we are
trying to deal with one variable, then the statistical analysis is usually called as univariate
statistical analysis or univariate analysis, and whenever we are trying to deal with more than
one random variables at the same time then we call it as multivariate analysis or multivariate
statistical analysis.
134
Now what is the role of variable? The observations are collected on the variable, now I’ll try to
take some examples to make you understand that how the variables are defined and how the
observations are collected on them. Suppose I want to know in some college that how many
male students, how many female students and how many transgender students are there, so in
that case my variable will be gender, gender of the student, and this I will denote as say here
capital X, and I’ll type it as italics like as here, and this variable will take three possible values,
one will be here, the student can be male, the student can be female or the student can be
transgender.
Similarly, in case if I want to denote or if I want to study about some country in Asia, then I can
also define that my variable here is say country in Asia, and I will denote this by here capital X.
135
Now this capital X can take different values, the countries in Asia can be India, they can be
Bangladesh, this can be China, this can be Thailand, this can be Bhutan and so on. So now you
can see I have done here two things, I have defined the variable and I also have given an idea
Similarly, if I try to take any other example, suppose if I denote say any odd number, so I can
denote the variable by here X, and X is going to denote any odd number, and what are the
different possible values we just can take in this case? The odd values can be 1, 3, 5, 7, 9 and
so on, so if you try to see through this example I have given you one more aspect that the
number of values which are variable can take, this can be finite and also not, so this is what you
have to keep in mind when we are going into the further lectures. And in this example, I have
defined the random variable as X which you always have to keep in mind.
136
Now what will be our next step? Once I have defined the variable, then I would try to draw a
representative sample or simply a sample from the population, and whatever are my sampling
units, I’ll try to collect or I will try to record data on them. For example in case if I want to
record the ages of children, in the age group of 5 to 7 then suppose I try to choose say 10
children whose ages are between 5 and 7 years from a population of that city, of that country, or
that state whatever you want, and now I would try to record the age of those children. So the
value of the age that will always be denoted by the corresponding small letters, so if I try to say
here I have denoted the random variable by X, then the values which X is going to take they are
denoted by small x like as here, so the value of the ages of those children will be denoted as
small x.
And suppose if I want to find out the average height of the students in a school, then my
objective will be to collect the data on the height of some students of that school, so I would try
137
to represent here capital X to be the height of the student, and whatever are the values of
heights, heights of students this I’m going to denote by here small x, so I can now say here X is
going to denote my height, small x is going to denote my values of height, and now I start
Suppose I take two students, student number 1 and student number 2 and I try to measure the
height, and again I try to measure the height of the second student. So suppose I find that the
height of the first student is 150 centimeters and height of second student is 160 centimeters.
138
Now how to denote this 150 and 160 that is the question which I’m going to now address here.
So height equal to 150 centimeter and height equal to 160 centimeters, they are the two values
139
And values of X are denoted by small x, so now I can denote here that the first value which is
here 150 centimeter this is here the value of random variable x and this is the first value, so I
can denote it here say x1 and I can write down here x1 is equal to 150 centimeter.
Similarly, the height of the second student or the value of the height of the second student, this
is x and this is my second student so I can write down here x2, and this I can write down here
so you will see in the statistics that is a very common sentence let x1, x2, xn be a sample from
some population, so once I tried to write down here let x1, x2,…,xn be a sample or a random
sample, what does this mean? This simply means that these x1, x2, xn they are some numerical
values, nothing more than that, and they are the numerical values on what? They are the
10
140
numerical values of the data, data on, data that is recorded on the variable X, so if I say X is
height, and suppose I have collected 20 students and I have recorded their heights, then the
values of those heights that is going to be denoted by x1, x2 say … say x20, so this is the
simple interpretation of this notation, and which is going to be used in all the further lectures.
Now the next aspect which I’m going to address about the variables is that there are two types
of variables, one is quantitative variable and other is qualitative variable, and under the
quantitative variables we have two types of variables, one is called discrete variables, and other
is called is continuous variable. Once again I would like to reiterate here that there is a strong
qualitative variable in a statistics, but definitely my objective is not here and I’m not going into
that detail, but my simple objective is to make you understand that given a situation when you
11
141
are trying to collect the data, the corresponding variable will be a discrete variable, continuous
variable or qualitative variable or a quantitative variable. Once you are able to judge this thing,
then after that you will see that the tools for example quantitative variables, qualitative
variables they are different, the statistical tool for discrete data for the continuous data they are
also different, so that will help you in choosing the correct and appropriate statistical tool, okay.
So first I try to address here the aspect of quantitative variable, in very, very simple words, I
can say without going into the mathematical details, that quantitative variables represents some
measurable quantities, measurable quantities means the values of X, the numerical values of X
can be obtained and once these numerical values are obtained then they can be ordered in a
logical or a natural way, possibly this is one of the most simple definition I can give you about
12
142
(Refer Slide Time: 15:25)
Let me try to take some example to make you understand, suppose I want to buy a shirt, and I
go to a shop then the shop keeper will ask me what is the size of your shirt, that can be 38, 39,
40, 41, 42, 43, 44, 45, 46 right, so if I try to take here the size of the shirt this can be 39, 40, 41,
42 and so on, what does this mean? 39 means there will be some dimensions, some size of the
shirt, and if I say what is the difference between 39 size shirt and say 42 size shirt? We
understand that the size of the shirt with number 42 is going to be larger than the size of the
shirt containing the number 39, so you can see here that this 39 and 42 they are representing
some numerical values and they can be ordered, that means I can always say that the size 42
And similarly if I try to say, try to take an example of a cost or the price of the vegetable say,
per kg price of say this some vegetable, this price can be 30 rupees a kilo, 35 rupees a kilo, 40
13
143
rupees a kilo, 45 rupees a kilo, what does this mean? Once I say you have to give 35 rupees for
one kilo of the vegetable you know what you have to give, and in case if I say that the price of
the same vegetable in one shop is 35 rupees per kg, and the shop of the same vegetable in say
another shop is 40 rupees a kg then you can always make a conclusion by putting them into
order that the price of the vegetable in the second shop whose rate is 40 rupees a kg is higher
than the rate of the vegetable in the earlier shop where the price was 35 rupees a kg. And
similarly if I say the price of the vegetable is 50 rupees then you will say that it is more
And similarly if I want to count the number of colleges in a city, this number can be 2, 5, 10, 8,
12, 15, 20 so whatever is there, so now once again these values have some interpretation, if I
say I have two cities one say city has 10 number of colleges and say another city has 20 number
of colleges, then I can always order them and can say that the second city has more number of
Similarly if I try to measure the heights of the children say 1.2 meter, 1.23 meter, 1.32 meter
and so on, then you know that this 1.2 meter and 1.23 meter has some interpretation and you
can visualize these things, suppose there are two children and suppose you record get the height
of the first child is 1.2 meter and the height of second child is 1.3 meter then you can always
infer that the second child is taller than the first child, so this is what we really mean by
quantitative variables, I have a variable like as price, height, number etc, and I can obtain the
values, numerical values on them and those numerical values have some interpretation,
14
144
interpretation means they can be ordered and they will have some meaning in their numerical
value, right.
Now I’ll try to address the aspect of qualitative variables. This qualitative variables represent
the measurable quantities, so this is same as in the case of quantitative variable, then what is the
difference? The difference lies here that the values of the random variable or the values of the
variable which are denoted here as say small x they cannot be ordered in a logical and natural
way.
Once again I would say possibly I can think of this is one of the most simple definition to
understand for a common person not having a statistical or say a strong mathematical
background, what is this actually mean? That the values cannot be ordered in a logical and
15
145
natural way, so let me take here some example and I try to explain it to you, now suppose I
want to collect the data on the names of cities in this country India, okay,
So I will define my variables as say here X, and now this variable will take different values, for
example Kanpur, Mumbai, Kolkata, Delhi and so on. So these values are going to be
represented as say x1, this is the first value of the variable that it can take.
Second value of the variable which it can take and third value of the variable Kolkata that the
variable can take, but this variable is very well understood, but how to order them, how to put
them in a natural way? Well, as soon as I say this thing you always try to associate some
number with it, but here I am not associating any aspect of this variable, I’m calling only at
their name, but in case if I associate like that number of person is staying in a city, then it will
16
146
become a quantitative variable, or if I associate the area under that city then this will become a
Similarly if I take another example and I say I want to record the colours of the hair, for
example now they can take different values say black, so I can denote black by capital X1,
white by capital X2, and brown by capital X3 denoting the first, second and third values which
Yes, I can very well understand that the colours of the hairs are black, brown, white or
something else, but how to quantify it? Unless and until I try to associate a degree or some
measure on some scientific way, the colours will remain only as a colours and there is no way
that I can order them that for example like white is better than grey or grey is better than brown
and so on, but here I’m recording only the data, so that is why the colours of the hair this
Similarly if I take another variable here tastes of food, which can be sweet, which can be salty
or which can be neutral and so on, so I can denote the first value that the variable takes X1 to be
sweet, second value of the variable X2 to be salty, and third value of the variable to be neutral,
and these values sweet, salty, or neutral they are only the qualitative variable, they are not the
quantitative variable, I cannot say that sweet takes 20 or salty takes 30, so this is the idea
17
147
And similarly many times in examination or in any competition we try to judge the
performance of the candidate by making it good, excellent or bad. So in that case this
performance can be variable which is denoted by the variable X, and the good this is the first
value which variable takes excellent, this is the second value which variable take, and bad is the
these are only some qualitative things, I can understand them, I can observe them but I cannot
quantify them unless and until I say that if a student is in the excellent category then it is better
than the student who are in the category of good and bad, or similarly if a student is in the
category of good student then he will be considered as better than the student in the bad
category, so unless and until I make these types of rules which are again trying to denote a
quantitative way up to that point, the variable will remain only as a qualitative variable, right.
18
148
But in statistics now we have a problem, statistics work only with the quantitative data with
some numbers, so in case if I have got a qualitative variables, unless and until I associate a
number with this, I cannot operate my statistical tool, and here at this point I will try to inform
you that the statistical tool for qualitative variable and quantitative variable they are different in
most of the cases, so you have to be very very careful when you are trying to use a tool whether
you are applying it on a qualitative variable or a quantitative variable, so for example now I will
try to take an example to show you that how we try to handle the qualitative variable, by
associating a number with them, so suppose I consider here the variable say here X, say here as
taste, now taste is taking here 3 possible values, capital X1 it denoting the sweet, capital X2 it
so right, what I will try to do, I will associate a number with these three indications, sweet,
19
149
Suppose I decide that I would say assign 1 to sweet, I will say assign number 2 to salty, and I
will assign number 3 to neutral, but remember one thing once I’m trying to assign this number
1, 2 and 3, they are only indicating the category, means if I’m assigning 2 to salty and 1 to
sweet this does not mean that salty is 2 times of sweet, this will be a wrong interpretation, so
Now after this I’ll try to address the discrete variables, in some situation the variable on which
we want to record the data, that can take only a finite number of values, and in a very simple
way or in an informal way, I can say that the variables are counted, for example in case if I
want to find what are the number of children in different families, so this number can be 1
child, 2, 3 and so on, they cannot be that a family has 1.2 children or 2.4 children, these value
20
150
will not exist or they will not have any interpretation, so in this case I am simply trying to count
Similarly if I try to find out the number of branches of a school in a city this number can be 1,
2, 3, 4, 5, 6, 7, but this number cannot be 2.5 or 5.5 or 6.7, so in this case these values are being
counted, so all those variables where we are going to record the data on the basis of only
counting, they can be categorized as discrete variable for all practical purposes.
So now in case if you try to associate this definition of counting, then in case if I try to say there
are, there is another option that a variable can also take a value in fraction, say 2.1, 2.2 and 2.3
also, so those variables which can take an infinite number of values they are called as
continuous variable, so basically there are two categories discrete variable and say continuous
variable,
21
151
so in discrete variable, the values are counted and in continuous variable case the number of
values what a variable can take that can be infinite, and in simple words in formally I can say
that the values are measured, and they are not counted that is very important to note, they are
being measured, for example if I say suppose I want to measure the length of a road in certain
fraction, it all depends how we are going to measure it by which instrument that is a separate
aspect.
Now let me take an example, suppose I want to measure the length of a road, the length of a
road can be 1.5 kilometer, this can be 1.52 kilometer, or this can be 1.521 kilometers and so on,
so in this case you can continue as long as you want depending on the instrument, depending on
the length, so in this case I am measuring the value and this value can take infinite number of
values, so this type of data usually we collect under the headship of continuous variable.
22
152
Now I try to address another aspect, this is called grouped data. Suppose you have got large
number of values, then in that case it is possible to group those values in certain categories or
certain groups, and then what will happen that the original value or the behavior of the variable
will be such that, that the nature of the exact value will be lost and that value will be identified
only by the category. Suppose I try to take an example, suppose I try to measure the heights.
These heights can be 1.5 meter, 1.7 meter, so 2.2 meter, 2.5 meter, 3.3 meter, 3.6 meter and so
on.
Now I can make here three groups, say group 1 where the heights are between 1 to 2 meters, the
height are between 2 to 3 meters, and say last group in which the heights are between 3 to 4
meters. Now they are here two values 1.5 and 1.7 meters which are lying between 1 and 2
23
153
Similarly here 2.2 and 2.5 these are two values which are lying between 2 and 3, so I can write
down here 2 values, and similarly 3.3 and 3.6 they are lying between 3 and 4 so I can write
down here 2 values. Now I will have only this information, and this information will be hidden.
So now looking at say this value here 2, I cannot find whether this value was 2.1 meter, 2.2
meter, 2.3 meter or something else, I do not know, so these values will become simply
unknown to me.
24
154
So whenever we are trying to deal with the group data we have to keep in mind that the values
will be grouped together, they are individual values will be lost and that we will be working
Now I would like to briefly address another aspect that is called as primary data and secondary
data. So what is the difference between the primary data and secondary data? So you see
suppose I have an objective to study, and based on that either I go to the field or I ask some of
my investigator to go to the field and collect the data directly, and I try to work on this thing, so
Whereas second option is this means I can go to some offices like as municipalities of the cities
or there is a, and there are different offices like a National Sample Survey Organizations who
collect the data from time to time on different aspect, I can request them to give me the data and
25
155
I try to work on that data. So this data has been collected by somebody else, either a person or
an agency and we are trying to use the data from that source, so this type of data is called as
secondary data.
So I can say here very briefly that the data which is originally collected by an investigator for
the first time with an objective to study any statistical query or statistical investigation, that will
And the data which has already been collected by some person or some agency for any
objective or say for any statistical query or for any statistical investigation, and we are trying to
borrow the data or we are trying to collect the data from their agency and then we are using, this
26
156
Well the definition of this primary and secondary data is very relative, some data which is
primary for someone, may be secondary for say for the other person. So I’m not going into that
detail but this was just for the, for your information.
Now the next question comes, how this data comes into picture? How this primary data is
obtained, how this secondary data is obtained? So very briefly I can give you different ways in
which this type of data are collected. So in order to collect the primary data, one of the
important source is direct personal investigation that the person goes directly to the respondent,
and he or she ask the question and he or she tries to record the answers directly.
27
157
Second thing is this indirect oral investigation that the person will go, he or she will ask
different types of questions and based on that he will try to make a judgment that what is the
Third popular option is this some questionnaire which are sent through postal mail, email, e-
forms like as nowadays Google forms are very popular and through online service there are
some websites which tries to help us in conducting a survey, so they also try to give us the
primary data.
And sometime we send our surveys, surveyors to the field and we don’t allow them to ask
anything but we will give them a questionnaire and we ask them that you please give it to the
concerned person and he will or say she will collect the data and give it back to us. So and then
there are many other ways also, but this is how we try to collect the primary data.
28
158
For the secondary data, there are some published sources, for example there are some reports or
data sets available where the countries, offices responsible for samples survey, for example in
India we have National Sample Surveys Organization, Central Statistical Organization, and in
definitely at some world level, we have United Nations and then there are different wings who
try to collect the data from time to time and they try to published the data and we can use that
data.
The second option is this the data which is collected from some survey agencies. We can use
them, and third thing is for example there are some public offices where we record our data for
example municipalities. Whenever there is a birth of a child or death of any person, we try to go
to the municipality and we try to report it there and they try to keep the data, so this type of data
is also available in those municipalities and we simply try to take it from there.
29
159
So now in this lecture I have given you a background under which we will be working in
further lectures. From further lectures now I will try to take one topic at a time, I’ll try to give
you the basic idea, I will not be going much into theory but I’ll try to explain you with different
example that what are the different concepts and I will try to show you that how to obtain those
things using the R software. So now I will be going into the different tools of descriptive
analysis from the next lecture, so you please try to review this lecture, you please try to revise
the lecture and try to understand the concepts, try to settle down inside your brain and we will
30
160
Lecture 08
Welcome to the next lecture on the course descriptive statistics with R software. You may recall
that, in the earlier lectures, we had discussed different aspects related to statistics. And we have
understood that whenever, we want to do any statistical analysis, how are we going to start? And
how we are going to control the process of obtaining the observations, different types of associated
variables, discrete continuous etc.? So, now I assume that, we have collected a sample. And as I
said earlier, I will always be assuming that, my sample is representative. Which means, that all the
salient features, which are present in the population, they are also present in my sample. So now,
we are at a place, where we have collected the observations. And now, we want to move further.
So, first of all, whenever the data comes to our hand, as I told you, that there are two options, we
can start with one is graphical tool and another is analytical tool. So, first of all I would try to take
the first analytical tool, which will give us an idea, that how are we going to combine, the data
present the data and how we can do it, which will give us, some information that is contained
inside the data. And based on that, we will try to tak further decisions, that what type of graphical
tools and tools, for analysis of the data, can be used. So, we start this lecture and first we are going
to address, that once the data comes, why it needs to be classified. You can always assume, inside
your mind, that whenever you conduct an experiment, you will get the data. Now, I am also
assuming, that the data is collected on the relevant variable, which you want to study, for example,
if you want to stay the height, that then the data is on the height, if you want to study weight,
becomes, data is on the weight. So, now you have collected the data, this data can be 20
So, in case, if all the values are just before you, can you really get an idea of, what is the information
161
hidden inside it? It is very difficult. Because, as I said, data cannot speak, data cannot come outside
of your computer or outside your experiment, to tell you that. Okay. I have this piece of
information, this is only you, who has to use appropriate tool to get it out, to take it out. So, first
thing what we try to do? We try to rearrange the data in some required format,
And for that, we would like to classify, the data into, different groups and from different aspects.
For example, I can make the groups of those observations which are similar. Or which are also
dissimilar. All those units, which looks similar to each other, they can be put into one group,
similarly all those units, which looks similar they can be put in another group. And then, based on
that, we can extract the information, different types of information, through those groups. So, this
is what we are going to now study. So the classification of the data consists of a very simple thing,
that this is a simple process of arranging, the data into groups or classes, according to resemblance
162
Refer slide time: (4:27)
Now, what are the functions of for this classification, why do we make this classification? The
biggest advantage is that this will condense the data, you can assume that on one of the walls,
different numbers are written, continuously. And if you try to look at those numbers, there are
thousands and ten thousand, 1 million numbers, you can't, get any information out of that. So, you
need to condense the data. So one of the important objective of classification is this, that we would
like to condense the data. Condense the data in a away from where we can draw some relevant
information. And whenever you are trying to make a statistical experiment, generally your
objective is to compare something, for example, if there is a new medicine, which claims that it,
can control the body temperature, for say, for say 12 hours, then you would like to compare it with
the with the earlier existing medicine. That whether this improvement is, happening or not? So,
this condensation of the data or the classification of the data has to be done in such a way which
163
can help us in comparing different types of things, different types of aspect, different types of
quantities, different types of natures. And in many many situations, usually we are interested in
studying the relationship, for example, modeling, statistical modeling is a very popular word.
Which is nothing but, a sort of relationship. We want to find out the relationship between, input
and output variables. So, the models cannot be obtained in a single shot, the models are obtained
on the basis statistical data and this is the starting point, means about, descriptive statistics from
there we try to gather the data information in a small pieces and then, we try to, combine them
together in getting a model. So, we would like to condense the data in such a way, which helps us,
in studying different types of relationships. And the data has to be condensed in such a way or the
data has to be grouped in such a way that is compatible with our statistical tool also. This is that
thing, what we always have to keep in mind. Usually people do what? That first they will collect
the data and then they will try to choose the statistical tool, what I always suggest is that, you first
try to fix your objective and then try to see, what type of statistical tool can be used to fulfill or to
give an answer of that objective and whatever is the requirement of that tool, try to collect your
data, according to that and this will help us. So, another important function of classification is that,
164
So now, moving further, first till, then let me introduce, the basic definition. This is absolute and
relative frequencies. One thing I would like to make it clear here, that in order to teach in this
course, I have two option, that first I try to take the theory, formulas etc. and then I try to take an
example. But, rather I would prefer in most of the situation, that I should start with an example
and then, I try to develop the theory so that, you can make a one-to-one correspondence between
the theory and the, and data definitions. That will help you in applying or choosing, the tools in R
software. Okay. So, now let me take a simple example, suppose there are ten persons who
participated in a test and their results were declared, their results were declared in two categories,
either they passed or they failed. So the candidate, who has passed, he or she has been assigned,
the letter capital P and the candidate who got failed he or she has been assigned letter capital F.
So, now you can see here, that this is here the data of 10%, who either got pass or fail and their
outcome is recorded here as say P, F,P, F, F,P, P, F, P,P. Well, I'm trying to take care of very small
data set which you can see from your eyes and what about mathematical manipulations, I am doing,
165
that you should be able to see from your own eyes. But you can always think that, this data can be
very very large, there can be ten thousand candidates, there can be 1 million candidate, there can
be 10 million candidates. So, now how to combine this data? How to condense this data? So, we
are going to use the concept of absolute frequencies and relative frequencies to condense the data.
And then later on, we will try to put them in some proper format for example, in the form of a
table, to get more clear information. So, now I would try to denote here two categories, there are
two categories. One is here pass and say, another is here fail. So, I can now in general, represent
these categories as say here, a 1 and here a 2. So, this a 1 category will represent the candidates
who have passed and a 2 category will represent those candidate who have failed. So, now I have
introduced here a word category. So now you can see, category contains all the observation, which
are similar to each other, for example, the category of candidate who pass this will contain all the
candidate, who have passed. The category fail, contains all the candidates, who got failed in the
test. Right. So these are called, ‘Categories’. So, I can see here, that there are some number of
candidates who passed and some number of candidate who failed. So let me count it. So firstly let
me count here, how many candidates passed, 1, 2, 3, 4, 5 and here, 6. So there are 6 candidate who
passed. What about fail? 1, 2, 3 and 4. So there are four candidates, who failed. So now, this
number, the number of candidates who passed and the number of candidates who failed? This is
denoted by, say n1 and n2, n1 and here n 2. So, this category, this is here and n 1 and this category
this is going to be denoted by here, n 2. And this number n 1 and n 2, they are simply trying to
represent, the number of candidates, in the category a 1 and in the category a 2. Or simply the
number of candidates who fail, or the number of candidates, who belong to category a 1 or the
number of candidates, that belongs to the category a 2. So this n1 and n2, they are simply trying
166
to present, the number of units present in the category, a 1 and a 2. So the number of observation
Now, one of the drawback, in absolute frequency is that, means, if I try to give that, Okay, there
are 100 candidates who pass; there are 300 candidate who failed. But, you are not trying to see that
how many candidates appeared. So, in order to incorporate, this feature, we have a concept of
relative frequency. for example, there are 6 candidate which passed, there are 4 candidates which
failed, I can also say, that there are 6 candidate out of 10, who passed and there are 4 candidates
out of 10, who failed. So, I can denote, the relative frequency of the class a1, to be the number of
persons in the class, divided by total numbers, in total number of observations that are available
and n1 is the numbers in a 1. And this is denoted as he here f1, f 1 is going to be n 1 upon, n 1 plus
167
n 2. So this is called, ‘Relative Frequency’. And in this case, this number is 6 upon 10, because we
have observed 10 daasets, so that is 0.6 or this can be called as ’60%’. So, I can say that there are
60% candidate, who passed. And similarly, the relative frequency of the second class a2, that is
denoted by, f 2. And the definition of f 2, is the similar to the definition of f 1, that is the total
number of elements, in the category a2, total number of elements in category a1, divided by total
number. So, that is going to be the number of candidates, who failed that is 4/10 which is here 0.4
or I can say 40%. So, this will give us an information, about the number of candidates, who passed
or failed with respect to the total number of candidates, who appeared in the examination. And this
can also give us the number in terms of percentage. So this is the basic idea, of the absolute and
relative frequencies. Now, next question comes, how to compute this absolute and relative
frequencies in the R software, in order to compute this absolute frequency, first we need to define
our data vector, the data vector as I said, in the earlier lecture, the data vector will consist all the
numerical values and they are combined using the c operator. Right.
168
So, I can see here that any data vector, that is going to be in the format of c and then here, the value
x 1, x 2, up to here, x n. means, if I assume that there are n values, for example, in the earlier case
there are 10 observation and is going to be 10 and so on. And so, after this, the command to obtain
the absolute frequency is table, t a b l e all in small letters. So, when I try to write down, this
command table and inside the bracket, the data vector, then this will create the absolute
frequencies of the data, which is given inside the arguments, under the vector, data vector. And
suppose you want to find out the relative frequency, so now, if you try to understand, what is the
relationship between frequency and relative frequency? Frequency means, absolute frequency. The
relative frequency is, absolute frequency divided by total frequency. Total frequency means, total
number of observations. So, once I am trying to write down all the observation in a data vector,
then whatever is the length of the data vector, that is your total frequency. So, this total frequency
can be computed by the, by the command, length. a length and inside the brackets, you have to
169
give the data vector, say here, x. so, now in case if you want to find out the absolute frequency,
then I simply have to use the same command, table and inside the arguments, I have to give the
data vector, say here x and I need to divide all the values by length of x.. Remember one thing,
length of x is going to be a scalar value. But, now if you recall, when we did the division operation
of our data vector with respect to a scalar, then we had learned that the division happens in each
and every element of the data vector. So, this value will give us, the absolute frequency divided by
total number of observation and this will give us the information about the relative frequency.
So, now let me, try to take an example and try to show it, on the R software also. So, for example,
I will take the same example, but now, I'm doing one thing more, means, earlier all the ten
candidates, who were categorized in two categories, as pass or fail. Now, I will try to assign them,
170
an indicator value. Because, as we discussed in the earlier lecture, that unless and until, you try to
assign a numerical value to our data. We cannot operate any statistical tool. So now I have here
two values, one is a here pass and and there is, here fail. So, what I try to do here, for pass, I try to
represent it by number one and for fail, I try to represent it by number two. Once you do it, then
this data P that will be represented by one and this F will be represented by 2 and similarly, variable
is your P that will be replaced by 1 and wherever is your F that will be replaced by 2. So now, this
is our data. And now, I need to type this data into a data vector, before I can expose this to R
software. So you can see here, that I have, created here a data vector like this and you can see here,
this one is here this 1, this 2 here is this 2, this one is here this 1, this 2 here is this 2, this 2 is this,
this 1 is this, this 1 is this, this 2 is this, this one is here 1 and this 1 one here is one. And based on
that, I have created my data vector here, result. So you can see here, the outcome and here is the
screenshot, of the same result, well I will show you on the arc console also, but before you let me,
try to explain it. And then, it will be more convenient for me to show it on the R software.
171
Now, I will simply use my table command and I write here, t a b l e and inside the argument I try
the data vector name, that is a result, as soon as I write here, table and result inside the argument,
it will give us this type of argument. So, first we try to understand, what is the meaning of this
result? You can see here, there are four numbers here, one, two, six and here, four. So, what this
one is indicating? And what two is indicating, first of all. This one is indicating, the category. Your
category one and then category two. And now, what this six and four are indicating, six is
indicating the number of elements in category one. And similarly, this four is indicating number
of elements in category two. And this is nothing but these are your absolute frequencies, six is the
absolute frequency of category one that is class1 and 4 is the absolute frequency of class 2. And
similarly, in case if you want to obtain the same result, with respect to the relative frequencies,
172
Then, what we have to do? I would write here the same command table, inside the arguments the
data vector, whose name is result and I would divide it, by the length of the data vector. Once you
do it, you will get here an outcome like, this. What is this showing you? you can see here, there
are 4 values, 1, 2, .6 and .4. This 1 and 2, as earlier, they are trying to show you, the categories.
category one and category two and .6 is actually, 6/10, six here is n 1 and 10 here is n 1plus n 2,
where you can see here n 1 is equal to 6 and n 2 is equal to here 4 and similarly this, 0.4 is 4 upon
10, which is here n 2 divided by n 1 plus n2. So this, 0.6 and 0.4, these are the relative frequencies
of categories 1 and 2 respectively. And this is here, the screen shot that we are going to get, when
we try to execute this thing on the R console. So, let me now first come to R console and try to
173
show you here, that how the things are happening, so first I will try to create here, a vector here
So, you can see here, I have created here a, data vector, result like this. Now, I would try to create
or try to find out here the absolute frequencies by using the command table, t a b l e and inside the
argument, I will say, the data vector whose name is result, so you can get here, the similar output,
that I discussed, this 1, this is indicating the category 1, this 2 this is indicating the category 2, this
4 this is indicating the number of elements in category 1 and this 4 is indicating the number of
elements in category 2. But, definitely this is going to give you the result, in terms of absolute
frequency, now in case if you want to have this result in terms of relative frequency, then first I
174
will show you, that what is the value of here, the length of result, this you can see here, this is 10
and you can count here 1, 2, 3, 4, 5, 6, 7, 8, 9 and here, 10. So there are10 values in the data vector
result., So now, if I try to write down table result, divided by the length of result, then the outcome
comes out to be like this one. So you can see here, this one is representing the category 1 and this
point 6 is trying to denote the relative frequency of class 1 and this 2 is denoting the category 2 or
class 2 and this 0.4 is denoting the relative frequency of the class 2. So, this is how you can obtain
the, the absolute and relative frequencies. And this, absolute and relative frequencies, you can see,
there will be more prominent, when you have a qualitative variable. Now, to give you an idea,
what that, whatever we have done here, I would write to put them, in the right words. So, I had a
set of data of 10 candidates, in terms of two categories, a1 and a 2, pass and fail. And we have
found the number of persons, who are passed and number of persons who are fail categories. So
and then, I would try to or I would say that I have rearranged the entire data sets into 2 groups. So,
175
This arrangement of ungroup data in the form of group, this is called the, ‘Frequency Distribution
of Data’. This is a standard terminology and we always call it, please create a frequency
distribution, that means, you need to group the data and then you have to condense the data,
condense means, you can see that, the, the ten data values, have been condensed only into, two
categories and they are based only on two value, six and four, where six is the frequency of class
one and four is the frequency of class two. And now, what we try to do? Whatever is the data, this
data is condensed into different groups and for that, I try to create different groups, for example,
this group's a1 and a2, they are not coming from sky, you are the one who has created this group
says, pass or fail. So, I will try to, divide the entire data into different groups and these groups are
called as ‘Classes’. So, the meaning of a class is simply a group and for the given set of data, we
always try to create suitable number of groups. Well I'm saying here, suitable number of groups,
well, there is not a very hard and fast rule to decide how many groups can be there? For that you
have to, use your common sense and some basic information, about the experiment to decide that
176
how many groups can really help you in getting the data or the information which is contained
inside the data. Means, obviously if the number of group is, is too small or too large, then possibly,
it will be more difficult to handle the data. So, we need to have some suitable number of groups
that we will try to see, with some lectures in the coming slides. Now, in every group, you have to
define the boundaries, for example, in this case where we have only pass or fail, well there are no
mathematical boundaries, but, but they are categorized only by, by two, this categories pass and
fail. But, suppose you try to record the age or height, of say, some number of candidates, then the
age can be, 5 years, 7 years, 9 years, 12 years, 20 years, 18 years, 21 years, 30 years and so on. So,
this ages can be defined into, can be defined in some groups, like as, 5 years to 10 years, 10 years
to 15 years or this can be 0 to 10 years, 10 to 20 years, 20 years to 30 years and so on. So, in this
case, they will be representing our class, so whenever we are trying to define a class, there are
going to be two values, one is the lower boundary of the class and upper boundary of the class
177
and these are called as, ‘Lower Limit and Upper Limit’. And when we are trying to find the
difference between these two limits, the lower limit of the interval and the upper limit of the
interval, then this is called as, ‘Width’, of the class or ‘Class interval’. And when you are trying to
define a class, there are two values, lower value and upper value. And this will be a sort of interval.
Now, whatever is the value in the mid of this interval, this is called as, ‘Mid Value’. And the
advantage of this mid value is that, when we are trying to group the data, the data will be scattered
over the entire interval. But, we assume that, that entire set of data is going to be concentrated only
at the mid value. So, in case, if you try to see here, what will really happen, suppose I try to take
here, two intervals, say ages, five years, ages say here, ten years and ages, fifteen years. And
suppose I try to collect the data, the age comes out to be seven years, it will come here. It comes
out to be eight years that will come here. Now, it will come out to be nine years that will come
here and so on. There will be many many observations in this interval. and similarly, all those
edges which are lying between 10 years, 11 years, 12 and so on, up to 15 years, they will be lying
178
in this interval, so all these values are lying over here, in this interval at different locations. But,
we assume that once they are grouped, then all are going to lie in the mid of the interval and the
mid of the interval is, simply going to be five years, plus ten years, divided by two, is equal to
seven years, six months. And similarly here, the mid value is, ten years, plus 15 years, divided by
2, this comes out to be 12 years and six months. So, this is what we assumed that, the value of the
And whatever is the number of observations which are lying inside a particular class, this is simply
called a ’Absolute Frequency or in simple words, we also call it as ’Frequency’. And when we are
trying to divide this absolute frequency, by the total number of observations in that class, then this
179
Refer slide time: (32:51)
Now, there is another aspect, which is cumulative frequency. What is the cumulative frequency?
As the word suggests, cumulative means, you are trying to accumulate; you are trying to add more
and more. So the cumulative frequency is also defined to a particular class and this is defined for
all the classes separately. So, the cumulative frequency, corresponding to any variate value is the
number of observation less than or equal to that value. And this cumulative frequency corresponds
to a class and what is the total number of observation, less than or equal to the upper limit of the
180
Let me, try to take an example and try to illustrate all these things, then it will become more clear
and easy to understand. Now, in this example, there are twenty participants, who participated in a
race and time taken to complete, the race is recorded in seconds. So, this 32 means that, the first
participant took thirty two seconds. This 35 means, the second participant took thirty five seconds.
This 45 means, they are the third participant of the forty five seconds and so on. Now, what you
have to look here, this is very important. And yeah, I would also request you that you please try to
concentrate on this example, that how I’m trying to create class intervals and what are the steps
which are involved. Because, in this lecture, I am going to explain, the example in detail and in
the next lecture, I will try to implement the same example in R software. And one and when you
are trying to implement it, it is important for you to, to do, the same steps in R software, which
you are doing here . Okay. So, now looking at this data, first we try to see, what is the minimum
value in this data and what is the maximum value in this data? So, I can see here, that 32 is the
minimum value and here, 84 is the maximum value. So the minimum value is 32 seconds and
181
maximum value is 84 seconds. Now, looking at these two values minimum and maximum, I have
to define here the width of the class interval. And this width is going to decide the number of
intervals also. So by looking at this data, suppose suitably, I choose that the width of the interval
should be of 10 seconds. So I try to create here, different classes, like this, class one say here, a 1
this is consisting of 30 to 41 seconds. That means, this interval a 1 will contain all the values of
the time, which are lying between 30 seconds and 41 seconds. And similarly, the next class is a 2
which will contain all the values between 41 seconds and 52 seconds. And similarly, I have here
Class a 3, 4, 5 and here 6. So I have created here, six classes and you can notice, that all these six
classes, they can contain my all that data and that is another point, while creating these groups,
that these groups and the limits have to be defined in such a way, such that the entire data can be
accommodated, among these, groups. And now, in this group you will see here, this this 30 and
41,, this 30 is the lower limit and 41 is the upper limit. And similarly in this case, in the 81 is the
lower limit and 91 is the upper limit. so now, I have, created the groups, in which the entire data
can be summarized.
182
Now, I will try to present it in some suitable tabular form, so that I can easily understand it. So,
now you can see here, I have created at table and the first column here, is class interval and in this
interval you can see here, that I am trying to, give the same class interval, which I had denoted
here as say, a 1, a 2, a 3, a 4 a, a 5 and a 6. And then in the next column, I am finding out the
midpoint. So, the midpoint of Class a 1 is 35.5, which is coming from 30 plus, 41 divided by 2 and
so on. And similarly I have found the, midpoints of other classes, so we are going to now assume
that whatever data is, is a spread in the interval 3 to 41, this is going to be concentrated, at 435.5
and another point of difference will be 35.5, that I will say, that the data, that all the data in this
interval will have the single value, at 35.5. And as I said, earlier and we had discussed in the earlier
lecture, also once you try to group the data, the information on the individual information is lost,
but only the information on the data is available, that this data belongs to which category or see
which class. Now, after this, this is the third column, in which I am trying to count the absolute
frequency or frequency, forFexample, here, this is the interval 30 to 41. So, if you try to look at
183
the data, how many values are lying between, 30 to 41? So, this is the first value, this is here the
second value and they say the third value, fourth value and you can see here, this is the fifty value
here. And similarly if you try to find out here, how many values are lying between, 41 and 50, I
can say here, this is the value for forty five one and then, if you move for the, forty two second
and then, forty two once again here two three, so there are only three values, which are lying
between 41 and 50, so you can see here, this is represented here, where I am trying to make us
circle and similarly, I have found the, I have counted the number of observations in that particular
category, and I have done it manually and I have written here, so this five is indicating that there
are five values in the interval 61 to 70 and so on. Now, the total number of observation here, you
can see is 20. And this we try to denote by here and this is equal to here 20. So now, when I try to
divide the absolute frequency by the total number of observations, I get here the relative frequency
in the next column. So, you can see here, this 5 is coming here, this 3 is coming here, this 3 is
coming here and this is here the total number of observations. And once I try to divide it, I get here
the values of relative frequencies. The advantage of relative frequency is that all the relative
frequencies, they will always be lying between, 0 and 1 and so, they can be easily converted into
percentages. Now, in the last column, I am trying to find out the, cumulative frequency and here,
I would like to, once again explain you, that how these cumulative frequencies are found. Now,
this is my here, I can rewrite here, class a 1, a 2, a 3, a 4 plus, a 5 and class a 6. Now, I am trying
to say, the total number of observation in the given set of data, in Class a 1 is 5 or this a 1 is, having
the limit 30 to 41. So I can also say, that there are only 5 observation in the entire data set, where
I am trying to make it circle there, there are only 5observations, whose values are smaller than 41.
Now, let me come to the, second category, a 2 this is going from 41 to 50. So you can see here,
that thet total number of observation, whose values are smaller than 50, what are those things, 5
184
and the absolute frequency? Absolute frequency of group 1and absolute frequency of group 2, of
class 1 and class 2. So this is here, 5 and 3. And similarly, if you try to look at a 3, a 3 has a limit
51 to 60. So, this is trying to give you, a value here, which is the sum of all the absolute frequencies
up to the third class, which is 5 plus, 3 plus, 3, that is 11. and similarly for the a 4 this is trying to
give you the total of all the absolute frequency, from class 1 to class 4 and similarly for the class
5, this is trying to give you the sum of all the absolute frequencies up to the class, 5 and similarly
this sixth and the last group is trying to give us, the sum of all the absolute frequencies up to the
class 6 and this is obviously going to be the total number of observation. So, this is what we mean
by cumulative frequency. So, by looking at the value of the cumulative frequency, I can always
find that how many values are smaller than this value. And now, the same thing I can just, make
185
6 class interval, I have here, k class interval, in general. And there are total number of observation,
are here n and this observations are divided into k class interval. And this class intervals are
n2 observation and n k contains here, nk observations respectively. So, obviously if you want to
find out the relative frequency of the j th class, this will be here, the number of observation in j th
class divided by the total number of observations. and j will goes from 1, 2 up to here k and now
all this information can be combined together, in this format class interval here, I can write down
required, I can also add the information on the cumulative frequency. So, this table, what we have
drawn here, this is called as, ‘Frequency Table’, or The Frequency Distribution.’ why we call it
distribution? Because, we are trying to see, how the values of a variable are distributed. So, now
in this lecture, if you try to see, I have taken an example and then, based on that example, I have
tried to give you, the different definition, concepts and how the things are implemented. But,
whatever I have done, that is manually. Now, in the next lecture, I will continue with the same
example, but then, I will try to show you, that how the things can be implemented over the R
software.
So, you practice it, you try to learn it, you try to understand it and we will meet in the next lecture,
186
Lecture 09
Frequency Distribution and Cumulative Distribution Function
Welcome to the next lecture on the course Descriptive Statistics with R software. You may recall
that in the earlier lecture, we started a discussion on the frequency distribution. We had
understood the concept of absolute frequencies, relative frequencies and we had put them in a
table, what we called as, ‘Frequency distribution’. And once we are trying to construct the
variable. So we had completed our discussion on the discrete variable and we started the
discussion on how to construct the frequency distribution or the frequency tables, based on a
relative frequency from a continuous data. So, in the earlier lecture, we had taken an example
and we had seen, that how manually, you will try to create the groups and based on that, you will
try to compute the frequencies and based on that you will try, to construct the frequency
distribution or frequency table. So, we will continue, on the same lines and we would like to see
today, that whatever we have done manually, manually means we had made all the calculation
by hand, now we would like to implement it on the R software, using the same steps same
concept, same methodologies and let us try to see how the things are going to work and
essentially how you are going to obtain a frequency distribution table, using the R software in
187
You may recall that in the earlier lecture, I had taken an example like this one, I had recorded the
time, taken by 20 participants in a race and I had created a data vector, time and this data vector
consists of 20 values. Now after this, our objective was to create a frequency distribution, you
may recall, what was our first step? First step was that, we tried to find out, what is the minimum
and maximum value in this dataset. So, you may recall that we had identified that 32 is the
minimum value and here 84 is the maximum value and using this minimum and maximum
values, we need to decide, that how many class intervals, we would like to have. We had
discussed, that all this data has to be grouped in some suitable number of class intervals or in
simple words classes. So, first we need to decide that, how many class intervals we can make?
188
So, in this example we had decided to make, the class intervals of the width 10 seconds and
have a look on this table, we had constructed this table. So, first we had constructed this class
interval and based on that we had found this frequency and based on that we had found the
relative frequency and finally we had found the cumulative frequency. The same thing we are
189
Refer slide time :( 04:00)
So, the first step in the construction of a frequency distribution is to find out the, range of the
data values. The range of the data values is defined as, say here maximum value minus minimum
value, once you get a range then you will have an idea that this range has to be partitioned in
how many class intervals. So, in order to find out the range, we have a command in R software
as r a n g e and range and in order to use this, I have to give the name of the command, range and
inside these arguments, these brackets, I have to give here the data vector and if I try to do so
then the outcome will be the minimum and maximum values of the data contained in this data
190
vector, that is the first step. So, what will be that second step? Looking at the value of the range I
have to decide the number of class intervals, I have to decide the width of the interval and then I
need to partition this range into different segments, what we will call as classes.
191
So, you can see here, I have operated, this functional range, over the data of time and you can see
here that this is giving us 32 as the minimum value and 84 as the maximum value and once you
can obtain the range, the next step is this, how to divide this range into suitable number of
classes. So, we had decided in the earlier lecture, that we are going to have the classes of width
10 seconds. So, we had the classes like 31 to 40, 41 to 50, 51 to 60, 61 to 70, 71 to 80 and 81 to
90. So, now this range, 32 and 84 will be converted into 30 and 90 and this range will be
partitioned in different classes. So, now I have another task that how to create this intervals. So,
you know when we are trying to divide a range into different segments of equal length, we can
use a command of sequence, we are simply trying to create a sequence or the values of the
192
I try to create here, a sequence and in order to do this thing, I'm going to use here the command
seq, I would like to address here that it is not really possible to give you here the details of all the
R commands. So, for that you need to learn first the R software and once you learn the R
software, the basic things, then you will be able to use them or the statistical function. So, in case
if you want to do it you can go to the slides available, in my another lecture on R software that is
introduction to R software, its slides and the videos are available on the NPTEL website, you can
have a look. So, anyway without explaining the use and function of this operator s e q, I would
try to use it and the first value inside the argument is giving me the starting point and the second
value in the argument is giving me the end point, that means I need to create a sequence from 30
to 90 and this here y equal to 10 is giving me value or we are providing a value that what should
193
be the width of the interval, that means the sequence has to be broken at which point. So, once I
start with the 30 then the sequence will be broken at 30 40 50 60 70 80 and 90. Right. So, in case
if I try to store the outcome of this command, that trying to break the sequence into seven classes
at an interval of 10 units, I would like to store all the things in a new variable here breaks, well
I’m going to use this breaks variable later on, in the construction of frequency table. So, once I
try to execute it on the R console I will get it here this value and this is here the, the screenshot.
Right? Now once I get, this sequence, at an interval of 10 units is starting from,
194
30 to 90, then I need to convert this numeric vector into a factor, once again you need to know
what are the factors in R software and for that I will again request you to have a look on the
lectures on introduction to R software and in order to achieve this, we have a command here cut
and this command is used in this particular way. I would try to use the command here c u t, cut
and then I need to specify the data vector for which I would like to operate the, function cut and
then I would like to define here the brakes, brake is going to be a numeric vector, of two or more
unique cut points, or a single number, which is greater than equal to two and this will give us the
number of intervals in which this data vector has to be cut. So, this break is going to control the
number of partitions in this data vector and here I have here a command, small letters r i g h t,
right and this is equal to here FALSE and you can see here this is the logical operator, as we
195
discussed capital letters TRUE and capital letters FALSE they are the logical operators and this,
right equal to FALSE is denoting that the intervals are to be closed on the left and open on the
right and this statement, we want to set as FALSE. So, in case if you want to make it TRUE then,
then instead of here FALSE you have to give it TRUE. So, let us try to execute this command,
196
and then we try to see, what happens? And what is the interpretation? And once I complete this
thing, then I will take you to the R console and I will try to show you, what is really happening.
Okay. So, if you try to see here, I try to take the same data vector here time and I am trying to
use here the data, which I have already generated as brakes, remember we have generated the
brake where here, 30, 40, 50, 60, 70, 80 and 90. So, now I am trying to tell this function, that
please use the data of time vector and create the brakes, using the data in the brake and right is
going to be FALSE, that means all the intervals are going to be closed on the left hand side and
they will be open on the right hand side. In case if you want that the vectors have to be closed on
right hand side also, then you need to use here the command TRUE, but anyway I have not used
it here and whatever is the outcome of this thing I am trying to store it in a new variable here
time dot cut. So, this is simply indicating, that the time has been operated with the, command cut
197
and the outcome of this time dot cut will look like this and here is the screenshot. You can see
here there are values here 30 to 40, 30 to 40, 40 to 50, 80 to 90. And so, on and here there are
some here levels, what are these values indicating? You see, this is very important in any
software that you need to understand what the software is doing because unless and until you you
understand it you will not be able to execute the, the correct command on the given outcome. So,
198
what is the meaning of this outcome? You will recall that the data in your time vector was, 32,
35, 45, 83, 74 and so on. And if you try to see here what is the outcome of this variable time dot
cut. I have simply copied and pasted these two data vectors over here, this is 30 to 40 to 40, 40 to
50 and so on. And you can see here from here to here there are only here 20 values. So, what is
really happening? That these are my individual data’s, these are my individual values in the time
vector. And now I have created the intervals, in which these values are going to lie. So, these are
the intervals in which the values lie. For example suppose I take this value here 32, this is my
first value and what is the first value in that variable time dot cut, this is 30 to 40 so you can see
here this is indicating that this value 32 is lying in the interval, 30 to 40. Similarly, the second
value here, the second interval that is indicating, that where the second value of the vector time is
lying. So, you can see here 35 is lying, between 30 and 40. So, the second interval is indicating
the interval of the second value and similarly if you move forward, the third interval here is
indicating the interval in which this third value 45 is lying and so on. So, you can see here, there
are 20 values here and there are 20 values in the data vector time dot cut. So, every value is
corresponding to the interval in which it is lying or the 20 values or the 20 intervals in the time
dot cut vector they are indicating or they are providing the interval in which the corresponding
values are lying. So, this is how the interpretation of this time dot cut goes. Right. Now what we
have to do? We have got this data and now I have to create a frequency table. So, we had
learned in the earlier lecture that in order to do. So, we have a function here table, using the table
command, we had constructed the frequency table in the case of discrete data also or in the case
of qualitative data that was finally converted into some numerical values indicating the, the
variables values. Okay. So, the same command I will use here, but now this command is going to
199
be used over this new variable time dot cut, because now we have got this interval and then we
have got the data also time, now I need to create the frequency distribution using this data vector.
So, as we had discussed earlier, now we are going to use the absolute frequencies of this data
vector using the table function. So, the usage will be table, you have to type table all in a small
letter and inside the arguments you have to write the data on the variable. So, here you try to
write the data. So, now once you try to execute table dot variable then it will create the absolute
200
frequencies corresponding to the data, which is contained in the arguments under the variable
name variable. So, table variable inside the argument will simply try to create or simply try to
inform you the values of the absolute frequencies, with respect to different intervals. So, now
you can see here, once I try to do it as table and inside the bracket I say time dot cut. So, you can
see here I get this type of outcome, what is this telling you? This is trying to say that there are
five values, in the interval 30 to 40. So, 5 is essentially the absolute frequency of class a1 which
is equal to 30 to 40. And similarly here this here is 3 this, 3 is trying to indicate that there are 3
values in the interval 40 to 50 which is our class a 2 and similarly there are 3 values here which
201
are between 50 to 60, there are 5 values which are lying in the interval 60 to 70 there are 2 values
which are lying in the interval 70 to 80 and there are 2 values which are lying in the interval 80
to 90. Well, you can see here one thing, that here the intervals are continuous, because I have
used here open interval here and close interval here and that is the way we try to create the
frequency distribution in a continuous data. Well, in this case because all the values are going to
be some integers. So, that is why you have to make sure that if there is a value 40 on the
boundary then where it is going to be added, in the earlier interval or in the next interval. So, you
have to be careful. Okay. And now this is the screenshot, but you can see here one thing, usually
in that textbooks, whenever we try to write down the frequency table, they are not written
horizontally, but they are written vertically like this, means here you will have here class
intervals a 1 a 2 and so on and here you will have frequency f 1, f 2 and so on. So, now but this is
the outcome, you can see here, this is actually horizontal this frequency table, is coming like
202
So, in order to make it vertical, we have a command here, cbind this function is used to print the
frequency distribution in the column format, up to now, the intervals and the irrespective
frequencies they are coming in rows, now I want to make it column wise. So, I have to do
nothing, the same outcome whatever we have obtained here, I simply have to operate the cbind
function over that. So, you can see here I try to obtain here the cbind and inside the argument I
am simply trying to write down the table and inside the argument I'm writing time dot cut and
this is the same command, if you try to see here what I have used here where I am circling. So,
whatever is the outcome of this, command this is being used here with the function cbind and
you can see here you get here a vertical table. And this is you’re here, frequency distribution. So,
it's the same thing, there are five values in the interval, 30 to 40 there are three values in the
203
interval 40 to 50 and so, on. So, this is the lower limit, say a1 a2 and this is my interval a1 to a 2
and this is a frequency, f 1 this is the frequency f 2 and so on. So, that is the same table that we
had obtained earlier there is a difference of say between 31 and to 40and 41 to 50 but that can
also be adjusted by using the appropriate intervals. So, that is not going to make much
manually and all our data values were an integer. So, in case if you have fraction the data, the
204
But now, before going further let me try to show you these things over the R console. So, let me
try to start here, with the data. So, you can see here your data was here like this, say here time.
205
Refer slide time :( 22:42)
I will come to here R console and I will say her time, time is equal to c and inside the bracket I
have to give that data. So, you can see here now this is my here data. Right. On which I would
like to create my, frequency distribution. Now the first step is this I want to find out the range of
this data. So, I have to operate range of say time and this comes out to be here 32 84. So, now
this is giving us an idea, that we can have an interval of 10 units starting from 30 till 90.
206
Refer slide time :( 23:26)
So, now after this, what I have to do? I need to first create the sequence, at an interval of 10.
207
Refer slide time :( 23:41)
So, I would try to create here a sequence using the break and if you try to see here, this breaks
comes out to be the same outcome that we had done earlier the use of the values30 40 50 60 70
208
Now after this, what we have to do?
209
I simply have to create here, I have to use here the command cut. So, I try to use the data, on
210
Time is here, breaks is here and I have to use this data, to get the values of time cut and you can
see here time dot cut comes out to be like this. Right, And after this, what I have to do? I simply
have to operate the table function over this as you can see in this light, I simply have to use here
the table function. So, if I try to use here the table function. So, you can see here table time dot
cut and you can see here, this is, the same outcome here, same outcome here, this is the
frequency table that we have obtained, but you can see here that this frequency distribution table
is in horizontal. So, I want to make it vertical. So, I can use the command here, cbind and then I
will try to type here the same command or the same data that we want to convert into vertical,
which is time dot cut and the outcome of the function table, time dot cut has to be converted into
a vertical table. So, you can see here I am getting this thing here. So, now you can see here this is
the same frequency distribution that you used to obtain by the manual calculations. Right.
211
Refer slide time :( 24:03)
Now there is another issue, they issue is this in this case, you have obtained the frequency
distribution with respect to absolute frequencies. Now suppose alternatively, you want to find out
this frequency distribution with respect to relative frequencies, then how to do it? So, as we had
discussed in the earlier lecture, that there is a very close connection between absolute frequencies
and relative frequencies. The relative frequencies are simply obtained, by dividing the absolute
frequency by total number of observation, for which we had used the command length. So, in
order to find out the frequency distribution with respect to the relative frequency in the same data
set, I simply have to divide the variables at appropriate places, by length of the data vector. So,
212
yes so, in the last lecture, we had obtained the frequency distribution, using absolute frequency
and then I had divided it by length of n that has given us the relative frequency. So, the same
concept, I am going to use it here once again. So, what I will do here that in order to obtain the
frequency distribution with absolute frequency, I have this function table and inside the argument
I have to give the data on the variable, for which I want to create the frequency distribution. Now
I will try to divide it, by the length of the variable. So, the length of the variable is simply going
to be the number of observation present in the data vector and once you try to do it you will get
213
relative frequency. So, if you try to see here, I have obtained the first the frequency distribution
with respect to relative frequency. So, you can see here I have divided by here the vector time
dot cut length and I am now getting here this value here 0.25 0.15, 0.15 and so, on what are these
values? If you try to see what was your outcome in the earlier case the outcome was your here5,
3, 3, 5, 2, 2. So, I will try to copy this thing here 5, 3, 3, 5, 2, 2 and then now I am going to divide
it by here 20 because there are 20 observations, 20 and here 20 and now you can see here
whatever is this outcome 0.25 this is nothing but 5 divided by 20 and this is here the that
screenshot and once again if you want to put it in the vertical columns, then the vertical format,
then you simply have to use the, same command here cbind, but now cbind will be operated over
this variable, which was used to obtain the frequency distribution with respect to the relaive
frequency and here this is the same outcome and the vertical columns and this is here the
screenshot. So, I would like to show you the same outcome on the R console also. So, you can
see here. Right. So, we had obtained the data here time dot cut, time dot cut was this thing and
now we are going to obtain the frequency distribution. So, this will be table time dot cut divided
by length of this time dot cut. So, you can see here you are getting the same outcome and if you
want to make it vertical, then I would say here you simply have to operate here the command
cbind, write this and you can see here, this is once again the frequency distribution where the
frequencies are in terms of relative frequency, here. Right. So, this is how we try to create our
frequency distributions. Now after this, the next thing comes the last column.
214
Refer slide time :( 30:26)
which was the cumulative distribution function or the data or the cumulative frequencies, we
would like to compute. So, as we had discussed, that the cumulative frequencies gives us an idea
up to a certain point, the frequencies up to that particular point, that how many values are less
than or equal to this particular value. So, in order to compute the cumulative frequencies, we can
use here a function here, cumsum. So, that is trying to abbreviate cumulative sum and in order to
use this thing, we simply have to use this function cumsum on the variable for, which we want to
create the cumulative totals, but the variable has to be first operated with the table function,
because once we have a data in some variable then first it needs to be converted into a frequency
table and then based on that frequency table, the sum of those frequencies, can be obtained. So,
215
that is why the complete command will be the cumsum function, on the table variable. Right.
And in case if you try to do it, this will produce the cumulative frequencies.
So, let me try to show it on the, data set that we are considering. So, now if you try to see, we
216
that was obtained for the absolute frequencies means, I had here six intervals and then six
absolute frequencies, like as 5, 3, 3, 5, 2 and here too and now, if I try to obtain the cumulative
217
in the earlier slide, that this is how we are going to find out the cumulative frequencies, first
frequency that will be the sum of itself, second cumulative frequency will be the sum of the first
and second frequencies, third cumulative frequency is going to be the sum of first second and
third frequencies, fourth cumulative sum is going to be the sum of first four frequencies and the
fifth cumulative sum is going to be the sum of first five frequencies and the last one, which is the
sixth cumulative frequency here, that is going to be the sum of all the six frequencies which is
equal to the total values of the observation and you can see it that you are getting here the value 5
8 11 16 18 20. Now once you are trying to obtain the outcome of cumsum you can verify here.
218
you are getting the same outcome here 5 8 11 16 18 20. So, this is how this cumulative
frequencies are obtained, now in case if you want to find out these cumulative frequencies with
respect to relative frequencies and if you want to represent this outcome in a vertical way, you
have to use the same command, in order to make this horizontal outcome into a vertical outcome
just use cbind command and in case if you want to present it with respect to the relative
frequencies, just try to divide this command by length of the data vector. So, let me try to show
you here. So, you can see here that in this light, I am simply trying to express the outcome of this
light in a vertical columns by using the function cbind and here is the screen shot.
219
And similarly incase if I want to produce the same cumulative frequencies, with respect to the
relative frequencies, then I simply have to use the same command, table, variable and then I have
to divide it by here the function length of the variable and then use the cumsum command over
this new variable. So, using this command over the function cumsum will produce the
220
cumulative frequencies or the cumulative relative frequencies of the data contained in the
221
That now I have operated the cumsum on the earlier data, table time, dot cut, but now it is
divided by length of the time cut. So, now this is giving me the sum of the relative frequencies.
So, this is my here the first frequency, first relative frequency rather, this is the sum of first and
second relative frequencies and so, on. And so on means other things are also fine and if I want
222
then means I simply have to use, the bind function over this. Right. So, now I would try to show
So, you can see here that I have here the function or I already have obtained the data here time
dot cut. Right. Now I would try to find out the, cumsum of the table of time dot cut and you can
see here, this is the outcome and in case if you want to make it vertical, then you simply have to
223
use here the bind function, you can see here like this and in case if you want to obtain these
cumulative sums, with respect to the relative, frequencies, means I can show you here I simply
have to write down the cumsum table with the variable time dot cut and I have to divide it by
length of time dot cut, the same variable in which I stored the data and you can see here now this
is the outcome, with respect to the, the relative frequencies, but again this outcome is in the
horizontal direction and suppose if I want to convert it into a vertical direction then I have to use
the cbind function, on the same variable and executing it you can see here I’m getting the same
data which was here in the horizontal way, now this is coming in a vertical columns. So, now
we come to an end to this lecture and I have given you the basic idea, that how to create
frequency distributions using, using the R and I would like to emphasize here one thing here
more. You cannot assume this lecture or this course, which is trying to teach you pure statistics.
But my objective is that that, many people are using these things. So, I want to give them a basic
idea, related to the use and interpretations and I want to show them that how to do the same thing
on the R software. So, my motive is very simple. Definitely my this lecture cannot substitute or
will not provide you the escape from saying that that without reading the books or without
reading the chapters of frequency distribution from a proper book will help you. So, I would
request you, please try to have a look on the chapters of frequency distribution, data tabulation
and any good book, try to see the concept, try to learn the concept and then this lecture will help
you in brushing those concept and will teach you that how to use them on the R software. So,
you try to take different example from the books, from the assignment, practice it and try to see
how the interpretations are being made and we will see you in the next lecture till then, Good
bye.
224
Lecture -10
Welcome to the next lecture on the course descriptive statistics with R software. Up to now
what we have done? We have considered the aspect of frequency distribution in the last
couple of lectures and before that we had two introductory lectures, where we had learned
that once we get the data in our hand, then there are two types of tools that can be applied,
one is graphical tool and another is analytical tool. So in the last two lectures where we had
done the frequency distribution that was the first step when you would like to make an
arrangement with your data so that the data is compatible to be exposed to the graphical and
analytical tools. So now in this lecture and in the next couple of lectures first I will try to
target at the graphical tools and after that I will continue with the analytical tool. So now the
first question comes why should we use the graphical tools? So we know, that graphics are
very easy to understand and that is why, we take the help of graphics to extract and to
understand the information which is contained inside the data. Now we are going to use the
information extracted from graphical tools as as well as analytical tools together, so now
firstly let me try to explain you why do we need this graphical tool.
225
So now if I say, suppose I want to convey that a person is happy or sad, well you can explain
in say several sentences that how the happiness will look like or how the face of a sad person
will look like and so on, but in case if you try to use the Smileys, what do you see here, that
the mood of a person is very easily conveyed by these three Smiley's, just by looking at this
structure means anybody can say very quickly, that this is indicating the happiness and this
one blue one means anybody can say after looking at this face that the face is sad and in the
green face in the middle one means anybody can have a look and can say very easily that this
is reflecting where the person is okay. Similar is the information that is expressed in graphics
mode. As we had discussed that when we start in start six we have only a sample of data and
the size of the data can be very small or it can be very very large and each and every data
contains some information, but we want to have the information in some combined way and
that is why our first target was frequency distribution, now once again we are trying to
combine the information in some graphical way and you would like to condense the
information in the form of a graphic, so that we can have some idea about the information
226
Refer Slide Time :( 03: 56)
So there are various types of advantages. The graphics can explain the hidden information
very compactly and very quickly, which is very easy to understand by a common person.
behaviour of a smiley face or the behaviour of a curve, one can easily understand it, so that is
the advantage. Once again, I would say that whenever you are trying to conduct any
statistical analysis, there are various types of graphics that can be used, but sometimes there
is a myth that unless and until you use more number of graphics, the analysis is not good,
rather, I have heard people saying that that the goodness of a statistical report depends on the
number of graphics, higher the number of graphics better is the report, well this is wrong,
means if somebody has some problem, it does not mean that if the doctor gives more
medicine then the doctor is good. The doctor is good if he gives the appropriate medicine
inappropriate quantity, same is the message in the statistics also. We have to use appropriate
graphics and say appropriate number of graphics also, right, and the use of appropriate
graphics and correct number of graphics will only give us the correct information in a more
fruitful way. So in statistics there are various types of graphical tools, that can be used.
227
Refer Slide Time :( 05: 35)
There are two dimension plot, three dimension plot, there are scatter diagram, there is pie
diagram, histogram, bar plots, stem and leaf plot and there are box plot and there are many,
many more and particularly with the advent of software, these graphics have become very
popular because they are very easy to create and they can be created in a very small time
actually. Similarly just like all other software, this R also as a capability to create the
graphics, not only create the graphics, but it also gives you an option to save the graphics in
different mode, like as Postscript format, jpeg format, PDF format and so on. So in R there is
228
For example, same plots which you have learnt earlier like a, bar plot, pie chart, box plot,
group box plot, a scatter plot, Coplots, histogram, normal QQ plot and there is a long list, all
sort of two dimension, three dimension, coloured. There are many, many possibilities and it
simply depends on your capability that how many graphics you can learn. Well I am going to
explain here some selected type of graphics. My idea is not to teach you the graphics, my
idea is that, I will try to show you, how one can create the graphics in R software and what
are the different options which are available and then I will try to give you here several
example and I believe that after that you will be confident enough to learn how to create any
229
Well, so let me try to start herewith one of the very basic graph this is called bar diagrams. So
this bar diagram is essentially used to visualize the relative or absolute frequencies of the data
or the relative or absolute frequencies of the values that are observed for a variable. And this
bar diagram, this will consist of one bar for each category and one of the characteristic or one
of the very important characteristic, of the bar diagram is the height of the bar, height of the
bar is simply proportional to the frequency or to the relative frequency, right, and the height
of each bar is determined by the absolute frequency, or the relative frequency of the
respective class and this height is shown on the y-axis whenever we are trying to create a bar,
like this, then this bar has two things, one is the width of the bar, this is here and another is
the length or height of the bar. So what we have to keep in mind that when we are trying to
consider the bar diagram then this width, this width is not important and width of the bar is
immaterial and this can be chosen arbitrarily. Only this length, this is import and this is going
to represent the frequency say absolute frequency or relative frequency but one thing I would
like to emphasize here that many times or rather most other times you will see that, whenever
we are trying to create the bar diagram, the width of the bars are taken to be the same, that is
230
just because the graph should look nice and one should not get an impression, who does not
know the theory of bar diagram, he should not get confused, that why the widths are so
different, so that is the only reason, right. So now in case we have a frequency distribution of
a discrete data or a qualitative data that is converted into some numerical value, through some
proxy variables, like as yesterday, we had given the values, tastes, like a sweet will be
now we assume that we have this type of frequency distribution, where these are my classes,
A1, A2, Ak minus 1, Ak. so there are altogether k classes and f1, f2, fk minus one, fk, they
are the frequencies or they are the absolute frequencies. So they are simply going to represent
that f1, number of values belongs to class A1, f2 number of values belong to class A2 and so
on and once they are divided by the total number of observations, which is denoted by here,
231
‘n’, total number of observations, then this third column, this is giving as the value of relative
frequency. So f1upon n, this is the relative frequency of class A1, f2 upon n, is the relative
frequency of class A2 and so on. So now suppose I want to create here a bar diagram, then
this is the basic philosophy or these are the basic fundamentals, that first I need to create here,
X and y-axis and on X-axis, I need to create here the bars. So for example, if I say here I have
a class here, A1, so this class A1 is going to be denoted by here some bar like this, so this will
denote here class A1 and similarly class A2 can be denoted by here another bar, say here A2
and so on, they will be here Ak, so width of these bars like is this, this they can be same or
they can be different, right, but it is always advised to have the equal width so that the bar
looks better and now if you try to see the height of this bar, on the y-axis ,this is somewhere
here, so this is going to represent the frequency or absolute frequency of the class A1, if I try
to consider here Class A2 here then this height, this height is here somewhere here and this is
going to denote here point f2, this is point here f1 and this is at point f2 and similarly here
this height here is proportional is actually see here fK. so you can see here that the heights of
the bar, of bar is proportional to the frequency. Now instead of frequency one can also use the
these heights are going to represent only the relative frequency. So now I have to option I
have this bar and the height of the bar is f1 or say f1 by n that is the absolute frequency or the
relative frequency. The advantage of using the relative frequency is that, that the maximum
value of the relative frequency is always 1. So it becomes easier to compare the heights of the
bar.
232
So in case if you see a diagram like this one, small bar and a higher bar, by looking at this
diagram I can always say that this has a lower frequency and this has a higher frequency.
Suppose if I say these bars are going to indicate the number of shirts sold in a shop on a given
day, so I have now two shops and sales are represented by the height of the bar, so by
looking at the height of the bar, I can very easily conclude that which of the shop is selling
more number of shirts, right. Now the question is this how to create this a bar diagram or say
233
In R software, we have a command here, bar plot. This bar plot helps us in the construction of
bar diagram. When we try to construct a graph, then there are many, many parameters and
you would like to handle those parameters so that you can get the outcome in the required
format, which is more suitable to understand and for that we have different parameters. For
example if you want to create a graphics then there will be x-axis there will be y-axis, so you
would like to put some desired labels on your x axis and y-axis that what they are
representing, you need to control the width of the bar, different bars are representing different
things, so you would like to add information inside the graph that which of the bar is
indicating what you would like to give different colours to the bar and there are many things
that you can do. So when we try to use this command barplot, then I have two options in a
simple language I will say, simple bar diagram and in an in simple way I will call the other
option, bar diagram with more options. So if you simply want to create a bar diagram, then I
need to use this command barplot and inside the arguments, I simply have to give here the
data inside a variable called here as a height and that will work. So if I simply try to type,
barplot and say here data vector this will give us a simple bar diagram but suppose you want
234
to modify it, you want to improve it, so that it looks better, then in that case the detail
command of the bar plot is here like this and you can see here this is the command here bar
plot and this is here the argument, first value is nothing but the height, which is here the data
second is here which is equal to one so that is going to control the width of the bar. Similarly
there is here a space, there is here names dot arg, legend dot text, besides equal to FALSE
horizon-- and soon you can see here that this is a long list so next question comes, how you
will learn all these things? I would suggest you simpler option is to take the help on the
command barplot. Because it is practically impossible to keep all the commands always in
your mind, so best option is this, R is free software that will always be available with you.
Try to look into the help menu of barplot and there you will see that the interpretation of each
and every parameter, that is expressed inside the argument, that is very well explained there,
so whenever you need, you simply try to read that part and execute it. To make you
comfortable to make you understand, I will try to take care some option and I will try to add
235
But before that in case if you really want to know or if you want to take the help from the bar
plot then how to do it. In order to do it you simply have to use here the command help and
inside the arguments, within the double quotes, you need to write barplot. This syntax is not
only for the bar plot, but this is for all the graphics, also all the commands, this is one of the
way to obtain the detailed help and now I will try to show you on the console, R console that
how this happens and how do you get all this information but before that please try to have
the look on this slide and the next slide. You will get the same information whatever is
236
And in the next slide, you have all the details, like as what is represented by height, either a
vector or a matrix of values describing the bars and so on. What is here the width? This is an
optional vector of bar widths space, the amount of space left before each bar names dot arg,
this is a vector of the names to be plotted below each bar or the groups of bar and so on, and
there will be a long list. So what you need to do, you simply need to read this help bar and
this will be your best teacher to understand what is really happening and based on that you
237
So you will see here, now I try to copy and paste this command on the R console.
You can see here as soon as I will press enter, this will come on the internet and a website
containing the help on this barplot will be opened. This is one possible way, so in order to use
this type of help you need to have an internet connection or else you can also go to the help
238
menu were here inside the R software and then you can have a detailed thing, but here as
soon as I say here enter, you can see here what is happening.
you can see here that this internet site, is opened here and actually this site or this command
has taken you directly to the R server and on the R server, you have the latest help whatever
is documented, that is available for you, So you can see the advantage of this R, you are
239
And you can see here if I try to scroll down you can see here that this part is the same part
240
And if you move further, you can see here that there are different here argument like as here,
height and height is trying to give you the all this information something like this
and if you scroll it more, there is an detailed interpretation for the width and detail
interpretation for the space and so on you can see here, this is a long list. So now it depends
on your capability that how much here you want to learn and how beautiful or how
informative graphic you want to create, right. And you can see here I have simply copied and
pasted this thing just to give you an idea, right. So now one very important aspect in barplot
whenever you want to construct the bar plot, please decide, you are constructing the bar plot
on what? On the individual data or on the categories, the answer is this, we want to create a
bar plot on the categories. For example, in case if we have data, which has been categorized
in two categories one and two, suppose you have got hundred data values, would you like to
241
Refer Slide Time :( 23: 02)
you would first translate those hundred values into two categories- category 1, category 2 and
based on that you will have here their frequencies or the absolute frequencies and actually,
we would like to plot the frequencies. So first you need to input your data in the form of a
frequency and the frequency can be obtained by using the command table. So that is why
242
So if I want to create the bar plot using this command, barplot then first I need to transform
this data into a frequency table from, using the command table and then I have to create the
bar plots. Similarly if you don't want to use the absolute value and if you want to use a
relative frequency, then in that case, the same command has to be transformed and in this
case what I will do that now I will try to operate the command barplot over the table divided
by length of x, we have learnt that once we try to use this command, this will give us a
frequency table frequency distribution using the relative frequency, so now I would try to
create a bar plot on the data which is given by table x divided by length of x.
243
So why not to consider here a simple example and we try to see how these things are
happening. So now let me take a very simple example, so I have data of ten persons and we
have recorded whether the person is graduate or non graduate, so in case of a person is
graduate, then it is it denoted by here G and if the person is non graduate it is denoted by here
N. Now so we have this dictator here G N G and and so on. So as we had discussed earlier
this data cannot be exposed to a statistical tool, we need to convert it into a number, so what
we have done? We are giving this graduate a value 1 and a non graduate a value 2. So now in
this case the person is graduate, so I am giving it here value 1, then the second person is non
graduate I am giving it value here 2, third person is graduate, I am giving here the value 1and
so on. So now I have a data which is here 1, 2, 1, 2, 1, 1, 1, 2, double 1. this is my here data,
on which I would like to create my bar plot. So I try to store this data using the c command
here, in variable here ‘quali’, which is a I have used a short form of qualification. So you can
see here I have entered this data and this will look like this and this is here the screenshot.
Well I will try to show you on the R console also. So let us first try to go to the R console.
244
Refer Slide Time :( 26: 22)
245
So you can see here this is data. Now please try to observe what is really happening. Now I
will say here I have been told to use the command barplot on this data to get the bar plots, so
Or now if you try to observe what are you getting here? Do you think exactly do you wanted
this thing? For example I will just copied and paste this graph over here, if you try to look at
this graphic you would realize that no, this is not matching what we wanted because this is
giving me a graph of data 1, 2 and so on, right. This is giving me only here two values 1 and
here 2.
246
So if you try to look at that data, first 2 people this is here 1 and 2
And this is giving here first person here 1 and second person here to be 2 and then there are
such 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, what is this? We didn't wanted this thing. Well this is the very
common mistake what we always make while making this bar plots or say pie chart that I will
show you later on that we directly expose the original data set, but what we have to do here, I
247
need to first create the frequency distribution of this variable here quali and then I need to
And now this is what I'm trying to do here. You can see here now first I try to create the data
of the frequency table, using the cable command on the data quali, this comes out to be here
like this gender and then 1, 2 and here 7, 3, so that simply indicating that there are 7 persons
which are graduate and there are three persons which are non graduate and then I'm trying to
use the command bar plot on this data table quali and now I am getting here this bar chart,
you can see here this is bar number one and this is here bar number two. So this bar number
1, that is representing the graduates and this is and the bar number 2, this is representing the
non graduates and by looking here, you can see here at this height, this is indicating that there
are seven persons, the absolute frequency here is seven and similarly here if you try to see
this is here three, so there are three persons, so by looking at this bar diagram I can very
248
easily conclude that there are seven persons which are graduate and there are three persons
which are non graduates and. in case if you see what is indicating here? This is here no
nothing but the frequency or the absolute frequency and the width of these bars, with of these
two bars, say this is here width and this is here the width this is the width here they can be
arbitrary, but anyway as I said, it is nice to have the bars with equal width for easy to
understand graphics, right. So this is how we try to create this thing. Now I will try to show
you on the
R console but before that, let me also show you here that in case if you want to create this bar
diagram using the relative frequency, how are you going to do it? So first I try to create the
frequency distribution, with the relative frequency, so you can see here, once I try to execute
the table command on quali divided by length of quali, I get here this type of frequency
distribution, where this is indicating graduate category, this is giving me non graduate
249
category and this point seven is actually seven upon 10 and then 0.3 here is 3 upon 10, which
is indicating the relative frequency of the two classes and when I try to execute the barplot
command over this data, then I get this type of graph, but now you can see here means again
that this is my class A1 of say graduates and this is my here class A2 of non graduate, so this
graph is exactly the same as the earlier one, but there is a difference here on the y-axis. Now
you can see here these values are 0.0, 0.1, 0.2 and so on and this is here the height which is
indicating 0.7 and this 0.7 is the same thing here like this and this height of second bar A2
you can see here, this is 0.3 and this 0.3 is the same thing which is coming here. So this is
250
But now let me try to show you on the R console also, how you are going to do it. So now I
would try to correct it and I would say I would like to make a bar plot of the table quali. So if
you try to see it here quali and then this you can see here now, this is your here the graphic
and this is here in the on the y-axis, you can see here where my cursor is this is 0, 1, 2, 3, 4, 5,
6, 7, this is indicating the frequency and if you try to use the same command with the length
The graph remains the same but now you can see here that that on the y-axis, these
frequencies have now changed and now they are relative frequencies. So the height of this bar
251
So this is how we try to create the bar diagram. Now let me take here one more example to
make you understand better and so I have collected the data of say hundred customers. These
hundred customers are visiting a shop and there are three sales persons and these three sales
persons are indicated by numbers 1, 2 and 3, salesman number one, salesman number two
and salesman number three and when these customers enter into the shop, they are attended
by a particular sales person. So now which sales person has attended which of the customer,
this data is recorded for the first hundred customers entering into the shop and so this data
you can see here this consist of the numbers 1, 2 and 3 only. 1, 1, 2, 1, 2, 3 and so on, so now
you can see here by this small data set, set consisting of hundred values, you can see this is
only a say, numbers numbers and numbers, They are not giving you any fruitful information,
so my first attempt will be to make some suitable graphic to have this information. So I
would try to first attempt to create a bar diagram. So I would like to store all this data into a
data vector head called ‘salesper’, which is a short form a ‘sales person”, right.
252
And now, after this if you try to make the same mistake here if you try to make it here bar
plot of here salesper you will not get here the correct information, right.
253
So first let me try to show you these things on the, our console also and we will move
254
So I have created this data vector and you can see here now this is my data, right, and now if
I try to create here a bar plot, of this thing, you can see here, you are getting a graph like this
one, definitely you don't want this thing and yeah I accept that I have made here a mistake,
Because that correct option is this. I don't have to use the data directly, but I have to first
create the frequency table. So I would first try to create the frequency table, you can see here
looking at this hundred data values you are not betting much information but by looking at
these three values, you are getting the information, that there are 28 customers, which are
attended by the salesperson number one, there are 43 customers, which are attended by
salesperson two, and there are 29 customers which are attended by salesperson three, right.
255
Now I would try to create a bar plot over this thing. So I try to write down here the command
barplot and inside the bracket I have to get the data and you can see here that this is giving
and similarly in case if you want to create this bar plot with respect to the relative frequency,
then I need to modify the data vector as by length of salesperson, data vector and you can see
here, now in this case, the frequency that was earlier as 10, 20, 30, 40 and so on, this
256
So the same thing I have copied and pasted here for your understanding and then I would try
to do something more on this thing, so you can see here that this is the same bar plot that we
have just obtained and similarly in case if you want to have this bar plot with respect to the
257
And now, now I would try to add some features to this plot and suppose I want to give here a
title, so now you can see here this command, this command is the same what we had used
earlier. But now in order to add a title on the graph, I am using here a command main, equal
to this and inside this double quotes, I am writing the title what I want, customers attended by
sales person and if you try to execute it, you will get here the same graph but there you can
observe that you are getting this title. So the moral of the story is this, that in case if you want
258
And I would try to show you it on the, our console also and you can see here now we have
259
Now let me try to do some more changes in this graph. Suppose I want to add here some
legends and title to this bar chart, so first you try to look at this bar diagram, you can see here
now I have added here a legend to the bar number one, as SP1 that is salesperson one, legend
to bar number two, SP2, salesperson two, Legend to bar number 3 SP 3 and here I have added
a title on the x-axis and here I have added a title on the y-axis. So now we need to see that
how these things can be done. Well, I will tell you the very simple way go to the help menu,
try to see what are the different parameters that can do the job and they will also explain you
how this values have to be given, some time they are in some number, some time they are
inside the double quotes, please try to look from the help menu and try to use it. For example
here if you try to see this bar plot table and main command they are from my earlier plot, now
I want to add here three names sp1, sp2 and sp3, now this help tells me that please try to give
the names of the bar inside the double quotes, whatever you want, so I’m trying to give you
give here three names sp1, sp2, sp3, inside the double quotes and which are separated by
comma, this is the format and when I want to give a title on the x-axis, I have to use the
command xlab equal to and inside the double quotes I have to give whatever I want to write.
For example I have given here sales person and inside and the bracket SP and this entire thing
I want to print on the x-axis, so you can see here this is coming here and this sp1 sp2, sp3,
this is coming here and, similarly if I want to give a title on the y-axis, then the command is
ylab. So I have to write ylab, equal to and whatever I want to give the title on the y-axis I
have to write it inside the double quotes and you can see here I have written number of
260
Refer Slide Time :( 40: 32)
So I can execute this thing on the R console also and you can see the outcome which you are
261
you can see here there is no title no regents over here and now I’m trying to execute it and
you can see here that this labels sp1, sp2, sp3 and titles on x and y-axis they are added over
here.
262
Now suppose I want to add some colours and I want to make these bars of different shades,
different colours right. So in order to add the colours I will have to use one more command
this is here c o l and rest of the command that is simply the same as in the last slide. Now if
you try to see in order to and here colours I am giving here col equal to then I am trying to
give here three colours, red, green and orange and these colours are given inside the double
quotes and remember one thing, red, green and orange, they are the reserved words. Reserved
words means, once you try to write down red, green, orange the R will also understand them
red green and orange and there is a list of different types of colours which R understands, so
you simply have to give these colours inside the bracket using the c command and the same
order will be followed in your bar diagram also, for example here in case of here red, this is
red is coming here in the first, second place is green, this is coming here green, third place is
263
So we kind of try to do the same thing on the R console and let us try to see what do we
obtain? So you can see here now the, this is the same graphics but now as soon as I execute it
these colours are added, right. Now similarly you can add many more things into the graphic
to make it more informative and more useful and there is a long list, but definitely it is not
possible for me to take all the issues in this lecture, but I believe that I have given you
sufficient background to understand, that it is not difficult at all, to add different types of
parameters and it is not difficult at all to create nice graphics, using the R software which is
completely free. So you are creating all these graphics just for free. Now only thing is
this,yes, you need to spend some time in learning it, but definitely, in case if you have to
spend some time in learning and you are saving a lot of money in from buying an expensive
software to create all this graphic, I think is not a bad bargain. So you please try to have a
look try to take different datasets and try to create more graphics and in the next turn, I will
try to take more graphics and I will try to explain you all these features. So you practice,
enjoy it and I will see you in the next lecture again. Till then good bye
264
DESCRIPTIVE STATISTICS WITH R SOFTWARE
LECTURE 12
GRAPHICS AND PLOTS – SUBDIVIDED BAR PLOTS, PIE AND 3D PIE DIAGRAMS
Welcome to the next lecture on the course descriptive statistics with R software. You may recall
that in the last lecture, we had considered some graphics and we had considered the construction
of bar diagram. Now he will continue with the topics on graphic and we will try to learn some
other types of graphs in this lecture. We are essentially going to discuss about subdivided bar
diagram, pie diagrams and say three dimensional pie diagrams, okay.
So let me start our discussion at the first topic that is subdivided bar diagram and this is also
called as component bar diagram. What does this mean and what it interprets? You have seen
that when we created the bar diagram, we had created these types of bar, right and every bar
indicating a class a1, a2 and so on, but these bar diagrams are going to indicate only one value at
a time. Now suppose there is a situation where the value inside this bar or this bar is also
265
subdivided and it depends upon some other values, then what we will try to do that I will try to
create here the bar and I will try to subdivide it like as this is the component of first aspect, and
here, for example, this is the component of second aspect, similarly if I try to take the second
aspect then I can say here well, this is the component of second aspect and similarly if I try to
say here, take the third aspect then I can say here, this is my here third aspect. So now you can
see here inside these classes a1 and a2, you also able to compare different things. For example I
can compare that the contribution of this part, it is here, so I can see here that in the class a1 the
contribution of this green diagonal lines is less. Similarly if you want to compare these orange
lines, they can be compared by the height like this and if you want to compare the third category
you can compare by the red lines like this. So what happens that this subdivided bar diagrams,
they try to divide the total magnitude of variables into different parts, in various parts.
Let me try to take an example and try to show you how these things are going to work and then
how we are going to do it on the R software. Suppose I try to take here a data and this data is on
three shops, shop number one, shop number two and shop number three and the data is recorded
on the number of customers who are visiting, say for example between 10:00 to 11:00 a.m. in the
266
morning, on four consecutive days which I am denoting by day 1, day 2, day 3, day 4. So this is
sort of two way table in which the rows are indicating the shops and columns are indicating the
days and the interpretation of this is like this. Suppose if I take the value here 2, this means there
are two customer who visited shop one on day one whereas there are 20 customers who visited
shop 2 and there are 30 customers who visited shop 3 on the first day and similarly, if you try to
take any other number, suppose if I take here this 15 here this is indicating that there are 15
customer who visited on day 3 the shop 2 and so on. So now you can see here that here are two
aspects- one is here shop and another is here day and these two values are going to determine the
number of customers visiting the shops during 10:00 to 11:00 a.m. in the morning, right, so now
how to do it, or how to plot a subdivided bar diagram? In this case, what I would like here is the
following- you can see here that in case if you try to make a simple bar diagram then it is not
very convenient or it is not very informative because that will be giving you the information
either on shops or on the days but these data values are depending on two aspects- shop and the
days. So this is the advantage of using the subdivided by diagram that I can control or I can
So what I would like here is the following. Suppose I want to create here three bars. So this is
indicating the shop number one, first bar, second bar is indicating the shop number two and third
bar is indicating the shop number three. So on the x-axis, I am trying to denote here the shops
and on the y-axis, I will denote here the days. So for example, in case, if I say here on day one is
representing this thing here, this thing here, and this thing here. Similarly if I try to take it here
day 2, day two might be indicating here somewhere here like this this orange lines and similarly
if I try to take here day three, day three I can take here like this and if you go for day four, this is
267
some this dotted area. So you can see here that this height, this indicating the day 1 and this
height, orange height, here, here and here, this is indicating day 2 and so on. So now looking at
this type of graphic, you can have the information that how many people visited a particular shop
on a particular day in a single glance and this is called as subdivided bar diagram. Now we want
to construct it but before you use the command to plot this subdivided bar diagram, you have to
think that how you are going to input the data in your R command. Why? if you remember in the
bar diagram you had input the data using the c command, just as a simple data vector but in this
case, it is not a data vector but the data is given in two dimensions.
So I can use the aspect of matrix theory and I can use the matrix command to input my data and
you can see here, I am trying to give here data, if you try to write down this matrix here, I can
matrix. So what I would like to do here that first I try to create the data matrix. So now you can
see here that in this matrix, there are one, two, three, and four, there are four rows and there are
268
three columns. So now you may recall that we already have discussed the use of matrix theory or
how we do? You would like to provide the data inside the matrix. So I use the same command
here and I try to store the outcome in the data vector or the data variable say here cust which is
indicating the customers. So I will use here the command matrix. Now as per the rules of this
command, I will try to provide the number of rows by the parameter nrow equal to four, number
of columns by the parameter ncol equal to 3 and now I have to give the data which I want to
insert inside the matrix. So this data is given row wise. So I'm trying to give here this command
byrow is equal to 3 that means TRUE and data is given in the format like here 2 to 20 to 30 and
then here 26 and then 53 and then 40 and so on. So you can see here now I have given here this
data and once I try to see the outcome of this command, you can see here I get here a matrix of
order 4 by 3 data where this column is denoting the shop number 1, shop number 2 by the second
column and shop number 3 by the third column and what about these rows? These rows are
denoting the days. So now you can see here the data, what is here in this matrix, it is the same as
269
Once you enter the data after this, you have to use the command barplot and inside the arguments
you have to give the data or the name of the variable that is containing the data in the matrix
format and this command will create a subdivided or component bar diagram where the columns
of the matrix are going to denote the bars of the diagram. So this bar will denote the columns and
there will be some sections here and these sections are going to denote the frequency in
cumulative format. What does this mean? For example, if you try to look in this data matrix, the
first column is here 2 26 42 30. So if I try to denote here this value here say here 2 26 42 and 30
right, so you will see here this, these are my frequencies and now they will be denoted in the
cumulative format. How to do it, how it will look like. I will just try to plot it and when I will try
to explain you, right. So remember one thing that in the subdivided bar diagrams, the frequency
on the bars they are essentially the cumulative frequencies and in case if you want to find out the
frequencies by looking at the bar diagrams or the subdivided bar diagrams, it is pretty simple.
Try to subtract the two cumulative frequencies to get the difference and that will be indicating
the values of that particular class. Suppose if I try to take the cumulative frequency of two
classes and I try to subtract it by the cumulative frequency of the class one then whatever is the
difference, that is going to provide the value of the frequency for the class 2, okay.
270
So now just for your given here the data, right and when I try to execute the command here
barplot cust, cust was the name of the variable in which I have given the data in the matrix
format. Then I get this type of subdivided bar diagram or the component bar diagram. What we
need to do here that first let us try to understand what is this showing us? You can see here, there
are four sections here, one here is black ,second here is say here dark gray, and then here lighter
gray and then here is more lighter gray. So these are four different colors which are used inside
this bar to divide it into four different components. What is here your bar? Bar here is like this
and what are your here components? First component is here one, second component is here two,
third component is here say like here three and this is your here fourth component. So you can
see here as the name suggests, the bar of the diagram is subdivided. Now if you try to see what is
happening on the x-axis? This is trying to denote here the sh0p one, second bar is denoting shop
Well the basic command that is the bar plot will not give you all this information but in the
further slides I will show you that how you can insert these legends on the x-axis, y-axis and how
you can add titles and how you can provide different types of colors to the bars, right, but in this
slide, I am simply trying to explain you that what is the interpretation of a bar and its component.
Now if you try to see over here on the y-axis, these values are 0 50 100 150 and so on. So these
values are the values of cumulative frequency? How? Let me try to explain it by taking the first
bar of shop one. You can see here, her,e where I am denoting this is your a very small bar of
black color so in case if you try to move from bottom to up on this y axis, the height of the bar
diagram is actually here 2, this is given here. So this height if I try to make it here this is here 2,
271
what is it because 2? This 2 is actually this value and now whatever is the boundary of this dark
gray and light gray component where I am trying to make a cross, this boundary is the
cumulative sum of first two classes. What are the first two classes? if you try to look at this data
table in this first column in the shop one, first frequency is two, second frequencies 26, third
frequency is 42 and fourth frequency is 30. So this border line is indicating the cumulative
frequency of two classes that is the first and second class. So you can see here the frequency of
the first class is here 2 and the frequency of the second class is here 26 which is given here. So
their sum becomes 28 and this value here is actually 28 and similarly, if you come on the next
partition, I am trying to make here a small circle, so that you can see on the screen. What is this
point? This point is again representing the cumulative frequency. Cumulative frequency of what?
Cumulative frequency of the class first, second and third classes. What are this thing? The
frequency of first class is 2, frequency of second class is 26 and frequency of third class is 42.
How? You can see here like this, this is the 42 value. So this value at this circle, this is indicating
the value 2 + 26 + 42 which is equal to here 70. So this is the cumulative frequency of all the
observations and if you come to the last border where I'm trying to make it here a square, what is
this point? This point is the sum of all the frequencies. So all the frequencies are here first class
has frequency to 2, second class has frequency 26, third class has frequency 42 and the fourth
class has frequency 30 that you can see here and their sum is going to be 100 and this is what is
being denoted here as say100 and the same story goes for the shop two and shop three. Similarly
you can create partitions and you can create the component bar diagram for shop two and shop
three. Now what is the advantage of creating this type of bar diagram? So let us try to have a
look on this bar diagram. If you try to see here, if I try to compare here the peaks like this one or
if I try to compare here the height of this particular components. What are the indicating? The
272
height of shop number one first component is smallest, the height of thus bar 2 which is
indicating the shop 2 has more height than the height of the shop 1 that is the was the first bar
and third bar has the highest height. So that is indicating that the number of customers visiting
shop one, shop two and shop three. So one can very clearly see from this graphic that the number
of customer visiting the shop number three, they are the highest and for that, you don't need to
look into the data and now in case if you want to find out that on a particular day which of the
shop has more number of customer? What you have to do? You simply have to just compare the
component with respect to that day in this bar. For example, in case if you want to see that on
day four which of the shop was visited most by the customers? So you can see here in this bar
number three, height of this component and try to look in the height of the this part in the second
component, you can see that here this component is smaller then this one. So I can save very
clearly by looking in the last component of these three bars that the number of customers who
visited on day four were the highest in shop number three, then in shop number two and the
lowest was in shop number one because this height is the smallest. Similarly if you want to see
what really happens on that day two? So you can see here by comparing the dark gray part, this
part, in the three bars, you can simply compare and can look into the heights of the components
and whichever height is more, you can say very clearly that the number of customers going to
that shop they are the highest. Now let me try to first show you this graphic on the R console so
273
So first let me try to copy this data vector here. So you can see here I have created here a data
matrix like this and after that you can see here, my command was barplot and name of the
variable in this case, so into this case, my name is cust. So I can write down here bat plot c here
c u s t and you can see here this is the same graph which we have just obtained, right. Come back
Now you could see my objective is that I would like to add some colors and I would like to add
some legends on x and y axis. I want to add some labels, so how to get it done? You see, adding
10
274
colors will definitely make the components more informative. They will be more easy to
visualize. The choice of colors depends on you and in R software there is a particular code for
every color. Well, I'm trying to use here the simple colors like a red, green, orange, brown. For
that they have the same spelling but in case if you want to use any particular, you please look
into the help menu of our software and can decide what color you want and what is the correct
spelling of the command to give that color. So I'll try to write down here the command and I will
explain you what is really happening. So you can see here first 2 bar plot cust that that is the
same command to have the bar plot, this subdivided bar diagram. Now I want to add here these
labels. Please look into the diagram - shop one, shop two and shop three and I would like to add
here that this is my x-axis which is indicating the shops. So how to do it? In order to add these
names, you have a command here or a parameter in the bar plot command which is names
dot arg, n a m e s dot a r g and then you have to give the name of the bars which you would like
to put inside the double quotes separated by comma. So suppose I have here three bars and I
want to give it the name- shop one, shop two and shop 3. So I have enclose it with double quotes
and I have separated it with here comma and all these values, they are converted into a data
vector using the c command. Now in order to put a legend on the x-axis, for example, here I am
using her shop so this I am doing by using the command here xlab. xlab is going to give you the
idea that what should be the label on the x-axis? So here I have the same thing that I am trying to
put the name inside the double quotes and I want here the shop so these names are user defined
and it completely depends on you. Similarly on the y-axis also, I want to give here a name -days.
So this is given here by ylab, right, and this is here the days inside the double quotes and now
you can see here, in this bar I have given here first component as here red, second component it
as a green, third component as say here orange and fourth component here as a brown. So I need
11
275
to give these colors in the same sequence in which I want to put from bottom to top. So I'm
trying to make here a data vector of red, green, orange and brown colors and each of the color is
written inside the double quotes and they are separated by comma and then all these colors are
put into a data vector using the command c and the name of the parameter under which I'm going
to store this data is c o l which is the short-form of color and once you try to do it and then you
try to execute this command, you will get the same outcome.
So I can show you here on the our console also that how these things are happening? So on this
R console, I try to copy and paste this command and I try to execute it. So you see here you are
getting the same outcome which I have shown you, right. So this is here red color, this is green
color, this is orange, this is brown and on the x axis, we have a label shops and different bars
have got the name shop 1, shop 2, shop 3, y axis you have the labels days and so on. Now let me
12
276
Now in this slide or in this graphic, in case if you want to make interpretation, you can also do it.
For example, just by comparing the height of the brown component, you can compare that how
many people visited shop on day four and you can compare that which shop had more number
of customer and by simply by comparing the heights of say orange component, you can once
again compare that which of the shop was visited more by the customer so the height of the
component is simply proportional to the number of customers visiting a shop right and on this y-
axis, as I told you, this is giving you the value of cumulative frequency, right. So this is how you
can create the subdivided bar component and yeah, there are many other options available here
and if you want to explore them more, I would ask you to look into the help menu, okay and you
can also see here I have given you different aspect means if you want to add labels, if we want to
add colors, so now you can see that this graphic is almost the same which you use to obtain by
any software that was an expensive paid software. The same thing can be obtained in the R
software without any cost and it is not that difficult. The only thing is this, yes you need to study
13
277
the commands, you, but that is also not difficult, help menu is always there. You simply have to
look into the help menu and then just type the commands, okay.
Now after this I try to come on another chart which is the pie diagram. Pie diagrams also are
used to visualize the absolute and relative frequency and what happens in the pie diagram that a
circle is created and circle is divided into different segments and these segments will denote a
particular category like a category 1, category 2, category 3, category 4 and the size of these
sections like as here this one or say here the size of this here the category 2 or the size of this
category 3, actually this depends upon the relative frequency and the size of this segment is
controlled by the angle. Well I can use a here the red color to make it more clear, this is the angle
which is going to determine the size of that category 1. Similarly this is the angle which is going
to determine the size of the category 3 and this size is determined by the angle relative frequency
multiplied by 360 degree. So whatever is the frequency that you have obtained, just multiply it
by 360 and whatever the angle you get here, we need to create this angle over here and that will
14
278
give you the segment of the pie diagram and this type of diagram is called as pie diagram. Now
this pie diagram can be created into two-dimensional and three-dimensional. For example, here
I'm trying to make it here in the two-dimensional plot but the same plot can also be made it like
this, like this, something like this and more beautiful and so on so here you can see this is the
height and so on so I will try to discuss two dimension and three dimensional pie diagram both
here.
In order to construct the pie diagram in R, the command here is pie and inside the arguments you
have to give the, the data. This data is given by here a vector called as here x. Now I will be
more often using the symbol x to denote the data vector and after that, there is a long list of the
arguments or the parameters which can be used here to give labels, control the size, control the
colors and so on, right. So but in our case I have chosen some popular aspects. For example, first
aspect is here x which is giving the data vector, then the second parameter I will show you that
so the labels which is giving a description to the slices, then third parameter is radius which is
15
279
indicating the radius of the circle of the pie chart and then another parameter here is mean, mean
is going to indicate the title of the chart, c o l colors that is going to indicate the colors of the
slices that we can choose and last option which I will show you here that is the clockwise,
clockwise means this is used to indicate that if the slices are drawn clockwise or same anti-
clockwise and for that, you can use here the command here logical say true or logical false by
writing TRUE and FALSE in capital in letters. So and if you want to have here more idea, I will
request you that you please go to your R and try to look into the help.
For example, I can show you here if you want to help you the pie, simply try to give it here help
inside the double quotes, if you go to the pie and you can see here, you will come to the website
of the R software where they have given here all the details. So but for this you need an
internet connection, right. You can see here there are many many options. So definitely, I am not
going into those details but I will try to continue with these things.
16
280
So now I would try to show you or explain you this thing as an example. Suppose 10 percents are
asked whether they are graduate and or non graduate and their data is recorded as G for graduate
and N for non graduate like is here graduate G, non graduate N and then in order to convert it
into a numerical value, we will use the symbol 1 or the number 1 to denote a graduate and
number 2 to denote a non graduate person. So the data on the third person which is here G can be
converted or can be written as 1, the data on the 4th person which is non graduate can be written
as or can be denoted as 2. So if we have now here this data vector and we want to create a pie
chart for this thing, ok. Now I try to collect this data using the c command under a variable
named quali which is a short form of qualification. So this is the data which I have stored here
and this is a screenshot. Now in case if I want to create here the pie chart, you can see here that
there are now two categories- categories 1 and 2 indicating the graduate and indicating the non
graduate.
17
281
So in case if I want to create here a pie diagram I would simply use here pie and then here quali
and as soon as you do it you will get here a graph like this one but now my question is do you
want this think about it? If you try to look into this graphic, this pie diagram, this is giving you 1
2 3 4 and so on up to here ten categories but just now, you indicated that there are only two
categories- 1 and 2 for the graduate and non graduate. Then what is this happening? Now you
may recall in the earlier lectures while creating the bar plot, I explained this aspect that whenever
we are trying to plot the bar plot or the pie chart, we are essentially plotting the frequencies. So
whatever is the data, that has to be converted first into the frequencies and then I have to
18
282
So first I try to use the table command and I try to convert this data into frequency table. So you
can see here there are two categories 1 and 2 and this is indicating that there are seven persons in
category 1 and 3 persons in category 2 and then I try to make it here the pie diagram. You can
see here now this is giving us a pie chart or a pie diagram that we wanted. So this white is
indicating that there are seven persons and this blue is indicating that there are 3 persons. So by
looking at this angle, you can see that this segment is much larger than this segment. So this is
giving us a clear idea that the number of graduates are higher than the number of non graduates
and this is the screenshot here but I would like to show you here first on the R console that how
19
283
So first I try to create the data so you can see here this is my data on qualification quail and then
I try to make it here a frequency table of this data quail which is like this and then I would try to
use the pie command over table quali and you can see here, you are getting here the same
Now I will come on the next aspect of this pie diagram that if you want to make it more beautiful
by adding colors and labels etc. So you can see here this is the same pie diagram but here I have
20
284
added a label, a title, and I have added here the labels - graduate and non graduate and I have
used different colors- red color and blue color. So how to do it? Now I have to use the different
options. Different options are means if I want to give this graduate and non graduate labels, I
have to use, I have to give it here by using the parameter labels l a b l s equal to graduate inside
the double quotes and the second label non graduate inside the double quotes separated by this
comma and both this graduate and non graduate labels are combined using the c command and
this title - the persons with qualification, this is given here by the parameter main, main is used to
indicate the, the title and then whatever title I want, I'm trying to give it inside the double quotes,
you can see here, and after this I am trying to give up vector of colors as I did earlier that colors
red and blue, they are written again inside the double quotes and separated by comma and they
are combined with the c command and they are stored in the parameter c o l and if you try to do
it here, you can see here that you are getting the same thing. So I would try to show you here on
the R console.
21
285
So you can see here, you are getting the same graphics over here. Now I would just take a quick
example to show you that what really will happen when we have large amount of data.
For example here you can see, I am taking a simple example where there are 100 customers who
are visiting a shop and they are attended by three sales persons what we call as 1, 2 and 3 and it
is recorded that which of the customer was attended by which of the salesperson, like as first
customer was attended by salesperson 1, second customer was attended by salesperson 2 and so
on, right.
22
286
So this is the data and I try to collect all the data inside this data vector salesper that is indicating
the sales person and they can I try to create here a frequency table. You can see here there are
three categories which is indicating the salesperson 1 2 and 3. Now you can see here by looking
at this data, you may not have an idea that what is the number of ones, twos and threes but by
looking at this frequency table, I can say very clearly that first sale person has attended 28%,
second has attended 43% and third attended 29 % and if I try to create a pie diagram, this is now
So this is indicating the category 1, this is indicating the category 2 and this is indicating the
category 3 and similarly if you want to make it here more beautiful by adding label, more
informative then I simply have to use the same command labels means and colors and then I
have to define what colors I want, what heading I want, and what labels I want to give it here.
For example, I am giving here sp1 sp2 and sp3 to the salesperson 1, 2 and 3, right.
23
287
So let me quickly show you here that what will happen here. So I try to use this data, store this
data and then I try to make it here table of here salesperson, this is here this thing and then I try
to create here a pie diagram of, you can see here, this is the same pie diagram that you obtained.
Now I would like to add here these colours etc. So I can make it here, you can see here, I'm
going to get the same outcome which I just shown you there.
Now I would like to stop in this lecture. We have learned how to create the pie diagram which is
essentially a two-dimensional pie diagram. Now in the next lecture I will continue and I will show
you that how to create the three-dimensional pie diagram. So now I would request you, please try
to take some data from the books and try to create such diagrams on the R console and try to
experiment it and now and my suggestion will be that please don't restrict yourself only to the
parameters that I have used in showing it. I am doing it because of the limitation of the time but
please go through with the help menu, try to read what are the different interpretation of different
parameters and try to use them inside the R software under this diagrams and that will give you
more confidence and you will become much better in producing more beautiful and more
informative graphics. So you practice and we will see you in the next lecture till then, Good Bye.
24
288
DESCRIPTIVE STATISTICS WITH R SOFTWARE
LECTURE 12
Welcome to the next lecture on the course Descriptive Statistics with R Software.
You may recall that we had a discussion on different types of graphics in the last lecture, and
So in this lecture I’m going to address two topics, two more types of graphics, one is 3
dimensional pie diagram and another is histogram. This pie diagram and 3 dimensional pie
diagram, they are more or less similar, the only difference is in their look. The construction, the
structure and the interpretation, they are the same as in the case of pie diagram.
289
So let us start our discussion first with the 3 dimensional pie diagram. As in case of pie diagram
there are different slices, and those slices represent the absolute or the relative frequencies.
Similarly in case of 3 dimensional pie charts or 3 dimensional pie diagram, they also represent
the absolute and relative frequencies. The difference between a pie diagram and a 3
dimensional pie diagram is that in case of pie diagram there is a slice, but in a 3 dimensional pie
diagram there is a circular slab and this slab is partitioned into different segments or slices, and
every segment or every slice represents a category of the frequency distribution and the size of
each segment, this depends on the relative frequency, and this is the same case as it happens in
And here also the size of each segment is determined by the angle, and the angle is determined
by the same formula as in the case of pie diagram that is relative frequency into 360 degree, so
290
a pie diagram is a circular diagram which is partitioned into different segments and the size of
Similarly, the 3 dimensional pie diagram, this is a sort of circle having a third dimension as its
height,
and same way as in the case of pie diagram, we create the slices and the size of the slice is again
291
So let us now try to first understand how to create a 3 dimensional pie diagram in R software.
So in order to construct the 3 dimensional pie diagram, we have a command here, pie3d, pie
that is small letters and 3 number and d, and then here is the data vector, exactly as in the case
of pie diagram.
292
And then there are certain parameters which are given for different types of options like as
labels and another things, the difference between pie diagram and 3 dimensional pie diagram is
that construction of pie diagram is the part of the base package of R that is inbuilt in the base
package, but in order to construct the 3 dimensional pie diagram, we need to install a package
so first we need to install this library using the command install.packages and inside the
arguments, inside double quotes we have to write plotrix, p l o t r i x, and once I do it then I
293
In case if you execute these two commands on the R console, you can get the library plotrix on
your computer and if you try to see I have installed this package on my computer and this is the
294
now I will try to take some examples to show you how to create the 3 dimensional pie diagram.
So once again I’m continuing with the same example that I consider in the case of pie diagram
that I have a data on 10 persons and we have recorded their educational qualification in 2
categories graduate and non-graduate and this data has been indicated by 1 for graduate and 2
for non-graduate and we have this data vector and we have stored the data in a variable name
quali, and so now we have here this data vector quali consisting of two numbers 1 and 2 and
So obviously as we have discussed earlier that whenever you want to create pie diagram or say
3 dimensional pie diagram you need to input that data in the form of frequencies. So what I’ll
try to do? First I would try to create the frequency table using the data quali, using the
command table quali and you can see here I already had done it in the last lecture, so I’m
295
simply reproducing here a screenshot, and after this, you simply have to use the command
pie3D, remember one thing D here is in capital letter, and then you have to use the data that is
And once you do so you will get here an outcome like this one, so you can see here this is your
the third dimension has been added by this height here, and in case if you want to make it here
more informative
296
by adding the names to the slices like as non-graduate and graduate and if you want to add here
title of the graphics like as persons with qualification, and if you want to change the colours
you can use the similar commands what we used in the case of pie diagram. For example, if you
try to see pie3D table quali this is the same command that we used earlier.
And now in order to give here two categories graduate and non-graduate, I’m using here are
parameter labels, l a b e l s, labels equal to the graduate and undergraduate whatever we want to
give the name inside the double quotes, and these two values are combined in a vector using the
c command.
And similarly, if you want to give here that title, this title is given by the parameter main, so I
have to write main is equal to an inside the double quotes, I have to write what word is that,
297
And then you can see here one slice is in red colour, and another slice is in blue colour, so once
again I will use the similar command here col is equal to red and blue that is the R command
for the two colours inside the double quotes and separated by comma and they are combined
with the c operator and I will mention here the colours of this one. And once you try to do it
Now I would like to show it to you on the R console, so first I try to load here the library so you
can see now there is no error, the library has been uploaded,
and now I defined here the data quali, and if I want to make it here the 3 dimensional pie
diagram, I have to first create the frequency table of the quali by, and if you try to do it here I
10
298
And similarly, in case if you want to add here more information I can execute the same
command over here, I will try to copy and paste the same command and you can see here that
you are getting the same graph that you had obtained, right.
Now I will try to show you another feature in the same 3 dimensional plot, you can see here
11
299
In order to make it more informative I can separate it, so that the graphic will look like this that
you can see here that the red and blue parts are separated.
12
300
In order to make this type of graph, I can use here one parameter that is explode, so you see
here in this command which is the same as the earlier one, but now I’m adding here a new
parameter explode, e x p l o d e all in small letter is equal to 0.2, actually this 0.2 is the factor
that is going to decide that how much separation do you want, for example in this case, this is
the space between the two slices or two slabs, so this 0.2 factor is going to determine this thing,
so I’ll try to show you on the R console so that you are more comfortable and then I’ll try to
take one more example and I’ll try to show you all those things very quickly, so if you try to see
here now I have used here the function see here explode, and suppose if you want to, well I’ll
try to show you that change here, suppose I try to change this explode value, so suppose if I try
to make it here this explode is equal to suppose here, instead of here 0.2 suppose I make it here
0.8, you can see here what happens. You can see here now the separation becomes more,
13
301
so by increasing the value of the parameter explode, you can increase the separation between
So now I would take one more example to make you comfortable, so once again I’ll try to use
the same data set which I had used earlier in the pie chart,
so that was about the 100 values or that was the data on the 100 customers, they were
entertained by 3 sales person, 1, 2 and 3, and this is the data here that was stored in a variable
salesper, and now I’m trying to create the frequency table using that table command,
14
302
and now you can see here there are three categories 1, 2 and 3, and if I try to make the simple 3
dimensional pie diagram using the command pie3D, I simply have to use the same command
and I have to change the name of the variable which is now here the salesper, so you can see
here this is the standard 3 dimensional pie diagram which is using the default values, so you can
see here this 1, 2 and here 3 they are indicating the 3 classes that is the sales person 1, sales
15
303
And similarly if you want to give here title or the names to these slabs, right, you can also do it
here by using the same command labels, main, colour, but now I have here 3 categories, so now
I’m using here green, red and blue, 3 colours, exactly on the same lines and you will get here
16
304
And in case if you want to use the parameter explode, so for example here I’m trying to use
here explode is equal to 0.3, then you will get here three separated slices, so you can see here
So I’d try to show you on the R console also so that you are more comfortable, so first I’ll try to
copy this data, and then I’ll try to make it here pie3D on the sales person, but I need to give it in
the form of a table, so table of this one, so right, you can see here this is the same graphic that I
17
305
And similarly if you want to make it more clear I can by adding the titles and colours and
legends you can use the same command here and I can show you the outcome is coming out to
be like this, this is the same output that I just shown you.
18
306
And in case if you want to use here the explode function, just add this as one of the parameter
inside the arguments and you will see here, now this is separated,
19
307
and once again, in case, if you want to make the separation bigger you simply have to increase
the value of explode, suppose I make it here 0.8, so you can see here now the separation
becomes more,
so now it essentially depends on the choice of the experiment here that what exactly he or she
wants.
Now after this pie diagram, let me try to introduce here histogram.
20
308
So histogram is graphic but this is used for continuous data. You can recall, we had discussed
the aspect of discrete data, continuous data and so on, so histogram also does the same thing
what a bar diagram or a pie chart does, but the difference is this bar diagram and pie diagrams
they are essentially for that categorical variable where the values are indicated by some
numbers representing the category, but histogram is for continuous data, so histogram also does
the same thing that it first try to categorize the data into different groups, and then it plots the
bars for each category and in this case, the data is always continuous or I would say that,
Now there is a difference between the bar plot and histogram, you may recall in case of bar
diagram I had told you that the height of the bar is simply proportional to the frequency or
relative frequency, width of the bar is immaterial, so we don’t bother about it which has no
21
309
The size of the bar is essentially proportional to the area of the bars in case of histogram. So
essentially the area of the bar is given by the height of the bar and width of the bar which have
to be multiplied. So now in this case you can see here that the bars in the histogram had to be
controlled with respect to height and width both, you will notice in most of the cases the width
of the bar is kept the same in case of histogram, but the reason for this is just to make it simple
to understand, means if you have 2 bars and if you have to compare with their area whereas if
you have 2 bars where you have to compare them only with respect to the length or the height
of the bar because the width is same, then which is more convenient? Obviously the length or
the height of the bar is more easy to compare than the area of the bar, okay.
Now let us try to see how you are going to create the histogram based on the frequency
distribution, you may recall that we had some data, some continuous data and we had discussed
22
310
that these data is divided into different classes, and those classes have lower limits and upper
And the size of the interval that will provide us the width, and when every class will have some
frequency or the relative frequency which is the number of values which are belonging to that
particular category.
So now if you try to understand the construction of a histogram what we really try to do? That
say this will have the limits, for example this value will be your a0, this will be your here a1,
23
311
Now we have got the data x1, x2, xn, suppose n values are there, now I try to see where this x1
class interval 1 and this is here 2, suppose this belongs to x1, so suppose its value on the X axis
inside this bar lies somewhere here, and now I take the second value suppose this values lies
over here, third values which lies over here, fourth value which lies over here, fifth value here
and so on some f1 number of values will be lying inside the bar 1, and similarly here f2 number
So one thing what we do that we assume that all this values which are spread around the mid
value, mid value is to determine by this a0 plus a1 by 2 for the category 1, and for this interval,
for the second interval a1 to a2 the mid value will be a1 plus a2 divided by 2, so what I’ll try to
assume here that all the values are concentrated in the mid value.
24
312
So what I’ll try to see here that the frequency of the class interval 1 a0 to a1 is f1, so assuming
that all the values are at one place, I’ll try to make it here the frequency f1, and similarly the
height of this one will become here f2, and I’m assuming that the width of both the intervals are
25
313
Now obviously in case if you try to create here a histogram something like which is so thin and
another is so big,
26
314
this is not so convenient to compare the two areas, so that is why it is emphasized that for all
practical purposes the widths of the bars are kept the same.
And now instead of this frequency, I can also have here relative frequency f1/n and say f2/n,
but it depends on the need and requirement what we really want to do.
Now in R software, the histograms are constructed by the command h i s t and inside the x
which is, we have to give the data vector, and you will see that in this case you don’t need to
create the table, histogram function or the function hist will automatically create the frequency
table and then it will create the bars, so this is different than in the case bar diagram or the pie
diagrams.
Now in histogram I have two options, histogram can be created using the absolute frequencies
or it can be created using the relative frequencies, so in case if I want to use the absolute
27
315
frequency then there is no issue this command hist will take care of it and that is the default
choice, but in case if you want to create the histogram using the relative frequencies then you
have to add here one more parameter f r e q is equal to capital F that means frequency is equal
to FALSE. So as soon as you give the frequency to be false the function h i s t will
automatically considered that the function has to considered the relative frequency for the
Now in case if you want to improve your histogram as we have done in the case of bar chart,
there are some more choices of parameters which can be given inside the arguments, so
obviously this here x this is going to determine the data vector, the numerical values for which
28
316
Now in case if we want to give the title of the chart then this is controlled by main, m a i n, in
case if you want to change the colours of the bar then we have to use the parameter c o l, in case
if you want to add any description on the x axis then we have to use xlab, and in case if you
want to control the limits on the x axis then you have to use the command xlim.
And similarly in case if you want to control the limits on the y axis also then you can use it here
ylim, and there are more options but I would suggest you that you please try to look into the
help using the command, help hist inside the double quotes, inside the arguments, that will give
29
317
(Refer Slide Time: 25:40)
Now let me take here an example to show you the construction of histogram, so here in this
example we have the heights of 50 persons recorded in centimeters. Now you can look in these
values, do you think that way? In the first glance are you getting any information whatever is
contained inside the data? It is very difficult and that is the advantage of using the histogram
that it will try to reveal the information contained inside this set of data, so I tried to stored all
this data into a variable here height using the command here c,
30
318
and after this if I try to use here the command h I s t over the variable height h e i g h t we get
here this type of data, so you can see here this is trying to give us the intervals here 120 and
125, then here is 130, 135, 140 and so on, and once I say, what are the values which are
contained inside this bar, so all those values which are less than 125, they are stored in this bar,
and I can look at the height of the bar which is here, so since the width of each of the bar this,
this, this, this and so on they are the same, so by looking at this value I can say that there are 5
31
319
Similarly, in case if I try to look at this interval, the frequency here is 2, so I can say here there
are two values which are lying between 125 and 130.
Similarly, in case if I try to look at this interval, this is starting from 155 to 160 and the
32
320
so that is indicating that there are 7 values between 155 and 160 and also this is the same height
of the next bar which is here, so these two bars they have the same frequency, so I can say that
the number of persons having the heights 155 centimeter to 100 centimeters, and 160
centimeter to 165 centimeter they are the same, so this type of information is revealed from this
type of graphics.
Now in case if you want to improve the look by adding colours or by adding say legends or
controlling the limits you can use the parameters, and how to use those parameters, I will try to
33
321
but I would request you to have a look on the help and then try to see. So for example here I am
trying to give the title of the chart as say heights of person, and I have changed the colour,
colour of the bars and on the X axis I am trying to give here a legend say heights or title,
heights, on the Y axis I’m trying to give here the title number of person.
So in order to get a graph like this one, I simply have to add the parameters inside the hist
command. So I’m trying to use here the command here main, heights is equal to heights of
And similarly this green colour, this is controlled by this command c o l, so I’m trying to give
here col is equal to green inside the double quotes, and this title on the X axis heights that is
going to be controlled by xlab, so I’m trying to give the name of the height inside the double
quotes.
34
322
And similarly the name on the Y axis is controlled by ylab which once again I’m trying to give
it inside the double quotes. And similarly if you try to add some more parameters over here,
you can make it more informative depending on your choice, depending on your wish,
So now I stop here in this lecture and once again I would request you that you please try to
choose some dataset from the book which are continuous and try to create this histogram, and
similarly you try to practice for the 3 dimensional pie diagram and try to use different types of
parameter, try to give them different values for example I have shown you that one, that when
we try to use the explode is equal to 0.2, 0.8 then how much is the separation between the two,
so that will give you a more idea that how the graphics can be controlled and or how the
Similarly in the histogram also there are some other parameters which we have not used here,
but I would request you to have a look on the help menu and try to see how they are used, and
try to experiment them. So keep on practicing and we will see you in the next lecture, till then
good bye.
35
323
Lecture – 13
Welcome to the lecture on the descriptive statistics with R software. You may recall that, in the last
lecture, we had a discussion on the construction of histogram. Now, in this lecture we, will continue on
the same idea and we will try to discuss, kernel density plot. And after that, I will take up the stem-and-
leaf plot.
Now let me start our discussion. In case if you recall, what do you do in case of construction of
histogram? First you have a frequency, distribution in which you have class intervals and then you try to
plot the bars, where the lower and upper limit of the bars are indicating the lower and upper limit of the
class interval. And in case if your data becomes quite large, then what will happen? Ideally you assume
that, whatever are the observations which are contained inside the bar, they are assumed to be
concentrated at the midpoint of the class interval. So, in case if your, number of data point become large,
324
then obviously, you will have to create more number of bars. And if you are in case, if I say in very
simple word, in case if you want to make your histogram more precise, then you need to make more bars.
So, what will really happen, that we try to first understand and then based on that, how to represent the
data, that we try to see. Actually this can be done, through the concept of kernel density plot.
So, first you try to see here, what happens? In case of histogram, the continuous data is artificially
categorized, in different class interval. The choice and width of class interval is, very crucial in the
construction of histogram. For example, you may recall, that in case if you have our data, the histogram
may look like this, Or other alternative is this, that in case if you try to make the width of the bars to be
smaller, then it may look like this and so on. So, if you try to see, that in case if you want to make your
histogram, more precise or if the number of the data points, which are becoming very very large, you
need to create here, more number of bars. And what ideally, you assume here, is that whatever is the data
325
concentrated inside the bar. That is concentrated at the midpoint of the interval. Right. So, what we can
do? That we can, join these points, like this, like this.
And similarly, in case if you want to do it here, this will look like this, they are the straight lines and so
on, and then yah means, these points can be extended, so that they vanish on the x-axis, but now, in case
if this number of bars are increasing, then another option is this, that I can join the midpoints of the bars,
by a smooth curve, like this, you see I will put my pen here and then I will try to make a small curve like
this, or something like this and this, curve has been drawn in such a way such that it is passing through
with most of the midpoints of the class interval, this curve is called as frequency density plot, or in simple
word this is called as density plot. Now, this is the concept. Now, our next issue is this, when we really
want to implement it, for computation. Then how to get it done. and in order to construct such type of
plots, we try to take the help of, kernel functions and based on that, we try to create or construct the
326
These density plots are like smoothened histograms. Smooth and histograms, means we are trying to
create the histograms and then, we are trying to join the midpoints of the bar, by a smooth curve. This is
really, what we mean. and now, this is smoothness, this depends on the number of bars, in case if you
have large number of bars, the curve is going to be more smooth and as, the number of bar becomes,
larger and larger and they tend towards infinity, this number tends to infinity, then this curve will become
a perfectly smooth curve, this is the basic idea. So, in order to plot it, this smoothness is controlled by a
parameter, which is called as ‘Bandwidth’. And the density plots helps us in visualizing the distribution of
the data over a continuous interval or a time period. And in case, if I want to explain this kernel density
plot, in very simple words, then I would say, that this is only a variation of a histogram and the histogram
is being constructed by using the concept of kernels. Right. So, density plus is simply a variation of
histogram, that uses kernel is smoothing, to smoothen the plots, by smoothing out the noise, this noise is
And whenever, we are trying to create such a kernel density plot, the peaks of the density plot, they
display where values are concentrated over the interval. That is the value of the frequency. And the
327
advantage of kernel density plots, over the histogram is that, the shape, of the density plot is not affected
by the number of bins, number of bins mean number of bars. Where as in case of histogram, the shape of
the histogram is determined by the width of the bins, number of the bins and so on. So, that is why
density plot play better role than the histogram when you have large theta. And these density plots are
constructed using the concept of kernel density estimate. So, let us try to first understand, what is this
What are this kernel density functions? So, a kernel density plot is essentially, defined by your, kernel
density function. and how this, kernel density function is defined, you can see here, I am writing here, a
mathematical function, which is here fˆn ( x ) here, is a data or the variable on which, we are trying to
1 n
x xi
create the data or we try to collect the data, this is given by
nh
K
i 1 h
, h 0 , and we are h is
some positive value. Here, this is small n this is denoting the sample size and this is small h, this is trying
328
to indicate the bandwidth, this is the parameter that is going to control the, smoothness of a kernel density
function. And this K here, this is called here say, say kernel functions. This kernel density plot is not
arbitrarily defined, this function also has some properties which have to be satisfied in case if a function
has to be treated like as a kernel density function and these properties are similar to the properties or
Well, those who do not have a background statistics, I can tell them, that in, statistics we have probability
density function, for a continuous random variable and these are the function which have certain
properties and these function helps, in determining the probabilities of events. For example, you have
heard normal density function, gamma density function, chi-square functions, t distribution, chi-square
distribution and f, f distribution and so on. So here, I'm not going into that detail, but I just wanted to give
you some idea. And in this case of a kernel function, in case if I try to choose, different types of K, that is
the kernel function, we will get different estimates and constitutively, we will get different types of plots.
Because, now I can say briefly that this kernel density plots are going to be constructed on the values of
the kernel function, that we are going to obtain, on the basis of given set of data. So we have data, we try
to estimate the values of kernel density functions and then, we try to plot them. So, obviously in case, if
you try to change the form of the kernel function, your estimate may also change. But, definitely those
kernel choices are taken in such a way such that, there is not much difference among different kernels.
And the aim is to provide a function which is more efficiently presenting the true frequency distribution
of the data. Okay. So, just to give you an idea, what I am going to use here, and what choices are
available in our,
329
We are going to discuss here, three choices of kernel function. One function here, is a rectangular kernel
and second is Epanechnikov kernel and third is, normal distribution kernel or this is called as Gaussian
kernel. Right. What is this rectangular kernel? This rectangular kernel is defined, as a function, which
1
if 1 x 1
takes value K ( x) 2 . Similarly Epanechnikov kernel, this takes the value,
0 elsewhere.
3
1 x if | x | 1
2
K ( x) 4 . Similarly the Gaussian kernel, this is based on the normal
0 elsewhere.
distribution. We know that the probability density function of normal distribution, with parameters mu
1 1 x 2
exp , x . . Right. And actually,
2
and Sigma Square, this is given by
2
when we are trying to construct such, a kernel density plots in R software, then in case if you don't give
any choice, then this Gaussian kernel is the default choice. But, I will try to show you that, how the
density plot looks, when we try to change the type of kernel or the choice of kernel. So, could we take
here, the same example that I had taken in the case of histogram and I will try to construct the kernel
330
density plot in the R software and I will try to compare them with histogram also. So now, let us try to
the heights of 50 persons are recorded in meters. And we would like to create a density plot for this data.
So, this data has been stored in a variable, here height, here like this.
331
And after this, in case if you want to plot this, kernel density plot. Then the command in R software is,
plot density. And so, I would try to write here plot, inside the argument density and then inside the
arguments, I have to give the data. So, this is the command here, for plotting the density and remember,
this is plot, inside the argument you have to write, density and then, once again inside the argument, you
have to give the, data. So here, the data is given by, variable here height and in case if you execute, it you
will get here, this type of graph. So, you can see here, here are the number of observations, here 50, The
bandwidth here, is controlled by the Gaussian kernel, which is the default kernel, when we are not
specifying anything and you can see here, this curve looks like this, this is a smooth curve something this.
So, this is called a ‘Kernel Density Plots’. And here, these are the values, of something like a class
intervals, in case of case of histogram. And this type of a smooth curve helps us in getting an idea about
the distribution of the numerical values and these types of curve are actually more useful when you have
large number of data that will give you a much better information, than the histogram. Okay.
332
Now, in case if I try to plot the histogram and here, this density plot side-by-side. Then it will help you in
comparing the two. So you can see here, I have plotted the histogram of the same data and here is the
Gaussian kernel plot. Now, you can see here that how the things are happening over here. So, one thing
what you have to keep in mind here, that the histogram is starting from hundred twenty, whereas, kernel
density plot is starting with zero, and then, what is really happening, that the width of these bars is now
made a smaller and smaller, and that is controlled by the choice of the kernel and then all this frequency
they have been joined together, to give this type of curve. So, you can see the, similarity between that two
graphs. For example, you can see here, the maximum frequency is here, at the same place. But this is
giving us the more information. But, before going further, let me try to, plot this data on the R console
and I try to show you here, so that you are more confident.
10
333
So, now let me copy this data on the R console and you can see here, this is the data on the height and
now, I use this plot, density command and if I execute it on the R console, you can see here, you get, this
type of kernel density plot. Based on the choice of Gaussian kernel. Right. And now, I will show you one
more aspect,
11
334
You can adjust the plots this kernel density plots, using a command, adjust. This adjust is going to be a
parameter, that takes some numerical value and if you, try to see here, the number of bins or the structure
of the curve, that is going to be controlled by this adjust parameter. For the sake of understanding and
explanation, I am trying to take here, two possible values of adjust at 0.5 and here, adjust equal to 1 and
then I am trying to plot the same data using the same command, so, you can see here, the structure of this
data set and structure of this data set. So, you can see here, the data sets are the same in both situations
and the same Gaussian kernel density function has been used to construct these densities. But, their
structure is different. This is giving you more variation and here the second one this is more is smooth
and, and you can see here, the in, the first case, the value of bandwidth chosen is, two point six nine eight
and in the second case that chosen bandwidth is five point three nine seven which is corresponding to a
just equal to one. Right. Now, I would say that a question comes, what should be the proper value of
adjust? So for that, I will say, try to create a histogram, try to look into your experimental data and try to
choose certain values of edges and try to plot your curve and whatever, curve is representing the situation
in a more authentic way, more honestly, try to choose that value of a just.
12
335
And similarly, in case if you want to increase the value of adjust, for example, in the next slide I am
trying to create the same density plot, using the adjust equal to one and two. And you will see that, as I try
to increase the value of adjust parameter, the smoothness increases more. For example, you can see here,
this first graph, this is the same data with a adjust equal to 1 and the second graph, this is with adjust
equal to 2. You can see here, with adjust equal to 2, this is more like a Gaussian curve or normal curve.
Because, here the bandwidth has increased from 5.397 to 10.79, nearly double. Right. So, this is how you
And now, in this slide, I am simply trying to give you, all the three slides or say all the three graphics,
with a just equal to 0.51 and 1, two together. So, that you can make a better inter-comparison, you can
see, that as the value of adjust is increasing, this degree of smoothness here is increasing. So, this is how
you can play with this data. So, first let me show you, this execution on the R console.
13
336
Refer slide time: (20:43)
So, you can see here. Now, I'm trying to use here, first the density plot, with adjust equal to 0.5 and then
click on I try to take the value here 1, 1 and then I try to take the value here 2, you can see here, this
becoming more normal and suppose just for the sake of illustration, I try to take the value to be here 10,
you can see here, this is now more towards the normal curve and so on. Right. Ok. Now, let me try to
14
337
That in case if I try to use different kernel function. So, you can see here, on the upper left hand side, I'm
trying to create a dencity plot, using the Gaussian kernel, which is the default kernel and for that, now I'm
giving the choice, as kernel k e r n e l equal to inside the single quotes, Gaussian g a u s s i a n and
similarly, I am trying to take here, another kernel here, that is a rectangular kernel and if you try to
compare here both these graphs, you can see here, the structure is almost the same, means even here if
you try to plot the graph, this will look, something like this, on the similar lines. But, the structures are
different. So, now you are the one who has to decide that which of the kennel is going to give you the
more representation of the data. And you can notice that, the bandwidth in both the cases that is here the
same, 5.397. Right. And if you want to increase the, bandwidth and the choice of kernel, both together,
then you try to use the parameter adjust as well as the kernel choice.
15
338
and similarly in case if you try to choose other, kernels like as here, in this first graphic, I'm trying to use
the, kernel Epanechnikov and in this one, I’m trying to use a triangular kernel and so on. So, you can see
here, the structure is more or less similar, but then, they are not exactly the same. So once again, I will try
to show you, all these graphics on the R console, to make you more confident and then, I will move to
another graph. so you can see here, here I'm trying to make the Gaussian kernel, then I'm trying to make
the, density plot based on rectangular kernel, which was reproduced in the slides and similarly if I try to
take Epanechnikov kernel, then it comes out to be like this and if I try to take a triangular kernel, then
again this curve changes, you can see here. Right. So, this kernel density plot will help you in getting a
smooth density plots and they are more or less like similarly to the histograms also.
16
339
Now, we come to next topic, that is stem and leaf plots. These are once again, another way of
representing the data. And in this case, the absolute frequencies in different classes, they are represented
in a different way, for example, we represented the frequency or the absolute frequency using the bar
diagrams in case of discrete data and histogram in the case of continuous data. Right. So, in this, case we
have a data set and the graphic is going to be a sort of combination of the graph and numbers. Graph or
text, so that is why, this is also called as a, ‘Textual Graph’. Means, it is trying to use the, text as well as
graph. So, this stem and leaf plots, show the absolute frequency, in different classes in the frequency table
or in a histogram. And they also, present that same information, the only difference is that, this is based
on a quantitative variable and these are the textual graph. And the presentation in this graphic is based on
the data according to their most significant numeric digits. And this type of graphic is actually more
suitable for a small data set. So, I would advise you, in case of your large data sets, try to go for histogram
17
340
And this is stem and leaf plot, this, this is a sort of actually tabular presentation. Where each of the data
value is going to be splitted into two parts. One is called, ‘Stem’, and another is called, ‘leaf’. Usually in a
stem leaf plot, the stem is representing the first digit or the first digits of the data. Or leaf is going to
usually represent the last credit of the data. For example, in case if I have a data, 56. Then 56 is going to
be splitted into two parts, five and here six. This five is called a stem and this six is called here leaf.
Similarly in case if I say that, I have a number, whose stem is two and leaf is eight, this means, that the
number is 28. So, this is how we try to, represent the interpretations stem and leaf.
18
341
In order to create the stem-and-leaf plots, what we have to do? First we need to, separate each observation
into a stem consisting of all but the final, which is the rightmost digit and a leaf which is the final digit.
Actually this stem may have as many as possible digits as needed but each leaf contains only a single
digit. This is the key point. And then, we try to write down the values of the stem in vertical column in
such a way such that the smallest value is on the top and then we draw a vertical line and after this, we
write the value of the leaf in each row, to the right of the, the stem. And this is again done in increasing
order.
19
342
So, first let me, give you here the R command and can I will try to show you, how the stem and leaf plot,
look like which will make you understand better. So in R, we have a command s t e m, this command,
produces the stem and leaf plot, of a variable here, of the some values here in data x, so, we need to write
here, say stem inside the argument, see here, x. and then, the size of the stem, that means, how it has to
go, this is controlled by a parameter, what we call it as say, ‘Scale’. So, in case if I try to choose here
two values of scale, scale equal to 1 and is scale equal to here 2. then the interpretation will be, that when
I am trying to choose the value 2 here, then this will give me, a stem and leaf plot, which is roughly that,
twice as long, as of the default value, which is scale equal to 1. So, in this case, when we want to
construct the stem and leaf plot, the R command is, stem and inside the argument, I have to give the data
and then, I have to give the scale value, this can be number and this control the length of the plot and
similarly, there is another parameter here width, which controls the width of the plot. So, scale controls
the length of the plot and width controls the width of the plot. And the default value is here, usually 80 is
taken.
20
343
So, let me try to take an example, over here and then we try to understand it. Suppose there are 15 lots of
an item and the, number of defective items in those lots are found to be as follows, like this. So there are,
46 defective number of pieces, 24 number of defective pieces , 53 number of defective pieces and so on.
And yeah, that's a small data set said, only consisting of 15 values, so and we would like to construct the
stem-and-leaf plot, for this data set, so I try to put all this data inside a variable defective, so this will
look like this and then, I will try to execute, the R command over here.
So, as soon as, I say here, stem defective. Then this the outcome will look like this, which I will show
you, on the R screen also, on the R console, also. So, you can see here, that this part here is the stem. And
this part here is the leaf. And these are the vertical lines, which I was discussing. So, you can make this,
vertical lines to be something like this, so that there is a partition over here, so that stem and leaf, they are
separated. So, first question comes that, what it is trying to interpret? The interpretation of this graphic is
easy to understand, in case if I try to add here the scale parameter. So, first I would try to show you here
21
344
the outcome, the screenshot of the outcome here and first, I will try to give you, the explanation and then
So, now if you try to see here, first I'm trying to discuss here, the role of, parameter scale. So, you can see
here, I have created here two stem plots. One here is like this, number one and here number two. A
number one, I am using the scale 2 and in number two, I'm using the scale value here one, that that I used
22
345
earlier. You can see here, that there is a difference, that here, there is only here one, two, three, four, four
values here, whereas here, one, two, three, four, five, six, seven. Then in case if you come on the leaf part,
corresponding to two, you can see here, there are four values 4 4 5 8, whereas in the case of scale equal to
two, there is only here one value and there is no here three, three is not present here, five is not present
here and similarly one is not present over here. So, when I am trying to increase the value of the scale
from 1 to here 2, then this is giving me a more clear stem-and-leaf plot. So, now let us try to see, in the
figure number one, what is the interpretation? Now, first you try to look in the data. I will change the
color of my pen, so that you can clearly see it, in the data set, try to see there is here a value one eight.
Now, and now try to look at, this thing. So, this is actually indicating one as stem and eight as leaf and
this is the same value, which is given here. And similarly if you try to look at the second value here, 2,4.
So this is actually here 2,4. What is this indicating? This is also indicating a value inside our data and if
you try to look on the data, this is the value here 24, this value is indicated here. So, these are the cases,
where we have single digits only. Now, I try to take here that third column. So you can see here, this is
indicating here, something like three as it stem and the interpretation is first digit is 34. Now, next value
here is five, so the interpretation goes, the stem here is three and the leaf here is five, so the value here is
thirty-five, then the third value here is eight, so this becomes your stem three and value here eight, that is
leaf is eight, so the number is 38. And you can see, that these numbers are present in your data, so if you
try to see here 34. 34 is present in the data here like this. And if you try to see here, 35, 35 here is this data
is present here and similarly if you try to see here 38, 38 is present here, in this data. And similarly if you
try to interpret, this here fourth line, this number is going to be 44, it is representing actually four values,
44, 46, 48 and 49. And you can see here, that these values are also present, here in this data says 44, 46,
48 and 49. And similarly, if you will try to choose other values, you can also see here, that this is here 53,
54 and 56 and these are again given here in the data set 53, 54 and here 56. So you can see here, that this
stem and leaf plots, they are trying to represent the sort of histogram in terms of the frequency of the
23
346
Refer slide time: (35:29)
Let me try to show you here, well this is here the screenshot of the two with a scale equal to 1 and is scale
equal to 2.
24
347
So, in order to compare it, I am trying to make here the histogram and then I have the histogram will look
like this and so on, I have rotated it so that it becomes comparable with the stem-and-leaf plot. So, you
can see here, in this data set this is here 1 8. So this is indicating here, like this, so now, if you try to see
this is here 1. So this is indicating the frequency, that the number of values between 10 and 20 is only
here 1. So, the same thing is happening here also. We are trying to read here 1, 8 as 18. so that is
indicating, that and the next value here is 24, so that is indicating, that the class interval here is 10 to 20
and say, 20 to 30 and so on. 30 is coming because of here this 3. So, you can see here, that the number of
values in the interval, first interval is only here one, this is here 8, so this is indicating here, on the y axis
here 1. Now, in the second case also, there is only one value here, which is 24, so again, this is indicating
in this second bar, so again you can see here there is only here, one value which is indicating here, and
similarly if I try to take here, the third one, so third one has, the interval now, say 30 to 40 and then there
are here three values 34, 35 and 38. So, the frequency here, f3 becomes here 3 and you can see here, that
this interval is representing here, in this bar, where the frequency here is 3. And similarly in the fourth
case, if you say observe means, I can say simply here, that the frequency in this case is here 4, f4 here is
4 and this value is indicated over here and you can see here, that this frequency here is 4.
So now, you can see here, that the stem-and-leaf plot and the histogram, they are more or less
comparable, the only difference is that, in case of a stem and leaf plot, it is also trying to give an idea,
about the individual values, whereas inside the histogram, the individual values are lost. So, I'm not going
to compare to see their advantages or disadvantages of these two graphics. But, I will say that depending
on the need, you try to create a graphic, according to your need. Now, I will try to show you, this stem
plot on the R console. So, first let me try to copy this data, so I can close this thing and then you can see
here, this is the defective data here. Right. And then I try to make it here, I try to create here, a stem. So
you can see here, this is the same stem, what was presented here? And then I try to play with the scale
part. So, I try to create here, the two actually this is what you have obtained here, this is actually here, the
same width scale, with a scale equal to here one. So you can see here, that these two are the same thing.
25
348
But, definitely in case if you try to add here, say here, is scale equal to here 2. So, you get here, this thing.
So, I will now stop here and I will also try to close our discussion on different types of graphics in one
dimensional. There are some graphics, which are available for two-dimensional, when we have two
variables. But, that I will discuss, when we are trying to deal with the data on two variables, at this
moment, first I will try to take up all the issues, when we have the data only on one variable. So, now I
will stop here, but definitely I am not saying that, that these are the only possible graphics available.
There are many more graphics available and day by day, their new types of graphics are, also coming into
picture, because of the use of software. But I'm sure that this type of background will surely help you and
it will take out your fear from your heart, that it is difficult to create graphics in R software, in comparison
to a software where you simply have to do some clicks. What you have to do, just make some practice
and if you practice, I'm sure that, you will be very very comfortable in making more beautiful and more
interactive graphics. So, you practice it and we will see you in the, next lecture with a, new topic. Till
26
349
Lecture 14
Central Tendency of data –Arithmetic Mean
Welcome to the next lecture on the course, descriptive statistic with R Software. Up to now,
in all the earlier lectures, we handled how to create, different types of graphics. And they
were the part of the graphical tools of descriptive statistics. From now onwards, we are going
to handle the analytical tools and in uni-direction: that means when we have only, one
variable. And we try to handle more than one variable; then again I will try to introduce the
graphical tools in two dimensional and analytical tool for two the dimensional data. So, the
first step after we get the data is that we would like to get some quantified information that is
hidden inside the data. As we had discussed the data is very silent, data cannot speak, data
cannot tell you, well I have this value, I have the this information and graphical tools, will
give you a graphic view, visual, information, from that you have to use your knowledge, your
common sense, your information, your statistical knowledge, your information from the
experiment and you need to combine them to get a clear-cut conclusion. Now, we would like
to quantify that information. So, when we talk of the information contained inside the data,
there can be enormous information which is contained, but our question is how should we
take it out? So, we had discussed that we would try to take the or extract the information on
different aspects of information like as central tendency, variation, symmetry etc. So, we are
going to start it with a new topic, in which I'm going to discuss the central tendency of the
data and then I will try to, discuss different types of tools, which are used to extract the
information on the central tendency of the data. So, in this lecture I am going to, discuss the
350
So now, whenever we try to conduct an experiment, there will be several aspects and we try
to collect the data on those aspects. So, finally the data set may contain many variables,
several variables and every variable may have many observation and our basic objective is
this we want to know the information contained inside the data, which is not possible, so we
are trying to develop the tools which can help us in digging out the information from every
observation. Now, the question is this, what we would like to have? Suppose I here hundred
data points and every observation tells me something or alternative is that, instead of having
hundred pieces of information, I have a summary information, that may provide more
information to a common person and that will be more useful. So, here now we are looking
forward to understand some summary measures which can give us the information hidden
351
Now, let me take a simple example to explain my view. Suppose I want to know the
clothing, and I have two choices, Lucknow, in Uttar Pradesh, which is quite hot, during the
month of May, it is the summer season and similarly other city is Srinagar in and Kashmir
which remains cool during the month of May. So, now we have collected the data on the day
temperature of last year, on say five days and this data is coming out to be, 35 degrees
centigrade, 37 degree centigrade, 36, 40, 38 degree centigrade for Lucknow and 20, 18, 17,
22, 23 degree centigrade for Srinagar. Now, what information I can get from this data? This
data can be a large, there can be means, I have taken here only 5 values for the sake of
understanding, but these values may be, hundred, thousand and, and, and even millions. So,
from this data, I would like to know for example: that what type of clothing I should, take
there, in case it is cold then I would try to take some woollen clothing’s and if it is hot and I
will try to take some, simple cotton clothing. Right. So, now how to get this information? By
looking at this information, yeah! it is telling me that,the temperature is quite, high and here
the temperature is here, usually low. But, we would like to have a summary information, the
information in the summary. Now, what we observed that is the human tendency to compile
352
the information, in terms of averages. For example, in case if I say, in a class, some students
might have got 45%, somebody has got 55%, somebody has got 65% and there will be more
marks, but then, I am more interested in what is the average performance of the class?
So, if I sayhere: that the average mass in thesubject in a class are 60%, then it makes more
sense. Similarly in case if I amtrying to go for a tablet of a medicine and suppose the
shopkeeper or the doctor tells me that this tablet can control the body temperature and bring
the temperature down for six hours, what does this mean? The doctor is telling the average
value and we are very, easily understanding it, this six hours, cannot always be exactly six
hours, this may be five hours, this may be seven hours, this may be five point five hours, but
and this may be six point five, five hours, but this data, has been collected and doctor has
found a sort of arithmetic mean or an average and he's convinced that, if this medicine is
given to a person, having a fever, then this tablet can control the body temperature up to six
hours or say on an average six hours. This is what we mean? So, in statistics, this concept
353
refers to as average or the central tendency of the data. Central tendency of the data means for
example, if I have a data which I plot it here like this, then I would try to see, what is the
point here around which the data is concentrated, for example, here you can see, this is trying
to give us a central value, around which the data is concentrated. In statistics we have
for example, arithmetic mean, geometric mean, harmonic mean, median, quintiles, mode,
etc.
354
So we are just going to discuss these measures one by one and I will also try to show you
that how to compute them on the R software. So first let me try to explain, the arithmetic
mean for an ungrouped data. Ungrouped data means, we have a variable here X and we have
collected the data on X, s say here, x1, x2 and so on. See here, x n, so for example, if I say
here X is my hair height, then x1 is going to be the height of first, person say 152
centimetres, x2 is going to be the height of the second person say 165 centimetres and so on.
these x's are small letters. Okay. Now, the arithmetic mean of these observations is defined
1 n
by like this x xi , this means, I have first to sum x 1 plus, x 2 plus, x n and then I have
n i 1
to divide this, sum by here the number of observation and, this is the meaning of this symbol.
In order to compute, it in R, the command here is, mean and then I write, mean and inside the
355
So, let us try to make an example and try to see here. Well in case if you want to know, more
about this mean for example, I will be discussing here, another aspect that how to handle the
missing values, but it but, but there can be trimmed mean also and there are some more
parameters, with this command, I would request you to go to the help, of mean and then try to
look into different parameters. So now coming back to my example, so this is the same
example that we had considered earlier that there are 20 participants, who participated in a
race and their time in second, seconds has been recorded here, like the 32 second, 35 seconds
and so on. And this data has been captured inside a variable here stay here time.
356
Now, in case if I want to find out here this variable, then I simply have to type here, mean
inside the argument, the variable name time and I get it here, as the value 56. So, you can see
here, this is the screenshot. So, I will try to show you here that how it works on the R console.
So, let me store the data over here so that you can see here, this is my here time and when I
try to find out here mean of for your time, the variable,in which the data is contained, this
357
Now I try to address here one more aspect. If you remember, when we started our course,
then in some initial lecture, I had given you an idea, that there can be many situations, where
some data might be, missing and this data, is represented as capital N and capital A that is N
A and which is a reserved value. So in case if the data, is missing, then how would you like
to compute mean and other components, you see, the way I am going to explain here: that
how to compute the mean, when data is missing, the same concept will be used, in all other
cases: that you want to find out the variance the standard deviation or the median, the same
concept and the same command will be used. So, here I will try to explain in detail and after
that I will quickly take it. So, when some data is missing, then in that case, the mean
command is used to find out the average value, arithmetic mean of the data in x vector here.
But, there is another parameter which is added here n a dot r m, is equal to true, capital T,
capital R, capital U, capital E. So, this is trying to tell that please compute the mean,
compute the mean, after removing the, the N A values, N A or the missing values. This is,
358
let at me take the same example and you can see here that I have replaced, the first two
values, by n a. I have just made it underlined, so that, you can easily, see it and so in this
case my data vector will contain first two values here as say na and na. And in order to, store
this value I am trying to use a different name and this name I am time to give, the time which
I had used earlier, dot n a, dot n a is not a result but this is simply trying to indicate: that the
data on the time variable that we have used earlier, this is the same data with na. Right.
10
359
So, in case if I try to do it here, then you can see here, if you try to find out the mean, only of
the time dot n a vector, where the data is missing, this will not give you any numerical value.
But it will give you simply here a say NA. Why? Because this is going to find out the sum of
here NA plus NA plus, the value but, the numerical values here, 56 plus dot whatever are the
values here, divided by here, 20. Right. So, that is why this value is coming out to be NA,
you can see here: that this NA, plus this NA, plus this value, plus this value, plus this value
and so on, divided by 20. So, definitely this cannot be computed so, this is giving you NA,
whereas in case if you try to add here the command n a dot r m is equal to true, then, what it
is trying to do? It is trying to find out the sum of all the values, all the values after removing
NA, divided by the number of observation, which now becomes here, 20 minus 2: that is
there are 20 observation and 2 observations are missing. So, this is going to be here 18, so
this number is going to be divided by 18. And now you can see here that this is giving me a
value 58.5, where, If you recall that the mean of time which was 56. So, this is now
changed, because this has been computed, after removing the missing value. Now, in case if I
try to make this n a dot r m to be here FALSE, like as here,then you will see here, this is
11
360
giving me the outcome NA, because once you say, na dot r m, that this is the default function.
And when you are trying to use here the, mean of time dot na, actually, this is, the default that
here n a dot rm is taking always FALSE as a default. So, whenever we are trying to find out
the value of the mean, it assumes by default that all the values are available. In case they are
not available, you need to inform your R software: that yes, there are some values which are
not available and please try to compute the mean after removing those, missing values. Okay.
Now, this is here the screenshot, I will try to show you on the R console.
12
361
But before that, let me try to show you here that in this case, which I just discussed actually
that mean of this time, which is containing here, 20 values, is computed like this, sum of all
the 20 values divided by 20 and whose value is coming out to be 56, whereas when you are
trying to use, the data with missing values, then it is actually based on only 18 values and this
value is coming out to be 58.5. Now let me first come to R console and try to show you here,
13
362
data here, so you can see here, time dot n a console has, this thing. And now, if I try to find
out the mean of here, time dot n a, this comes out to be here NA, but in case if I try to add
here, one more parameter and a dot rm, is equal to here true, then you can see here, this is
coming out to be 58.5, this is the same outcome that we have, discussed in the slides. Okay.
14
363
So, let me come back to our slides and let us try to have here another aspect. So, up to now
we have discussed the arithmetic mean, for an ungrouped data. Now, I'm going to discuss
how to compute the arithmetic mean in case of group data. You remember that in the case of
group data, first you need to construct a frequency table. So, now we will learn, how to
So, you may recall that while constructing the frequency table, we had constructed the class
interval and the class intervals were constructed on the basis of given set of data and they
were divided in suitable number of intervals of suitable widths. And these intervals I am,
denoting as e1 to, e2 to e3 and so on. And this part here, the first value e 1 and here, this here
e 2, in the second case, they are called the ‘Lower Values’ and similarly this e 2, in the first
interval and e 3 in the second interval, they are con the upper values, of the interval and so
on. So similarly I have created here k classes. So, I have created here, case such classes and
15
364
then I’m trying to find out the midpoint, of this interval, so midpoints of the first interval, i
am denoting by here m1, which is simply here, e1 plus, e2 divided by 2 and similarly, the
midpoint of the second interval is denoted by m2, which is the lower limit plus upper limit,
divided by 2 and so on and similarly I try to find out the weight values, of all the intervals
and based on the given data, set I try to find out the absolute frequency. So, there are n 1
values in the first interval and n 2 values in the second interval and so on, n k values in the
Kth interval. And we also know that the sum of all this n 1, n 2, n k, we are trying to denote
by here, n and the relative frequency has been obtained say, say here f 1, which is n 1 upon
n, f 2 is here n 2 upon and so on here, f k is, n k upon n and n is the total frequency and is
here the, total frequency. And means obviously, in case if you, try to sum all the relative
frequencies over here. So, this will come out to be sum of all ni’s a is divided by n. So, sum
of all n i is here is n, so this become and upon n, which is equal to here 1, which is written
here. Right.
16
365
Now, I would try to define the arithmetic mean for this group data. And it is defined here as
1 K
say x ni mi , so m i is here the, midpoint of the interval. So, now in case if you try to,
n i 1
simplify it, so this can be written here they , fi frequency, so other alternative is that, I can
K
simply find it out here as the fi mi . And based on this, there is another version, of the this
i 1
type of mean in case of group data, which is called as see here `Weighted Arithmetic Mean’
wm i i
and weighted arithmetic mean is defined, as say x i 1
K
where, w i's are the weights,
w
i 1
i
which are assigned to the values, right. So, this is a more generalized function which is
Now, in case if you want to find out here the arithmetic mean of the group data set. So, we
know this is now going to be simply here, i goes from here 1 to k which is here, see here, fi,
mi. So, if you remember, when we were discussing, different types of mathematical operation
using the R software, then we had discussed that, this type of thing can be obtained by, say
17
366
sum of two data vector f and here m, Right. So, we had f is going to be the data on say f 1, f
2, f k and m is going to be the data, on say mid values, m 1, m 2, m k. So, in order to do it,
this R has already a built-in function, which is called as weighted dot mean, w e i g h t e d
dot m e a n and inside the arguments, I have to give that two vectors here, m and f, where m
is containing all the midpoints and f is containing all the frequencies. So obviously, when you
want to compute, the arithmetic mean of this group data, first objective will be to find out the
frequencies and in order to find out the frequencies, you may recall that we had used the
command, table and table will have two types of components, the first one will be intervals
and second will be frequency. So, we need to, extract the frequencies from the outcome of a
table command. Now, you have to be watchful here that in this function, weighted dot mean,
I have used, this symbol f to indicate the absolute frequencies, whereas if you observe, in this
slide, in this formula, I have used here f to indicate the relative frequency. So now in this
example and in order to compute the weighted mean I will be using the, the indicator f to
indicate the data vector, of absolute frequencies and not the data vector of relative
frequencies.
18
367
So, let me take here an example and show you that, how the things are going to what? So,
once again this is the same example that I discussed earlier that there are 20 participants, who
participated in a race and their time is recorded in seconds. And this data has been recorded
19
368
And now, we had converted this data earlier in the form of a frequency table and you can see
here, that I have created herethe class intervals, like this 31 to 40, 41 to 50, 51 to 60 and so
on. So, there arealtogether, six class intervals, so K here is equal to 6. And then, I have found
the midpoint, which is here 31 plus 40 divided by 2 and so on, so this is the value of here m
1, this is the value of here m 2 and so on, so we have here, m 6. And their absolute frequency
have been obtained as here five, for the first class, 3 for the second class, 3 for the third class
and so on, so these are the value of here f 1, f 2, f 3 and here see here, f 6 and this is here the
value of here sum of all the frequencies, which is equal to here, n. Right?
So now, just to give you a brief recall, that how we had found the frequency distribution, I
mean I have taken some slide from the earlier lecture and you may recall that first we had
defined sequence, between 30 to 90, at an interval of 10, by using the command, s e q and we
had stored this data, inside a variable breaks. So, breaks was 30, 40, 50,60, 70 ,80, 90 values
and then, we had using the data time and using this data vector, here breaks and putting the
20
369
right hand side interval to be open, we had used the R command cut, c u t, to convert the data
into factors. So and this data was, stored in time dot cut. So, this outcome was, like this, this
And based on this time dot cut data, we had found the frequency table of the data in time
vector by using, table inside the argument time dot cut and this was the frequency distribution
that we had obtained. Right? Now, what we have to do? We need to find out the weighted
arithmetic mean or the arithmetic mean for this grouped data. So, this can be done by the
following ways, first step is that we need to extract the frequencies from this frequency table.
So this is here the frequency table and we want to extract only this data vector 5 3, 3 5, 2 2,
because this is the value of f 1, this is the value of f 2, this is the value of f 3 ,this is the value
of f 4, this is f 5 and this is here f 6. So, now in order to do it, we try to operate, a command,
21
370
as dot numeric on this frequency table data. So that will be a s dot n u m e r i c, all in small,
alphabets and inside the argument, I have to give the data vector and the data vector is going
to be the outcome of the frequency table. So, in this case, you can see here, your frequency
table data is given by table and inside the argument time dot cut. So I try to operate this
command over here, as dot numeric and inside the alchemists table, time dot cut and this
gives me here, this outcome. So, you can see here that this data is the same data that you have
obtained, say this 5 is the same as here this 5, this 3 is the same here, this 3 and this 3, this is
the same here, this 3 and this 5, this is here the same 5 and then this 2, this is here the same 2
So now, once we have obtained, this here, vector here f, which is the vector of the
frequencies, similarly we can find out the vector of the data, on midpoints and then I simply
have to use here the command, weighted dot mean and with m and f and this will come out to
22
371
be 56, right. Okay. So, let me now first try to operate, this thing on the R console to show
you.
So now, if you try to see, we already had entered the data on time, which is here. Now I first
need to create a frequency table. So, I will simply try to, copy and paste the commands that
we had used earlier. So this is about the breaks and then I will try to execute the command to
get the data, time dot cut and then I will try to find out the frequency table, using the time dot
cut and you can see here, this is the same data set. And now, I will try to extract here, the
data, from this table, using the command as numeric. So, you can see here, this data is here
the same, if you try to see here, this line, which I am highlighting and this line, which I'm
highlighting, they are the same, right. And now I need to define here the, the vectors of
midpoints. So, this is here m, so you can see is here m and now, once you try to use here, the
function or the command, weighted mean, this weighted mean come out to be 56. So, this is
how you can obtain this weighted mean, in case of this group data.
23
372
Refer Slide Time :( 30: 16)
And now here is, the screenshot of the same operation that we have just done. Now, I would
like to stop here in this lecture and if you try to see, what we have done? We have simply
learned the concept of arithmetic mean and we have learnt how to execute it on the R
software. And this arithmetic mean is found for the group data, for the ungrouped data. So
that is pretty simple, but it is more important to learn that, what are the different other
parameters that can be used in the command mean that can be looked through the help menu.
So, I would request all of you: that you try to take, a small data set, say only few numbers and
try to compute the arithmetic mean by your hand, manually and try to compute it using the R
software. And once you see that, both the things are matching, then it will give you more
confidence that yes, the software is also doing the same thing, what we wanted to do? So, you
practice and we will see you in the next lecture. Till then, Goodbye.
24
373
Lecture – 15
Central Tendency of Data - Median
Welcome to the next lecture on the course descriptive statistics with R Software. You may recall
that in the earlier lecture, we had discussed the idea of central tendency. And we had planned
that, we will discuss, several measures of central tendency of the data. In the last lecture, we had
explained the concept of arithmetic mean. Now, in this lecture, I am going to consider the aspect
of partitioning values. And under that topic, I will try to consider the median. Okay. So, let us try
374
If you try to see, means if I try to create here the frequency distribution means, on the x-axis we
have values, say class width or the x values and on the y-axis we have frequency values and
suppose we have got a, frequency distribution like, this one. Now, you can see here that the
entire frequency is being covered under this curve, you can see here, these are the frequency
values and these are the different values here of the frequency on the curve. Now, we would like
to know that how these values are going to be partitioned, for example, if I say, suppose I want to
divide this in four equal parts, so I can make here first, say here second, say here third and here
fourth. So, you can see here, from here to here, this is indicating the area, which is containing
nearly the 25% of the total frequency. And similarly from here to here, this is an area, which is
trying to cover another 25% of the frequency. And similarly, this is also 25 percent of the total
frequency and this is also 25 percent of the total frequency. So, these values here, they are called
as, suppose here, partitioning values which are trying to divide the total frequency into four equal
parts, so I can call it here, as a quartiles. So, this is quartile this is quartile value. And yeah, I
mean, so the first value can be called as, ‘First Quartiles’. Second value can be called as ‘Second
Quartile’ and so on. So, essentially what is happening that we are trying to divide the entire
frequency into four parts. And similarly, in case if you want to define, divide it into more parts,
for example, if this is the frequency curve, then possibly I can divide it into ten parts, one, two,
three, four, five, six, seven, eight, nine, ten and so on. And similarly, I can partition it in system
other way also, for example, in case if I want to partition it into say hundred equal parts, 1, 2, 3,
4, 5, 6, 7, 8 and dot, dot, up to here, there will be hundred such partitions. And it is also not
necessary that this partitions have to be of the same length, they can be of different lengths. So,
by looking at the partitioning value, we can have an idea that, how the frequency is distributed,
over the entire range of the frequency distribution. And this will also give us an idea that how the
375
frequency is concentrated in different regions of the frequency curve. So, we will try to take up
all these partitioning values, one by one, we will try to understand them and we will try to see,
how to compute them on the R software. Now, I can say very simply that the frequency
distribution is partitioned to have an idea about the concentration of the values, over the entire
frequency distribution. And as I said there are several measures: median, quartile, deciles,
Now, going through with that definition, suppose if I try to plot here, the frequency curve like
this. And suppose I say, I would like to divide the entire frequency into two parts, two equal
parts, such that, 50% of frequencies on the left side of this red vertical line and 50% of the
376
frequency is on the right-hand side of this red vertical line. So, now corresponding to which here
is this value, this is called as, ‘Median’. So, median is the value, which divides the observations
into two equal parts, such that, at least 50% of the values are greater than or equal to the median
and at least 50% of the values are less than or equal to the median. So, median is a measure,
which is trying to divide the total frequency into two parts. So, if I say that the median of my
frequency distribution is suppose say is 20. So, I can say here that, 50% values are less than 20
and there are 50% values which are more than 20. Okay. Now, if you try to compare median
with arithmetic mean, then in all those situations, where we have got extreme observation that
means, some observation which is taking very, very high value, then in those cases this median is
preferred, why because if you try to see, suppose if I try to take here, two values. Two and four
and then I try to find out it's arithmetic mean, arithmetic mean is going to be two plus four
divided by two is equal to three. But, if I try to add here 2, 4 and here 100, then this value
becomes here, the arithmetic mean is equal to two plus four plus hundred divided by here three
and this is hundred six by three and this is closely equal to 35.3. So, you can see here, there is a
huge difference between three and thirty five point three and this difference is coming because,
there is a new value, which is added here hundred and this hundred is very much different from
two and four, there is a huge difference between the two values. So, this medium is a better
377
Now, I would try to give the definition and how to compute the median in two cases. One is
ungrouped data and other is group data. So, first we try to understand the median and its
computations, when we have a data that is ungrouped. So, let us try to say, we have observations
x 1, x 2, x n, so there are n values and they are ungrouped. Ungrouped or you can call as, they
are the values of some discrete variable. Now, what I do? I try to order the observations and I
present the ordered values as, x and in the subscript inside the bracket, I am writing here one and
the second ordered value will be, the value of x, inside the bracket, I am writing here two and so
on. What does this mean? This means that, this value x 1 is the smallest value, this is the
minimum value among x 1, x 2, x n and this x inside the bracket n, this is the highest value or the
maximum value among x 1, x 2, x n. What does this mean, suppose if I try to take here 4
observation, say here, x 1 is equal to say here 20. x 2 is equal to here, see here, 10. x 3 is equal to
here, 60 and x 4 is equal to 5. Now, these are the four values. So, I try to find out here the
minimum value among, twenty, ten, sixteen and here five. So, this is here, five. So, the first
378
ordered value, which I will denote as say, x and one inside the bracket in the subscript, this
becomes here five. And after that, once again I try to find out the minimum value, among the
remaining values, which is twenty, ten and 60. So, this gives me the second ordered value, which
is equal to here, now you can see here this is a here ten. And similarly, x 3 is the minimum value
among the remaining values, with between 20 and 60 and this is equal to 20. And obviously then
the largest value is here 60. So you can see here, the difference between, the simple observations
and the ordered observations. So you can see here, what is the relationship? The relationship is
this first order value x 1; this is same as you’re here, fourth unordered value. Similarly here,
second ordered value is the same as, the second unordered value. And third ordered value is here,
20 which is the same as, x 1; this is the first unordered value. And fourth ordered value is 60,
which is the same as, third unordered value. So, you can see here that how the ordered and
unordered values are interrelated. Okay, So now, the first step in finding the median of an
ungrouped data is to order it first. And once you order it, then there are two situations that, the
number of observations, they can be odd or the number of observation can be even. So, now in
case if the number of observations are odd, then the median is going to be the, (n + 1)/2th,
ordered value which is here like this and in case if even that means the number of observations
is, even then the median is going to be the average of (n/2)th ordered observation and ((n/2) + 1)th
ordered observation like this. So, this gives you here the definition of the median, so you simply
have to see that, whatever is the appropriate ordered value according to this rule that will give
you the median. Now, we consider the median for the grouped data, so we know that whenever
we have a group data or the data on any continuous variable, then the first step is that, we try to
create the frequency table, then in frequency table, we will have classes.
379
Refer slide time :( 12:58)
So now here, we start our discussion by assuming that we had the data and from the data, we
have created the frequency table. And this frequency table has classes and there are K classes,
denoted as A 1 A 2 see here, A K. So now, the entire frequency is distributed, equally among K
classes and we assume here that, the absolute frequency of the i th class is n i. So, there are a
number of observations in the i th Class A i. From this absolute frequency, I can compute the
relative frequency. And we are denoting the relative frequency f i say here, n i upon, say here,
total frequency. One thing I would like to mention here and I would like to draw your attention
that please notice the definition and symbol of f i, in the case of median I am trying to denote, f i
the relative frequency. But, later on when we are trying to consider other type of measure, there
is a possibility that I may define, this fi to be the absolute frequency, so be watchful. Now, after
this what we have to do? Now, I have got here classes A 1, A 2, A K. And we know by
380
definition, the median is the value where the total frequency is going to be divided into two equal
parts, so there will be a class, where that half of the frequency will be lying. So, I would try to
find out here, the class where half of the frequency is lying. And let this class be denoted by here,
as see here A m. So, A m the interval or the class, which includes the median. So, I can define
this median class, as the m th class where, in case if I try to sum all the frequencies, from one to
m minus one and sum of all the frequencies one to m, then this sum is going to be smaller than
half and this summation from I goes from 1 to m, this is going to be greater than or equal to half.
Now, the expression for finding out the median, in case of group data is given by this. Here, you
can see, there is a quantity here e m minus 1, this is denoting the lower limit of the median class.
And similarly here, there is a quantity d m this is going to denote the, width of the median class,
width means, upper limit minus lower limit. So, this is the class width, then there is a relative
381
frequency here f m, which is going to be the relative frequency of the median class. And based
on this, we try to compute the median of the given grouped data and we denote it here say x bar
med, the short form of median. Now, let us try to take a example and try to see how to get it
done. Here, I would like to, inform you that when we try to compute the median on the R
Software, then at least to the best of my knowledge, there is only one command, which is
available inside the base package, to compute the median. So, this R Software has no separate
commands, for computing the median for grouped and ungrouped data. So now, through this
example, I will try to show you that how these different values, like as relative frequency of the
median class and so on, how they are chosen and then, I will try to show you that, how the
median is computed on the R Software. But then I will not be able to show you that, how to
specifically compute the median for the group data. Well one can write a small program or a
small function to compute, such thing but at least I will not be handling it here.
382
So, let us try to consider this example, in which the data is collected, on the time taken by a
customer to arrive in a shop, in our insider’s shopping mall and this time is recorded on different
days of the month. So, assuming that, there are 31 days in the month, this data has been collected
for example, on the day one he takes 30 minutes, on the day two the customer take 31 minutes
and so on. So now, I will try to find out the median, from this data first considering it as
ungrouped data and then I will try to group it and then once again I will try to find out the
median. Okay.
So, now here in this case, when I try to consider this data, as an ungrouped data. So, the number
of observations here is 31, so n is equal to 31, so the value of n plus 1 is here, two this is equal to
16. So, now what we have done that, we have ordered the data, this data has been ordered, well
10
383
I'm not showing you here that you can do and then I am trying to find out the, (n +1)/2th ordered
value and this is the 16 the value in the ordered data. And this value comes out to be here 26, so
26 minutes is the median time. And now, in case if I try to convert the same data into an even
number of observations, so I can drop the last observation and I try to consider only here the 30
observation, so in that case, the number of observation becomes a 30 and then, n by 2 here is 15
and (n/2) + 1 is here 16. Now, according to the definition of the median, the median is going to
be the, mean off (n/2)th ordered value. And, ((n/2) + 1)th ordered value. So, essentially this is
going to be the automatic mean of the fifteenth ordered value and sixteenth order value and from
the data, we find that the fifteenth ordered value is 27 and 16th ordered value is 26. So, this
median comes out to be twenty six point five. So, this is how we compute the median in case of
11
384
And similarly in case if I try to consider this data as a group data, then we try to create here the
frequency table, so you can see here, I have already created the frequency table here, these are
my class intervals of the width five units, like as 22 to 25, 25 to 30 and so on. And then in the
second column, I have found the absolute frequency and in the third column, I have computed
the relative frequencies, of all the classes. Now, you can see the advantage of working with the
total relative frequency. Total relative frequency is always going to be, there are five classes, so i
goes from here one two five, this is going to be here one. so, I simply need to find out here, what
is the class say here m minus one, fi which is smaller than 0.5 and for what value of here m, the
sum of frequency is greater than half. So, once I try to do it here, I observe that there is a third
class, this is e3 for which the sum of the relative frequency of class 1, class 2 and class 3, this
comes out to be 12 upon 31. So, this is going to be 3 minus 1, so essentially I am trying to find
out, f 1 plus f2. So, you can see here this f 1 is 0 and f 2 here is, 12 upon 31. So, this comes out
to be smaller than 1/2 and if I try to find out the sum of f 1, f 2 and f 3, this comes out to be 30
upon 31. How this 30 is coming into picture? This is coming out to be, the absolute frequency of
class one, this is zero plus absolute frequency of class two this is 12 and absolute frequency of
class three, this is 18. So, this is essentially 12 plus 18, which is equal to 30 and this comes out to
be greater than half. So now, I can say here, my median class is third class e3 and so here, m is
equal to 3.
12
385
Now, I try to find out the lower limit of the median class, which is here 25, the relative frequency
of the median class, which is here 18 upon 31. And then, the width of the interval, of the median
class, this is here 30 minus 25 it is equal to 5. Now, once I try to substitute all these values over
here, in this expression and I try to simplify it, I get here 25.97. So, you can see here, there is not
much difference in the values of the median, when we are trying to compute it as a group data or
say ungroup data. Right. For the ungrouped data, you may recall that, this value was coming out
to be 26, you can see here. And for the grouped data, this is coming out to be 25.97. So here, you
can see means, if your data is proper then, practically there is no difference either you try to
compute the median, say by this formula or by that formula and possibly this is the reason that R
has not implemented it. And then if you try to see this 25.97 is also close to 26. So, for all
13
386
Refer slide time :( 24:18)
Now, I try to come on the aspect of R Software. Inside the R Software, to compute the median,
the command is median and m e d i a n and this x is my here, data vector. And then in median
also there are several option, so but I would once again we trust you that you try to look into the
help menu and try to see, what are the different possible parameters that can be given inside the
arguments. But, here I would certainly like to address that how would you compute the median,
in case if some data is missing and that is represented, as say here NA. So, in that case, use the
same command median and give the data vector and use the option here n a dot rm is equal to
TRUE. So, this will give you, the value of the median.
14
387
So, I try to now collect all the data on the minutes, inside this data vector here minutes and then I
simply try to find out the median of minutes here, this comes out of here, 26. So you can see
here, this is matching with the value that you had obtained earlier. And this is the screenshot. So,
I would like to also show you here, inside the same data that in case if the data is missing, then
how you are going to handle it. So, inside the same data set, I try to make the first two values, to
be not available and I replace them by NA. So now, in this case, I try to create or I try to store the
data inside a new data vector, minutes dot na. Well I would like to address here, one thing in R
there is an option to name the variables using the dot sign or say full stop sign. So that is why
minutes dot n a I am writing, it is not a say built-in function or there is a rule, it is simply trying
to denote, for, just for the sake of convenience that this is the same data of minutes. But, now I
am using the missing values. So now, using this data set, I try to find out here the median, using
the same command, median minutes dot na and I'm using here a na dot rm equal to TRUE. And
you can see here that this is the screenshot and this value comes out to be again here, 26. So,
15
388
before I try to do something more, let me try to show you, how to compute these things on the R
see here minutes, so you can see here these are the values of minutes and then I try to find out the
median, of here minutes. So, you can see here, this comes out to be like this. And similarly and I
try to consider here the missing values, I try to once again, store the data inside a new vector
median dot na . And you can see here, this, this is the data meet minutes dot na and you can I try
to find out the median of this data. Right. And if you try to see you have not used the option, n a
dot r m is equal to true, so that's a very common mistake. So, now let me use it here na dot r m is
equal to true. So, you can see here, this value is once again coming out to be 26. Now, I would
like to stop here, I have given you a detailed overview of median, how to compute it, what is the
concept and how to compute it in R. And you please try to practice it, take some data and try to
16
389
calculate the median manually and then, try to do the same thing, with the software. And try to
see, what is the difference? Usually I expect that unless and until the data is extremely
heterogeneous, this difference will be very, very small. And for all practical purposes, the value
of the median that you, compute from the R command either for the grouped data or the
ungrouped data, they will not differ much. So, for all practical purposes, you can accept them. So
you practice and I will see you in the next lecture. Till then, Goodbye.
17
390
Lecture - 16
Quantiles
Welcome to the next lecture on the course Descriptive Statistics with R Software. You may recall
that in the earlier lecture, we started a discussion on the partitioning values. And we had discussed
the concept of median and we had learned how to compute it manually. And then how to do it on
the R software. Now I would like to extend this concept further. As we have seen that the median
is the value, in a frequency distribution, which divides that total frequency into two equal parts.
So now the question is this, why only two equal parts, they can be more, they can four, they can
be ten, they can be hundred, also this partitioning can be equal or this part is things, can also be
unequal widths. So, now let us try to understand this concept in this lecture. And which are
generally called as Quantiles. So, quantile are nothing, but they are the values, which try to divide
391
So, in case if I try to plot here, the frequency distribution, like this one then. We have understood
that in case if I try to divide, it into two parts, then this is called Median. And if I say, now I want
to divide it into, four equal parts, for example, this can be first part, then second part, then third
part and then here, fourth part. So, you can see here, this is the first part, I can denote it with here
q1 this is the second partition I can denote it by here q2, third partition q3 and fourth partition q4.
And similarly, I can divide it into say here, more equal parts like as ten equal parts, I convert it
here 1, 2, 3, 4 and here 5, 6, 7 8, 9, 10. So this will become here first part partition, second partition
and so, on and similarly I can increase the number of partitions also. And these partitions can be
of equal widths or safe unequal. Right. So, in general these partitions are called as Quantiles. So,
just like, the median partitions the total frequency into two equal parts similarly, quantiles
partitions the total frequency into say, some other proportions and these proportions are decided
by us.
392
So, for example if I say 25 percent quantile. So, if you remember, the definition of median, the
median was the value, which was trying to split the data into two parts such that at least 50 percent
of the frequencies are lower than that value and fifty percent values are more than that value than,
the value of the median. So, similarly if I try to say here 25 percent quantiles, than 25 percent
quantized, they split the data into two parts, such that at least 25 percent, of the values are less than
or equal to that quantile. And actually 75 percent of the values are greater than or equal to that
quantile. So, if I try to plot it here it will look like, this that this is here, the quantile here. So, you
had Q and this is the 25 percent part and this is here on the right hand side, this is the 75 percent
part. So, this is here the quintile value. And similarly if I try to define the 50 percent quantile, than
50 percent quintile will split the data into two parts such that at least 50% of the values are less
than or equal to the quantile. And at least 50% of the values are greater than or equal to the quantile.
So once again if I try to plot here the frequency distribution, then the value of the quantile is
dividing, the total frequency into two parts, such that this part is the 50% of the total frequency
and this part with vertical line, this is also the 50% of the total frequency. And this value which is
partitioning the total frequency, into two parts this is called here as a quantile. And this is
essentially the 50 percent quantile and you may notice that 50 percent quantile is nothing but your
393
basic idea of the quantiles. So, now as we have defined that 25 percent quantile, 50 percent
quantile, similarly I can choose any value say 3 percent, 5 percent 11 percent, 18 percent, 25
percent, 25 percent, 90 percent and so, on whatever we want. So, in general I can extend this
definition to a general definition of quantized. So, I can see here, when we choose here the value
here, . So, I can see here, the definition of into 100 percent quantile means if I try to choose
the value of , between 0 and 1. So, if I take the value of here to be say here 0.20, then this
becomes 20 percent quantile, if I try to take to be 0.30, then this becomes 30 percent quantile
and so on. So, this into 100 percent quantile is the value, which divides the data in two
proportion, one consisting of into 100 percent, another partition containing the 1 minus into
100 percent. And this division has been made in such a way, such that at least into 100 percent
of the values are smaller than or less than the value of that quantile and at least 100 into 1 minus
percent, of the values are greater than or equal to that quantile. So, in general if I try to plot it
394
here. So, for example I can see here that this region, this region consists of into 100 percent, of
area this reason with the hair dots, on the right hand side, this is hundred into 1 minus percent,
area and these are the values of the, frequencies. So, this is a graph on the, x and y axis like this,
sorry this x axis is here, x axis and here y axis. And this value, which is doing it here, this will be
you’re here quantile or more specifically into hundred percent quantile. So, now I can choose
different values of , the values of can be a single value, the value of can be in a sequence,
where the values are at equal width or the values of can be a sequence where the values are
different, they are not of the equal width. So, firstly let us try to understand the basic definition, of
this quantile and in order to understand this definition, means I can suggest you, please try to recall
the discussion, in the earlier lecture, when we had discussed, the concept of ordered data and
unordered data and based on that we have, we had defined the median for ungrouped data.
395
Now here, in this case also, let me consider, the data is given to be x 1 x 2 x n and yeah this is
ungroup data. Right. And this data has been ordered. So, this value x 1 is denoting the minimum
value among x 1 x 2 x n. And this x n is denoting the largest value or the maximum value in the
data and similarly here, this x 2 means the second ordered value that is denoting the second highest
value. And now in case if you recall the, definition of median, for the two cases when the number
of observations were odd or even, that definition is now extended to an integer n . So, now the
definition of the quantile, is the following, first try to decide, whether n , this is an integer or
not an integer. So, if n is not an integer, then I have to choose the k which is the smallest integer
greater than n . And then corresponding to this value of k, try to find out the kth ordered value.
And this kth value or the k th ordered value is going to provide the into 100 percent quantile.
Similarly in case if, n is an integer, then the quantile is going to be the arithmetic mean of two
values, one n th ordered value and second and ( +1)th, ordered value. So, both these values,
they can be obtained from the data. And simply try to take the arithmetic mean of these two values
and that will give you the value of the hundred into quantile. So, after understanding, this
definition, now I can see here that just by choosing the different value of , I can divide the total
frequency into different number of groups. So, for example, in case if I try to,
396
choose the to be here, 0.25, 0.5 0.75 and yeah means obviously the last value of will be 1,
because is always lying between 0 and 1. So what essentially, I am trying to do if I try to create
here the frequency distribution like this, so I'm trying to divide the entire frequency, into four equal
parts here, is equal to 0.25, corresponding to 25 percent, the second section, another 25%, third
section another 25% and another section 25%. So, these are the fourth partition, this is the first
partition, this is the second partition, this is the third partition and this is the fourth partition. And
corresponding to which, I can write here the values on the x-axis, say here this value to be q1, this
value to be q2, this value to be q3 and whatever is here the fourth, the final value this is here q4.
So, these values are called as Quartiles. So, these quartiles are the values or they are the particular
values of the quantiles, when the entire frequency distribution is divided into four equal parts and
this is denoted by q1, q2, q3 and q4. So, this q1 will denote the first quartile, which has 25% of the
397
observations. And q2 is the second quartile, which has 50% of the observation, what does this
mean? If you try to see here, my frequency distribution is like this. So, this is here first quartile
and this is here second quartile. So, now if you try to see, this q2 is going to be the value which is
trying to take care of entire this data. So that is why the two values here 25% and 25%, they are
added and they are corresponding to 50 percent of the data. So, that is why I am calling it here that
the second quartile has the 50% of the observations. And this is the same as the median. And
similarly the third quartile is represented by q3 and which has 75% of the observations and then
obviously that goes without saying that, the fourth quartile q4 will take care of all the hundred
percent observations. So, now the rule is very simple, I am trying to divide the entire frequency
into four equal parts and those partitions will be called as Quartiles. And they are the particular
value, of quantiles. Okay. Similarly in case if you want to divide the total partition into ten equal
parts, for example here we have divided in four equal parts, now I am changing it to ten equal
parts.
398
Then in this case this is called Deciles. So, deciles are the values we divide the entire frequency
distribution into ten equal parts and we denote them as here D 1, D 2, D 10. So, this will look like
this if I have this frequency distribution, then I am trying to divide it into say five, six, seven, eight,
nine, ten. So, this partition value on the x axis this is here say D 1, this is here D 2, this is here D
3 and so, on this is here D 10. And every partition here, this partition, this partition, every partition
will take care of only ten percent of the frequency. So, I can say here that this, first partition, this
is trying to take care of only the 10% of the frequency. So, I am trying to decide, what I am trying
to define the D 1 as the first decile, which has only 10% of the observations, similarly we and I
come to the second quantile, in this case this is here D 2, this is trying to take care of 10%, plus
10% frequency. So, I can see here that the second decile is the value of the quantile which has 20%
observations or 20% of the total frequency. And similarly if you go for third, fourth and so on,
similarly third decile will take care of 30% of the value, fourth will take care of the 40% of the
value, fifth will take care of the 50% of the observations and that is going to be the same as median.
And similarly the, ninth decile will take care of the 90% of the observation. So, when we are trying
to divide the total frequency into ten equal parts, we call the quantile deciles. Now similarly in
case if I try to, divide the total frequency into hundred equal parts, like that suppose I this is my
here frequency,
399
graph and I try to make it a one, two, three, four see here, hundred partitions. So, hundred partition
means, every partition will take care of one percent of the total frequency. So, here it will be here,
they notice here p1 p2 and the last one will be here P hundred. So the percentiles are the values of
quantiles, which I divide the given data into 100 equal parts. And they are denoted as P 1, P 2 P
100. So, this first percentile is denoted by, say p1 and this takes care of 1 percent of the total
frequency or the 1 percent of the observations. Similarly the second percentile will take care of 2
percent of the observation. And that is denoted by here P 2 and similarly if I try to take the 50th
percentile, this will take care of the 50 percent of the observation and P 50 is the same as median
and similarly if you go for to say here 90th percentile, then this will take care of the 90 percent of
the total frequency and it is denoted by P 90. You have heard that some of the examinations, have
a condition that the candidates should have obtained the marks, which are lying in the top 20
percentile or top 30 percentile, what does this mean? If you try to see, what is top 20 percentile?
10
400
The maximum value of the percentile can be hundred. So, top 20 percentile means, those who are
lying in the top 20 percentile that is from 80th percentile to 100th percentile. So, what they are
asking, they are saying that suppose, a large number of students have appeared in the examination.
And they have prepared the frequency curve of the marks obtained. And now they have were
divided those frequencies into say here hundred equal parts. So, this first value is denoting the first
percentile P 1 second is denoting P 2, somewhere here is P 80 that is the 80th percentile and finally
it is here, P 100 that is the final percentile. What they are asking is that any candidate who has
scored the marks which are lying in this region, they are eligible to appear in the examination. So,
that means, they have got the marks, which are greater than 80th percentile. So this is the basic
11
401
Now after this, let me explain, how to compute this different types of quantiles in the R software.
So the basic function is here quantile, q u a n t i l e and inside the arguments, we have to give
different types of thing, but the compulsion is that I have to give here, a data vector. And I am
denoting the data vector here say x. Now after this, we have several options, but I am going to
illustrate here, the use of two- props and then na.rm and I will also discuss about type, this first
parameter is going to give you the data vector, of the data for which the quantiles are needed, p r
o b s, this is going to denote a vector, of probabilities between 0 and 1. So, this is essentially the
value of . So, it depends, which of the quantiles you want to obtain. So, by controlling, the value
of here this probs, we can generate different types of quantiles, like as quartiles, deciles, percentiles
or something else. Third option is n a dot r m that we know that in case if there are missing values,
in the data, then if I say that na dot r m is equal to TRUE, then the quantiles, will be computed on
the basis, of the data that is available inside the vector x, after this there is another parameter, here
type, type is going to take a value between 1 and 9 that can be 1, 2, 3, 4, 5, 6, 7, 8 or 9 and this is
going to inform the quantile command that, which of the nine available algorithms to compute
the quantiles is to be used? What does this mean? Well, you see once, we have got that data, the
data has to be ordered, using some algorithm and then based on that, the algorithm has to partition
the values and then the algorithm needs to choose the correct value of the quantile. So, this all
these things, this entire process is based on, certain algorithms, different people have given
different types of algorithm to compute the quantiles. And it is possible that when we try to
compute the quantiles based on different algorithms, their values may differ a little bit. But that is
essentially the choice of the experimenter or those who want to compute it that which of the
algorithm they want to use. So this R has this facility that you can choose say any of the algorithm
that is available, inside R software to compute the quantiles. Okay. So, for example,
12
402
Refer slide time :( 22:59)
if I say, type 1 means it is something like, I will type, type is equal to 1, then this is going to use
the, algorithm based on the concept of inverse of empirical, distribution function, if you choose
type equal to 2, this is similar to type 1, but here in this case the averaging, has been done at the
points of discontinuities. And similarly if you choose type equal to 3, then this is based on the
nearest even order statistics. But definitely I am not going to discuss, this type of algorithms here,
but my simple objective is this how to use them on the given set of data. So, now I will try to take
here, an example and I will try to show you that how you can compute, the quantiles or in particular
13
403
So, now I consider the example, which I also had considered earlier that there is a data on the
heights of 50 persons, which are recorded in centimeters. And this data has been stored inside a
14
404
different types of coin ties over this data. So, when I say, quantile and here height, the height has
to be the data vector, which has to be given inside the arguments, brackets. And this is here the
outcome, you can see here, there are two rows. First row is this one. So, it is trying to show, 0
percent, 25 percent, 50 percent, 75 percent and percent and just below, this there are values here
121.0, 137.5, 146.5, 158.0 and 166.0. So this is indicating the value of the quartile at zero percent.
And then the second value is denoting the value at the twenty five percent, third value is denoting
the value 146.5 at fifty percent quantile, this seventy five percent value is indicating the value
158.0. So this is indicating the seventy five or seventy fifth quantile. And this is indicating the last
value, is indicating the hundred quantile, whose value is 166. So, essentially if you try to see here,
this value is indicating, the first quartile Q1, this is indicating the second quartile Q2. And this is
third value is indicating the third quartile Q3 and Q4 is the last value, which is the value of the
fourth quartile. So, you can see here, this is also the value of the median. And this will also be
same as the value of P 50 or say D five. Right? So, what you have to observe here that when you
are trying to use the command quantile, then that default option here is quantiles. Now in case if
15
405
find out other things, like as deciles and percentile, then you simply have to control, the parameter
probs. But here I would like to show you before I go to decide. So, percentile that in case if I try
to choose, the probs here, to be a sequence between, our sequence from, starting from zero to one
at an interval of 0.25. So, this command s e q, it is going to create a data vector, like 0.00, 0.25,
0,50, 0.75 and 1. So, now I'm asking to compute the quantile of the data in the height vector using
the probabilities which are here. so, you can see here, the outcome will look like this and now what
you have to understand that how the things are happening this 0%, this is the same as this 0,00,
this 25%, is the same as 0.25, the second value in the probs vector, 50% is the same as 0.5, which
is the third value in the probs vector, 75% is the same as the value 0.75, which is the fourth value
in the probs vector and 100% is the last value 1,00 in the probs vector. And you can see here that
these are the values of the quantiles and essentially if you try to see, they are the quartiles. So, you
can compare these values and the values, which you have obtained directly, by using the quantile,
16
406
And this is here the screen shot of which I will try to show you on the R console. So. let me try to
first copy the data on the R console. So, you can see here, this is the height, data and I'm simply
trying to fight here quantile of here, height. Right. And this is giving me the this, value and if I try
to use here mice, this probs or the sequence of between 0 to 1, at an interval of 0.25, you can see
here that these values are the same values, which are here. Right, Okay.
Now after this, I try to show you that how you, would try to find out the, other types of quantiles,
like as percentile and deciles. So, first we try to understand, how to generate or how to compute
deciles. So, now you have understood, it is very simple, the command, is the same what you have
to do? You simply have to change the probs vector or the data inside the probs vector. So, deciles
are essentially controlling the value of to be as 0.1 0.2 up to here 0.9. So, I simply have to
generate here, a sequence from starting from zero and ending at one at an interval of 0.1. So, using
17
407
the command s e q, I generate, such a sequence and this value is coming out to be here like this
you can see here. And now I simply use the same command, which I used earlier but in this case,
I have simply replaced the value 0.25 earlier to now 0.10. And you can see here, here I am getting
the 10 value of that deciles. So, for example as we have explained earlier the zero percent is
and similarly a 100 percent is corresponding to one. And these are the values, of say this here D 1,
And similarly this is here that screenshot; I will show you on the screen also.
18
408
Refer slide time :( 30:37)
Now we try to compute the percentiles, P 1, P 2, P 100. So, now it is pretty simple, I simply have
to generate, a sequence from 0 to 1 at an interval of .01. So, I try to generate, this sequence and
you can see here, these are the values which are obtained here 0.0, 0.01, 0.02 up to here 1.
So, now just using the earlier, command means I generate the quantile. So, I simply have a replace,
the value in the probs by here point is 0.1 and you can see here that the values of all hundred
19
409
percentiles, they have been obtained, for example this the value of first percentile, this is the value
of second percentile, this is the value of third percentile. And similarly in the last this is the value
of here 99 percentile and this is the value of here, 100 percentile. Similarly in case if an
examination needs that the candidates, must have the marks more than 80th percentile, then this is
going to be controlled by this 80% and the P 80 here is 159. So, this means the candidate needs,
I go further let me try to show you, these things on the R console. So, I try to compute here the
deciles you can see here, the ideal that it can value and similarly if I try to change here, only the
last value, then this will give me hundred values, which are here the percentiles and they have been
20
410
Now I will try to continue, with the same example and I will try to show you that in case if the
data is missing then how we are going to compute the quantiles. So, I have replaced the first two
data values by here NA an I have stored this data into height dot na here, as we have done it earlier
21
411
the same thing, whatever we had to compute, for this height dot na. So, you can see here, as soon
as I give here only the quantile function without, any specification of na.rm , this will give me an
error. Right. And so, I try to add in the quantile command na.rm is equal to true and this gives me
this data. That is the first quartile, second quartile, third quartile and fourth quartile. Right. And
similarly in case if I want to find out that deciles in the same data, I have to use the same command
here. And I have to use the probs, which is the sequence of 0 to 1 at an interval of 0.1. So, this will
give me the deciles so, you can see here these are the deciles.
That we have obtained and this is the screenshot of the same operation which I just shown you.
22
412
And now I would try to show you, this thing on the R console to make you confident. So, I try to
store this data I say here height dot na. So, you can see here this is my here, height dot ne and if I
try to find out the quantiles of height dot na and this will give me an error. So, now I have to add
here that n a dot rm is equal to TRUE. And this will give me the values, of the quartiles and
similarly in case if I want to find out say deciles then I have to add here the function, probs is equal
to sequence starting, from 0 to 1 at an interval of 0.1, which is here like this. So, you can see here
23
413
And similarly in case if you want to compute here, percentiles in this case you simply have to
24
414
Refer slide time :( 35:13)
Now I would like to show you that in case if you want to compute, some other percentiles, which
are not in the equal width. Suppose I want to compute here the 14th percentile. So, I will try to
give it here, the value 0.14. Suppose, I want to compute the 23rd, percentile. So, I will try to give
it here the value as 0.23, suppose, I want to compute the 79th percentile, so I will try to give it the
value here point seven nine. Okay. And now if I try to write down here, quintile of the data, heights
and with probs, given by here as above. So, you can see here you are getting here the 14 percent,
23 percent and 79 percent percentile. So, this way actually you can compute different types of
percentile and if you learn the statistic, there is a topic say testing of hypothesis, where we define
the level of significance, those the definition of level of significance is nothing, but it but it is
related to computing a particular percentile. So, this concept is going to be very, very useful for
you, when you go for further courses, any statistics. So, I would like to stop here and I would once
again request you, to, to look into this concept, try to practice them, on the R software and we will
see in the next lecture again. See you and good bye.
25
415
Lecture 17
Welcome to the next lecture on the course descriptive statistics with R software. You may recall
that in the earlier lecture we had a discussion on computing the quantiles which are the
partitioning value and they help us in determining the central tendency of the data. So we will
continue on the same topic that what are the different other tools to measure the central tendency
of the data and in this lecture we are going to address three more tools that is mode, geometric
416
So let us try to start our discussion with mode. So what is here a mode? You might have
encountered in your day-to-day life that if you go to a shop of the shirt or shop of the clothing
then whenever you want a size of your shirt that is usually available. How it happens suppose
you are the shopkeeper and you want to open a shop for the clothing and suppose you want to
sell shirts. So which of the size of the shirt is not that much in demand. So you want to know
which is the most popular size of the shirt that you should keep in your shop in more quantity. So
in this case what you would like to is do is say you will take a sample of the data you will ask
people what is the size of the shirt and then you will see that whatever size of the shirt has more
frequency that you would try to keep more and the size of the shirt which has a smaller
frequency, you would try to keep say smaller number of shirts. So this is basically done with the
help of mode.
So suppose if a fruit juice shop opener wants to know which of the fruit juice is more preferred
and simply as I said a clothing shop owner wants to know that size of which of shirt or say
trouser is more in demand or say highest in demands and so on. So here in such cases the concept
of mode is used. So the mode of say n number of observation is say x1, x2, xn is the value which
417
So essentially the mode is the value which occurs most frequently in the set of observation. How
this frequently word is coming into picture that is coming because of the frequency distribution
or the frequency of the values. So the definition of the mode will be interconnected with the
advantage of mode is that mode is not at all affected by the extreme observations. For example,
if I say I have a data set here 1, 1, 1, and say here 3, 4, and 6 so you can see here that number 1 is
occurring here three times, 3 is occurring one time, 4 is occurring one time and 6 is occurring
one time. So the maximum number of value which is occurring here this is here 1. So the mode
here is going to be 1. Now in case if I try to add here a value 500, mode is not going to be
changed because that is also appearing further one time only. So that is the basic idea that mode
is not at all affected by the extreme observations and in case if you try to plot the frequency
curve and suppose I have these two types of frequency curve. One is going to give like this and
418
say another is going to give like this. So you can say here that here is only one mode but here
you can see here although the more here is only one but still there are two peaks, so we call
associate contribution as a bimodal and in first case, they are called as unimodal. So all the
distribution having only one mode they are called unimodal and all those distribution which have
So now we try to define the mode for two cases when we have group data and when we have
ungrouped data. Here I would try to inform you that in the R software and in the base package of
R software, there is no direct command to find out the mode. Well, there is a command mode, m-
o-d-e but be careful that mode is not used to find out the mode that we are trying to find as a
measure of central tendency. That command is used to describe the behavior of the data that
means that data is where the numeric or not something like this. So please be careful. Although I
will try to show you here that by writing some special functions or say some special commands
we can find out the mode. But you cannot use the function m-o-d-e or the command m-o-d-e
419
So first we try to discuss the mode for ungrouped data. So for the ungroup data or for the discrete
variables, mode is very simple. The mode of a variable is the value of the variable which has got
the highest frequency and obviously this is true in the case of unimodal distribution. So what you
have to do here you will have here say x1 data with frequency f1. x2 data with frequency f2 and
so on xn data with frequency fn and you simply have to first choose that whichever is the
maximum value. Suppose this maximum value occurs as say fm, then corresponding to fm
whatever is the value of here xm that is going to be the mode. So obviously this is true when we
420
So in order to find out this mode in the R software, we will go like this. If you try to understand I
am simply going to write here two steps and they are simply trying to copy the same thing what I
had just told you that you try to create a frequency distribution, try to choose the maximum value
of the frequency and corresponding to the maximum value of the frequency, try to choose the
value of x and that is going to give you the value of your mode. So when we try to compute this
mode in R software I am giving you here two steps. Step one and step two. The step one is very
simple, whatever is your data try to create a table of that data vector or that can be a matrix also.
How that will be useful? I will try to show you. So whatever is your data either in the form of a
vector or a matrix, try to convert it into a table. So I will try to store my data inside a variable
named data d a t a and whatever is the outcome of the table of this data yes I am going to indicate
So in order to use this thing I will use here a command table that we had used earlier and inside
the arguments, I will try to write here as.vector. a s dot v e c t o r that means you try to create or
you try to consider that the data is vector and this data which has to be considered as vector this
is given here inside the variable named data. So I will show you later on but here I can tell you
that whatever is the outcome of this command, the first row of this command will be a sorted list
of all the unique values in the data vector data. Now after this you have to operate here over one
more command. This command I am writing here. This is here names then inside the arguments
you have to write the data that you have obtained in the first step. This data is called as modetab
then inside the square brackets, remember these are the square brackets here. Inside this bracket
you have to write modetab or the data what you have obtained in there step one is double equal
to, what is this two equality sign, this is a sign of a logical operator for equality. So that is a sign
for logical equal sign, for example we have less than sign, greater than sign, and equality sign but
6
421
equality sign means equal to but the logical equal sign is denoted by two equality signs. And then
I am trying to find out the maximum value of the data which is inside the modetab.
Here I would just like to inform you that here I have used the commands here names of modetab
and this function which is generally taught when you try to learn the R software logical operators
and how to find or extract the names from R data and so on but here I would not going into those
details but I would simply request you please try to use this command. So now suppose I simply
try to take an example here and try to show you that how these things are happening.
So I am simply trying to take the data which is here like this inside the data vector and from there
I am trying to create the table of this data using the same command and here is the outcome. You
can see here. So you can see here that here are in the data vector of the values are 10, 10, 10, 10,
you in the first row these are the sorted values of all the unique values in data. You can see here
that the unique values here are 2, then here 3, then here 4, then here 5, then here 6 and this value
7
422
is repeated again and again four times that is 10 and now in the second row, this is trying to
count that how many times a value is occurring. For example, if you try to see here 2 is occurring
two times. One, two. So this is here two. Three is occurring here only one time so this is here
one. 4 is occurring here only one time. So this is here one. Five is occurring here say one time.
So this is here one. Now and then 6 is occurring here say one time this is here one. And now 10
is occurring here 1, 2, 3 and here 4 times so this is here 4 and this is here the 10. So what
essentially I need to do. In order to find out the mode I simply have to extract the values from
So the frequency, maximum frequency is at 4, this here and then in this second step, I have to
find out the value corresponding to this maximum frequency which is here 10. So this command
here which is here at the second steps, this tries to extract this value 10 from the first row and
8
423
this gives me here 10. So this is how you can compute the mode but definitely, I would like to
address here this is not the only way out. You can define it at an own way is using your own
logic also. And this is here the screenshot and I would request you that you please try to execute
it on your data also. So for example I can show you it on the R console but you please unless and
until you do it with your own hand on the R console you may not really understand it. So I have
got our data you can see here and with this data I am trying to get the value here modetab so you
can see here this is here modetab and now after this modetab, I'm using the command in the step
Now I will try to show you another aspect. As I told you that this command which I just
introduced that can be used on a data vector or a data matrix. So suppose I have a data in the
form of a matrix and we have learned how to write the data inside the matrix. So there is a three
by three matrix and the data values are given inside this data vector. The values are 1, 2, 2,
424
double 3, 4, 5, double 6 so this matrix will look like this. When I say I would like to find out the
mode of this matrix, essentially I am trying to say that I need the mode of first column, second
column, and third column. So in the first column you can see here, the value 2 is occurring two
times, so here the mode should be equal to 2. And the second column the value 3 is occurring
two time. So here the mode should be equal to 3 and in the third column, the value 6 is occurring
So this is what I mean when I try to repeat the same command on this data matrix. So I try to do
the same thing, I have to simply copy and paste the same thing so you can see here I'm using the
same command but now it is giving me here this type of value and then I'm trying to use here the
command in the second step and it is giving me an outcome here 2, 3, 6. So you can see here this
2, 3, 6 is the same what you had obtained manually here 2, 3 and here 6.
10
425
So my idea was simply to inform you that this data can be a data vector or a data matrix and here
And now we come on the aspect that how to compute the mode for the grouped data or the data
when is on continuous variable. So in the case of continuous variable the mode of the data is the
value of the variable with the highest frequency density corresponding to the ideal distribution.
What is an ideal distribution? Which would be obtained if the total frequency were increased
indefinitely, that will be they are becoming very very large and if at the same time the width of
that class intervals were decrease indefinitely. Now you may recall that we had a discussion on
histogram and frequency curve. What we had seen that frequency curve or the frequency density
or density plots they are more useful when you have large number of data and in that case the
bins of the histograms are reduced and they are made as small as possible and the number of data
points are made as large as possible. So this is the same thing which is trying to say so under that
11
426
thing in case if you are trying to make here a frequency curve like this, the bins are going to be
very very small and so on and then you will get to here at this type of frequency curve. So this is
the highest value of the frequency around which you will get the value of the mode.
Now in order to compute the mode for this grouped data, the first step is to create a frequency
table and this frequency table we just need three things. One is the class intervals. Second is the
midpoint of the class intervals. And the third is here the absolute frequency. One thing what you
have to notice here that in this case I am trying to use the symbol f to denote that absolute
frequency and earlier in some of the lectures, I have used the symbol f to denote the a relative
frequency but I am just trying to keep the standard symbols so that you don't face any problem
So here if you try to see I have created different class intervals. Now I am simply trying to find
out the midpoint of this class interval say m1, m2, mk and they are obtained simply by finding
12
427
out the value of lower limit plus upper limit divided by 2 and corresponding to the first class I
have frequency f1. Corresponding to second class I have frequency f2 and corresponding the kth
class I have frequency fk. Now what we have to do? We simply have to find out the maximum
value among these frequencies and whatever is the maximum value say here fm, I have to
identify wherever this fm is lying corresponding to which I have to find out the value of here m
and that will give me the based on that I will try to compute the value of the mode. So the class
where this maximum frequency is occurring this is class modal class. In order to compute the
mode for the group data the expression is given by like this.
Well this is based on certain computation certain derivation but I am not going into that detail.
So here this value here el is the lower limit of the modal class and this dl it is the class width and
f0 here is the frequency of the modal class and by this f and inside the subscript it is minus 1 this
13
428
is denoting the frequency of the class just below the modal class and this frequency here f1 this is
just indicating the frequency of the class just after the modal class. So for example if I have here
modal class here see here Am and then I have here 1 say here class, say here 1 and this here is
class 2 then this is going to the frequency of this Am is going to be of, this is going to be f-1 and
So not based on that we will try to compute the mode. Now we consider an example and I will
try to show you that how you are going to compute this data the mode based on this data. So this
is again here the same example that we considered earlier that the time taken by a customer to
arrive in our shop in inside a mall on different days is recorded and there are 31 days, so there
are 31 values here on the number of minutes that the customer takes.
14
429
Now we try to convert or we try to prepare the frequency table. So you can see here I have made
here five classes, five class intervals and I have computed their midpoint, for example in this
case the midpoint is computed as 15 plus 20 divided by 2 is equal to 17.5 and a frequency is here
zero and similarly other midpoints have been calculated and their corresponding frequencies
have been calculated. Now in this case out of this frequency, f3 equal to 18 this is the maximum
frequency which is occurring. So the modal class is going to be the class corresponding to which
we have the maximum frequency. So here this 25 to 30 is going to be the modal class. So I can
take here that l is equal to 3 which is 25 to 30, this interval this is the modal class. Now based on
that I will try to see, here for example here you can see the frequency of the modal class is
something like what we had denoted as f0. This is 18 and the frequency just for the modal class
which is here f-1 as per our notation is 12 and the frequency just after the modal class which is
15
430
here denoted as f1 this is equal to here 1. So substituting these values we will try to find out the
So you can see here that el is the lower limit of the modal class which is 25. The width of the
class is 5 and these are the frequencies modal class frequencies f0. The frequency of the class
just before the modal class as 12 and frequency of the class just after the modal class as 1 and if
you try to substitute all these values over here you get here the value 31.52. Now I would like to
address here one thing first. Inside R, there is no built-in function to compute the mode of the
grouped data what we have just obtained through formula. Well you can write a small function
or a small program to compute it but definitely I am not going to consider with that idea over
here. And now I would try to address two more tools geometric mean and harmonic mean which
16
431
are also two different measures to find out the central tendency of the data. So first I try to
Geometric mean is useful in calculating the average value of ratios or rates of interest in say
banking and in finance sector and so on and this geometric mean is not really applicable when
17
432
This is going to be clear from the definition of the geometric mean. So one condition in
geometric mean is that all the observations should be positive. So let x1, x2, xn be the n
observation which are all greater than 0. Now these observations can be on a discrete variable or
an ungrouped variable data set or they can be data on continuous variable and may create a
grouped data. So when these observations are ungrouped then in this case the geometric mean is
defined by here xG something like this so what we are trying to do here we are simply trying to
take all the observation x1, x2, xn, we are trying to multiply it and then I am trying to find out
the nth root by taking the power to be here 1 upon n. And similarly in case if you have the
grouped the data in which every xi has frequency see her fi then in this case the geometric mean
1
is defined by xG ( x1f1 x2f2 ... xnfn ) N . So this is the product of xif raised to the power of fi and
18
433
here the power here is 1 upon capital N. N is the sum of all the frequencies. So now in case if
you want to find out the geometric mean in the R software once again there is no direct
command but writing down this command is very simple. If you remember in the initial lectures,
we had discussed different types of built-in functions and using those built-in functions we can
create the command for the geometric mean. So if you see here I have created this command here
So suppose if I try to denote the data vector here as say x then if you try to see what are we going
to do? In the case of ungrouped data first I am trying to find out the product of all the observation
and then I am trying to take the power 1 upon here n. So this product can be stored or can be
defined by product of x where x is containing all this data and then for this power I am using
19
434
here hat and then this here n, n here is the number of observations in x. So this can be determined
by lengths of x. So this power can be written as hat and inside the bracket 1 upon length of X.
So this is pretty simple and similarly in case if you have a group data where you are trying to say
that the data vector has something like values x1, x2, xn with frequencies f1, f2, fn such that the
x1 has frequency f1, x2 has frequency f2 and xn has frequency fn. So I can write the data vector
and the frequency inside a vector so that data is written inside the data vector x and all the
frequencies have been written inside the vector here f and finding out these frequencies is not
difficult just by creating a frequency table and extracting the frequencies as we have done it
So now whatever is the product something like x1 raised to the power of f1 multiplied by x2
raised to the power of f2 and so on multiplied by xn raised to the power of fn this can be written
by product of x raised to the power here f. And then it is here 1 upon N. N is going to be the sum
20
435
of all fi. So sum of all fi this can be determined by the function sum of f and so writing it here
like this you will get the value of the geometric mean in this case. So you can see here that just
by using the built-in functions in R you can find out the geometric mean.
So now in case if I try to take an example, the same example in which I have considered the data
on minutes so this data is here contained here like this and if you want to find out the geometric
mean considering this data as a discrete data you simply have to use this expression and you can
see here this is the value of the geometric mean and this is here the screenshot and considering
this data as a group data we can find out this frequency table that we already have discussed and
based on that I have this data here that is the same data.
21
436
And now first I need to extract the frequencies. How it can be done? You may recall that in the
earlier lecture in the case of median I had shown you that how to find out the or how to extract
the frequencies from a group data. So I am not going to explain it here again but I will simply be
using the same command that I had used in the case of median and if you have forgotten I would
request you please try to go to the lecture on the median and try to see that how these values like
as breaks, cut commands were used to extract the data. So now using the earlier explained
method I will try to construct here a data vector breaks which is a sequence of 15 to 40 at an
interval of 5 because our class intervals are starting from 15 and going up to 40 and they are of
the width 5. So this data comes out to be as 15, 20, 25, 30, 35, 40, and then I have to operate the
command here cut on the given data vector minutes using here the breaks and right intervals are
going to be open so this is right equal to FALSE and once you try to do it I will get here this type
22
437
of data. So you can see here these are the class intervals that we have obtained. Now I have to
So in order to find out the frequencies, I will simply operate here the command table and inside
the arguments, the data that is minutes.cut it will give me the frequency table. So these are the –
the first row is indicating the class interval and the second row is denoting the frequencies.
23
438
Now how to extract this frequency vector for that we have used the command as numeric and I
try to store this value here in this f so as numeric on the data provided by table minutes.cut and
this comes out to be like this here. So now I have obtained here the vector f and now I have to
collect all the midpoints and I have to create a vector here x. So I am trying to collect all the
midpoint 17.5, 22.5, and so on and now I have here two vectors x and here f and based on that, I
can use this function which we have just evolved to compute the geometric mean. And here if
you try to see I have given here the screenshot but I would request that you please simply try to
copy these commands, paste it into your R console and try to see whether you are getting the
same outcome or not. And this is the same outcome which you will be getting here.
24
439
Now after this I will come to the last topic on the measure of central tendency that is harmonic
mean. So harmonic mean is also defined for group data and say ungrouped data. For our discrete
1
data, the harmonic mean is defined here as say here xH . So doesn't it look like if I
1 1
n
n i 1 xi
am trying to find out the mean of the inverse of the observations and then I am trying to take it
here once again the inverse and the same definition is for the continuous data for the group data
and the expressions for finding out the harmonic mean is like here this. But here you have to see
here that xi has frequency fi, same terminology that we have used in the case of geometric mean
25
440
So now in case if you want to compute the harmonic mean in the R software, once again I would
like to inform you that there is no built-in command inside the base package of R but writing
down a small command to compute the harmonic mean is not difficult. Just by using the built-in
function and by looking at the structure how the mean has been defined, one can easily do it.
26
441
1
So if you try to see here the command which I am writing here is for this quantity . So
1 n 1
n i 1 xi
you can see here if I try to denote here 1 upon xi to be here something like here yi then this
quantity becomes 1 upon say 1 upon n summation here and yi and which is nothing but y over
here 1 upon mean of here y. So this can be written here simply here 1 upon mean see here 1 upon
here x vector in you R software. So that is the same thing which I am writing here that if your x
is a data vector then the harmonic mean in case of discrete data is defined as 1 upon mean of 1
by x.
27
442
And similarly in case if you have a continuous data and the group format we are x1, x2, xn are
the values which are occurring with the frequencies f1, f2, to respectively then the harmonic
mean once again can be defined by the 1 upon mean of f upon x because if you try to see you are
simply trying to compute the average of this fi xi so for that fi of xi this is written here to say f
divided by x in the R symbol and then I'm simply trying to find out here the mean of this and
28
443
So now if I try to take the same example that we considered earlier of the minutes so this is here
the data which I am stored in the variable minutes and if you simply try to execute this data upon
this command to compute the harmonic mean that we just discuss you will get this value over
here.
29
444
This is not difficult at all now and similarly in case if you want to consider this data as a
continuous data, so as we had obtained the frequencies in the geometric mean case up to that
point you have to copy that the same thing and finally we were getting the frequencies like as
here. So now you have got here the f vector and you have got here the x vector, x is here this –
this is the midpoints. So this is what you have to keep in mind that in this case xi’s are the
midpoints and if you try to just execute this command 1 upon mean of f upon x then you will get
this value here. And these things are not very difficult to obtain. You can see here this is the
Now I would like to stop here in this lecture and I would also like to stop on the topics of
measures of central tendencies. So we have discussed in this chapter, different types of tools
arithmetic mean, geometric mean, harmonic mean, median, mode for the grouped data, for the
30
445
ungrouped data and as far as possible, wherever available, I have explained you how to compute
in the R software. So once again I would request you that you take different types of data sets,
and on the same datasets, you can compute each and everything- quantiles, mean, median, mode,
harmonic mean, geometric mean, try to convert the same data into grouped data as well as
ungrouped data. Try to operate the things and try to see how much difference you are getting and
try to think, why this difference is coming, that you will get from the theory of statistics. So you
practice it and we will see you in the next lecture, till then, Good bye.
Now I would like to stop here in this lecture and I will also like to stop on the topics of measures
of central tendencies. So we have discussed in this chapter different types of tool, arithematic
mean, geometric mean, harmonic mean, median, mode, for the group data, for the ungroup data
and as far as possible wherever available I have explained you how to compute it in the R
software. So once again I would request you that you take different types of data set and on the
same data sets you can compute each and everything, this quantiles, mean, median, mode,
harmonic mean, geometric mean, try to convert the same data into group data as well as ungroup
data try to operate the things and try to see how much difference you are getting and try to think
why this difference is coming that you will get from the theory of statistics. So you practice it
and we will see you in the next lecture. Till then good bye.
31
446
Lecture – 18
Welcome to the lecture on the course Descriptive Statistics with R Software. Now, you may
recall that, when we started the topics of descriptive statistics, we have taken several aspects.
One option was the central tendency of the data, which we have discussed in the last couple of
lectures. Now, we will aim to discuss the topic of variation in data. So, now the first question
comes, what is this variation? Why it is important? How it is useful? What type of information it
is going to give us and what are the different quantitative measures of such variations? So, in this
lecture, you will try to develop the concept, need, requirement, of having the measures of
variation. And, we will discuss three possible measures in this lecture; range, inter quartile range
So, let us start our discussion. You have seen that whenever we have the data, we simply want to
dig out the information contained inside the data. And, as we had to discussed, that, data itself
cannot tell you that I have these properties. So, in the last couple of lectures, we have
447
And we have seen that those measures of central tendency gives us an idea about the location,
where most of the data is concentrated. What does this mean? Means, if I have, suppose this
data, which I'm trying to plot here through a graphical measure. And, suppose if I say here like
this is my x-axis, y-axis. So, I can see here that this data is concentrated somewhere here. So, this
is trying to give us the information about, that where this where all this data is concentrated. But,
there is another thing; you can see here that there is a deviation between the center point and
those individual points. And, some points are close to the center and some points are away from
the center. So, if I say here that suppose I have here, these two types of data sets, and suppose
their scales are the same on y-axis and so, scales on the x and y axis, they are the same. So, there
is no issue.
Now, one data is here like this and another data here is like this. So, you can see here, in this
case, the most of the data is again concentrated over here. But these deviations that means the
difference between point and this center point or where is the mean is located, this is changing.
And, you can see here, that in the first figure, this region is of this type and then it's another
figure this region is of bigger shape. So, what I can see here, that there can be two different data
448
sets which may have the same mean, for example, here in this case the mean is here and this of
the mean is here. So, I'm assuming that this point on the x axis is suppose here is mu and here
mu. Which is the generating the mean suppose. But, the spread of the values around the mean is
different. And, similarly I can take any other point instead of a mean also.
So, now what we can see from these two figures that there can be two different data sets, which
have got the same arithmetic mean. But, they may have different concentrations around mean.
Now, the question is this, from the graphics I can show you that there are two deserts data sets in
which the individual observations are scattered around the central point, in a different way.
Graphically I can view it, but, now the question is how to quantify it? For example, I can take
here a simple example to explain you, that what type of information is conveyed by the measures
of variations. Suppose, there are three cities and we have measured the temperature, the weather
449
And, you can see here, those temperatures in degrees centigrade are recorded here in this table.
So, now please try to have a look on the data that is given inside this table. So, here I am taking
here three cities, city 1, city 2, city 3. And, in the rows we have here different days, days one,
two, three, four, five, six. So, if you try to observe in the city number here one. You can see
here, this temperature here is zero, on that day one, temperature on that day two is zero, the
temperature on the day three is zero and the same temperature continues for all the six days. So,
now in this case if you try to find out the mean of these observations, mean of these temperatures
this will come out to be zero. So, what we can see here, that the arithmetic mean of the
Now, similarly in case if you try to observe in the city number 2. First three values that mean the
temperature on first three days -15. And, the temperature in the day four, five and six, this is +15.
Once again, in case if you try to find out here the average, this will come out to be some of -15, -
15, -15, +15, +15, +15 divided by 6. And, this will again come out to be zero. So, in this case
also this x bar is coming out to be zero. So, now you can see here, there are two cities; city 1 and
city 2. In which the arithmetic mean is coming out to be the same, zero, zero. But, in case if you
try to look into the data, see here, here, here, here… Do you think that the data in the city 1,
which is all zero and the data in the city 2 are the same, the answer is no. They are different, but
Now, in case if you try to observe in the data in the city number 3. You can see here, on day one,
city three has temperature 11, day two-9, day three- 10, day four- 8, day five- 12, day six-10.
And, now in case if you try to find out the arithmetic mean of these temperatures, like eleven,
450
plus nine, plus ten, plus eight, plus twelve, plus ten divided by six, this will come out to be ten.
So, the arithmetic mean of the temperature in city 3 is coming out to be 10. So, now you can see
here, in this case the temperatures are little bit different, then in compared to the city one and city
two. So, you can see here, in these three cities, I have artificially taken three different types of
data sets. And, I’m trying to find out their arithmetic mean. What you have to notice, that the
arithmetic means in the city 1 and city 2, they are the same. But, their data values are trying to
Now, can I say that since the mean temperature in city 1 and city 2 are the same as zero. So, the
pattern of the temperature in both the cities are the same. Why? Because, city two has an, has a
peculiar characteristic that it has that temperature on two extreme, -15 degree centigrade and 15
degree centigrade. Whereas in the city one, the temperature is, city one the temperature is
actually constant, it is always zero. Where is in the city two? Yes, there is some variation in the
data. Now, in case if you try to look into the data of city 3, this also has some variation in the
451
data. But, now incase if you try to look at the temperature patterns of these three cities, what do
you think? Can I say that the information provided by the city three temperatures, it's more
reliable? No.
Let us see, whether this statement is right or wrong. So, first let me try to plot, this data on a
simple graph. And, let us try to see what a type of information I am going to get. Well, that's a
very simple plot and I will show you later on that how to plot this type of data in R software.
But, if you try to see here in that city number 1, this temperature is constant, zero means all
these dots are denoting the temperature on, day one, day two, day three, day four, day five and
day six, and it is here zero. Similarly, in the city two, you can see here, there are two values here
-15 and +15. And, there are three observations on days one, two and three, which are the same, -
15. And, there are three temperatures on day four, day five and day six they are also +15 and
452
But, can you see the pattern in Figure one and figure two. And, you can see here that the average
value will be somewhere here at about zero, in both the cases. Yeah! But, if you try to see such a
diagram for city 3 so on, day one this is trying to show that the temperature is 11, day two it is 9,
day three it is 10 and so on. So, if you try to see here this pattern don't you see that, the points
here are scattered in different places. For example, you can see here, this is the figure number
one, figure number two and figure number three. Right. What this point should be here. Right.
So, you can see here, and the mean value is somewhere here. So, now our objective is very
simple, we have understood that the mean value is giving us some information and the variation
in the values is giving us different type of information. So, from graphics I can see, but I would
453
So, now the next question is how to get it done? So, I can say here, that the location measures,
location means the central tendency. I will say in simple language. The measures of central
tendency are not enough to describe the behavior of the data. The, there is another aspect, the
concentration of data or the dispersion of data around any particular value. This is another
characteristic of the data that we would like to study. And, now the question is, how to capture
this variation? And, in order to capture this variation, various statistical measures of variation or
Now, we are going to say that, what is the means, how to use it, what is the interpretation and
how to implement these tools on the R software? These are the three objectives in this lecture.
Okay. So, now let we try to show you here a simple graph. So, here you can see here I have
made here two types of dots in this picture here. One here is in green color like as here and
454
another here is in red colors, which are concentrated somewhere around this. So, these are two
data sets and I have simply plotted the scatter plot. You can see here, in both the cases, the
arithmetic mean is going to be somewhere here. But, you can see here, that the data set in green
color, that is more close to the central value, which is here. And, whereas the data in the red
color, that is more scattered from the center of the data. So, I can say here, that in the case of
green dots the variation is only up to this point. And, whereas in case of, say here red dots this
variation is going up to this point. So, this orange color pen and then this blue color pen, they are
trying to give us the idea of the scatterdness of the data. So, I would try to now devise some
tools, which can measure this scatterdness or this constant. That which data is more concentrated
around the central mean or which data is more scattered around the mean or say. That mean is
the general consideration otherwise I can measure it around any particular value also.
Now, I'm going to address one simple issue, before going further. Sometimes people do ask me
that they have got two different data sets, and they have got the same variation, is it possible that
they also have the same mean? So, I'm just trying to create here two different data set means
hypothetical graphics to show you the answer. The answer is this by looking at the variation of
the data, you cannot really comment on the central tendency of the data and vice-versa. For
example, in this in the last slide we have seen, that there are two data sets, which have got the
same mean, but they have got different variation. Now, I will try to show you that, I have got two
data set which have got the same variation, but they have got different means. Okay.
455
Now, if you try to see here, I have taken here two data sets, one in if denoted by red dots and
another by green dots. Well! I have prepared it by hand. So, essentially I have tried to make it
very similar. So, you can see here, the mean in the data set one is somewhere here, and the mean
in the data set two is somewhere here. And, here is our x-axis and here is our y-axis. So, you can
see here, that the variability in both the cases, this is the same; nearly the same means I have
tried to make it as close as possible graphically. But, they have got different mean, the mean of
first data set is here, calling as mean 1 and the data set two has a mean 2. So, you can see here,
even of the two the data sets have got the same variability, but they can have different means.
Usually, you will see that the spread or scatterdness or concentration that can be measured
around any particular point. But, we will see that measuring this concentration or spread around
the mean value. And, particularly the arithmetic mean this is more preferable. And, there is a
statistical reason actually, that when we try to measure, some measures like as variance, standard
10
456
deviation around the arithmetic mean, then they have got certain statistical advantages.
Definitely, this is not really the platform or this is not really the course, where I can really
explain you the advantages of using arithmetic mean. But, as I go further in the lectures, I will
try to show you that, what is the most preferable location with respect to the given tool, around
So, now we have understood that different measures of variations or different measures of
dispersions, they help in measuring the spread and scatterdness of the data around any point, and
preferably the arithmetic mean value. Now, there are different types of measures which are
available in statistics. There is a long list, but I will try to take here some of them in the given
time, and those things with which I can show you easily on the R software. So, I would write, so
11
457
there are different major like as range, interquartile range, quartile deviation, absolute mean
deviation, variance, standard deviation and there are some graphical things also. And, there is a
long list. So, I will try to consider here, these many measures in this lecture and the forthcoming
lectures, and I will try to show you that how to measure them on the R software also.
So, now what I will do, I will try to take these measures one by one. And, I will have several
questions to, to answer depending on the tool. First thing will be what is the definition of the
measure, second thing will be how to compute it on the R software, third thing will be how to
interpret it. And, in some of the cases I will try to show you that in case if some missing values
are present in the data set, then how to handle them, how to compute the tool in the presence of
some missing observation. And, lastly, wherever possible I will try to show you the tools for
grouped and ungrouped data. So, this idea or these things will continue during the entire topics of
12
458
So, now let us take the first topic here, which is here the range. So, first I will assume here and
that will be valid forever all other lectures, that I have here a variable x on which we are
collecting the n observations. And, I am denoting it by small x1, is small x2 and say small xn.
For example, in case if I say x is here height, then I am trying to collect the data on the heights of
say here n persons, and the height of first person is denoted by small x1, height of second person
is denoted by small x2 and the height of nth person is denoted by small xn. So, this small x1,
small x2, small xn they are going to be some numerical values, Right. So, I will now say in
simple words that we have a set of observation x1, x2,….,xn which is our data set. And, our
objective is this how to define the tools, and how to compute them using this data? The range is
defined as a difference between the maximum and minimum values of the data. So, it is pretty
simple. Just try to find out the maximum value out of x1,x2,….,xn that is the given data. Try to
find out the minimum value from the given data set among x1, x2,., xn. And, just try to find out
the difference between that two. And, this will give us the value of the range.
13
459
So, this is pretty simple actually. Now, the question is this once you get the range then how do
you interpret it? So, the rule is pretty simple, the data set having higher value of range has more
variability. So, I can say one thing that if you have got more than one data sets, and if you want
to measure the variability in terms of range. Then, what we have to do? Just try to find out the
ranges of all the data sets. Try to compare them and whosoever range is coming out smallest, the
corresponding data set will be thought to have a smaller variability. So, I can say nowhere that
the data set which is having the lower value of range is preferable. And, in case if we have two
data set and suppose their ranges are represented by range one and say here range two. Then if
range one is greater than range two, then we say that the data set of range one is having more
One thing I would like to make it clear. That we are going to discuss different types of tools
range, interquartile deviation, quartile deviation, absolute mean deviation, variance, standard
deviation and so on. So, whenever we are trying to measure the variability. Then, we are trying
to make such decision only with respect to that measure. Now, if I say that suppose you have two
data sets, and you try to find out the range of one data set and say standard deviation of say
another the data set. Now, if you try to compare the range of first data set and the variance or the
standard deviation of the second data set, that may not be appropriate.
So, my advice is that whenever you want to compare the variability try to use the same tool, and
14
460
Now, I am coming on the aspect that how to compute the range in the R software. So, I will
denote here by this x the data vector that whatever are the data values they are contained here
like c (x1, x2…….xn). Now, you know as we have defined the range here the range has been
defined as maximum value minus the minimum value. So, what I can do here, that I can use the
built-in commands in the R software to find out the maximum value and the minimum value that
we had discussed in the earlier lectures, so the maximum value of x1, x2, xn is going to be
computed by max of (x), max and inside the argument you have to give the data vector and the
minimum value of x is going to be computed by the command min and inside the argument you
have to give the data and then you try to find out the difference between the two. So, max of (x)
15
461
Now, suppose it happens that x has some missing values and they are denoted by capital N
capital A. And, suppose I try to store this data into another data vector say xna, so xna is my
another data vector which has got some missing values. So, in case if you want to compute the
range, of such a data vector where the values are missing, you simply have to use the same
command max and min, but inside the argument you have to give the data, in this format xna the
data vector in which you have got some missing values and you have to use the command na dot
rm is equal to logically TRUE. That is capital T, capital R, capital U, capital E. So, what will
happen? That once you try to operate the max command on this vector, this data set, then it will
try to remove the missing value and then it will try to compare how to compute the maximum
value. And, similarly when we try to use the min minimum command on this data vector, then
when we specify na dot rm is equal to TRUE, then this operator will remove the missing values
from this vector xna which are denoted by capital N, capital A and then after that it will try to
compute the minimum value. So, this is how you have to compute the range in case some data is
missing.
One thing which I would like to caution you here, as you have seen in the R software, we have
the names of the function which are giving a value which is easily understood by the name, like
as mean. Mean, means the arithmetic mean of the data vector median which is trying to give the
median of the data vector. Similarly, when you try to use here the command range, r a n g e.
Then it appears that as if this is going to give me the value of the range, that is maximum value
minus minimum value, but it does not happen. Range will try to give you the two values. The
range command will give you two values, one is the maximum value of the data vector and
another is the minimum value out of those data inside the data vector, so be careful.
16
462
Refer Slide Time: (29:04)
So, here I would like to make here a note of caution, that if you try to use the command r a n g e,
then this will return a vector containing the minimum and maximum values of the given
argument.
17
463
So, just be careful and if you recall the same thing happened in the case of mode also, m o d e
that was trying to give some other information but by name it appears that as if this is going to
give me the value corresponding to the maximum frequency. So, similar is the case with range,
so you need to be careful. Now, after this I will try to take an example and I will try to show you
that how to compute the range on the given set of data. So, I will take the same example which I
18
464
So, I have computed the or I have observed the time taken by 20 participants in a race and they
are given in seconds over here like this, and this data is recorded inside a variable here time. So,
this is my here the data vector and now I'm simply trying to execute the R command on
maximum time minus minimum time, and I can see here that this is giving me the value here 52.
And, you can also verify it from the given set of data, for example here you can see here this is
here the maximum value and this is here the minimum value. So, if you try to subtract it here 84
minus 32 you get the value equal to 52. Just to show you what will be the command, or what will
be the outcome of the command range, then you can see here this is giving me two different
values. This is here the minimum value of time and this is 84 is the maximum value in the data
19
465
So, before I try to show you on the R console, let me try to give you one more example and then
I will come back and here, in this slide you can see here the screen shot on the R console and
now what I am trying to show you in the same example, if that data values are missing then how
you are going to handle it, so in the same example where we have recorded the time taken by
20
466
I have made this first two values to be NA, that means they are not available, and all other values
are the same. So, now I'm trying to record this data inside a new variable, this is time dot na. so,
time dot na is simply indicating that, the data is not available inside the time vector, Right. And,
there is no rule that the, that you always have to use here the point or dot, that is your choice. So,
I have stored this data and now I'm trying to find out the maximum value of this time na, minus
minimum value of the time na. But now if you see this will give me a output like NA. Why?
Because I have not used here that command na dot rm is equal to TRUE. So, I try to correct
myself and I try to find out the maximum and minimum on the data set time dot na using the
command that na dot rm is equal to TRUE. So, as soon as we give na dot rm is equal to TRUE,
this will understand, the maximum command will understand that there are some missing values
in the time dot na, which have to be removed before computing the maximum, and the same
21
467
thing will happen here. This minimum command will understand that the first the any values or
the missing value we have to be removed and this value comes out to be here 49. Right.
So, now I will try to show you here on the R console and you can see here this is the data this is
22
468
So, let me try to first copy, this data set, say her time and I try to put it here time. So, I can see if
this is my here that data time. So, now suppose if you say by mistake if you try to operate the
command range, you can see here this will come out to be 32, 84 that is the same outcome that
we had received, Right. But if you try to find out the maximum of time minus minimum of time
you can see here, you are getting the this value which is the value of the range, and similarly if
you try to remove the missing data Then how to operate it? So, I try to create here say here
another, the data vector time dot na, so you can see here I have created this data vector here time
dot na in which the first two values are missing and suppose if I try to find out here the range of
time dot na. This will give me there is something wrong, so I try to use here the command na dot
rm is equal to TRUE. I mean now it will give me the minimum and maximum values.
But my objective is not to find the minimum and maximum value. I want to find out the
maximum and minimum values and I want to find out their difference, so I will say here time dot
23
469
na, and when na dot rm is equal to TRUE, minus the minimum of time dot na, na dot rm is equal
to TRUE. Now, if you try to see here, you will get here the when you 49. But in case if you by
mistake if you do not give the command na dot rm is equal to TRUE, then you can see the
24
470
Now, after this I come on at this another topic, which the interquartile range. Just like range is
trying to measure the difference between the maximum and minimum values. Similarly, if we
have another measure what is called as interquartile range, and interquartile range simply tries to
measure the difference between the third and first quartiles. Now, you may recall, what was the
quartile?
25
471
If you try to recall, then we had discussed that in case if we have the frequency distribution like
this one, then this frequency distribution is divided into four equal parts and the first 25 percent
of the frequency is covered in the first quartile to denote it as Q1 , next 25 percent of the
frequency is contained between Q1 and Q2 .So, essentially Q2 is trying to consider the total fifty
percentage of the frequency, so this is the median and similarly we had here Q3 and finally here
Q4 . So, now what I try to do here, that I try to take here Q1 ,Q2 and Q3 . Q2 is the median, and I
try to consider this area. So you can see here that this area is consisting of 25 percent of the total
frequency and this area is consisting of the another 25 percent of the frequency. So, the area
between Q1 and Q2 is 25 percent of the total frequency and the area between Q2 and Q3 is 25
percentage of the total frequency. So, altogether if you try to add it together then this entire area,
which I am denoting by here dots this is going to take care of the 50 percent of the total
frequency. So, the interquartile range is defined as the difference between the 75th and 25th
percentiles or equivalently, this is nothing but the third and first quartile, so this is denoted by
here I Q R is equal to Q3 minus Q1, Right. And, as I have shown you in this figure here, that this
IQR or the interquartile range covers the center of the distribution and contains 50 percent of the
observations.
26
472
So, now how to make the decision making. Once again, the rule is the same the data set having
the higher value of inter quartile range has more variability that will be the interpretation. So,
obviously if we would always like to have a data set which has got the smaller variability, so the
data set with lower value of interquartile range is more variable. So, suppose if I have got two
data sets and suppose their interquartile ranges are computed as IR1 and IR2 . So, if IR1 is greater
than IR2 , then we say that the data in IR1 is more variable or has more variability than the data
in IR2 . So, this is our interpretation, so now with the these two examples you can see here that
range and interquartile range, the, both of them are trying to measure the same aspect of the data
that is the variation, but they are doing in a different way, Right. Now how to compute it on the
R software this is pretty simple. There are two ways that you can write your own program or just
use the command to compute the quartiles and try to find out their difference or there is a built-in
27
473
Refer Slide Time: (39:36)
So, if I say that my data vector here is x which is consisting of here say observation (x1, x2,..xn)
what we had assumed, then the interquartile range is computed by the command IQR and inside
the argument you have to give the data vector, and in case if the data vector has some missing
values which are denoted by here xna, then the command will be modified as IQR of xna in the
Now, with this thing I would like to introduce one more measure that is called as quartile
deviation. This quartile deviation and interquartile range both are very closely related to each
other, and after this I will try to show you how to compute it on the R software. Okay.
28
474
So, this quartile deviation is another measure to find out the variability in the data and this is
defined as the half difference between the 75th and 25th percentiles are the half difference
between the third and first quartile. So, this is essentially half of the value of the interquartile
range, so half of the interquartile range is called as quartile deviation. So, it is not really difficult
to give the definition of the quartile deviation, we simply have to take the difference between Q3
minus Q1 which is nothing but your interquartile range and you have to divide it by 2, which I
am doing here, so this is nothing but the half of the interquartile range and the decision making
in this case is the same as in the case of interquartile range the data set having a higher value of
29
475
Now in case if you want to compute the quartile deviation on the R software, it is pretty simple,
you already have learned how to compute the interquartile range so you simply have to write
down the same command and divide it by 2, and suppose if the data vector has some missing
values, then again just write the same command define for the interquartile range in the case of
missing data and divide it by 2. So, this command will give you the value of the quartile
deviation. Now I will try to take an example to show you how to compute these things and then I
will to show you on the R console also. So, again I'm going to take the same example,
30
476
where, I have restored the data on the time of 20 participants, now if I simply tried to operate
here the command IQR inside the argument time. This will give me the value of the interquartile
range and this value comes out to be here 27, and similarly, if I want to find out the quartile
deviation. I simply have to write down IQR of time, the same command which I have used here
and just divided by 2. So, this value will come out to be 13.5.
31
477
And, this is here the screenshot of the operation.
32
478
And, now I will try to show you first that how you are going to handle the essing value. So, once
again I will try to take the same example, which I have taken earlier that in the time data, the first
two values have been replaced by na, so they are representing the missing values and these
missing values have been given inside the data vector time dot na.
Now after this in case if you simply try to put here IQR of time dot na, so obviously this is going
to give you an error because there are missing values. So, what you have to do, in case if you
want to compute the interquartile range, then give IQR as the command the data containing the
missing value time dot na and write the command na dot rm is equal to TRUE. So, this is going
to tell this time dot na that when we are trying to compute IQR then first the missing values have
to be removed, so this value comes out to be a 25.25, and if you try to find out the quartile
33
479
deviation, just try to use the same command here and divide it by 2. So, this will give you the
So, I will try to now show you this thing on the R console. So, you can see here you already have
the data entered here time, so I will say here IQR of time. This is coming out to be 27 and if you
want to find out the inter quartile deviation you just divided by 2, this will come to be like this.
You see what happens if I try to use the small quick address i, iqr this will give me a mistake. So,
what you have to keep in mind that IQR command is case sensitive. They're small iqr and say
34
480
Refer Slide Time: (44:48)
And, similarly if I try to take the data on the single use time dot na. So, this is my data vector
now if I want to find out the IQR of this time dot na, so I try to do here without giving the
argument na dot rm is equal to TRUE, and you can see here this gives me a mistake. So, I try to
35
481
add here the command that na dot rm is equal to TRUE, and you get here this value and if you
want to find out the, the IQR divided by two, the quadrille deviation. So, you have to simply
divide the interquartile range by two, and this gives me the value here 12,625. Okay.
So, in this lecture I have given you concept of variation in data and I have introduced two
measures which are based on certain values, for example range is based on two values minimum
and maximum, and quartile deviations they are or interquartile range. Both are based on two
values first quartile and third quartile. So, at this moment, I would request you, you please try to
go through with the lecture take some example and practice them on the R software try to make
different experiment, try to get the values and try to see that how the values which you have
obtained for range and interquartile range, they are going to give what type of information which
is contained inside the data so you practice it and I will see you in the next lecture. Till then,
goodbye.
36
482
Course Title
Descriptive Statistics with R Software
Lecture - 19
Variation in Data - Absolute Deviation and Absolute Mean Deviation
Prof. Shalabh
Department of Mathematics and Statistics
IIT Kanpur
Welcome to the next lecture on the course Descriptive Statistics with R Software. You may
recall that in the earlier lecture, we started our discussion on the topic that how to measure the
variation in the data and in that lecture, we had considered three possible tools: range,
quartile deviation and interquartile range. The measure of range was dependent only on two
values, that is the minimum and maximum, and quartile deviation as well as interquartile
range. They were dependent on two values- the first quartile value and third quartile value.
So now there is another thing that these measures are going to be based only on say two
values at a time, either minimum, maximum or first and third quartiles. They are trying to
take care of the entire dataset in a different way. That is the minimum, maximum, quartiles
But now there is another concept to measure the variation that why not to measure the
variation of individual data points from the central value and then try to combine all the
things together; combine all the information, combine all the deviation together.
So now we are going to start a discussion on different types of tools to measure the variation
which are based on the individual deviation of the data from the central value or any other
value also. So, in this lecture, we are going to discuss on two topics. First is absolute
483
deviation and another is absolute mean deviation. We will try to understand the concept and I
Now we are going to discuss the another aspect that what are the different measures which
are based on the deviation? What does this mean? If you remember in the early lecture, I have
made this type of data, right, and if you try to see here, the mean value is somewhere. I'm
assuming somewhere here. This is here the mean value and I'm trying to make this difference,
this. So these are my data points and this blue colour lines are indicating the difference
between the central value, say mean or it can be another value also from this individual data
484
So now the concept what we are going to discuss here is that the tools like range, interquartile
range, and quartile deviations, they were based on specific values and partitioning values. A
specific value mean minimum value, maximum value where the partitioning values were first
Now we would like to have a measure which can measure the deviation of every observation
around any given value. So in this figure instead of taking here mean, I can also take it here
485
So now if you say the first question is how to measure the deviation, and first I am initiating
the discussion that I would like to measure the deviation around any value A and later on I
will try to choose appropriate value of A. So you can see here that in this graph, if my this is
the data point and this is here the mean value, then or the known value here A, then these are
my deviation.
So this is my here x1, this is x2, and this is x3. And essentially, this difference is trying to
measure the difference between x1 and say here let me denote it by here d1. And similarly,
this difference here, this is trying to measure the deviation between x2 and A which I am
trying to denote it here as d2, and similarly, the different between here this A and x3 denoted
by here d3. So, in general, I will say here that di = xi - A is going to measure the deviation.
Now in case if xi is greater than A, then all such observation where this whole, in those cases
the deviations will be positive, and in case if the value of xi is smaller than A, then all such
486
deviations will be negative, and if xi is exactly equal to A, then all such deviations are going
to be zero. So now this di can take three possible values: zero, less than zero, and greater than
zero.
Now there is another issue. The issue is this that in case if I have got the observations say x1,
x2, xn, then corresponding to every observation I have got the value of deviation d1, d2, dn.
Now d1, d2, dn, they are the individual values. So suppose you have got 20 observations and
then you will have 20 values of deviation. Some are positive. Some are negative. Some are
zero. But it will be very difficult by looking at deviation and get a summarized value. As we
had discussed in the case of measures of central tendency that the basic tendency is that we
would always like to have a summarized measure. That means all the information which is
contained in d1, d2, dn, this has to be somehow combined in a single quantity by looking at
487
(Refer Slide Time 07:15)
So now one option is this that once I have the values of d1, d2, dn, I can find out the average
1 n
of these deviations. So I try to compute di , but this deviation may be very, very close to
n i 1
0. Why? Because it is possible that some of the deviations are positive; some are negative. So
there is a possibility that when we try to take the average of positive and negative values, the
mean may come out to be 0 or very, very close to 0. So once this value is coming out to be
1 n
zero that di this is 0, then it is possibly indicating as if there is no variation in the data or
n i 1
if this value is pretty small that means it is indicating that there is very small variation in the
data.
So if I try to make here a figure here that suppose my central value is somewhere here and
my observations are something like as one and two are here and one and two are here. So in
this case if you try to see here if this is my here A and x1, x2, x3, x4, then the deviation d1, d2,
6
488
and deviation d3 and d4, what will happen? d1, d2 will have the opposite sign as of d3, d4
So now in case if I try to find out the value d1 + d2 + d3 + d4 divided by 4, then two are
suppose positive and two are suppose negative values, then their average may be close to 0 or
exactly 0. So this may be misleading because may not give us the correct information.
489
So, obviously, now by this example, it gives us a clear idea that we need a measure where
these signs are not considered. Why? Because I'm interested only in the scatteredness of these
green circles around this red point. I'm not interested in their individual values. So we need
not to consider the signs of this di, signs of these deviations, but we need to consider only the
So now after looking at this example, we have understood that now I need a summary
measure which can denote the deviation in the data or the variation in the data based on the
deviation and these deviations have to be considered only in their magnitude, but obviously,
in the data, some deviations are going to be positive and some deviations are going to be
negative. So as long as we have positive deviations, there is no problem, but then I'd need to
convert the negative deviations into a positive deviation. That means I'd need to convert the
490
Now the next question is how to convert the sign? So in mathematics we have understood
that there are two options. Either I try to consider the absolute value or I try to make the
negative values to be squared. So now based on these two aspects, we have two types of
measures. One is absolute deviation and another is variance. So, in this lecture, I'm going to
consider the concept of absolute deviations and in the next lecture I will try to consider the
concept of variance.
consider their magnitude. Consider absolute values or consider their squared values of those
deviations.
So now in this lecture I'm going to concentrate on absolute values. So I had the observations
x1, x2, say here xn. Now I have taken the deviation from the value A. Now after this I will try
491
to take the absolute value of all these deviations and now I need to combine all this
Now the question is this, how to combine it? So we will consider it. Now before going to the
discussion on the how to combine such information, first, let me clarify the symbols and
notation what I'm going to use in this lecture and the next lecture. You see I have two types
In discrete variable, we simply try to use the observations as such, but in the case of
continuous variable, we try to group them. We try to convert the data into a frequency table
and then we try to extract the mid-values of the class intervals and the corresponding
frequencies to construct our statistical measures, the same thing that we had done in the case
For the ungrouped data that was defined as summation of xi divided by number of
observations and in the case of continuous data or the group data, it was defined as
summation xi fi divided by summation of fi and in the second case xi’s were denoting the mid-
So I would now try to introduce these measures for grouped data as well as for ungrouped
data.
10
492
So, first, let me just explain here that in case if our variable is discrete or the data is
ungrouped, then we will denote the sets of set of observation as say x1, x2,...,xn. There is no
issue and we will use these values directly, use the values directly.
11
493
Whereas in case if I have grouped data, so suppose my variable here is X and I have got the
observations, which I have tabulated in K class intervals in the form of a frequency table like
this here. If you can see you here, I have made the observation in this class interval, say, first
class interval, second class interval, up to here, Kth class interval and after that I have found
the mid points of this class intervals. So midpoint of the class interval e1 to e2 is rather (e1 +
e2) / 2. The class interval e2 to e3 has the midpoint at (e2 + e3) / 2 and these mid points are
So I am denoting the mid points of the intervals by x1, x2,...,xk and which is just different than
in the earlier case where we had denoted x1, x2,...,xk as the values that are obtained directly
from the data and corresponding to these intervals we have the absolute frequencies, which I
will be denoting by f1, f2, fK and the sum of these f1, f2, fK is going to be denoted by n, which
is again different from the earlier case from the case of ungrouped data.
12
494
(Refer Slide Time 16:01)
So this is what you have to keep in mind that when I'm trying to consider the case of
ungrouped data, then x1, x2,...,xk, they are going to denote the values which I’m observing
directly, and when I'm trying to consider the ungrouped data, then x1, x2,...,xk are going to
denote the mid-values of the class intervals and the corresponding frequencies in those class
Similarly, the symbol is small n that is going to denote the number of observation in the case
of ungrouped data whereas a small n will be denoting the sum of all the frequencies in case of
grouped data.
So now with these notations I will start the discussion on the first measure, that is absolute
deviation. Right.
13
495
So, first, try to concentrate on this slide, this one here. You can see here I have obtained here
this n such deviations which are the absolute values. Now in case if I try to find out the mean
of all such deviation, then this will give me a sort of summary measure of absolute deviation.
14
496
And the definition of absolute deviations is that if you have got a data x1, x2,...,xn, then the
follows.
Now I have here two cases for ungrouped or discrete data or for grouped or say continuous
data. In case of discrete data, we simply try to take the arithmetic mean of such absolute
deviations and in case of continuous data, we again try to find out the mean of such absolute
deviations using the concept of arithmetic mean in the case of frequency table.
So now you can see here, this is how we try to define the absolute deviation, sum of absolute
deviation divided by n, which is here the arithmetic mean of the absolute deviation and in the
case of continuous data also this is again the arithmetic mean of the absolute deviation which
15
497
(Refer Slide Time 18:27)
Now the next question comes, how to choose this A? Well, in this definition, I'm assuming
that A is known to us, but in case if I try to choose different values of As, then the values of
absolute deviations are going to be changed. So by using a small algebra, we find that these
absolute deviations is going to take the smallest value when this A is chosen to be the median
of those values. So we try to replace the value of A by the median of x1, x2,...,xn and this
So in case if I try to take the observations x1, x2,...,xn, then this absolute deviation is going to
be minimum when it is measured around median. That is A is the median of the data, say, x1,
x2,...,xn.
16
498
Remember one thing, we had computed the median in case of a grouped data and ungrouped
data by different expressions. So whenever we're trying to find out the value of absolute
mean deviation, then we need to compute the value of median suitably. Suitably means if you
have discrete data, try to order the observation and then try to take the middle value
depending on whether your number of observations are odd or even what we have discussed
And similarly, if you have grouped data, then you please try to convert it into a frequency
table and use the expression for computing the median from that table, and you remember
that we had computed the median in a different way and by not using the direct functions.
So in case if I have discrete data, then I simply try to replace A by the median of x1, x2,...,xn,
which is denoted by x bar med x̄med and then I try to take the arithmetic mean of those values
and this is called as the absolute mean deviation in case of ungrouped data. And similarly, if
17
499
you have grouped data on continuous variable, then, again, I try to find out the value of
absolute deviation by replacing A equal to here by x median x̄median, but remember this x
median x̄median and x median x̄median, they are suitably computed. Right. And also note that the
definition of here n and the n in the case of grouped and ungrouped data, they are going to be
different. Right.
18
500
Now what are these absolute deviation or absolute mean deviation doing? So you can see
here either it is absolute deviation or say absolute mean deviation. They are trying to present
the information on the variability of the data in a comprehensive way. So this absolute mean
deviation measure the spread and scatteredness of the data around, preferably the median
19
501
And in case if you want to know that how to make a decision that if you have got more than
one data sets and if you want to know that which dataset has got the more variability, then
how to do it? So, again, we have a same rule as we had discussed in the case of interquartile
So I will say here the data set which is having the higher value of absolute mean deviation or
even the absolute deviation is said to have more variability and the data set with lower value
of absolute mean deviation or the absolute deviation is preferable. And suppose we have two
data sets and we try to compute their absolute mean deviation and we computed say AMD1
and AMD2. Then we say that if the value of AMD1 is greater than the value of AMD2, then
the data in the AMD1 is said to have more variability than the data in the AMD2.
One thing what you have to keep in mind, there are two interpretation. More variability or
20
502
What I'm trying to say? More variability means that the data is more scattered around the
value. When I say less concentration, that means the data is less concentrated around the
median value. So, usually, people provide their inferences using that term variability or the
concentration. So you have to be careful. More concentration means less variability. Okay.
Now the question is how to compute it on the basis of R Software? So now I have two cases
for ungrouped data and second for grouped data. So in case of ungrouped data, I'm denoting
the data vector by here x. So this is going to be x = c (x1, x2,...,xn) separated by comma.
Now, once again, I will try to address that in R, there is no built-in function to compute the
absolute deviation or the absolute mean deviation, but by using the built-in commands inside
the base package of R, it is not difficult at all to write down the command for absolute
21
503
(Refer Slide Time 25:24)
If you try to see what I'm trying to do, I have the observations absolute value xi - A and I'm
trying to find out the absolute value of this xi - A. So for that I have the function here absolute
a b s (xi - A), and then I'm trying to find out the sum of these and divided by the number of
22
504
So I have used this concept here, and so, first, I try to write down the absolute value of all the
data vectors around a given point A and then I try to find out the mean of all such absolute
values. So by writing mean inside the argument absolute (x - A) we will find out the value of
the absolute deviation around any given value A. So A is assumed to be here known and if I
try to replace A by median of here x, then I know how to compute the median. The command
here is median of the data vector x. So I try to find out the absolute value of the deviations x
and median(x) by using the command abs and then I try to find out their arithmetic mean.
23
505
Now in case if I have ungrouped data or the missing data, so what I say is suppose I have this
data vector here x and whatever are the values missing in this vector, they are denoted by NA
and those values are collected inside the data vector xna. So in case if I have missing values
You see I have done nothing here. If you have understood the command that how to write or
how to handle the missing value, I have written the same command earlier, but I have added
So now this is telling that please try to compute the values of xna - A after ignoring the
missing values using the command a b s and then whatever is the outcome, operate the
command mean on it. And similarly, in case if you want to find out the absolute mean
24
506
So in the same command, I'm trying to simply replace A by the median, but remember one
thing that median also has to be computed using the command na.rm = TRUE. And then,
first, na.rm, this is going to be used by the command median and this na.rm, this is going to
be used by the command mean. So you simply have to be little bit careful while writing this
Now, similarly, when we have grouped data, then now I will assume that my data vector here
is x, which is essentially denoting the midpoints of the class intervals, and this data has the
frequency f. So x1, x2,...,xn, they are my class intervals with frequency f1, f2, say here fn.
Right.
1 K
Now in this case, the absolute deviation is simply going to be the fi | xi A | . So I try to
n i 1
compute this xi - A by the function absolute of x - A and then I have to multiply by the
25
507
corresponding fi. So I can use here the operator star * and then I have to find out here the
sum. So I can say here first of all find down the absolute value of x - A. Then multiply it by
here vector f and then whatever is the command here, this you have to find the sum, and it
has to be divided by here n where this n is going to be the Σ fi. So I can divide it by here sum
by say here f. Right. And the same thing has been written here in this command here.
So once again you can see here that there is no built-in function, but you have to be careful,
and you can easily write such commands. And similarly, if you want to compute the absolute
mean deviation, what I have to do? I simply have to replace this A by here median.
26
508
So just in order to make it clear that here we are not going to use the command median
directly, but we need to compute the value of the median separately, I am using here a black
colour font and I'm denoting here as xmedian so that you remember that this value has to be
computed separately using the functions and command that we have used in the case of
27
509
So now let me take here an example and I will try to take the same example. So I will try to
first treat the data set as ungrouped data set and then I would convert it into an ungrouped
data set. So, first, I try to consider the case of ungrouped data set. You have the same
example that we have used earlier that we have the data of 20 participants, and this data has
been recorded inside a variable say time, and this is the data vector here.
And in order to compute the absolute deviation, I try to choose say value of A. Just for the
sake of understanding, I'm trying to choose here the value of A to be 10. And now once I
operate this command what we have just noted mean of absolute value of time - A, this will
give me the value here 46. And if I try to find out the median of the time, you can see here,
this value will come out to be 46.5 because this is the discrete variable. You have the
ungrouped data. So you can use this command, and now using this command directly inside
the command here, I'm trying to compute the absolute value of these deviations of time from
28
510
the median and then I'm trying to find out the mean. So this is giving me the value of absolute
mean deviation around median. So this value comes out to be here 14.5.
So as we had discussed that this absolute mean deviation is going to be minimum when it is
measured around the median, so you can see here when I try to choose here this value to be
here A equal to 10, then this value was coming out to be 46, but now this value is coming out
to be 14.5.
Okay. Before I try to go to the R console, let me show you here the screenshot of the same
operation that we are going to do, but before that, let me consider this data as a grouped data
and show you how to compute the median in this case also.
29
511
So now considering the same dataset as a grouped data, I have divided the entire data into six
class interval and I have found their mid-value xi here in the second column. So you can see
here this value is 31+40 divided by 2. So this is 71 by 2 which is 35.5 and so on other values
and here are the frequencies. So 5 is the frequency of class 1; f2, which is the frequency of
second class, is 3; then f3; then f4; then f5 and then f6.
30
512
Now what is our objective? We simply want to find out the frequency vector and median
Now how to find out the frequency vector and how to find out the median, I'm not going to
explain here in detail because I already have discussed it in more detail when I was trying to
compute the median and I also have used it in the earlier lectures. So but here just for your
understanding and so that you can recall, I will simply try to give you the broad steps and
these broad steps are simply I have taken from the slides that I had used earlier.
So my first objective is this. I want to obtain the frequency vector. So we had used the
denoted this by breaks b r e a k s, and this breaks has an outcome like 30, 40, 50, 60, 70, 80,
31
513
(Refer Slide Time 36:04)
And then after this we wanted to obtain the frequencies. So in order to obtain the frequencies,
first, we need to classify the time data in the class interval using the width that was defined
by the breaks. So for that we have used the command cut c u t and this was operated. The cut
command was operated over the data vector on time using the breaks and we had used the
option right = FALSE so that the intervals are open on the right-hand side and I had stored
this, these values in time.cut vector, and this, these values were obtained here like this and
32
514
And here in this case if you recall the last row, which is indicated with levels, that is going to
indicate the class intervals. Now after this we had used the command table to tabulate the
values of time.cut and the frequency table was obtained here like this. This was the frequency
table and the second row over here, 5, 3, 3, 5, 2, 2, this is going to give us the frequency
values.
So we had obtained the frequency vector from this table using the command as.numeric and
inside the arguments the name of the frequency table, and this gives me the here an outcome
5 3 3 5 2 2 and this outcome is the same as given here in this vector. So this is our frequency
33
515
Now we need to find out the vectors of midpoints. So now how to obtain it?
We had defined this vector x to be here the midpoints of the class interval. So these points are
the midpoints of class intervals. How they are obtained? You see, one thing you have to be
very careful here. When we are compiling the data and when we are trying to compile the
data through the R Software using the command table, then in this case I have to use the
34
516
midpoints of the class interval and the frequencies, which are provided by the R Software.
Right.
So here if you try to see, in the earlier slide, we had obtained the class intervals like this one.
So the midpoint of this interval is (30+ 40) divided by 2 which is equal to 35. The midpoint
here is 45 and so on, and this is different than what we obtained earlier manually.
35
517
You may recall that this value was 35.5 and so on, but anyway we are now doing in the
software and this will not make much different. So this is now here my vector here x. After
this I have to use the expression for finding out the median. So if you remember we have
used the notation m, em, fm, dm and we had this expression to find out the median, which we
So now using this em, which is denoting the lower limit of the class, this fm, fm is going to
give me the frequency of the median class, which is here, and some of these fi’s that is going
to be the sum of first two classes frequency, frequency of classes one and two and dm here is
the width of the median class. This is 10. So now using these values over here, I get the value
here 56.66.
36
518
So what you have to notice here that in this case we have to find out the median separately.
And now using these commands, now I have here the data vector on frequency f, data vector
on say here x, which are the middle points, then the value of the median 56.6 and the value of
here constant A, which I have chosen myself. So now using the same commands for finding
37
519
out the absolute deviation with value A and finding out the absolute mean deviation around
the median, but here the median has been defined separately, we get these values here.
And in case if you try to compare the results that we have obtained for the grouped and
ungrouped data, you can see there is not much difference actually, right. Here that this is for
the ungrouped data and then for the grouped data. You can see here the values of this
medians around A equal to 10, they are coming out to be the same and similarly the values
which we have obtained for the absolute mean deviation in case of grouped and ungrouped
data, they are not much different. This is 14.5 and 14.16, so 14.2, so there is not much a
38
520
And now I would first try to show you these things on the R console. So we come to our
example and where we try to find out the first the absolute deviation.
39
521
So I can show you that here I have the data here time, and I can take A to be here 10, and if I
try to find out the absolute deviation, this is coming out to be 46, and if I try to replace A by
median of time, this is coming to be here 14.5 and you can verify here that these are the same
Now I try to come to the grouped data case where we have this, the frequency vector to be
here like this. We have copied here and the vector of the midpoints which I am writing here
like this and the value of here xmedian that has been obtained separately. This is here like
this. So you can see here f is coming out to be here, x to be like here, xmedian is coming to
40
522
Now let me just clear the screen so that you can see clearly. Now using these values, I'm
trying to compute the absolute deviation and absolute mean deviation. You can see here this
is coming out to be 46 and similarly if I try to compute here the absolute mean deviation, this
will come out to be like this. So you can see here that it is not very difficult to.
41
523
Now I will just quickly show you the same example that in case if you have got missing
values in the data. So I will try to take the same data set, and I'm assuming that first two
values are replaced by here NA and I will try to show you that how to compute these values
So this data has been stored in the variable time.na, and you can see here that I'm trying to
choose here A = 10, and then I'm simply trying to use the command here for computing the
absolute deviation when data is missing, and then I'm trying to find out the median, and then I
am trying to compute the absolute mean deviation using the command that I have just
discussed and these values are coming out to be here like this.
42
524
So I will try to show you on the R console also, and if you remember the time.na data has
already been there. Now if I try to use the same command here, I’m getting the absolute
deviation, and if I try to find out the absolute mean deviation, then it is going to be this.
43
525
See here that these are the same values, the screenshot what we have just done. Right.
So now I would like to stop in this lecture and you have seen that how we have formulated
the measure to find out the variation in the data based on the absolute value. Notice that there
the earlier concepts of built-in function, we have defined the function or the command to
compute the absolute deviation around any arbitrary value or say around the median that is
You also remember, you also need to remember that the computation procedures for grouped
and ungrouped data, they are also different in R Software. So you have to be careful while
doing it. So I would request you that you please take some more example and try to practice
it and we will see you in the next lecture. Till then good bye.
44
526
Lecture – 20
Variation in Data – Mean Squared Error, Variance and Standard Deviation
Welcome to the next lecture on the course Descriptive Statistics with R software. You may
recall that, in the earlier lecture, we had considered the aspect of measuring the variation, by
using the absolute deviation and absolute mean deviation, and if you recall, we had developed
that tool, on the concept of deviations of the observations around any arbitrary point or around
some central tendency value like as mean, or say median like this. We also discussed that
whenever we want to develop these types of tools, we had to convert the deviations only into a
positive value, that means, we needed to convert or we needed to consider only the magnitude of
the deviations, and for that we have two options. You try to consider the absolute value of the
deviation or second option is that, we can consider the squared values of those deviations. So,
now in this lecture, we are going to, to discuss the second aspect that, how to build up or measure
of variation by considering, the magnitude of the deviations by squaring them. So, in this lecture,
we are going to discuss the concept of mean squared error, variance and standard deviation, and
we will try to see that, how to implement it on a R software. So, before we start, let us try to fix
our notations once again. Although, I had done it in the earlier lecture but, here I will be just
doing it quickly.
527
So, we may be considering here, two types of variable, one discrete variable on which we will
have ungrouped data, and in this case, the variable is going to be denoted by capital letter X, and
on this variable, we are going to obtain the n observations and these observations are denoted
Similarly, when we are trying to consider, a continuous variable here, and we have grouped data
on that variable, this means, we have a variable, continuous variable X on which we have
obtained the observations and those observation have been tabulated in K classes, or say K class
intervals, and the entire tabulation has been presented in the form of a frequency table. For
example, here you can see the frequency table in which all the observations have been converted
528
into the groups, and these groups are the class intervals, they are denoted by here even to e2, e2 to
e3 and so on. So, we have here K class intervals, or say K groups and after this whatever is the
midpoint of this interval, that is going to be denoted by here x1. So, x1 is going to denote the,
midpoint of the first-class interval, x2 is going to denote the, midpoint of the second-class
interval and so on. And all these x1, x2, xk. Here, now in the case of grouped data, they are going
to denote the midpoint and not the value of the observation, as it happens in the case of
ungrouped data and the frequency of these intervals is, they converted by an f1, f2, fk. So, f1 is
going to denote the frequency of the first-class interval, f2 is going to denote the frequency of the
second-class interval and so on. And the, some of all this frequency is denoted by here n. So, that
is going to be our basic notations for grouped and grouped data, So, as soon as I say that, we are
going to define the measures on grouped data, I will be using these notations and as soon as I
say that, I am going to develop the tool for the ungrouped data, then I will be using the earlier
symbols and notations, Right! Now, I come on the aspect of, developing a tool called as mean
squared error and I will be following the lecture almost, on the same line as I did in the earlier
lecture. You may recall that first, I define the absolute deviations and these deviations were
defined around any arbitrary value A, and then I developed the measure, and then I replaced A,
by some measure of central tendency and we defined the absolute mean deviation by replacing
A, by the median, because median was the value, around which the absolute deviations were
minimum. Now, similarly on the same lines, now instead of absolute deviations, I will be
529
that in the case of absolute deviations, I have used the quantity absolute value of xi minus A.
Now, I will try to, consider the squared values of xi minus A, and I will write down here the
squares of these deviations as xi A . So, these deviations are the squared deviation around
2
any arbitrary point A. Now, I will try to obtain, this quantity for each and every observation, say
x1 minus A whole square, x2 minus A whole square, and up to here xn minus A whole square,
and then, I will try to take the arithmetic mean of all these values, and once I try to do it, then in
the case of ungrouped data, discrete variable, this quantity is denoted by here like this, you can
see here this is the same quantity what I have given here. This quantity, is denoted here as, s2
(A), which is called as mean squared error with respect to A. And similarly, whenever we have
continuous variable or grouped data, then in that case, the mean squared error, with respect to
any arbitrary value A is defined here like this. So, you can see here that, now this summation is
going over the number of classes, and here this is small n is here the sum of all the frequencies,
and this quantity is a sort of weighted mean, where the weights have been given by the frequency
530
fi. So, this is how we try to define the mean squared error in case of grouped data and ungrouped
Now, one can choose any value of A, so this can be shown mathematically that, the mean
squared error is going to assume, the minimum value, when A chooses the value arithmetic mean
or in simple words, I can say, mean squared error takes the minimum value, when the deviations
are measured around the arithmetic mean. So, now what I will do? I will try to replace capital A
by x bar, which is the sample mean, or the mean of the observations, and when I try to do it, then
1 n
what happens, that I simply try to replace A by here x bar, x xi
n i 1
, that is the arithmetic
mean, and then I try to define here the deviations, x1 minus x bar whole square, x2 minus x bar
531
whole square, and up to here xn minus x bar whole square. And then I try to find out, the average
of these things, simply arithmetic mean and this quantity is denoted here, in the case of
ungrouped data or the discrete variable, that these values are the values which are given here, in
case of a ungrouped data or a data on the discrete variable. So, you can see here, this is the same
thing and this quantity which is essentially here, s square, x bar which is in general, we denote
by here, s square, this is called the variance, and in case, if you try to simplify this expression, so
1 n 2
you can write down here, this same thing here ( xi x 2 2 xi x ) . So, this comes out to be
n i 1
1 n 2 n 2 n
1 n
n i 1
xi x 2 xi x . Now, this can be further simplified to xi2 x 2 2 x 2 . So, this
n i 1 n i 1
becomes here, twice of x bar square. So, this quantity becomes a 1 upon n, summation i goes
from 1 to n, xi square minus x bar square. So, the alternative expression for the variance is here
given by this thing, which is the same thing. So, actually you can use any of this expression to
532
And similarly, when we have grouped data on a continuous variable, then in this case, the
variance is defined like this. So, this is 1 upon n summation i goes from 1 to K, where K is the
number of classes, and then multiplied by fi into xi minus x bar whole square like this, and where
x bar is defined like this, and this is small n, is defined as the sum of total frequencies. Similarly,
if you want to simplify this expression as we did in the earlier case, you can see here, this will
also come out to be i goes from 1 to K, fi, xi squared, plus x bar square, minus twice of xi, x bar,
and if you try to see here, this is 1 upon n summation i goes from 1 to K, fi, xi square, plus x bar
square, summation fi, i goes from 1to K upon n, minus twice of x bar, summation i goes from 1
to K, fi, xi upon here n. So, this quantity here, becomes here same as x bar. So, I can write down
here 1 upon n, i goes wrong 1 to K, fi, xi square, plus x bar square. Why because, the numerator
summation fi becomes here n upon n, minus here, twice of here x bar, into x bar, so this quantity
comes here, 1 upon n, summation i goes from 1 to K, fi, xi square, minus x bar whole square, this
is the same quantity given over here. So, any of this expression can be used to compute the
533
Now, after giving the definition of variance, I would like to address here, one more aspect, you
have seen that, in the definition of variance, I am trying to take the average of n value. So, this is
defined as 1 upon n, summation i goes from 1 to n, xi minus x bar whole square. In statistics,
there is another expression of various that is quite popular, and this expression is given by this, in
case of ungrouped or discrete data, this is given by here this quantity, you can see here that this
expression is like, the earlier one, the only difference here is that, instead of here n, I have here 1
upon n minus 1, and earlier we had only here n. So, this is what you have to keep in mind, and
similarly in case of continuous data or a grouped data, the divisor that was earlier n, and now this
becomes here 1 over n minus 1. Now an obvious question comes that, why I am using this
expression? Actually, the properties of this expression, when we have divisor n, or when we have
divisor n minus 1, they are different. If you have the idea of an unbiased estimator, in the context
of statistical inference, then when I say or when I use the divisior n minus 1, then this form of the
variance is an unbiased estimator of the population variance. Whereas, when I am using the
divisor n, then 1 upon n summation xi minus x bar whole square, is not an unbiased estimator of
the population variance. So, that is why many times the software uses this definition. For
example, in the R software, this definition is used where the divisor is n minus 1. So, that is the
reason, I would like to inform you here that, whenever you, you are using any software, please
try to look into the manuals of the software and see what they are trying to do. Well, in case if
the data is very large, then the value of the variances, may not differ much but, in case if the data
is small, then the values of the variances computed by using the division n or say n minus 1,they
may have difference. So, you should know, what you are obtaining, it should not happen that,
you are assuming that, the divisor is n, and your divisor is actually inside the software is n
534
Refer Slide Time:(16:15)
Now, after the variance, I come to another concepts, that is called standard deviation. You have
used possibly two types of terminologies; one is standard deviation and second is standard error.
In general, people do not differentiate between these two names but here, I would try to explain
you that, what is the difference between the two, but to start with, I will try to use the common
terminology, and I will use here the standard deviation, Right! So, when I say that s square is
going to denote my variance, then it is actually the sample variance, sample variance means, that
the variance calculated on the basis of given set of data, or given set of or given sample of data,
Right! So essentially, we are trying to compute the sample variance, but we always call it
without loss of generality as variance. When I try to take the positive square root of s square,
then this is called as sample standard deviation. So, once again you can see here, I am writing
here the sample word, inside the bracket just to indicate you, that the common language is simply
535
the standard deviation, but here actually we are trying to compute the standard deviation on the
basis of given sample of data, Right! Once I'm saying that the sample variance, or the variance,
or the standard deviation has been computed, on the basis of given sample of data. What does
this mean? If you try to see, in statistics what happens, that you are usually trying to collect a
sample of data, and on the basis of sample of data your objective is to compute the population
counterpart. You may recall that in the beginning of the lectures, we have discussed the concept
of population and sample. So, suppose means, I would like to compute the total number of
people, who are eating, say more rice in the country like India, which is a very huge country,
very big country. Now, if I want to find out the arithmetic mean, the average number of people,
who are eating more rice than wheat, then what I have to do? I have to compute the number of
such persons all over the country, which is very difficult to compute, unless and until you, you
execute a census. So, we try to take a sample of data that means, I will try to choose a small
number of observation and based on that, I will try to find out the mean, and that will be called a
sample mean. Similarly, if I want to find out the variance of the data that I have collected inside
the sample, then that will be called as sample variance, but there will always be a counterpart
like as population mean or population variance. Population mean, means the arithmetic mean of
all the population. Similarly the population variance means, the variance of all the units inside
the population. So, what happened that when we are trying to compute the positive square root of
the population variance, then this is called as standard deviation. But the problem is this we do
not have the entire population in our hand. So, we always work on the basis of a sample of data,
and that is why in a common language, people, do not differentiate much between the two
definitions, standard error and standard deviation. But, once you are trying to do a course on
10
536
statistics, as a student you must know it. So, and that is my idea to explain this concept here in
So, Now, I will try to denote sigma square as the population variance, and the positive square
root of sigma square that will be called as standard deviation or this is actually the population
standard deviation. But, as I said, they say using sigma to denote the standard deviation and
using sigma square to denote the population variance is a more popular notation among the
practitioners.
11
537
Now, next question comes. What is the advantage of having a standard deviation or say standard
error in place of variance? Now, if you try to see suppose I am trying to collect the data on the
height of some children, say in meters. So, the arithmetic mean will be in meters. But, the
variance will be in meter square. So if I try to see that say that there are two data set whose
variances are say 16 and 36. Then, it is more convenient to compare the two values if they are in
the same units as of the original data set. So, what we try to do, we try to find out, the positive a
square root of 16 and 36 which are 4 and 6 respectively. So, Now, once I say that, I have got a
data set in which I have measured the heights of the children in meter, and then they have got the
arithmetic mean of say of 1.2 meter and standard deviation of say 0.5 meter. Then, it is more
easy to understand, and similarly if I say that I have got two data sets in which the standard
deviations are 4 and 6. Then it is more easy to understand that the standard deviation of the
second data set is higher than the standard deviation of the first data set that is the only reason
actually. So, this the standard deviation or standard error has an advantage, that it has the same
unit as of the original data. So, that is easy to compare. For example if I have a variable in which
I have taken the observations denoted as, say small x then if this is small x have been obtained
12
538
in the unit meter, then variance s square will be in meter square, which is not so convenient to
interpret also. on the other hand, if I have obtained the observation x in meter, then the standard
deviation s will also be in meter, and which is more convenient to use, more convenient to
interpret. That is the reason that why people prefer to use the tool of standard deviation or
standard error.
Now, the question comes what does this variance or standard deviation actually measures? The
variance or equivalently standard deviations measure, how much the observations vary? or how
the data is concentrated around the Earth metric mean? For example, if I try to take here a data
set and suppose this data set is like this and suppose there is another data set which is here like
this. So, you can see here in both the cases, the mean is somewhere here. But, these deviations, in
the case of red dots, and in the case of green dots, they are different, and the deviations in the
13
539
case of red dots they are larger. So, in this case if I try to find out the value of the variance say
here, variance 1, and here variance 2, then if I try to compute the value of variance 1 and
variance 2 on the basis or given set of data, then we will find that variance 1, it is smaller than
variance 2. So, whenever I have a value of variance, say variance equal to 4 and variance equal
to here 10, then this obviously indicates that the data in the variance 1, is more concentrated,
around the mean value, and the data with variance 10, suppose I'm denoting in the red dot that is
more scattered around the mean value. Which is here like this. So, this is how we try to take the
So, obviously when we want to make a decision on the basis of a given value of variance, then
the lower value of various are equivalently the standard error, standard deviation indicates, that
the data is highly concentrated or less scattered around the arithmetic mean. Whereas the higher
value of variance or the higher value of equivalently standard deviation or standard error
14
540
indicates, that the data is less concentrated or highly scattered around the mean. So, this is the
interpretation,
and obviously on the other hand if I have a data set which has got the higher value of variance or
the standard error or standard deviation, then I can say simply that the data set has got more
variability. In statistics, usually we always prefer to have a data set which has got the lower value
of variance, or lower value of standard deviation, or standard error. So, in case if I have got two
data sets, and suppose we compute the variances, and suppose these variances are coming out to
be v a r 1 and v a r 2, then if v a r 1 is greater than v a r 2. Then we say, that the data in the
variance 1 is having more variability, or less concentration, around the mean. Then the data in
15
541
Now, there is a very simple rule. We would always like to have a data which has got lower
variance and if you remember in the initial slides I had discussed, that one of the basic objective
in a statistical tool is that, we would always like to have a data, in which the variability is less.
Okay.
Now, in case if you try to compare the variance and absolute mean deviation, then you know,
that when there are some outliers or some extreme observation inside the sample. Then the
median is less affected than the arithmetic mean, and in this case means, if you the data has very
high variability or the data has extreme observation or say outliers. Then in this case using the
absolute mean deviation is a better option and it is preferred over the variance or standard
deviation. Whereas variance has has its own advantages. For example, if you are working with
this statistical tool, then the variance has some nice statistical properties. So, from the statistics
point of view, from the algebra point of view from the from the statistical analysis point of view,
it is more easy to operate on the variance mathematically, algebraically, than on the absolute
function, like s absolute median deviation, or we call it a say, say here absolute mean deviation.
16
542
Refer Slide Time:(29:10)
Next, we try to understand what is the difference between standard deviation and standard error?
To understand this thing, we have a concept which is called as a statistic. You see the spelling of
the subject statistics is s t a t i s t i c s. But we are not using here the last s, and this is called
only as a statistic. As statistics is a function of random variables. So, if you have random
variables X1, X2, Xn, then any function of X1, X2, Xn is called as a statistic. For example if I
say you have random variables X1, X2, Xn and you try to find out the arithmetic mean like is
here the capital X bar 1 upon n summation i goes from 1 to n XI. Then this is this X bar is itself a
Now, the concept is very simple. Whenever you are trying to find out the standard deviation of a
statistic, then the outcome is again going to be a function of only the random variables, and this
standard deviation is called as standard error. So, whenever we try to find out the standard
17
543
Refer Slide Time:(30:44)
What does this mean actually? Actually, ideally what happens, that standard deviation is a
function of some unknown parameters. For example, if you try to understand, suppose mu is
representing the population mean. Right. The mean of all the units inside the population, which
is very very large, and practically it is very difficult to find out the mean of entire population. So,
usually it is unknown. Then ideally the standard deviation is defined as the positive square root
of the variance of all the values in x which is denoted here as a square root of sigma square
1 n
xi . So, mu
2
which is equal to here sigma, and that is equal to a square root of 2
n i 1
here is actually the population mean population mean of all the values xi’s. But since this mu is
not known this is unknown. So, you cannot compute it. This value cannot be computed. Why
18
544
Refer Slide Time:(32:02)
So, then in that case the Sigma square cannot be found. So, one option is this, that we can replace
the value of mu by its sample counterpart. So, when I say mu is here what, this is the population
mean? So, this is equal to 1 upon the total number of units in the population, which is population
size, i goes from 1 to here. Population size and say here x i's. So, I try to replace it by the sample
value So, this is for population and now for sample I try to replace by 1 upon see here n number
of observation i goes from 1 to n. Say here x i and i denoted the arithmetic mean to be here x bar.
So, now mu is unknown to us. But sample mean x-bar is known to us. So, what I can do? I can
1 n
replace mu by the sample mean x xi . and then in that case the standard error which is the
n i 1
19
545
1 n
positive square root of the variance becomes like this
n i 1
( xi x ) 2 , and this quantity is called
as standard error.
So, a standard error always remember, standard error will always be we are function of observed
values. So, in simple language I can say that the standard error will always refer to a standard
deviation which can always be computed on the basis of given sample of data. You have got the
data, you are asked to compute the, the standard deviation. You can compute it, and in that case
and in that case if you try to see here more specifically the population variance was defined here
1 n
xi x for the case when we have ungrouped data,
2
like this, and now this becomes s 2
n i 1
1 K
fi xi x in case of we have group data. So, this is basically the definition of sample
2
and
n i 1
variance and in common language, we usually do not call again and again it to be sample. But we
20
546
simply call it that find out the variance of the set of given data. Now, after this I, will come to the
aspect that how are we going to compute the variance or standard deviation on the R software
Right?
So I, will take the first case here the case of ungrouped data. So, in this case the data vector is
going to be denoted by say here a small x and the R command for computing the variance is here
v a r and inside the argument you have to give the data vector. But remember one thing, this
command var and inside the data vector x gives the variance with a divisor and minus 1 that is
1 n
xi x quantity then what I can do I
2
the this value so, in case if you want to obtain s 2
n i 1
can multiply and divide it by the quantity n minus 1. So, now this I can write down here
n 1 n
xi x . So, in case if you are very particular in getting the divisor n in the variance
2
n i 1
then in that case I, would like to suggest that you try to use this command try to multiply the
21
547
variance of x by the factor n minus 1 upon n. How? for example, you can see here now I'm
writing in red color this quantity will give you variance of x and this quantity will be the factor
by which if you try to multiply the variance you will get the variance with divisor n, and were n
is the length of the x vector that with the number of observation present in the data set.
Now, we are going to consider the case when we have the group data now in the case of group
data you know that there is no built-in command in the base package of R. So, I need to compute
the mean value of the given data set separately along with the midpoints from the frequency
table, and the frequencies from the frequency table. So, in this case we are going to compute the
mean as say x mean separately and in this case if you try to see your expression for variance was
i goes from 1 to k f i x i minus x bar whole square so, now this x bar becomes x mean and then
this quantity here I am trying to write down here x minus x mean whole square and then it has to
be multiplied by the frequency vector here f and then this has to be some means all the elements
in this vector will be sum divided by the n which is the sum of all the elements in the frequency
22
548
vector f. So, this is how this expression has been obtained to find out the variance in case of
grouped data.
And in case, if you have some missing data, then in the case of ungrouped data if the data vector
x has some missing values as NA, then we are going to denote this data vector as a here xna, and
in that case the command remain the same but I have to give her the option na dot rm is equal to
TRUE Right.
23
549
And similarly if, you want to compute the standard deviation that is very simple simply try to
find out the square root of the variance that you have obtained earlier so, if I try to find out that
square root which is a function s q r t of the variance of x that we had earlier obtained then this is
going to give you the standard deviation or the standard error all right, but in this case you notice
that the divisor is going to be n minus 1 in case if you want the divisor to be n then in that case
simply try to take the square root of the earlier expression that we had obtained Right. So,
finding out the standard deviation is simply equivalent to finding out the square root of the
And, similarly in case of grouped data also whatever the expression you have obtained before the
computing the variance just try to take the square root of the variance and this is how you can
24
550
Refer Slide Time: (39:23)
Now I, try to take the same example or the same data set that I considered earlier where we have
the data of 20 participants who participated in a race and their time taken to complete the race
was recorded and this data is stored here in this data vector. Now I, try to find out the variance of
this data vector and I use the command var and inside the argument the name of the data vector
that is time and I get here this value 283,3684 and here you can see the divisor here is n minus 1
that you always have to keep in mind and I, will show you that how to find out the variance with
divisor n in the next slide. Similarly, if you want to find out the standard deviation then you
simply have to take the square root of the variance. So whatever variants I, have obtained here
I'm trying to operate the function s q r t, sqrt is a function to find out the square root and I get
25
551
Refer Slide Time: (40:30)
Similarly, if you want to have the variance or standard deviation with the divisor n then in this
case, we have learned that we need to multiply the variance x with a factor n minus 1 upon n
where n is the length of the data vector. So, I simply try to multiply this n minus 1 upon n in the
given variance that we have obtained earlier and I, will get here the value of variance with
divisor n and similarly if I, try to take the square root of this value then I will get the standard
26
552
So, you can see that it is not difficult to operate or get the variance or standard deviation on the
given set of data, and here is the the screenshot which I will try to show you. Now, next we try to
find out the variance in the case of group data so, we consider the same example and we try to
convert it into a group data we try to find out the frequency table and from there we will try to
find out the frequency vector, vector of midpoints and the arithmetic mean. We, already had
discussed this aspect that how to find out the frequency vector, vector of midpoints and how to
create the frequency table. So I will not discuss here but I, will very briefly give you the
background so, that you can look into the earlier lectures how to get it done.
27
553
So, you see the, the given data has been classified in 6 class intervals and these are the
midpoints, and these are the absolute frequencies that we already have obtained, and now we
need to find out the frequency vector and mean from the given data.
28
554
So, we had used the command breaks and then cut command and whatever is the outcome of this
cut command we had created a frequency table using the table command and after that we had
operated the as numeric on the data obtained from the earlier command which has given us the
frequency like this one, and we had created the vector of the midpoint from the given output
from here like this Right. So now, we have obtained the x and f vector.
So, now I can say here that in case if, you want to find out here we already have obtained x we
have already obtained f now I need to find out the mean of x you can see here that in this case
mean of x, is going to be defined by summation fixi goes from 1 to k Right. So you can see here,
I'm trying to define here at x mean and this is sum of fixi divided by n which is sum of f and if I,
29
555
Refer Slide Time: (43:12)
Now I try to use the command or syntax that we had defined earlier using with a given set of data
here and this gives me the value here 269 and if I, try to find out the standard deviation of this
quantity this is here this, will give me the value of the standard deviation in this case.
30
556
And, this is the screenshot whatever we have obtained here,
31
557
and similarly at the same time I, would try to show you that if you have some missing data so,
then how to handle the situation how, to find out the variance and standard deviation so, in this
earlier example I am trying to take that first two values to be missing and I am denoting the data
inside new vector time dot na where two values are missing and if I want to find out the variance,
it is simply the variance of time dot any with the command na dot rm is equal to true and this
will give me the value of the variance as 250.2647 this will have the divisor n minus one and in
case if, you want to convert it into the division n then you know how to get it done. And if I, try
to take the square root of this value, this will give me the standard deviation when, we have the
32
558
And, this is here the screenshot, you can see here now I will try to come on the our console and I,
will try to show you that how you are going to obtain these values Okay.
So I, try to first take it here the time. So, you can see here I am copying the data on time. I will
put it in the our console and I try to find out here the variance of here time and you see I, get here
this value if I want to find out the square the standard deviation, I simply have to find out its
square root and you can see here this is the value and similarly, these values have been obtained
with the divisor say n minus 1 if, you want to have the divisor n in the variance then you just use
the same command and you get here the outcome and if you want to find out here the standard
deviation like the divisor n then you simply have to find out this square root of the earlier
expression and this gives you the value 16.40732and this is here the same screenshot which I
33
559
have shown you here Right. and similarly if, you want to find it out in the case of of here this
group data then first I, need to create here the data vector x and f which I, already have done.
So, you can see here I can clear the slides x is here my midpoints f is here like this and if I, try to
find out the mean in case of group data this is coming out to be 56 and now if I, try to find out
the variance in this case, this is going to be by the same command that we discuss here like this,
this is 269, and if you want to find out the the standard deviation, you simply have to find out its
square root this has come out coming out to be like this and similarly if you want to see that how
the things are happening in case of a single use in the data so, I simply need to use this command
here and you can see here that this data time dot na I already have copied here, and if you try to
find out the variance when this na dot rm is equal to TRUE, this give me this value, and if you
want to find out the standard deviation in this case, then this is simply the square root of this
value that we have just opted and it is coming out to be here like this, and if you try to see here
this is here the same screenshot that I have shown you here.
34
560
So now I, would like to stop in this lecture and I, would request you that we already have discuss
the concept of variance which is one of the very popular tool to find out the variability in the data
please try to understand it, please try to grasp it, and try to see that how this measure has been
developed and please try to take some datasets and try to compute the variance with divisor n
and with divisor n minus 1, try to compute the mean squared errors around any arbitrary value
and get comfortable in computing the variance or a standard deviation or standard error using the
R software. So, you practice, and we will see you in the next lecture. Till then, Good bye.
35
561
Lecture – 21
Welcome to the next lecture on the course descriptive statistics with R software. Now, you may
recall that, up to now what we have done? We have considered two aspects of data, one is the
central tendency of the data, and another is variation in data. And, both these aspects they are
very important part of the information which is contained inside the data. Now, I am coming to
another aspect, suppose we have data set and we want to know the variation in the data set that
should also depend on the measure of central tendency. What does this mean? Up to now we
have taken the aspects of central tendency and measure of variation separately, one by one. Now,
I would like to have a measure which can inform me the information contained inside the data,
based on arithmetic mean and variance. This will help us in getting an idea about the variability
in the data in various types of situations. We are using either the mean or the variance may not
really be advisable, and may not really give us the correct information. So in this lecture, we are
going to first discuss about a tool or statistical tool to measure the variation, this is called as
coefficient of variation. And, after this I will try to consider one quantitative measure, and say
another graphical measure based on the R software, to have combined information on various
aspects.
562
So, let us start our discussion with the coefficient of variation. The coefficient of variation
measures the variability of a data set without reference to the scale or units of the data. What
does this mean? Suppose, I have got two different data sets, one is measured in say centimeters
and say, another is measured in say meters. Incase if you simply try to combine the mean and
variances of the two data set and if you try to compare that, it might be little bit difficult.
Similarly, in case if you want to compare the house rents say in India, and say house rent in US,
the house rents in India, they are given in Indian rupees, and they are with respect to the salaries
that we get here. Similarly, if you go to US, the house rents are going to be in terms of US
dollars, and they are also with respect to the salary structure in US, and the salary structures in
US and India, they are very different. So sometimes you have heard that people simply try to
multiply the dollar by the exchange rate and they try to say that, Okay, I am earning this much or
I'm spending this much, or I am paying this much of house rent. So, how to handle these types of
situation, that can be done using the concept of coefficient of variation, Right.
563
So, this coefficient of variation will be useful to us, when we try to compare the results from two
different surveys or they are obtained from two different tests and they are obtained on different
scales. For example, if I say that I have got here two data sets, and we try to find out the
arithmetic mean and standard errors of the two data sets. So, the sample mean of the first data set
is obtained here they say x bar 1, first data sets mean and x bar 2 is the arithmetic mean of
second data sets. And, similarly I try to find out the standard errors s1 and s2, so a standard error
s1 corresponds to the first data set, and standard error s2 correspond to the second data set. Now,
there are two aspects mean and standard errors or central tendency or variation. Now, how to
compare the two data sets, that is the question that we are going to entertain here.
564
Now, the definition of coefficient of variation, as we had discussed in the earlier lecture in case
of variance, that the variance can be for the entire population, which is usually unknown or the
variance can be for the sample, that is called as sample variance, which is computed. Similarly,
in the case of coefficient of variation, we have two versions, one for the population and one for
the sample. But, here when we are trying to discuss the tools of descriptive statistics, we want to
compute everything on the basis of given sample of data. So, I am going to discuss here the
sample version of the coefficient of variation. Please keep this thing in mind. Okay, so, once I
have got the data say x1, x2,…., xn, this can be either for the grouped data or say ungrouped data
or whatever you want. Then the coefficient of variation is defined as standard upon mean. So, if
you remember that we had denoted the variance, sample variance by s square. So, the standard
deviation or in simple language we call it standard error, whatever you want to call. I am trying
to consider both as a similar meaning, without any loss of generality. This is denoted by say here
s. And, sample mean we already have denoted by x bar. So, this coefficient of variation briefly
𝒔
denoted as CV, this is defined as . Now, if you try to see a standard deviation will always be
𝒙
positive. So, this coefficient of variation is properly defined only when the mean is positive or
𝒙is greater than zero. See here this definition of CV; this is based on the sample. Now, if you
want to understand, what is the population counterpart? Then, if I say that sigma square is the
population variance. And, is the population mean, then CV of this population will be defined
here as a upon . But, since we do not know the value of a or , so we try to replace it by s
for Sigma and X bar for . So, this gives us the sample based definition of the coefficient of
variation, which can be computed using the data x1, x2,…, xn. Right.
565
Now, the next question is how to take that decision? Because, coefficient of variation is also a
measure of the variation in the data. So, if I have two data sets, then how we are going to
measure it. And, now you can see here, that if I try to take here two data sets. In which, suppose
the arithmetic mean of one data set is greater than the arithmetic mean of data set 2. And,
suppose the standard deviation of first data set is smaller than the standard deviation of, of data
set 2. So, what is happening? In one data set mean is larger, but the standard deviation is less,
and in other data set, just opposite is happening. In that case, which of the data you have to
choose? That cannot be answered directly by using the mean or standard deviation. So, in these
situations the coefficient of variation helps us and we say simply try to find out the coefficient of
variation of both that data sets. And, the higher value of coefficient of variation is going to
indicate that the variability is higher. So, the data with higher CV is said to be more variable than
the other.
566
So, that is again similar to the interpretation of variance, higher the value of variance that means
more variability, has the value of CV that means there is more variability. Just to explain you in
more detail, let me taken a simple example? Where two experimenters have collected the data on
the heights of same group of children, one experimenter has taken the observations in meters and
others experimenter has taken the observations in centimeters. And, they have found the average
height and standard deviation of the two data sets. So, you can see here, I have tabulated the
information, this is the first experimenter, second experimenter. And, first experimenter has
found the average height to be 1.50 meters, and standard deviation to be 0.3 meters. And,
similarly the second experimenter, he has found the average height to be 150 centimeters, and
the standard deviation to be 30 centimeters. Now, this is the usual tendency to compare the
standard deviations. So, you can see here, here the value of a standard deviation is 30. Whereas
here this value is here 0.3 in the first case. So, obviously in the first look, it appears that 30 is
much much greater than 0.3. And, this indicates that the variability in the second data set is very
very high. But, this conclusion is actually wrong. Because, if you try to see both sets of
567
measurements, they have been taken on different scales, but they have got the same value. The
heights say 𝒙𝟏 and say here 𝒙𝟐 they have the value 1.50 meters and 150 centimeters, which are
the same. And, similes and similarly standard deviation, they are point 3 meters and 30
So, now in this case how to report it or how to identify it, how to know it, that is the question.
So, in this case, this coefficient of variation comes to our help and risks you. So, if I try to find
out the value of coefficient of variation in both the cases, then in the first case the coefficient of
variation comes out to be standard deviation divided by mean, which is 0.3 divided by 1.5, and
this comes out to be see here 0.2. And, in the second case also the coefficient of variation comes
out to be 30 upon 150, which is equal to 0.2. So, you can see here, that both the values are same.
And, this is indicating that both data sets have the same variability. And, this was not possible by
568
So, similarly this coefficient of variation also helps us in comparing the data sets on two
different scales. But, the advantage what coefficient of variation has, that coefficient of
variation is dimensionless. So, this helps us in the comparison of the variation in two data sets.
For example, in India if I take an example of rents of houses in a metro city and in a village. We
know that the rents in, in a metro city in India they are very high, where is the rents in a village
that are also very, very low. And, similarly if I try to take say another example, the rents of
houses in Mumbai, they will be in Indian rupees and the rent of houses in London, that will be in
pounds. Now, how to compare when the data has been obtained in the same unit, as in the first
case when we are trying to find out the rents in India in a metro city and a village and when the
data has been obtained in different units. As in the case of, rents of houses in Mumbai and
London, so how to compare? Then again this in this case this coefficient of variation helps us.
569
Refer Slide Time: (14:57)
Now, that question is how to use this concept of coefficient of variation in making a decision?
So, the data set having higher value of coefficient of variation is thought to have more
variability. So, definitely when we have lower value of coefficient of variation this is preferable.
For example, in case if I have two data sets and suppose we have computed their coefficient s of
variation as CV1 and CV2. Then, if CV1 is greater than CV2, then we consider or we say that
the data in CV1 has more variability or say less concentration around the mean value than the
data in CV2 or in the second data sets. And, similarly in case if I have the opposite that CV1 is
smaller than CV2 that means the data and CV1 has a smaller variability than the data in CV2,
Right.
570
Now, the next aspect is how to compute it on the R software. So, as such there is no built-in
command inside the R software to compute the coefficient of variation. But, computing the
coefficient of variation is very simple and straight forward. This is only the function of standard
deviation and arithmetic mean. And, we already have learnt how to compute the standard
deviation and how to compute the arithmetic mean. So, just by using the same commands, we
can always compute the coefficient of variation. So, this is how we are going to compute the
coefficient of variation. Okay. So, if I say that we have got a data vector x, then the coefficient of
variation is going to be defined as like this. What is this? Coefficient of variation is simply here
standard deviation upon mean. So, a standard deviation was nothing, but the square root of
variance and mean was by the function here mean of x. So, this is what I'm trying to do, first I'm
trying to find out the square root of the variance that is that will give me standard deviation
divided by mean of x. And, this will give me the value of coefficient of variation.
10
571
Now, if I ask you that how would you compute the coefficient of variation in case the data is
missing, then it is very simple. How? We already have learnt how to compute the standard
deviation when the data is missing, and we also have learnt how to compute the arithmetic mean
when the data is missing. So, you simply have to use the same syntaxes, same functions and you
have to write down the syntax for computing the coefficient of variation. So, if you recall that
what we had done earlier. Suppose my data vector x has got some missing values which are
denoted by NA, and I'm denoting this data vector as xna. So, now I would try to find out the
variance of this x and a by using the command inside the argument na dot rm is equal to TRUE.
So, this will help us in finding out the variance when the data vector has missing values and then
we try to find out the square root, which will give me the standard deviation. So, this function
will give me the standard deviation in the presence of missing data. And, similarly the mean
function on the data vector xna with the argument na dot rm is equal to TRUE, will give me the
value of the mean when the data is missing. And, by using this command, we can always find out
the value of the coefficient of variation. And, now using the same command, you can also write
down the syntax and command for computing the coefficient of variation in case of grouped
11
572
Now, I will try to take a small example to show you that how to measure it. Suppose, I have
collected the data on 20 participants, in the time taken in a race and this data has been recorded
inside a variable time. So, this is the same example that I have used earlier, now in case if I want
to find out the coefficient of variation of time, so CV of time that is simply here standard
deviation of time divided by mean of time. So, I am using here this syntax and this is giving me
12
573
And, suppose if the data is missing, then in that case I have the same data in which I have
replaced first two values by NA, and this data has been stored inside a new data vector time dot
na. And, I try to use the same command to compute the standard deviation in the presence of
missing value and the command for mean, in the presence of missing values and I try to take the
ratio and this will give me the coefficient of variation. And, this value comes out to be here 0.27.
13
574
And, you can see here this is the screenshot of the same thing what I have done. Now, I will try
I'm trying to take the command and this data has already been copied, which is the same data set
had a time. And, if I try to find out the coefficient of variation, this is giving me this thing, Right.
And, similarly incase if you want to find out in the presence of missing values. Then, I already
have stored the data time dot na, you can see here and if I try to use the same command on this
data set, you will get here the same value. And, the same value has been reported here in this
14
575
Now, I have completed different types of measures of variation that we had aimed. Now, I'm
trying to address another aspect, when we get the data, then data has all sorts of feature. And, we
would like to have all sorts of feature like measure of central tendency, partitioning values
variation and so on. Up to now what we have done, we have taken all these aspects one by one.
Means, how to compute maximum among the data values, minimum, range, quartiles and so on.
In R software there is a command, by which you can compute all these values like as minimum
value maximum value and different types of quartiles in the single shot.
15
576
So, I'm trying to discuss now this command, which is a summary command. So, in R, there is a
comprehensive information on different types of quartiles, mean, minimum and maximum values
of the data sets. So, if my data vector is denoted by x, then we use the command s u m m a r y
summary and inside the arguments here x. And, if you try to use this, then the outcome of this
command will give us an information on the minimum values, minimum value, maximum value,
arithmetic mean, first quartile, second quartile this is median and the third quartile of the data set
16
577
Now, let us try to take an example to understand it. And, now I take the same example, that, that
I took earlier where we have the data on 20 participants in the time taken in a race and the data
has been stored in a variable here time. So, now when I say summary time, then on the execution
we get here an outcome like this one. So, you can see here there are 6 values, first value here,
second value here, third value here, fourth value, fifth and sixth. So, now if you try to understand
what are they trying to give us. The first value here is giving us the minimum value. Minimum of
the values contained in that time data vector, which is 32, and we can see here this is here the 32.
Similarly, the second value giving us the value of the first quartile. Remember this is quartile not
the quantile, Right. So, you know how to compute the first, second or third quartile. So, but here
in this case you need not to compute it separately, but it will give you inside the same outcome.
Similarly, the third one is the median value, which is the second quartile. Third value is here the
mean that is the arithmetic mean. Similarly, fourth value is the value of the third quartile, and the
fourth and the last one the sixth one is the maximum value of this observations, which is here 84,
17
578
you can see here this is the value. So, by using this summary command you can get all this
common information in a very comprehensive way and that is the advantage here.
And, here you can see here this is here the, the screenshot. I will try to show you on the R
18
579
If you say see here that data on time here is already stored there, and if I try to write down here
summary of your time, we get here this value, Okay. But it cut us now come back to our slides
and we try to discuss now another aspect. So, now you can see that, this summary command is
trying to give you different type of information minimum, maximum, quartiles, means in a
comprehensive way and when we started the discussion on descriptive statistics I had told you
that there are two types of tools, one are quantitative tools and others are graphical tools. So, now
the next aspect is this whatever the information has been provided by the summary command
can it be represented in a graphical way now, and now, what will be the advantage of this thing?
Suppose if you have two data set or even more than two data sets and if you want to compare all
the characteristics at the same time, you can use the summary command on all the data vectors
and also you can create the graphics. So, graphics will give you a visual comparison of the
information contained inside the different data sets, and in order to do so, there is a graphic
which is called as box plot. So, now I try to discuss what is a box plot, and how to construct it
19
580
So, this box plot is actually a graph. We summarize the distribution of the variable by using
different types of information like as median, quartiles, minimum, maximum. I remember this
will not give you the value of the mean. So, this box plot looks like actually this. How you can
see here the graph, you can see here there are two lines here one in the bottom please try to
observe the pen. One in the bottom and now other in the top, in green color you can see here
these two values are giving the minimum value of the data set and the upper value is giving the
Now, in case if you try to find out the difference between the minimum and maximum value,
What you will get? you will get here the value of range, and similarly if you try to look on these
three lines which I am indicating here, first second and third. So, if you try to see this is here as
sort of here a box and the lower edge of the box, which is here. That is giving you the
information on first quartile. Now, I will between the color of the pen, so you can see it clearly.
20
581
Similarly, the upper edge, which is here, this is giving us the information on third quartile. So, if
you try to see if you try to find out the difference between the two, don't you think that this will
give you an idea of the quartile deviation and also in some sense, it will give you the information
on interquartile range. These two measures we had discussed as the measure of variation. Now,
finally if you try to look in the line in the middle of the box. This line is going to give you
information on the median, which is the second quartile Q2 . So, you can see that inside this box
there are several measures which are combined and just by looking at this difference, and this
difference you can also compare the variation by looking at the middle value, you can compare
21
582
So, let's try to first see the applications through the software. Inside the R software there is a
command here box plot and inside the arguments you have to give the data vector for which you
want to create a box plot and there are several arguments, which are several options available in
the plotting of box plot. I would request you to go to the help menu and try to see what are the
different arguments, and what are their uses in creating the box plot, you can make the legends,
So, let me take here the same example which I considered earlier so this is the same data set on
time and now, I have created here the box plot on the set time so you can see here this upper
value this is going to give me the maximum value of this time, and which is here 84. You can see
here this is somewhere here and similarly the bottom line this is going to give me the minimum
22
583
value of this data set which is here 32 somewhere here you can see here this is the data set. Now,
these three-values bottom value, second value, and third value. These are trying to give me the
information on first quartile, middle in the second quartile and upper edge in the third quartile.
So, you can see here these values are somewhere here, so you can compare it with the values that
you have just obtained in the summary command and by looking at the difference of these two
edges here and these two edges here you can have the idea on the variation in the data in terms of
range and quartile division or say interquartile range. Now, you can see here, why this plot is
called as box plot? You can see here that all the information is contained inside this box so that
Now, what is the use of this box plot, how it is going to help us?
23
584
Refer Slide Time: (32:38)
So, now what I have done here first you please try to notice, what are the first two values which I
24
585
Now, what I do in the same data set, I try to make it 320 and 350 very high value and I give this
data a new name time1, and then I try to create the box plot the time1. You can see here that this
box plot is very, very different than what you have obtained in the earlier case. Not only this, it is
also showing you that there are two observations which are possibly the extreme observation. So,
by looking at this graph this is giving you information, well when you are trying to analyze that
data please try to take a look at the values, which are somewhere between 300 and 350. These
values are unusual because they are very, very far away from the remaining data, all our data is
lying between 50 and 100 here, where these two values are lying between 300 and 350. But my
objective was that I want to compare it. So, I try to artificially make these two plots side-by-side.
25
586
So, I have simply copied and pasted these two graphs manually and you can see here that the
first graph is for the data set time, and the second graph is for the data set time1. but now you can
see still we are not very comfortable. Why? because the range on the y-axis in the both data set,
they are different. So, they are not really comparable well they are comparable, but you need to
put a harder work in the comparison. So, we would like to make a graph in such a way where we
have only this type of boundary on say x and y axis and this plot should be inside the same
boundary, so that they are comparable. So, in order to do so, what we have to do? This I am
26
587
We have a graphic, which is called as grouped box plot. This graphic will combine different
types of datasets and it will create a box plots inside the format of a boxplot using the data
inside the format what is called as data frame. What is this data frame and why this is needed?
Suppose we have here three data vectors, which I am indicating by x, y and z. So, what we want
here is the following that we want here a graphic, which is enclosed inside this rectangle where,
there are three box plots like this and they are indicating the box plots for the three data vectors
x, y, z. So, in order to create this combined box plot, first we need to combine that data, and in
order to combine the data, we have a concept of data frame, and we do as follows that we use the
command here data dot frame and inside the arguments we try to give the names of the data
vector which are separated by commas and this will give us a data set in the framework of the
concept of data frame and this data will be a combined data set.
27
588
Now, at this stage question crops of that, what is a data frame? Well data frame is a method to
combine different types of data sets in R. Well, it is not really possible for me to explain this
concept in detail here but these concepts have been explained in the lectures on the course
introduction to R software, if you wish you can have a look on those lectures or you can look
into the help menus of the R software to have an idea about the concept of data frame, Right. But
in this lecture, I will be using only this command just to combine the data sets, so I am not going
into that detail. I have given you the command that how to use it. If you want some advanced
features, advanced knowledge above this concept, I would request you to look into the books and
28
589
So, now my objective is very simple, I would try to create a box plot for this data set which has
So, now what I do here is the following, I try to take the same data set time and here time 1 and I
try to combine it and create a data frame using the command data dot frame, and inside the
argument time separated by comma and then time 1, and I store this data as a data box plot.
29
590
And, after this I use the command box plot and, and give the data inside the argument as data
box plot, which has been obtained through data frame. Now you can see here that both the
graphics have been combined together, Right. This is the box plot for the time, and this is the
box plot for the time 1 and here you can see that that is indicating the presence of two extreme
observations, Right. So, in this case you can see that we have combined the two box plots, but
they are not really informative because, the ranges or both the box plots are very different, Right.
This is here and this is indicating that there are two extreme observations, so this is giving a
different box plot that if, what we wanted for the ideal thing is that, after looking at this picture,
first try to remove the extreme observations, extreme observation and create the box plot again,
and you will see that this will give you more information, and I would like to illustrate the same
30
591
So, this is here the screenshot of the creation of data frame of time and time1 data.
31
592
Now, I will take another example, to show you the utility of box plots. Suppose the marks of ten
students in two different examinations are obtained and we would like to compare the marks
using the concept of box plots. So, the marks in the first examination they are stored here inside
the data vector named as marks1. and the marks of those ten students in second examination,
they are stored here, and they are stored in a the data vector marks2. So, what I want here is the
following? I want here a graphic like this one, where there are two box plots one indicating the
marks1 and another box plot indicating the marks2. So, in order to do so, first I try to create here
a data frame using the command data dot frame and inside the arguments, I try to give the data
vector, which I would like to combine. And, this data suppose I am trying to store as say data
32
593
Now this is the screenshot of this operation and now I try to create the box plot of this data box
plot data which has been obtained in the framework of a data frame like this box plot of data box
plot.
Now, you can see here now here you are very nice and clear picture. By looking at these two
values these green color lines, you can easily compare that which data set or which of the group
of students has got the lower marks or the minimum marks. Similarly, if you try to look at the
red pink which I am highlighting here by comparing this to you can have an idea about the
maximum marks obtained by the students. Similarly, if you try to look in the middle value these
are trying to give you the idea of the medians. So, by looking at these values you can see here,
that the median in the marks two case is higher than the median marks in the case of marks1.
33
594
Similarly, in case if you try to compare this orange color part, this will give you the idea of the
first quartile and similarly if you try to compare the third quartiles, highlighted by this violet
lines you can compare it and you can see here that the third quartile Q3. That Q3 in the case of
marks2 has higher value then in the case of marks1 and you can see here the range in the marks1
and the range in the marks2. So, you can see here that very clearly that the range of the marks2 is
higher than the range of the data in marks1, and similarly you can also compare the quartile
deviation and interquartile ranges, so you can see here this is how we can obtain it, now I would
try to show you the construction of the box plots and the group box plots on the R software,
Right.
34
595
So, you can see here we already have the data on time.
So, if I try to create here the box plot it will look like box plot of say head time.
35
596
And you can see here this comes out to be like this.
36
597
Where we have increased the values of two observation first two observation.
Then in this case this data set is here time1 and you can see here that there are two extreme
values.
37
598
And, if you try to create here the box plot of the same.
38
599
This comes out to be here like this which we had reproduced in the slides, Right.
And, in case if you try to combine here the data here see here, data the time is equal to Data dot
frame and inside the bracket, time separated by comma time1 argument closed.
39
600
You will get here at the, the combined data on time and time1 in the data frame mode and you
can see here this is the data which I have obtained here.
40
601
Now, I will simply try to make sure the box plot of the same data set.
So, you can see here if I try to create the box plot this comes out to be here like this which we
41
602
And, similarly if I try to take here the data on marks.
42
603
So, data on marks here is like this. and data on here mark here is like this.
You can see here, this is the data on marks1 and marks2, Right.
43
604
And, I would like to combine this data into a data frame.
44
605
So, I try to use as the data frame come on and you can see here this data has been combined see
So, I would like to now create a box plot of the same thing two box plot of the data marks you
can see here no this looks like this so you can compare it and can have some fruitful conclusions,
Okay.
So, now I would stop here in this lecture and I would also complete the discussion on the topics
of different measures of variation. So, we have discussed different types of measures of variation
and every measure will give you a different information, and a different numerical value. Your
experience in dealing with the data sets and using these measures will give you more insight into
45
606
the aspect that how to interpret, how to say whether the variability is low or variability is higher,
this is always a relative term. But, remember one thing, from the statistics point of view in case if
the data has very high variability then most of the usual statistical tool will not really work well.
They will give you some information, but that information may be misleading. So, it is very
important to use the appropriate tool to bring the information out from the data regarding its
inherent variability, different samples taken by different people from the same population they
may have different variation, so if you try to use different types of tools, ideally all the tools
should give you the same information, but they will have different numerical values. So, I would
request you that please try to look into the books, try to understand the concept of variation and
the data, try to look the different drawbacks, different advantages of all these tools all the tools
cannot be applied in all the situations, and more importantly, how to compute them on the R
software, this is what you have to learn. So, I would request you that you take some datasets and
try to employ all the tools whatever you have done up to now. Different measures of central
tendency, different measures of variations and try to see how these values are trying to provide
you different pieces of information. So, you practice, and we will see you in the next lecture with
46
607
Lecture - 22
From this lecture, we are going to start a new topic, and this is about moments. What are this
moments, and why do we need it, you may recall that, up to now in the earlier lectures, we have
considered two aspects of the data information. First is central tendency of the data, and second
is the variation of the data, and we have developed different types of tools like as arithmetic
mean, standard deviation, variance and so so on to quantify that information. Similar to this,
there are some other aspects like as symmetry of the frequency curve, or how the values are
608
concentrated in the entire frequency distribution. Similarly, there is another aspect, what is the
hump of the frequency curve? So, in-order to study all these things, and some other important
essentially you will see in this lecture that, moments are some special type of mathematical
functions, and using these moments, we can quantify different types of information, which is
contained inside the data or the frequency table. So, let us first try to understand what are these
moments?
So, moments are essentially used to describe different types of characteristics of a frequency
symmetry, peaked-ness etc., Now, if you try to understand that, how these things have been
609
developed? So, if I try to see here, when I wanted to study the central tendency, then we defined
arithmetic mean, and what we did, we had observations x1, x2 up to here xn, then what we did, we
just added them, and we divided it by the number of observations, say here n. So, this gave us the
information about the central tendency of the data that, where the mean or the average value of
the data set lies in the frequency distribution or a frequency curve, Right! Similarly, when we
studied the variation in the data, we defined a quantity like variance or absolute deviation, I will
try to show you both, what we did, in case of variance, we had the data x1, x2 up to here xn, and
then what we did, we took the difference of these observation from their arithmetic mean, which
we call as deviations. So, we took the deviations of each observation around the arithmetic mean.
So, we obtained x1 minus x bar, x2 minus x bar, up to here xn minus x bar, and after this, we
squared them, all the deviations were is squared, and after this what we did, we simply took the
arithmetic mean of these squared deviations, and similarly when we computed the absolute
deviations, then we had considered the observations, x1, say here x2, up to here xn, and then when
we wanted to study the absolute, say mean deviation, then what we did, we simply found the
median of these observations, and we took the deviation from their median like this, x2 minus x
bar median, and xn minus X bar median, and after this what we did, we simply took the absolute
value of these things, and after this we just took the arithmetic mean of these values, now you
can see here, what are we trying to do? In the case of arithmetic mean, we are trying to find out
1 n 1 n
here, x
n i 1
xi , in the case of variance we are trying to find out, ( xi x ) 2 , and in the case
n i 1
of absolute deviation, we are trying to find out the arithmetic mean of the absolute deviations of
xi, from the median. So, now looking at these three examples, you can see here, what are we
trying to do, we are essentially trying to consider the deviations, say x1 minus, some arbitrary
value here A, the deviation of second observation from some arbitrary value here A, and
610
deviation of the nth observation from some arbitrary value A, and then I am trying to use a
suitable power function on this deviation. For example, if you try to look over here, please try to
observe my pen, here you are trying to use the, the square. So, I can make it more general,
instead of using here the square, I can replace it by here some general quantity, say here r. So, I
try to consider, now here the deviation of observation x1 minus A, x2 minus A, xn minus A, that
is the deviations of observation around any arbitrary value A, A is say any value, any arbitrary
value which we have to choose. So, now I try to take the rth power of these expressions, and then
1 n
I try to find out their average, this I can express as
n i 1
( xi A) r , and now you can see that this
function has several advantages, for example, if you try to take here r =1, and A = 0, you get here
nothing, but mathematic mean x , if you take here r = 2, and A equal to say here, x , then you get
here, variance of x. Similarly, if you try to choose other values, we can define some other
functions which are going to represent the different characteristics of the frequency distribution.
So, in general I can call this function as, say rth moment, and now you know that this is what I
have defined for the data, which is observed as, x1, x2, xn, and that is ungrouped data, and if you
want to define the same quantity for, say here, some grouped data,, then this can be simply
1 K
expressed into
n i 1
f i ( xi A) r , and so on. So, now this is the basic idea behind defining a
function which is called as moment. Now, what is the advantage of this thing, you will see here
that when I try to take specific values of r and A, we get different types of characteristics, for
example, when I choose r equal to 1, and A equal to 0, I get arithmetic mean, which is giving
other information on the central tendency of the data. Similarly, if I take r equal to 2, and A is
equal to sample mean, then this quantity is giving us the information about the variation in the
data. Similarly, we can think about that there are some other properties of the frequency curve
611
like a symmetry or it's hump, hump of the curve, how to quantify those information? So, on a
similar line as these two values are going to give us the information on central tendency and
variation in the data. So, we can use the concept of moment to quantify the information, about
the symmetry of the frequency curve or the hump of the frequency curve, but in order to do so,
first we need to understand the concept of moment, and what are the basic definitions in grouped
case, in ungrouped data case, and beside the idea which I have given you here, there are some
other aspects. So, in this lecture, I am going to consider about the raw moment and central
moments, and in the next lecture I will try to show you, how to compute them on the R software.
So, in this lecture please try to understand the basic concepts, and they are going to help you in
Before going into the further details, let me specify my notation for grouped and ungrouped data.
Now, I'm going to consider two cases, case one where the data is ungrouped, and the variable X
is discrete, and we have obtained n observations on X, which are denoted as small x1, small x2,
small xn.
612
Refer Slide Time: (12:28)
Similarly, the second cases, when we have the data which is grouped, and we have a variable X,
capital X which is continuous in nature, and the data is obtained on X, and the same data has
been tabulated in K class intervals in a frequency table like as here. First column of the
frequency table is giving us the class intervals here, like even e1 to e2, e2 to e3 and so on. So,
these are essentially, you are here k class intervals, and the second column is giving us the
information on the midpoints of this class interval. So, the idea noted here is x1, x2, xk. So, x1 is
going to give us the information on e1 plus e2, divided by 2 and so on, which is obtained from
here. So, the second column is giving us the value of x1, x2, xk, but here x1, x2, xk are the
midpoints, and similarly in the third column we have the frequency data f1, f2, fk which is
denoting that, the first interval e1 to e2, whose midpoint is x1, has frequency f1, e2 to e3 which is
613
having the frequency f2 and so on, and the sum of all this frequency is denoted by here n. So, I
can see here that the midpoints x1, x2, xk they have got the frequency f1, f2, fk respectively.
Now, after this I first try to define the moments about any arbitrary point here, here. So, let me
define here the general rth moment as we have just discussed, so the rth moment of a variable x
about any arbitrary point A, based on the observation x1, x2, xn is defined as follows For the case
of ungrouped data, discrete data, this is simply the sum of the rth power of deviations, and these
deviations are obtained, you can see here xi minus A, can see here A, and then the arithmetic
mean of these deviations has been obtained, and this is denoted as r , this symbol here, I see r ,
and this symbol here, this is called here prime. So, this is the basic definition of the rth moment
for the ungrouped data around any arbitrary point A. Now, similarly if you have group data on a
614
continuous variable, the same can be similarly defined, that now I am trying to find out the
weighted mean of the other power of the deviations, and weights are given by the frequency,
Right! and this is also denoted as r . Although, I agree that these two the r for the grouped and
ungrouped data, they should have different type of notation, but we are using here same notation,
because at a time we are going to handle either the discrete case, or the continuous case. So, this
is how we try to define the rth moment of a continuous variable x, around any arbitrary point A,
and here this n is going to be the sum of all the frequencies. So, you can see here, this is the same
Now, I try to give you the idea of what is called here raw moments. Now you will see that, most
of the things they are simply the particular case of what we have obtained. So, as you had seen in
615
the case of arithmetic mean, we are trying to take the mean of say xi. So, similarly instead of
taking here xi, I am trying to take the rth power of xi, and this is going to give us the value of the
rth moment, or this is called as the rth raw moment. So, if you try to see this is nothing but in the
earlier case if you try to substitute A equal to 0, you get the same thing. So, the rth sample
moment around the origin A is called as raw moment, and it is defined here like this, in case the
data is discrete, and similarly in case, if you have the group data on a continuous variable, the
definition of discrete can be extended to continuous case, where you are simply trying to find out
the average of the group data, say fi, xi, and then you try to convert xi, into xi raised to power of
here r, and here n is going to be the sum of all the frequencies. So, this is how we define the rth
raw moment in this of continuous data, one thing you would like to notice here, that in the first
line of the definition I am using here a word sample, you can see my here pen, why I'm trying to
write down here the sample, and that is inside the bracket, what does this mean? Now, you can
see here, I will try to explain this concept in this color, so you can notice it, suppose I try to take
the first case, where I have defined the rth moment for the ungrouped data, you can see here, that
I am trying to take here the sum over the observed data, what does this mean, that I have taken a
sample of data and I am trying to find out the average over the sample values, but if you try to
see, there are two counter parts one is the sample and another is the population. What is the
sample, this is only a small fraction from the population. So, as we have computed this quantity
on the basis of a sample, the similar quantity can be estimated for the entire population units, and
essentially in statistics, we are interested in knowing the population value, but since this is
unknown to us, so we try to take a sample and we try to work on the basis of sample. So, as we
have defined the moments, or say rth moment or raw moment or other types of moment that we
are going to define, we are going to define on the basis of sample, but surely there exists a value
616
in the population. So, the counterpart of the sample is the population, and when we try to
compute the same moment, on the basis of all the units in the population, that would be called as
the population moment, or the moment based on the population values. So, that will be defined
here, for example, if I try to define this r , for the population values, and suppose if I say that
there are suppose, capital N values in the population, then in this case this is going to be i goes
1 N
from
N
x
i 1
r
i . So, N is here the total number of units in the population. Now, there are two
counter parts, one for the population and, say another for the sample but obviously our interest
here, is to compute the values on the basis of sample. So, in this lecture and in the previous
lecture, I will try to define these values, and you have to just be careful that whatever we are
doing here, they are on the basis of sample. So, in practice we always do not call it as a sample
moment, but we simply call it moment, but the interpretation is that, that the moments have been
computed on the basis of given sample of data. So, this is what you always have to keep in mind,
10
617
of this raw moment, and I simply try to choose two values r equal to 1, and r equal to 2, and then
in the case of ungrouped data, you can see here as soon as I say, r equal to 1, I get this value, 1
1 K
is equal to xi , and if you try to identify what is this, this is nothing but your arithmetic
n i 1
mean, and this 1 , this is called as first raw moment, and similarly, if you try to put r equal to 2,
1 n 2
then I get here 2 is equal to xi , this quantity 2 is called as second raw moment, and if
n i 1
you try to recall, where this quantity was occurring, you may recall that when you wrote the
1 n
expression of variance of x, this was written as ( xi x )2 , and which we have written as
n i 1
1 n 2
n i 1
xi x 2 . So, you can see here, that the first term here this one, this is nothing but your
second raw moment, and this quantity here x , this is simply the first raw moment. So, now this
will give you an idea, that why this raw moments are needed, and now you can see here that this
raw moments are going to help us in defining the variance of x, Right! So, in this case I can
redefine the variance as say here, 2 , minus, 12 , and similarly if you try to take the grouped
data, then when I substitute r equal to 1, I get here 1 which is defined here like this, which is
the arithmetic mean, once again and when I try to put r equal to 2, then I get 2 which is here
defined here like this, and this also has the same interpretation, that if you try to see variance o. ,
2
1 K 1 K 1 K
we had defined
n i 1
f i ( xi x ) r
, and we had denoted it while
n i 1
f x 2
i ii
n i 1
f i xi .. So, this
is nothing but your 2 12 . Now, see here by this example, that the interpretation and the utility
of the moments in case of grouped and ungrouped data, this is similar, Right! What you have to
just observe here, that if I try to choose here r equal to 0, then in this case, this 0 , becomes 1,
11
618
and this is true for both the cases, for ungrouped data and for grouped data. Now, after discussing
let me define here what is called here as central moments. So, we had discussed the rth moment,
K
around any arbitrary point, say A, ( x A)
i 1
i
r
, and in this case, if you simply try to choose A, to
be the sample mean x , then whatever is the moment, that is called a central moment. So, the
moments of a variable X about the arithmetic mean of the data, they are called as central
moments. So, now if I want to define the rth central moment based on the sample data x1, x2, xn,
then in the case of ungrouped data, I simply have to replace A by x . So, this quantity becomes
here like this, so if you try to see here this quantity here is nothing, but the rth power of
12
619
deviations, and then arithmetic mean of these are the power of deviations have been taken, and
this quantity is defined here as say r , there is no prime in this case. So, that is the standard
notation, that the raw moments are defined by r , and the central moments, they are defined by
r . So, the rth central movement in case of ungrouped data is given by this r , is equal to
1 n
n i 1
( xi x ) r , and similar definition can also be extended to the case of group data, and in this
case, this is the rth power of the deviation, and they are multiplied by the frequency fi, and then
the arithmetic mean of this quantities have been taken where, n is equal to the sum of all
frequencies, and what you have to just notice, that in this case, you are going to compute the
1 K
arithmetic mean by this expression, that is fi xi and whereas in the case of discrete data, you
n i 1
1 n
try to compute the x , simply as a xi . So, this quantity is called as the rth central moment of
n i 1
13
620
let me try to choose here, r equal to 1, and r equal to 2, and see what happens to the first and
second central moments. So, first I try to take the case of ungrouped data, and this case if I try to
1 n
substitute r equal to 1, you can see here what I will get ( xi x )r , which is here 1. So, this is,
n i 1
1 n n
I can write down here
n i 1
xi ‐ x . So, this quantity becomes is x bar minus x bar is equal to 0.
n
Now, I can say that here, that the first central moment, in case of discrete data will always be
equal to 0, and in the next slide, I will show you, that the same is true in the case of continuous
data also. So, in general I can say, that the first central moment is always 0, similarly when I try
1 n
to substitute r equal to 2, then what I get here,
n i 1
( xi x ) r , r is equal to 2. Now, can you
identify this thing, what is this thing? This is nothing but your sample variance, what we have
defined in the earlier lectures. Now, I can say, that the second central moment which is denoted
by 2 is representing the variance, or the sample variance. One thing I would where I would like
to have your attention is the following, you can see here, the second central moment is going to
represent the variance, and first central moment is always 0. Whereas, the first raw moment that
was denoting the arithmetic mean. So, when you are trying to interpret these moments you have
to be careful while making an interpretation for the arithmetic mean. Arithmetic mean is
represented by the first raw moment, and in the case of first central moment, this is simply
denoting the sum or the averages of the deviation around mean which is always 0.
14
621
Now, in this is step, I would also try to show you here, that when I try to write down here this
1 n 1 n
expression. Then 2 was written has here
n i 1
( xi x ) 2 which was written as xi2 x 2 . So, I
n i 1
can write down here this quantity here, this is nothing but my 2 , and this quantity here x bar
square this is 12 . So, 2 becomes 2 12 . So, you can see here, that this is a relationship
So, what I am trying to explain you here? I am trying to ensure you here, that there exists a
relationship between the central and raw moments. I have shown you here that, how to express
the second central moment as a function of raw moments, and I have taken here the example of r
equal to 2. Similarly, if you try to take r equal to 3, r equal to 4 and so on. Then you can obtain a
15
622
One more important aspect which arises here, is the following usually you will see that in
statistics, we are considering the 1st, 2nd, 3rd and 4th moments. Whether the raw moments or the
central moments. Well, one can very easily compute the higher-order moments say 5th, 6th and
so on. But up to now, what has happened that we have the interpretation for the first four
moments. For example, first moment that's the first raw moment that is indicating the value of or
the it is quantifying the information on the central tendency of the data. Second central moment
is giving us the information on variability. Similarly, I will try to show you in the forthcoming
lectures, that third moment is giving us the information on the symmetry of the frequency curve.
That is called as property of skewness, and similarly the fourth moment will give us the
information about the hump of the frequency curve. The peakedness of the frequency curve and
that property is called as kurtosis. But what is indicated by fifth moment, sixth moment and so
on, that is still an open question. So, that is the reason, that usually we are interested in finding
out the moments up to order four. There also exists a clear-cut relationship between rth central
moments, as a function of raw moments. But here I am not showing you here, I am not
discussing it here. But I will request you please try to have a look in the books, and there it is
clearly mentioned. But here I would certainly show you, that what is the relationship of first four
16
623
Now, if I try to take the r equal to 1and r equal to 2 in case of continuous data, group data, then
we have the simple similar interpretations. That in the case of first central moment, this is going
1 K
to be fi xi x . So, you can see here this is nothing but, x bar minus x bar is equal to 0. So,
n i 1
once again I have shown you that the first central moment in case of continuous data, is always
0, and similarly, if you substitute here r equal to 2, then this quantity is nothing but the sample
variance, and in this case also, this 2 can be represented as 2 12 . So, this is again the same
outcome that we have obtained in the case of ungroup data. So, you can see here and notice that
this relationship is not going to change in the case of group and ungroup data. Now, if I try to
choose here r equal to 0, then I will always get 0 equal to 1 either we have a ungroup data or a
17
624
Now, after this I will show you what is the relationship between central moments in raw
moments. You have observed, that the fourth central moment and first raw moment both are
taking the value 1, and first central moment here, this is taking the value always 0. These are the
two points where you have to be careful. I already have shown you this result, where I have
shown you how I can express, the second central moment 2 , as a function of first raw moment
and second raw moment. Just using a similar concept, I can also express the third, and fourth
central moments, as a function of raw moments as a function of 1 , 2 and 3 , and similarly the
4th central moment can be as a function of 1 , 2 , 3 , and 4 and you may recall, that 3 is
1 n 3
nothing but, xi and 4 is, which is the fourth raw moment. I goes from 1 to n summation
n i 1
and, then X is the power of here four. So, you can see here that these are the relationship of
central moment and raw moments. Well, I am trying to give you here only four relationship as I
said. But using the binomial expansion you can obtain a general relationship between the rth
central moment as a function of the first r raw moments and that is available in in all the
18
625
standards statistics books. So, I will not do it here. But I would like to stop here. I have given you
the basic concepts of moments. Well, this lecture was purely theoretical. But we also know, that
any development is based on the theoretical construct. Unless and until you understand the basic
fundamental you will not be able to understand what the software is trying to do and these are
very important concepts in the subject statistics. So, that was my objective to give you an
exposure of this concept. Now, in the next lecture, I will try to consider the concept of absolute
moments, and I will try to show you that how these moments can be computed on the basis of r
software, and after that I will introduce the concept of skewness and kurtosis. So, please you take
a break revise the lecture try to read from the books, and I will see you in the next lecture. Till
19
626
Lecture – 23
Welcome to the next lecture on the course, descriptive statistic with R software. You may
recall that in the last lecture. We started a discussion on the concept of moments, and we
discussed raw moments and central moments. Now, in this lecture I will introduce you with
another type of moments which are called as absolute moment, and I will show you, that how
the raw moment, central moment and absolute moments are computed on the R software. But
before we go to the concept of absolute moment, let me introduce you one small topics,
So, you may recall, that whenever we have our continuous data or say group data. What we
do? We try to group them in, group the data in class intervals, and the frequency of that group
1
627
is going to indicate that how many values are present in that interval. Now, if you try to see
what are we trying to do? We have here a sort of interval, say here e1 to say e2. Right, and
we assume, that the frequency of this interval is concentrated at the midpoint x1, if you
remember x1 was the midpoint of the interval. So, we assume that on the y axis this value
here, is showing here the value of say here frequency say f1. For example, first class. But
now if you try to see what is happening. There were f1 values in the interval or in the class
interval e1 to e2, and these values were scattered at different location inside this interval. But
when you are trying to group all this information, you are assuming that these values are
concentrated at x1. So, you assume that this frequency comes here, this frequency comes
here, this frequency comes here, this frequency comes here, and so on, and you assume that
all these values are concentrated only in the midpoint, and this number of values is here f1.
So, in some sense what are we doing? We are trying to group the observations. But when we
are trying to group the observation, the information contained inside the individual
observation is lost. What does this mean that the information is lost. Suppose I have two
values one is 5 and another is 10 and suppose the mid value of the interval is at 6. So, I am
assuming that 5 is also becoming 6, and the value 10 that is also, becoming 6, and after this,
in case if you try to observe the value 6, two values of 6 like a 6 and 6. You cannot
differentiate whether this value was 5 or 10, that which of the value of 6 is representing the
value of 5 and which of the value of 6 is representing the value of 10 or in general, we have
lost the information about the individual values of xi whether the values were 5,6,7,8,9,
whatever it is we simply assume, that they are just concentrated at the middle value xi. So,
you can see here when we are trying to group the observation there is some error which is
introduced and now obviously when you try to compute the moments on the basis of grouped
data, then this error is going to be reflected in the value of moments, and when this moment
are not representing the true value, consequently they will be giving us the wrong value of
628
mean, wrong way of variance and wrong value of other quantities which are based on
moments. So, this is very important for us, that whenever we have grouped data, we should
apply a sort of change or correction in the value of moments so that the modified or the
corrected value of moment is used which in turn will give us the correct information. Okay.
So, in this direction professor Sheppard worked, and he introduced, and he provided, some
expressions, and these expressions are based only on the moment and the class interval, and
he explained how this changes can be made so that the moments are reflecting the value
without end grouping error, or in simple words professor Sheppard suggested how this
grouping error can be can be treated. So, let us try to start the discussion on this direction.
629
So, we assume that in a group data that the frequencies are concentrated at the middle part of
the class interval, and this assumption may not always hold true and so-called the “grouping
Now, how to improve these values and how to take care of the grouping error? So, this effect
can be corrected in calculating the moments by using the information on the width of the
class interval. So, this is pretty simple. So, let us assume that suppose small c is denoting the
width of the class interval. Then Professor WF Sheppard proved that if the frequency
distribution is continuous and the frequency tapers off to zero in both that direction that is on
the left-hand side and right-hand side, then this grouping effect can be corrected as follows.
630
and he provided the value of raw moments and central moments after applying the changes.
So, in case of raw moment, Sheppard’s corrections are applied as follows. On the left-hand
side, I am trying to indicate the corrected values of moments, and this is the same here also in
case of central moments and on the right-hand side, I am trying to indicate moments, based
on given data without any correction. So, you can see here, that the first raw moment that
remains the same. There is no error in this case. So, the value of the first raw moment and the
so called first raw corrected moment, they are the same. But in the case of second raw
moment the second corrected raw moment is a function of second raw moment 2 , and it is
adjusted by here a quantity c2/12. So, what are we trying to do? That in order to take care of
the grouping error, we are simply subtracting the raw moment by a quantity c2/12 where c is
the width of the class interval, and now I will get here a new value of second raw moment in
which the grouping error has been taken care and similarly in case of third raw moment the
c2
expression goes like this and the ird corrected raw moment 3 1 . , and similarly in the
4
case of fourth raw moment the fourth raw moment after taking care of the grouping error is
631
c2 7 4
obtained by 4 2 c where once again I would say c is the width of the class
2 240
interval, and similarly in the case of central moments also, we can modify the second third
and fourth central moments. Because first central moment is always 0. So, there is no
So, the second central moment after incorporating the grouping error or after taking care the
grouping error becomes the corrected value of the second central moment is equal to the
value of second central moment minus c square by 12. There is no change in the third central
moment. Thus, third central moment is not affected by the grouping effect. So, the original
value of the 3 and the corrected value of the 3 both remain the same. Similarly, for the
fourth central moment, the value of the fourth central moment after taking care of the
c2 7 4
grouping effect is given by this. Which is the fourth raw moment 4 2 c . So,
2 240
basically these are the part which have to be taken care only during the computation, and here
also you can see, that if I explain you how to compute this mu r prime. That is the raw
moment and r which is the central moment at a central moment. Then after that at least in
the first four central moment you can simply write a simple syntax in R software.
So, whatever is the expression for computing a particular moment, that value has to be
adjusted just by adding and subtracting few terms as proposed by Professor Sheppard. So,
implementation of Sheppard correction in R software is not difficult at all. Now, after this I
will come to the aspect of absolute moments. So, how this absolute moment comes into
picture?
632
You have seen, that when we introduced the idea of absolute deviation what we had done?
We had observations x1 x2 suppose here xn. Then, what we did? We chose an arbitrary
value, and we subtracted every observation by that arbitrary value A, and after this what we
did? We considered the absolute value of these deviations, and after this, we simply found the
1 n
arithmetic mean of all such observations. So, this was simply xi A . Now, in case if I
n i 1
try to consider here the rth power of this. So, I try to add here r. So, now what will happen
means if I try to take here r equal to say here 1. This will become simply a sort of absolute
1 n
deviation around mean, and if I try to take here r equal to 2. This becomes
n i 1
( xi x ) 2
which is same as your variance of x. So, this gives us an idea, that why not to define the rth
absolute moment and the quantity which you have defined here this is called as rth absolute
moment about A. So, the rth absolute moment about arithmetic mean based on the sample
observations x1, x2, xn is defined as like this in the case of ungroup data. So, this is simply
the rth power of the of the absolute deviations, and after that, what are we doing? We are
633
simply trying to find all such deviations and we are finding out its arithmetic mean. So, this is
called the rth absolute moment about arithmetic mean, and similarly in case of grouped data
1 K
fi xi x .
r
the rth absolute moment about arithmetic mean is defined as
n i 1
So, you can see here that this is the same philosophy or same way of development as we have
done here, the only difference is this I, have to just adjust it for the case of grouped
Now after this I, would come on the aspect that how to compute it in the R software. Well
when we want to compute the values of the moments on the basis of given sample of data in
R software, then this part is not available in the base package of R but, in order to compute
the moments, and after that I, will show you that when we are trying to measure the departure
from symmetry and peakedness of the frequency curve, and we compute the coefficients of
634
his skewness and kurtosis, then we need a special package, and we need to install the package
So, in order to compute the moments, we first need to install a package which is called as
moments, and then we load it as a library and then we operate. So, when we are trying to
compute the moment, the first step is that you try to install a package moments, and in order
to do so what you have to do? You have to simply write install dot packages inside the
arguments within the double quotes you simply write “moments” m o m e n t s and the
package will be installed, and after this you need to load the package library moments. In fact
whenever you will need to compute the moments you need to upload this for library. Now
after this, all the sample moments are computed by the command all dot moments a l l dot m
o m e n t s and this syntax or disk function has several arguments you can see here x order dot
max central absolute and any dot rm just by controlling this parameter inside this argument,
you can generate different types of moments, raw moments, central moments and absolute
moments. So, what is happening in R package, in R software, we have only one command all
dot moments and just by giving different choices of TRUE and FALSE inside the argument
command for raw moment or central moments or absolute moments. So, this is what you
have to now understand which of the choice of the parameter or the values inside the
635
So, now firstly you let me explain you the meaning of this argument you see the first value
here is x. This is going to denote the data vector then there, is another parameter here all
order dot max this is written here as two but this is going to give us the information on the
number of moments to be computed. This I have explainde in the next slide, here you can see
here like this but I will try to explain you here also, and 2 here is that default value now in
case if, you want to compute three moments, four moments, five moments, six moments, you
have to simply choose the appropriate value here, next command here is central, central is
indicating for central moments, and the default value which is taken inside the all command
is FALSE, that is the logical FALSE but in case if you want the central moments then what
you have to do you just have to use the logical TRUE so in place of here false you simply try
to type TRUE and it will give you the central moment. Similarly the next option is absolute,
so absolute will give you the values of absolute moments the default value which is taken in
the command all that moments is FALSE, but in case if you want to compute the absolute
moment, you simply have to replace this FALSE by logical TRUE and the last syntax, it is
known to you now this is na dot rm is equal to FALSE, so this will try to help us in the case
10
636
of missing values means if you want to compute when the missing values are present then
what you need to do you simply have to change this logical FALSE to logical to capital
TRUE so, this is how just by handling different arguments with different logical TRUE and
For example, in this slide I am trying to explain all these things in detail so that you can have
a look so, this is about ordered max, this is about central, this is about absolute and this is
about na dot rm, same thing which I have just explained you, Right.
11
637
Now after this I, will try to take an example and I, will try to show you how those things are
being computed so I'm taking again the same example that we have used couple of times in
the earlier lectures that we have a data of 20 participants in a race, and this data has been
stored in a variable here time, and now I would like to compute different types of moments
for this data. So, as I, said first we need to use the command install dot packages to install the
package moment, and then I have to load this package by using the command library
moments Right,
12
638
Now after this I, will show you that how to compute raw moments, how to compute central
moments, and how to compute absolute moments. So, first I try to take the raw of moments
and suppose I want to compute first two raw moments. So,I have to control it by here order
dot max 2 so, I use the command here all dot moments and the data vector time and I have to
give here order dot max equal to 2 and in fact even if you don't give this option even then you
will get that same outcome but my objective is to show you that how the things are being
controlled. So, you can see here now once you execute it you will get here this type of
outcome. Now the next question is what is the interpretation of this value another outcomes,
the first value here is 1.0 which is indeed keynoting the value of 0 that is the value of raw
moment at r equal to 0 similarly, the second value here 56.0 this is denoting the value of 1 .
that is the value of raw moment at r equal to 1 so, this is the first raw moment, right. Ts is the
first raw moment and similarly the last value here 3405.2 this is the value of 2 that is the
value of rth raw moments through moment at r equal to 2 and this is denoting the second raw
1 n
moment, which is
n i 1
( xi x ) 2 right, now in case if I, try to repeat the same command just
by changing here the order dot max equal to four, then what will happen? Now you can see
here that in this earlier case, the maximum value of r up to which the moments are computed,
this is R equal to two and this is the same value which is here ordered dot max is equal to
two. So now suppose I want to compute first for raw moments so, in this case I, simply have
to give the value here for and then I, try to execute this command with order dot max is equal
to four, I get this outcome so, you can see here the first value is the value of mu 0 prime,
second 56 value is the value of , third value 3405.2 is the value of 2 , fourth value is the
value of here 3 and the last value here 15080073.2 this is the value of 4 . So, you can see
here this fourth which is the maximum value here for this is being indicated by this order dot
13
639
max so, this will give you the first four moments, now before going further let me try to show
So, first I, try to create the data vector which is here like this so, you can see here this data
vector here is like this and yeah I, already have installed this package on my computer but
yeah you need to install the package, and I am simply trying to upload it. So now this
package is uploaded moments. Now I, try to use the same command here all dot moments of
the data vector time and I, want to compute the first to moment that is 0 , 1 and 2 so this is
like this if I try to compute say first four moments starting from 0 , 1 , 2 , 3 and 4 raw
moments. So, this is going to give you here like this suppose here you want to compute eight
moments, you simply have to give this and you will get the outcome.
14
640
Video End Time (27:02)
Ah very important point which you have to notice here, is that in the command all dot
moments if you are not giving any option like as central or absolute or na dot rm etc., the
default outcome is the raw moments, this is what you have to always keep in mind
15
641
that whatever outcome we have obtained here, these are the default outcomes of the
command all dot moments which are the simply the raw moments, Right.
Now I, will show you how to compute the central moments so, again I, will repeat the same
thing that first I will compute the central moments up to order 2 and then up to order 4. So,
what I have to do here my command or the earlier command remains the same here all dot
moments of the time data vector with order max equal to 2 what I have to do here that I
simply have to add here one more argument central dot central is equal to logical TRUE, that
default value here is central is equal to FALSE so, I try to adhere this central equal to TRUE
inside the argument separated by comma and the first outcome here is indicating the value of
mu 0 and that was already shown that it will always take the value 1. Similarly, if you come
to the second value which is 0 point 0, 0.0 because this is indicating the value of 1 which
16
642
1 n
was xi x and this always takes value 0. Similarly if you come to the last outcome to
n i 1
1 n
69.2 this is indicating the value of 2 which was nothing but
n i 1
( xi x ) 2 and if, you see
this is nothing but the value of variance of x or here x is actually here time the data in time
vector. Now similarly if, you try to repeat the command and if you wish to compute the
moments up to the order 4 then you need to make here only one chain that order dot max is
equal to here four and the same command and you can see here that this is here the outcome
where first value is denoting 0 , second value is denoting 1 , third value is denoting 2 ,
fourth value is denoting 3 that is that value of the third central moment and simply the last
value here this is denoting the fourth central moment. So, you can see here that it is not really
17
643
Now I, will try to show you the computation of these values on the R software on the time
data, so you can see here I, try to take a like this and I, get here the moments up to order 2
that is 0. first and two moments. Similarly if I, try to make it here 4, I get here first four
values and similarly, if you want to make a say here first 8 moments you have to simply
make order dot max equal to 8 and here is the outcome so, you can see here it is not really
difficult to compute these values. Now let us come back to our slide and try to see finally that
18
644
Now we have understood that as we, have computed the central moment similarly we can
compute the absolute women just by controlling the argument values. So in this case if I,
want to compute the absolute moments say up to second order that r equal to 0 1 and 2, my
command remains the same as earlier all dot moments time order max equal to 2 and now I,
add here one more option that is absolute is equal to TRUE, the default value is absolutely is
equal to FALSE but now I, need absolute moments so I'm trying to give here absolute is
equal to logical TRUE and this gives me here this these values. So, as we have done in the
case of earlier example this first value that is going to give us the value of absolute value of
mu 0 which is always equal to 1, and second value here this is giving us the absolute moment
that is the first absolute moment, and this second value is giving us the value of second
1 n
2
absolute moment. So this is nothing but xi x which is here the variance and this 1 .
n i 1
1 n
first absolute moment this is the value of xi x . Now similarly if, you want to compute
n i 1
the moments up to order 4 then I have to just make order dot max equal to 4 and the same
19
645
command I try to use here and this gives me this outcome so obviously this first value is
going to give me the first value of absolute moment at r equal to 0 the second value is the
value of first absolute moment, third value is the value of second absolute moment, and
fourth value is the value of absolute moment and last value is the value of fourth absolute
moment. So, this is how you can compute these absolute moments and I, will try to show you
So, you can see here here were the time data, and then I try to compute the absolute moment
with order max equal to 2 which is giving me the first three values and now I, try to choose
first four moments and these values are given to me like this. Similarly, if you want to
compute any higher order value say up to nine moments, it is coming out to be like this
remember one thing, this counting is starting from zero so, that will so when you take order
20
646
Video End Time (34:11)
21
647
This is the screenshot what we have just done.
After this I, will try to show you what is the use of last option that when we have some
missing value. So I will take the same data set in which the first two values have been
removed and they are substituted with the na so, now and this data vector has been stored
22
648
Now after this in case if I want to compute here the raw moments in this case, I have to
simply use the same command all dot moments the data vector and and suppose I want to
compute first four moments so I have to give order dot max equal to four here, and then I
have to say here na dot rm is equal to TRUE and if you don't write it then the default value
here is FALSE, and once you execute it you will get here the same outcome. So, these are the
values computed in the same way as in the earlier example. The only thing is this now those
missing value have been removed and then the rock moments have been computed. Similarly,
if you want to compute first for central moments in this case you have to use the same
command that you use earlier and with the data given by time dot na and use na dot rm is
equal to TRUE and this will give you here the odd come and you quicker says the same
outcome that did this is the value of 0 this is the value of 1 , third value the value of 2 . ,
and fourth value the value of 3 , and the last value is the value of 4 .
23
649
Refer Slide Time (35:55)
And similarly if, you want to compute the absolute moment in this case then again you have
to use the same command of absolute values computation and just add here na dot rm is equal
to TRUE and you get here the value of the absolute moment for R equal to zero, the value of
absolute moment for R equal to 1, that is the first absolute moment, then second absolute
moment for R equal to two, the absolu`te movement for R equal to three, and finally the last
24
650
Now I, will try to show you that how to get it done on the R software. So first I try to create
the data vector here, you can see here this is my the data, and now if I try to compute the raw
moments, and if I, and suppose I show you that that if I don't use this option na dot rm then
what will happen? But after execution this command on you will get the same outcome that
we have used earlier now. I will try to remove this na dot rm is equal to true and you will see
that it is giving you here NA NA NA NA, the first value is coming out to be one because that
will always remain true whatever is the value of here R and similarly if you want to compute
the central moments here, just use this command and you will see the outcome here and
similarly if you want to compute the absolute moments over here, then simply use the
25
651
And similarly if you want to compute higher order moments in the case of missing values,
simply try to control the value of order dot max. So now I, have given you the basic concepts
of moments and I have explained you how to compute them on the basis of given set of data.
Now it's your turn, try to take some datasets and try to practice and try to observe that just by
choosing the logical TRUE, and logical FALSE inside the arguments which are given to the
different parameters, how you can generate raw moments how you can generate central
moments, and how you can generate absolute moments, and in the next lecture I, will show
you what is the use of third and fourth order moments by considering the concepts of
skewness and kurtosis. So, you practice and I, will show you in the next lecture. Till then,
Good bye.
26
652
Lecture - 24
Welcome to the next lecture on the course descriptive aesthetics with R Software. You may
recall that, in the last two lectures, we had discussed the concept of moments. And we discussed
the raw moments, central moments and absolute moments. We also learned how to compute
them on R Software. Now, I am going to introduce here the concept of skewness and kurtosis,
which are again, the two features of a frequency curve and our objective is that, to understand
firstly, what are these features? Secondly, how to quantify them? And thirdly how to compute
them on the R Software? When we are trying to quantify them, then you will see that, we will
need the definition of the moments and in particularly the central moments. And that was the
reason, that I had, explained those concepts earlier. So now, let us start this discussion.
653
And first we try to understand, what is the skewness? The dictionary or the literal meaning of the
skewness is lack of symmetry. What does this mean? This symmetry is, talking about the
symmetry of the frequency curve or the frequency density; you have seen that, how we have
computed the frequency table, from there we had drawn the frequency density curve. So, when
we say that, this is the lack of symmetry. Then what is symmetry? Symmetry here is like this, so
this is the basic meaning of symmetry. So now, I am saying that, this symmetry is lacking, when
the symmetry is lacking, what will happen? Means in, an ideal situation if I say, suppose if I say
this is my symmetric curve, then the symmetry is distributed around mean, this curve will look
like this or this curve will look like this. Now, what is the interpretation of these curves? Suppose
I try to take an example, where we are counting, the number of persons passing through a place,
where many, many offices are located. So, we know, what is the phenomena? The phenomena is
like this; that usually the office will start at, nine o'clock, ten o'clock, in the morning. The traffic
at that point will be very less say around 7 a.m., 8 a.m., in the morning. And then, the traffic will
start increasing and then, it will increase say up to 10 o'clock 10:30 or say 11 o’ clock in the
So, in case if I want to, show this phenomena through a curve, this curve will look like this.
Suppose if I say, this is the time here, I'd say here, 10 a.m.. And this is the time here, somewhere
here say, 7 a.m... And this is that time here, up to here, say here 3 p.m. to 10:00 a.m., 11 a.m. and
so on. And here is the number of persons passing through that point. So, I can say here that from
7 a.m., this frequency or the number of persons, this is very small, it starts increasing and then it
keeps on increasing up to say 10 o’ clock and then after that everybody comes in the office and
then, there are less number of people, who are coming to office and then finally, see if you come
up to here, 3 p.m. this number will decrease. Now, on the other hand, the opposite happens in the
654
third case, suppose if I try to mark these points, as here I said 12:00 p.m., 1 p.m. , I'm up to here
sometime here, say here 6 p.m. and then here 7 p.m. and say here 9 p.m. and here we try to count
the, same did, same record, the same data that is the number of person, which is denoting the
frequency. So now, what will happen? Once the office starts and offices from say 9:00 to 5:00 or
say 9 a.m. to 6 p.m. or so, so usually in that marketplace, or in the that place, where there are
many, many offices, people will be working inside the office and then, in the evening when the
office hours, closes then they will leave the office. So what will happen? Say from say 12 o’
clock or 1 o’ clock, the number of person passing through that point will be very less. And this
number will start increasing say from, 4 p.m. 5 p.m. and it will be maximum say, between say 5
p.m. and 6 p.m. and once everybody has left the office, then the number of person passing
through that point will sharply decrease and say at 7 p.m., 8 p.m. the number will be very, very
less. So, now how to denote this phenomena through a curve, so this type of phenomena can be
expressed by this curve. So, initially at 12 o clock the number is very, very less. And then, say
around 5 p.m., 6 p.m., the number is increasing and then it is decreasing after, say 7 p.m. or so
on. In both these cases, what is happening, you can see here, more data is concentrated here and
the first figure and more data is concentrated on the right-hand side, in the last figure.
So, these are the areas in red color, where more data is concentrated. Now, if you try to take the
third figure, you can see here in this case, the curve is symmetric around the mid value. If you try
to break the curve into two parts, from this point and if you try to fold it and if you keep it or dis
thing, then this will look symmetrical. So what I'm going to say? Suppose the curve is like this
and if I try to break it in the mid and if I fold it, then the curve will look the same. So, this is
what we mean by symmetry. And this feature is missing in the first and last curve. That if you try
to break the first curve at this point and the last curve at this point, where I am moving my pin,
655
then this will not be symmetric. So, the objective here is, how to study this departure from
symmetry, on the basis of given set of data, I would like to know, on the basis or given set of
data that whether the data is concentrated, on the left hand side more or more concentrated on the
right hand side of the frequency curve. So, this feature is called as, ‘Skewness’. And in order to
quantify it, we have a coefficient of skewness. So now I can say that here is skewness gives us an
idea about the, shape of the curve which is obtained by the frequency distribution or the
frequency curve of the data. And it indicates us the nature and concentration of observation
656
So, you can see here, I have plotted here three figures. I can call it; say here figure 1, figure 2 and
here that this is a figure 3. So, I will call this figure number 3, as bell-shaped curve, why this is
called bell-shaped? Have you ever seen a bell, the bell is like this and then, there is here is a ring,
right. So, you can see the structure of this curve here, this is symmetric, so that is why this
structure shape in the figure 3 is called as ‘Bell Shaped Curve’. So, the bell shaped curve, I will
say, this is a symmetric curve. Now, when the symmetricity is lacking, then the frequency curve
will look like, as in Figure 1 of Figure 2. So, the curve is or the frequency curve is, more stressed
towards right or towards left, this is indicating that more values are concentrated in this region, in
Figure 1 and more values are concentrated in the, in this region in the figure 2. In these cases, we
say that the curves are skewed. So, our frequency distribution is said to be skewed, if the
frequency curve of the distribution is not the bell shaped curve and it is stretched more on, one
side then to the other. Now, how to identify it, because now we have two types of lack of
symmetry, one in Figure 1 and 1 in Figure 2. So, we try to give it here a name.
657
So let me rename the figure, this is Figure 1, this is Figure 2 and this is figure 3, from the last
slide. So, now in the figure 1 you can see that the curve is more stretched on the right hand side.
So, when the curve is more stretched on the right-hand side, this is called as, ‘Positively Skewed
Curve’. And similarly if the curve is more stretched on the left-hand side, then this is called as a,
‘Negatively Skewed Curve’. And in case of a symmetric curve, we assume that the curve is
symmetric and we say that it has got zero skewness. When we want to discuss the property of his
skewness, we try to write whether the frequency curve is positively skewed, negatively skewed
or it is symmetric. And this is how we try to express the finding from the data. But, definitely
there will be one thing, suppose if I try to take here, two curves, like this and like this, both the
curves are, positively skewed. So, the next question is that both the curves are positively skewed,
but their structure is different, one curve is lacking the symmetry more than the others. But, just
by saying, less or more it will not help us we need to quantify it. So, our next objective is that,
how to quantify this lack of symmetry. And in order to understand this thing, we have a concept
of coefficient of skewness.
658
And the definition of coefficient of skewness depends on the second and third central moments
of the data. So, you may recall that, we had denoted the second central moment by and third
central moment by . So now, the coefficient of skewness is denoted by . And this is defined
as, the square of third central moment, divided by the cube of second central moment. And this is
called the, ‘Coefficient of Skewness’. There is another coefficient of skewness, which is defined
as, the square root of this and this is denoted by . Now, what is the difference between the
two measures of coefficient of skewness that is and ? You can see here that in the case of
, can be positive or can be negative. But, will always be positive. And similarly,
will always be positive, so will always be positive. So, this will always be positive. So,
this will give us the information, on the magnitude of the skewness or the magnitude of the lack
of symmetry. But, this will, not be able to inform us, whether the, the skewness is positive or
negative. Where is, when we are trying to use the, next coefficient , then what we try to do
here: that will also give us information about the sign. And when I try to combine, the
information obtained by , then this will, give us the information on the magnitude, as well as,
the sign. Sign can be positive; sign can be negative, indicating the positive or negative say
skewness. So, this is the basic difference between the two measures, and . And you will see
that in the R Software, R Software provides the value of gamma 1. And one thing now you have
to notice here and I can explain you on this slide itself , that I have defined here, and , this is
659
We have got a, data set say x 1, x 2 and here see here x n in this case, what we try to do? We try
to compute the value of and , on the basis of data x1, x2, xn or they are called as Sample
Moments. So, we try to compute the sample moments, of order 2 & 3 and we try to replace,
and by their sample moments. So, in this case, this I am trying to denote it by s that
means based on the sample values, becomes like this. So, that is the same thing, I simply have
computed, the second and third, central moments and I have replaced him at in the definition of
. Next the coefficient of , now I am denoting by s, s means sample, this becomes here, this
square root of s and it is given by here like this simply, the square root of the s. So, now this
will give us the information, on the magnitude and sign and where this will give us
660
Now what is the interpretation? The interpretation goes like this. So, I'm trying to divide the
interpretation, based on and s and both are actually the same? So, the first case is if I say is
equal to 0, this means the distribution is symmetric, when I say is greater than 0 that is
positive, then the distribution is also said to be positively skewed. And if is negative, then the
distribution is negatively skewed. And the same continues, for the definition of s, if s is 0, this
means symmetry, if is positive that means that distribution is positively skewed, if is
negative then the distribution is negatively skewed. So, you can see here, now that having the
coefficient of skewness, it is not difficult to know, the feature of the frequency curve with respect
to the symmetry and I can see whether my distribution is symmetric or not if not symmetric then
it is positively skewed or negatively skewed. Now a simple question comes, what happens to,
your here mean, median and mode in these different types of distributions, when that
transmission is symmetric or positively or say negatively skewed. So, now I try to give you a
brief information.
9
661
Refer slide time :( 18:49)
When we have a positively skewed distribution, in this case will be equal to 0. Now if you
recall what will be the here mode, mode will be somewhere here, which is the maximum
frequency value. So, corresponding to maximum frequency this will give me the value of your
mode, median is the value which will try to divide the entire area under this curve into two parts.
So, this will be somewhere here and the mean, will be somewhere here, yeah. So, in the case of
positively skewed distribution, mode will have the highest value followed by median and mean.
So, mode will be smaller than median and median will be smaller than mean. The opposite will
happen, when we have the negatively skewed distribution, in this case, the mode will be
somewhere here, corresponding to this frequency. So, mode will be somewhere here,
corresponding to X and similarly the median will be corresponding to this frequency and median
will be somewhere here and the mean will be somewhere here. So, in this case, when we have
with a negatively skewed distribution, then mode will be greater than median and median will be
10
662
greater than mean. Well, in the case of symmetric, distribution, all the values of mean, median
and mode, they are going to be the same see here and in this case is equal to 0 and median,
There are some more, coefficients of skewness, which have been given in the literature so, I will
just briefly give you an idea. So, one measure of coefficient of skewness or one coefficient of
skewness is based on, the mean and mode, which is given by mean - mode divided by standard
deviation. So, Sigma x is giving the value of standard deviation. So, this is essentially the value,
of say s what we have used in the earlier lecture. But I'm using here is say Sigma x, to denote it
that this is standard deviation, because there's a standard notation in many, many books. And this
is the same as this quantity, which is based on the mean and median because you may recall that
11
663
we have a relationship that x xmode , is approximately equal to 3( x xmedian ) under certain
conditions. So, these two measures lie between minus 3 and plus 3 and we say that if these
coefficients are greater than 0 that means the curve is positively skewed, if they are negative that
means the curve is negatively skewed. And if this coefficient is 0 that means the curve is
symmetric.
And similarly, we have two more measures, which are based on the definitions of quartiles and
percentile. So, the coefficient of skewness based on quartiles, is given by like this
(Q3 Q2 ) (Q2 Q1 )
. And similarly the coefficient based on percentile, is given by this formula
(Q3 Q2 ) (Q2 Q1 )
and both this coefficient, they lie between minus 1 and plus 1 and they have the same
interpretation as earlier, means, positive value of this coefficient will indicate positively skewed
12
664
curve, negative value will indicate the negatively skewed curve and 0 value will indicate the
symmetric curve.
Now after this I come on the next aspect, which is called,’ Kurtosis’. Please try to observe here,
in this picture, I have made here three curves, duck car in the sail let me call it here's curve, one
two and here three. I request you to please try to observe, the hump of the curve, where is the
hump of the curve, which is here I'm making my yellow color, this and by looking at these three
curves, can we really say that what is the feature related to the peak of the curve? The peak of the
curve 3, this is the highest, followed by the peak of the curve number 2 and followed by the peak
of curve number 1. So, the question is, is how to show this property of peaks and how to quantify
it? So, this property of kurtosis, this describes the peakedness or flatness of a frequency curve,
flatness means, how flat is the curve at the peak. Now after this you can see here, from this curve
that one of the curve has more peak and other curves have less Peaks. But how to compare it,
13
665
what is more and what is less? So, what we try to do here, we try to measure the peakedness,
with respect to the peakedness of normal distribution, what is the normal distribution? In
statistics we have a probability density function, what will what we call as normal distribution or
sometime it is called as,’ Gaussian Distribution’. So, before I try to give you an idea about this
So, the probability density function of a normal distribution function is given by like this. And
this probability density function is controlled by two parameters, mu and Sigma square. So, we
denote this function, as say here n, which P is normal and the two parameters and , which
are the parameters of this, density function, here this is indicating, the mean and is
indicating the variance. So, if I try to draw this curve, this will look here like this. So, this value
14
666
is indicating here, the mean and this is spread here around the mean, this is giving us the value of
sigma square. And this curve is actually, symmetric around mean. So, we try to compute, the
coefficient of skewness and kurtosis in the case of normal distribution. And in this case the
coefficient of skewness comes out to be zero and coefficient of kurtosis comes out to be zero.
And this was the reason that we were trying to conclude on the basis of coefficient of skewness,
being zero, positive or negative. And similarly we are going to do, with the case of kurtosis. So,
now the curve of the normal distribution will have zero kurtosis. So, now we can compare the
peaks of other curves, with respect to the normal curve. And this is what we are doing in this
picture,
Here if you try to see, the curve in the mid, curve number here tow this is the curve of normal
distribution. And we are trying to compare the flatness or peakedness, with respect to the curve
15
667
number two. So, now you can see here, first into the curve number three, this is here, which is
here, the curve number three has got more peak, then that curve number two. And similarly if
you try to look in the curve number one, then curve number one has got a smaller peak, than the
curve number two. So, what we try to do here? That all those curves which have got higher
Peaks than the normal curve, they are called as ’Leptokurtic’, the peakedness of the normal
distribution, is called as,’ Mesokurtic’. And the peakedness of those curves, which have got,
lower peak than that of normal curve, this is called,’ Platykurtic’. And this is how we try to
characterize the less or more peakedness, with respect to the peakedness of the, normal
16
668
So, now I can explain you that the shape of the hump, which is the middle part of the curve or
the frequency distribution of the normal distribution, has been accepted as a standard one. And
this kurtosis, the property of kurtosis examines the hump or the flatness of the given frequency
curve or distribution, with respect to the hump or flatness of the normal distribution, which we
And those curves, which have got the hump, like a normal distribution they are called,
’Mesokurtic’, curves with greater peakedness or say less flatness, then that of normal distribution
curve, they are called as,’ Leptokurtic’, curves and those curves, which have got less peakedness
or say greater flatness, then that of normal distribution, they are called as ’ Platykurtic’, curves.
17
669
Refer slide time :( 28:46)
Now the Pearson’s is how to quantify it. So, we have a coefficient of kurtosis and there are
different types of coefficient of kurtosis, but here we are first going to consider, the Karl
Pearson's coefficient of kurtosis, which is denoted by that is the standard notation. And similar
to the coefficient of skewness you can see here that this is also depending on the and ,
what are this and ? is the second central moment and is the 4th central moment. And
the coefficient of kurtosis is defined as the 4th central moment, divided by the square of second
central moment, the value of for a normal distribution; this comes out to be 3. So, what we try
to do? That we try to define another measure, which is be - 3 and we denoted by here . Now
the advantage of is that that just by looking at the value of , we will get the idea of the
magnitude and if this is greater than 0, is smaller than 0 or equal to 0, will give us the idea about
the, nature of hump. So, that is why we have two coefficient of kurtosis and in R software, this
18
670
Refer slide time :( 31:27)
So, now you can see here the same thing. So, I can see here for a normal distribution is equal
to 3 and equal to 0 and if is greater than 3 or is greater than 0, then we say that the curve
is leptokurtic, if is equal to 3 and or equivalently the is equal to 0, then we say the
frequency distribution or the frequency curve is Mesokurtic. And if is smaller than 3 and is
smaller than 0, then we say that the distribution is platykurtic. So, you can see here that in the
same figure that we had drawn, for the curves leptokurtic, we have this for mesokurtic, we have
this and for platykurtic, we have this. So, this is about the coefficient of kurtosis.
Some properties, which I'm not going to prove here, it is just trying to say that , the coefficient
of kurtosis will always, be greater than or equal to one and coefficient of kurtosis , will always
19
671
be greater than the coefficient of skewness, and this will always, be greater than or equal to
+ 1, these are some properties just for your information, I'm not going to use it here.
And but, I would try to define the sample based coefficient of kurtosis, because there in the
and , I simply use the population values. So now, I know, I don't have to do anything, I need
here two central moments, and , I will try to compute the sample based moments, i.e., the
value of and , on the bases or given set of data and I will simply replace, them in the
coefficient of kurtosis and I am denoting this coefficient of kurtosis, as s, right, s means sample
and similarly, the coefficient of kurtosis, is now transformed to s and this is the same thing
s - 3 and they have the same, interpretation as we have the case, in the and , for
20
672
leptokurtic distribution, s and s will be greater than 3 and s will be greater than 0, similarly
s will be 3 or swill be 0 in case of Mesokurtic and s will be smaller than 3 or swill be
So now, we have a fair idea: that how to measure two types of characteristics, in a frequency
curve, what is the symmetry and other is the peakedness and they are going to be quantified, by
coefficients of skewness and kurtosis. So now, the next question is how to, compute them, in the
R software. So, as we have seen: that in the case of, computing the moments, we need a package
skewness, we need the information on the moments, , , and . So, we first need to install the
21
673
package, moments and then based on that, we can compute the coefficient of skewness and
coefficient of kurtosis. So, we try to understand it here. So, in order to compute the coefficient of
skewness and kurtosis in R, first you need to install the package, moments by this command,
install dot packages, inside the arguments, you have to write moments and then you have to load,
this package moments. Right. And now, the sample-based coefficient of kurtosis, which was our
s, this will be computed by the expression, a skewness, s k e w n e s s and the data vector here x
and yeah! If you want to use, the na dot r m, which means for the missing value, you are saying
because here FALSE: that means there are no, missing values and if there are missing value, we
will simply write na dot r m is equal to TRUE and similarly the sample based, coefficient of
kurtosis, which we have defined as s, this will be computed by the command kurtosis, k u r t o s
i s all in small letters and inside the argument, same thing, like the data vector x and if you do
not have any missing value, then use n a dot r m is equal to FALSE and if there are missing
22
674
So, you can see here, this is not a difficult thing. So, yeah! If you have missing value and if you
try to store those missing values inside the data vector xna then the command will become,
skewness xna and na dot rm is equal to TRUE, for computing the coefficient of skewness and the
coefficient of kurtosis, in this case will be given by kurtosis, x dot na and na dot rm is equal to
TRUE. Right.
Now, I would try to take an example and show you that, how to measure this is skewness and
kurtosis in the data. So, I am going to use the same example, in which we had collected the
timings of twenty participants in a race and this data has been stored inside a variable time. Now,
after this, we simply have to use the command skewness, s k e w n e s s and inside the argument
give the, data vector and this is giving us the value, 0.05759762. So, this is indicating: that the
23
675
skewness is, greater than zero. So, the frequency curve, in this case is, positively skewed and
similarly you can see here, when you try to operate the kurtosis command, on the time vector,
then it is giving us the value, 1.7 and which is is smaller than 3 then we say the distribution
So you can see here, I have here that data on time, first I need to load the package, library,
moments and I already, have installed this package on my computer. So now, I will need, the
skewness of time, this comes out to be like this and kurtosis of time. Right. You can see here,
this is the same thing which you have just obtained and this is the screenshot of the same
24
676
I know, I will take one more example, where I will show you that, how to compute the
coefficient of skewness and kurtosis, when we have some data, missing. So, I will take the same
example, in with the I have just removed two first two observation and I have replace it by NA,
NA that means the data is missing and this data is stored in the data vector time dot na and now, I
will use the command skewness, on time dot na with the option any dot rm is equal to TRUE
that means, please remove the NA value and then you try to compute the skewness. And
similarly for the kurtosis, I will use the same command kurtosis, on the time dot na with then
option n a dot r m is equal to TRUE and this will give me the value of coefficient of kurtosis,
when the two values are missing. So now, you can see here, the coefficient of skewness comes
out to be negative, this is less than zero. So that means, the frequency curve based on the, the
remaining observation, is now negatively skewed. So, it, this is indicating that, when the first
two observations are deleted, then the nature of the skewness has changed. Similarly for the
kurtosis, this value is 1.81, and which is is smaller than three, then we say that the distribution
25
677
is platykurtic. So, this is indicating that, even after removing the first two observation, the nature
of the curve, remains the same. And now, I will try to show you, this on the R software also.
So I already have stored this data time dot na on my computer. This is here like this. Now I want
to compute the skewness; this is coming out to be the same, what we have just observed. And if I
try want to come compute the coefficient of kurtosis, this is here like this.
26
678
And then, the next slide is the screenshot of the same operation that we have done. Now, I would
stop in this lecture and you can see that we have discussed the coefficient of skewness and
kurtosis, which are going to give you, two more pieces of information about your frequency
curve, beside central tendency and variation. Now, you know how to find the behavior of the
frequency curve, with respect to central tendency, variation, lack of symmetry and peakedness.
And now you can see, just by looking at the data, you were not getting all these things, but now,
you know that how to quantify these things and how the graphics in this case will look like,
means if you try to plot the, frequency curves of the data, which I have taken say time or say
time dot na and try to see, whether the feature of the curve is matching with the information
given by the coefficient of skewness and kurtosis or not? And you will see that it is matching. So
this is the advantage of these tools of descriptive statistics that instead of looking at the huge data
sets, you simply try to look into these values graphically, as well as quantitative way. And they
will give you a very reliable information, but, you should know, how to use this information and
how to interpret that data. So now, up to now, from the beginning I have used the tools, when we
have data only on one variable. So, we have discussed the univariate tools of descriptive
statistics. Now, from the next lecture, I will take up the case when we have more than one
variable and in particular, I will consider two variables. So, when we have the data on two
variables, they also have some hidden, properties and characteristics, so how to quantify them,
and how to have the information graphically, these are the topics which I will be taking from the
next lecture. So, you practice these tools, try to understand it and enjoy it. And I will see you in
27
679
Lecture – 25
Associate of Variables - Univariate and Bivariate Scatter Plots
Welcome to the course on Descriptive Statistics with R Software. You may recall that in the earlier
lectures up to now, we have considered the descriptive tools which are used for a univariate setup.
Univariate setup means there was only one variable; and we try to measures the, measures of central
tendency, measures of variations etc. only on one variable. Now what will happen when we have more
than one variable?
For example, there can be two variables which are interrelated. So, the question comes that how to
know whether the two variables are interrelated or not and in case, if they are interrelated how to
measure their degree of association. So now, from this lecture we are going to attempt to study the
descriptive tools which are used for more than one variables. Then there will be two types of tools: one
graphical tools and say another are analytical tools like as quantitative tools.
So, in this lecture we are going to study on say univariate and bivariate scatter plots which is the
graphical procedure.
So, now I will get it is a start about discussion here and let me first take few examples. Now for
example, we know from our experience that if a student studies for more number of hours usually he
will get more marks in the examination. So now, if I try to take these two as variables, the numbers of
680
hours of study and the marks obtained in the examination, then from experience we know that both
these variables are associated and they are related.
But you can think that, if you have got a data set on two variables, how would you know on the basis
of given values that whether the two variables are related or not. And in case if they are related how to
show it graphically and how to quantify the degree of dissociation? So, the first example which I have
just taken is that, that the number of hours of study, they affect the marks obtained in an examination.
Similarly we also know that when the weather temperature increases, for example, during summer,
then we use more electrical appliances like a cooler air conditioner and so on; so the electricity or say
power consumption increases.
So, I can say that the two variables power consumption and weather temperature they are related and
their tendency is that as the temperature increases the power consumption also increases. Similarly, in
say another example we know that, weights of infants and small children they increase as the heights
of those children increases under normal circumstances right. So, now, my question is this from this
data set, from this type of information we have use our experience and based on that we are trying to
conclude this things. But my question is this how to do it statistically, how to do it mathematically and
more so over how to show it graphically and how to quantify it?
So, now I will be considering the association between two variables. So, I will consider two variables
and I will first try to show you that what are the different types of plots available.
681
So, now my question is this that I have got the observations on two variables and both the variables are
assumed to be related to each other. So, first question comes how to know that the variables are really
related; and if they are related how to know what is the degree of relationship between the two
variables. So, there are various graphical procedures like a two dimensional plot, three dimensional
plots and so on.
And there are some quantitative procedures also like a correlation coefficients, contingency, tables, chi
square statistics, linear regression, non-linear regression and so on. So, we will try to study these tools
one by one.
So, now first let me try to describe the setup that what are the variables and how the observations have
been obtained; and now we are interested in creating the graphs. So, I simply assume here suppose
there are two variables and these two variables are denoted by capital X and capital Y and small n
number of pairs of observations are available on these two variables and these observations are they
denoted at x 1, y 1 which are occurring together x 2, y 2, which are occurring together and lastly x n y
n which are occurring together. What does this mean? Suppose I take here X variable as say here
height of children and I try to take here Y to be the say here, weight of children and now I try to collect
the observation on these two variables like this. Suppose I take a child and I try to measure his height
and suppose this height comes out to be 100 centimeters.
682
So, for this child number one, the height is coming out to be 100 centimeter. So, I try to denote it say
here x 1 is equal to 100 centimeter. And, then I try to find out the weight of the same child, child
number 1 and suppose this weight comes out to be see here 20 kilograms. So, I try to denote this first
value as y 1 equal to 20. So now, this x 1 y 1 which is here say here, 100 and 20 this is my first paired
observation. And similarly if I try to take second child say child number 2; and if I try to measure the
height of this child, suppose this comes out be 90 centimeters. So, this is going to denote the height of
the second child height is denoted by capital X. So, I can denote the height of the second child by x 2;
and similarly I try to find out the weight of this child.
So, weight is given by Y and its value is given by small y, so I try to write down the y 2 which is
indicating the weight of second child and suppose we observe that, this weight is suppose 17 kgs. So
now, this x 2 y 2 which is equal to here 90, 17, this (Refer Time: 08:21) my second pair of observation
and so on, we can say collect more number of observation and this observation will be given as (100 ,
20), (90 ,17) and so on. So, this will indicate that this is the value of x 1 y 1 and this is the value of x 2
y 2. So, as soon as I write that there are n pairs of observations that mean, these are some numerical
values which we have obtained by experimenting the data or by observing the data in any phenomena.
Now, after doing this my first question will be I would like to know, whether there is any relationship
between, the two variables or not and that I would like to judge on the basis of given set of data. In
order to do this thing we have a plot which is called as a scatter plot and in scatter plot what we do that
we try to plot the paired observation in a single graph. How? For example, I have here two variables X
4
683
and here Y. So, I will try to take the value of here x 1 and here y 1 and I will try this point over here.
Similarly, I try to take the value of here x 2 and the value of here y 2 and I try to plot it here.
So, this point is x the point (x 1 y 1) and this point is here (x 2 y 2), so this is called a scatter plot.
Now, this scatter plot can reveal different types of nature and trends of possible relationship. So, for
example, we can broadly divide the nature of relationship to be linear or non-linear. So, here you can
see that I have just plotted I scatter plot here and here. So, you can see here these circles here they are
trying to indicate the value of some observation, right. So, all these circles they are essentially trying to
plot the values of x i’s and y i’s.
But if you try to look at the pattern in this graph, what you see that you can see here that there is a sort
of trend mean the trend is that that as the value of X is increasing the value of Y is also increasing and
this is happening moreso over in a linear fashion there is a sort of linear relationship. So, I can say here
that by looking at the scatter diagram I can conclude that the relationship in this case is linear; that
means, there exist a linear relationship between X and Y.
Similarly, if you try to look into this figures second figure, you can see here that once again these
points are plotted here, but you can see here that this points are something like they are not actually
linear, they are not showing that there is a really a linear trend, but the trend is something like this. So,
this is indicating that the relationship between X and Y is non-linear, right.
684
After this once we have a decided that the relationship is linear or say non-linear, then how to see
whether the strength of the relationship is say more or less. So, we are going to now consider here only
the linear relationship. And, similar type of conclusion will also be there for the non-linear
relationship, but I will not consider it here in this lecture. So, if you try to see here in these two
graphics, figure number 1 and figure number 3 on the left hand side.
In case if you try to see here in the figure number 1, these points are concentrated inside this band; and
in figure number 3, the points are concentrated in this band. And now if you try to observe the width of
this band; and compare it with the band width of figure number 1 and you can see here that in case of
figure number 3, the observations are scattered more than in the case of figure number 1, but in both
the cases you can see that the trend is almost the linear which I am denoting with the red colour. In
both the cases you can see here that the trend is nearly linear.
But what is happening in this trend, in this figure number 1 you can see here that those points are very
close to the line. All the scatter point they are lying very close to the line in red colour. Similarly in the
case of figure 3 here, if you try to see the points are lying quite far away from the line you can see here
in the orange line I am trying to denote the deviations. And when I try to compare these deviations of
observation from the trend line or the red colour line in figure number 3 and figure number 1, I can say
here that the strength of the linear relationship in figure number 1 is more.
Why? Because in figure number 1 here the points are lying more close to the line and in figure number
3 the points are quite away from the line in comparison to figure number 1. Now, there is another
thing which you have to observe, in figure number 1 and say figure number 3, now try to observe my
line in purple colour; you can see here as the values of X are increasing, the values of Y’s are also
increasing. And, the same thing is happening in figure number 3 also as the values of X are increasing,
the values of Y s are also increasing. You can see here in figure number 1 this is my here X this is my
here Y and now if I try to take another X here this is my here another Y, right.
So, this is indicating that the relationship is increasing or we call is positive. So, this is what I mean
which I have written in the title strong positive linear relationship; that means, the relationship is linear
I am this is the strongly positive in comparison to the relationship in figure number 3, where I am
saying that the relationship is positive, but it is moderate positive and it is linear. Similarly, now in
case if you try to observe in figure number 2 and figure number 4, here you can see here in this case as
the values of X are increasing the values of Y’s are decreasing this is happening in figure number 2
and the same thing is happening in figure number 4.
6
685
So, in this case I can say that the relationship is decreasing relation between say here X n Y n both the
cases figure 2 and figure 4. Now, in case if you try to create here are line that is called trend line, you
can see here this is like this and in the figure number 4 this is like this. Now if you try to analyze what
are the deviations of individual observation from this line. So, if you first observe in figure number 4
these points are lying quite away from the trend line you can see here in comparison to the figure
number 2 because in figure number 2, now if you observe these points are very very close to the line in
comparison to figure number 4.
So, I can say now here that in figure number 2 the relationship between X and Y is quite strong and
since it is decreasing relationship. So, we call it as a negative linear relationship, because the
relationship is linear; and similarly in the case of figure 4 I will call it as a moderate negative linear
relationship.
And, similarly incase if you try to plot X and Y and if you get no clear relationship for example, it is
happening here in figure number 5. For example, you cannot say here whether there is an increasing
trend or decreasing trend or say where is the trend something like this.
So, in this case by looking at the scatter plot, I can see here there is no clear relationship and even we
do not know whether it is linear or non-linear or even this is positive or negative. So, now, in this
lecture, we are going to study the aspect of linear relationship. So, we will assume that whatever things
686
we are going to do in those cases, the relationship between X and Y is linear and there will be two
expects say graphical and say quantitative, right.
So, now what I will do that, I will try to take some examples and I will show you that what are the
commands on the R software; and I will also show you that how to execute them and how to
understand the outcome. First I am going to discuss here the command which is plot command. One
thing which I would like to make clear here that this plot command can be used to create the scatter
diagram in univariate as well as bivariate set up. I had not covered it when I covered the univariate
graphics because I knew that I am going to cover the topic on plot. So, why not to cover it together,
right. So, increase if you have only one variable univariate case then the data on that variable is stored
in a data vector x and then the R command is plot p l o t and inside the argument you have to write the
data vector x, ok.
687
(Refer Slide Time: 19:49)
Now, I try to take simple example and I try to show you here that how the graph will look like. So, I
have once again taken the same example which I considered in the earlier slide that, we have collected
the data on height of 50 persons and which has been recorded here and this data has been stored inside
a variable whose name has been given as height like this.
Now, I would like to plot this data. So, I give the command here plot and inside the argument height;
and you will see here you will get here this type of outcome on the R console. And my objective is not
688
to show you here the figure, but I want to show you that how to interpret it. You can see here on the x
axis, this is giving me the index, what is the index. For example, if you try to see in this data set what
is my first observation? This is my first observation. So, this is my here second observation.
So, this index is trying to give the order in which the data has been given. So, this was observation will
have index 1, second observation will have index 2, third observation 130 will have index equal to 3;
and in this plot what they have done? They have taken index here they try to take the value of index
number one and whatever is the value here of the x data for example, here is this is height, they will
plot it here they will try to take the index number two they will plot the data here somewhere we were
it lies.
So, here you can see here on the y axis, we have here the height. So, this is only I scatter diagram of
one variable. So, from here actually you can have the information on say central tendency or say
dispersion if the data is more consternated or it is more concentrated around a particular value. So, all
the types of measures of central tendency and dispersions can be viewed from this type of graph, ok.
Now, I come to bivariate plots. So, bivariate plot means there are two variables and these are the plot
which gives us. The first and visual information about the nature and degree of relationship between
the two variables; whether the two variables are related or not and if they are related whether their
relationship is linear or say non-linear. So, all this type of information can be used by these bivariate
plots. So, in bivariate plots what we try to do we will take two variable here X which will be plotted on
10
689
X axis and say values of Y variable which will plotted on Y axis and then whatever are the values of
say here x 1, x 2 and so on, here x n and similarly on the y axis all the values y 1 y 2.
See here y n they will be plotted here like this and they will try to show you the trend and degree of
relationship. So, we will try to take some examples and we will try to see that what are the commands
and how they have to be executed.
So, now in order to create a bivariate plot, suppose we have two variable which I am decorating by
small x and small y in which the data of these two variables have been stored. Now in case if I want to
make a plot like on x axis I will have the value of x variable and on y axis, I will have the value of y
variable this can be executed by the command plot. But now inside the arguments I have to give the
variables x and y separated by comma. So, this is how we have to do now there is another option
available in this command which is here type. Just by giving various values to type we can create
different types of graphics. For example, if I said type equal to p then it will give me only the point
which is the default value and if I say here p is equal to l, this is going to give me the lines. So, you can
see here I have made this p and here l to be bold which is indicating that what is the meaning so you
can easily remember also. And, if you want this point a line to both to be present, then I have to use the
type equal to b, which is coming from both. And similarly, if you try to use that type equal to c then I
will get the lines which are only part of the b because then we have the points and lines both.
11
690
And, similarly if I try to use here the command o then I will get a over plotted means the point and line
are both are over plotted inside the graph. Similarly, if I try to take the type equal to s then I will get a
graph which will look like a stair steps. And finally, if I choose type equal to h, then this will give me a
lines which will look like as if I have created the histograms of this data or there will be some high
density vertical lines.
So, now let me try to take an example and I try to show you that how these different types of graphical
look like, but before that there are some more options which are available in the plot command. For
example, in case if you want to give a title then you have to use the main inside the argument and you
have to just give whatever you want to give as a title of the plot. Similarly, if you want to give us
subtitle of the plot, then you have to use the command suba and if you want to give the title on the x
axis then the command here is x laba.
And similarly for y axis title the command is y lab a, and if you want to maintain the aspect ratio that
mean how the graph will look like should it be more stretched in the x direction or in the y direction;
then this is given by the option aspthe. And, if you give a numerical value to this thing this will
maintain the aspect ratio and definitely if you want to have more information, I will suggest you that,
please try to look into the help of this type command or say help on plot command, right.
12
691
(Refer Slide Time: 26:40)
Now, I will take a simple example to demonstrate how to use this command and how the graphs will
look like. As we have discussed that we know that number of hours of study of a students this affects
the marks obtained examination. So, the number of hours of study and the marks obtained in the
examination, they are related, but this is from my experience. So, we have collected the data on 20
students that with that how many hours every week, they have studied and finally, how many marks
they have obtained in the examination out of 500.
And those marks and the number of hours of these 20 students have been recorded as follows that for
example, if a student has studied 23 hours per week then he has got 337 marks; if a student has studied
25 hours every week then he has got 316 marks and so on. So, the first row is denoting the marks and
second row denoting the number of hours per week which a student has studied and this data for 20
student has been obtained here.
13
692
(Refer Slide Time: 27:53)
So, now what I do here first let me try to compile all the data into data vectors. So, I try to create a data
vector here marks in which I tried to store all the observation which are given in the first row here, this
and here this and which is here. And, now in the second case I have taken variable here hours in which
I have stored the data which is given in the second row here and I have created here two data vectors,
marks and hours as I will be using this example again and again. So now, I have explained you the
genesis of this data set.
14
693
Now, I would try to simply make a plot of these two data vectors hours and marks. So, I will be using
here the plot commands for the bivariate plots which was plot x y. So, now I will say here plot inside
the arguments I will try to give that two data vectors hours and marks, separated by comma and this
will give me here the graphic like this one. So, you can see here that these are the data points and by
looking at these data points you can see here the that, there is going to be a sort of linear trend; trend
means that most of the points are going to lie nearly on a straight line.
Now, in case these points are line close to the line or away from line, that will give some idea about
the extent of the relationship between marks obtained in the examination and number of hours studied,
right.
Now, in case if I want to change this type then now I am going to use here a type which is here l if you
remember this l was used as the type to create the graph with lines. So, now you can see here once I try
to do it here, I get here a graph like this one in which all those data points like this one here and they
are inter connected by such lines and so on. And if you wish you can compare it from this curve, if you
try to make it here this, this, this, this and so on connected them by line that will give you a line plot.
15
694
(Refer Slide Time: 30:33)
Similarly, if you try to choose here the option for type equal to b; b means both line and point. So, you
can see here this is the combination of the first two graph that we have obtained here those points are
here; all those points are here and they are connected by these lines like this, this, this, this, this and so
on. So, this is how you can obtain this type of plot.
Similarly, if I try to use the option of type equal to o, o means over plotted. What you mean by over
plotted? In the earlier graph if there are two points here like this there were simply joined like this, but
now I am saying that there are two points like this and they and the line will cross through like this
one. So, you can see here these are the points and so on which I just joined from point to point. So, the
line and those dots both are over plotted. So, this type of graphs will again give us a different type of
information.
16
695
(Refer Slide Time: 31:49)
Now, in case if I use the type h. So, we have discussed this h means this is a sort of histogram or say
high density vertical lines. So, we can see here we have this data here like this and this data has been
joined on the x axis like here this. So, this is another type of plot which can be obtained by using the
type h.
17
696
And similarly if you want to create a stairs steps type of plot then use that type equal to s; and if you
see the plot which we have obtained here these are my data points and so on, you can see here first the
that what we are my data points in the first curve.
You can see here this point is here like this and this and the same thing is here like this the points are
here and here and now they have joined by say stairs type of plot. And, similarly you can see here
these points are going like a steps means you have seen that when we climb on the roof then there are
stairs like this one. So, that is why this plot is called as a stair step type plot.
And, now in case if you want to make it more informative, suppose you want to add here the title of
the plot titles on the x and y axis. So, you have to use different options and if you remember we have
discussed these things in more detail when we have discussed the graphics in the case of univariate
data. So, for example, now exactly on the same way, suppose I want to give here a title like marks
obtained versus number of hours per week.
So, we have discussed that this can be given by the command main. So, I try to write here is the main
is equal to the title inside the double quotes. And similarly if you want to give here title on the x-axis,
suppose I want to give number of weekly hours. So, then in order to do this thing we have the
command x lab. So, I try to give whatever the title I want to give on x axis by writing x lab is equal to
the title inside the double quotes and this gives me here an outcome like this.
18
697
Similarly, incase if you want to have a title on the y axis. Suppose I have want to give here marks
obtained then for that we have the command y lab. So, I write y lab is equal to the title which I want to
give inside the double quotes and this outcome is given over here. And, similarly there are some other
options which I would say that please try to look into the help menu and try to see here. So now, I will
try to show you that how these graphics using the plot command are constructed inside the R software.
So, let us come to our software part. So, you can see here that I already have stored the data h o u r s
hours and on the marks, you can see here like this, this is the same data that we have obtained in the
example you can verify it. Now I am trying to see here plot between see here marks and see here hours
and as soon as you enter it, you can you get here this type of graphic right. So, you can see here this is
the same graphic that we have obtained.
Now, in case if you simply try to change the order of the variable here, you can see here at this
moment marks are coming on x axis and hours are coming on y axis. Now, suppose I want to
interchange with in place of marks I will give hours and in place of hours I will give the marks. So, I
will type it plot hours and marks and now as soon as I enter you can see here that this graphic is
changed you can see like this.
19
698
(Refer Slide Time: 36:02)
So, now I will continue with the same command and I will clear the screen, but I will try to show you
that what are the different options which you can use using the type option. So, if you see here this was
our first option was type equal to l by which I will get the lines. So now, you can see that as soon as a
press enter this graphic will change, you can see here this is now l and suppose if you want to have line
and points both then I have to give the type equal to b, so you can you see here you get a different
graphic. And, similarly if you want to use the over plotted option by choosing the type equal to o small
o then as soon as I press enter, I get this curve this is the same graphic which we have seen inside the
slides.
And similarly if you want to have a histogram or say high density lines then using the type equal to h
we get here this type of command we can see here. And finally, if you want to have stair type graphics
then I have to use the type equal to s and as soon as I do. So, I will get here an graphic like this one, so
you have now seen that creating graphic is not difficult at all. Now, suppose if you want to add some
more features like as there are some default features that whatever is the name of the variable that will
be coming on the x and y axis, but suppose if you want to add titles on x axis title on y axis and main
title then you have to use the same command. So, you see here on my slides that I had shown you here.
20
699
(Refer Slide Time: 38:02)
Now, I will try to copy this command and I will try to paste it on the R console. You can see here this
is coming like this; and now you can see here as soon as I have press the enter this is graphics changed
now you have the titles on x y axis as well as the other titles. And similarly, in case if you want to
change the colours, this is also possible using the options, but definitely I have show you the basic
operation you can make it as beautiful as possible as informative as possible whatever you want.
So, now I would stop in this lecture and in this lecture, I have given you an idea of the plot command. I
am not saying at all that which of the command or which of the graphic or which of the type is the
best. It depends only on you people or the experimenter. The experimenter has to decide or you people
have to decide what type of information you want from the graphic and which is the graphic which is
more suitable to provide that information in the correct way. And, this also comes by practice and
experience.
So, at this moment the objective is to learn how to create graphics and how to make them more
informative more interactive using the R commands. So, you please take some data set try to
experiment on it, try to give different types of options, try to create the plots and try to see what type of
information they are going to provide, try to take different types of types.
And, you practice and we will see you in the next lecture till then good bye.
21
700
Lecture – 26
Welcome to the lecture on the course Descriptive Statistic with R Software. You may recall that in
the earlier lecture, we started a discussion on the association of variables. And, we had considered
how to construct the bivariate plots using the command plot and we had used different types of
options to create a more interactive plots. Now, when we are trying to use the plot option to create the
scatter plot, then we would like to have two types of information; one is the direction or the trend and
second is the magnitude and the magnitude was decided on the basis whether the strength of the
So now, the question is this by looking at the scatter diagram how you would know that whether the
strength is more or less. So, in order to do so we had created a line manually, but now that can be
done on the basis of software also. So now, we are going to consider the plots where we will create
the scatter plot and we will also add a smooth line and that line will give us a sort of fit; and by
comparing the observation with that fit we can compare whether the strength is less or more or this is
weak or strong and so on. So, in this lecture we are going to consider the scatter smooth plots.
701
So, now we assume that there are two variables and those variables are related and we have obtained
and paired observations say (x 1, y 1), (x 2, y 2) up to here (x n, y n) and so on. Now, the objective is
that we want to create a scatter plot and inside the scatter plot we want to have a line which is called
as fitted line why this is called a fitted line that will be clear to you after some lectures. And when I
try to do so, this type of graphics will provide us the information on the trend or relationship between
them.
And in order to construct such a graphic in R software we have a command scatter dot smooth s c a gt
t e r dot s m o o t h. And this command produces a scatter plot and it also adds a smooth curve to the
scatter plot.
So, now how to do it and what are the details? So, this function actually this command is scatter dot
smooth, this is based on the concept of loess l o e s s actually is locally weighted scatter plot
smoothing method and this is used for local polynomial regression fitting and in this case it fits a
polynomial surface which is determined by one of the one or more numerical predictors using the
local fitting.
702
Definitely I am not going to discuss about the details about the loess and so on, but we are simply
going to use it. And I thought that because you will see that later on there are some options the details
are written in terms of l o e s s loess. So, I thought that I should tell you. So, that you do not get
confused at a later stage well; if you want to have more details about this is scatter dot smooth
function please look into the help of this command and you will get more details, right.
So, now the more detailed command of a scatter dot smooth which will give you a scatter plot and a
smooth curve is the following; you can see here the command is the same scattered dot smooth. Now
I am trying to give here the data here x and here y, y is here actually NULL because we are trying to
make it only with the one variable, but if you want to make it bivariate plot you can use both x and y
data. And then there are different option here as say span, span controls the smoothness for this loess
This degree is the degree of the local polynomial which is used for fitting and then it is asking for
family. This family can be symmetric or Gaussian and well there are different methods for fitting the
data. So, in case if we are using the Gaussian, Gaussian is indicating that the fitting has been done on
703
the basis of least square method, he will consider the least square method at a later stage. And, in case
if the option of symmetric is used, then this is indicating that descending M estimator is being used to
And, similarly if you want to give the labels on x axis or say y axis then these are the command x lab
and y lab here. And, similarly if you want to give the limits for example, if you want to provide the
limits, on the y axis then this is given by y lim and then, yeah, means this is the range. If you
remember range gives you the two values- minimum and maximum that we had discuss and after that
if you are handling with the missing data, then you have to use the command na dot r m is equal
TRUE or FALSE. So, there are some more option available, but I would request you that you please
try to look in to the help menu and try to understand how to use them.
Now, I will try to take an example and would show you that how to plot such graphics. So, I am
going to consider the same example which I discussed in the last lecture, where we have obtained the
data on the marks obtained by the students out of 500. And, the number of hours they studied in a
704
week; and this data was obtained for 20 students and this data here is given like this, the first way
here in this case is the marks of the students out of 500 similarly here also.
So, these are the marks student have obtained and the second row is giving the information on
number of hours they have studied in a week like this here, ok. And this data I already have stored, in
two variables when I am calling it here marks and hours, exactly in the same way as I did in the last
lecture.
Now, after this I would like to create a scatter plot with a smooth curve using this data.
705
And for that I use the command scatter dot smooth and inside the argument then I try to give the
name of the variable say hours and here marks; and if you try to see here this is the graphic that we
are going to obtain, right. You can see here now these points are the same point which you occurring
only in the scatter plot that we had constructed in the last lecture.
But now there is a line which is added to this thing and this line is helping us in knowing that how
much is the deviation of this individual observation from these points. And, if these deviations are
less or if these points are lying very close to the line, I can say that the strength is quite high.
Suppose, if you had got the same data with this line and, but the points are lying here and there and so
on, right yeah, this may happen then in this case I would say that the strength is weak or the degree of
So, this is how we try to do it and now I will try to show you on the R console also that how to
operate it. So, I already have entered the data on say hours and here marks you can see here.
706
(Refer Slide Time: 08:43)
And I try to make it here a plot scatter smooth by this command and you can see here we are getting
the same curve what we have shown here, right ok. So, now let us come back to our slide.
And now if you try to see I am taking a very small data set here. And, I am just trying to consider 10
values and these 10 values are indicating the weight of 10 bags of grains. And this data which is
707
recorded here for 10 bags in say kilograms, this is stored here inside the variable weight, right. And I
am trying to plot the scatter smooth graph for this data set and this is plotted here and the graph
So, you can see here these are the points and this is indicating that possibly the relationship is not
actually linear, but it is a sort of non-linear relationship. And please keep this figure in mind because I
will try to give you some more information later on using the same data set, ok. And, if you want to
plot it here on the R console, I can copy this data here and, right, this comes out to be like this and if
I am simply trying to make a scatter smooth plot of this thing and this comes out to be like this; one
thing what you have to notice here that in this example, I have used only here one variable. So, you
can see in this graphic that the value on the x axis, this is only an index. So, what we are trying to
learn here that is scattered smooth command can also be used for getting the curve only for univariate
708
(Refer Slide Time: 11:10)
And now I will take an example where there are some more number of data set and I will try to show
you how this curve look in the univariate case, single variable. So, I have collected the heights of 50
persons like this, it is the same example that we have considered earlier and this data has been stored
inside a variable here height like this; and based on that I will try to create a scatter smooth plot of
709
So, you will see here this looks like this and it is indicating well, this can approximately be a sort of
linear relationship, but it is indicating that this relationship here is curvilinear. On the x axis, this is
only the index of the observation and on the y axis these are the values of heights. The reason why I
am taking here this graphic is that that you will get on I want to create another graphic on the same
data set and I would like to compare both of them together, but if you want to see the same graphic on
the R console,
you can see here that I am trying to store the data say here height this is the data vector height and I
try to create here it is quite a smooth graph of height, this will come like this.
10
710
(Refer Slide Time: 12:35)
So, this is the same one which we have shown here on this slide.
Now, there are some more options which are available on this scatter smooth curve, which will give
you more interactive things and that we already have discussed here. So, you can just look into the
11
711
(Refer Slide Time: 12:59)
Now, I would like to discuss another type of smooth scatter plot. There is another command in R
software, which produces a smooth scatter plot, but this plot is little bit different than what we have
obtained earlier this is essentially a smooth and colored density plot. And, this is obtained through a
two dimensional kernel density estimate. You may recall that when we were discussing the graphics
in the univariate case, then we had created the frequency density curve or say density plots using the
kernel estimates. So, there we have defined the kernel functions. Those kernel functions were having
some nice statistical properties simile to the properties having in the probability density function. So,
that kernel function was for one variable. So it so that was a univariate kernel function.
Similarly, when we have more than one variables, then in statistics it is possible to handle them
through the probability density functions, which are functions for more than one variable. Suppose
you have two variables x and y or three variables x, y and z then it is possible to define the joint
probability density function of x and y or joint probability density function of x, y and z. And
similarly the kernel functions can also be defined in a multivariate setup. So, in this case, this plot is
going to use the concept of kernel density estimates in two variates and that is why this is called a two
dimensional kernel density estimate? So, well we are not going into the detail that what are those
12
712
kernel density estimates in two dimensions, but definitely you should know that what are we going to
And definitely if you want to know the more details, we already have understood that one of the
biggest advantage of R software is that it is not a black box. You can go to the site of R software and
there you will see the help menu and there will be all the details that how this scatter plot has been
constructed, ok. So, let us Now come back to our slides and in order to create this type of a smooth
scatter plot, the command is smooth scatter, but you try to see the difference, here one letter capital S
of a scatter is in capital letter, that you have to keep in mind capital letter; and the spelling is a smooth
scatter s m o o t h any small letters then S in capital letters and then a t t e r in small letters and inside
And similarly, if you want to have more options, you can see here they are given here, but definitely I
But I have given them on the slides, those with are with you and you can have a look and if you try to
use them, you will get a more informative and better graphics.
13
713
(Refer Slide Time: 16:18)
So, just try to have a look on the help on this smooth scatter and you will get all these information,
right.
So, Now I would try to first take the same example which I just took in the case of univariate data;
and I would try to take the data on say here weight where we had collected the weight of 10 bags of
14
714
grains in kilograms, right, and this data has been given inside the data vector here weight. So, earlier
we had obtained the smooth scatter plot using the command scatter dot smooth and this curve was
Now, I will use this new command smooth scatter on the same data set and we will obtain the new
So, you can see here when I am trying to use the command smooth scatter for the data on weight
here I get here a graphic like this one. What are these points and what is this showing? Now if you try
to recall, the earlier plot that was obtained was like here this. Now if you try to put both these points
side by side, you can see here the similarity. You can see here this is here a point, this is here a point,
this is here the point on the earlier plot, let me call this points a point number 1, 2 and 3 and they were
created as a dots.
Now, the same points have been created with this new command is smooth scatter which are here like
this point number 1, point number 2 and point number 3. So, this point is here, this point is here and
this point is here, right. And, similarly if you try to see here, now I will use a different color, so that
15
715
you can observe the movement of my pen, these two points, they are here. And similarly, if you try to
here, this point this is here and similar to this if I try to say, here this point 1, 2 and 3 here these are
the points here. So, they are here. So, you can see here both these plots are going to give us the
similar information, but their structures are different. And now it depends on the experimenter or the
statistician or you people, you have to decide that under the given circumstances which graph is going
to give you much better picture. One situation where this smooth scatter type of plot will be more
useful is that, that when you are trying to obtain the observation and you are not 100 percent
confident about the values whether this value is 20 or 20.1 or say 19.8 then in that case this type of
graphics that we have obtained, now using the smooth scatter command, they will be more useful;
because they will also try to show you the uncertainty involved in the point. But definitely, when I am
trying to say the value is 20 then definitely the margin of error should be as small as possible and
definitely if the value is 20 then there is no error and this part is indicated that the values are 20 or
19.8 or 19.5 or 19 or the value 20 is 20.2, 20.5, 21 or 22 this is indicated by the decreasing tint of the
color. Now if you try to see in this a graphic, you can see here, suppose if you try to observe in this
black color one you can see here the values in the center here are most dark.
Similarly if you try to take it another say anything any point over here, the middle part has the darkest
color. And as we are moving from the center like this here if you try to see my pen in red color, we
are moving towards outside or in this point number 1, this is the center part and when we are trying to
move from the center you can see that the color is now decreasing and the color is becoming lighter.
So, this is what is indicated by this type of curve or this type of graphic, as the color is becoming
lighter, that is showing the level of uncertainty. If you are confident, if you are is strong that your data
value is correct that is indicated by a stronger color. But if you are weak and you are not confident
about the data, then that variation or that uncertainty is indicated by lower tint of the same color or
similar colors.
16
716
So, this is how you have to take a decision that which of the graphic you want you would like to use.
And now if I try to take the earlier example, where we have collected the height of 50 persons and
17
717
Then if I try to create the smooth a scatter plot it will look like this you can see this because the
number of data points are quite large, so this is more concentrated, right.
And, if you try to compare it with the with the earlier a scatter plot, you can see here, earlier we had
obtained this thing; and now excluding the new command is smooth a scatter we have obtained this
plot.
So, you can see here this point and this point they are here the similar. And similarly here on the right
hand side corner, these two points and these two point they are the similar. So, you can see here that
both these graphics are trying to give similar type of information, but in a different way. So, now,
before I go further let me try to plot these things on the R console. So, first let me copy here the data
18
718
(Refer Slide Time: 23:27)
So, this data here is given as here weight and now I try to copy the same command smooth scatter on
And you will see that, as soon as I execute it, it gives me this type of information. Now, up to now
you could see I have taken two examples where I have considered the univariate data and I have
19
719
Now, I would try to take a bivariate data and I would try to show you that how the information can be
retrieved and how the information is present in this scattered a smooth curve?
So, you may recall that, we had considered an example where we have collected the data on the
marks obtained by the students and the number of hours they studied every week. And this data was
stored in the variables marks and hours and based on that, I would try to make the smooth scatter plot
20
720
And if you try to execute it you will see here you will get with this type of curve. So, this is indicating
that there is a sort of here linear trend and this is also giving you a sort of that what can be the
You can, simply have to visualize that what is the difference between the green line and the darkest
part of the data darkest color of the data for them darkest colors is here in this center. So, this will
give us information that how the things are happening. So, I would try to show you that how this is
So, you can see here I already have stored the data on hours and marks because we have just used it.
21
721
(Refer Slide Time: 25:16)
And now I try to give here this smooth a scatter plot command and you get here a data like this one.
So, now, in this lecture I would like to stop and you have seen that we have discussed that how to
construct the smooth scatter plots. We have discussed two types of plots and in every type of graphic,
there are different commands, although I am not discussing it because they are exactly on the same
So, my request is that you please try to experiment with them. You can take even the same data set
and try to see how you can add or change the labels on axis, colors of these dots and how you can
gave different types of titles and how you can incorporate more type of information by using different
options available with the command. So, you practice it and we will see in the next lecture with some
22
722
Descriptive Statistics with R Software
Prof. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology, Kanpur
Lecture- 27
Association of Variables - Quantile – Quantile and Three Dimensional Plots
Welcome to the next lecture on the course Descriptive Statistics with R Software. You may
recall that in the earlier lectures, we started our discussion on association of two variables and we
had discussed several types of two dimensional plot. And those two plots were trying to give an
idea about the direction and the degree of association between two variables.
Now, I am going to consider two dimensional plot, but they are used in different concepts. So, in
this lecture I am going to talk about Quantile Quantile Plot and after that I will give you a brief
introduction and brief description about the Three Dimensional Plots which are available in R
software.
Now, there is a situation that two samples have been obtained from some population. Now, we
want to know on the basis of given sample of data that whether the two samples have been
drawn from the same population or different population, these type of situations are very very
useful in several statistical tools. For example, one good example is in the case of testing of
hypothesis. We try to conduct one sample test, two sample tests and so on. In those types of test,
there is a requirement that sample is coming from a particular type of population and those
populations are characterized by some probability density function.
For example, you must have studied in the books that are popular sentences let x 1, x 2, x n be a
random sample from normal population with mean and variance 2. So, definitely in this case
we are trying to say that there is a population which is very big and practically unknown to us
and this population is characterized by the normal density function.
Now, I am drawing a small sample, say 20 observation, 30 observation or say 100 observations
and so on. And I would like to know whether the sample is coming from a normal population or
not and this type of information is needed because the tools like the t test, z test and so on and
different types of tests which are use in the testing of hypothesis, they are constructed assuming
that the population is characterized by normal distribution.
723
So, unless and until this assumption is verified, the further statistical inferences will be
questionable. So, one question I would like to address here is that how to judge that a particular
given sample is coming from a normal population and if I try to extend this concept. Then I
would say suppose 2 persons have brought two samples to me and I would like to know whether
these samples are coming from the same population or from different population.
And in order to makes such an comparison, one option is that, I can compare the quantiles of the
samples and then I can conclude that whether the two samples are coming from the same
population or not, in case if they are coming from the same population I would expect that the
quantiles of both the populations are going to be the same. And similarly when I want to test
whether a sample is coming from a normal population or not, then I would try to compare the
quantiles which are computed on the basis of the sample and the quantiles of the normal
probability density function, if they match then I can say, yes, my sample is coming from a
normal population with certain mean and certain variance. So, these types of plots are called as
quantile quantile plots. So firstly, let us try to understand the concept, interpretation and then I
will show you how to use them on the R software.
So, in the case of quantile quantile plots, what we try to do? That we try to consider two
variables and we try to find out their quantiles and when the quantiles of the two variables are
plotted against each other in a two dimensional plot, then we get the quantile quantile plot. For
example, if I say if I have got here two variables X and Y and based on that I have got two
2
724
samples, say x 1, x 2, x n and say y1, y 2 say y m these samples may be of same size or of
different size.
And then I would try to plot the quantiles of X and on the Y axis, quantiles of y and then I would
try to see that how they are matching and in case if they are matching, then I would say yes they
are coming from the same population otherwise not. So, it is something like this 25 percent
quantile of x and 25 percent quantile of y and say 40 percent quantile of x and 40 percent
quantile of y, say 70 percent quantile of x and 70 percent quantile of y if they are matching, then
I can join this line and that is going to be a straight line.
So, in case if I try to do so, these types of graphics they provide us a summary whether the
distribution of the two variables are similar or not with respect to the location and we try to plot
this quantiles of the variable against each other.
In order to plot such quantiles in R software, we have a command here qqplot and then inside the
arguments, we give the data vector. So, I have here two data vectors x and y and we use the
725
command qqplot and inside the argument x, y and this is going to give us a QQ plot of two data
sets. Similarly, there is another command qqnorm. This qqnorm produces a normal quantile
quantile plot of the values in the data and in this case they are compared with the quantiles of the
normal distribution.
So, here the command is qqnorm and inside the argument you have to give the data vector x.
Similarly, there is option that inside this normal qqplot, we can add here a line and this line is
based on the theoretical quantiles and by default, this is normal. And these quantiles whatever
are being plotted, they can be controlled by the probs function. You may recall that we had use
the probs function to define the probabilities or they were trying to indicate that the quantiles
have to be computed for which of the probabilities.
So, you may look into the lecture on the quantiles where where we had used this probs function.
So, the command here is qqline and inside the arguments we give the data vector. So, this will
also plot a line inside the qqplot.
So, the first question comes that how to make interpretations from this quantile quantile plots?
So, I will try to take here different types of possible situations and based on that I will try to
show you that how we are going to take a conclusion or draw a statistical inference.
Suppose I try to plot the quantiles of data on x and data on y on a two dimensional plot against
the x axis x and y, these dots they are trying to show the data. And if you try to see in this case,
4
726
all this data that is lying on a straight line, straight like this and this line is essentially a 45 degree
or this line is made at an angle of 45 degree from the x axis.
So, in this case you can see that all the points of quantiles, they are lying on a straight line which
is drawn at an angle of 45 degree from the x axis. So, this is indicating that the two samples have
the similar distribution. And in practice it is always not possible to get such a 100 percent clear
straight line, but the plot will look like this. So, in this case, in case if I try to plot here a trend
line, this will look like this and you can see that the points are lying nearly on the straight line.
So, in this case, we can say that yes the two samples are coming from two populations which
have got the similar distributions.
Similarly, in further case suppose we get a quantile plot like this one where all these data points,
they are lying below the straight line and no point is lying in this direction here, in this region.
So, in this case, what I can conclude is that the y quantiles are lower than the x quantiles and this
has an interpretation that y values have a tendency to be lower than the x values. This obviously,
indicates that the distribution from where the samples have been drawn on x and y, they are not
the similar.
And in practice, in case if you are getting a data like this one and suppose the trend line is
passing through like this one. So you can see here; you can see here that most of the points are
lying in the lower region below the line and there are very few points which are lying above the
727
line. So, in this case, in general, I can say that the y values have a tendency to be lower than the x
values and hence the distributions are not the same.
Similarly, the opposite of this can also hold true that all the data points are lying above the line
and there is no data point here in the lower side of this line and this is indicating that the x
quantiles are lower than the y quantiles and this has an interpretation and it is indicating that the
x values have a tendency to be lower than the values of y. And hence the two samples from x and
y they are not coming from the same distribution.
728
Similarly, in case if you are getting a QQ plot or the quantile quantile plot in this way, where you
can see that here there are some data on the lower side of this line and suddenly there is a break.
And after that there are few points towards the end and these points are above the line. So, in this
case this is indicating that there is a break point up to which the y quantiles are lower than the x
quantiles and after that point, the y quantiles are higher than the x quantile.
So, you can see here that this is the region where there is a break point and these quantiles are
lying in this direction and in the upper part they are lying in the upper direction from the line. So,
in this case also, we can interpret that the two samples which are coming from two different
populations and those populations are not the same.
Now, I can do one thing that I am trying to take two data points say or say two samples - one
from x and one from y. And similarly, in case if I try to take one of the quantile to be the
theoretical quantile from the normal distribution, then I can compare the quantiles of a data set
with the quantiles of a normal distribution.
So, let me try to show you through an example and I will try to make this quantile quantile plot
or popularly they are called as QQ plots using this data set. So, now this data set is the same
example that I have use earlier couple of times and this is about the heights of 50 percents which
are recorded in centimeters right and this data is collected inside the data vector height.
729
(Refer Slide Time: 16:14)
And now after this I try to first prepare a qqnorm of this data set. So, now you can see here that
these points are lying here, these dots are indicating the quantiles and if you try to see this line
looks something like this. So, you can see that approximately it looks linear, there are some
points over here and here which are going beyond the lines, but here you can see that most of the
points are lying on the straight line.
So, possibly I can compute or I can conclude that the quantiles which are computed and
indicated on the y axis on the basis of given sample and the quantiles of the normal population
which I have been computed using the PDF of normal distribution and these are the theoretical
quantiles they are matching, they are nearly matching. So, one can safely assume that this data is
coming from a normal population.
730
Now, in case if I try to add here a line using the command qqline. So, the command will be
qqline and height and you can see here that this line has been added to the same quantile quantile
plot. So that is helping us in comparing that how much is the deviation of this point from this
line. You can see here this deviation is less and in the starting and towards the end this deviation
is here more something like this. So, this will help us in taking a conclusion whether the samples
are coming from a normal population or not.
Now, in another example, I would try to take the same data set which I had used earlier on two
variables. So, in this data set 20 students have given their data on the marks obtained and the
number of hours they have studied every week and this is here the first row is giving the marks
and the second row is giving the number of hours they studied per week and this data is
contained in the two data vectors here marks and hours.
731
(Refer Slide Time: 18:48)
So, when I try to make a qqplot, so the command will be qqplot and inside the argument, the two
data vectors marks and hours. So this qqplot is going to help us that I have got here two data
sets- one is the marks and another is the number of hours. And this is suppose coming from say
population number 1 and hours are coming from the population of hours called as population
number 2.
So, I would try to see whether these two populations are same or not, this population 1 and
population 2 are they have got a similar characteristic or they have got different characteristics.
So, in this case also you can see here that there is a line which can pass through this thing and
this angle is going to be 45 degree for this line. So, one can conclude that well most of the points
are lying on the line or near to the line. So, I can say that the samples are coming from the
similar population.
Here you can see here these are the deviation that we need to look and here I am trying to create
this trend line and you can see here that how the points are going to lie and whether the trend is
linear or something else. So, now looking at this data set, we can have this idea. Now, before
going further let me show you these operations on the R software.
10
732
(Refer Slide Time: 20:30)
So, I am trying to first create the data vector here height, so you can see here this is the data
which is contained here height and now I will try to plot the qqnorm of here height.
And as soon as I enter you can see here I am getting the same plot which I have shown you on
the slides and if I try to make it here qqline means I would like to add a line trend line in this
qqplot, you can see here now we have this data and this is the qqplot and we have got here this
line try to have a look on the cursor.
11
733
(Refer Slide Time: 21:11)
Next I would try to make a qqplot with the marks and hours. So I already have stored this data
marks and hours is like this and if you try to make here a qqplot between marks and hours, you
will get here a qqplot like this one.
And also you please try to have a look and see if I try to change the order of the variables, say
instead of qqplot marks hours, I will say qqplot hours and marks, you can see here this that
12
734
correction will simply change, but the information is going to be the same what we are going to
conclude. Now, let us come to next topic.
Now, I am going to discuss here briefly how to create the three dimensional plot. You see in R
software we have a facility to create several types of three dimensional plots, well it is not
possible for me to give the details of all the plot, but surely I will try to show you how to create
those plots and how start it. And I will try to give you an example that how the different types of
3D graphics are made.
So, one of the question you guess, in what type of situation these three dimensional plots are
useful? You see, whenever we are trying to deal with multivariate data and we want to study the
interdependence of the variables over each other, then in that case we would try to make such
plots. For example, if I take an example which I have taken in my slides, we know for children,
height, weight and age, they all change with time. As the age increases height also increases
weight also increases. As the height increases the age and weight also increases and so on. So,
now how to explore this type of interdependence, so for that we will try to create here the scatter
plot.
So, now I am going to take here some examples and I will try to show you the commands with
those examples. So, first plot which I am going to consider here is the scatter plot that is a three
13
735
dimensional scatter plot and this is created by the command is scatterplot3d s c a t t e r p l o t and
3 and d, all in small letters and 3 in numbers.
And inside the arguments I have to give the data vectors for which I need to create this plot and
this command will plot a three dimensional point cloud on the data x y and z, but for this thing I
need a special package this is not included in the base package of R and the package which is
needed here is a scatterplot3d.
So, first you need to install the package using the command install dot packages and inside the
arguments within double quotes type a scatterplot3d and after installing it, you need to upload it
by using the command library scatterplot3d. This you know how to get it then otherwise you can
simply use this command on the R console and can install it.
Now, I will try to take the example which I just discussed that we have taken the data on 5
persons for their height, weight and age and we would like to create a three dimensional plot for
this data set. Well I am taking here only 5 data values. This is because I want to show you that
how the picture will look like and so that you can see inside the picture that how these values are
coming. So, the person number 1 has the height 100 centimeters with 30 kilogram and age is 10
years and so on we have this data set.
14
736
(Refer Slide Time: 25:34)
Now, I have copied this data set in three vectors- height, weight and age like this height, weight
and here age. And before that I have install the package a scatterplot3d and I have loaded on the
R console.
So, now in case if you use the command here scatter plot3d height, weight, age inside the
arguments you will get here this type of picture. You can see here these are the dots here which
are indicating the values 1, 2, 3, 4 and here 5. And now you can see here this is a sort of cube or
15
737
cuboid for which this graph is giving us an information that how the points are lying inside that
cuboid, right.
So, by looking at such graphs you can have an idea that how the things are happening. It is also
possible to create the surfaces which are called surface plots and they will give you an idea that
how the variation in the data is happening or how the data is behaving by looking at these
observations.
Now, in this type of plot, there are various options by which you can draw more meaningful
inferences. For example, in case if I want to change the direction; direction means you can see
here on one axis we have age, another axis is here height and say another axis here is weight.
Now, I would try to change the direction of this cuboid, so I try to now take here weight on the
this side height on the x axis and age on the other side. So, then again if I try to use the same
command here scatter plot3d with height, weight, age, but now I am giving here an option angle
is equal to 120 degrees. So, by giving an option of angle, I can control that how much the cube or
the cuboid has to be turned or to be rotated.
So, the earlier picture is now rotated by 120 degrees at an angle of 120 degrees and but now you
can see here these are my points here, so you can see here in sorts of your curve, one can see
here. Whereas, in this case it was showing like as if there is a straight line. So, by making
16
738
different types of cuboid with the changing angle you can have an idea that what is really
happening.
And similarly in case if I want to change the color of the points, I can use here an option- color is
equal to, For example, here I have use red inside the double quotes and you can see here the
color of these points is coming out to be now red and this command can also be combined with
the angle and before I go further you help me to try to show you this thing on the R console.
17
739
So, I will try to collect here the data, this is my height, this is my data on weight and this is the
data on age. And now I need to first upload the library, I already have installed this package on
my computer, but you can do it yourself, right. And now if I try to use this command to create
the scatter plot of height, weight and age, you can see here I get this type of picture.
18
740
And if I try to add here say angle is equal to 120, this picture changes here, well I can show you
the both the things together. Suppose, now I try to make here an angle of say here not 120, but
say 150.
So, you can see here these points will change you can see here this direction is now changed.
Similarly, if you want to add here the color, say colors and angle can also be combined, color is
equal to say here red. So, you can see here, this now gives me a red color and suppose if I try to
make it here the colors to be blue, now the colors are blue.
So, by looking at these types of pattern, you can have more information. One good thing will be
that you please write a program in which the angles are changing continuously say at an angle of
1 degree, 2 degree and so on. So, then you will have a picture which is continuously rotating and
then you can have a 3 dimensional view which is possible in R, just by writing a small function,
right.
19
741
(Refer Slide Time: 31:11)
Similar to this three dimensional plot we have here some other types of graphics which I am not
going to discuss here, but I am just informing you one is here see here contour plots which is
give you the plot with the contour line, we have dot chart, we have image plots and this will
produce a picture with colors as the third dimension we have a mosaic plot. And which is a
mosaic plot for say multi dimensional diagrams a particular in case of categorical variable or say
contingency tables that we are going to use later on. And there is another say here perspective
plot which is obtained by either command p e r s p and in this case you get surfaces over the x-y
plane.
20
742
So, I will simply try to take an example here although I am not going to discuss it in detail, but I
will show you that how the perspective plot is created and how it looks like and what is the
advantage. For example, I have collected here some data, I have collected the data x as a
sequence between minus 10 and 10 with the 30 observation and then y is going to the same as
here x.
And then I have created a function to compute the value of r is equal to square root of x square
plus y square or 10 into sin of r divided by r. And then I am trying to obtain here the z vector as
an outer of x y and f and then I am trying to use here a logical operator and then I am trying to
give other parameters.
And then I try to create here the plot say perspective x y z with some other parameter theta equal
to 35 equal to 30 expand equal to 0.5 and color is equal to light blue and this graph will look like
this.
21
743
(Refer Slide Time: 32:55)
And similarly if you try to add here some more options over here, try to inform the tick types
shades etc. etc. you can obtain here a different perspective plot. So, I would simply try to show
you it on the R console. So, that you are confident that will these things are possible. So, I will
simply try to copy all this commands at the same time.
And I will remove this thing and I will and you can see here that I have simply copied these
commands over here and now I am going to plot here this curve, you can see here when I try to
22
744
execute this command over here I get this type of plot. And similarly if I try to use this command
which I have done here, this is as soon as I execute this color changes, there are shades and it is
more informative.
Now, I would like to stop here with all the graphical tools for studying the association between
two variables or more than two variables. Well in the given time frame, I have taken some
representative topics or some important types of plots, but this does not mean that these are the
only plots. There are many other plots and you have seen that in case if you want to make your
graphics more informative, the simple rule is try to take the help from the R software about those
syntaxes, commands, try to see what type of information they can give you and try to use those
commands inside the argument and try to control your graphic in the way you want.
And now you can see here, with this one dimensional graphics, two dimensional graphic, multi
dimensional graphics, you can produce very good graphics which people try to create from say
expensive software, but the only thing is this that here you have to study little bit you have to
understand. But the advantage is that you can control each and every parameter, each and every
characteristic of the graph, whereas in case of built in packages you do not have much options.
So, now in case if you spend some time, try to learn more you will become successful in creating
good graphics and which will give you lots of hidden information contained inside the data.
Now, in the next lecture I would try to develop some tools, so that we can get such information
in a quantitative way. So, you practice with these graphics, take some example try to create
graphics, try to experiment with them, try to take different combination of the values of the
parameter and see what you get and I will see you in the next lecture till then good bye.
23
745
Descriptive Statistics with R Software
Prof. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology, Kanpur
Lecture – 28
Association of Variables – Correlation Coefficient
Welcome to the lecture on the course Descriptive Statistics with R Software. You may
recall that in the earlier lectures we started a discussion on the aspect of association
between two variables and more than two variables and we have two types of tools-
graphical tools and analytical tools. In the last couple of lectures, we have discussed the
graphical tools, to study the association between or among the variables. Now the next
question is how to quantify that association? For example, we have seen in graphics that
there can be an association that is strong or that is weak. But now how to quantify where
that what value represent a strong or what value represent a weak association?
So, we will try to do it in the next couple of lectures. So, in this lecture we are going to
discuss about the correlation coefficient and I will be giving you here only the concept
and the theory of correlation coefficient means what is the formula, what is the structure,
how you use it what are the interpretations in different situations? And in that next
lecture I will show you that how to compute it on a software and how to interpret the
graphics and numerical values together. So, let us start our discussion.
So, we already have understood what is called the association between or among the
variables. Now I will try to take several examples to explain you that how would you
differentiate the association between continuous variable, between discrete variables and
746
so on. If I take an example, that we all know that as the number of hours of study
increases the students usually obtain more marks. And you can see this number of hours
can be expressed as hours, minutes and so on, that is essentially the time and marks can
also be measured on a continuous scale, they can be 70 or 70.50 also.
So, in this case you can see that there is a relationship between, the time spent on studies
and marks obtained in the examination. So, in this case we would like to see how is this
association is increasing, decreasing and what is there and what is the strength of a
relationship? Similarly, if I take another example, we know that during summer times
when the weather temperature is high, then the consumption of electricity increases
people are using air coolers, air conditioners and so on.
So, one can feel that as the weather temperature is increasing, the consumption of the
electricity also increases. But now we would like to verify it on the basis of a given
sample of data, we would try to quantify the degree of association and we would like to
see that how to say whether the association is strong, moderate or weak and how to see
whether the trend is increasing or decreasing and how to quantify it alright. Similarly, if I
take another example, we know that for the small babies and children, as their age is
increasing, their weight also increases up to a certain age means after certain age the
height and weight both stabilizes.
So, now I can say that as the age of those babies are increasing, their height and also
their weight, they increase. So, there is a association between the two variables. So, now
in these cases, the variables are continuous.
747
(Refer Slide Time: 05:26)
So, now we have noticed that in case if the numbers of hours of study, the increase the
marks obtained by the students are also increasing. So, number of hours of study they
affect the marks obtained in an examination. Similarly, the power consumption or say
electricity consumption increases when the weather temperature increases weight of
infants and small children increases as their height increases under normal
circumstances. So, in this case you can see and you can observe that the two variables
are continuous in nature. So, the question is how to quantify this association.
748
Similarly, if I try to take here another example, in this example I will try to consider the
variable which are discrete; their values are obtained only at point, right. For example, if
I want to know that in a college whether male students prefer mathematics more than the
female students or not or the male students prefer biology over the maths or not. In this
case, what we have to do, we are simply going to count the number of male and number
of female students who are preferring the subject.
So, here the numbers are going to be the observation on the discrete variable why it is
called discrete? Because this number of students can only be an integer there can be 5
student and 6 student, but there cannot be 5.50 students. So, in this case we would try to
see that what is the nature of association about the gender versus the subject.
Similarly, in case if some vaccine or some medicine is given to some patients, then we
would try to see that how many patients are getting affected? If there are significant
number of patients which are getting affected by the medicine, then one can conclude
that, yes, the effect of medicine and the number of patients they are associated. And we
would like to see, what is the nature of association in this case of discrete variable?
So, whenever we have our discrete of or say counting variable, we would like to know
whether male is to in preferred mathematics over female students or not. For example,
and say in say another example we will consider we want to know if the vaccine given to
the disease person was effective or not, right. So, these observation, you see they are
based on the counting of the two variables and in this case and in this case, the variables
are discrete in nature or and their values are obtained only as a number.
749
(Refer Slide Time: 08:13)
Similarly, there can be third situation. For example, in a viva or enough fashion show the
model or the candidate, they appear before that interviewers. For example, in the case of
a fashion model for example, the model comes on the stage and there is a group of
people who try to judge the performance and based on that they try to give the marks.
Now what do we expect? We expect that if a model is good, then all the judges will be
giving the high scores. And in case if the model is bad, then all the judges are going to
give the lower scorse, but it is possible in real life that whenever a model come, there
will be certain number of judges who are giving higher score and certain number of
judges who are giving lowest scores.
So obviously, we would like to see that what is the correlation or what is the association
or what is the nature of association between the ranks given by two judges. How to
obtain the ranks? Whatever the marks given by the judges, they are converted into ranks
and finally, we would be interested in knowing the nature of association of those ranks
not of their original values. So, in this case, we have a concept of rank correlation
coefficient. So, now we are going to consider here different types of thing.
So, in case of ranked observations, there can be a situation for example, where two
judges give rank to the fashion model or there is another example that quite a person has
cooked the food and there are two persons who are giving ranks to the food preferred or
their scores are ranked. In these cases, the observations are obtained as the ranks of two
750
variables or say two judges or two persons are those two variables. So, now we have here
different types of situations and those situations are described by the nature and behavior
of the variable.
So, my objective is now here is that, I try to consider the nature of variable, one at a time.
I try to show you that what are the different measures how to interpret them and how to
compute them? So, in this lecture first I am going to consider that case where the
variables are continuous and based on that we have a concept of correlation coefficient,
after that when we are going for the ranked data, then we will discuss about rank
correlation coefficient.
And when we have counting variables then we will be discussing about the different
types of this coefficient like contingency coefficients, chi square coefficient and so on.
So, now we are going to start our discussion, where we have two continuous variable and
we want to study the association between the two variables.
What is the meaning of association? The association can be linear or that can be non-
linear. How do you know whether the relationship is linear or not? For example, we
know, and we have learnt that we will try to plot the data on with the scatter diagram. In
case if the data looks to be like this then we say that there is going to be a linear trend
and if the trend is like this means you can see here, this will be called that the trend is not
linear, right ok.
751
So, now our basic framework is that we have here two variables capital X and capital Y
and they are measured on a continuous scale. And both these variables are linearly
related. This is you have to keep in mind that now we are going to talk about the
relationship which is linear in nature. So, we know that if the relationship is linear this
can be expressed in a mathematical format by the equation of a line like as Y = a + b X;
we have this a and b there are some unknown constant value. For example, in the case of
line this equation is in the standard form it is presented at y = mx + c.
So, here this a is going to represent the c which is the intercept term and b is going to
represent the value of m that is the slope of the line. So, now you can see here that this Y
and X, they are related and they will have some degree of association. Now how to study
this degree of association? For that we have a tool what is called as a correlation. So,
correlation is a statistical tool to study the linear relationship between two variables,
right.
So, now means I can say that the two variables are said to be correlated if the change in
the one variable results in a corresponding change in the other variable. What does this
mean? For example, we have taken the example of marks versus the time is spent in
studies.
752
So, we know that when the students increase the time of their study, then the marks
obtained in the examination will also be changed. So, in this case the change in one
variable is causing the change in another variable. Similarly, in the case of say height and
weight of small children, suppose height and weight are my two variables. So, when the
height increases, then usually the weight will also increases and similarly when the
weight changes then the height also changes. So, change in weight also causes that
change in height.
So, this is what we are trying to say and in these cases, we can say that the two variables
are correlated. If you try to see, how this word is coming correlated? This is co related.
Now when we say that the two variables are correlated, so we are trying to say the
change in the value of one variable is causing the change in the other variable.
Now this change can be positive or this change can be negative, means if the change in
one variable that is, suppose if I try to change the value of one variable and suppose the
other value that increases. Or in simple word, if the values on X increase then the values
of Y also increase; that means, it is a positive relationship and if the opposite happens
that if the values of X increase, but the values of Y then decrease then this is negative
relationship.
So, based on that we have a definition of positive correlation and negative correlation, so
we can say here that if two variables deviate in the same direction; that is the increase or
say equivalently the decrease in one variable results in a corresponding change to
increase or to decrease the other variable. Then in this case, the correlation is said to be
positive or the variables are said to be positive correlated.
So, what will happen in this case? Suppose if I try to make here the scatter plot in the
plot will if I try to change here the value of here X, suppose the Y value is here and if I
try to increase the value of here X then Y will also be somewhere here if I try to increase
it more, it will be here like this. So, we will have a graph like this.
So, in which the trend line will go like this. So in this case I can say that the observations
on X and Y they are positively correlated and the nature of correlation is positive. So, in
this case what is happening in case if the value of X is increasing then the value of Y is
also increasing, the opposite is this if the values of X are increasing then the value of Y’s
8
753
are decreasing and the next situation is where in case if the values of X are increasing
then something is happening in the Y values, but the nature of the change is not clear.
So, the next case is that if two variables deviate in the opposite direction what does this
mean? That is as one variable increases then the other variable decreases or vice versa
then in this case the correlation is said to be negative and the variables are said to be
negatively correlated. So, in this case, what will happen suppose if I try to plot the data,
suppose I have value here of x as say x 1 and somewhere here is y equal to y 1 which is
here like this right and then I try to increase the value of x say here x 2 then the value of
y 2 becomes here which is lower than y 1. And similarly if I try to take here another
value of x 3 then this value comes over here which is here y 3.
So, in this case you can see here that as the values of x are increasing the values of y’s
are decreasing and there is a negative trend, right. Similarly, if there are two variables
and if the one variable changes and then the other variable remains constant on an
average or there is very small change or no change in the other variable, then in this case
the variables are said to be independent or they have no correlation.
For example, in case if I say I am trying to take here the value of say here x 1 and
suppose this value comes out to be here y 1 somewhere here, then I try to take here x 2
754
and this value comes out to be here y 2 then in this case here x 3 and this value comes
out to be here and so on we have some more values.
So, there is no clear cut trend in the data. So, this is indicating that when we are trying to
change the value of x, there is practically no change in the value of y, then in this case
the variables are said to be independent of each other and they do not affect each other.
And in this case we say that they have no correlation or they have zero correlation.
Now, in case if you try to represent these situations. So, what will happen here, suppose
if I try to take the observations on two variable x and here y and we make a scatter plot
then you can see here that in the figure number here 1 that these observations are here
like this and in this case a trend line can be fitted which will passing through like this.
So, in this case when the values of x’s are increasing then the values of y’s are also
increasing and so, we say that the relationship is positive and there is a positive
correlation. Similarly in the figure 2 here, as the values of x’s are increasing then one can
say here that the values of y’s are decreasing and these values are these observations,
they are going like this and here in this case the trend line can be fitted like this.
So, in this case we would say that x and y are negatively correlated. And similarly if we
try to change the value of x and suppose the values of x are increasing, but there is no
change or say no trend in the values or we do not know how the y values are the y’s are
10
755
going to behave and they remain on an average as constant, there is no change when we
are trying to change the value of here x.
So, it is here it is very difficult to say in this case, we say that x and y they are
independent of each other and in this case we say that the correlation between x and y is
zero or the x and y have no correlation here.
Similarly, if I try to take here two situations of having the positive relationship, in say
figure number A and figure number B, if you try to see the trend line will look like this.
So, in figure number A, you can see here that the points are lying more closer to the line
then in the case of figure number B, right.
And means I am assuming that the scales are of both the figures are same means
otherwise there can be a confusion. So, in this case when the points are lying close to the
line then we say that, there is a strong relationship and in this case, the relationship is
positive. So, we say that there is a strong positive relationship between X and Y.
And similarly in the figure number B, there is a positive relationship, but this
relationship is not as strong as in the case of figure A. So, we call it that there is a
positive coalition, but it is moderate. Now we have understood what is the concept of
correlation. Now we need to define a quantity which can measure it and for this we have
a definition of correlation coefficient and this correlation coefficient is based on the
11
756
concept of variance and covariance. What is variance, that we already have discussed but
what is covariance?
You can recall that in the case of variance what we had done we were trying to measure
the variability of the observation around the mean value say arithmetic mean. Now if
there are two variables and suppose both the variables are affecting each other they are
interrelated, then when the value of one variable changes then the value of other variable
also changes.
So, there is a sort of co variation between the two values. So, similar to the concept of
variance, we have a concept of covariance. As variance measure the variability of a
variable covariance measures the co variability of variables, right.
Now, we are going to address a question that how to quantitatively measure this degree
of linear relationship and for that we are going to use the concept of correlation
coefficient which is based on the concepts of covariance and variance and now first we
address what is covariance?
12
757
(Refer Slide Time: 24:07)
So, as we have discussed this covariance is very very similar to the concept of variance,
when there is only one variable, he variation exist and that is measured by variance;
when there are two variables or even more than two variables, then for these two
variables, their individual variation exist; that means, the suppose there are two variables
of course, the two variables will have their own variance. And beside their own variance.
they will also have covariation and you have, obviously, if the effect each other if they
are independent then there is no concept of co variation.
13
758
So, the question is now how to quantify it and how to measure it. So, but before going
further you may recall that we had defined the variance of a variable here x
1 n
say ( xi x )2 . Similarly, if what we try to do here, we try to write this function, as I
n i 1
go through 1 to here n and we try to write down here, ( xi x ) and say another
( xi x ) will be replaced ( yi y ) .
And this will be sort of quantity that will be measuring the co variation between x and y.
We assume that there are two variables which are represented as X and Y and it is
obvious that we are assuming that these variables are related or correlated. Now we have
obtained n pairs of observations on these two variable and these observations are
expressed as (x 1, y 1, (x 2, y 2, (x n, y n).
So, these are numerical values and we already have understood while discussing the
graphical techniques that how to obtain such observations. Now the covariance between
the variables x and y based on the sample observations, this is defined as covariance cov
indicating the covariance between two variables X and Y it is defined
1 n
as ( xi x )( yi y ) .
n i 1
So, you can see here these are the deviations in x i and y i minus y bar they are the
deviations in y i’s. And we are trying to take a cross product of the deviations in x and
deviations and y and we are trying to find out the average of those cross product of
deviations. And here this x and y they are the sample means of x and sample means of y
the sample mean of x is defined here like this and sample mean of y is defined here like
this, right .
And I have given you here the definition of covariance in case of actually in case of
ungrouped data. And similarly in case if you want to have it for the grouped data, then a
then a similar definition can be defined as a covariance between x y is equal
1 k
to f i ( xi x )( yi y ) .
n i 1
14
759
But you have to remember that here these symbols and notation they are going to have a
different interpretation; where now this x i and y i’s is they are indicating the mid values
and those (x1, x2,…, xn), (y1, y2, yn), that data has been grouped into k groups and so on,
but anyway I will consider here only 1 case.
Now, the next question is how to compute this covariance on the basis of given set of
data in the R software. So, if I try to indicate x and y are the two data vectors, then the
syntax or the command to compute the covariance is a c o v, all in a small letters and
inside the arguments, we give the data vector. But here you have to remember one thing
that, this command c o v in the R software, this will give us the value of covariance
whether divisor here n minus 1.
So, in case if you want to find out the covariance between x and y say having a divisor n
then what you need to do here, that you need to multiply and divide by here n minus 1
1 n
into the quantity say here ( xi x )( yi y ) and then this will be become here (n -1)/n
n i 1
and covariance between x and y and this is your here the r command, right.
And this is the same story that we also had discussed in the case of variance that the
variance was defined in two ways having a divisor n and say divisor n minus 1 and we
had discussed that when we have the divisor n minus 1, then this is an unbiased estimator
15
760
of the population covariance. And the same story continues here also, that when we try to
take the divisor to be n minus 1 then it is going to be an unbiased estimator of the
population covariance, but anyway I am not going into the details of estimation and
statistical inference.
So, but this is for your information. So, that in case if you really want to compute a
particular type of covariance with divisor n or say n minus 1 you should at least know
how to do it and you should also know what r is trying to give you, ok. But anyway, I
will not take an example here to compute a to show you the on the R console, but that I
will try to show you in the next lecture when I am trying to compute the correlation
coefficient. Now I come to the definition of correlation coefficient.
16
761
So, that is what I said, that in order to define coefficient of correlation, we need the
concept of covariance and variance. So, we now know that this covariance is written like
1 n
this ( xi x )( yi y ) this is covariance and this quantity here is trying to define the
n i 1
variance of x and this quantity here is trying to define the variance of y and then we try
to take its square root. So, this is the expression for the correlation.
So, this can be simplified if you want to make it more clear this
1 n
( xi x )( yi y )
n i 1
is , right. And if you try to further simplify it ,then the
1 n 1 n
( xi x ) n
n i 1
2
i 1
( yi y ) 2
numerator will become summation x i y i minus n times x bar y bar and this denominator
n n
will become x
i 1
i
2
nx 2 and variance of y will become similarly y
i 1
i
2
ny 2 you may
When we discussed the concept of this variability and while discussing the concept of
variance. Similarly, if you want to see here that how this expression can be simplified
n
which is involved in the definition of covariance xi yi n x y . So, this can be written
i 1
n n n
as x y nx y
i 1
i i - see here y xi - x yi + n x y . And now if you try to observe here
i 1 i 1
This quantity is nothing but n x and this quantity here is ny . So, what will happen that
the two factors here this and this will get canceled out and I can write down here that,
n
here the x y n x y n x y n x y . So, this n this gets canceled out and we get here
i 1
i i
the same quantity what we have obtained here like this is the same quantity, right.
17
762
(Refer Slide Time: 34:48)
So, having understood the basic definition or the mathematical form of the coefficient of
correlation, let us try to understand what it is doing and what are the different types of
interpretation? So, essentially r is measuring the degree of linear relationship. Well, it is
a very important term in a linear relationship well. You always have to keep in mind that
correlation coefficient can be used only to measure the degree of linear relationship. In
my experience I have seen that many people they try to use the correlation coefficient
blindly and even they are trying to use this correlation coefficient to measure the degree
of non-linear relationship, this is actually wrong.
And if you try to see this mathematical form of this correlation coefficient this is only a
mathematical function, a mathematical formula, whenever you try to give some value of
x and some value of y it will give you some numerical value. But the interpretation of
those values will be wrong and they will not be indicating the information contained
inside the data.
So, this is my humble request to all of you that please use this correlation coefficient
only to measure the degree of linear relationship and in order to do so, first use the
scatter plot, try to see whether the relationship is linear or not that can be increasing or
decreasing whatever you want. But the trend has to be linear and only then one should
use the definition of correlation coefficient.
18
763
This correlation coefficient is also called as Bravis Pearson correlation coefficient and
also as say product moment correlation coefficient. Why this is called as Bravais Pearson
correlation coefficient? Actually Professor Karl Pearson presented the first rigorous
proof or first mathematical rigorous treatment of the correlation and he acknowledged
Professor Auguste Bravis because, Professor Bravis had made some initial mathematical
contribution by giving the mathematical formula for the correlation coefficient.
So, that is the reason it is called as sometimes this is also called as Bravis Pearson
correlation coefficient and this is also called as product moment correlation coefficient.
Why this is called as a predictive moment correlation coefficient? You might recall that,
1 n
we had learned the definition of r th central movement and it was given as ( xi x )r .
n i 1
So, this was a sort of the automatic mean of the rth power of the deviation in the value of
x is. So, this was valid only when we have one variable, but now suppose if I have two
variables, what I can do here is the following, I can consider the deviations of x, I can
consider the deviations of y, and then, I will try to take the rth power of divisions of xi s
and s th power of the divisions of y i’s, then I will try to find out the arithmetic mean of
the product of these deviations. And this is denoted as here rs and this is called as say r s
th product moment.
So, that is why this is called the product moment correlation coefficient. Why? Because
in case if you try to substitute see here r equal to 1 and s equal to 1 this will give you the
1 n
value here ( xi x )( yi y ) which is your covariance between x and y.
n i 1
19
764
(Refer Slide Time: 39:12)
Now, we try to discuss the magnitude and sign of correlation coefficient, what is their
interpretation? So, this correlation coefficient value lies between minus 1 and plus 1
well, I am not giving you here the mathematical proof. So, r lies between minus 1 and
plus 1 and those values 1 and minus 1 they are inclusive. So, what is the interpretation?
You can see here this is here minus 1 and this is here 1 and this is here zero. So, this is
the limit of here r.
So, what happens if r is negative lying in this side and what happens if r is positive lying
in this side? So, when we try to compute the value of r on the basis or given set of data
and if this comes out to be positive then, this indicates that there is a positive association
between x and y and hence x and y are positively correlated. Similarly, if the value of r
comes out to be negative, then this indicates that that there is a negative association
between the two variables x and y and hence these two variable x and y are negatively
correlated.
Similarly, if r equal to zero; that means, if you compute the value of correlation
coefficient and it comes out to be zero well, zero is a theoretical value, but even if this is
very close to zero, then in this case this indicates that there is no association between X
and Y and hence X and Y are uncorrelated.
20
765
(Refer Slide Time: 40:54)
So, now we have seen that the value of r has two components; one is the sign and another
is the magnitude. This sign of correlation coefficient indicates the nature of association;
that means, whether the relationship has got an increasing trend or decreasing trend,
right. So, the positive sign of r indicates that there is a positive correlation, this means
what as one variable increases or the values of one variable increases then the value of
other variable also increases. And similarly, if the values of one variable decrease and
the values of other variable also decrease.
So, plus r will give us an information that the relationship is positive and the degree of
linear relationship is the magnitude of r. And similarly if we consider the negative sign
of r say minus r, then the negative sign indicates the negative correlation. So, as the
values of one of the variable increases then the value of other variable decreases. So the
relationship is opposite and similarly, if the value of one variable decreases then the
value of other variable increases.
21
766
(Refer Slide Time: 42:33)
And what about the magnitude of r the magnitude of r indicates the degree of linear
relationship so, we have seen that r lies between minus 1 and plus 1. So, there will be
two extremes are the minus 1 and plus 1 and one in the middle value. So, when we say r
is equal to 1, this indicates the perfect linear relationship, what does this mean that if I try
to plot the scatter plot between x and y, then all the points are lying exactly on the same
line.
So, if we try to make here a line on this graphic, it will look like this. So, there is a 100
percent perfect relationship and all the values are lying exactly on the line. Similarly, if I
say r equal to 0, this will indicate that there is no correlation, there is zero correlation and
in case if I say any other value of r between 0 and 1, that will indicate the degree of
linear correlation relationship higher the value of r higher the degree of linear
relationship, this relationship can be positive or negative.
So, when r is equal to plus 1, this will indicate the perfect linear and increasing
relationship between x and y and when we say r equal to minus 1, then this will indicate
a perfect linear and decreasing relationship between X and Y. For example, if you try to
make it here x and here y then what will happen, this in case of decreasing relationship
this relationship will be like this and in this case if you try to make a trend line this will
be a perfect straight line. So, this is the case of decreasing relationship and the above one
was the case of increasing relationship, right.
22
767
(Refer Slide Time: 44:32)
And now I would simply try to show you these things graphically what does this actually
mean for example, now what you have to do, you simply have to keep your
concentration on my pen, right. First you try to look over here at this figure, you can see
here that this X is here X is indicating the values of X and this is here the values of Y
inside all these figures.
So, we can see here in this picture here that as the values of X are increasing the values
of Y’s are decreasing here like this. And in case if you try to create here a line, trend line
this trend line will be something like this and you can see here that all the points are not
lying exactly on the same line.
So, in this case the sign of the correlation coefficient will be negative and definitely this
value is not 1, but most of the points are lying very close to the line so the correlation
coefficient can be close to 0.90. And similarly, in the next figure, in this one, you can see
here when we are trying to increase the value of X the values of Y’s are decreasing and
in this case if you try to create the trend line, this will be like this. So, now in case if you
try to compare, this picture then most of the points are lying close to the line, but these
points are not as close as in the figure number 1 here.
So, that is why if you try to see the values of here r, this is minus 0.50 this is indicating
that the relationship is linear by this negative sign and the value here is 0.50, well its not
23
768
0.50, but it is very close to half and this is lower than the value of 0.90 as in the earlier
case. Now in the third case, you can see here, that there is no clear relationship and the
value of X’s are increases Y’s are also changing and there is no relationship in this case.
So, this is the case of zero correlation or no correlation and this is represented here by r
equal to 0.00. So, all these two cases in the above panel, they are trying to indicate the
negative relationship.
Now, we try to consider the lower panel, where we have the increasing trend, here you
can see here in all the panels there is an increasing trend in the data. So, if you try to see
here in the first picture here the as the values of X’s are increasing the values of Y’s are
increasing. And if you try to make here a trend line you can see here it is like this, but
definitely the points are not so close to the line, so the value of r is plus 0.50 here.
So, that is indicating that the sign here is positive. So, that is indicating the increasing
relationship and 0.5 is the magnitude of the r which is indicating that well there is a
linear relationship, but; obviously, all the points are not lying exactly on the line. And
similarly, if I try to increase the value of correlation coefficient something like here r is
equal to 0.90, the sign is positive in this picture and you can see here as the values of X
are increasing, the values of Y’s are also increasing and the trend line here will be like
this and if you try to compare the first two pictures over here, this picture and this
picture, you can see here that in this reason, the points are lying closer to the line in
comparison to the points in the first picture like this.
So, that is why this difference is indicated by the magnitude of the correlation coefficient
which is from 0.50 to 0.90. Finally, in the last picture here you can see that as the values
of X are increasing, Y’s are also increasing and all the points are lying exactly on the
same line. So, this is the case of perfect increasing linear relationship and this is
indicated by the value of r is equal to plus 1.
So, this is how we try to get the information about the magnitude and direction of the
relationship by looking at the scatter diagram. So, there are six diagrams, I have
represented here and they will give you a fairly good idea that how the things are going
to be done. Now I would try to address a very important and a very interesting
observation.
24
769
(Refer Slide Time: 49:40)
You see whenever the value of r is close to zero or say r equal to zero, this may indicate
two things, there can be two types of interpretations either the variables are independent
or the relationship is non-linear, why because correlation and coefficient is only
indicating the degree of relationship when it is linear.
So, now what happens if the relationship between X and Y is non-linear? In this case the
degree of linear relationship computed by the correlation coefficient may be low and so
the value of r is close to zero and this will be indicating that as if the variables are
independent, but it is not correct because their exists a non-linear relationship. So, in this
case this r is close to zro even if the variables are clearly not independent, for example, if
I say there is a trend like this one you can see here this that the relationship is very very
clear there is a sort of sine curve, but the correlation coefficient in this case will give you
the value close to zero.
So, be careful with these types of interpretation and remember that when X and Y are
independent then the correlation coefficient between X and Y will be equal to 0, but not
the opposite, but not the converse is true. So, this is very important point for you to keep
in mind that what is the meaning of r equal to zero?
25
770
(Refer Slide Time: 51:24)
26
771
Similarly, one very nice property with correlation coefficient has that this quantity is
independent of the units of measurement in X and Y. So, what is the advantage? Suppose
one person measures the height in meters and weight in kilograms and find out the
correlation coefficient say r 1. Now there is another person who measures the height and
weight of the same set of people, but he measures the height in centimeters and weight in
grams and he finds the correlation coefficient as say r 2, then in this case both r 1 and r 2
are going to be the same they will have the identical value and that is a very nice
advantage of using the correlation coefficient.
Now in this lecture, I would stop here, that was a pretty long lecture. And my objective
was to give you the information and development of the correlation coefficient concept.
Now, in the next lecture I will show you that how to compute this on the R software and
how to interpret it. In the meantime, you please try to read from other books and try to
develop the concept of correlation coefficient in more depth and I will see you in the
next lecture, till then good bye.
27
772
Descriptive Statistics with R Software
Prof. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology, Kanpur
Lecture- 29
Association of Variables - Correlation Coefficient using R Software
Welcome to the next lecture on the course Descriptive Statistics with R software. You
may recall that in the earlier lecture, we started a discussion on the concept of association
between two continuous variable and we learned about the correlation Coefficient. And
we also have understood now that how to interpret the values of correlation coefficient
with respect to the magnitude and direction of the association.
Now, in this lecture I am going to demonstrate that how you are going to compute the
value of correlation coefficient using the R software and how you are going to
implement it, how you are going to use it when you get a data set. So, first just a quick
review what we had done earlier.
So, you may recall that we had discussed the concept of covariance between the two
variables X and Y and for which we had obtained the n pairs of observations as (x 1, y
1), (x 2, y 2), (x n, y n), based on that we have computed the covariance between x and
773
y as like this and this was for the ungrouped data and similar definitions folowl for the
grouped data also.
And we also had discussed that if I have two data vectors in R software which I denoted
as x and y then the command cov that is covariance between x and y. So, you write cov
and inside the argument write the data vectors, this will give us the value of covariance.
But remember one thing this command covariance between x and y in R software is
going to give you the value of covariance in which this divisor is 1 upon n minus 1. And
1 n
whereas, we had defined the covariance as ( xi x )( yi y ) which has the divisor n.
n i 1
774
So, after this we had define the correlation coefficient this was defined as here say r and
finally, we had obtained the expressions here like this, a simplified version of this
correlation coefficient and after that we had understood that how to interpret the values
and the sign.
Now, the next question is how to compute this correlation coefficient inside the R
software? So, if I say that I have the same setup that I have two data vectors x and y, then
the correlation coefficient between x and y is computed by the command cor and inside
the argument we have to write the data vector. So, the cor inside argument x and y will
compute the correlation coefficient between x and y that is the data vectors x and y.
And when you try to look into the help of the c o r function, then there are several
options and I am going to detail here some important things which we are going to use
actually. You will see here the function here the c o r and inside the argument this and
this, I am writing several options.
So, now we try to understand them one by one. This x and y that we know these are
going to denote the data vectors here like this. Now, there is here another command here
use and I have written here inside the double quotes as everything.
775
(Refer Slide Time: 03:57)
Now, what does this everything means and what is the use of this syntax u s e use.
Actually this use is an optional character in this c o r function which is trying to give a
method of computing the covariance in the presence of missing values. If you remember
earlier we had used the command like n a dot r m, so this command also have a similar
utility.
So, then in this case, we have several option to give here say everything when you try to
use all the observations or say all observations, complete observation n a dot or dot
complete or pairwise complete observations and so on because, there are different test
situation in which one would like to compute this coefficient of correlation.
So, we are simply going to use here the option here everything where I want to compute
the correlation based on all the data, right, remaining details you can look into the help
menu. After this there is another option here which now I am trying to denote in blue
color, so that you can see clearly this is here method and inside the c command inside the
argument, I am writing here three options; pearson, kendall and spearman.
Actually there are several types of correlation coefficient, up to now what we have
studied, the r and if you remember I had told you that this r is also called as say this Karl
Pearson coefficient of correlation and this is how this Pearson is coming here. Second or
say another correlation coefficient is rank correlation coefficient, which we will discuss
776
in the next lecture and this rank correlation is also called as Spearman, Spearman’s
correlation coefficient or Spearman’s rank correlation coefficient.
So, this option here which now I am highlighting in red color spearman inside the double
quotes, this option is used when we want to compute the rank correlation coefficient.
And similarly there is another form of the correlation coefficient which is defined by the
kendall which we are not using at the moment.
So, essentially we are going to use here the option - pearson the correlation coefficient
1 n
( xi x )( yi y )
n i 1
that was defined by say here this is actually computed
1 n 1 n
( xi x ) n
n i 1
2
i 1
( yi y ) 2
And this is method is mentioned here in the next slide that method is a gives us a
characterstring indicating that which correlation coefficient of the covariance is to be
computed for say pearson, for kendall or say spearman, this has to be abbreviated inside
the brackets and the default will be pearson.
Now, I try to illustrate you first some example if you try to see here I have taken here
two data vectors 1, 2, 3, 4 and 1, 2, 3, 4, same values and I am trying to compute the
5
777
covariance between them. Once I try to do it, this value will come out to be 1.66 and so
on and here you can see the slide. And now what I do in the next example, which I am
denoting now in say here purple color. I try to take here the same data set c 1, 2, 3, 4, but
in the next the data set, I try to change the signs from the second data vector.
So, now if you try to see what happens here, the value of the covariance comes out to be
-1.66 and so on. So, you can see here that these two values which I am now highlighting
in red color, this and this, the magnitude of this values are the same, but only the sign is
opposite and here is the screenshot. What is the meaning of this?
Now, if you try to look into this slide of the definition of correlation coefficient. We had
understood that r can take positive value and r can take negative value, but if you try to
see in the denominator, this value will always be greater than zero, this value will always
be greater than zero. So, this is the only value which is the covariance between x and y,
this can be greater than zero or this can be smaller than zero.
So, the direction of the correlation coefficient is determined by the covariance. So, I can
say that the sign of correlation coefficient is determined by the covariance. And this is
what I am trying to show you here that if the data points here in the first case here and in
the second case here they have got a opposite sign, then this is given by this negative
sign here, right.
778
And if you try to plot these two data points how they will look like. For example, now I
try to find out the correlation coefficient and by which I will also show you the direction.
So, when I try to find out the correlation coefficient between the two data vectors 1, 2, 3,
4; 1, 2, 3, 4, they are the same, then this correlation coefficient comes out to be here one,
that you can see.
And you see this is also indicated in the scatter plot which is given here you see these are
the point 1, 2, 3, 4 and they are lying exactly on the straight line. So, this is a case where
we have exact positive linear dependence and this is the screenshot of the same
operation, I will try to show you on the R software also.
And similarly if I try to take the data vectors with exactly opposite sign, say first data
vector with the all positive sign data and second data vector minus 1 minus, 2 minus, 3
minus, 4 having the opposite sign and if I try to find out the correlation coefficient
between the two this comes out to be minus 1.
And you can see here this is indicated in the scatter diagram here these are the four
points and this relationship is decreasing and this is the case of exact negative linear
dependence between the two variables and here is the screenshot of the of both the
things.
779
(Refer Slide Time: 11:17)
Now, before taking an example, let me try to show you this things over the R console
also. So, you try to see here I try to find out the covariance between 1, 2, 3, 4 and here
minus 1, minus 2, minus 2 and minus 4, you can see here this comes out to be minus
1.6667 and if I try to remove the signs and suppose both the data vectors have got the
same value, then in this case the magnitude remains the same, but the sign become
positive.
780
So, in this case you can see here the covariance is positive. Now in case if I try to find
out the correlation in the same case where the covariance is positive, you can see here
this is obtained by the function c o r and this comes out to be 1 for the sign of positive is
maintained here. Similarly, if I try to take the data set with negative signs and if I try to
find out the correlation coefficient between 1, 2, 3, 4 and minus 1, minus 2, minus 3,
minus 4 you can see here this comes out to be here minus 1, right.
Now, if I try to make here the scatter plots, I will try to show you the both the things
here. So, you can see here, say I can make here the scatter plot of 1, 2, 3, 4 and minus 1,
2, 3, 4 you can see here this comes out to be like this where the direction of my cursor is
indicating the negative relationship.
And similarly if I try to make it here positive; that means, that two data vectors both are
here positive 1, 2, 3, 4; 1, 2, 3, 4 in this case you can see here that this scatter diagram is
trying to give a positive exact relationship. Now, I try to take an example to show you
that how the interpretation of correlation coefficients comes into picture. Suppose I try to
take here a data set where I have obtained the marks and the number of hours of study by
the 20 students, this is the same example which I have considered in the earlier lectures
while making different types of graphics or say bivariate plots.
781
So, we have recorded their marks in the first row in both the tables and the number of
hours per week they studied in hours here, right this is the same example which we have
considered several times and this data has been stored inside two variables marks and
here hours, right.
Now, the first thing what it comes to our mind that now we have got the data and before
using the concept of correlation coefficient, we would like to see what is the type of
relationship whether it is linear or not. So, we simply use the option plot and we try to
plot here the marks versus hours.
10
782
So, you can see here in the data here is lying like this all the data here are your marks and
here is your actually hours, right and you can see here this there is a linear trend here. So,
this gives us a sort of assurance that the relationship between marks and hours is well
close to linear and now I am, but still I have made the trend line by hand.
And now we also have learned how to make this trend line, so for that I will use the same
command that we discussed earlier scatter dot smooth and I will try to make your line
between the two data vectors marks and hours. And you can see here that this line is now
also indicating that this is nearly linear and this gives us a confidence that ok, in this case
I can use the correlation coefficient.
11
783
(Refer Slide Time: 15:29)
And then I try to find out the correlation coefficient between marks and hours you can
see here this comes out to be 0.96. So, you can see here what is this 0.96 indicating. If
you try to look into this curve you can see here the that these deviation which I am
plotting in orange color, they are very close to lines and these deviations are very small
and these points are lying very close to the trend line.
So, this is possibly indicating that on an average the degree of linear relationship
between marks and hours is very strong and this is nearly 96.79 percent. Similarly, if you
try to interchange those variables, so earlier I have taken say marks and hours and now I
try to take hours and marks. So, you can see here, this value of the correlation coefficient
remain the same this is what we have discussed in the last lecture also, right.
So, in this case you can see here the sign of the correlation coefficient is positive and this
is indicating that the relationship between x and y is positive which is here, you can see
here. And we can conclude that as the number of hours of study per week are increasing,
the marks obtained by the students are also increasing because there is a positive
relationship and the value of correlation coefficient is pretty high. So, that is why I can
say that whatever the data is telling, that is correct and my conclusions based on the data
they are correct and data is also showing a linear trend.
12
784
(Refer Slide Time: 17:08)
And this is here the screen shot I will try to show you it on the R console also before I
take one more example of the negative one, so I will try to show you here. I already have
stored the data, so you can see here this is the data for marks and this is the data of our
say hours.
So, if I first try to plot marks versus hours or say hours versus mark whatever you want,
this gives you us you can see here nice scatter plot indicating the linear trend, if you try
13
785
to make it here a trend line also, so try to give a scatter dot smooth and you can see here
that a line is also plotted here, right.
And now I try to find out the covariance between marks and hours that will indicate the
direction. So, you can see here that the value of the covariance is 365.5368 and it is
positive. So, it is indicating and it is matching with our graphical conclusion that the
relationship is positive.
Now, I would try to find out the correlation coefficient between marks and hours and
you can see here this is 0.96, right. So, you can see here that this correlation coefficient is
trying to take care of the variance as well as covariance between the two variables.
14
786
(Refer Slide Time: 18:41)
Now, I will try to take one more example and try to illustrate something more. Suppose
there are 10 patients and those patients are given some medicine and the quantity of the
medicine is measured in milligrams mg and the time say in hours is recorded and to
show that when the medicine is started showing the effect and this data is compiled here,
say quantity of medicine and the time in say hours. So, this is indicated here, so this is
the data set of patient or the person number 1 or patient number 1.
So, this is indicating that 30 mg of medicine was given and it took 4 hours of time.
Similarly for the second data set, this is the patient number 2 and a 45 mg of medicine
was given and he took 3.6 hours of time and so on. So, this data on the quantity is stored
here inside a variable quantity and time is stored here in say effect dot time. One thing
please don’t use their variable time because time is used by the R software also, so be
careful. So, now I will try to take this data set and I will try to first make a plot here.
15
787
(Refer Slide Time: 20:09)
So, you can see here this is the screenshot and you can see here this is on the x axis it is
quantity and y axis, this is time, that is the effect dot time and you can now see here these
observations are coming out to be like this. So, you can see here that as the value of x’s
are increasing the values of y’s are decreasing.
So, this shows here a sort of negative trend, right and this information we would like to
verify with the covariance function or the correlation function but now let me plot here
the trend line also.
16
788
So, you can see here that the trend line is also indicating that the relationship is almost
linear and we can safely use the concept of correlation coefficient, right, so this is the
outcome. And now I try to find out the correlation between quantity and effect dot time,
so this comes out to be here minus 0.9885454, well you can control the number of digits
also. And now if you try to see this is showing the value negative, so the sign of negative
is indicating that this relationship is decreasing, as the values of x’s are increasing the
values of y’s are decreasing and the magnitude here is 0.9885454 close to 0.99.
So, close to 0.99 means the relationship is going to be nice and linear and the degree of
linear relationship is pretty high, the maximum value is 1. In the case of 1 what will
happen that all the observation will be lying exactly on the line and whereas, in this case
this is just 0.988, so it is very close to the line. So, now, this give us a information and
confidence that we can use here the concept of correlation coefficient and this degree is
coming out to be 0.988.
So, now, I can conclude that as the quantity of medicine is increasing, the number of
hours to effect are decreasing and I would try to show you this thing on the R console
also and this is the screenshot what I have done ok. So, let me first try to copy this data
the earlier data was there because I had used it earlier.
17
789
(Refer Slide Time: 22:39)
So, this here is quantity first let me clear all the things, clear the screen by control l, try
to get this quantity and then I try to copy the effect dot time variable and the data content
in it. So, this is here effect that time, so you can see here quantity is obtained here, effect
dot time is obtained here. Now, I would like to know what is the nature of relationship.
So, I try to make here a plot between see here quantity and effect dot time.
And you can see here this is the plo, this is the same plot that we have just obtained in
the slides and now if you want to make it here, a scatter a smooth plot by adding a trend
line. So, I have to use the command here scattered dot smooth and this gives me here this
18
790
graph which is the same points, but with a trend line. And now first I try to find out here
the covariance between the two because covariance will assure us that what is the sign of
the direction whether this is positive or negative. So, you see this covariance comes out
to be minus 22.625.
So, this minus sign is indicating here that the relationship is negative and this is verified
here in this graph also, if you try to observe the direction of my cursor on the screen, this
is decreasing, right. Now, I try to find out the correlation coefficient between the quantity
and effect dot time and this comes out to be here minus 0.988. So, once again this sign is
coming from the covariance and it is indicating that the relationship between quantity
and effect of time, they are negatively correlated, as the quantity increases that time
taken to affect is decreasing, right.
And this is quite obvious also we know from our experience that when we try to increase
the dose of the medicine, then it acts faster and the time to react becomes smaller and
smaller, well up to certain extent, after certain limit that may say that is not advisable and
you have to depend on that doctors advice, right. So, now in this lecture I have shown
you that how to attempt or how to take a decision whether you want to compute the
correlation coefficient or not.
So, the steps are first to try to find out the scatter plot or the smooth scatter plot whether
trend line try to look at the trend. In case if you are convinced, yes, there can be a linear
trend or the relationship between the two variable is approximately linear, this can be
positive or negative, then you decide that well in this case I can use the concept of
correlation coefficient to measure the degree of linear relationship and then you try to
use the formula for the coefficient of correlation.
So, I stop here and I would request you that you please try to take some more data sets
and try to plot them and then try to compute the values of correlation coefficient and see
what do you get. So, you practice and I will take the topic of rank correlation in the next
lecture and then I will see you in the next lecture.
19
791
Lecture – 30
Association of Variables – Rank Correlation Coefficient
welcome to the lecture on the course descriptive statistic with R software. You may recall that in
the last two lectures, we had discussed the concept of correlation coefficient and we learned that
how to compute it in the R software. So, you may recall that when we started the discussion on
the topic of association of two variables, we had discussed three possible situations where we
would like to measure the degree of association. First was, when the variables are continuous,
and the data is collected on some continuous scale. Second situation is where the data is
collected as ranks, that means the observations are obtained and they are converted into the ranks
and then we need to find out the correlation coefficient, and the third one is where the data is say
categorical, where you try to count the number or the frequency. So, in this lecture we are going
to consider that now the second case where the data is obtained as the ranks of the observations.
Now, first question comes where such situations can occur. Although I had explained you in the
earlier lecture, but I will take a quick review, Suppose there is some fashion show that is going
on and suppose a model comes on the stage and suppose there are two persons who are looking
at the performance and they are giving some scores. Now, what do you expect? You expect that
in case if the performance is good, then both the judges should give some higher score and in
case if the performance is bad, then you expect that both the judges should give lower scores.
Now, in practice it is very difficult that the judges are giving exactly the same is scores, suppose
the scores are to be given in the say between zero and hundred, it is very difficult that they are
giving the same scores. If the person is good, they may give a score of say 80 or say 85. So, now
the question is this, how to measure the association between the opinions expressed by those
marks by those two judges? What we can do that, we can rank the its scores of different
792
candidates. Suppose they are ten candidates. So, those ten candidates are judged by the two
judges. Judge one has given some his scores to the ten candidates and judge two also has given
their scores to the ten candidates. Now instead of using the scores, we will try to find out the
ranks, that which of the candidate got the highest rank, which of the candidate got the second
highest rank and which of the candidate got the lowest rank and these ranks will be calculated for
both the judges, and then we would like to find out the association and direction of the
association between the ranks given by those two judges and this can be achieved by using the
concept of rank correlation coefficient. What do we expect on the basis of what we have learnt in
the case of correlation coefficient? If both judges are giving the similar scores or they have the
similar opinion, if a person is good, then it is good for then he or she is good for both, then the
correlation should be positive and it should be say quite strong and in case if there is an opposite
opinion that means the judges just feel opposite, one judge just said good and say another judges
judge says bad, then in that case, we expect that the correlation should be negative and now how
close or how strong or say how weak, this depends on the magnitude of the correlation
coefficient. So, you will see that the similar type of concepts will also be there in case of rank
793
So, now we discuss our lecture and we will simply assume that here that we have here two
variables X and Y and and observations on X and Y are available, Right! and after this whatever
are the observations they are ranked with respect to X and Y and ranks of those and observations
are recorded. What does this mean, suppose if I say in the example which I have given, suppose I
say there is a judge one and he has given the scores say here 90, 20, say 60 and say here 35. So,
this is my observation number one, observation number two, observation number three and
observation number four. So, these are now essentially the values of xi’s. So, this is my here x1,
this is x2, this is x3 and this is x4. Now, what I can say I will find out the ranks, if you try to order
these observations, you can see here that the smallest observation here is 20, that is the smallest
observation and after this we have 35, then we have 60 and then we have 90. So, 90 is the largest
observation, Right! So, if you try to give the rank, so the smallest observation will be getting a
rank equal to 1 and there are 4 observation. So, the largest observation will be getting a Rank
four and the second largest observation which is here 60, it will be getting ranked 3 and third
794
largest observation, it will give here the rank two. So, you can see here now 1, 2, 3 and here 4,
these are the ranks given to these observations. Now, if I try to write down here the ranks, what
is the rank of this x1, which is having the value 90, this is here rank is 4 in which is written in red
color. Now, second observation is here x2 equal to 20, what is the rank of this observation 20,
this is here 1, you can see here and this I can write down here 1, Similarly, how this 4 was
coming, four was coming from here and next if you try to see x3, the value of x3 is 60 and what is
its rank, this rank here is 3. So, this comes over here and I write here this 3 and similarly here x4
is equal to here 35 and 35 has got the rank 2. So, this comes here and we write it here two. So,
now you can see here that whatever are the its scores given by the judge, they have been
converted into the ranks and similarly there will be a second judge and we will try to convert the
scores of judge two also in the order of ranks and then we will try to find the correlation
coefficient between the two ranks. Now, incase if you think mathematically, that how to obtain
the value of the correlation coefficient, in this case, you may recall that in the case of Pearson
correlation coefficient, we have taken this xi and yi to be the values on the continuous scale, now
the values are here as integers and because they are the ranks 1, 2, 3, 4 and so on. Now in case if
you choose x1, x2, xn in the case of Pearson correlation coefficient to be 1, 2, 3, 4 up to n and so
on, and similarly y1, y2, yn also to be the integers 1 to n, then just computing the correlation
coefficient or the Pearson correlation coefficient with the two sets of values 1 to n and 1 to n, you
will get the value of the correlation coefficient and this is the idea that how the expression for the
rank correlation coefficient has been obtained. Well I'm not going to give you here the derivation
between the two, but definitely I think that you should known, and this correlation coefficient
was given by Spearman. So, that is why this is also called as Spearman's rank correlation
coefficient.
795
Refer Slide Time: (11:12)
So, now let us try to continue with our discussion and now we try to understand it that how are
we going to do it? So, we discuss about how to compute the Spearman’s rank correlation
coefficient? Now, suppose there are n candidates and who participated in a talent competition
and suppose there are two judges and these two judges are indicated by X and Y and both these
judges, they judge the performance of n candidates and they try to give the ranks to every
participant or they try to give the scores which are finally converted into ranks. So, I will say
here now we have a situation like this, one say judge is here X, it is trying to give us the scores
and they are suppose candidates. So, you must know that how this data will look like, so that so
there are 1, 2 up to here n candidates, and judge X gives them the scores x1, x2, xn and then these
scores are converted into the ranks. So, these ranks are going to be indicated by the numbers say
here 1, 2 up to here n and similarly there is another here judge which is here judge Y, this judge
796
Y also gives the scores as y1, y2 suppose here yn and these scores are once again converted into
ranks say 1, 2 up to here n, and now we are going to find out the correlation coefficient between
the two sets of observation 1 to n, definitely these numbers 1 to n will not occur in the same
sequence. For example, suppose judge X says that in his opinion the best candidate is x2. So, he
has given the rank x2 to be here, see here n and we're suppose judge Y has given the rank to x2,
say here 3, because he think that he is only at the third order. So, you can see here that the person
here is the same, same person, but he has been given two different two different ranks and then
three, and then this is what we want to find that either the ranks given by the two judges whether
they are similar or they are different or very different. So, that is the objective to introduce this
correlation coefficient.
So, now if you see what I'm going to do here judge X gives ranks to the n candidates, and
suppose he says that whosoever is the worst candidate, he gives the rank one, and this rank 1 is
6
797
given to that candidate, who had scored the lowest score and similarly, whosoever got the second
lowest say score out of this x1, x2, xn, he has given the rank 2 and similarly whosoever is the best
candidate, the one who has got the highest score has been given the rank n, and similarly judge Y
also has given the rank to the same n candidates. Remember the candidates are same and
whatever score he has given based on that he has converted the scores into ranks as 1 to n, based
on the scores y1, y2, yn exactly in the same way as a judge has judge X has done it. Now, what we
have to do that, now we understand that every participant has got two ranks which are given by
two different judges, what do we expect? We expect from both the judges to give higher ranks to
the good candidates and lower rank to the bad candidates. Now, our objective is this, we want to
measure the degree of association between the two different judgments through the ranks and we
would like to find out the measure of degree of association for these two different sets of rank.
Now, the question is how to do it and what will it will indicate? So, in order to measure the
degree of agreement between the ranks of two judges or two data set, in general, we use the
798
Spearman rank correlation coefficient. So, one thing what you have to notice and always keep in
mind, we are not using here the original observation, but we are using only their ranks.
So, now let me develop or tell you that how do we do it. So, first we try to define here the rank of
an observation, suppose rank of xi, is denote as say here rank of xi mean rank and inside the
argument xi. So, how it has been obtained that all the observation x1, x2, xn have been ordered
and then from there whatever is the numerical value of the rank, this has been recorded, Right!
799
So, as I shown you here, if you see here, I have shown you that how these ranks have been
obtained, Right!
800
So, exactly in the same way these ranks have been obtained for the data in X, and similarly these
ranks have been obtained for the data in Y and the ranks in Y, are denoted by simple statement
rank of yi. Now, what we do that I try to find out here the rank between xi in yi and we try to find
out the difference, we consider the ranks of xi and yi and we try to find out the difference. So,
what is the, if you try to see what I am doing, I try to take here the ith person and I see what is the
rank given by judge X and what is the rank given by judge Y and whatever is their difference I
try to compute here and denoted by di. Now, the expression for the Spearman's rank correlation
n
6 di2
coefficient denoted as capital R, is given by this expression R 1 i 1
, and this R is called
n(n 2 1)
as Spearman rank correlation coefficient and this lies between minus 1 and plus 1,
10
801
One thing which I would like to address here, that it does not matter whether the observations
have been arranged in ascending order and their ranks are computed. So, what I am trying to say
that if rank 1 is given to the worst candidate or to the best candidate, it will not change the rank
correlation coefficient provided both the judges have used the same criteria, if judge X has given
the lowest rank with the worst candidate and judge Y also has given the lowest rank with the
worst candidate and judge X has given the highest rank to the best candidate and same has been
followed by the judge Y also, that he also has given the maximum rank to the best candidate,
then in both the cases if you try to compute the correlation coefficient or the rank correlation
coefficient, their values will come out to be same, but the only thing is this the ordering has to be
the same in both the cases either ascending or descending. So, you give the ranks 1, 2, 3, 4 up to
n or you try to give the ranks n, n minus 1, n minus 2 up to here 1, Right! Okay? Now, what is
the interpretation of the values of R. So, we have seen that this R will lie between minus 1 and
plus 1. So, when I say that R is equal to plus 1, this means that same ranks have been given to all
the candidates by the to the two judges, exactly the same rank, there are n candidates and
whatever the ranks have been given by judge X to a particular candidate, the same rank is given
by the judge Y and this is true for all the candidates. So, in this case both the judges are going to
give that exactly the same opinion and that is reflected by R is equal to plus 1. So, here this plus
is trying to indicate the direction, direction that both the judges are given the opinion in the same
direction or that direction of the opinion of both the judges is the same and one is trying to give
us the value of which is indicating the perfect relationship as we have done in the case of simple
correlation or the Pearson correlation coefficient. Similarly, incase if the two judges give just
opposite ranks to all the candidate that is another extreme, that means whosoever is the best
candidate in the opinion of judge X, is the worst candidate in the opinion of judge Y, then in this
11
802
case the value of rank correlation coefficient will come out to be minus 1. So, this is the case of
say perfect relationship, once again and this relationship is negative. So, this negative sign is
indicating the direction. So, R equal to plus 1 means perfect positive relation and R is equal to
And now in case if you try to choose any other value of R between minus 1 and plus 1, similar to
the interpretations of Pearson correlation coefficient, we will have the similar interpretation in
the case of rank correlation coefficient, For example, if the correlation coefficient is suppose
rank correlation coefficient is 0.95, then I would say that, Okay. Both the judges are giving more
or less the similar opinion, but definitely, if the rank correlation coefficient is supposed 0.02,
then I would say, well, the the opinions given by the two judges are not dependent on each other,
12
803
but definitely if the correlation coefficient is suppose minus 0.9, then I would say that well both
the judges are giving just the opposite observations, and they have got just the different nearly
the opposite opinions, whatever judge X thinks good, other judge is just thinking opposite of that
that is bad and so and vice versa. Now, let me take here an example and I try to show you that
how to compute the rank correlation coefficient and then I will show you that how to compute it
on the R software. This is important for me here to show you this minor calculations, because
many times people try to compute the rank correlation coefficient not on the basis of the ranks,
but simply on the basis of the of the observed values, but here you have to be careful, when you
are trying to implement that the concept of rank correlation coefficient. Whenever you are given
a data, try to see whether you have been given the original scores or you have been given the
ranks. So, if you are given the original scores, then first you need to convert them into ranks and
then try to compute the value of rank correlation coefficient, but in case if you have been given
the ranks, then you can directly use them and compute the value of correlation coefficient, right!
So, in this example, I am trying to consider the scores and we will try to convert them in to
ranks, now I'm taking a very simple and small example, so that I can show you the calculation.
Suppose there are 5 candidates and they have been judged by two judges, now the judge one has
95 to candidate number 4 and 50 to candidate number 5 and similarly judge two has given 70 to
score of 30 to candidate number 4 and a score of 40 to the candidate number 5. So, you can see
here, we do not have the ranks. So, first we need to find out the ranks. So, first let me find out
here the ranks, in the case of judge one, what is that the minimum value among all these values
75, 25, 35, 95 and 50. So, you can see here, this value here is 25. So, now I decide that my
13
804
ordering will be that we will give rank equal to one to the candidate having the minimum score.
So, this candidate who has got the score 25, which is here, this he or she gets the rank 1. Now,
once again I try to find out the minimum or maximum, whatever operation you want to do
among the remaining values. So, I try to take here 75, 35, 95 and 50. So, this comes out to be 35.
So, rank of 2 has to be given to a candidate whose score is 35. So, you can see here, now this is
here the candidate who has been given the rank 2. Similarly, I try to compute the minimum value
out of the remaining value which is 75, 95 and 50 and this comes out to be here 50 and then, I try
to give rank equal to 3 to the candidate whose score was 50. So, you can see here, this I am doing
it here. Now, once again I try to find out the minimum value with the remaining values which is
here 75 and 95, right! and this is comes out to be 75. So, I try to give the rank 4 to the candidate
who has got 75 marks. So, you can see here this is here and this candidate has been given the
rank 4 and similarly, now the value which is the maximum or the value which is left here is the
maximum value, and maximum value here is 95. So, this will give the maximum rank 5 and the
same operation is done on the scores given by judge 2. So, you can see here out of this 70, 80,
60, 30 and 40 values, the smallest value here is 30 and so, this has been given the rank 1. Now,
after this I try to find out the second largest value, second largest value here is 40. So, this
candidate has been given the rank 2. Now, I try to obtain the third largest value, third largest
value here is 60. So, this candidate has been given the rank 3, and similarly try to obtain the
fourth largest value, this is here 70, and this candidate has been given the rank 4 and finally the
maximum score is here by 80, and this candidate has been given the rank 5, right! Now, I try to
find out the difference between the rank of xi and rank of yi. Actually here, I have both the
options either I try to consider the rank of xi minus rank of yi or rank of yi minus rank of xi, and
this will give you the same correlation coefficient, you can see here in the formula of this rank
14
805
correlation coefficient here, we are using here di square. So, this sign of di will not make any
difference. So, now if you try to take here the value here, I will try to highlight here 4 and here 4,
this difference here is 0, this is 4 minus 4, second value here is, try to observe my pen, this is 1
and here 5. So, this is 1 minus 5 is equal to minus 4. Similarly, next value here is 2 and 3, this is
2 minus 3 is equal to minus 1, then the fourth value here is 5 minus 1, is here plus 4 and similarly
the difference between 3 and 2, which is 3 minus 2, is here plus 1. Now, after obtaining the value
of this di is,
I will try to implement them on the expression for the rank correlation coefficient, which is given
here. So, these di is we have obtained and here the number of observations are here 5. So, n is
equal to here 5 and this value comes out to be minus 0.7, what is this indicating? Now, you can
15
806
see here, this value here is minus 0.7. So, this minus is indicating the direction, direction of the
ranks given by the two judges, right! So, it indicates that both the judges have got say negative
opinion and the degree of the association between them is 0.7. So, this is indicating that both the
judges have got say quite different opinion about the candidates who participated, right!
Now the next question is how to compute the rank correlation coefficient in the R software?
correlation coefficient, they are similar. In R software, we use the same command for computing
the both. The only difference is that when we try to specify the option method, then we have to
coefficient in the last lecture then, then I had given the option, say method is equal to Pearson, at
that time I also explained to you about the rank correlation. Now I am giving you more
16
807
explanation that that we need simply need to specify the method is equal to Spearman, and then
the entire computation procedure and all other R commands, they will remain the same, they will
have the same interpretation. So, now let us try to understand it here. So, the R command to
compute the Spearman rank correlation is the cor and inside the argument you have to give the
data vectors here x and y, and in this case there are some other options which have to be used in
case if you want to compute the correlation coefficient based on rank, that is the rank correlation
coefficient c o r will remain the same, x and y will remain the same as the data vector, the option
of use, to use all the data by giving the the command everything inside the double quotes, will
remain the same as in the case of Pearson correlation coefficient, the only change will occur here
that now, the method is going to be a Spearman which has to be given inside the double quote
with the c command and once you try to do it here, then you will get the value of this rank
correlation coefficient.
So, this is about the use that handles the presence of missing value that, that is the same thing
17
808
Refer Slide Time: (33:20)
So, now I try to take the same example in which I have computed the rank correlation coefficient
manually, you can see here that two values are here the same
18
809
75, 25, 35, 95,50 in the first case,
and there's 75, 25, 35, 95, 50 in the first case. So, similarly the other values are same. So, I try to
give here the data x equal to the scores given by the judge 1 and y here as the scores given by the
judge 2. Remember that I am not giving here the ranks, but I am simply giving this scores and R
19
810
So, I try to name this data for the sake of convenience, as judge 1 the data of given by judge 1
and judge 2 for the data given by second judge and now I try to find out the correlation
coefficient between judge 1 and judge 2 to using the option everything and now my method
becomes a Spearman and this you can see here, that this value comes out to be minus 0.7,
20
811
and this is the same value which you had obtained manually
and this is here the screenshot of this operation, but I will try to show you the same operation on
21
812
So, first I try to copy the data of judge 1, say here judge 1 and the data of here, judge 2 and then I
try to take the command of finding or the correlation coefficient using the method Spearman
and you can see here this comes out to be minus 0.7. Now, I will stop here and now we have
learned that how to compute that the association, when the data is in the form of ranks. So, once
again as usual, I will request you that you please try to look into books, try to study more about
this rank correlation coefficient, try to take some examples and try to execute them on the R
software and you practice it, more you practice, more you learn and I will see you in the next
lecture, where we will discuss about, how to find out the or how to measure the association,
22
813
Lecture – 31
Association of Variables – Measures of Association for Discrete and Counting
variables: Bivariate Frequency and Contingency Tables
Welcome to the course descriptive statistics with R software. You may recall that in the last
couple of lectures, we have understood how to measure the association between continuous
variable and ranked data and we had the concept of correlation coefficient and rank correlation
coefficient. Now the third case that is left is how to measure the association between two
So, in this lecture we are going to understand two concepts; bivariate frequency tables and
contingency tables and we will also see how to implement the concepts on the R software and in
the next lecture, we will continue with some more measures of association in the case of
counting variables or counting data. So let's just start our discussion here. First of all you have to
understand that what is a bivariate frequency table? You may recall that we have already
discussed the concept of a frequency table and in that case, we have essentially considered the
univariate frequency table. Univariate frequency table means there is only one variable on which
the data has been collected and the data was tabulated. Now suppose there are two variables and
we want to tabulate the data. So this type of table will be called as bivariate table.
Now the question is this how does this come into picture and how the measures of association in
this case of counting variable comes into picture. Now suppose you want to know in a college
that what is the choice of the subjects between say mathematics and biology between boys and
girls students, male and female students. Now what are we going to do? We will take a sample of
students consisting of both boys and girls and we will ask their choices; do they like mathematics
1
814
or biology. Now what we expect that in case if there is no choice of the subject with respect to
the gender then the number of students who are considering to study mathematics or biology,
they should be the same number of boys and same number of girls but if a particular gender
gives more choices for a particular subject that would indicate that yes, the choice of the subjects
So what are we going to do here, let us take this simple example and it's the same example that
suppose we want to know if boys and girls in a college have say an inclination to choose between
So obviously as we said that if there is no choice, no discrimination between the two subjects
then we expect that the total number of boys and girls opting for mathematics and biology should
815
be nearly the same. And now we have to collect the data to make a conclusion on such an
opinion.
So the data is collected on boys and girls with respect to biology and mathematics and this data is
obtained in the form of a frequency and now we need to summarize this data into a frequency
table and based on that, we need to devise a measure based on this frequency data or summarized
data to study the association between two such variables which are basically counting in nature.
Now suppose I take a sample of 1, 2, 3, 4, up to 10 students and we ask individually, each and
every student, what is their choice and we note down their gender. So I am denoting the gender
of male students as M and gender of female student as F and the subjects- the mathematics is
being denoted by here Mth and biology is being denoted here as a Bio. So now the data is
collected as follows. Suppose we ask the first student and first student is a male and he says that
3
816
yes, he prefers biology. After this we ask the second student and second student is girl. So we
write here F female and she answered that she prefers to have biology. And similarly we try to
take the third student who is a male and this boy answers that he prefers mathematics and so on
Now you can see here how many boys are here 1, 2, 3, 4, and 5. So number of males here are 5
and obviously the number of females here are once again 5. 1, 2, 3, 4, and 5 and now we try to
see what is the data on the subject mathematics. This is here 1, 2, 3, 4, 5, 6. So there are 6 student
who prefer mathematics and there are 1, 2, 3, 4 students who are preferring biology but now you
can see here, from this type of data, from this type of frequency we are unable to conclude
anything. So what we need to do here we try to create a bivariate frequency like this one where
on one side, we will express the gender and on other side, we will make the subject.
817
Now there are two genders male and female and there are two subjects see here mathematics and
here biology and then we try to count these numbers that how many males are preferring maths
and with how many females are preferring maths how many males are preferring biology and
how many females are preferring biology. So you can see here that in this case the number of
males who are choosing mathematics is number one, number two, number three, and number
four. And the males, who are choosing biology is here, in the first student only. And similarly the
female students who are referring maths, they are here one and two. And similarly, if we collect
the other data also and all this data is compiled here in such a table. So you can see here the
number of male students who are choosing maths is four and this number we are denoting by n11,
n is indicating the frequency and what is the meaning of 11, 1 is corresponding to first row and
first column. Similarly, the number of female students who are preferring here mathematics, this
number is given in this cell and this number here is 2 and so we write this frequency as n12 and
12 means 1 is the row, this is the first row and 2 is the column, this is the second column. And
similarly the male students, who are choosing biology this number here is one and this is denoted
as the frequency n21, so once again 2 is going to denote here the second row and this 1 is going to
denote the column. And similarly, the number of female students who are choosing biology is
here n22 is equal to 3, this means the second row and second column the data in the second row
and second column. So you can see here I have denoted here the frequency as here say row and
column. So the address is given by two numbers row and column for a particular type of
Now the next step is this I try to count the numbers row and column wise. Suppose I count the
numbers in the first row. This is here 4 + 2 and this is here 6 and if you try to see, I am trying to
denote this number here as say n1 plus. So n1 is going to indicate that the subject in the first row
5
818
which is here mathematics, this one, and this + is indicating that we have added the frequency
over the column. n1+ equal to 6 is going to give us information that how many students are
preferring maths and similarly in case if I try to sum the frequencies in the second row, this is
denoted here as say n2+ and this number is 1 + 3 which is equal to here 4. So once again here this
2, in this n2+ , 2 is indicating the subject in the second row which is here biology and + is
indicating here that this addition has been obtained on the second subscript which is column. So
this n2+ is equal to 4, this is giving us the information that there are four students who are
Now the same exercise can be done in columns. So when I try to add the numbers in the columns
here like this, then this comes here 4 + 1 and using the same philosophy of the symbols I am
denoting this sum as n+1, +1 is coming as a subscript. So this + is indicating that the sum has
been obtained on the first column or this is the sum of the frequencies in the first column and this
1 here, this is indicating the first column. So if you try to see, this number here n + 1 this is equal
to 5 is indicating the male students here who are choosing maths as well as biology or any
subject out of this. And similarly if I try to come on the second column here and then I try to add
here n12 and n22, this is here 2 + 3 which is equal to here 5. So this number here n+2 this is going
to indicate that the sum of the frequencies has been obtained based on the second column by n+2.
So this is essentially giving us the number that how many female students are preferring maths
and biology. So you can see here that in this case the entire data whatever we had obtained here
on the basis of the sample has been classified into a two by two table and this type of two by two
table is called as contingency table and in particular this will be called as two by two
contingency table.
819
Now if you try to see what are the different symbols, in general, what are the indicating this I
have summarized here. And you see here what is here nij in general this is the frequency in the
ijth cell and essentially this is the absolute frequency in better terminology and similarly when
I'm trying to take here n1+ this is indicating the row total and 11 plus n12. So this is indicating the
first row of the data or the sum of the frequencies in the first row of the table. Similarly n2+ is
equal to n21 + n22. So you can see here this 2 remains the same and this + is indicating that the
So this is going to give us the sum of the frequencies in the second row of the table and similarly
in the case of columns n+1 is equal to n11 + n21. So you can see here the sum is obtained over this
1 and 2 and this is indicated by this + sign and this one and this one they remain the same and
this is going to give us the column total of the frequency of the first column. And similarly n+2 is
7
820
equal to n12 + n22 and similarly this quantity is going to give us the sum of the frequencies in the
second column. Now if you try to look in this table and try to see here n equal to 10, what is this
n equal to 10. This is indicating the sum of all the frequencies and if you try to see here, this can
be obtained in different ways. First is this, this is n11 + n12 + n21 + n22. This can also be obtained
as here sum of the frequencies in the rows. So this is n1+ + n2+ this is again equal to 10. So the
sum has been obtained from the row and similarly if you try to take the sum of the columns, then
again this can be represented as n+1+n+2. So this is what I have mentioned here in the last line
that n is equal to sum of all the frequencies which is here and this is the same as sum of the
frequencies in the row and sum of the frequencies in the column and so n is going to indicate the
total frequency. So that is going to be our general symbols and notations in contingency table.
So now if I try to make it more general. Suppose I try to take in general, two variables, two
discrete variables on which the observations are obtained as counting. So you can see here for
8
821
example, in the earlier case X was denoting the gender and Y was denoting the subject. And we
had divided the data into two classes for X and two classes for Y. Two classes of X are male and
female and two classes of Y are maths and biology. Similarly I can make it more general and we
assume that on the data on X variable, we have created K classes which are denoted as x1, x2, xk
and similarly for the Y we have created L classes say y1, y2, yl and this nij this is the absolute
frequency of the ijth cell corresponding to the observations xi, yj and obviously i goes from 1 to
k and j goes from 1 to l. So now in general these frequencies in general case, they can be
represented in k cross l contingency table, similar to what we have represented in the two-by-two
contingency table and this contingency table will look like this. I have used here different types
822
So you can see here this part in blue color, this is going to indicate the absolute frequencies of
different classes. So n11 here is indicating the absolute frequency in x1, y1 cell. Similarly here
say nij that is going to indicate the frequency in the xi and yjth cell and so on. So all these values
n in blue color they are going to indicate the absolute frequency of all the classes.
Now we try to find out the row sums and column sums. So we try to add all these frequencies
here. So this is equal to n1 + is equal to n11 + n12 + n1j up to here n1l. So this is denoted here and
this quantity is called as marginal frequency for the values here in this column which are
indicating the total of the rows or the sum of the frequencies in different rows they are called as
marginal frequencies and more precisely this is called as marginal frequencies of X or x1, x2, xk
different classes. Now we try to do the same operation on the columns. Suppose I take the
frequencies in the first column and I add them here. So this will be n11 + n21 up to here and nl1 +
in general nkl.
So this sum is going to be denoted here as a n+1 and similarly if we try to do the same operation
for each and every column here, here and so on. So these frequencies are denoted by here n+j and
if you try to see, they are the summation from 1 to K and ij means sum of all the frequencies in
the column. So this is also called as marginal frequencies and they are essentially denoting the
marginal frequencies of Y that means the module frequency of the class y1, y2, yl and so on.
Now finally if you see this here n now this n can be obtained as a sum of all the frequencies
which I am denoting here in red color or say all the frequencies which are denoted here in blue
font, blue color. So this is the sum of all here n and this sum can also be obtained by adding the
values in the rows that is n1+n2+ni+ up to here nk+ and this value will be going to be the same as
n and similarly if you try to add here the marginal frequencies of the columns that will also give
10
823
you the same value here and this I have denoted here in this expression. So this is called the total
frequency. So this is how we try to interpret and we try to construct the contingency table and
this contingency table is our k cross l contingency table. Why? Because there are K number of
So now we have understood that all the data this can be represented in different types of
frequencies and we simply try to summarize it here once again. This thing here and as we have
discussed this nij’s are going to discuss about the absolute frequencies. Now in case if you try to
see what are these nij’s representing. These nij’s are giving us different numbers about the
choices between X and Y which are occurring together. So these values of nij they represent the
joint frequency distribution of X and Y, Now you may recall that when we had discussed the
frequency table in a univariate case then we had only one variable but now here I have two
variables X and Y and we had discussed that how the frequencies are distributed in different
11
824
class intervals that was compiled in a univariate frequency table but now since we have here two
variables or the frequencies inside the cell, they are determined by two values; the value of X and
So this joint frequency distribution tells how the values are both the variables behave jointly and
just to inform you here that here I am trying to take only two variables but these variables can be
more. There can be three variable, there can be four variables and corresponding to those
numbers so we can create the suitable contingency table. Suppose we have three variables X, Y,
Z. X having two classes, Y having three class and Z having four class. So then we will create a
12
825
Now the next symbol here ni+ this was the sum of the frequencies and similarly n+j. This was
again the sum of the frequencies in rows and columns. So these values are representing the
marginal frequency distribution of X and marginal frequency distribution of Y. What does this
marginal frequency distribution tells us? The marginal frequency distribution tells how the
values of one variable behave in the joint distribution of X and Y. So now you can think here that
these values are going to be determined by two values and we assume that the value is affected
by two variables X and Y. So obviously one question comes that when we have the joint
Distribution of two variables X and Y what is the contribution of X and what is the contribution
of Y. So this information can be digged out from this bivariate frequency table or the
Similarly when we had discussed the concept of frequency table then we had two types of
frequencies; absolute frequencies and relative frequencies. The advantage of using the relative
frequencies was that that the sum of all the relative frequencies will always be equal to one and
the relative frequency of every cell or any value will always be between 0 and 1. So this is very-
very similar to the concept of probability. So in fact this frequency tables, in univariate or say
bivariate case, they are indicating or they are representing the probability distribution of discrete
So now in this case also, in place of absolute frequency, we can also use the relative frequency
and relative frequency will be obtained simply by dividing the absolute frequency by the total
frequency and a new table or a new contingency table can also be created using the bivariate
frequency table based on the relative frequency. So in case if we try to use the relative frequency
in place of absolute frequency then the similar information is provided and we call as joint
relative frequency distribution. And similar to marginal frequency distribution. Now we will
13
826
have marginal relative frequency distribution and there will be one more concept what we call as
So what is this we try to understand here. You see the relative frequency of any class or say any
class corresponding to xi, yj or ijth class, this will be obtained by nij/n and this is indicated by the
symbol fij. So similar to nij’s, now this fij’s will represent the joint relative frequency distribution
of X and Y. Now we try to obtain one more quantity which is called as conditional frequency
distribution. This conditional frequency distribution is obtained in two cases. When the value of
Y is given or the value of X is given. So when the value of Y is given say Y equal to some
particular value yj then in this case the conditional frequency distribution of X given Y equal to
yj is obtained by nij divided by n+j that is the frequency divided by the marginal frequency of that
class and this is denoted here say F of x given Y and Y is given as Y equal to yj and there is a
subscript here i given j that's a standard symbol for indicating that conditional frequencies.
14
827
Now the second case will be that in case of Y is given suppose X is given. The value of X is
given and suppose X is given as xi then in that case the conditional frequency distribution of Y
given X equal to xi is obtained by nij/ni+ so that is again the ratio of say this here frequency
divided by marginal frequency and this is denoted as say f of y given X equal to xi and in the
subscript, we write j given i. This symbol here vertical line this is thus this is called as given.
information that how the values of one variable behave when another variable is kept fixed. For
example, we have considered the case of gender versus subject. Now suppose I want to know
what is the behavior of the subjects for a given gender. Then this type of information can be
obtained by the concept of conditional frequency distribution. So I will try to take an example to
show you that how to interpret such values. But before that let me just write all the symbols in
general.
15
828
So when I try to find out the sum of all the frequencies in the rows and columns corresponding to
X and Y, they will give us the marginal relative frequency similar to the concept of marginal
frequency. So when I try to sum all the relative frequencies corresponding to X then for the ith
class I get fi+ which is here and this is the sum of all the frequencies in that particular class or say
particular row. Similarly the marginal relative frequency distribution of Y values or the classes in
Y this is denoted as f+j and this is the sum of i goes from 1 to k fij’s. And similarly the
given Y and here i given j in the subscript and similarly the conditional relative frequency
distribution of Y given X equal to Xi this is denoted by here f of Y given X and in the subscript j
given i.
16
829
So now let me take a very simple example and we try to first understand that how these values
are obtained and how to interpret them and after that I will show you that how to create the
contingency table and this type of marginal frequencies inside the R software. Suppose I have an
example here where a soft drink was served to some persons and those persons have been
divided into three groups depending on their age. First group is children. Second group is young
person. And third group is elder persons. And they would ask that how the drink taste and they
were given two options whether the taste is good or the taste is bad and based on that we have
obtained the data. For example, you can see here in the row whatever the data I have obtained
this has been counted and compiled in a 2 cross 3 contingency table like as follows. You can see
here in the row I'm writing here three classes of children, young persons, and elder persons and
in the column, I am taking another variable taste and which has two classes good and here bad
and then based on the data collected from such hundred persons if we try to count that how many
children said that the drink is good and this number is supposed 20. So how many children and
taste is good here and similarly we try to count that how many young person said that the drink is
good and similarly, we try to count that how many elder persons said that the drink is good and
this number is here 10. And similar information was obtained for children and there are 10
children who said that the drink is bad. Similar to this, there are 15 young persons who said that
the drink is bad and similar to this, there are 15 elder persons who said that the drink is bad. Now
you see, in this data, we have three groups on each and two groups on taste. So there are three
classes of age and two classes of taste and one can see here that the taste and age are not
independent. Different people in different age groups they are giving a different opinion. Had
this drink been very good, then we expect that all the person, all the 100 persons would have said
the drink is good but this is not happening here. So this is indicating that the variables X and Y
17
830
are not independent but they are correlated and my issue is and my question here is this how to
So there are different types of measures which have been suggested and all those measures tries
to measure this association in different ways. So the objective here is to understand what are
Now after this I try to find out their row sums and row columns so you can see here the sum of
20 + 30 + 10 here is 60. So this 60 is going to give us the information on the marginal frequency.
So I can see here by looking at this number 60 that there are 60% out of 100 who said that the
drink is good and similarly this marginal frequency which is here 40, this is obtained as the sum
of this 10, 15, and 15 and this is here 40 so this number 40 is essentially indicating that out of
100, there are 40 persons which are saying that the drink is bad. Now on the same lines, let me
try to add the numbers in the column. So you can see here I try to add here this 20 and this here
10 and this gives me value here 30, 20 + 10 is equal to 30. So this number here at 30 this is
giving us an information that out of 100 persons there are 30 children and similarly in the second
column I get this number 45 which is equal to 30 + 15. This and this number. So that is
indicating that there are 45 young persons out of this hundred persons and similar to this and the
last column of elder person, this value is 25. So that is indicating that there are 25 elder persons
in the sample of 100 persons. So you can see here that this marginal frequencies in rows and
columns, they are giving us a particular type of information and this information has been
obtained by making one of the effect to be the constant. For example, when I say that how many
persons said that the drink is good then we are suppressing the information on the age and we try
to add simply children + elder person + young persons together. And similarly, when I want to
find out that how many persons are there in different categories then I am trying to suppress the
18
831
variable taste. I am not bothering who said good or who said bad but I am simply counting that
how many children said the drink is either good or bad. Similarly, how many young person or
elder person said that the drink is good or bad, right. So this is the type of information what we
obtained from the marginal frequency and now the similar information can also be obtained in
terms of relative frequencies. What we have to do in the same frequency, I simply have to divide
So you can see here that total frequency here is 100. So first cell has a relative frequency 20 by
100 and similarly other cells have 30 by 100. 10 by 100, and 10 by 100, 15 by 100 and 15 by
100. Now once again what I try to sum them row wise then the sum is 60 by 100 and 40 by 100
in the first and second rows. So they are essentially trying to give us the marginal relative
19
832
frequencies and in the columns when I try to add it here this number is 30 by 10. 40 by 100 and
25 by 100. So this is trying to give us the same information in terms of relative frequencies and
obviously here the sum of all the frequencies will always be equal to one.
Now so if you try to see what type of information, I have got here from this table well I'm not
going to discuss here each and every information but I will try to give you some information. So
one thing is clear that this is a joint frequency distribution and it is informing us that how the
values or both the variables behave jointly. So when I try to see here about the marginal
frequency distribution, so you can see here in this case, where I am now making here as circle if
you try to see here 60. So there are 60 persons who said that the drink is good and there are 40
persons who said that the drink is bad. So in general I can also write it there as 60 out of 100 and
into if I multiply by 100 this will give us the value in percent. So I can say in general that 60% of
20
833
the person said that the drink is good and similarly 40 or 40% percent person said that the drink
is bad. And similarly, if I try to look at the values in the column say here, here, and here then
30%, 45% and here 25% persons in the sample are children, young persons, and elder person. So
you can see here that I can say that there are 45% young person and 25 or say 25% elder persons
in the sample.
Similarly, if I try to find out the frequency distribution in terms of conditional frequencies then
this conditional frequency distribution is giving us an information that how the values of one
variable behave when another variable is kept fixed. So you can see here I am obtaining here a
value 20 by 60 and I am saying that 20/60 into 100% children said that the drink is good. How
this has been obtained? If you try to see how this 20 and 60 values are coming. So now you can
21
834
see here I will try to make here a circle in a red color, if you try to see this data here, this is here
that 20 and this is here the 60, the marginal frequency. So you can see here, I'm trying to fix one
variable here which is good. This I'm now fixing and that is the idea of the conditional frequency
distribution that I have fixed the variable here good and then I am trying to find out the
conditional frequency by taking the absolute frequency 20 and the marginal frequency 60 and
this is going to give us an information that how many children said the drink is good and
similarly there will be another information that well how many children said that the drink is bad
so for that I try to take the information on here the bad and here the frequency is 10 and the
marginal frequency here is 40. So I try to take here 10/40 or this is equivalent to 25% of the
children said that the drink is bad and similarly if I try to come under columns you can see here
that I'm trying to take here different values from here and I am trying to say here that 30/60
which is equal to 50% of the young person said that the drink is good. So I'm again fixing the
variable here good and then I'm trying to take the frequency here say nij divided by marginal
frequency and similarly, I can have the information about the young children, sorry the young
person, who said the drink is bad. So this can be divided by the total number of young persons
divided by the marginal frequency and this comes out to be 37.5% people said that the drink is
bad.
So now if you try to see what we have done here we have understood the concept of how to
create the bivariate frequency table and in turn how to convert them into a contingency table and
this contingency table is going to give us different type of information. Now the next question is
how to create this frequency table and contingency table or the contingency table from the
frequency table using the R software. So this I would try to discuss in the next lecture.
22
835
In this lecture you please try to revise all these concepts and try to understand them what they are
trying to say, what they are trying to tell. Once you understand them then getting the output from
the software is very simple but the main thing will be how to interpret those values. So you
practice here and I will see you in the next lecture. Till then good bye.
23
836
Lecture-32
Association of variables
Measures of Association for Discrete and Counting Variables: Contingency table with
Welcome to the next lecture on the course, descriptive stats it with R software. You may
recall that in the last lecture we started our discussion on measuring the association between
two discreet variable, on which the observations were obtained as numbers obtained by
counting, and in that lecture we had discussed that from the given set of data, we can create a
table, and a contingency table. From that contingency table, we can obtain marginal
frequency distribution and the conditional frequency distributions, and this marginal and
frequency, and we had taken an example and we understood how these values are coming and
Now, in this lecture I will try to show you that these contingency tables can be obtained
inside the R software. So I will take an example and I will try to show you that how to obtain
the contingency table and after this I will introduce some measures, some quantitative
measures to find out the magnitude of the association or the degree of association between
the two variables. So now first I try to take the topic that how to construct the contingency
837
So as usual I will assume that we have got here a data vector and suppose we have two data
better x and y, you may recall that in the case of univariate frequency distribution if I have
the data vector as x then we had used the command table to find out the frequency table. And
when this table was divided by the length of x then we had got the frequency table in terms of
relative frequencies. Similar to that when we want to tabulate the bivariate data, the same
command is used that is table t a b l e, the only difference is that now inside the argument you
have to give that two variables or two data vectors, and similarly if you have more than two
you can express all those data vectors here separated by comma. So, this table (x, y) is used
to cross classify the factors to build a contingency table of the count set each combination of
the factor levels. Right, and if you try to use this command table (x, y) this will give you an
output in the form of a contingency table with absolute frequencies, and similarly if you try to
divide this table by the length command, then it will return a contingency table with relative
frequencies.
838
Now in case if you want to find out the marginal frequencies, then there is a command here
addmargins a d d m a r g I n s and this command addmargins is used along with that table
command, and this gives us the marginal frequencies to the contingency table that was
constructed by the command table. So the entire command to obtain or to add the marginal
frequencies in the contingency table will become addmargins and inside the argument try to
tell, adding margin to what? So add margin to the contingency table which was provided by
the command table (x, y), and in case if you want to obtain the marginal frequencies in terms
of relative frequencies, then you simply try to use this addmargins command inside the
argument and inside that argument, try to write down the contingency table of which you
839
Refer Slide Time: (4:51)
So now I try to take a very simple example, and I try to convert the given data into a
contingency table and then I would try to obtain the marginal frequencies. Suppose there are
twenty persons and they have been divided in three categories with respect to their age as a
child, young person, and elder persons and all of them were given a drink and the taste of the
drink was asked? You must note here that this is a similar example which I took in the earlier
lecture, in the last lecture, where I took hundred persons but my objective here is to show you
that how the contingency table is constructed from the raw data. So, showing you here
hundred observation is more difficult so that is why I am taking here only twenty
observations. So, you can see here that there are twenty persons one to ten and then here
eleven to twenty and first person is a child, and the child has been asked that how's the drink,
and he responds good. Similarly, the second person is a young person who said the drink is
good. Similarly, the third person is an elder percent who said that drink is bad, and then the
840
fourth person is a child who said that the drink is bad and so on and this is how we have
So now you see I would try to store this data into two data vectors. One is here person in
which I would try to store the data set which is here this and here this.
841
So, I have simply typed it.
And then in the second variable which is here taste here I have collected the data,
842
On this tale good bad and so on. And this data has been assigned to these vectors in the same
order.
Same order means if you try to say here first person here a child and this child said that the
taste is good, you can see here and now if you try to see here in the data vector,
843
this is the thing here child is saying the taste is good and so on.
So, these observations are written exactly in the same order as in the table.
844
And now after this I have to use the command table and inside the arguments, person
separated by comma taste, and this command will provide a contingency table with an
absolute frequency and when I try to execute this command on the R console, I get here a
table like this one. So now how? to interpret this table and how to read this table that is more
important to learn, you can see here one where you will here is taste and another variable here
is person, and this person has three categories child elder person and young person and these
categories are the same what you have denoted in the data, and similarly that taste is also
divided into two categories bad and good. And this classification has been done by the R
software automatically by Counting that how many persons are in which category. Now this
is showing you here, for example, if you try to see this is here your contingency table data or
For example this two is indicating that there are two children which are saying that taste is
bad and similarly if you try to see here, I will use a different colour pen say here six, so six
means that there are six elder person who are saying that taste is good, and similarly if you
try to take another data here say here 2, so this 2 is indicating that there are two young
persons out of 20 who are saying that the taste is bad. Now next, we would like to obtain the
marginal frequencies so you can see here I am using here the command addmargins and
inside the argument I am using the same command which was obtained here to get this
contingency table. Now you can see here, this contingency table is the same which is
obtained here but now there is one more column and one more row which is added in this
case here you can see here this is here sum and sum, so what is this sum this value here is 4
you can see here this has been obtained by 2 plus 2 is equal to 4, and similarly if you see here
10, this is here 4 plus 6 is equal to 10, and similarly if you see here is this 6 the 6 is here 2
plus 4 is equal to 6, and similarly if you try to take here this first column, 2 plus 4 plus 2 this
845
is equal to here 8, and similarly for the second column, if you try to add 2 plus 6 plus 4 this is
here 12. So you can see here this sum is indicating the marginal frequencies, so these are the
based on row and similarly here, this is the marginal frequencies based on columns, and
finally this value here 20, this 20 is the sum of all the observations or sum of all the
frequencies its frequencies 2 plus 2 plus 4 plus 6 plus 2 plus 4 which is equal to 20. So, this is
Now this is here the screenshot which I will show you that how I have obtained on the R
console,
10
846
Now we try to find out the same thing with respect to the relative frequencies. In order to find
out the contingency table with the relative frequencies first we need to find out the, the total
number of observations. In order to find out the total number of observations we have two
options since we have got here two variables one is here person, and another here is taste and
you observe that both these variables have got 20 observation. So now I can use here the
context the length of the person or I can also use here length of taste. Both are going to give
us the same values because both the variables have got the same number of observations. So,
once I try to operate here the command length of this vector, I will get here a value here 20
which is indicating the total number of observations. Now in order to find out the
contingency table, I have to use the same command but now I have to divide it by the length
of the data vector, so I try to use here the same command table, person, taste and now it is
divided by length of the data vector, and once you try to do it, you will get here an outcome
like this one. So first, let me try to show you that how you are getting this value, so suppose if
I take here this value here 0.1 how it is coming if you try to see here,
11
847
In the earlier slide, we had obtained the frequency here this year two which is corresponding
to bad taste and a child. Right. So now this two is being divided by the total number of
observations which is here 20 and this will be equal to here 1 upon 10 and which is equal to
here 0.1.
And this is here the same value, so this is no nothing but your nij upon n which was your h
equal to 2 upon here 20 is equal to 0.1, and similarly if you try to find out here how this value
12
848
Refer Slide Time: (14:06)
13
849
So, I try to use here say here nij upon here n which is equal to here 6 upon 20 and this comes
out to be 8 could be a 3 upon 10 which is equal to here 0.3. So, this is how all other values in
this table are obtained. Now in case, if I also want to find out the marginal relative
frequencies, so I have to use here the command addmargins and I have to use the same
command which I you have used to find out the contingency table, and in case if you try to do
it, you can see here this part here, this is the same as this part here because this is
corresponding to the contingency table, and now there is additional row and column here
which are here like this, this and here this. So the first question comes what are these
additional rows and columns are indicating? So I will try to show you here that suppose if I
try to take here the first row, so for the sum of first row which is 0.1 plus 0.1 is equal to 0.2
and this is indicated here in the first value in this column. Similarly, if you try to add here 0.2
plus 0.3 which is equal to here 0.5 so this is the second value, and similarly the third value
here is 0.1 plus 0.2 which is equal to here 0.3. So they give us the values of marginal relative
frequencies. Similarly, if you try to look into the columns, so if I try to sum here these things
so 0.1 plus 0.2 plus 0.1 this is being given here as 0.4, and similarly in the second column 0.1
plus 0.3 plus 0.2 in this direction, this is giving us the value 0.6. Right. So, these two values
here 0.4 and 0.6, they are also the marginal relative frequencies.
14
850
And this is here the screenshot of the operations which are to be done on the R console. Now
if we're going further let me try to show you all these operations on the R console. So I will
try to take the same example and I will try to enter the same data set, and I will try to obtain
the contingency tables with respect to the absolute frequencies as well as with respect to the
relative frequencies.
15
851
So you can see here, first I will try to create these two data vectors here on the R console and
similarly I try to create this data vector taste so you can see here now I have the data on
person, and here taste. Right, Okay. I clear the screen and I try to now create here the
contingency table. So you can see here this is the person and here taste. This is the same table
what you have obtained. Now I just want to show you that if you try to interchange person
and taste, so first, I try to give inside the argument taste and then person. You see what
happens? So obviously the data will remain the same but only the rows and columns are
interchange that you can observe here. So, in case if I want to find out the marginal frequency
of anyone say person comma taste, you can see here this is obtained here like this, so this is
the same thing what we had just obtained. Now I try to find out the same contingency table
with respect to the relative frequencies. So you can see here I'm trying to take here the same
command but now I'm trying to divide it by the length of the data vector and I'm choosing
here the data vector to be person and you can see here that we have got here this type of
command and these are the values which are the same which we have shown you on the
slides, and now in case if I want to find out the same contingency table with respect to the
length of another data vector, so I try to choose here the data vector taste inside of here
person and you will see here that in both the cases, you are going to get the same command
because length of the two data vectors here is the same which is here 20. Right, and now in
case if I want to add the marginal relative frequencies so I have to use the same command but
I have to add here one command addmargins so you can see here that this gives us this value.
Right. So, you can see here now the sum of all the marginal relative frequencies is coming
out to be here one. Right, and this is the same output which I had shown you on the screen.
So you can see here that finding out search contingency table with respect to absolute
frequencies or relative frequencies is not difficult once you have some data set.
16
852
Video End Time: (20:16)
And now I would try to discuss one more new topic. So now I'm going to discuss a tool
which is called as statistics and the role of this statistic is that it tries to give us an idea
by quantifying the degree of association, similar to what we had in the case of continuous
variable, we had correlation coefficient. Right. So one thing you have to keep in mind that
when I am going to introduce the statistics, actually this statistics is used for testing of
probability density function, and when we try to use this statistic, there are certain conditions
in the case of test of hypothesis. For example, the cell frequency should be greater than 5 and
so on but here you see I am taking an artificial example so in this example means, I have kept
the frequencies to be low means if you have more data then obviously these frequencies are
going to be higher, so while computing the statistics on the R software, you may get sort of
warning but you need not to worry for these things you have to follow essentially the
17
853
Refer Slide Time (21:48)
So this statistics or this is called as Pearson’s chi squared the statistics, this symbol here
chi this is a Greek letter which is written here like this and this statistics statistic is used
to measure the association between the variables in our contingency table suppose there are
two variables, so the statistics for a k cross l contingency table what we have created
ni n j
2
nij
n
k l
earlier is given by this quantity you can see here this is the . So, you
i 1 i 1
ni n j
n
can recall that this nij is giving you the value of absolute frequency and this ni plus and n plus
j they are the marginal frequencies of x and y and small n here is the total number of
observation or the total frequency. So this is the statistics which gives us an idea about the
18
854
degree of association and this value of 0 2 n min k , l 1 means whatever is the
minimum value out of k and l they this can be k or l whose ever is minimum.
Now what is the interpretation of this statistics now given the data one can compute this
statistics now in case if the values of are coming close to zero or the value of is very
close to zero this will indicate that the association between the two variable is weak. so, a
value of close to zero indicates a weak association between the two variables x and y and
19
855
that the has the limits zero and here this minimum k, l minus 1 into n. so, in case if the
value of
20
856
is close to the second limit which is n[min(k, l) – 1] then this would indicate a strong
association between the two variables remember this thing this is not between like zero or
one or minus one or one something like this. So, this value depends on the size of the
contingency table, size, size of the contingency table well that's a drawback and that was
overcome in the, in some further modification that we are going to discuss. Now, in case if
you are getting any other value of which is not close to zero or this interval and into
n[min(k, l) – 1] then suitably that will indicate the degree of association between the two
variables say as low moderate or say high, in general. One aspect of this statics is that it is
symmetric, symmetric means that which of the variables you are taking in the row or say
column this will not make any difference for example, you had seen that we had constructed
the frequency table will person and taste and taste and person both the cases this statistics
21
857
Now, you look at me just give a particular example which is more popular, say in case if I
have only two by two contingency table. So, suppose there are a variable x and y which have
got two classes each x1, x2 and y1, y2 and absolute frequencies in these cells are a, b, c and d
which I have written in green colour and now, if you try to find out the sum of rows if it is the
marginal frequency this will come out to a plus b and similarly the marginal frequency of this
second row corresponding to x 2 is c plus d and similarly the modular frequency with respect
to the first column is a plus c and the modular frequency with respect to the second column y
2 is b plus d in this case if you try to substitute all the values inside the statistics this will
simplify to this thing. where here obviously n is equal to a plus b plus c plus d which is
indicating the total frequency or Right! Okay. Now, after this what I will do is the following
that I will take a simple example being on this two cross two data and I will try to compute
this statistic and later on I will introduce some more statistics and then I would try to
measure the degree of association in the same example using different statistics. So, this data
is about that a sample of hundred student, the students was collected and they were judged
whether they are weak in academics or good in academics, they are good in your studies or
bad in your studies based on their I can make performance now after this a group of student
was given a tuition and after the tuition they were just once again and the idea was that we
wanted to know whether this tuition is going to be helpful or not, whether this tuition means
extra studies are really helping the students and improving their academic performance? That
is the question which I would like to know on the basis of given sample of data. So, what we
22
858
Refer Slide Time: (27:44)
That this sample of handed student is divided into two groups, weak and strong in academics
and some of the students from say weak and strong both, they are given tuition and after that
their academic performance was just and after this that data was compiled in the following
contingency table here. So, you can see here there are weak students and there are strong
students and students who were given the tuition hours, they were not given that tuition. So, it
was found finally that there are 30 weak students who were given the tuition and there were
10 strong students, 10 good students who were also given the tuition. Similarly, there were 20
weak students who were not given the tuition and there were 40 strong students who were not
given that tuition. and based on that we would like to find whether there is any association
23
859
Refer Slide Time: (28:50)
So, we try to find out their marginal frequency. So, you can see here modular frequencies are
here 30 plus 10 is equal to 40, 20 plus 40 is equal to 60, and similarly at 30 plus 20 is equal to
50, and 10 plus 40 is equal to 50 and the total sum here is hundred. Based on that I try to
24
860
what we have done here
in this table and I try to use the same formula here and
25
861
I try to compute this value this value of comes out to be 16.66, this is a manual calculation,
Right! and then the value of the upper limit which is n into n[min(k, l) – 1]. So, you can see
here the value of n here is hundred and the k here is two and l here is two, the minimum value
between 2 and 2 is 1. So, 2 minus 1 is 1 and this value here is hundred. So, now you can see
here whether this value 16.6 is close to zero or close to hundred this is what we have to see
and based on that we have to take a call what's really happening. So, you can see here that
this value is not close to zero, but on the other hand it is also not close to hundred. Right.
So, one may conclude that well there can be a moderate association or a lower association, it
depends on you how you want to interpret it there is no hard and fast rule to decide what is
low and what is moderate and what is strong but, yeah in my opinion I can say that well there
is a moderate Association.
26
862
Now I try to take the same example which I had considered earlier, and I would try to find
out the statistics. So, this is the same example which I just considered Right! and which we
have collected the data on 20 persons and their high responses for the taste of a drink are
recorded based on the category of the age that is child is young person or elder person.
So, you have already done this job that you had obtained the contingency table here using the
command table person dot taste. Now in order to obtain the, this statistics. I am giving
you here two things, one is the command and second is the application on this example. So,
the command here you can see here is c h i s q dot test. So, this is a short form of test and
you can see here that inside the argument I am trying to give the data in terms of contingency
table. So, this is table comma person comma test and this bracket close this is the same
command which is given here. Now after this I am using here a command dollar. Which is
given on your keyboard and after this I have to write down statistic, s t a t i s t i c and this will
27
863
give you this outcome. So, the outcome looks like this that this is showing that the value of
which is written here say X because Chi is a Greek letter so in R, we cannot type the Greek
letter, this value is coming out to be 0.277 and so on but here you will see that there is also a
warning message that and it is saying that in test this data the approximation maybe
incorrect. Why this is happening? I just informed you that the is statistics which we have
used here to find out the association between the two variables persons and tastes, this is
actually used for test of hypothesis and to test the hypothesis that whether there is an
significant association between the two variables see here in this case persons and tastes. So,
when we try to apply the test of hypothesis, then there are certain conditions for the
applicability of the test and one of the condition is that that each and every frequency should
be greater than 5 and here you can see here in this case there are several frequencies which
are not greater than 5 like as 2, 2, 4, 2 and 4 and that is why this outcome is given here in
terms of warning and the test is telling you well you are trying to find out the statistic
but the number of frequencies are smaller than 5 so that means the values may not be so
accurate and you may have a wrong conclusion but that is related to the test of hypothesis and
here our objective is very simple I just want to show you how to calculate the statistic. If I
can take a bigger data also but then you will not be able to match what R is doing and what
you can obtain manually, rather my request will be you try to take the same example and try
to create the same contingency table yourself and try to compute this value this will come
28
864
So, now let us try to see that how you can compute these things on the R software on the R
console.
29
865
So, you may see here that I already have this data on person, we just used it taste and here
like this and if you try to create here a table say here person and taste, you get here the same
So, I try to clear the screen so that you can see everything clearly and now I try to compute
the statistics. So, you can see here I just use the same command what I have used in the
30
866
Now, let us come back to our slides now you see by looking at the value of , you can judge
whether the association between the academic performance and the tuition is significant or
not. Right. Just by computing the values of n into minimum of k, l minus one and so on. But,
now I would try to address another aspect. You can see here in this case you need to compute
the values of the range, well. One is zero but another is n into minimum of k, l minus one. So,
the value itself is not giving you a clear cut indication, for example if you remember in the
case of say correlation coefficient that was lying between minus one and plus one or the
magnitude will lie between zero and one. So, just by looking at the value of r, you can very
easily communicate whether the association is high or low and so on. So, here a modification
was registered in the statistics and a new statistics which is a modified version of the
statistics was defined as Cramer's V statistics and in this case what happened.
31
867
That in the case of this Cramer V statistics the range is if the range of the Pearson’s
statistic depends on the sample size and the size of the contingency table. So, these values
depends on the situation that what is the number of row what is the number of column and so
on. So, this issue was solved and a modified version of the Person statistics were presented
as Cramer's V statistics for a k cross l contingency table, for the same table and this was
2
defined as say and the advantage of this this V statistics was that it lies
n[min k , l 1]
32
868
Refer Slide Time: (37:36)
So, now making inference about the degree of association became more simpler. For
example, we can conclude that if the value of V is close to zero that would simply indicate
the low association between the variables and in case of the value of V is close to one then
this will indicate the high association between the variables and similarly if you have any
other value say between zero and one, then depending on the magnitude of the value that
would indicate whether there has been a moderate association or a lower association or a
strong association. So, now you can see here that in the earlier case we had obtained
33
869
the value of to be here 16.66, and we had concluded that the association is moderate. So,
now
34
870
for the same example I try to compute this value. So, using that value of to be 16.66, I try
to compute the value of V statistics and this comes out to be 0.40. So, that would indicate
once again that ideally, V should lie between zero and one but it is lying somewhere in the
And similarly, if you want to compute this V statistics in the R software for that we need a
special package which is called as l s r. So, first we need to install this package by using the
command install dot packages and inside the argument within the double quotes you have to
write lsr and once you do it the package will be installed I am not showing you here because I
discuss it in the starting lectures and after that you need to note the library by using the
command library inside the argument lsr, and if you see, what we had obtained earlier, in this
data set.
35
871
Refer Slide Time: (39:35)
This example where we had 20 percent who responded for the taste of drink.
36
872
So, I am going to calculate this value for this thing. So, remember one thing, I am taking here
two examples one I am trying to do manually and and another I am trying to do on the basis
of R software. Right. So, this is an example which we had just done on the R software. So,
now I would try to compute the value of Cramer’s V statistics. So, once again I would like to
show you here two things what is the command and what is the interpretation? So, you can
see here the command here is Cramers V c r a m e r s and V is in capital letters that you have
to remember, and then I have to give the contingency table for which I would like to compute
the value of Cramer's V statistics and once I do it this gives me the value 0.11 . Right, and
yeah, once again you will see here the warning message this warning message is coming out
because of the same reason that this is based on the statistic and the total number of
37
873
Refer Slide Time: (40:57)
you can see here they are say, smaller than 5 like as 2, 2, 4and so on. Right?
38
874
So, this is how you can do it? and this is the screenshot but I would like to show you it on the
R console also.
39
875
I already have installed this package on my computer.
40
876
and after that
I will try to use the command here cramers V and inside the arguments
41
877
I have to give the contingency table of for which we want to compute it. So, you can see here
Okay. So, you can see it is not difficult at all and it is more easy to interpret because all the
values are going to lie between zero and one. So, by looking at the value of see here 0.11, one
can conclude here or here that the association is there, but it is quite low. Right. So, the taste
42
878
Refer Slide Time: (42:09)
Now, after this there is another coefficient which is used to measure the degree of association
in such a case and this is actually the called as contingency coefficient. And this is simply the
corrected version of the Person's contingency coefficient which is based on the statistics
once again and this contingency co-efficiency which I am denoting here as C, c o r r which
is an abbreviation for corrected this is defined as C upon C max, where C is given by square
root of Chi square divided by Chi square plus n and C max is given by this quantity that is the
square root of minimum of k, l minus one divided by minimum of k, l and this is statistics
also has an advantage that it is lying between zero and one. So it is more convenient to take
conclusions or statically inference using the value which lies between zero & one. So, this
also have similar interpretations like it’s value of C, if it is close to zero that would indicate a
lower association between the two variables. In case if the value of C is close to one this is
going to indicate a higher association between the two variables and other values of C
between zero and one they would try to negatable indicate the degree of association between
43
879
Refer Slide Time: (43:36)
And now, in case if I try to take this example, just for your remembrance.
44
880
This students versus tuition where we have used, we have found the value of here 2 to be
16.66, now for the same thing I would like to find out this contingency coefficient. So, you
can see here from this value I was concluding that there is a moderate association then, I use
the Cramer’s V statistics this is also indicating the moderate Association and now let me see
So, if I try to compute the value of C based on the values of equal to 16.66, then the value
of C is coming out to be 0.38, obviously n is here hundred and the value of C max is coming
out to be here 0.71, and finally the value of contingency coefficient C is coming out to be
0.54. So, since this value is lying between zero and one. So, this value 0.54 is lying
somewhere in the middle of zero and one, so once again, I can say that this is indicating a
moderate association between the two variables. Now, we have considered different types of
things, different types of measures, to find out the degree of association between the two
variables where the variables are in the form of counting data, well, you can see here that
45
881
different statistics, they are giving us different values and they have got different
association present inside the data, then ideally all the statistics should indicate the same
thing, there can be a small difference for example, Cramer’s V statistics is close to 0.46
whereas this contingency coefficient is giving 0.56 and so on. But definitely they are
indicating, yes, they that there is a moderate association. So, this is how we go. Now,
definitely the question comes how to decide whether this is a really low moderate or strong?
for that you have to use your own judgment and this type of power to judge that you can very
easily develop by practice. So, I would request you please try to take some more example and
try to practice it. Now, I would like to stop with this topic of measuring the association
between different types of variable, either they are continuous, ranked observation or
counting observation, and you also have learnt that how to compute them on the basis of the
given software. So, the main thing is this if you understand the concept, it is easy to compute
them but, the main thing come how to interpret them. So, the main objective which you have
to emphasize here is this how to choose the right tool and how to compute it correctly and
then how to take the correct statistical inferences out of that. So you practice and learn this
technique, develop this technique and I will see you in the next lecture till then good bye.
46
882
Lecture – 33
Welcome to that lecture on the course descriptive statistics with R software, in this lecture we are
fitting of linear models. So, you see what happens, whenever we have a sample of data, then the
first information is obtained from the graphical tool and suppose we have data on two variables
and both the variables are associated, they are not independent, then definitely these observation
will have some correlation structure. So, first step will be to create a plot like as a graphic plot, a
smooth a scatter plot and so on. Those plots are going to give us an idea, there is an association
present in the data. Now, after this we try to use different types of tools for example, correlation
coefficient to quantify the degree of association or the degree of linear relationship. Now, the
final question which is remaining is, can we find out the mathematical relationship between the
883
two variables or can we find a statistical model between the two variables and if yes, how to do
it? So, that is the question which we are going to entertain in this topic fitting of linear models.
Now, we are assuming that there are two variables or there can be more than two variables and
there exists some relationship between those variables, right, and whenever there is a
relationship, there are going to be two types of variable- one output variable, that means the
variable on which, we obtain the values of the output and second aspect will be input variable.
So, whatever are the values which are given to the so called input variable based on that we will
have an output, for example, suppose if I say that whenever we are doing some agriculture, then
the weight of a crop that depends on the quantity of fertilizer, that in case if I try to increase or
decrease the quantity of fertilizer in the field, then up to a certain extent the crop will increase or
decrease, Yeah! obviously if you try to increase it more than the crop will get burnt up. So, this is
a case where we can see that relationship between the yield of the crop, the quantity and the
quantity of fertilizer. Similarly, if I try to extend this relationship, then we know from our
experience that the outcome of a variable in this case for example, the yield of the crop, it does
not depend only on one variable the quantity of fertilizer but it depends on other variables also,
quantity of fertilizer, temperature, rainfall, irrigation, moisture and so on. So, now I have given
you two situations where the outcome is going to be dependent on one variable and on more than
one variables. So, now how to handle this situation, this is what we are going to understand in
this lecture,
884
Refer Slide Time: (04:03)
we assume that a relationship exists between two variables or this can be generalized, that the
relationship exists between a variable and more than one variables and so on, that we will see
later on and in this type of relationship, we assume that the outcome of variable is affected by
one or more than one variables for example, the example which we have just considered that the
yield of a crop increases with an increase in the quantity of fertilizer, right! Similarly, if you try
to see, incase if you try to observe the phenomena of an electric fan. So, whenever we try to
increase the speed, how it is done that is controlled by a switch and we try to control the switch
from position number one to position number two, position number two to position number three
and in this process, what is happening, inside the switch, the quantity of current that is increased.
So, I can say that when I am trying to increase the position of the switches from one to two, two
to three and so on, then automatically inside the switch the amount of current flowing in the
885
circuit increases and finally the outcome of the fan which is the RPM, that is rotation per minutes
increases or in simple words the speed of the fan, say increases. Similarly, in another example,
we have seen that as the temperature of the weather increases, then the quantity of water
consumed is more. You can see that will consume more water during summer than winter, Right!
So, in these type of examples, where I am saying that the speed of an electric fan which is
measured by the rotations per minute, because increases as the voltage or current increases,
people drink more water as the weather temperature increases, and similarly the same example
that the yield of a crop depends on other variables also, like as quantity of fertilizer, rainfall,
So, now we are assuming that such type of relationships, can be expressed through models and
model is a very fancy word, in nowadays everybody wants to find out a model among the
variables and so how to do it, but first question comes what is a model, this model is only a sort
of mathematical or statistical relationship among the variables and this relationship is in such a
886
way such that it is representing or depicting the phenomena that whatever is happening, this is
relationship. So, when we talk of this modeling, then modeling and relationship among the
variables, they are the same thing and in such a case, the relationship is characterized by two
things- one is variables and say another is parameters. So, the first question comes that what is
the difference between variables and parameters. So, this i will try to address in a very simple
language through a very simple example, but before going further, let me clarify here that what
type of relationships can exist, then there can be different type of relationship and broadly I can
classify them into two parts linear and nonlinear. Now, in this course we are going to entertain
Now, I come back to my means earlier issue that how to start. So, whenever there is a
phenomena or whenever there is something is happening, then in that phenomena, you have to
887
observe that usually there will be two types of variables or the variables can be divided into two
categories, one category is input variables and another category is output variables. Now, the
first question comes that how to take a call or how to decide that which of the variable is an input
variable and which of the variable is an output variable. So, this I can explain you with the two
simple examples. For example, we from our experience that whenever a student is studying
more, usually he or she will get more marks. So, now I have here two variables, one is the marks
in the examination and second is the number of hours a student studies. So, now in this situation
I will try to ask two question that there are two possibilities, one possibility is that I can assume
that marks depends on the number of hours studied and second option will be that the number of
hours studied depends on the marks obtained. Now, in such situation there is no mathematical
rule or statistical rule which can explain you, but this is only your experience with the data, the
information about the phenomena that is going to help you in taking a call or in taking a decision
that which is affecting what. So, in this case, we have two options as I have given in the slides
that the marks depend upon the number of hours a student studies or the second option is the
number of hours of study depends upon the marks obtained by the student, right! Now, which
one is correct, this is correct and this statement is wrong and similarly, if I try to take another
example, that the end of a crop depends on the quantity of fertilizer and temperature of weather
or the reverse happens, that the quantity of fertilizer or the temperature of that depends on the
yield. So, in this case, second option is not possible and we know only from our experience that
weather, temperature and the quantity of fertilizer, both are going to affect the amount of yield
up to a certain extent. So, in this case, the quantity of fertilizer and the temperature of weather,
they become the input variables and the yield of the crop becomes an output variable. So, in this
case, if you try to see we have two options which are written on the slide that the yield of a crop
888
depends upon the rainfall and weather temperature or the second option is that the rainfall and
the weather temperature depends upon the yield of the crop. So, obviously the first sentence is
correct and second sentence is wrong, this is wrong and similarly in the first case this was wrong,
right! So, this is how we try to decide in a phenomenon that what is going to be an input variable
and what is going to be an output variable in any given situation. Now the next question is that
whenever we are trying to write down a model, model is essentially going to be a mathematical
equation, Yes, the statistical concept will be used to obtain that mathematical equation but in that
equation, there will be two components, one is variable and another is parameter. So, I will try to
take here a very simple example to explain you, that what is the difference between the two and
what is the role of variable and parameters in a model. Now, I am going to take an example of
the equation of a simple linear line which you have possibly studied in class 10, 11 or 12, y = mx
+ c, you see y = mx + c has two types of component, one category I can define as x and y, and
another category is m and c, out of these two categories, one set of quantities is a variable and
another set of quantities is a parameter. Now, how to take a call, how to decide what is what. So,
889
So, we know that the equation of a straight line is given by y equal to mx plus c, where your c is
interceptive, and this line suppose it looks like this. So, here x is the x-axis and this x is going to
indicate the values on the x axis, and this is my here y-axis and here this y is going to denote here
the values on y-axis, and this quantity here is say here, c. So, c is going to be an intercept term,
right! and this angle that is going to be represented by m in terms of tan of the angle the
trigonometric function. So, this is how we denote this equation y = mx + c. Now, in this
equation, if you see there are two options which I can explain you here, one is here x and y and
second here is m and c and now out of this one of them is parameter, one of the set is
representing parameter and another set is representing variable. Now, means I can give you a
simple query that please match these two columns. So, column one and column two that is the
simple type of question that you have solved that please match, match the columns. So, now let
890
that how you can do it. Suppose I give you first option as that you know the value of x and y,
suppose I know the value of x and y to be say x equal to 4 and y equal to 2, then my question is -
can we know the entire information or all the information about the line. So, in this case your y
equal to mx plus c will become, say here y equal to 2, x equal to 4 and then m plus c. Now,
looking at this line, do you think that you have the entire information about the line? Certainly
not. I just know that there is a point x equal to 4 and y equal to 2. Now, the second option is this
that instead of having the values of x and y, I know the values of m and c, and suppose m is
equal to 5 and c is equal to 6, in this case, the equation becomes here y is equal to 5x plus 6.
Now, my question is do we know the entire information about this line or can we really know by
looking at this equation, all the information contained in this equation about the line, the answer
is yes, if you wish, even I can plot this line here, say here 1, 2 up to, up to here. So, here is 6. So,
this line is somewhere here like this where this is going to present the intercept term c and
similarly this angle is given by this quantity here 5, tan of theta is equal to 5 or 10 of angle is
equal to 5. So, now one can see here by looking at the values of m and c, I can have the entire
information about this line. So, this option one was incorrect and option two is correct.
891
So, now if you try to have here in more detail that what is happening in this equation? You have
to keep in mind that your ultimate objective is to know the equation, y equal to mx plus c. Now,
when I say that I want to know the equation y equal to mx plus c, then it is equivalent to knowing
m and c. If you tell me the values of m and c, then I know the entire line, if we know m and c, we
know the entire line, whereas just by knowing x and y, I do not know the entire line. So, in this
situation such quantities m and c, they are called as parameters, another values on which we try
to collect the values, they are called as variables. So, what will happen that if I try to take the
earlier example of marks obtained in the examination, they depends upon see here some
parameter here m into numbers of hours of study plus c. So, we see here that here this marks and
here number of hours of study, they are my variables, and m and here c, they are the parameters.
So, what we try to do, we try to conduct an experiment and we collect the data on variables,
collect the data on variables, right! So, now I can solve your this question and I can say here that
x and y are the variables and m and c are the parameters. So, if you try to see, what is the
advantage or how you are trying to make a decision? The parameters are those values which we
give you the entire information about the model. So, whenever we call that we want to find out a
model in simple word, I'm trying to say, I want to know the values of the parameter. So,
whenever we hear the sentence like that we want to construct a model that is equivalent to saying
that I want to estimate or I want to know the values of the parameters on the basis of given
sample of data, tight! So, now how this data is collected and how to indicate in our statistical
scale language, how to make symbols and notation for the data? So, we try to take an example
10
892
Refer Slide Time: (20:29)
so now we take an example here, this is the same example what we can consider about the
quantity of fertilizer and yield. Suppose capital X is a variable, which is denoting the quantity of
fertilizer in kilogram and Y is the variable which is denoting the yield of a crop in kilogram. So,
what is being done here, we are trying to conduct an experiment and we are trying to collect the
data, so that we know what we have in our hand and our objective is that we want to find out the
relationship between X and Y, that is the quantity of fertilizer and yield of the crop. So, now we
conduct the experiment and collect the observation in the following way. Suppose I take a plot of
some fixed size, Yeah! don't change the size of the plot, fixed size and then we put 1 kilogram of
fertilizer in the field and after some time we get here say 6 kilogram of yield. So, this is going to
give us the value of x1, and this 6 kg of yield is going to give me the value of y1. Similarly, I try
to repeat the experiment and then on a plot of the same size I try to put 2 kilogram of fertilizer.
11
893
So, this quantity will be denoted as say x2, which is the second observation on X, then after some
time, we are suppose we observe that 7 kg of yield is obtained. So, in this case the value of Y for
the second observation is to converted as y2, and similarly I can keep on repeating, for example, I
can take 3 kilogram of fertilizer and its value is going to be denoted by x3 and then we obtain 6
kg of yield, suppose and this value is here y3. So, you can see here, we are going to obtain here,
the paired observations like, when I give the value of x1 then I get the value of y1, when I give
the value of x2 then I get the value of y2 and when I give the value of x3 then only I get the value
of y3 and so on, and suppose we say that we have obtained I say, n number of observations. So,
all these paired observations are going to be denoted by (x1, y1), (x2, y2) up to here (xn, yn), right!
12
894
Now, once we have obtained such paired observations (x1, y1), (x2, y2),…, (xn, yn), then the first
information is given by the graphical plots. So, we try to plot this data on a scatter diagram. So,
for example I have just made it here these are the point which are indicating the data points,
suppose this is here x1, this is here y1. So, this data point is denoting the value of (x1, y1).
Similarly, suppose this is here x2, this is indicating here y2. So, this data, is point is the location
of the point (x2, y2) and so on, right! So, now you can see here that looking at this graph, you can
decide whether there is going to be a linear trend or not, right! So, you can see here that the
things are going in this direction, so that there is a presence of linear trend in the data or there
can be a nonlinear trend also, but my objective is here that by looking at the values of these
observations, how to know the equation of this line, how to know this thing or how to know the
equation of the curve and this equation is going to be found in such a way such that it is
representing the population. What is the meaning of this contents? You see, whenever we are
trying to make a model, the model is given in on the or say for the entire population and the
problem is that we do not know the entire population, so we have to work on the basis of given
sample of data, for example, have you ever heard a statement like, this medicine controls the
body temperature of Americans for seven hours and the same medicine controls the body
temperature of Indians, say for 10 hours and the same medicine controls the body temperature of
say, say German people only for five hours, it doesn't happen, medicine is a medicine, and the
effect of medicine under the similar type of persons will also be the same and that will be valid
for the entire population all over the world, Right! But, when we are trying to know the duration
of the temperature control, we try to conduct an experiment by giving the doses of the medicine
to some people, we try to obtain the data and then we try to find out the equation of the curve or
the line, and based on that we make a conclusion and this conclusion is valid for the entire
13
895
population, this is the entire process of modeling but here in this course, we are going to find
only the equation, Right! the remaining part, there is a course on say linear regression analysis
and the tools are for linear regression analysis gives you all the information that how to construct
a linear model, Right! But, here we are just going to concentrate only on one aspect, okay. So,
So, now that we try to take here the same example, what we had considered in the earlier lectures
and we try to see how to get this equation of the curve. So, you may recall that we had
considered an example, we had recorded the marks in the examination obtained by twenty
students out of five hundred marks and the number of hours they studied in a week. So, this data
14
896
is given here for example, in the first row here, the marks are obtained and in the second row the
number of hours studied by that corresponding students are given. So, say this is student number
one, he or she studied for 23 hours and he or she has got 37 marks out of 500 and so on. So, this
data is here for twenty students. So, this is to number one, this is to number two, this is to
number three up to here is student number twenty, right! So, now we want to know what is the
relationship between the marks obtained and the number of hours studied in a week. Although,
we know from our experience that marks obtained by students in increase as the number of hours
increase, but we would like to see whether this statement is correct or not, right!
So, suppose I have a stored this data on marks inside a variable named marks, like this, in the R
software and similarly the data on number of hours is stored inside a variable named hours, like
this, inside the R software and how this data is presenting that is what you have to see. So we
15
897
have collected this data and this is how we are going to represent the data that x1 is equal to 23
then y1 is equal to 337 so that the data inside the hours and marks, a data vector that is given in
the same order you can see here, this is 23, 337. So, this is 337 in the first position and 23 in the
first position. Similarly, we have the second value 25 of x2, and 316 for y2. So, this is again in the
same order, say 316 here and 25 here. So, the data for the paired observation is given in two
different vectors but the order of the observation remain the same in both the vectors, this is very
important and you should keep in mind, so you can see here this 23 occurring is here, 25
occurring is here, 26 occurring is here, 337 is occurring here, 316 is occurring here, 327 is
occurring here, and then after this, these are the paired observation 23 and 337, 25 and 316 and
this is what you have to keep in mind. Now, what is your first step, first of all I would try to
create here a plot that we already have actually done by using the command plot inside the
16
898
argument hours and marks, right! So, you get here a plot like this and you can see here that there
is a sort of linear trend and in case if you try to make plots like this scatter smooth, another thing
they will also give a sort of estimated close line, but my objective here is to know that how this
line is created. So, before going into the details of this, let me give you here a small information
on the notations. In the language of statistics, the variables are denoted by say English alphabets
like as A, B, C, X, Y, Z and so on and whereas the parameters they are indicated by the Greek
letters like , , , and so on. So that is our standard language. So, now the equation which I
had just expressed as y = mx + c, we had understood that m and c are the parameters. So, I try to
represent it in the statistical way, and we try to write down the variable here as say y, and instead
of here m I use here X + . So, that is a standard notation, when we are trying to say that the
model is linear.
17
899
Now, we consider the same example and we move forward. So, you can see here in the same
scatter plot, these points here this small circle, they are going to denote the data points and by my
experience, I have drawn a line here, which is indicated in the red color and I have done it
manually and I feel that this is a line which is representing the values or the relationship between
the values of X and Y, and my objective is to know the equation of this line which is in here in
red color. So, suppose I try to denote the equation, the mathematical equation of this line by here,
y = + X, and now you know that Y, I am using here and and not m and c. So, now in this
equation y = + X, this X is going to denote the number of hours which is here and Y is going
to indicate the marks obtained which is plotted on the y-axis here, and now my objective is very
simple, I want to find out the relationship between X and Y in terms of y = + X, and now you
also understand and we have already discussed that this line is known to us only, when the
parameters and are known to us. So, I can say if we know and then the entire equation
will be known to us. So, now the fundamental question comes in front of us, how to know this
and .
18
900
So, that is the objective what I am going to now explain here, but before going that try to
observe, suppose I try to take in the same figure, I try to take this small section and I am trying to
enlarge it. So, this is the figure here which is simply the enlarged part of this, say circle. So, you
can see here that this point is here, this point is here and so on and there are one, two, three, four
points here, one, two, three, four points here, right! So, now if you try to see what are you trying
to do, you are saying that this red line, this is y = + X, and you assume that all the observation
point should lie on this line, that is your idea that all the observations should lie on this line, so
that you can say that this is the mathematical equation between X and Y and this is how the
experiment is being controlled, but in practice this will not happen, all the points are not going to
lie exactly on the same line. So, you can see here, if here is this point, then you want or you
expect that this point should lie here, somewhere here and similarly if you try to take another
point here, see here this point, you expect that this point should lie exactly on this line. So, you
can observe in this graph that this is not really happening and there is some difference between
these two points and that difference or the deviations or the deviation between say this point and
this point indicating here is a sort of error which is happening in our approximation, right!
Similarly, you can see here that in all other cases also you as, there's an error here here, there is
error is here, the error is here and it is here, error is here. So, you can now notice here, that there
is a deviation between the absurd points and the corresponding points lying over the line, right!
and one thing what you have to observe here that in this case, there are two values X and here Y,
one is here indicating here x-axis and here it is indicating y-axis. So, if you try to see here in this
case, in the case where I am making this line, the values of X remains the same, in this case
19
901
values of X remain the same, only Y is changing, why Y is changing, you can see here one here
is this Y, and another here is this Y. So, this is essentially it called as error.
And now we would like to find out the values of and , such that these errors are minimum
and you can see here that these others are happening in each and every observation here, here,
here, here, here and so on. So, now the question is how to compile all these errors, so that by
minimizing that quantity, I can find out the value of and . So, one objective is to minimize the
sum of such errors, I can simply measure this error and then I try to take the sum of all the errors
20
902
but you can see here that when we want to measure these errors, then these errors are measured
with respect to this line, this red line. So, this observation which I'm indicating here, this has got
an error e1, but this is above the line and this second observation here for which the error here is
e2, this is lying below the line. So, now we need to measure the direction of the points whether
they are lying above the line or below the line. So, we assign that all the points which are lying
above the line, I will indicate them by plus sign and all the observation which are lying below the
line that I'm trying to indicate with negative sign. So, now when I am trying to add all these
errors, then some errors are in the positive direction and some errors are in the negative direction
with respect to the line and hence when I try to sum them, there sum may become very close to
zero or exactly zero, and this might be indicating as if there is no error in the data or if this sum
is very small very close to zero, that will indicate that the amount of error in the data is very
small which is not correct. So, now I need to device a methodology by which I can change the
21
903
sign or I can get rid of the sign. So, I have two options, either I take absolute value or I try to
square these errors. So, I opt here that we try to consider the sum of squares of this errors, that is
a better option, why I am calling the better option, it's just due to mathematical simplicity, I can
say at this stage and in this case, I can find out the clear-cut expressions for the values of and
. in case if you try to take here absolute values, that is also possible, but that I am not discussing
in this lecture.
So, now you can see here that when you are really trying to represent the values of this
observation. So, as we have discussed here that there are two values corresponding to every
observation, for example, this is the value here which I am observing inside the experiment and
these values are suppose corresponding to here X equal to say here x1. So, this quantity will have
22
904
coordinate say here x1 and small y1. Now, I assume that this point should lie on this line and this
line is being denoted by Y, Capital Y is equal to + X. So, this coordinate is going to be small
x1 and capital Y1, Right! and there is some error in this data and we call or read, express this
errors as e1. So, now incase if I try to express this fact here, then I can express, in general, that
every pair of the observations satisfies the equation like yi = + xi + ei, where ei is are the
errors they can be in the positive direction or they can be in the negative direction, and now we
are going to find out the values of and , such that the sum of squares of this ei is minimum.
So, now we take a call that we will try to find out the equation of this line, which line, this line,
red color line on the basis of the given data set say it using all the small n paired observation on
xi and yi, such that he line is passing through with maximum number of points and the deviations
23
905
So, you can see in the same picture that this difference or this error or this deviation, this is
essentially the difference between this value, see here capital Y1, and this value here is small y1
on the y axis. So, this difference is denoted as say yi difference Capital Yi and this difference
between y and capital Y can be positive or can be negative, but you have to use the same
structure that either you try to measure them by yi minus Capital Yi or yi minus yi Right! So, now
we will try to find out the value of and such that the sum of squares of these deviations ei is
minimum,
how to obtain it, Right! So, for that I can use the principle of maxima and minima and we
minimize the quantity, summation i goes from 1 to n, ei square which is the sum of square of all
the deviations, ei and this is denoted by capital S. So, now my objective is now defined, I want to
find out the values of parameters and , such that the line is passing through with the
maximum number of given data points and the sum of squared deviation or errors from the line
24
906
is minimum and this is called as method of least square or principle of least squares, and the
principle of least square is simply saying that try to find out the the equation of the lines in such a
way, such that the line is passing through with the maximum number of given data points and the
sum of squared deviation from the line is as minimum as possible. So, now in order to find out
we try to write down the sum of squares of errors, like here this and you know that ei is given by
yi - - xi. So, I try to replace this e is over hereby (yi - -xi)2, and now I have to use the
principle of maxima and minima. So, for that I need to find out the first order derivative of this S
with respect to and , I need to put it equal to zero and then, I need to check by finding out the
second order derivative that whether the maxima or minima has been achieved. So, I try to find
out the first order partial derivative of S, with respect to and this give us this equation and
25
907
following the principle of maximum, minima I try to put this first order condition equal to zero.
n n
So, now incase if you try to solve it you get here, that y x
i 1
i
i 1
i 0 . So, this quantity is
1 n
simply you’re here, n y -n - n x , Right! Because, x and y are defined as xi and y is
n i 1
1 n
yi , Right! So, now you can solve this equation, and this gives you here y - - x = 0, and
n i 1
this gives us that = y - x , which is here. So, now on the basis of given set of data, I can find
out the values of say here x and y , because we have observations xi and yi, i goes from 1 to n.
So, now this value of is going to be known to us, if is known. So, y is known from the
sample data, x is also known from the sample data and is unknown. So, now my next
So, we try to do the same process and we obtain the first order partial derivative of S with respect
to and I try to substitute it equal to 0, then I get this equation and I put it equal to 0 this is the
26
908
first order equation, and now if you solve it, that's a pretty simple algebra, you will get the value
( x x )( y y )
i i
of here like this, which is i 1
n
, and now this is the value of that can be
(x x )
i 1
i
2
obtained on the basis of given set of data on say xi and yi, right! So, this estimated value of
which is obtained on the basis of given sample, it is denoted here as a ̂ , that just write and put
here a gap, hat, Right! So, now I can obtain the value of as ̂ from the given sample of data
and I try to substitute the value of = ̂ in this equation of . So, once I try to substitute here
= ̂ , I get here the value of which now I can find out on the basis of given set of data. So, this
is denoted as ̂ . So, now you can see here we have obtained two values of parameters ̂ and ̂ .
So, ̂ is the value of and ̂ is the value of that can be obtained on the basis of given sample
of data.
27
909
Now, mathematically the next issue is how do I know, whether this value of and are
minimizing the sum of squared deviations or not. So, I try to find out here the second order
partial derivatives of S with respect to and and I try to substitute the value = ̂ and = ̂
and these values comes out to be zero, that you can verify yourself. So, now I have a value of ̂
and ̂ like this. So, this hat is given by this expression, which is telling us the what is the value
of on the basis of given sample of data and this ̂ is called as least squares estimate of , this is
based on the principle of least squares, and similarly, the value of we which is obtained here is
̂ , y - ̂ x and this can also be estimated on the given sample of data and this is called the least
square estimate of . So, now you can see here the equation y = plus x + e, that was our
original model that we wanted to find, and now we have found the value of to be ̂ and to be
̂ x, and now it will become 0, because we have obtained this equation in such a way such that
sum of ei square is minimum and the value of ei for which this sum is minimum is 0. So, that is
why this is called as fitted model, right! and this is simply called here as a model, and now we try
to compute these two values, the values of ̂ and ̂ on the basis of given sample of data,
28
910
on the basis of example, that we were considering earlier. So, you can see here, this marks is here
your Y and number of hours per week is your here X. So, this is the value of y1 and this is the
value of x1 and this is the value of x2, this is the value of y2 and so on, this is you here all the data
1 20
set, now I try to compute the values of x and y which is here y , if you try to compute yi ,
20 i 1
this comes out to be 389.9 and x says simply comes out to be a sample mean of all the values of
xi's as 35.1, and similarly the expression of ̂ , if you try to substitute all the values, this will
come out to be ̂ is equal to 6.3 and the expression for hat will come out to be 168.65. So,
now you can see here that your model becomes here marks is equal to 168.65 plus 6.3 in two
hours. So, this is your here fitted model and you can see here that this model has been obtained
on the basis of given sample of data only, Right! Okay. So, we stop here now in this long lecture
and you have seen that how we have computed the values of parameters and on the basis of
given sample of data, but we have done this computation manually, now I will try to show you in
the next lecture that how to get all these values from the R software directly, but it is important
for you to understand that how the values inside the software are obtained and what are the
29
911
computations and the philosophy and the concept which has been used in the computation. So,
you try to understand this concept and I will see you in the next lecture, till then goodbye.
30
912
Lecture-34
Welcome to the lecture on the course descriptive statistics with R, and welcome to the last
lecture of the course. Yes! Whenever the last lecture comes this gives us happiness for the
students and for the teachers also. So we try to understand what are we going to do in this
lecture you may recall that in the last lecture, we had discussed the principle of least squares,
and we also had understood that how one can obtain the estimates of the parameters or the
values of the parameters on the basis of given set of data in case of a model y = + X. We
had taken an example and we had solved it manually. The idea was to show you or to expose
you with the basic concepts. Now in this lecture we will learn that how to obtain the same
result using the R software, and after that considering only one input variable is not a very
realistic thing, so in case if you want to extend the principle of least square where, where you
have more than one independent variables or more than one input variables, then how to do it
and how to use the R software to obtain the least square estimates of parameters in that
model. So, this is what we are going to do in this lecture. So, the first thing comes what is the
command for obtaining the least squares estimate or fitting a linear model using the R
software.
913
So in case of R software, the R command for fitting a linear model is lm these applets this is
an abbreviation of ‘l’ means linear and ‘m’ for model, and this command ‘lm’ is used to fit
the linear model. Well you can see here that this ‘lm’ has various arguments and so on. But
here we are going to use a very simple thing, just basis on formula and you should know
why? why I'm not going to discuss all the details. I had explained you in the earlier lecture
that whenever we try to fit a linear model, this does not ends the story. After that you have to
check how the model is going to perform with the entire population. There are different types
of statistical assumptions which are needed to expose the estimated values over different
types of statistical tools like a test of hypothesis goodness of fit and so on. So well here I am
not considering the course on linear regression analysis but I am simply trying to use one of
the methods for finding out the model which is used in the case of linear regression analysis,
and this command ‘lm’ in R software was developed in order to find out the further statistical
tool related to the linear regression analysis. So that is why you will see that in the command
‘lm’ there are many, many arguments, but I will not go into those details but I will request
you that if you want to understand them first try to have a course on linear regression analysis
914
and then try to understand those conceptz. The interpretation of all the terms inside the
argument they can be, be studied from the help on ‘lm’ command. Right.
So now my objective here is to show you that how to get the things but briefly I will try to
give you the idea here what we are going to use that there is an option here ‘formula’. So
basically, we will be adding or using here the option of formula, so this is going to give us a
sort of symbolic description of the model to be fitted. Right. and what are the details of the
model, they are given under as separate specifications and next the data, this is also an
optional argument where the data is given in the frame of, in the structure of a data frame, or
list of environment is given by a as data dot frame and so on but we are not going into that
detail and similarly there is another option here upset and this gives you an option to use an
optional vector and which is specifying a subset of observation to be used in the fitting
process but anyway we are not going to use it, we are simply going to use the option here
‘formula’.
915
Refer Slide Time: (5:24)
So now how to use this formula I will try to show you with an example the same example we
had considered earlier where we collected the data on 20 students on their marks and the
916
And this data, if you remember in the last lecture was obtained here that ̂ was obtained as
6.3 and ̂ was obtained at 168.65 and model was in like it, but this all was done manually all
Now I will try to show you how to do it on the R software. So, as we had discussed, we
already have stored this data in that two data vectors- marks and hours and how to store this
data that I discussed in the last lecture. Right. That this all this data is coming in the same
order so that these observations are paired, I mean the first observation (xi yi) mean first
917
Now this is the most important part that how to give the command. First you try to write
down the instruction ‘lm’ and then try to write down here the output where you will here ‘y’
and then use this equivalent sign this is present on your keyboard, the keyboard of the
computer which you are using and after this, you try it to use here the variable to give the
input data. So why is your here output variable and x is your here input variable. Right? So,
and they have to be given in this format. So in our case our output variable here is marks and
input variable here is hours and they are given in this framework just joined by the sign
equivalent. Right, and the sign I can make it here more bigger something like this you will
see on the on your keyboard. Now once I try to do it, this will give me this type of outcome
so first we try to understand what is the meaning of this outcome. Once I say that that lm
marks equivalents hours this is going to indicate that our model is marks = + x hours +
some error. Right, and this will inform the R software that we had obtained say ̂ i goes
from 1 to here n xi minus x bar and yi minus y bar upon summation i goes from 1 to n xi
minus x bar whole square. So this command will inform the R software that the value of xi’s
are coming, for example, I will use a different colour, so this values of here xi’s are coming
from the data which is given here in hours, and similarly the values of y's are coming for
918
computation of these things from the first data vector marks. Right, and so it uses the formula
is equal to marks, equivalence hours and the outcome comes out to be like this so you can see
here this is written as coefficients the coefficients means the value of and , and this is
specifying here the value of that is also called as intercept. So, this value is coming out to
be 168.647 and the value of the , which is coming here, here so this is here the value of ̂ ,
and this is here the value of ̂ . So, this value is indicating that this is the intercept term, and
this is indicating that this 6.304 is the coefficient associated with the variable hours and you
can see here these are the same value which you had obtained manually you can compare
here,
919
So, you can see here this is pretty straightforward to obtain such a result inside the R
software.
Now I will try to show it on the R console, and this is here the screenshot of the same thing.
920
Video Start Time: (10:10)
So first I will try to prepare my data. So this is here marks data which I have entered and this
is my here data on hours which is entered here so you can see here marks data is like this
hours data is like this and now I'm trying to find out here the model ‘lm’ say marks
equivalence say hours. So, you can see here that this is coming out to be like this. Right? and
here you have to be careful that how you are going to specify the value of x and y, for
example, in case if you make a mistake and if you say in place of marks hours and hours in
place of marks, then this linear model will be fitted like hours and between hours and marks
and you can see here that this is entirely different than the first one. You can see here the first
value is 168 but here the value is .22 but this is a wrong thing to do because here in this case
when you are trying to fit a model like this one you are trying to say that number of hours is
your output variable and input variable is your marks which is not correct here in this case.
921
Video End Time: (11:50)
10
922
So now let us try to come back to our slides, and now I try to give you one more concept.
Now you see we have learned how to find out a model or the least square estimates of and
in the case of a linear model where you have only one variable but in practice, you can
imagine that any process is not going to be controlled only by one variable but there will be
more than one variables and, we had discussed such example in the last lecture for example
the yield of a crop will depend on several factors quantity of fertilizer, quality of fertilizer,
temperature, rainfall, irrigation and so on. So now in case if you want to extend this principle
of least squares to the case when you have more than one input variable then how to do it.
Well this is a topic which is basically taught in case of linear regression analysis under the
topic multiple linear regression model but here my objective is not to teach you the regression
analysis, my objective is to tell you or show you that how you can obtain the values of the
parameters on the basis or given sample of data using the R software in a case when you have
more than one input variables. So in this case, I will try to take an example of the linear
model and I will show you logically that how you can extend the model which you have
considered here into a multiple framework when you more than one input variable but yeah
this is not a very general technique that is valid for all the models, every model has its own
way to extend it to a multivariate situation but here you please try to learn that if I try to
extend my linear model in this particular way then how the R software can be used to find out
So now you may recall that we had considered the model here y = + x now I'm trying to
say that here you had only here one input variable, but now I'm trying to extend it and
suppose I'm saying that there are more than one input variables. So, in the first case I have
denoted the input variable by the notation x so now suppose I say there are p number of
variable of input variables, so I can denote them say here x1, x2 see here xp. Right. means I
11
923
deal in that symbols and notation of regression analysis this small x1, small x2 is small xb
should be capital X1, capital X2,capital XP but here my idea is something different than the
objective in linear regression analysis. So I try to extend it and since I am going to consider
here and edit your model so I try to write down the terms like x for each of the variable so
for x1 this will become say here 1x1 for x2 this will become here 2x2 and similarly here for
xp, this will become here pxp and then I try to add them together so this is what I am writing
random error involved in the observations. Now similar to the simple scatterplot that we had
obtained in the last lecture, here in this case you have more than one independent variable, so
matrix plots are more useful in verifying whether the relationship between y and x1, x2, xp is
linear or not. One thing which you have to keep in mind here that in the earlier case we try to
establish the linear association between x and y but now you have here a group of x's, x1 x2 xp
inside the one group and there is here one simple output variable y, so we are trying to verify
the linearity of a single variable y with respect to a group of variable of x1, x2, xp. This is not
so straightforward. One option is that I can make individual plots y versus x1, y versus x2, y
versus x3 and so on, and now what we can conclude that if all the relationship bit of y is with
respect to each of the x1, x2, xp is coming out to be linear then we can expect that their joint
effect may also be linear, but this is a little bit tricky situation and you need some experience
to handle this condition but here but here I would like to just inform you that how you can
So now the first I would try to show you that how to construct such matrix plot in this case if
you try to see how you are trying to obtain the observation, you are trying to conduct the
experiment, say in x and y, for every experiment you will have an observation like
observation on the first variable observation, on the second variable and observation on the
12
924
pth input variable and then here value of y, and if you remember earlier it was simply here (xi,
yi) because there were only one variable so here the observations are going to be something 1
i2 in and yi. So, this is going to represent the ith and ith n tuple of observations on say y and x.
Right, so now we assume that each set of observation will satisfy this equation.
So we can write here for the first set of observation we can write that this set of observations
satisfies the equation like this so instead of here x1, I have x11 instead of here x2 I have here
x12, instead of here p I have here x1p, and associated random error here is even and the
obtained value of y here is y1 and similarly we have obtained here say n observation which
are satisfying this equation. Now using a very simple theory of mathematics based on vectors
and matrix, this entire set of equation can be expressed in the form of vectors and matrices.
So this I'm trying to give it here I am not explaining the theory of matrics here but I assume
that you know it so all this observation they are contained inside the n cross 1 vector here y1
13
925
y2 yn and all those parameters , 1, 2, p they are contained in say another vector here
consisting of p + 1 elements , 1, 2, p and the associated matrix of input variables on the
data on say here x1, x2 and say here xP this is given here like this, and the first column is
indicating the presence of intercept 1,1,1,1,1. So now I can express this y vector here as say
small y and this entire matrix here to be here X, this entire matrix to be here and this entire
So now we have this model here y = x + e. Now in case if you try to use the principle of
least squares then principle of least squares, if you remember in the last lecture we defined
n
the sum of square of this eis and this were defined as e
i 1
2
i , so this quantity can be written as
say here e’e where is now a vector so e is now here y -x ’ y - x like this. Now in case if you
try to differentiate this quantity with respect to here , put it equal to 0, then after solving you
14
926
get here say = ̂ which is equal to (X’X)-1X’y which I am writing here. Right. I am not
giving you here the details but if you try to see this is a simple extension of the least square
estimate that you obtain in the model with one input variable so now these things are replaced
by matrix, so you can see here that X is the matrix of X here is matrix of observations on say
input variable so this is known to us similarly here y, this is a vector of observations on the
observed values y1 y2 yn, so this X and y are known, so I can compute the value of ̂ and
this will be called as least squares estimator of parameter , and ̂ will look like if you see
is looking like this then ̂ will look like say ̂ , ̂1 , ̂ 2 , up to here ˆ p . So, this also has the
Now I would try to show you the computation of ̂ using example and I have just taken two
input variables in this example to make it more simple and in this example, 25 observations
have been collected on the time taken by a courier person in delivering the parcels. So, it is
15
927
recorded that how much time the courier person takes in delivering the parcels and obviously
this time is going to be dependent that how many parcels are being delivered by this courier
person and also how much time the courier person has to travel. So this number of parcels are
going to be denoted by x1 and distance traveled by the courier person is to denoted as here
say x2 and whatever the time is taken that is denoted by here y. So, the interpretation of the
data goes like that there is a courier person who has to deliver 7
parcels, and suppose the courier person travels 560 meters and the total time in doing this job
is 16.68 minutes. Similarly, there is another courier person who delivers three parcels and
person travel 220 meters and person takes 11.5 minutes and so on. So, this is how we have
obtained this 25 set of data and all this data on y, x1 and x2 has been stored inside the data
vectors inside the R software. Right. As we have done it many times in the past.
16
928
So essentially now we are going to fit here a model yi = + 1x1i + 2x2i + ei where all these
observations are denoted by here i, and all observation will satisfy this model y = + 1x1i +
2x2i so all this data in the same order has been stored in three vectors the time of delivery
and del time in the number of parcel in parcelno they have been a parcel numbers and
traveled distance inside the data vector whose name is the distance. Right. So once again I
will explain you that this first observation correspond to in the first observation of parcel
number and this correspond to the first observation in that distance which is given here,
I say observation number here one. Right. So, this data is given over there so that we already
17
929
Now I would try to first show you that how to create a matrix plot. Right. What is a matrix
plot? You see when we had an idea or when we had discussed the simple case where I had
only one independent variable, or one input variable, and one output variable, then I can
make a scatter plot and a scatter plot will give us an idea that how the things are going to look
like whether there is a linear trend or a nonlinear trend, but definitely when we have more
than one input variables then what we are looking, we are looking between the joint effect of
one variable y with respect to a group of variable x1 ,x2 , xp. So finding out such a curve is
more difficult so what we try to do we try to create the scatter plots pairwise between y and
x1, y and x2, y and x3 and so on, and then what we interpret is that if all the relationship that
means the relationship of y with respect to each of the x1 ,x2, xp is linear, so we can expect at
the joint effect or joint association between y and all x1, x2, xp is also expected to be linear.
Well, you need some experience in interpreting search results but here I would like to show
18
930
So that command for creating the matrix plot here is pairs p a i r s and inside the argument
what we try to give here the data vectors and with some more information and another more
general structure is pairs and inside the argument, we give the formula, just in that case of say
linear models the operator ‘lm’ and after that we there are different options to be given data,
subsets, na.action and so on. So here you can see here I have given the details here that x is
trying to give us the coordinates of a points given as numeric columns of a matrix or a data
frame and similarly here formula, that this is the same thing, what we discuss in the case of
linear model so we have to write the formula as say with an equivalent sign given by the
variables and separated by plus sign. Right. So, each of this term will give a separate plot
with respect to y, and then data is a data frame in which the data for the formula is, has to be
used. Right.
Then there is an option here subsets and data, exactly in the same way as we did in the case
19
931
Refer Slide Time: (27:57)
Now I would try to create here a matrix plot of the given data set just by writing the formula.
‘formula’ you see I am writing here now pairs and then inside the argument formula for that I
write here an equivalent sign and then I need to find out the scatter plot of three variables -
delivery time, parcel number, distance, So I'm not discriminating here what is my input
variable and what is my output variable and it means you can get more information about this
command pairs from the help but here I am using one option which is the ‘main’ to give the
title of the graph. So, you can see here this is my graph and this title is the same here matrix
plot of delivery time data which is printed here. I will try to show you on the R console also
but first we try to understand this picture. This matrix plot will look like this so you can see
here first try to understand this graphic. So you can see here what is there on the y-axis, what
is here on the y-axis? This is the same thing which is written here, and what is here on the x-
axis, I will use a different color so that you can observe this is the same thing which is written
here. So in this box, the x-axis is denoting the parcel number and y axis is denoting that the
20
932
delivery time, and this is the scatter plot of two variables- parcel number versus delivery time
or the number of parcels versus the delivery time and you can see here that this trendt is
nearly a linear trend and a positive trend rather. Similarly if you come to this block, again
now what is being mentioned here, here this is the same thing which is written here say
distance. So now this is the plot between distance, and what is happening on the y axis here,
this is that delivery time, so you have to take the y axis from the y axis side and x axis from
the x axis side, so this is a graphic between the distance travelled by the courier person versus
delivery time, and you can see here that this curve is also coming out to be a sort of a having
a linear trend and similarly if you try to look here in the third case, see here, so you can see
here on the x axis here, this will be here distance and on the y axis, we will have here number
of parcels.
So this is a graph between the distance versus parcels, parcel number or the number of
parcels, and you can see here, here also the if you try to look at the trend, this again shows a
nearly linear trend and then if you observe direction of my pen, it is here this crossing this
diagonally. Now what I have shown you here is the plots in the upper diagonal, in this side.
Now what is happening in the lower diagonal, this is the same thing. For example, these two
will match this, two will match and these two will match. So usually we try to look in either
in the upper diagonal or on the lower diagonal, so they are going to give us the similarity
information. Okay, and yeah! means some of these blocks which I am indicating here by
cross these are not used why because this is a plot between, for example, in this case this, is
the plot between say here delivery time versus - delivery time which has no meaning. Right.
So, by looking at this matrix plot we can have an idea that whether the joint relationship of y
21
933
Refer Slide Time: (32:07)
So in this case I can take a call yes it is approximately linear. Well in this case, I will try to
show you one thing more that is since we are dealing here only with two variables and I want
to see the joint effect of y with respect to x1 and x2. So there are three variables y, x1 and x2 so
I can also take the help of this three dimensional plots and if you remember we had discussed,
we had discussed earlier a three dimensional plot which is, by using the command
scatterplot3d. So, we try to use it here, but it is always not possible because sometimes the
data is more than three directions, but this is your judgment what you want to do. So if you
try to use the command here the scatterplot3d, then first you need to upload the library so use
this command library inside the argument a scatterplot3d, and now I try to plot this
scatterplot3d with these three variables delivery time number of parcels and distance and this
comes out to be like this. So, this is trying to give you here a sort of here panel, which is here
in this case. So you can see here this panel is looking like, yes, there is a that most of the
points are lying close to the panel now you can also use the option to change the direction of
22
934
this cuboid for example, I use here the option in the next command as angle equal to 120 and
I try to rotate this figure by 120 degree. So you can see here this that the axis are changed,
these are changed and this is see here changed, but now I looking at different structure now
this is something like this. So just by looking at different types of picture you can finally
conclude whether you want to have a linear model fitted to this data or not.
Now we have concluded on the basis of given set of data, yes, we are confident that a linear
model can be fitted. So now we use the R command. So now you can see here we are
interested in this model where I have two variables x1 and x2 and coefficients are , 1 and 2.
So I need to estimate three parameters , 1 and 2, so I will try to use the same command
that we use earlier but now I am making here a small change I will use the same command
lm. Now whatever is my here output variable I am trying to give it an a here as such, and now
there is an equivalence sign which is trying to indicate the formula that now the variables to
23
935
be used as input variables are starting. So now I'm using here two variables named, parcel
number and distance, so they are given in this format that they are separated by this plus sign.
So if you have more variable suppose if the model is 1x1 + 2x2 + 3x3 + 4x4. So, all those
variables will be added here something like say here x1 + x2 + x3 + x4 and so on. So, this
formula will remain the same and if you try to operate it on the R software, you will get this
type of outcome, so this is giving us the formula. Right. So, this is essentially writing that y is
equal to something like 1 x1 and say here 1x1 + 2x2, and now this outcome has to be
read like this that these three values are giving us the values of coefficients associated with
these values. So, 2.19579 it is the value of intercept term, 1.67803 this is the value of the
coefficient associated with parcel number which is here actually ̂1 , and this is the value of
the coefficient associated with the variable distance. So, in our symbolic this is the value of
̂ 2 , and yeah! this intercept term this 2.19579 this is the value of ̂ .
So, my model becomes here deltime this is the delivery time = 2.196 + 1.68 X parcel number,
or the number of parcel + .013 X distance. So, you can see here now we have obtained a
model with two variable or model with two input variable and the same story you can
continue with more than one variables, and I will try to show you these things on the R
console also.
24
936
And you can see here this is the screenshot what I'm going to show you now.
25
937
So, let me first try to create this data vector here. So you can see here and then similarly I try
to create the data vector on parcel number, it is here, and similarly I try to create the data on
distance travelled like this. So, you can see here this is my data deltime like here like this
parcelo number of parcel it is say like this and distance here like this. Now I first clear the
screen so that you can now understand what is going to happen, so I will first try to create
here a three dimensional scatter plot and for that you know that first you need to create here
a, copy this command here, so you can see here what I'm doing , that I try to copy and paste
this command and you get here this type of, so this is the same plot which I just explained
you inside the slides, and now after this I try to obtain the scatterplot in three dimension so
first I need to load the library here, so I load the library of scatterplot3d. We had already
installed this library earlier, so this is already there because you need to install this package
only once and after that whenever you want you just use the library. So you can see here this
is the scatter3dplot and if you try to change the angle here say angle is equal to suppose 120
so it will give us this change picture, and suppose if you try to make it here see angle equal to
say here 90, so it will give you a different picture like this one. So, by using these things you
can have an idea that what really you are going to do. Okay. Now next I try to fit here a linear
model with this data set. Right, I close this pictures and I clear the screen by control L, and I
paste the command over here and this gives me the this command. So you can see here these
are the values of the coefficients that you have obtained, so this value 2.19, this is the least
square estimate of intercept term, this 1.67803, this is the least square estimate of the
coefficient associated with the variable number of parcel, and this 0.01311 is the value of the
least square coefficient or the value of the least square estimate of the coefficient associated
26
938
And this same screenshot has been given in the slides also. So now I would like to stop in this
lecture. I have tried to give you the idea of principle of least square and how to implement it
inside the R software. Sometimes it is possible that there are some nonlinear curves, and in
case if it is possible to make some transformation to change those nonlinear form into a linear
form, say by taking log or by taking exponential, then you can use the command ‘lm’ to find
out the least square estimate in that transformed linear model, that was originally nonlinear.
But in that case you have to keep in mind that if the original data is given as y and in case if
you transform the data into or the variable into log of y then the curve becomes linear then
you need to input the data on log of y. So these things say you have to keep in mind, and
using this technique you can estimate the least square estimate very easily using the R
software. But definitely, you will need more things to learn if you want to use it further, and
which is usually not possible for me to, to do in the same course. There is a usually there is a
But here now we come to an end to this course and this was the last lecture of this course I
hope you enjoyed it. you understood it. Well I am saying that this is the end of the course but
27
939
practically for you this is the beginning of the new course. I have given you only the basic
fundamentals. I have told you very basic things but believe me, these are the things which are
In case if you want to conduct a Monte Carlo simulation, you want to find out a good model,
or you want to do any data mining or anything, the first step is the tools of descriptive
statistics and even inside those things, for example, in data mining and other things you use
the tools which we have discussed here under the course descriptive statistics, and you have
seen that one tool will not give you the complete information about the entire data sets. There
are different types of features which are hidden inside the data. Data is so naive that it is not
telling you that I have this, I have this, I have this, you are the only one who has to use
different types of tools, and get the information from the data, and it is very important that
you try to use the correct tool on the correct data. In case if you try to use anything else like
as wrong data on correct tool or correct tool on wrong data, you will not get a correct
aesthetical outcome, and then people try to make different types of stories that the statistics is
not telling the truth, because they don't know, what is the appropriate tool for an appropriate
data. So my request is that please try to understand these topics in more detail before you try
to apply them on a real data set. Definitely you need to also study a book which will give you
more properties of these tools and more application of these tools before you become eligible
to use the tools in a real situation, and in a real situation, you cannot say that I can use only
the measures of central tendency or the variation or the association, all the tools can be
applied to any data set but this is only you who is going to take a call, take a correct decision
that which of the tools have to be used in a given situation, either graphical or say analytical
or a combination of them. So, you learn more in statistics and enjoy the course. I will take a
leave and see you sometime soon once again, till then Goodbye, and God bless you.
28
940
THIS BOOK IS NOT FOR SALE
NOR COMMERCIAL USE