0% found this document useful (0 votes)
63 views944 pages

Descriptive Stats With R Software Book

The document provides an introduction to the R software for statistical analysis. It discusses why R is important for mathematics and statistics and its advantages over other statistical software like being free and open-source. It also briefly explains the features of R including its base package and additional packages for specialized tasks.

Uploaded by

sultanmsajidm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views944 pages

Descriptive Stats With R Software Book

The document provides an introduction to the R software for statistical analysis. It discusses why R is important for mathematics and statistics and its advantages over other statistical software like being free and open-source. It also briefly explains the features of R including its base package and additional packages for specialized tasks.

Uploaded by

sultanmsajidm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 944

1,000

800

600
/

6
S
riq.

s.<SP

4,sr:fJ
6,00o
4,<$P

-,_,,sr:fJ

-,_,,<SP, 5,000

r:fJ
'l,s ,,.

'l ,<SP 4,000

',s<:fJ 3,000
,,<SP
,,000
<f$,
2,500

2.000

2,00o

8
7 347

Fig.3
0 2
14,000 ,50
12•OOo
10,000
Fig.2 6,000
s,OOo
4,000
0 2,000 300

250
Q1

200

3,876
Q2

DESCRIPTIVE STATISTICS WITH R


SOFTWARE

Prof. .Shalab Jha


ProfPrashant
Prof. Shalabh
h
Mathematics
Mathematics
IIT KANPUR
IIT Kanpur
INDEX
S. No Topic Page No.
Week 1
1 Introduction to R Software 1
2 Basics and R as a Calculator 18
3 Calculations with Data Vectors 35
4 Built-in Commands and Missing Data Handling 53
5 Operations with Matrices 83
Week 2
6 Objectives, Steps and Basic Definitions 109
7 Variables and Types of Data 131
Absolute Frequency, Relative Frequency and Frequency
8 Distribution 161
9 Frequency Distribution and Cumulative Distribution Function 187
Week 3
10 Bar Diagrams 225
11 Subdivided Bar Plots and Pie Diagrams 265
12 3D Pie Diagram and Histogram 289
13 Kernel Density and Stem - Leaf Plots 324
Week 4
14 Arithmetic Mean 350
15 Median 374
16 Quantiles 391
17 Mode, Geometric Mean and Harmonic Mean 416
18 Range, Interquartile Range and Quartile Deviation 447
Week 5
19 Absolute Deviation and Absolute Mean Deviation 483
20 Mean Squared Error, Variance and Standard Deviation 527
21 Coefficient of Variation and Boxplots 562
Week 6
22 Raw and Central Moments 608
Sheppard's Correction, Absolute Moments and Computation of
23 Moments 627
24 Skewness and Kurtosis 653
Week 7
25 Univariate and Bivariate Scatter Plots 680
26 Smooth Scatter Plots 701
27 Quantile- Quantile and Three Dimensional Plots 723
28 Correlation Coefficient 746
29 Correlation Coefficient Using R Software 773
Week 8
30 Rank Correlation Coefficient 792
Measures of Association for Discrete and Counting Variables (
31 Part -1) 814
Measures of Association for Discrete and Counting Variables (
32 Part -2) 837
33 Least Squares Method - One Variable 883
Least Squares Method - R Commands and More than One
34 Variables 913
Lecture 01

Introduction to R software

Welcome to the course on descriptive statistics with R software. This is our first lecture.

Although I believe that all the candidates who are attending this course, they have a basic

knowledge of R software. And in case, if you do not have then there is another MOOC

course on Introduction to R software. I would request you that you please go through that

lecture. But in order to have a quick review of the mathematical tools, statistical tools and the

commands, which we are going to use in this course, I will try to demonstrate it here in the

next couple of lectures. So, next couple of lectures will be devoted to the introduction to R

software and the tools which we are going to use in the statistical part they will be

demonstrated here quickly. Okay.

Refer slide time :( 01:10)

1
So, first question comes what is this R software and why this is so important for us to use and

what is the role of R software in mathematics and statistics and other basic sciences. Right.

So, we all know that use of software is desirable and moreover it is an essential part of any

statistical analysis or any mathematical analysis or rather I would say that any analysis is

incomplete, if this is without the use of mathematics or statistics. So, definitely you need to

learn the software how to use it in all basic sciences rather this is physics, chemistry,

mathematics, statistics, or any discipline in sciences arts Medical Sciences, engineering

sciences or anything else Right. Now I would like to concentrate here basically on the

statistical software. So, once we go through with the statistical software, there are different

types of software which are available in the literature. For example, there is SPSS, there is

SAS, there is a Minitab, there is Stata, there is Matlab and similarly on the same lines, there

another software here which is R.

Refer slide time :( 02:23)

2
Now, the question comes, what is the difference between the two? And how this R came into

picture? This are is basically developed by a team and this is called as R development core

team and this software is freely available on the website www.r-project.org and it is a free

software. This is the biggest advantage of this software, all other software they are mostly

paid software and sometime it is difficult for a common person to afford them. So, R software

gives us a solution and it gives us a very strong edge over the other software that this is a free

software and it's not only a software but this also supports many free packages. Free packages

means in the R, there is a base package which includes most of the common features which

are usable but there are some specialized tasks, they can also be completed using the

additional package available with the R software. Right.

Refer slide time :( 03:46)

3
So, that is why it is very, very popular and R is an environment for data manipulation, for

statistical computing, graphical display, as well as data analysis, or even who can think of

other things also and this gives us an effective way to handle the data and storage of the

outcome is also possible. The storage can be done one by one or the storage can be done in a

loop or in other ways which will be sort of automated way, you run the program and, and the

outcome of the analysis, will be stored in some file and that can be exported to different types

of software’s. Right. And this case simple as well as complicated computations are also

possible. Simulations like a Monte Carlo simulation in statistics, they are also possible. Now

what is in case if you are involved in research, without doing simulation. it is very difficult to

survive unless and until you do a research to give the mathematical aspect, give the statistical

aspect, is not only the sufficient thing but you also need to do the Monte Carlo simulation to

demonstrate the utility and application of your tools using the data. So, this R software also

works in that thing. And now if you know that in most of the software whatever are your

graphics means any software will have rather two types of outcomes- graphics, as well as

some numerical values. So, the numerical value as I already told you that they can be stored

in a file similarly the graphic can also be stored. There are two options, the graphics can be

displayed on screen.

Refer slide time :( 05:35)

4
As well as they can be stored and their hard copies are also possible. And in R there is a

programming language. It's not only that there are some built-in functions, that can be used

but you can also create your own programs. And this is the main advantage that you can use

the built-in function, as well as you can create your own function using the programming

language. And that is why this R has an environment of using the programming language

which is its own programming language, it is different than other languages although it is

very, very similar to other languages but it has its own syntax and commands. And this R

software has a basically a statistical computing environment. There are many, many things in

statistics and also in mathematics which are directly supported by using the built-in function

or by using the contributed packages. And the environment which are provides that is very

suitable for the statistical and mathematical, calculations, computations, and getting a

graphical output also. Right.

Refer slide time :( 06:43)

5
And as I said the biggest advantage of R software that it is a free software and it is an open

source. This means what? This means it's not like a black box. If you really want to see what

is happening inside the program you can just open it and can and can have an idea that how R

is computing something and as I said earlier, there are some packages which are built-in, they

are already embedded in the R software and there are some contributed packages, contributed

packages mean somebody is doing his doing their research and he or she develops something

new they can write a program and after a scrutiny or after checking or verifying by the R

core development team this program can be uploaded on the website of R software and then

anybody can just download it and can use it in their own research work. Right. So, in R it is

also possible to contribute our own packages and this gives us a very strong advantages over

the other software this type of facility is not really available in many other software's. Right.

And all those commands whatever we are using they can be saved if they can be their output

can be can be used and so, on this R software is available for all sorts of operating system like

as

6
Refer slide time :( 08:19)

Windows, Unix, Linux Macintosh and whatever are the graphics, they can also be stored in

different types of format the most common format of graphics is this PostScript file or the

PDF format, although they can also be stored in JPEG format they can directly be copied

from the software and they can be pasted in say the type of software like as this MS Office

ms word. And so, on

Refer slide time :( 08:46)

7
Now after this brief introduction and advantages of R, I will just try to show you that how

you can install the R on your computer so, in order to install the software what you have to do

that you have to go to this website. Now once you go to this website, you can this is here a

sort of the screenshot of the webpage that you will get. And here if you try to see, there are

here links. This link or you can also come through here this C R A N mirror and you can

simply click here just click here and after that the software, will be downloaded and once you

download it you simply have to click on the software, it will ask you various option. And you

simply have to press different clicks and then the software will be installed and once this is

installed on your computer. So, now I will come on the desktop and I will show you that how

the things are going to happen so, you can see here there is an R icon and I simply double

click on it and you will get this type of window here. Right. Just in order to make it more

clear, I will try to increase the font size so, that you can see whatever I am doing here clearly

Right. So, you will get here this type of thing and now this screen what you are seeing here

this is called ‘console’, and by pressing the here control L that means you have to press two

8
key ctrl + L together this will clear the screen and we will use it for executing different types

of commands. So, now let me come here to over slides and I will try to show you here what's

really happening. Right.

Refer slide time :( 10:51)

So, if you see here this is screenshot which is here this is the same thing which I just showed

you on the R software this is called ‘console’, and here you have seen that there is a sign here

something like greater than, this is the place where you write all that R commands. You try to

type all the commands and this place is called as command line. So, that will be our common

terminology come on console or type the command on the command line and so on.

Refer slide time :( 11:27)

9
So, you need to be familiar with this thing. So, what we try to do? Whenever we want to

execute any command, I will go to the command line and I will type it there and I will press

ENTER and that will execute the command. When you are trying to work with R software,

there are two options. First option is that you can use directly the R software and you can

type your commands, you can store your commands. And see the outcome directly on the R

software on the console and second option is that there are some software which are

available, they are the free software. For example, one software is R studio and another

software is Tinn R and similarly there are other software are also this software they help us

they are the interface between the R and us. So, using the R studio or Tinn R or similar type

of software will help you more in executing the commands of R.and getting the outcome but

here my objective is not to teach you here the R software. So, I will be working only on the

console and, and I will leave it up to you that this software you want to use. Okay? So, this R

studio software this is available at this way website www.rstudio.com and similarly this Tinn

R software is available at this site. So, if you wish you can just download it and install it n the

computer and start working on this.


10

10
Refer slide time :( 12:58)

Now I will try to just take a quick commands and quick things which are needed for us to

work in Statistics using the R software. So, as I told you that in our there is a base package.

Base package contains some essential libraries which are usually common among the users

and these libraries are required to do a statistical work. And some of the libraries are the part

of the base package and some of the libraries have to be downloaded from the website and

these libraries are needed to execute a particular type of task. Right. So, first I will try to

demonstrate here how you will install the package from this library and how you are going to

bring it to a platform where it is available to use in any data set. Okay. So, first command

which I would try to illustrate here is install dot packages install.packages, install.packages is

a function which is used to download the libraries. And in order to use it, this is the syntax.

Suppose I want to download here a package ggplot2. This is used for graphics so, I have to

type here install.packages and inside these two brackets here, I have to write double quotes

11

11
here, and say here. And then I have to write down ggplot2 that is the name of the package

which I want to install. Similarly in case if I want to install the package graphics, this can be

done by typing install.packages and inside the brackets, within the double quotes, you have to

write graphics and similarly. If you want to install another package here say, cluster you

simply have to type your install.packages and inside the brackets, within the double quotes,

just write C L U S T E R and once we enter R software will guide you how to install this

software. And then so, you simply have to choose a mirror and then you have to simply click,

click and this package will be installed here. I'm not demonstrating here but I would but I

would like to leave it up to you that you practice and after installing this package suppose I

want to use it.

Refer slide time :( 15:44)

12

12
Then I need to use hear another command here say library and the syntax is very, very

simple. It is simply here library and inside the bracket you have to write down the name of

the package, which you would like to use. For example if I want to use the package cluster

then I have to load the library as library. And inside the bracket I have to write here cluster

and similarly if I want to use the package ggplot2 then I have to write down here the

command library and inside the brackets, I have to write down here ggplot2 and similarly if

you want to use the package graphics. It means again just write library and then you have to

use here the graphics inside the brackets. So, this is how you can install package and this is

how you have to load the library before you start using it. There are some libraries which

come as a part of the package in R. For example, there is one very popular library which is

here mass M A S S in capital letters. Actually MASS means modern applied statistics using S

Plus. This is a book which was written by Venables and Riplay and this MASS corresponding

to the first letter of M say from modern, and A from applied, S from statistics and S from S

Plus. S plus is the software which is very, very similar. So, in case if you want to use the data

set or the libraries to use in that book you have to use this package MASS by writing the

command library. And inside the bracket capital M, capital A and capital SS and for that, you

need not to download it because, this package is already available in the base package and

similarly there are say the different types of libraries for doing specialized job. For example,

if you want to use the generalized additive models then there is another library say MGCV

and in order to use the generalized additive models, you use the command library and MGCV

so, this is how you can install a package and can use the library. And in case if you want to

take some help or if you want to know what are the contents of the library, what is contained

in that thing then you simply have to use a function help H E L P.

Refer slide time :( 18:13)

13

13
And suppose I want to know what are the contents of the libraries, ay here cluster. Right. So,

I simply have to type here library and inside these brackets you simply type here help. Help is

equal to say here name, name of the library which is her cluster and once you do it this will

give you here different types of information here that what is the package and what is the

version? What is the date? When it was incorporated? What is the priority, well and who

wrote this one? And there will be many that information availability

Refer slide time :( 18:53)

14

14
And this for example here you can see here a screenshot of the same thing. For example if

you want to see it, here I can show you here that if I simply try to show you here I can just

copy this command here. Right. So, you can see here that this is, these are the details about

the package cluster. So, I have simply and there is a lot of information complete information

about the package. And this is what I meant when I said that R is not a black box means you

have to complete details. Right. And if you want to know what is the programming of this

package, you can also know about it. So, let me come back here and here you can see this is a

screenshot which I shown you here.

Refer slide time :( 19:51)

15

15
And after this, yeah this is very important now some final words about using the R software

that whenever you are trying to start a new programming, you will always be using some

variables, which are given by some names like as x y z A B C and so on. So, it is possible

that today you are trying to write down a program in which you have used the variable name

say x and y and say after some time you write another program in which you are using the

same name x and y. so, what will happen? That as soon as you define the new variable x and

y the, the earlier x and y will be erased or it is also possible that some of your friend comes

with the same computer and he defines x and y in a different way so, the way you had defined

earlier x and y will be erased. So, it is always a better option that you try to remove the names

before you leave or we are before you start a new program. So, in order to remove a variable,

the command here is rm and inside the bracket you need to write down the name of the

variable for example if I have three variables here see here x say y and say z then I simply

need to write here r m and inside the bracket, you have to write x y z which is here. And once

you do it this will remove the variables x y z and then you can use here one variable at a time

or you can use more than one variable at a time.

16

16
Refer slide time :( 21:36)

This is your choice and simply once you are done and if you want to close the session you

want to quit R then the command here is q and you have to write a two parenthesis. So, once

you write q and opening and closing bracket you will come out of the R program. So, now I

will stop here and I would request you that you please try to have a quick look on the basics

of R, the commands in case if you have done it earlier that will help you and he will see you

in the next lecture till then Good bye.

17

17
Lecture 02

Calculations with R Software – Basics and R as a Calculator

18
Welcome to the, next lecture on, Descriptive Statistics, with R software. You may recall that, in the earlier

lecture we had a quick review about the R Software and the related information. Now, in this lecture and

in the next couple of lectures, I’m going to give you, an idea about the basic operations or the basic

mathematical operations, in R Software. And these are the operations which we are going to use again

and again in the forthcoming lecture. So, this will be a quick introduction to the minimum basic required

this mathematical operators and how to use them in R in this lecture and in the next couple of lecture.

Refer Slide Time :( 00:59)

So you can see here, as soon as you start the R Software, on the console you get a sign like greater that

(>) sign. This I explained you in the earlier lecture. This is the prompt sign. This is the prompt sign on the

console, after which you try to write down the commands, for their execution. You type the command,

after the prompt sign and press enter, the command will be executed. So one thing I would like to clear

you here, that when we are trying to assign a value to a variable, in mathematics, we always write, suppose

I want to assign a value 2, to a variable say, x, then I would write here, x = 2. The same thing is followed

19
in R also. Suppose I have a variable x and I want to assign here the value to x, then I would write here,

say x = 2. But before that, I would try to explain you one thing more. When this R started, at that time,

the initial assignment operator was not this equality sign. But this was less (<) than and hyphen (-) sign.

So in the older versions of R, once we try to assign a value, I have to write, less than and hyphen.

Refer Slide Time :( 03:08)

For example, if I want to assign a value 40 to a variable x, then I need to write down here x < - and here

40, like this. But now in the recent versions of R Software, they have used the equality sign, so either I try

to write down here, x = 40 or say, x < - 40, both are acceptable, in the recent versions of R. So that you

have to, just keep in mind. So I would try to show you here, how this things are happening and what does

this mean, when I say that I’m trying to assign a value x=40.

Refer Slide Time :( 03:10)

20
So, suppose if I try to show you here, if I try to copy this command and I try to paste it on the R console,

you can see here, this is giving me, now I press here x and in turn, this is giving me x=40. And similarly

if I try to write down here, x=40, using the equality sign, then also x is giving me here 40. Or I can take

here another example, just to discriminate between the two, if I write x=20, now if you see here, x becomes

here, 20.

Refer Slide Time :( 03:36)

21
And then once you are trying to assign a value to a variable, then there are two options. You can assign,

a numeric value to the variable, as well as, you can assign, a variable to variable also. For example, if I

say here, suppose I assign here x=40 and now I say, I will multiply this x by 3 and whatever is the outcome,

this I will try to store in the, in say, another variable here, y. So I can write down also here, y=x*3. X star

(*) means, star operator is the multiplicative sign. Where which I will explain after couple of slides also.

And similarly the minus (-) signs is the difference operator. So similarly if I want to multiply x by 3 and

I want to store in a variable y. I can write down here y = x*3. And similarly, in case if I want to find out

the difference between x and y, that means I want to find out x-y and suppose I want to store the values

of x-y in say another variable, say z, then I would write down here, z=x-y, like as here. Right, I can show

you here, on the R console, that how the things are happening.

Refer Slide Time :( 05:03)

So I try to show you here, suppose if I take here, x=20 and now I say y=3 into multiplied by x. So, if I

show here, the value of x here is, 20 and the value of here, y comes out to be 20 into 3, this is 60. And

suppose I. try to find out the value of x-y and I try to store it in, say, another variable here, Z. So if I try

22
to write down here, z=x-y and if I try to, see the value of z, this comes out to be minus 40, that is 20 – 60.

So you can see here, this is how; we are going to assign or store a value inside a variable.

Refer Slide Time :( 05:48)

Now if you try to see here, there is another symbol here, Hash (#). This hash is used to indicate, that the

given command, on the command line, is only a comment. It has not to be executed. And once you write

there, Hash (#) and followed by whatever you want to write, any command or any syntax, nothing is going

to be executed and this will be taken as the comment and all mathematical operator after that will be

ignored.

Refer Slide Time :( 06:20)

23
Let us consider, say, x is equal to say, 20, and say, y equal to here, 40, once again I try to define here, z=x-

y, which is here, giving me the value -20, that it 20-40. Now I define here, another value here, say here,

m and I try to put here Hash (#) sign. And I try to say here, Hash (#)m=x-y and you will see here, there is

no value here, something called here, ‘Error’. Because, it has not been executed. This is only stored inside

the R console, as a comment. And the comment cannot be used. Right. The use of this comment sign is

this. When we are trying to write down a longer program, we try to explain the features and the names of

the variable, inside the program, so that after sometime, when we forgotten about it, by looking at this

comments, I can recall, what was the meaning of x, what was the meaning of y or say, what was the

meaning of z.

Refer Slide Time :( 07:28)

24
Now, the next aspect. A very important point in using R is this.-There is a difference in the use of small

letters, alphabets and capital letter, alphabets.

For example, incase if I try to use here, small x, say equal to 40 and if I try to write down here X capital

X = 40, then these two are different. Right? I can show you here, with an example, on the R console itself.

Refer Slide Time :( 08:01)

25
For example, if I’m trying to, use here, some command here, say x equal to here 40. Now you can see

here, the value of x stored here, is 40. But, now in case if I say here, capital X, capital X is equal here is

equal to 70, then once I say here, capital X, this will give me here the, 70 and if I try to recall the value of

small x, this is 40. And even I can find out the value of here, x minus small x, this capital X minus small

x, this will come out to be, 70 minus 40, this is 30. So you can see here, that this small and capital letter

alphabets have different types of interpretation and here,

Refer Slide Time :( 08:44)

26
this is the, screenshot of the same thing, which I have done here. But, yeah I mean, I have used a different

values, but that is the same thing here. Right. Okay. Now another thing statistics is that, you will always

be dealing with the data. That is why you are doing a computation. So in a statistics and in using R,

whenever we want to input the data, I always have to write here, the letter, c.

Refer Slide Time :( 09:12)

27
Suppose I want to write down the values, 1,2,3,4,5, then I need to input the data in R, by writing here, c,

small c, not capital C. And the 1 comma, 2 comma, 3 comma , 4 comma, 5, that will, all the values are

separated, by a comma (,) operator. And in case if you don’t use the c command here, then it will give

you, different types trouble. For example, I can show you here it, on the R console here.

Refer Slide Time :( 09:43)

28
Now, let me choose here, x=1, 2, 3. Now you will see here, it is giving me an error. But incase if I try to

use here, x=c(1, 2, 3), then it is giving me here, x as here, 1 2 3 . Right. Okay. So, always remember to

give the values of the data, you choosing the, c operator.

Refer Slide Time :( 10:17)

Now I will come to the basic operation, that in case if you want to do the addition, then the operator here

is, plus (+) sign. The usual mathematical sign, you have to use the, plus operator here. For example, first

I will try to show you, the outcome and then I will show, you on the R console also. Suppose if I want to

add, 5 and 3, then I simply have to give here, 5 and 3, and then here, the operator plus (+). So I simply

have to type here, 5+3 and this will give me here, the value, 8. So, this is the screenshot. And similarly, if

you want to multiply 5 and 3, then you have to write the, 5 multiplied by 3. So the multiplication operator

here is, the star (*). So I would simply write it here, 5*3 and this will give me here a value here, 15. And

29
this is here, the screenshot. So, similarly if you want to subtract two values, the mathematical operator is

the same, that is, so this subtraction sign (-), hyphen. And if I want to subtract, 5 and 3, then I have to

write here, 5-3, in the same way, as we do in the usual mathematics and this will give me the value here,

2 and this is here the screenshot. And, in case if I want to use the division, then the division operator is

backslash (/) and incase if I want to divide, 5 by 2, then I simply have to write down here, 5/2 and you can

see here, the answer will come out to be, 2.5 and this is here, the screenshot, of the operator.

And similarly incase if I want to write down here, 3 square. Right. So, incase if I want to write down here

3 square, then I have here now, two options. I can write down here, say 3, say here at (^) 2 or I can write

down here 3 double star (**) 2. So, you can see here, I have used here two commands. This and this here,

two indicate the same value, 3 square. So whenever you want to write any mathematical equation, related

to the power, then I have two option. One is to use the at (^) sign another the thing is to use the double

star (**) sign. And 3 square in both the case will give 9 and these are the two operations here, which I will

show you very soon.

Refer Slide Time :( 12:45)

30
And similarly, when, once you are trying to giving the power, this power can be an integer or it can be a

fraction value. Also for example, in case if I want to find out the value of the square root of 3, the square

of 3, is only 3, is to the power of, 1 by 2, which is equal to a 3, 0.5, 3 raise to the power of 0.5. So I can

write down this value here as, say, 3 hat(^) 0.5 or 3 **0.5, which I have done here. And you can see here

in both the cases, the values are coming out to be 1.73 and this is, and these two are, the screenshot which

have been obtained after this education.

Refer Slide Time :( 13:32)

And similarly, incase if you want to find out the, value of, 1 upon square root of 3, so this is going to be

3, this power of minus 1 upon 2, this is 3 raise to the power of minus 0.5 and it can be written, either by

putting the hat(^) sign or by using the double star (**) sign. For example here, I have written this thing

and here you can see here, this is the screenshot here. And, yeah, in case if you have more than one

operator at the same time. Suppose you are trying to use the plus(+) sign, minus sign(-), multiplication

sign(*), division sign(/), then using the same rule, what we have the BODMAS rule, the same BODDMAS

rule is applicable, in case of, R Software also.

31
Refer Slide Time :( 14:26)

So for example, now I would try to show you all thsse things over the R console here, so that you have

some confidence here. For example, I can use here, x equal to here 5, y is equal to here, say 10, so now,

in case if I want to add it, I can write down here, x+y=15 or in case if you want to write down her directly,

say 5+here 10, it will also give you the value here, 15.

So similarly, in case if I want to, subtract it, say 10-5, so 10-5 will be a, see here, if you can see here, 5

and suppose if, I give here the value here, also y minus here x, so this will also be here, 5. And similarly,

in case if I want to multiply 10 and 5, so I have to write down here 10*5 and this will give me the value

here, 50. And similarly if I want to multiply, x and y, which are taking the same values, this will again

give me the value, 50. Now, in case if I want to divide, 10 by 5, so I have to write down, 10/5 and will

32
give me the value then, 2 and similarly if I want to write down here y divided by here x, this will again

give me the value here, 2.

Refer Slide Time :( 15:48)

And similarly, I clear this screen, by pressing control L (Ctrl L). And suppose I want to find out the, square

root of 2, then I have to give it here, say a 2, say hat(^), 0.5, so this will give me here, 1.414214. And

similarly if I want to find out the value of 1 upon, square root of 2. Then I have to give here the value

here, 2^-0.5 and this will give me the value of here, 1 upon, square root of 2. And similarly, if you want

to find out here the value of 2 cube, then I have two options here. I can write down here, 2^3, this will

give me here the value here 8 and similarly, if I write down here, 2**3, this will also give me the value

here, 8. Right. And similarly incase if I have, any other mathematical operator like as, 9+4-5^8*3/7, so

33
this is again the value, which using on the rule of BODDMAS. These are very simple operators, but

definitely, before we go into the statistical part, you need to practice them, so that you are more conversant.

So, I would request you, you please practice them. And I will see you in the next lecture. Till then.

Goodbye.

34
Lecture 03

Calculations with R Software – Calculations with Data Vectors

Welcome to the next lecture on descriptive statistics with R software. You may recall that in the

earlier lecture we had learnt how to use this R software as a calculator and we had learned different

types of mathematical operators – addition, subtraction, multiplication, division but all those were

based on when we are trying to add or subtract only the numbers. Now, in this lecture, we are going

to consider the vectors of the numbers or we call it as data vectors and we would like to see, we

would like to learn that how these mathematical operators- addition, subtraction, division, power

etc., they work with data vector. So you will have here two options that, one option, when we are

trying to work with data vector versus number and second option will be when we are trying to

work with data vector versus data vector. So let us start the course.

So now we are going to simply look into the aspect of the our software that how to handle the data

vectors with respect to the different mathematical operators addition, subtraction, multiplication,

division etc. So let me try to take care different types of examples and through those examples. I am

going to explain you how the things are happening.

35
So let me take here data vector consisting of 4 values 3 4 5 and 6 and as I told you earlier, all these

values have to be used with the c operator. So I will try to write down here c and inside the bracket,

I will try to write down these four value – 3, 4, 5, 6 and now then I try to write down here hat and 2.

What does this mean? So now I come back to my problem. What I really wanted to do. Suppose I

want to find out the value of 3 square 4 square 5 square 6 square. You have to notice that here the

powers in each and every number this is the same two-two-two -two. So now this two is actually

here and this three is here, this four is here, this five is here, and this six is here. So once I try to

write down here this data vector c inside bracket three four five six hat two then it is going to give

me an outcome like three square, four cube, five square, six cube which have the values 9, 16, 25

and 36 respectively. So what you have observed that once you are trying to use the power operator

with the vector, then the power is being operated on each and every number inside the data vector.

Right. So in case if you want to execute it on the R console.

36
I will show you but here is the R console. Suppose if I run say 3 comma 4 comma 5 comma 6 and

then here hat - this will give you a 9 16 25 36. Similarly if you want to find out here cube, you can

make it here c(3, 4, 5, 6,) hat 3 and this will give you the value of 3 cube, 4 cube, 5 cube and 6 cube

and instead of here hat you can also use here the operator say double star and this is the same thing,

you can see here. So this is what you have to keep in mind that once you are trying to do the

operation with a vector, the power of this is getting distributed. Now I try to take another option

where I am trying to use here two vectors. The data is inside a vector and the power which I had

taken earlier as a scalar - now this is this itself is a vector. So now that if you try to see what is

happening? So here essentially we wanted to find out the value of 3 square, 4 cube, 5 square, 6

cube. So you can see here that these powers are something like 2, 3 and then once again 2 and 3,

right. So these powers are being repeated here these powers are written here as a data vector 2 3 2 3

and once they are operated over the data vector c(3, 4, 5, 6) then 2 and 3, both operate pairwise that

means 2 will come here and it will make it here 2 then 3 will come here and this will make it here 3,

once again 2 will come here and it will make it here 5 square 3 will come here and it will make it

37
here 6 cube, so this is here the value of 9 64 25 216 which is the value of 3 squared4 cube 5 square

and 6 cube, respectively and here is the screenshot of the outcome.

So what do you need to learn here that whenever we are trying to give the data vector and I have to

operate a power operator then the powers are distributed over the data vector in the same sequence

in which they are given and the data is given inside the data vector, right.

So let me try to show you here over the R console how it happens, So let me take here a vector 3 4

5 comma 6 three four five six and I am operating here a power operator c(2, 3) and here three you

can see here that this is coming out to be like this, right.

38
And now let me take it another example on the similar lines where I want to compute this values- 1

square, 2 square, 3 raise to the power of 4, 4 square, 5 cube and 6 raise to the power of four. We

need to observe here that there are powers say 2 3 4 and once again here 2 3 and 4 which are

similar, so these powers 2 3 and 4 and this power 2 3 and 4, they can be given inside a data vector

here like as a c 2 3 4 and these values here 1 2 3 4 5 and 6, they can be given here another vector

here like this one. So once they are operated you can see here that this 2 is coming to here in the

first place, this 3 is coming to here in the second place and this 4 is coming to coming here in the

third place and similarly once again these three operators comes together 2 comes over here, 3

comes over here and 4 comes over here like this. So I can write down here 1 square, 2 cube, 3 raise

to the powers of here 4 and so on and this is the value which has been obtained here 1 8 81 16 125

1296. So you can just practice it yourself. So what you have to learn here that whenever we are

finding out the power of a vector where the powers itself are given in the form of a data vector then

the power moves over on the data vector exactly in the same order.

Now the next question comes that suppose if the number of powers to be operated are not a

multiple of the number of data vectors. For example, here you can see, I have here six values 1 2 3

4 5 6 and the powers are ahead 2 3 4, right. so this 2 3 4 is being operated on 1 2 3 4 5 6 and so on.

So what really happens if this is not a multiple.

39
So for example if I try to take care another example to show you the outcome, you can see here

what really happens and after that I will show you on the R console. Suppose I try to take here a

vector here 2 3 4 5, this is my data vector using the operator c and my another data vector is

containing here the value 3 4 5. So what's really happening? Once I try to use here a power

operator, so this 3 comes over here and it becomes here 2 cube, this 4 comes here and it becomes 3

raise to the power of here 4, 5 comes over here and this becomes here 4 raised to its power of 5 and

now after this 3 starts coming over here and this becomes here 5 cube but after that there is no

space place here for 4 and 5 to come. So what really will happen? What are we going to compute?

We are going to compute here 2 cube, 3 raise to the power of here 4, 4 is the power here 5 then after

that 5 cube but after that there is no space for these powers 4 and 5 to come. So in this case, this

will compute the value on the basis of whatever data is available but for the remaining value it will

give me a warning message and it is clearly saying that longer object length is not a multiple of the

shorted object length. So whenever in R, you are getting a warning, warning means it is something

like a literal sense of warning. It will not harm you but you have to be careful, the second will be

error messages that means one is making a mistake and without that the program will not run

40
but with warning the program will run but it will give you a message that you have to be careful

while executing it. So let me try to show you here on the R console first.

So suppose if I try to take the same examples here, suppose if I take the same vector, same data

vector 3 4 5 6 and then I try to multiply it here with here this thing, so now if I try to add here say

here for 2 3 4, so now you can see here that this is not really a multiple and it gives me here this

value here is 9 here is 3 square, the 64 is 4 cube, this 625 is 5 raise to the power 4 that is 5 into 5 is

25 and 25 into 25 is 625 and the last value 36 here is the value of 6 with this square 36 but after that

it is telling you that in this thing, the longer object length is not a multiple of the shorted object

length and this is giving you here a warning message. So you have to be careful, ok.

41
So that was about the power operator. Similarly if I try to take other operators like a multiplication,

addition, division. So first let me take it here multiplication so again I would try to take two

examples, one with the multiplication of a data vector with a scalar and then with the vector. So you

can see here I have taken here data vector of 2 3 4 5 and this is going to be multiplied by here 6. In

this case once you are trying to multiply a scalar value 6 with data vector then the scalar is going to

be multiplied in each and every value of the data vector. For example you can see here, suppose I

want to find out the value here 2 into 6, 3 into 6, 4 into 6, 5 into 6 then in this case this 6 is going to

be here common. So this 6 is appearing here and this data vector 2 3 4 and here 5 , this comes over

here inside the data vector and the outcome is going to be here say 2 into 6 that is 12, 3 into 6 is 18,

4 into 6 is 24 and 5 into 6 is 30, right and this is the screenshot. Later on, I will try to show you over

the R console also, right. Now what I am going to do here that I am trying to consider here two data

vectors. So in this example there is one data vector consisting of values 2 3 4 5 and another data

vector is minus 2 minus 3 minus 4 and 6 and both these data vectors are getting multiplied over

each other. So essentially we are trying to find out here the value of 2 into 2 minus 2, 3 into minus

3, 4 into minus 4 and 5 into 6, so what you can see here that these values 2 3 4 and here 5, I am

trying to combine in one data vector 2 3 4 5 and this value here - 2 - 3 – 4 and 6, they are getting

combined in say another value here - 2 - 3 4 and here 6 and this data vector is appearing here

and this data vector is appearing here and so when this multiplication operator comes into picture,

then this value is going to be multiplied by this first value, first value of data vector and first value

of another data vector, then second value of data vector and second value of data vector, third value

of data vector 4 with -4 which is the third value in the second data vector and 5 and here 6 with the

second data vector. So you can see here that there is an element-wise multiplication, the first

position is being multiplied with the first position, second position is being multiplied by the second

42
position and so on and this is the screenshot here of this same operation and here I can show you

with this another example where the number of data points in the second vector are simply a

multiple of the number of data points in the first vector but they are not equal. So here there are four

values in the first data vector two three four five and there are two values in the second guitar

vector which are the six and seven.

So essentially here we are trying to find out the value of 2 into 6, 3 into 7, 4 into 6 and 5 into 7. So

you can see here that these two values 6 and 7, they are being here repeated and these values they

are coming over here in the second data vector and this values 2 3 4 and here 5, they are coming

here as a first data vector and this multiplication sign is being converted into the R operator which

is say star. So in this case, this is 6 is being multiplied to 2, 3 is being multiplied to 7 and once

again once this process is complete then once again 6 is multiplied 2, and 4 is multiplied to 7. So

this operation is going in this particular way. So this is what you have to keep in mind, right and

similarly, here I would try to show you that in case if the number of data vector in the second vector

43
is not a multiple of the number of data points in the first vector then what happens? We have a

similar outcome as in the case of power operator.

So here I try to take an example of four values, 2 3 4 5 and the first data vector and value 6 7 8 in

the second data vector and once they are getting multiplied, then using the same rule this 6 gets

multiplied with this 2, 7 gets multiplied by here 3 into 3 and 8 is being applied into here 4. Now

after this once again this process is repeated and this this 6 is getting multiplied over here 5 but

after this there is no place to multiply the number by 7 and 8. So in this case we are simply trying to

find out the value of 2 into 6, 3 into 7, 4 into 8, 5 into 6 and after this, there has to be two more

values where I know that these are going to be multiplied by 7 and 8 but not present. So that is why

we are getting here a warning message that the longer object length is not a multiple of the shorter

object length. So now let us try to do this operation over the R console and you can see here, here is

the screenshot of what we are now going to do here but I would like to show you over the R

console. So let me try to clear by pressing ctrl L.

10

44
Now if you try to see here I will try to take care data vector c 2 3 4 5 and I try to multiply it by here

7. So you can see here that this number is coming out to be 2 into 7, 3 into 7, 4 into 7, and 5 into 7

which is 14 21 28 and 35 respectively. Similarly, in case if I try to multiply this data vector c 2 3 4

5 with say here another data vector, here 5 6 7 8 then I get this thing. So what is happening? This

can is coming due to 2 into 5, this 18 is coming due to 3 into 6, this 28 is coming by 4 into 7 and

this 40 is coming due to 5 into 8. So I am getting here a value like this and similarly if I try to make

it here that first data vector has four data values whereas the second data vector has only two

values. So again I have a nice outcome without any warning. So this 10 is coming because of 2 into

5, this 18 is coming because of 3 into 6, this 20 is coming because of this 4 into 5 and this 30 is

coming because of 5 into 6. Now in case if I try to add here one more number here if I see here and

in second data vector that instead of 5 6, I try to make it here 5 6 7 and if I try to multiply it by the

data vector 2 3 4 5 then you can see here I'm getting here a warning message. Why? Because this

10 is the outcome of this first value 2 multiplied by 5 first value, this 18 is the multiplication of 3

with 6, please try to look into the highlighted part, this 28 is the outcome of 4 multiplied by 7 but

after this when I try to multiply here 5, this will be multiplied by here 5, this is 25 but then after that

11

45
there is no number, there are no numbers to multiply by 6 and 7. So that is why it is giving me a

warning message, so you have to be careful in this case.

Now the same thing I would try to do with addition here. So once again I am trying to take here a

vector of four values 2 3 4 5 and I am trying to add here a scalar 20. So you can see here that this 20

is being added to 2, 20 is being added to 3, 20 is being added to 4 and 20 is being added to 5. So I

am getting here the value here 2 + 20 which is here 22, and 3 + 20, 23, 4 + 20, 24, 5 + 20, 25. So

essentially I wanted to add the number 20 in the values 2 3 4 and 5. So that is why this 20 here is

being written over here, yes, like this as the common value and this value 2 3 4 and 5, they are

written here as here say another data vector 3 2 3 4 5 and this is here the the output of this

operation, the screenshot of this operation

12

46
and similarly if you try to take here the vector with the say here where the length of the vector is not

multiple then again we have the similar outcome that if I try to take here a vector of four values 2 3

4 5 and say another vector of three values 6 7 and 8, so the number of values in the first and

second vector are not the same, so in this case this 6 is being added to 2, 7 is being added to 3, 8 is

being added to 4, which is giving me an outcome 8 10 and 12 but then I try to add here 5 and here

6, this is giving me here 11 but there are no numbers to add with 7 and 8. So essentially I am

getting here the value 2 + 6, 3 + 7, 4 + 8, 5 + 6, right. So this 6 7 8, they are combined here in one

vector here 6 7 and here 8 which is written over here and now this number is left alone and so now

we are trying to mean means ideally there should be here 7 and 8 but now here the data vector here

is c 2 3 4 5 and so ideally if there would had been two more here numbers then I would not have

got the warning, so this is the same style, same operation that we are getting over here and you can

see here this is a screenshot of the same operation which you have just learned and now but before

going into the further details, let me try to show you here how these things are happening, right.

13

47
So let me first clear the screen. Let me take here data vectors 2 3 4 and here 5 and I try to add here

a scalar number here 20, you can get here, you can see here that every number 2 3 4 5 is being

added with 20 and that is the outcome what we are getting over here and similarly if I try to take

care another vector here the same data vector c 2 3 4 5 and I try to add here 6 7 8 and here 9. So you

can see here the outcome is this first 8 is coming because of 2 plus 6, this 10 is coming because of 3

plus 7, this 12 is coming because of 4 and 8, 4 + 8 and this is coming because of adding 5 and 9

together, right and similarly in the same operation, if I try to add here, if I try to remove one

number, so that the length of the vectors are not really multiple then in this case means I will get a

similar outcome where I have a warning message. So this outcome is essentially, 8 is because of 2

plus 6, this 10 is because of 3 plus 7, this 12 is because of 4 plus 7 and this 11th is because of 5 plus

8 so the same thing continues over here. So what I will do that I will try to simply take some

examples on the similar line for subtraction and division and that will give you a clear idea that how

the things are happening, right.

14

48
So let me come on the R console and let me take the same example where the data vector is

consisting of four values 2 3 4 5. So I try to subtract here a value here say 1 from 2 3 4 5, so that

means each and every value in that data vector 2 3 4 5 is being subtracted by 1, so the answer what

we expect is 2 minus 1, 3 minus 1, 4 minus 1, and 5 minus 1, and if I try to enter I get the same

outcome here 1 2 3 4 and similarly if I try to subtract two vectors which are of the multiple lengths

same, as I can say here 2, this same in this second vector is c 2 3. So what I would expect here that

the outcome will be here 2 minus this 2, please try to look into the highlighted part, 3 – this 3 then

again 4 minus this 2 and finally 5 minus 3, so this answer comes out to be 0 0 2 and 2 and similarly

if I try to subtract here the vector of the same length say the same vector say c 2 3 4 5 minus c 2 3 4

5 that means the outcome will be 2 minus 2, 3 minus 3, 4 minus 4, and 5 minus 5, and this answer

will come out to be 0 0 0 0 and just for illustration, if I try to take here, say here some other values

just to make you more comfortable, suppose if I try to write here 7 8 9 and 10. So this means I am

trying to subtract 2 and 7 that means 2 minus 7 and then 3 minus 8, then 4 minus 9 and 5 minus 10.

So you can see here the answer comes to be -5 -5 -5 -5. This -5 is coming because of 2 minus 7,

the second -5 is coming here because of 3 minus 8 and third -5 is coming out because of here 4

15

49
minus 9 and last -5 is coming out because of 5 minus here 10. So similarly if you try to take here

some other value here say 8 2 1 and 3 say c 2 3 4 5, -8 2 1 3, so this value will be -6 1 3 and 2 and

similarly if you try to take here another say some other value which where the lengths are not

multiple. So if I try to subtract c 2 3 4 5 minus c 8 2 1 3 7, this value will come out to be 2 minus 8,

3 minus this 2, 4 minus this 1, 5 minus this 3 which is coming out to be -6 1 3 2, right, like this and

last value it is here say 2 and then here -7, so this is here - 5 but after this, there are no values the

two vectors are not of multiple lengths. So that is why I am getting here a warning, warning

message here. So you can see here that the same type of operations work.

Similarly I try to show you here how the division operator works. So let me take here a data vector

of say 5 10 15 and 20, four values and if I try to divide each and every value by here 5 so I'm trying

to give here a scalar data vector divided by a scalar value 5. So you can see here this value is

coming out to be 1 2 3 4, this 1 is coming because 5 is being divided by 5, this 2 is coming

because 10 is being divided by 5, this 3 is coming because 15 is being divided by 5, and this 4 is

coming because this 20 is being divided by 5 and similarly if I try to take the same data vector but

now instead I try to divide it by here say another data vector, say here 5 and 10, two value, so in

16

50
this case you can see here both the data vectors are of multiple lengths. So what will happen here

that once I enter, I get the outcome 1 1 3 2. This 1 is coming because I am trying to divide this 5

with the first value in the second data vector 5 which is 5 divided by 5 which is 1. This 1, this is

highlighted 1, is the outcome of this 10 divided by this 10 which is here 1. The third value 3, this is

an outcome of 15 divided by the first value 5 which is 3, 15 divided by 5 is 3 and the last value 2

here is an outcome of this 20 divided by 10 which is 2. So you can see here that the division

operator also works exactly on the same lines as the power operator, multiplication operator,

addition operator and so on. And similarly, if I try to take the two vectors of say different lengths,

so for example, I can say here I want to divide here 5 10 by here and say a 3. So now in this case

the first data vector has four values and the second data vector has only here three values. So in this

case we expect that we will get a warning message and yeah that is what is happening you can see

here that these first three values, 5 10 and 15, these are being divided by 5 10 and 3, for the first

value 5 divided by 5 which is 1, the second value 1 this is here 10 divided by here this 10, third

value here is 15 divided by here this value here 3 which is here 5 but now after this, the fourth value

20, 20 is being divided by here this 5 is 4 but but after that, there are no values because because the

two vectors are not of multiple lengths. So we are getting here a warning message.

So you can see here that the vector operation or the operation of data vector when the data vector

has different types of outcome in comparison to other software. So here you have to be careful and

but means I can assure you that these types of operations are going to be very very useful when we

are going to start with the statistics part. So in the next lecture, I will try to take some other aspect

on the computation of the values using R software and till then, you please practice these operators,

try to take more example, try to create more example yourself and then try to verify whatever you

17

51
are getting, is it really matching with the mathematical operation which you can do by your own

hand, manually and we will see you in the next lecture, till then good bye.

18

52
`

Lecture - 04

Calculation with R Software - Built-in Commands and missing data

handling

Welcome to the, next lecture on descriptive statistics with R software. You may recall that in

the earlier lectures, we had discussed, that how R is going to be used for different types of

computations, when we are trying to use the scalars as well as data vectors. In this lecture, we

will continue with the same topic and I would try to show you, that in R, there are two types of

computations, one using simple operation, where you have to define and beside those there are

some built-in functions and those built-in functions can be used, directly over the data or the

data vector, to obtain the required outcome. In this lecture I will also show you that, how you

are going to handle the R software, when some values or one value in our data vector is

missing, which is not available due to any reason. So, let us start our, lecture here. Now, what

I will do, I will try to take some examples and through those example, I will try to show you,

an illustrator, that how to use the built-in function, there are many, many functions available,

so it is not practically possible for me to cover all the built-in functions. But, I will take

sufficient number of examples, so that you are comfortable in using all other functions.

Refer slide time :( 1:54)

53
`

So, let me take here, the first example, that suppose I want to find out the, maximum among

some values, for example, I am take taking here a data vector, which contains four values, four

point two, three point seven, minus two point three and four. And I really want to know, out of

these four values, whichever is the maximum out of them. So, you can see here means,

obviously because, there are only four values, so you can see here, that this 4.2 is the maximum

value and once you enter it here, it will give you, this value 4.2. and if you try to, do it over the

R console, here is the screenshot. Well, I will also be showing you later on, that how to do this

thing and one thing what you have to observe here, that earlier I had told you, that whenever

you are trying to give a data, you have to give it in the form of a data vector using the c

command. But, there are some built-in functions, which can be used, without using the c

command. So, I am giving you here, an example of this, max function. Here, you can see here,

in this case and in the second case, I am finding the maximum among the same set of values,

but here I have used, here the c command. so here, I am trying to combine that data using the

c command, for these four values, 4 point two, three point seven, minus two point three and

four and here also I get the same outcome and here is the screenshot of this thing, but here, I

would like to give you one advice, well it is more difficult to keep a track that, which built-in

54
`

function is going to use with the c command and what are the built-in function, which are used

without the c command, and so I will say, the simple rule of thumb is this always uses the c

command. So, whenever I have to give a data vector, without creating any confusion, without

creating any problem, I will simply try to give the data vector using the c command. Right.

Okay. Similarly, in case if you want to find out the minimum, so minimum again, I am trying

to take the same value.

Refer slide time: (4:15)

But here, in this case, the command here is m i n and after the command in the case of

maximum, it was m a x. Right. so now, in case if I try to write down here mi n, inside the

bracket, if I try to give her all the data values and if, if I try to press your enter, on the R console,

then I get here, the value minus two point three and you can see yourself here, that out of four

point two, three point seven, minus two point three and 4, this minus two point three is the

minimum value. Okay. So, minimum and maximum are the functions, where you need not to

write the entire program yourself. But, somebody has already done the programming, to find
3

55
`

out the minimum and maximum values and that program has been renamed, as say m i n and

says ma m a x and you simply need to use it. But, definitely as I told you that R is not a black

box, so in case, if you really want to know that, what logic has been used, what programming

has been used, it is possible to look into the steps, which are used in finding these values.

Okay. Here, is the same operation and again, I'm trying to show you here, that here I am, not

using this c command and here, I am using c command. I mean, say the data is combined, using

the c command, whereas in the first case the data is not combined using the c command but, in

both the cases the outcome is the same. So, again I will advise, you that always use the c

command, to combine the data vector and here is the output or the screenshot. Now, I will try

to show you, that how these things are happening on the R console. Right.

Refer slide time: (6:09)

So, let me, go to here, this thing first I try to copy this command and I come to here,

56
`

Refer slide time: (6:19)

R console and I try to paste this here. So, you can see here, this will give you, four point two

and yeah. In case, if I want to write down, here the c command, you can see here yeah, so now,

the same the data, data vector has been combined using the c command and if you see the

answer is going to be the same and similarly in case if you try to take here, another example

here, suppose here, between 60, 20, say 56, 87, 97, 35 and so on. So, you can see here, the

maximum value, here is 97. Right. Okay. Now, similarly in case if you try to find out the

minimum, then I can do here, I can use here the command, m i n, over the same data set, 4

point two, three point seven, minus two point three and four and you can see here, now the

minimum value is coming out to be, minus 2 point 3 and similarly in case if I try to use here

this c commend on here, even then you will get the same outcome. Right. And similarly if you

try to take it, another value here and if I try to say here, minimum of the data vector between,

56, 23, 98, 65, 74, 34 and so on. So, it comes out to be 23. So, you see here, I am trying to take

57
`

a very small data set, where you can verify yourself that whether it is giving you a minimum

value or a maximum value, but this built-in function are very useful when you are trying to

deal with a huge data set where there may be five thousand values, ten thousand values or

twenty thousand values or so on, there you cannot find out these values manually, so these

functions help you there. Now, if you try to understand I have taken here, two examples, m i n

the minimum and m a x, maximum and I have illustrated how you can operate it over a data

vector. Now, similar to minimum and maximum, there are some other built-in functions, which

are available in R. There is a long list, but here, in this slide I am trying to give you an overview

of those built-in function and how to use those built-in function, the process and the procedure

is exactly the same what you have learnt in the case of minimum and maximum.

Refer slide time: (8:55)

For example, in case if you are interested in finding out an absolute value. Right. Then you

simply have to write here, a b s inside the brackets, you simply have to give the here the data

58
`

vector, even a single value or a vector value, both are acceptable. Similarly, in case if you want

to find out the square root of any value or any data vector, the function is s q r t and similarly,

in case if you want to find out the rounding of value, the function is r o u n d, if you want to

floor, the function is f l o o r, if you want to ceiling, then the function is c ei l i n g, if you want

to find out the sum of two numbers or say sum of two vectors, then the function here is s u m

and for put it the function is p r od and similarly if you want to make different types of logs,

these functions are there for exponential, functions are there for trigonometric functions, sine,

cos, tan, etc.,, are there and hyperbolic function says sine h, cos h, tan h etc., are also there

and, and there is a long list. But now, I will try to take here, some example and I will try to

show you, how you are going to use them.

Refer slide time: (10:16)

For example, in case if you want to find out, the square root of any value, suppose I want to

find out, the square root of 4. Right. So, in this case I simply have to write it s q r t and inside

59
`

here, single value 4. And if you try to see it here, this valuable, value will come out to be 2.

Now, I will try to illustrate you, that if you want to find out the square root of a data vector,

then exactly on the same way, as addition, subtraction, multiplication, division, they are

operated, over each and every element in the data vector, similarly this square root operator

will also be executed over all the elements of the data vector. if for example, in case if I try to

take here a data vector c, say four, nine, sixteen and twenty five and suppose if I try to find out,

the square root of this data vector, then this will be operated like this, say here, c square root

of four, square root of nine, square root of sixteen, square root of twenty-five, and this answer

will come out to be here, two, three, four and here five. So, you can see here, in this case, I'm

trying to do the same thing and the outcome is coming out to be here, two, three, four and five.

And here is the screenshot of the same operation. Right. Okay.

Refer slide time: (11:48)

60
`

Now, and similarly means, if you try to take another example, that you can try yourself try to

find out a square root of nine and then you square root of this another data vector, contained

containing the values 9, 25, 36, 49. So, I'm just leaving it for your practice. Okay.

Refer slide time: (12:05)

Now, another important aspect, sum and product, there is a built-in function here s u m and

this functions, finds out the sum of all the values inside the data vector, for example, in case if

I want to find out the values of 2 plus, 3 plus, 4 plus, 5. then I can give these values in the form

of a data vector, consisting of four values, 2, 3, 4, 5 and then I simply have to operate it here

and write s u m of this data vector and this will, give us the value of 2 plus, 3 plus, 4 plus, 5.

that is the summation of all the values inside the data vector and this value will come out to be

here 14 and this is the screenshot here. Similarly in case if I say, I want to find out the product

of say 2, 3, 4, 5, that is 2 into 3 into 4 into 5 that means, I can combine this data in the form of

a data vector, c, two, three, four, five and then if I use here the built-in function p r o d, then

61
`

this will give us the multiplication of all the values inside the data vector. Right. So, this p r o

d function, this is used to find out the product of all the values inside the data vector. So, for

example here, I try to use, this command over the R console and this value comes out to be

120. 2, 3’s are 6, 6, 4’s are 24, 24, 5’s are 120. Right. And this is the screenshot of the same

operation. Now, I try to come to the R console, so that I can show you, these operations. Right.

So I will try to show you here, the square root sum and product operation.

Refer slide time: (14:04)

For example, if I try to find out here, square root of here, see here nine, you can see here this

comes out to be here three and if I try to find out the square root of our theta vector, say here,

4, 9, 16, 25. So, this will give us the value of square root of 4, a square root of 9, is square root

of 16, square root of 25. Which is 2, 3, 4 and 5? Ok? Now, look at me they take here some

negative value, let us see, what happens -4, say here 9, 16 and here 25. So, this will give here,

what is here N a N? this is something, new for us, so I will try to show you, the use of this N a

10

62
`

N etc, all these things after a little couple of slides in the same lecture. So, this is, trying to

show you, in very simple word that there is some issue, some problem and it's showing that it

is not possible to compute it. Right. So, that is what you have to be careful that, for this minus

4, it is giving us the value and a something like, not available type of thing and for all other

value it is giving you the correct values. Okay.

Refer slide time: (15:31)

So now, let me give you an idea of the, sum function. So, if I try to find out the here, sum of

say here, five values, 5, 6, 7, 8, 9, combine in the data vector. because, this comes out to be

here 35 and this sum operator is also valid for negative values, also for example, some of – 3,

– 4, - 5 and say – 6, this will again come out to be -18, so there is no issue and in case if I try

to take here, two data vectors also, that also can be done, that we already have seen in the earlier

11

63
`

lecture and similarly if I try to show you here, now the product function, say if I want to make

it here, product of here c 2, 3, 4, 5. Right. So, this comes out to be hundred twenty 2, 3’s are 6

and 6, 4’s are 24, 24, 5’s are, 120. So, it is trying to give us the product of all the values present

inside the data vector. Now, if I try to take here one value to be here negative, then let me see

whether this operates or not. So, you can see here that the answer is coming out to be minus

120 because, this sum and product function they are the valid mathematical functions and they

are operative over positive, as well as, negative values. Right. and similarly if you want to c

the, use of your absolute function. So, I can show you here, absolute of here, say this minus

nine is actually nine, so you can see here, which is happening and similarly if I try to take here,

c a data vector consisting of c here, - 9, - 6 and 7 and here 8. So, you can see here, that there

are two negative values and two positive values in this data vector. so now, once you try to

operate the absolute function, it will give you the minus 9, will become 9, minus 6 will become

say 6 & 7 & 8 they are, they already are the positive values, so their absolute function remains

the same, so you can see here, that these operations are not difficult to do in the R software.

Refer slide time: (17:43)

12

64
`

And they will help us with hitter in solving many complicated functions, very easily and beside

this thing I would also try to show you, that because for a quick revision, that whenever we are

trying to assign a value to a vector, then it is also possible to assign a new variable to a new

vector, for example, in case if I try to say, take here, data vector of two, three, four, five and

suppose I want to obtain the square of all the values say, 2 square, 3 square, 4 square, 5 square,

then I can simply denote it here, by here x hat 2 and this value can be stored into a new variable

here y. Right. and once you try to see the value of here y, this will come out to be 4, 9, 16, 25,

which is, 2 square, 3 square, 4 square, and 5 squared and this is here the screen shot, so the

cursor pretty simple operation, so that will help us in assigning, the outcome of an operation

into a new variable. Right. if for example, in statistics, we will see at many, many places we

need to find out the sum of squares.

Refer slide time: (19:04)

13

65
`

So basically, if I want to find out the value of suppose I have, some data vector here consisting

of some values x 1, x 2, say here, x n and suppose I want to find out the sum of squares of these

value, that means, you first try to square the value and then find out the sum. so now, using this

built-in function, you can see here, what I can do? So, what I have to do? Here, that first I need

to define here all the values inside a data vector, so I am trying to define here, a data vector

here c, 2, 3, 4, 5. Right. And now, this is asking me to find out the square, so what I can do

here? That, I need to find out the square, so I can write down here x hat 2 or I can write down

even here x multiplication x and this is going to give me the value of here, say two square,

three square, 4 square and five square and now, I need to find out the value of sum of two

square, three square, four square, 5 square. So now, I can use here, the built-in function here,

sum x hat 2 or see here, some x star x and this is going to give me the value of 2 square plus, 3

square plus, 4 square plus, 5 square. So, you can see here, that in case, if I want to find out the

value of this function here Z, which is the sum of squares of different values

Refer slide time: (20:42)


14

66
`

then I can write down, it here simply like here, sum x hat 2and this value can be stored in a

variable, say here z and some of 2 square plus, 3 square plus, 4 square plus, 5 b square, comes

out to be here, 54 and this is the outcome here, So let me try to show you, it over the R console

also so that you get more confidence.

Refer slide time: (21:10)

15

67
`

So, you try to see here, I try to define here, a vector here, 2, 3, 4 & 5 and I try to find out here,

x square. Right. And suppose if I try to find out here, z is equal to here, some of say x square,

so this comes out to be and if you try to see, what is the value of z? This comes out to be 54

Right. And similarly in case, if you want to find out the product of 2 is square, 3 square, 4

square, 5 square, I can write down here, Again, here is some instead of some, I can write down

here, product of x square and this will come out to be 14400. So, what is this value this is the

product of 2 square, into 3 square, into 4 a square, into 5 square or simply the product of, 2

square, three square, four square, five square. Right. So, these types of operations are possible

and with this example I will try to illustrate here,

Refer slide time: (22:19)

16

68
`

The computation of this function is very important or very popular function in statistic.

n n
Suppose, I want to find out the value of this function here, z   ( xi  x ) 2   xi2  nx 2 , what
i 1 i 1

n
is here x ? x here is the arithmetic mean, it is simply here  xi . So, what I am trying to show
i 1

you here, that if I have got the values, I can find out it's arithmetic mean and for that I have

some built-in function, what is called as here mean m e a n? so now, what I want here, first that

I want that each and every value xi , should be subtracted from the automatic mean and then,

this value has to be squared and then all these values are squared and then I want to find out

n
the sum of all those values. Now, this value can be further simplified, x
i 1
2
i  nx 2 and if you

n n
try to open it this becomes a summation,  xi2  nx 2  2 x  xi and this becomes
i 1 i 1

n n n

 xi2  nx 2  2 x  xi . So, this quantity comes out to be, the same as,
i 1 i 1
x
i 1
2
i  nx 2 , which I

have written here. So now, in order to compute, this value, what I can do, that I already have

computed, this value, this I already had written here as sum of say x hat 2 and what about this

value in x value ? What is your n? n is the number of elements in the data vector, so for that

there is a built-in command in R, which is called here as a length, so if I try to say here length

of here x, so this is going to give me the length of the data vector or the total number of data

points, in the data vector x, so instead of here n, I can use here the command here length of

here x and for x bar, I already have a command built in command here, so you here mean of

here x so I can write down here, mean of here x and say here square. Well don't worry, all these

commands like as mean, length, etc., we are going to discuss in the further slides, but here I

wanted to give you an example, to show you that how this built-in functions are going to be

useful in computing a very important quantity in statistics.

17

69
`

Refer slide time: (25:56)

So now, in case if you try to do it on the R console here. So, suppose I try to take here the data

set, say 2, 3, 4, 5 and this is the same value, so I have written here sum of x square, which is

corresponding to this thing, length of here x which is corresponding to here n and this here

mean of x square here which is corresponding to x 2 and once you try to store, the value of this

function, into a new variable here z. So, once you try to execute it, first I try to show you here,

what is the length of here x? Which, which is coming out to be here, 4 and what is the value

of here z, this comes out to be at 5 and this is the screen short of the same operation. Right.

Okay.

Refer slide time: (26:44)

18

70
`

And similarly if I try to take another example on the same lines, suppose I try to take here two

data vectors, say here, x 1 and x 2, consisting of four values 2, 3, 4, 5 and 6, 7, 8, 9 and suppose

I want to find out the sum of the product. So, this is something like this, 2 into 6 plus, 3 into 7

plus, 4 into 8 plus, 5 into 9. So, I need to find out the value of this thing. so now, I can do it

very easily using the this built-in operator. First I need to find out the multiplication, so for that

I can simply use here the operator x1*x2, that means, I'm trying to multiply the components of

two vectors x 1 and x 2 and whatever is the outcome, that I am going to sum. So, this value

will come out to be 110. Right. So, before I go further, I will try to show you, on the R console

that how these things are happening

Refer slide time: (27:44)

19

71
`

So, suppose if I try to, to take it here the same data vector, that we have taken it here is here

earlier, 2, 3, 4, 5. First if you try to see here, the pin length of here x, there are 4 elements. So

it should come out to be here 4 and similarly if you want to find out the help of, find out the

here, mean of here x mean of x will come out to be a 3.5, that is 2 plus, 3 plus, 4 plus, 5 divided

by 4. Which is equal to 3 point 5? Now, in case if you want to find out the same command

here, sum of here x square, minus length of x into mean of x square. So, you can see here, this

function is coming out to be here, like this and the value is coming out to be here 5, and

similarly if I try to take here another data vector here, say 6, 7, 8 and here 9 and if I try to find

out the sum of the product of x and y, data vector so, what this is coming out to be 110. What

is this value? This is trying to first multiply the corresponding elements of two data vector x

and y and whatever is the product, it is trying to find out the sum. So, with this illustration you

can see here very clearly, that this built-in function will help us in many types of operation that

we are going to learn in future lectures. Right.

Now, after this I would like to address, another small topic, which is very, very useful, suppose

someone is asked to collect, suppose five thousand data values and after the values are

collected, they are manually entered, on a computer and suppose there are various possibilities,

that the number of values present in the data vector, are not really all and some data is missing.

20

72
`

So, in case if a value is missing in any data vector, then how to handle it? this type of situation

may occur, for example, someone is asked to collect the data from say five houses and suppose

he goes to the third house and the house is locked. So he will try to indicate that this data is not

available, so he will use some symbol, so in R, in case if the data is missing, there are some

standard symbols, which are used and there are some functions and commands, which help us

in modifying the statistical tool and the mathematical tools, to handle the missing data. So, this

is what we are going to learn now in the next couple, of minutes.

Refer slide time: (30:48)

So, before I go to discuss, the handling of missing data, first let me inform you few small things.

There are two letters, TRUE and FALSE, which are written in capital letters, capital T, capital

R, capital U, capital E and all capital letters in false, these two are the logical operators and

these operators are used to compare different expressions. We know, that in mathematics, there

21

73
`

are two types of operator which are the mathematical operator and, and say another are logical

operator logical operator, for example, say less than, for example, if I say five is say more than

three, so this is true or this is false, this is true. But, if I say five is smaller than three, then it is

false. So, here I just want to know the answer in terms of true and false, I am NOT interested

in, how much it is larger and how much it is, is smaller? So, these are our logical operators, so

this capital letters TRUE and capital returns FALSE, they are the logical operators and they are

the reserved word, means, as soon as you write here. capital T, capital u, capital R, capital U,

capital E, the R will automatically assume it, that you are going to use a logical operator. So,

you cannot use it to define a new variable or, or any variable from your side, that will be

acceptable and also in case of here the entire word, TRUE or FALSE, you can also use here,

the first letter, capital T or say capital F, to denote the words TRUE and FALSE respectively.

So, T can be replaced for say here TRUE and say F can be replaced for here FALSE, right,

and remember one thing these TRUE and FALSE have to be written only in that capital letters.

In case if you are writing this in small letters or even if a single letter in these two words is

small, this will not remain as a logical operator and R will not consider it as a logical operator.

So, this small true and small false, this is not possible, and they are not the same as the TRUE

and FALSE in capital letters

Refer slide time: (33:04)

22

74
`

As soon as, we get any data, then I try to input the data. First option is this, I can input the data

in the form of a data vector, using the command c. Now, I would like to know, is there any

value which is missing in the data/ So, first question is how to know whether any data value is

missing inside the data vector? So, in order to note this thing, there is a built in command here,

what we call here? Say, is dot n a and inside the bracket, you have to give the data vector, in

which you want to find, is any value missing. So, for example, I try to take here data vector

here, consisting of four values 10, 20, 30, 40, here you can see, all the values are present and

no values is missing, so I try to execute the command, is dot NA and inside the bracket say x

and you will get here an outcome like this, FALSE, FALSE, FALSE, FALSE, that means, this

10 value is not missing, so saying that 10 is missing or is 10 missing, this is FALSE. Similarly

this FALSE corresponding to this 20. So, so I'm asking with this command is 20 missing,

answer is FALSE, no it is not mean, what you think it is present? Now, both the 30 and 40

these values are available, so this command is dot n it is giving me a FALSE statement, so this

is the screenshot, of the same operation, I will try to show you over the R console also,

23

75
`

Refer slide time: (34:53)

And now let me take here another example, Here I am trying to replace the value 20 from the

earlier data vector by here N A. Right. you can see here I am writing here, capital N and capital

A, this is also a reserved word, I will try to discuss it after a couple of slides but, this is also a

reserved word and this is used in R, to indicate that the value is not available. Right. so now,

in this case, whosoever is entering the data, he has to be told, that in case if the data is missing,

he or she has to write NA in place of the missing data, so now, the data vector comes out to be

consisting of here, four values, 10, NA, 30 and 40. So, now when I try to operate the command

here is dot na, this is giving me the outcome here FALSE, TRUE, FALSE, FALSE, this means

here what? This FALSE corresponding to this10, so I'm trying to ask here is 10 missing, answer

says, no, it is available, hence my, my command is FALSE. So, it is giving me the value here

FALSE. Now, I come to the second value here, NA and I ask is dot na and this value, is say

yes. this value is missing and hence the answer is TRUE, my statement is true and similarly for

24

76
`

30 and 40, these two values are available, so this is trying to say, that these values are not

missing, they are available,

Refer slide time: (36:28)

And after this, Suppose, if there are more than one values, even then there is no problem at all.

For example, in the same data vector if I try to miss two values 20 and 30 here, in this new x,

so once I try to operate the command here, is dot na, then it is giving me here, FALSE and

FALSE for the values, for those values which are available and it is giving me the answer,

TRUE and TRUE, for the values which are not available, so by this operation I can always find

out whether the values are missing or not. So as long as, I am getting here, all FALSE, that

means all the values are available and if I am getting even a single TRUE, that means value is

missing in the data vector and now, I will try to show you that in case if the value is not available

it is missing in the data vector, then what happens in case if I try to operate any built-in function.

Right. Ok. But, before that let me come back to here.

25

77
`

Refer slide time: (37:33)

R Console, so that I can show you, that is it really happening or not. So, if I try to take here for

example here, the x greater vector here, c here, 20, 30, 40 and here 50. Right. So, this is my

here x, so I try to do hit is dot na and inside here x and it is giving me FALSE, FALSE, FALSE,

FALSE, all that means, all the data values are available. Now, in the same data set, I try to

replace, the second value by NA and now, you can see here, my this x becomes here, 20 NA,

40, 50 and if I try to repeat here, the same command is dot na, can it is giving me here for the

missing value, it is giving me here TRUE. Right. And similarly if I try to make it here, more

than two values to be missing. So, and if I try to operate with the same operator, I get here,

FALSE for those values, which are available and the TRUE for those values, which are absent,

which are missing .so, for the 20 and a 40 and a I am getting the outcome FALSE, TRUE,

FALSE, TRUE. Okay.

Refer slide time: (38:40)

26

78
`

So now, let me come back to our slide and let me try to show you here, that how the things are

going to happen, when some value is missing. So, suppose I try to consider the data set here,

in which one value is missing, 20, 30, NA and 40. Now, suppose I want to find out the mean,

mean of x. This sample mean is defined as here sum of all the values x1 plus, x2 plus, xn

divided by the total number of observations here x. Right. So now, in case if you try to find out

the mean of 10 plus, 20 and NA and 40, this will become here, 10 plus, 20 plus, NA, plus 40,

more of elements in the data vector which is here 4 and you will see here, it is not really

mathematically possible to add value NA in any numerical values. So, this answer will come

out to be here NA. Right. Where as,in case if you try to use this command, see here n a dot, r

m equal to TRUE, allowed this command, I am going to explain you in later on, whenever we

are going to deal with the statical function but, here I want to give you an idea that that how

you are going to modify the same command, when there is a missing value in the data vector.

Right. So, in this case suppose I know, that there is one value, which is missing in the data set,

I have to modify my command, mean of x’s, mean of x. So, I can write down here, mean of

27

79
`

here x and I'm trying to write down n a dot rm, that means, NA value has to be removed and

this option is TRUE or FALSE, this option is TRUE. By writing T, that means all the any value

has have to be removed and the arithmetic mean has to be calculated on the basis of available

numerical values. So, in case if I try to write down, this command here, then this arithmetic

mean will be found here as 10 plus, 20 plus, 40 divided by 3 not 4. Right. And then I will get

here a value here 23.33 and so on. So, this is how we try to work when some values are missing,

but let me show you this thing over the R console.

Refer slide time: (41:10)

So, let me take here, the data here, x to be here 10, 20 say here, NA, and here 40. Right. So you

can see here, this is my data here and if I try to write down here mean of x, this is giving you

me, me, me here NA and, and if I try to find out here mean of x, n a dot r m, is equal to TRUE,

TRUE, then you get here, the value 23.33. So, here means again just for the sake of illustration,

I will try to show you that instead of here using the entire word T R U E, I can also use here,

capital T and the answer is coming out to be same, but on the other hand, in case if I try to

28

80
`

make it here, say small letters, say r u in small letters quickest, will give me an error and even

if I try to find out this with only n a dot r m, is equal to a small T, that is not capital T, this is

going to be an error. So, this is how we try to handle missing values.

Refer slide time: (42:13)

But, before concluding the Lecture, I would try to inform you, that in R, in some places, you

will be getting an outcome, like here N U L L in capital letters. This is going to be the outcome,

which is returned by some of the function, remember one thing, NA and NULL they are not

the same thing and even this capital N A and is small na, they’re also not the same thing capital

N, capital A is a result word and this is case sensitive. The difference between NA and na is

the following; NA is a place holder, place holder means, yes, inside the class, a student has

been assigned a seat, but the student is absent today, it does not mean that a student does not

exist. So in this case the student is missing, so that is going to be given by NA, not available,

at that point, but in case if I am trying to use the word, here N U L L, this NULL stands for

something, which never existed. So that is the difference between the use of NA, NULL, which

29

81
`

we have to be careful in the while we try to use it, so here I would like to stop and I would

request you that you please, try to attempt your assignment, try to take some exercises from the

books, or even you can create your own exercises, just try to take a data set, do small

manipulations and verify them with their manual calculation, that are you getting the same

thing, which you used to get manually, try to replace some data set in that data vector by NA

and try to see, what is really happening? If one value is missing, two values missing and even

if no value is missing, how you are going to obtain or how you are going to interpret the value

of this logical operator TRUE and FALSE. Right. So, you practice and we will see you in the

next lecture, till then good bye.

30

82
Lecture 05

Calculations with R software _Operation with matrices

Welcome to the lecture on course descriptive statistics with R software. In the last couple of lectures, we

have understood that how R is going to be useful for doing various types of mathematical operations, and

we also understood that how the missing data values can be handled inside the R software. Now in this

lecture I will try to give you an idea that how to handle matrix in R software, so, what is the matrix?

Refer slide time :( 0:52)

83
If you try to see from the mathematical point of view matrix is simply a rectangular array, which has got p

rows and say n column and this will be denoted as a matrix of order p cross n. For example I can always

write a matrix like here x is equal to that's a standard notation x 1 1 x 1 2 say here x 1 say here p and say

here x 2 1 x 2 2 up to here x say n 1 and x 2 2 here and something like here x 2 p 2 p and then x n 2 and say

here x n p. Right. So, this is here a matrix and what we are going to assume that here all these values are

some numerical value some real values. Right. So, I will say that all these entries in the matrix are some

numerical values, they are currently some real number. Right. And in case if I want to denote a particular

element for example if I want to denote this x 1 2, So, x 1 2 is going to be denoted like x 1 comma 2. So,

that means this is the element on the first row and in the second column. So, a question arises here what is

the difference between data vector and vector in terms of matrix theory data vector is a data vector but in a

matrix theory, we have a number of different commands and it has a different structure. So, first you have

to decide that whether the data has been inserted in the form of a data vector or in a form of vectors and

matrices, both these operations are going to be different that is what we are going to see in the further

lectures.

Refer slide time :( 02:52)

84
So, the first question comes how to create a matrix? So, in order to create a matrix we have a command

here m a t r i x, matrix and inside the bracket you can see here I am writing here several thing nrow ncall

data. So, this nrow is trying to give me the information that how many rows I want so, this is giving us the

information on number of rows ncol is similarly it is trying to give us any information that what are the

total number of columns in the matrix. And what data has to be given? This has to be given using the c

command inside a data vector that has to be arraigned inside the matrix. So, here if you try to see I am using

here nrow is equal to 4 that means the number of rows are here 4. And number of columns this is here 2 so,

there are going to be 4 into 2 that 8 values in the data vector. So, I'm trying to write those values here 1 2 3

4 5 6 7 8. So, now the data is going to be arranged in four rows and two columns. So, you can see here there

are 4 rows 1 2 3 4 and there are two columns here 1 and 2 and data is going like this 1 2 3 4 and then from

here 5 6 7 and here 8. So, yes there can be a question that way why this the data is going to be column wise

or why it cannot be row wise. So, that I will try to address but here at this moment I would request to you

please try to observe how the matrix has been created. I have simply given the number of rows, number of

85
columns and the data and based on that a matrix of order 4 by 2 has been created. Right. And because, for

your remembrance

Refer slide time :( 04:52)

The parameter nrow defines the number of rows in a matrix the parameter ncol defines the number of

columns in the matrix and the parameter data defines what data has to be given inside the matrix. Right.

86
And usually in case if you are not giving any option whether the data has to be entered in row wise or say

column wise, the default is column wise as we have seen in the earlier slide.

Refer slide time :( 05:18)

So, now I'm going to consider here the same matrix here x which I have just denoted and suppose my issue

is this suppose my query is this I want to access a particular element. So, how to obtain a particular element?

Suppose I want to obtain this element 7. so, now what is the address of 7? The address of 7 is this, this is

located in the third row and second column. SoI will try to write down here the name of the matrix small x

87
and inside this square bracket I will try to write down the address. So, in this case this address is going to

be x bracket 3 comma 2. And once I try to type x 3 comma 2 on the R console, I will get here the value 7

which is the same value here like this. So before I go further I would try to show you that how the things

are going to happen on the R console. So I will try to create the

Refer slide time :( 06:23)

same matrix here and you can see here the gthis is here the matrix x. Right. And suppose if I want to obtain

a particular element say 3 comma 2, I am getting here 7 so, this 7 correspond to this thing. Similarly if you

88
want to find out the see here 2 comma 2, this is 6 but if you try to find out here 2 comma 7 you can see here

that this value does not exist here. Right. Okay.

Refer slide time :( 06:53)

Now I try to address here the second issue that in case if you want to enter the data row wise. Then what

option you have to give? You simply have to give an option here or add a parameter here byrow is equal to

TRUE. So, you can see here I mean all other part of the syntax is the same what we use in the earlier slide?

But I have used here one thing byrow is equal to TRUE. And in this case you can see that data 1 2 3 4 5 6

7 8 this is going to be entered like this 1 2 3 4 5 6 7 and 8. Right.

89
Refer slide time :( 07:39)

So, this is how we going to do and this is the screenshot I will try to show you on the screen also on the R

console also.

90
Refer slide time :( 07:49)

And similarly in case if the data has to be entered column wise then I simply have to add the parameter

here byrow and which is now here FALSE that means I don't want the data to be entered in the row wise

mode. So, obviously once this statement is false because, then the opposite is true that the data has to be

entered by column wise. So, in case if I try to execute these things over the R console I can show you here

so if I try to use here by equal to here FALSE. I should choose my font size here so that you can see it here

clearly you can see here you can reduce this font size here and you can see here that this is here by row is

equal to FALSE. Right. And now in case if I try to see what is here x this is the same thing but now in case

if I try to do the same thing and if I try to make it here TRUE or I can use it capital T also here now you

can see what, what is the outcome here? The data here in the first case and the data here in the second case

91
these are arranged in different ways. First in the case of column wise and then the second case it is row

wise, right.

Refer slide time :( 09:09)

So, this is the screenshot of the same operation for your information only

10

92
Refer slide time :( 09:11)

And similarly if I want to find out the transpose of a matrix what is the transpose of a matrix? When we

try to interchange the rows and columns then it is called the transpose, for example if I say I have here a

11

93
matrix here x is equal to 1 5 2 6 3 7 4 8 then this matrix can be given in R console using this command

matrix and by the same command that we have used earlier and its outcome will like this. Now in case if I

want to find out the transpose, the command to find out the transpose is here t x mean t means a transpose

and inside the bracket you have to give the name of the matrix of which you want to find out the transpose.

So, you can see here your earlier

Refer slide time :( 09:48)

Refer slide time :( 10:03)

12

94
matrix was 1 2 3 4 and then here 5 6 7 and here 8. But now this number of rows and number of columns

are changed here the number of rows are here four and number of here columns are here two but now once

I try to find out

Refer slide time :( 10:23)

the transpose the number of rows, they are here two and number of columns here are here four and the data

is now say 1 2 3 4 and then 5 6 and then it is 7 and here 8. So, this tx is the command to find out the

transpose of a matrix in R and now I would try to show you on the R console also. For example if I try to

13

95
take here the same matrix that we had taken earlier. So, here you can see here this is your here x matrix and

if I try to find out the transpose as t of here x. Now this is your changed this first row this becomes first

column, second row here becomes second column, third row here becomes third column, fourth row

becomes here fourth column. So, this is how we can find out the transpose of a matrix

Refer slide time :( 11:22)

Now next I try to address that how we are going to use the operations of matrix addition and subtraction

in the matrix setup. So, we know that in matrix if I have a matrix like here 1 2 3 4 and if I try to multiply it

by here a scalar 5 then this operation is done on each and every operation, say I say each and every element.

So, 1 into 5 2 into 5 3 into 5 and 4 into 5 and now, in case if I try to make it here the multiplication of a

matrix with say another matrix, then this is given by 1 into 5, plus 2 into 7, 1 into 5, and 2 into 7 multiplied

it. And then add it then again 1 into 6, plus 2 into 8, 3 into 5, plus 4 into 7, and then 3 into 6, plus 4 into 8.

14

96
So, this is how the multiplication is done in mathematics and this is what is taught to all of us. So, now in

this case I try to take here the same matrix here which is here in which the data vectors are 1 2 8

Refer slide time :( 12:54)

15

97
And now in case if I try to multiply this matrix by here 5. So, the operator what you have to learn here, is

star (*) this is the same operator what was used for multiplication. So, remember when you are trying to

multiply a matrix by a scalar then the operator is only star. When you are trying to multiply matrix by a

matrix then I will have a different operator. So, in this case if you try to see here if your x matrix is like this

and if you try to obtain here 5 into x then you can see here that this element is multiplied by 5 this element

is multiplied by 5 this element is multiplied by 5 and each and every element is multiplied by 5 and here

you are getting the outcome 1 into 5, 3 into 5, 5 into 5, 7 into 5, 2 into 5, 4 into 5, 6 into 5 and 8 into 5 that

is 40

Refer slide time :( 13:59)

16

98
And here is the screenshot of the same thing. So, I will try to show you here it on the R console also.

Refer slide time :( 14:16)

Now you can see here this is your here x and now you are trying to make it here 5 into x, so you can see

here that every element has been multiplied by 5. And now I'm trying to consider the multiplication of a

matrix by a matrix. So, you already have created a matrix here x and you already have created a matrix here

transpose of x, that is already there so, I would like to utilize that thing. Right. So, now I am trying to

multiply the transpose of a matrix x with matrix here x and now you can see here this is the operator. So,

this is a matrix and this is a matrix. So, when you are trying to multiply a matrix by a matrix of suitable

order then you have to use the operator percentage multiplication and percentage and remember one thing

means all those rules for the matrix multiplication from mathematics, they have to be satisfied here for

example if you have here two matrices A and B they can be multiplied only if their orders are like this A

17

99
if a is of order m cross n then B has to be of order of say n cross p. so, these two orders have to matched

otherwise this won't be valid so this is how we try to do it here and you just I would try to show you on the

R console also.

Refer slide time :( 15:40)

And if I try to show you here so, this you can see here xtx will come out to be like this

Refer slide time :( 15:48)

18

100
Now I will try to pick here some more examples to make you understand. And suppose I try to take here

say here two other matrices which are they are just matrices of order two by two and whose data values are

1 2 3 4 and data has been entered by putting the parameter byrow is equal to true. Right. So, in case if you

try to insert that data, this outcome will come out to be like this here there are 2 rows 1 2 & 2 columns 1

and here 2 and the data here is entered by row 1 2 and then 3 4 and similarly I try to take another data set

11 12 13 14 and on the similar lines I try to create here another matrix of 2 by 2 and I call it here as a z. so,

this matrix will look like this. The data will be 11 12 13 14 and so, now I have here 2 matrices of order this

2x2 and I will try to show you that how to multiply them so, Right. So, you can see here there are two

matrices here y and here z and if I try to multiply here y percentage star percentage you get here like this.

Right. Yeah and here is the screenshot of the same operation for your understanding. So, this is here Right.

Refer slide time :( 17:27)

19

101
And this is again the same screenshot, which has been obtained over here. Right?

Refer slide time :( 17:29)

20

102
And now after multiplication I would try to address the addition and subtraction. Addition and subtraction

is quite simpler, simpler in the sense that you have to use the same operator that you had used earlier. For

addition you simply have to use the plus operator and for subtraction you have to use the minus operator.

But again I would repeat that all the rules of matrix operation they have to be satisfied here before you try

to do any matrix operation. And this is pretty common that once you are trying to handle a complicated

structure where you are trying to deal with the various matrices many times the orders of the matrices do

not match and this gives you error that you have to actually see what is really happening. So, just be careful

so, when I am trying to add here two matrices here A and B I will assume that they have got the same order

say m cross n and I am trying to subtract here two matrices. And we I will assume that they have got the

same order that means the same number of rows and same number of columns. Now I'm trying to consider

here the same matrix which I had created earlier and you can see here that this was another matrix which I

had created five into x. so, now I have here two matrices here x which is of order 4 by 2 and then I have

here another matrix here 5 into x which is of order 4 by 2. So, I can add it together so now I try to add x

21

103
and 5 x and you will see here that what will really happen that all the corresponding elements of x and 5 x

will be added. And similarly in case if I try to do here fraction here then what will happen that the

corresponding elements of the two matrices will be subtracted. So, using the plus operator I can do addition

and using the minus operator, I can do subtraction which is pretty straightforward. Right. So, I can show

you here on the R console also. So, now let me take here one more example of the same matrix y and z that

we have created earlier to have to show you the addition and subtraction operations.

Refer slide time :( 19:51)

So, if you try to see here earlier I had created these two matrices, y and here z of order 2 by 2 the y was

order 2 by 2 z was order 2 by 2 using the data set 1 2 3 4 and 11 12 13 14. Now in case if I try to make it

here addition so you can see here or rather I can show you this 1 and 11 will be added 2 and 12 will be

added 3 and 13 will be added and 4 and 14 will be added and the same thing will happen to the subtraction

also. Right. So, these are the here the operation if I try to use here addition operation operator and if I try

22

104
to use here subtraction operator two matrices of the same order have been added and subtracted. Right. So,

before I go further let me try to show you it on the R console also so, you can see here, okay, first let me

go through here x and you can see here this was very matrix x and now you want to do it here x plus 5 into

x so, this is here something like this and if you want to see here 5 into x is what here so it is like this so,

you can see here that x and 5 x are added together. Right. And similarly if you try to see here this is your

here x and if you and if your 5 x is like this and now I try to make it here 5 x minus x subtraction. So, you

can see here that the corresponding elements have been subtracted and if you try to make it, it x minus 5

into x I mean then again all the values with a negative sign will occur. And similarly if you try to recall I

had created this y matrix and z matrix so, I can make it here y plus z you can see here this is like this I mean

the corresponding elements are added and if I try to do subtraction y minus z then the corresponding

elements are subtracted. So, you can see here that it is not really a difficult operation. Now I would like to

address

Refer slide time :( 19:51)

23

105
here another issue that once you are trying to deal with the vectors and matrices then sometimes you need

to access a particular part of the matrix that can be a particular role that can be a particular column or that

can be a particular sub matrix. Then how to do it? So, I will try to show you here suppose I try to create

here a matter which is the same matrix which I have created earlier. And suppose I want to access the third

row so, my matrix has been given by say here by the name x one, two, three, four, five, six, seven, eight

and I want to use here or call that third row. So, in that case I simply have to write down the name of the

matrix and then I have to write down the address, address you can see here this is the address of row and

here this is the address of column which is actually here blank. So, so as long as you give the address to be

the blank this will indicate are that the entire row is needed. So, and it is not difficult to remember because

if you try to see this 3 comma blank inside the square brackets this is the same address which is given over

here. so, that is not even difficult to remember. Sometimes people do get confused that way to put the blank

on the row or in the column so, don't worry for these things you simply look in the matrix and try to look

into that the row or column that you want to access and simply try to give the same address in the as given

in the matrix. And similarly in case if I want to access the second column so, second column here is like

this you can see here it consisting of the values 2 4 6 8 so, again the address of the column here is given

here it here is a blank sign comma and then 2 inside the square brackets. So, I can write down here the same

address over here say her matrix name and then inside the square bracket this is here the row address which

is here left as blank and this is here the column address. And this value will come out to be here 2 4 6 8

yes here you have to be little bit careful that I am trying to call the second column so, ideally this should

come like 2 4 6 8 but this doesn't happen in R. If you want to call a row or if you want to call a column the

outcome will look similar. But whether you are calling a row or a column that can be accessed only by

looking at the command whether you have said x inside the bracket blank space comma 3 or you have said

x inside the square bracket 3 comma blank space. Right. And I will try to show you it on the R console

also but before that I may show you something more that suppose I want to want to recall or access a sub

matrix of a matrix. Submatrix of a matrix means a particular section of a matrix. Suppose I want to find out

from the same matrix I want to recall this part only and this part has to be left means I don't want to call it

24

106
so this is the submatrix which I want to call this is pretty simple always remember one thing whatever you

want to recall just try to give the correct address Right. so I can use here x and now I have to give the

address, the rows and columns which I want to choose. So I'm trying to choose here the rows 1 2 & 3 so, I

can give it here 1 colon 3 1 2 3 and what about the columns? I am trying to choose here 2 columns first

column its second column. So, I'm trying to use here 1 colon 2 and this is my address and as soon as I try

to write down here and enter on the console I will get here the same matrix here. So, this you can compare

this is the same matrix and this is the screenshot. So, I will try to show you these things on the R console

here.

so, I will try to take it here x this the same matrix that we had considered earlier now suppose I want to

recall the second row. So, you can see here I'm simply not typing anything after this after the comma and I

will get here the second row 3 and 4 you can see here and similarly if I want to find out here the fourth row

this is here 7 and 8 and similarly if I want to find out here, first column then I have to leave it here blank

or don't type anything comma 1 and this will give me the first column 1 3 5 and 7 but again you can see

here that this structure and with the structure they are the same so, you will not be able to look or you will

not be able to, to decide whether you have recalled a particular row or a particular column but by looking

at these addresses or the structure of these addresses I can always find out whether I have recalled a column

or a row. Right. Similarly in case if I want to find out the sub matrix suppose I want to find out a submatrix

consisting of first 3 rows 1 2 3 and two columns first and second column then you can see here I am getting

here the same thing. And similarly in case if I want to find out here only the first two rows and first two

columns then means I can give the row address to be row number 1 and 2 and column number 1 and 2 and

you will get get here the same matrix, from here you can see here this is the submatrix with what you have

obtained suppose I want to find out another submatrix which is consisting of the third row fourth row and

first column and I can column. So, I can write down here x inside the square bracket 3 colon 4 and then and

the number of columns 1 to 2. And so, you can see here that you are getting the 5 6 7 8 and this is the same

matrix here which you have obtained here. Right. So, similarly I have tried my best here to take or consider

only those commands related to the matrix theory which are going to be useful for us but beside those things

25

107
means most of the matrix operation are possible in R and built-in functions are available. For example if

you want to find out the inverse of a matrix there is a command Solve, s o l v e but this list is very, very

long so, I would leave it up to you that whenever you want to use a particular operation, related to matrix

to matrix theory please try to consult a book or the R software help menu and try to see how that matrix

operation can be done. And I would like to stop here and I would request that you please try to make more

practice so that you get more conversant with these things and from the next lecture we will start with the

statistics part. So, you enjoy the course practice it and I will see you in the next lecture. Till then, Good bye.

26

108
Lecture 06

Introduction to Descriptive

Statistics - Objectives, Steps

Welcome to the lecture on the course descriptive statistics with R software. Now you may kindly recall

that in all the earlier lectures, we had discussed and we had an idea that how R software is going to help

us in different types of computations. Now from this lecture we are going to start the discussion on the

topics of Statistics. But, here I would like to tell you one thing or I would like to clarify one thing. The

topics in descriptive statistics, whatever I am going to consider here, they are pretty elementary and I will

try to go to the depth, as much as possible, under the given time frame, but my idea is not really here to

teach you statistics. My idea here is that, most of the topics you will see, you know. And my objective is

that I would like to make you comfortable, that in case if you want to use the R software for the computation

of those topics, then you should be comfortable, you should be confident. Once you are confident in

handling the basic, topics I am, sure and I am confident, that there should not be any problem in handling

the the advanced topics in statistics. Besides this thing, people are using these statistical tools, very often.

But sometime, they don't know why they are using it; sometimes they don't know what is the interpretation

of different quantities. So, that will be my another objective on which I would try to discuss here that

whatever the, the tools of descriptive statistics I am going to handle, I will try to discuss about their concept,

their implications and their competition using the R software. Right. So, in this, lecture I am definitely not

going to use any R software for the competition but I would try to give you, an overview, that what is

descriptive statistics and how it helps and what are the ways, in which it can give us different types of

hidden information in a given set of data.

Refer Slide Time :( 2: 49)

109
So, only I will be considering on the basic aspects in this lecture and possibly in the next lecture also. So,

one of the basic fundamental question comes here, what is really statistics? So, I would say simply, that

statistics is a science which turns the data, or which tries to take out the information contained inside the

data and converts it into a form which is useful for making a decision This decision can be at, your office

level, at policy formulation, for forecasting, at country level or say anything else. So, what are we going to

do here? We are trying to collect the data, we are trying to analyze the data and based, on that we will try

to, make some statistical inferences and those, inferences are drawn, from the numerical facts which we are

going to call as, as data. So, data is a very important thing in statistics and it gives you some information

and this information is hidden inside it and what is descriptive statistics?

Refer Slide Time :( 4: 03)

110
This is, essentially the starting point for knowledge discovery on the basis of data. Yes, the discovery of

knowledge can be done from different types of ways, by looking at the picture, by reading some subject

and similarly statistics also gives us a tool to discover the knowledge that is contained inside the data. So,

data is essentially a very important source of information, data contains many information inside but, the

problem is the following, if I know something, then I can speak and I can inform you, but data cannot speak,

data cannot listen, data cannot understand the language what we speak. For example if I am, speaking Hindi

or English language possibly you can understand it, but if I try to speak in a language, which you don't

understand, then I will not be able to transfer the knowledge. So, similarly data has its own language, for

example, you have seen a first somebody has some, problem in speaking or hearing then there is a signage

language and that language can be understood only by those people who understand it. So, similarly data

also has a language and which is based on, different types of symbols, notations and interpretation. So, our

objective here is that, that we want to know the tools, by which we can draw the information contained

inside the data using the tools of descriptive statistics.

111
Refer Slide Time :( 5: 53)

So, essentially I can say here, that statistics is a language of data and it provides a, scientific way to extract

and retrieve the information hidden inside the data. And remember one thing; its taxes cannot do any

miracle. Sometime if you try to see, you might have heard, that people try to make different types of jokes,

different types or comments, on the statistic that is that statistics lies or something like this. So, I would like

to inform you that, it started statistics never lies. Right. The statistics is simply based on the data and now

it is our capability that how much we can retrieve the information in a from inside the data, for example, if

somebody is speaking to us in a sign language then it depends on our capability that how much I can

correctly interpret. It so remember one thing, that statistics also cannot change the process or the

phenomena, whatever process is happening that will happen and as a statistician, I am not allowed to, alter

or change the process, I simply have to collect the data, on the basis of the process which is happening,

112
which is continuing and on that basis I have to take a call, I have to take a decision, that which of the tool

is most appropriate tool in this situation, to draw or to extract the information hidden inside the data. Right.

Ok and definitely, the inferences what we are going to draw on the basis of a statistical tool, they are going

to use for different purposes, for example one of the basic purpose is forecasting. So for example statistical

tools provide forecasting but remember one thing, this is not like a astrologers parrot, that the parrot chooses

a chit and then one reads it that what is my future. It does forecasting but on the basis of some scientific

principle and the principles of statistics. Now whenever you are coming to this aspect, that you have to

choose a tool on a data set, to extract the correct information, then in this process I can divide the entire

scenario of decision-making into two parts.

Refer Slide Time :( 8: 19)

113
One part I would say here, suppose my data is here and there are two options, that the data is correct and

the second option is, this data may be wrong, what do you really mean by data is correct and data is wrong?

Suppose I want to know, the average height of the students in a class, then obviously, I have to collect the

data on the height. But suppose I am, trying to collect the data on the weight and based on that I am, trying

to infer the data on the height, then it is not appropriate. Right. So, in that sense I am, trying to say, that the

data has been chosen correctly, which is matching, with the objective of the study and second aspect is,

choice of tool. There are many, many state scale tool which are available and we are going to study them

in the future lectures, in the further lectures. But the main thing is this, one has to choose, what is the correct

tool to diagnose, the problem and to solve the problem, just like, if you go to the shop of a medicine or if

you go to a doctor, there are thousands of medicine. But, what doctor does? Doctor tries to decide that

which medicine is most suitable for the given problem. So, similarly in statistics also, we have many, many

tools and we need to make a decision that, which tool is going to be appropriate, to draw the correct

statistical inference, for a given problem. So, I know, how the choice statistical tool can be correct or the

choice of statistical tool can also be wrong. Now there are four options, first option is this, suppose I am

trying to choose the wrong statistical tool over a wrong set of data. Then in this case, the decision is not

going to be the correct. The next option is this, means I can choose the wrong statistical tool and the data

is correct, even in this case, I'm trying to use the wrong tool over a correct data, I will get an incorrect

decision and similarly, in case if I try to take here the, wrong data and if I try to take here the correct

statistical tool. So, using the correct statistical tool, all are wrong data will also, give us the incorrect

decision. Now, the last option which is correct is that, I have to use the correct statistical tool over the

correct data. So, this is the only option, that unless and until you try to choose the correct statistical tool and

you collect your data correctly, you will not be able to get the correct statistical inference out of it. The rule

is very simple, garbage in, garbage out. Right. And, one thing we would like to inform is that, sometime

people do come to us and they ask us to do the statistics very quickly, for example I have to submit my

thesis tomorrow, I have this data please try to help me, right, well, it is not, so simple at that stage. Because

you are the one who has understood the entire process and as a statistician, I have not been told about your

114
process. So first you need to explain me the entire process then I will try to understand the data generating

process and only then, I can do something and it is also possible, that the type of tool which you need that

may not always be available, but it needs to be developed. So, statistics always need some time to

understand the phenomena. Now, another popular question which I am asked that when I come to the aspect

of descriptive statistic, there are two types of tools. One is graphical tools and another is analytical tool.

Refer Slide Time :( 12: 23)

So, there are different types of graphical tool, like as, two dimension, three dimension plots, pie scatter

diagram, pie diagram, Histogram bar plot, stem leaf plot, box plot etc. there is a long list. In this case people

do ask me, which of the graph is more suitable or they also feel that if they try to use more number of

graphs, then their analysis is going to be better, I would say this is only a myth, you simply have to choose

the correct graph and you have to use the correct number of graphics. So, the appropriate choice of graph

115
and appropriate number of graphics, that is only going to help you in getting the correct information and

similarly when I come to the aspect of say the analytical tool,

Refer Slide Time :( 13: 13)

the analytical tool, I can say there are different aspects, on which, we try to analyze the data, for example I

would try to find out what is the central tendency of the data? What is the variation in the data? And what

is the structure of the data? And what type of relationship are existing inside the data? So, for example,

when I come on the aspect of measure of central tendency then we have different tools mean, median,

mode, geometric mean, harmonic mean, quantiles, etc. and similarly when we are trying to, understand the

variability in the data, then we have different types of tools, variance, standard deviation, standard error

mean deviation, absolute deviation, range, etc. So, you can see here, in this case, there are two aspects, one

is the central tendency of the data and another is the spread or the variation in the data. Now, in case if you

want to study the central tendency of the data, then and then out of this list, you have to choose the

116
appropriate tool and similarly in case if you want to study the nature of the variation in the data, then you

have to choose the appropriate tool. And similarly in case if you want to find out, what is the structure of

the data in terms of, symmetry peakedness etc.

Refer Slide Time :( 14: 26)

then you have to, choose a proper tool for symmetry and then there are concept of skewness, concepts of

kurtosis, these concepts are going to give us more information, that what is the structure of the data. And

then, another aspect can be, if I have, a data on say more than two aspects something like height or weight

or say height, weight and age, then there may exist some relationship in the data also or there may exist

some coherent structure inside the data. Then in order to study those aspect we have the tools of correlation

117
coefficient, rank correlation, multiple correlation, partial correlation, coefficients correlation ratio, intra

class correlation coefficient, linear regression, nonlinear regression etc. So, there is a long list.

Refer Slide Time :( 15: 12)

So, when we talk about the descriptive statistics, descriptive statistics is not a tool. But this is the collection

of the appropriate number of tools, that may include the graphical tools, as well as, that may include the

analytical tool and the choice of analytical tool also depends what exactly do you want to study? Many

times people do come to us and they ask us, sir, can you please do some static analysis on my data? In that

case I would always, request them please, let me know what really you, want to know from this set of data?

And based on that, I am going to take a appropriated decision and I'm, going to decide that which of the

10

118
statistical tool can give an answer to your query and then I would try to use it, so another question crops up

here, that which of the tool is a better option, graphical tool or say analytical tool,/ My suggestion is that

please use both of them, because if you try to see, this descriptive Statistics, is the point of starting point

for any analysis, what you have in your hand? You simply have a set of data, data are some numerical

values. So, you can always imagine that in front of you, there are 20 values, there are hundred values, there

are 2,000 values or there can be two million or so two billion values. And all those values are sitting silently

and you are the one, who is going to, start the knowledge discovery on the basis of given set of data. Right.

So, I would say, don't make a rule, but depending on the condition, try to use both types of tool, later on in

these lectures, I will show you that, how the graphical tools and, and how these analytical tools can be used,

under what type of condition, and how they can be computed on the basis of R software. Now, what is the

difference between, between the use of graphical tool and an analytical tool? Graphical tools provide us a

visualization. This will give us a first-hand information and what about analytical tools? They will give us

the information in quantitative form and they will give us the quantitative information. So, graphics will

give us information, but we have to look into this and then we have to draw a proper statistical inference.

And this analytical tool will give us a number, which we have to interpret to make a correct statistical

inference. So, I would say usually, or in most of the cases, the graphical tools and analytical tools, both

work together because the process is the same, data is the same and data is never telling you, that please

use, only the graphical tool or please use, only the analytical tool, this is only you, who is going to take a

call that whether graphical tools have to be used or say, these analytical tools have to be used. So, please

try to make an appropriate decision keeping in mind the objective of your study and the type of information

which is contained inside the data. And in statistics, why do statistics comes into picture? Statistics comes

into picture

Refer Slide Time :( 18: 40)

11

119
because variation always exists in all the process. What do you mean by variation? For example, if I say,

suppose you try to take a plot and you try to put, say this hundred grams of seeds and say after a month,

you will get a crop and suppose you will get, one kilogram of seeds. Now in case, if you try to repeat the

same thing, try to use the same plot or the same sizes of plots and put the same quantity of seeds, do you

think that, in all the plots, you are always going to get exactly one kg of field, this is practically very

difficult, there will be some difference, one plot may be given you one kg, another plot may be given you

one point one kg and say another plot might be giving you 900 grams and so on. So, the variation always

exists in all the process and in statistics, our basic objective is that that we want to understand the process

of this variation, we want to control this variation. And to draw a statistical inference out of the data with

minimum variability. So, this is, one of the basic objective, so in statistics or in descriptive statistics what

we really want to do, we have a set of data, now we are going to use that data, on a statistical tool, that may

consist of say analytical tool, as well as graphical tool. Now I will be getting some information from the

12

120
graphical tool, I will be getting some information from the analytical tool and now, this is my responsibility,

that I have to combine the information coming from both the aspects together and I have to convert it, into

a piece of information, which is useful, which is interpretable or that can be conveyed, to the experimenter,

who might not have any knowledge about the statistics. Right.

Refer Slide Time :( 20: 50)

So, I can say, that using the information gained by the tools, of descriptive statistics and combining them

together, to reach to a meaningful conclusion, to depict the information hidden inside the data, is the

13

121
objective of any statistical analysis and proper interpretation of those outcome is very important and all

these outcomes, all these inferences are made only, on the basis of data. So the next question come, how

this data is coming? So, there are two processes, one deterministic process and say another is non-

deterministic process, deterministic process means, you know the outcome in advance. But non-

deterministic process are where you really do not know, the outcome in advance. So, in statistics whenever

we are calling or when we are and whenever we are trying to understand the data generating process, the

data generating process is always random or say, non-deterministic and that is why, the role of statistics

comes into picture, once there is no random variation things will become purely mathematical.

Refer Slide Time :( 22: 06)

So, a simple question that arises here, why should we collect the data? So data is collected with different

types of objective. First is, to verify the theoretical findings, for example suppose if I say, in children the

14

122
height increases as the, weight increases or the weight of that child increases as the height increases, suppose

that is my theoretical finding and if I want to verify it whether, this is really happening in real life or not?

So, I need to collect the data and I need to verify this finding, secondly I have some objective and I really

want to know, the outcome of that process, so I have to collect the data, which is being generated from that

process and then I have to use a statistical tool, to the correct statistical outcome and remember one thing,

the inference what we are going to draw, that is just on the basis of the collected data, you cannot argue,

that some statistical inferences is coming and which is from some other source, beyond the data. So,

particularly when we are talking of the tools of descriptive statistics, we try to speak of information or we

try to report the information, which is coming only from the given set of data. And yes, the information

which is coming from this data, that we try to convert it in the form of statistical inferences, which is further

use in the development of statistical models, which are used for policy, decisions, classification, forecasting

and many, many other things possible.

Refer Slide Time :( 23: 55)

15

123
Now, in case if you want to make a study on a statistical experiment, then what are the steps involved.

Right. The first objective step, that please identify the objective of the statistical analysis, which is missing

in, most of the cases, in my experience. Right. People simply try to collect the data and after collection of

the data, many times they try to decide what type of statistical inferences they can draw from this data?

Well that is not bad, but at least my suggestion is that before you collect the data, please try to, decide the

objective of your study and try to ask, why I am collecting the data and based on that you have to take

further steps. Okay. How to get the data, the data can be obtained from a laboratory experiment, from a

survey, from some primary sources, or from some secondary sources, called as, ‘Primary Data or say,

Secondary Data’. But whatever is the data, I am not bothered about that, my objective is this, I have to use

the, correct statistical tool and I believe that by using the correct aesthetical tool, I can get a correct, correct,

valid and meaningful interpretation of the result,

Refer Slide Time :( 25: 16)

16

124
that is my belief and with this objective I am moving forward. So, the next question comes, what is an

observation? So, the unit, on which we try to measure the data, that is called an ‘Observation’. What does

this mean? Suppose I want to measure the height, so first I have to decide height of children or say height

of elders, suppose I decide, that I need to measure the height of the children between the ages, say 5 years,

to 7 years. So, what I would say? That I will try to collect some children who’s, ages are between five years

and seven years and then I will try to record the heights of those children. So, the heights of those children

will be some numerical value and they will be called as an, ‘Observation’. So, similarly in case if I want to

find out the number of persons, number of cars, monthly expenditure on say food in any family, then these

are also my observations, which are trying to cater, to some objective of statistical analysis. Next definition

which I would, like to discuss here, what is called a population? The collection of all the unit is called a

‘Population’. For example, in the in the earlier, example when I wanted to the data on the height of children

between five and seven years, you are trying to collect the data, only on some of the children. But do you

think, they are the only children, no, there are many, many children in that city, in that locality, in that

country. What, what are you trying to do? You are simply trying to choose, some of the children and then

you are trying to record the data. So, the collection of all the data, that can be locality, that can be city, that

can be country, that depends on the objective, that is called as a, ‘Population of children’ whose ages are

between five years and seven years.

Refer Slide Time :( 27: 30)

17

125
Similarly, suppose I want to find out the average age of the, of all the female students, in class ten, in a

school, on the basis of a sample. Then my population is going to consist of all the female students in class

ten, in that school, that will be my population. But if I want to study, the average age of all the female

students in that city, then all the female students in that city, who are studying in class ten, that will consist

of my population.

Refer Slide Time :( 28: 07)

18

126
And similarly in say, in another example if my objective is that I want to know, that how many female

employees have salaries more than the, male employees in a given company, on the basis of a sample. So,

in that case, my population will consist of all, the female employees in the company. From there I will try

to choose, a small sample.

Refer Slide Time :( 28: 30)

19

127
Now, the next question is what is a sample? So, sample is only a subset of the population, a basic question

comes- why we use this sample? Well, that is the main objective and main advantage of using statistics.

We are always interested in finding out a statistical conclusion, which is for the entire population, maybe

of country, maybe of city or may be of village and there will be large number of people, who have to work

to collect the data, that is very difficult. So, the advantage of statistics is that, that the statistics says, that

instead of collecting the data on the entire population, if one can collect the data, on a small fraction, that

we call a sample, then on the basis of sample of the data, the Statistics can help in getting a reliable,

statistical inferences which are going to be valid for the entire population. So, that is why collection of

sample is very important in statistics. So, whenever we are trying to collect the data, in a sample, we

believe, that whatever are the characteristics, whatever are the features, which are present in the population,

they are also present in the sample. For example, you have seen that if you go to a market and you want to

buy, some wheat and there is a bag of hundred kilogram of wheat, usually you will not open the entire bag,

but you will try to take a small sample, maybe consist of say this 20 grains, 40 grains hundred grains, you

20

128
simply try to look at those grains and based on the quality of the grains, you try to make an inference for

the entire bag, which is of hundred kg. So, now, this is my sample, sample is possibly consisting of the

grains of wheat, may be 20 gram, 30 gram, 40 gram and based on that whatever we are going to conclude,

that is going to valid for the, for the entire bag. It's not even the entire bag, but I will say, the entire wheat

available in that shop. So that is why the collection of data in the sample is very important and we believe,

that the data has been collected in such a way, such that the sample is representative. So, this will be our

basic assumption and that goes without saying, that in all his statistical analysis the sample means, sample

is representative, what is the representative sample? That means all the characteristics which are present in

the population, are also present in the sample, for example in case if the quality of the wheat is not so good,

suppose there are ten grains, which are infected by some insects. Then we assume, that in the population

also, means a similar type of proportion will continue, in case of the ten percents seeds in my sample, in

my hand, are not good so we, so I believe, there 10% of the wheat in the entire bag is also not good. So,

this is how we go.

Refer Slide Time :( 32: 05)

21

129
So the basic, foundation like that in case if my sample is good, sample is representative, then my statistical

inferences are also, going to be good. And there are various ways in statistics which help us in choosing the

representative sample or, or say correct sample so, so we have different types of sampling scheme like

simple random sampling, stratified sampling, cluster sampling, systematic sampling, multi phase sampling,

multistage sampling etc. And which helps us and guide us, that how to choose the correct, good and

representative sample, but definitely this is not the objective of the course and I'm not going to discuss that

what are the different sampling procedures to collect a good data. So, now in this lecture, I have given you

a brief background that may not look very mathematical but believe me this is, very important for us to

understand that what are we going to do under what type of condition, only then, I will be able to take the

correct decision and I have, recorded or I have given this lecture, with this objective only. So, I will try to

continue with some more basic definitions in the next lecture. And you try to understand this lecture and

try to create a foundation inside your mind to understand the tools of descriptive statistics and I will see

you in the, next lecture. Till then, Goodbye.

22

130
LECTURE-07

Introduction to Descriptive Statistics – Variables and Types of Data

Welcome to the lecture on the course Descriptive Statistics with R Software.

(Refer Slide Time: 00:19)

You may recall that in the last lecture we had a small discussion on different aspects of

descriptive statistics.

Now from this lecture I will be moving towards more mathematics, well in this lecture this is a

very small amount of mathematics but my idea in this lecture is to make you understand what

are the different types of terminologies and how they are represented. Once we understand that

131
what is the nomenclature and how the things are being represented that will help us in better

understanding in the topics in the further lectures.

(Refer Slide Time: 01:08)

Now let us try to start our discussion with the first topic, what is a variable? Whenever we are

trying to conduct any statistical analysis, before that, there is a collection of data, and even

before that, there is always an objective and the objective is based on the research problem, or

in simple words, what we really want to know. Once I decide this question what we really want

to know based on that I try to collect that data on the relevant variable from a population, and

then I tried to collect the data on that aspect which I want to know and this aspect in simple

language is a variable. So I can briefly say that once a research question is fixed and the

population of interest is identified, then we try to collect the data on something, data on what?

We tried to collect the data on a statistical variable, what does this mean? That whatever is my

132
objective based on that I will try to collect the data on a relevant question which is going to be

interpreted by that variable.

Before I go further I must tell you that there is a strong mathematical definition of these

variables, random variables what we are going to discuss in the further slides, but here my

objective is not to go to that mathematical level, my modest objective is that for a beginner how

the things have to be understood, for example I will be dealing with the definitions of

continuous random variable, discrete random variables, and if you come on the area of measure

theory in the statistics there is a hardcore mathematical definition of these concepts, but

definitely my idea here is to give you a flavor or make you understand, what are these things

and how they are going to be used in the collection of the data, so please keep this thing in mind

once you try to understand these topics, right.

So I can say here whatever is the information in which we are interested that is captured inside

that, or through a variable,

(Refer Slide Time: 00:47)

133
now in statistics there is a convention that these random variables are always represented by say

this here capital letter, and in our case usually I will be denoting the random variables by X, Y,

Z etc., and when I try to type it then usually they are typed in a mathematical mode which is an

sort of a italics mode, right, so this is what you have to keep in mind that whenever you try to

see a capital letter\, usually that is indicating the variable.

This number of variable can be one or they can be more than one also, so whenever we are

trying to deal with one variable, then the statistical analysis is usually called as univariate

statistical analysis or univariate analysis, and whenever we are trying to deal with more than

one random variables at the same time then we call it as multivariate analysis or multivariate

statistical analysis.

(Refer Slide Time: 04:58)

134
Now what is the role of variable? The observations are collected on the variable, now I’ll try to

take some examples to make you understand that how the variables are defined and how the

observations are collected on them. Suppose I want to know in some college that how many

male students, how many female students and how many transgender students are there, so in

that case my variable will be gender, gender of the student, and this I will denote as say here

capital X, and I’ll type it as italics like as here, and this variable will take three possible values,

one will be here, the student can be male, the student can be female or the student can be

transgender.

Similarly, in case if I want to denote or if I want to study about some country in Asia, then I can

also define that my variable here is say country in Asia, and I will denote this by here capital X.

(Refer Slide Time: 06:30)

135
Now this capital X can take different values, the countries in Asia can be India, they can be

Bangladesh, this can be China, this can be Thailand, this can be Bhutan and so on. So now you

can see I have done here two things, I have defined the variable and I also have given an idea

that the variable can take different possible values, right.

Similarly, if I try to take any other example, suppose if I denote say any odd number, so I can

denote the variable by here X, and X is going to denote any odd number, and what are the

different possible values we just can take in this case? The odd values can be 1, 3, 5, 7, 9 and

so on, so if you try to see through this example I have given you one more aspect that the

number of values which are variable can take, this can be finite and also not, so this is what you

have to keep in mind when we are going into the further lectures. And in this example, I have

defined the random variable as X which you always have to keep in mind.

(Refer Slide Time: 07:53)


6

136
Now what will be our next step? Once I have defined the variable, then I would try to draw a

representative sample or simply a sample from the population, and whatever are my sampling

units, I’ll try to collect or I will try to record data on them. For example in case if I want to

record the ages of children, in the age group of 5 to 7 then suppose I try to choose say 10

children whose ages are between 5 and 7 years from a population of that city, of that country, or

that state whatever you want, and now I would try to record the age of those children. So the

value of the age that will always be denoted by the corresponding small letters, so if I try to say

here I have denoted the random variable by X, then the values which X is going to take they are

denoted by small x like as here, so the value of the ages of those children will be denoted as

small x.

And suppose if I want to find out the average height of the students in a school, then my

objective will be to collect the data on the height of some students of that school, so I would try

137
to represent here capital X to be the height of the student, and whatever are the values of

heights, heights of students this I’m going to denote by here small x, so I can now say here X is

going to denote my height, small x is going to denote my values of height, and now I start

collecting the data,

(Refer Slide Time: 10:04)

Suppose I take two students, student number 1 and student number 2 and I try to measure the

height, and again I try to measure the height of the second student. So suppose I find that the

height of the first student is 150 centimeters and height of second student is 160 centimeters.

(Refer Slide Time: 10:44)

138
Now how to denote this 150 and 160 that is the question which I’m going to now address here.

(Refer Slide Time: 10:54)

So height equal to 150 centimeter and height equal to 160 centimeters, they are the two values

of the variable height, these are the two values of X.


9

139
And values of X are denoted by small x, so now I can denote here that the first value which is

here 150 centimeter this is here the value of random variable x and this is the first value, so I

can denote it here say x1 and I can write down here x1 is equal to 150 centimeter.

Similarly, the height of the second student or the value of the height of the second student, this

is x and this is my second student so I can write down here x2, and this I can write down here

160 centimeters, right,

(Refer Slide Time: 11:58)

so you will see in the statistics that is a very common sentence let x1, x2, xn be a sample from

some population, so once I tried to write down here let x1, x2,…,xn be a sample or a random

sample, what does this mean? This simply means that these x1, x2, xn they are some numerical

values, nothing more than that, and they are the numerical values on what? They are the

10

140
numerical values of the data, data on, data that is recorded on the variable X, so if I say X is

height, and suppose I have collected 20 students and I have recorded their heights, then the

values of those heights that is going to be denoted by x1, x2 say … say x20, so this is the

simple interpretation of this notation, and which is going to be used in all the further lectures.

(Refer Slide Time: 13:15)

Now the next aspect which I’m going to address about the variables is that there are two types

of variables, one is quantitative variable and other is qualitative variable, and under the

quantitative variables we have two types of variables, one is called discrete variables, and other

is called is continuous variable. Once again I would like to reiterate here that there is a strong

mathematical definition of these quantitative variable, discrete variable, continuous variable,

qualitative variable in a statistics, but definitely my objective is not here and I’m not going into

that detail, but my simple objective is to make you understand that given a situation when you

11

141
are trying to collect the data, the corresponding variable will be a discrete variable, continuous

variable or qualitative variable or a quantitative variable. Once you are able to judge this thing,

then after that you will see that the tools for example quantitative variables, qualitative

variables they are different, the statistical tool for discrete data for the continuous data they are

also different, so that will help you in choosing the correct and appropriate statistical tool, okay.

(Refer Slide Time: 14:40)

So first I try to address here the aspect of quantitative variable, in very, very simple words, I

can say without going into the mathematical details, that quantitative variables represents some

measurable quantities, measurable quantities means the values of X, the numerical values of X

can be obtained and once these numerical values are obtained then they can be ordered in a

logical or a natural way, possibly this is one of the most simple definition I can give you about

this quantitative variable, and what does this actually mean?

12

142
(Refer Slide Time: 15:25)

Let me try to take some example to make you understand, suppose I want to buy a shirt, and I

go to a shop then the shop keeper will ask me what is the size of your shirt, that can be 38, 39,

40, 41, 42, 43, 44, 45, 46 right, so if I try to take here the size of the shirt this can be 39, 40, 41,

42 and so on, what does this mean? 39 means there will be some dimensions, some size of the

shirt, and if I say what is the difference between 39 size shirt and say 42 size shirt? We

understand that the size of the shirt with number 42 is going to be larger than the size of the

shirt containing the number 39, so you can see here that this 39 and 42 they are representing

some numerical values and they can be ordered, that means I can always say that the size 42

shirt is larger than the size 39 shirt.

And similarly if I try to say, try to take an example of a cost or the price of the vegetable say,

per kg price of say this some vegetable, this price can be 30 rupees a kilo, 35 rupees a kilo, 40

13

143
rupees a kilo, 45 rupees a kilo, what does this mean? Once I say you have to give 35 rupees for

one kilo of the vegetable you know what you have to give, and in case if I say that the price of

the same vegetable in one shop is 35 rupees per kg, and the shop of the same vegetable in say

another shop is 40 rupees a kg then you can always make a conclusion by putting them into

order that the price of the vegetable in the second shop whose rate is 40 rupees a kg is higher

than the rate of the vegetable in the earlier shop where the price was 35 rupees a kg. And

similarly if I say the price of the vegetable is 50 rupees then you will say that it is more

expensive than the earlier shops.

And similarly if I want to count the number of colleges in a city, this number can be 2, 5, 10, 8,

12, 15, 20 so whatever is there, so now once again these values have some interpretation, if I

say I have two cities one say city has 10 number of colleges and say another city has 20 number

of colleges, then I can always order them and can say that the second city has more number of

colleges than in the first city.

Similarly if I try to measure the heights of the children say 1.2 meter, 1.23 meter, 1.32 meter

and so on, then you know that this 1.2 meter and 1.23 meter has some interpretation and you

can visualize these things, suppose there are two children and suppose you record get the height

of the first child is 1.2 meter and the height of second child is 1.3 meter then you can always

infer that the second child is taller than the first child, so this is what we really mean by

quantitative variables, I have a variable like as price, height, number etc, and I can obtain the

values, numerical values on them and those numerical values have some interpretation,

14

144
interpretation means they can be ordered and they will have some meaning in their numerical

value, right.

(Refer Slide Time: 19:25)

Now I’ll try to address the aspect of qualitative variables. This qualitative variables represent

the measurable quantities, so this is same as in the case of quantitative variable, then what is the

difference? The difference lies here that the values of the random variable or the values of the

variable which are denoted here as say small x they cannot be ordered in a logical and natural

way.

Once again I would say possibly I can think of this is one of the most simple definition to

understand for a common person not having a statistical or say a strong mathematical

background, what is this actually mean? That the values cannot be ordered in a logical and

15

145
natural way, so let me take here some example and I try to explain it to you, now suppose I

want to collect the data on the names of cities in this country India, okay,

(Refer Slide Time: 20:28)

So I will define my variables as say here X, and now this variable will take different values, for

example Kanpur, Mumbai, Kolkata, Delhi and so on. So these values are going to be

represented as say x1, this is the first value of the variable that it can take.

Second value of the variable which it can take and third value of the variable Kolkata that the

variable can take, but this variable is very well understood, but how to order them, how to put

them in a natural way? Well, as soon as I say this thing you always try to associate some

number with it, but here I am not associating any aspect of this variable, I’m calling only at

their name, but in case if I associate like that number of person is staying in a city, then it will

16

146
become a quantitative variable, or if I associate the area under that city then this will become a

quantitative variable, but up to now this is only a qualitative variable.

Similarly if I take another example and I say I want to record the colours of the hair, for

example now they can take different values say black, so I can denote black by capital X1,

white by capital X2, and brown by capital X3 denoting the first, second and third values which

the variable can take.

Yes, I can very well understand that the colours of the hairs are black, brown, white or

something else, but how to quantify it? Unless and until I try to associate a degree or some

measure on some scientific way, the colours will remain only as a colours and there is no way

that I can order them that for example like white is better than grey or grey is better than brown

and so on, but here I’m recording only the data, so that is why the colours of the hair this

variable is a qualitative variable.

Similarly if I take another variable here tastes of food, which can be sweet, which can be salty

or which can be neutral and so on, so I can denote the first value that the variable takes X1 to be

sweet, second value of the variable X2 to be salty, and third value of the variable to be neutral,

and these values sweet, salty, or neutral they are only the qualitative variable, they are not the

quantitative variable, I cannot say that sweet takes 20 or salty takes 30, so this is the idea

behind the qualitative variable.

17

147
And similarly many times in examination or in any competition we try to judge the

performance of the candidate by making it good, excellent or bad. So in that case this

performance can be variable which is denoted by the variable X, and the good this is the first

value which variable takes excellent, this is the second value which variable take, and bad is the

third value of X, so again good, excellent, bad,

(Refer Slide Time: 23:56)

these are only some qualitative things, I can understand them, I can observe them but I cannot

quantify them unless and until I say that if a student is in the excellent category then it is better

than the student who are in the category of good and bad, or similarly if a student is in the

category of good student then he will be considered as better than the student in the bad

category, so unless and until I make these types of rules which are again trying to denote a

quantitative way up to that point, the variable will remain only as a qualitative variable, right.

18

148
But in statistics now we have a problem, statistics work only with the quantitative data with

some numbers, so in case if I have got a qualitative variables, unless and until I associate a

number with this, I cannot operate my statistical tool, and here at this point I will try to inform

you that the statistical tool for qualitative variable and quantitative variable they are different in

most of the cases, so you have to be very very careful when you are trying to use a tool whether

you are applying it on a qualitative variable or a quantitative variable, so for example now I will

try to take an example to show you that how we try to handle the qualitative variable, by

associating a number with them, so suppose I consider here the variable say here X, say here as

taste, now taste is taking here 3 possible values, capital X1 it denoting the sweet, capital X2 it

denoting the salty, and capital X3 denoting as neutral, so right,

(Refer Slide Time: 25:44)

so right, what I will try to do, I will associate a number with these three indications, sweet,

salty, and neutral.

19

149
Suppose I decide that I would say assign 1 to sweet, I will say assign number 2 to salty, and I

will assign number 3 to neutral, but remember one thing once I’m trying to assign this number

1, 2 and 3, they are only indicating the category, means if I’m assigning 2 to salty and 1 to

sweet this does not mean that salty is 2 times of sweet, this will be a wrong interpretation, so

here I’m simply trying to assign a number to indicate the category.

(Refer Slide Time: 26:32)

Now after this I’ll try to address the discrete variables, in some situation the variable on which

we want to record the data, that can take only a finite number of values, and in a very simple

way or in an informal way, I can say that the variables are counted, for example in case if I

want to find what are the number of children in different families, so this number can be 1

child, 2, 3 and so on, they cannot be that a family has 1.2 children or 2.4 children, these value

20

150
will not exist or they will not have any interpretation, so in this case I am simply trying to count

the number as a whole number.

Similarly if I try to find out the number of branches of a school in a city this number can be 1,

2, 3, 4, 5, 6, 7, but this number cannot be 2.5 or 5.5 or 6.7, so in this case these values are being

counted, so all those variables where we are going to record the data on the basis of only

counting, they can be categorized as discrete variable for all practical purposes.

So now in case if you try to associate this definition of counting, then in case if I try to say there

are, there is another option that a variable can also take a value in fraction, say 2.1, 2.2 and 2.3

also, so those variables which can take an infinite number of values they are called as

continuous variable, so basically there are two categories discrete variable and say continuous

variable,

(Refer Slide Time: 28:36)

21

151
so in discrete variable, the values are counted and in continuous variable case the number of

values what a variable can take that can be infinite, and in simple words in formally I can say

that the values are measured, and they are not counted that is very important to note, they are

being measured, for example if I say suppose I want to measure the length of a road in certain

fraction, it all depends how we are going to measure it by which instrument that is a separate

aspect.

Now let me take an example, suppose I want to measure the length of a road, the length of a

road can be 1.5 kilometer, this can be 1.52 kilometer, or this can be 1.521 kilometers and so on,

so in this case you can continue as long as you want depending on the instrument, depending on

the length, so in this case I am measuring the value and this value can take infinite number of

values, so this type of data usually we collect under the headship of continuous variable.

(Refer Slide Time: 29:53)

22

152
Now I try to address another aspect, this is called grouped data. Suppose you have got large

number of values, then in that case it is possible to group those values in certain categories or

certain groups, and then what will happen that the original value or the behavior of the variable

will be such that, that the nature of the exact value will be lost and that value will be identified

only by the category. Suppose I try to take an example, suppose I try to measure the heights.

These heights can be 1.5 meter, 1.7 meter, so 2.2 meter, 2.5 meter, 3.3 meter, 3.6 meter and so

on.

Now I can make here three groups, say group 1 where the heights are between 1 to 2 meters, the

height are between 2 to 3 meters, and say last group in which the heights are between 3 to 4

meters. Now they are here two values 1.5 and 1.7 meters which are lying between 1 and 2

meters, so these are 2 values so I can write down here 2 values.

23

153
Similarly here 2.2 and 2.5 these are two values which are lying between 2 and 3, so I can write

down here 2 values, and similarly 3.3 and 3.6 they are lying between 3 and 4 so I can write

down here 2 values. Now I will have only this information, and this information will be hidden.

(Refer Slide Time: 31:50)

So now looking at say this value here 2, I cannot find whether this value was 2.1 meter, 2.2

meter, 2.3 meter or something else, I do not know, so these values will become simply

unknown to me.

(Refer Slide Time: 32:12)

24

154
So whenever we are trying to deal with the group data we have to keep in mind that the values

will be grouped together, they are individual values will be lost and that we will be working

only the data that is represented in the form of those groups.

Now I would like to briefly address another aspect that is called as primary data and secondary

data. So what is the difference between the primary data and secondary data? So you see

suppose I have an objective to study, and based on that either I go to the field or I ask some of

my investigator to go to the field and collect the data directly, and I try to work on this thing, so

this data will be called as primary data.

Whereas second option is this means I can go to some offices like as municipalities of the cities

or there is a, and there are different offices like a National Sample Survey Organizations who

collect the data from time to time on different aspect, I can request them to give me the data and

25

155
I try to work on that data. So this data has been collected by somebody else, either a person or

an agency and we are trying to use the data from that source, so this type of data is called as

secondary data.

(Refer Slide Time: 33:30)

So I can say here very briefly that the data which is originally collected by an investigator for

the first time with an objective to study any statistical query or statistical investigation, that will

be called as primary data.

And the data which has already been collected by some person or some agency for any

objective or say for any statistical query or for any statistical investigation, and we are trying to

borrow the data or we are trying to collect the data from their agency and then we are using, this

is called as secondary data.

26

156
Well the definition of this primary and secondary data is very relative, some data which is

primary for someone, may be secondary for say for the other person. So I’m not going into that

detail but this was just for the, for your information.

(Refer Slide Time: 34:24)

Now the next question comes, how this data comes into picture? How this primary data is

obtained, how this secondary data is obtained? So very briefly I can give you different ways in

which this type of data are collected. So in order to collect the primary data, one of the

important source is direct personal investigation that the person goes directly to the respondent,

and he or she ask the question and he or she tries to record the answers directly.

27

157
Second thing is this indirect oral investigation that the person will go, he or she will ask

different types of questions and based on that he will try to make a judgment that what is the

numerical value of the variable.

Third popular option is this some questionnaire which are sent through postal mail, email, e-

forms like as nowadays Google forms are very popular and through online service there are

some websites which tries to help us in conducting a survey, so they also try to give us the

primary data.

And sometime we send our surveys, surveyors to the field and we don’t allow them to ask

anything but we will give them a questionnaire and we ask them that you please give it to the

concerned person and he will or say she will collect the data and give it back to us. So and then

there are many other ways also, but this is how we try to collect the primary data.

(Refer Slide Time: 36:00)

28

158
For the secondary data, there are some published sources, for example there are some reports or

data sets available where the countries, offices responsible for samples survey, for example in

India we have National Sample Surveys Organization, Central Statistical Organization, and in

definitely at some world level, we have United Nations and then there are different wings who

try to collect the data from time to time and they try to published the data and we can use that

data.

The second option is this the data which is collected from some survey agencies. We can use

them, and third thing is for example there are some public offices where we record our data for

example municipalities. Whenever there is a birth of a child or death of any person, we try to go

to the municipality and we try to report it there and they try to keep the data, so this type of data

is also available in those municipalities and we simply try to take it from there.

29

159
So now in this lecture I have given you a background under which we will be working in

further lectures. From further lectures now I will try to take one topic at a time, I’ll try to give

you the basic idea, I will not be going much into theory but I’ll try to explain you with different

example that what are the different concepts and I will try to show you that how to obtain those

things using the R software. So now I will be going into the different tools of descriptive

analysis from the next lecture, so you please try to review this lecture, you please try to revise

the lecture and try to understand the concepts, try to settle down inside your brain and we will

see in the next lecture, till then good bye.

30

160
Lecture 08

Absolute Frequency, Relative frequency and frequency Distribution

Welcome to the next lecture on the course descriptive statistics with R software. You may recall

that, in the earlier lectures, we had discussed different aspects related to statistics. And we have

understood that whenever, we want to do any statistical analysis, how are we going to start? And

how we are going to control the process of obtaining the observations, different types of associated

variables, discrete continuous etc.? So, now I assume that, we have collected a sample. And as I

said earlier, I will always be assuming that, my sample is representative. Which means, that all the

salient features, which are present in the population, they are also present in my sample. So now,

we are at a place, where we have collected the observations. And now, we want to move further.

So, first of all, whenever the data comes to our hand, as I told you, that there are two options, we

can start with one is graphical tool and another is analytical tool. So, first of all I would try to take

the first analytical tool, which will give us an idea, that how are we going to combine, the data

present the data and how we can do it, which will give us, some information that is contained

inside the data. And based on that, we will try to tak further decisions, that what type of graphical

tools and tools, for analysis of the data, can be used. So, we start this lecture and first we are going

to address, that once the data comes, why it needs to be classified. You can always assume, inside

your mind, that whenever you conduct an experiment, you will get the data. Now, I am also

assuming, that the data is collected on the relevant variable, which you want to study, for example,

if you want to stay the height, that then the data is on the height, if you want to study weight,

becomes, data is on the weight. So, now you have collected the data, this data can be 20

observations, 100 observations, 200 observations, 2 millions observation, 2 billions observations.

So, in case, if all the values are just before you, can you really get an idea of, what is the information

161
hidden inside it? It is very difficult. Because, as I said, data cannot speak, data cannot come outside

of your computer or outside your experiment, to tell you that. Okay. I have this piece of

information, this is only you, who has to use appropriate tool to get it out, to take it out. So, first

thing what we try to do? We try to rearrange the data in some required format,

Refer slide time: (3:32)

And for that, we would like to classify, the data into, different groups and from different aspects.

For example, I can make the groups of those observations which are similar. Or which are also

dissimilar. All those units, which looks similar to each other, they can be put into one group,

similarly all those units, which looks similar they can be put in another group. And then, based on

that, we can extract the information, different types of information, through those groups. So, this

is what we are going to now study. So the classification of the data consists of a very simple thing,

that this is a simple process of arranging, the data into groups or classes, according to resemblance

and similarities. Okay.

162
Refer slide time: (4:27)

Now, what are the functions of for this classification, why do we make this classification? The

biggest advantage is that this will condense the data, you can assume that on one of the walls,

different numbers are written, continuously. And if you try to look at those numbers, there are

thousands and ten thousand, 1 million numbers, you can't, get any information out of that. So, you

need to condense the data. So one of the important objective of classification is this, that we would

like to condense the data. Condense the data in a away from where we can draw some relevant

information. And whenever you are trying to make a statistical experiment, generally your

objective is to compare something, for example, if there is a new medicine, which claims that it,

can control the body temperature, for say, for say 12 hours, then you would like to compare it with

the with the earlier existing medicine. That whether this improvement is, happening or not? So,

this condensation of the data or the classification of the data has to be done in such a way which

163
can help us in comparing different types of things, different types of aspect, different types of

quantities, different types of natures. And in many many situations, usually we are interested in

studying the relationship, for example, modeling, statistical modeling is a very popular word.

Which is nothing but, a sort of relationship. We want to find out the relationship between, input

and output variables. So, the models cannot be obtained in a single shot, the models are obtained

on the basis statistical data and this is the starting point, means about, descriptive statistics from

there we try to gather the data information in a small pieces and then, we try to, combine them

together in getting a model. So, we would like to condense the data in such a way, which helps us,

in studying different types of relationships. And the data has to be condensed in such a way or the

data has to be grouped in such a way that is compatible with our statistical tool also. This is that

thing, what we always have to keep in mind. Usually people do what? That first they will collect

the data and then they will try to choose the statistical tool, what I always suggest is that, you first

try to fix your objective and then try to see, what type of statistical tool can be used to fulfill or to

give an answer of that objective and whatever is the requirement of that tool, try to collect your

data, according to that and this will help us. So, another important function of classification is that,

this should facilitate the statistical treatment of the data.

Refer slide time: (7:29)

164
So now, moving further, first till, then let me introduce, the basic definition. This is absolute and

relative frequencies. One thing I would like to make it clear here, that in order to teach in this

course, I have two option, that first I try to take the theory, formulas etc. and then I try to take an

example. But, rather I would prefer in most of the situation, that I should start with an example

and then, I try to develop the theory so that, you can make a one-to-one correspondence between

the theory and the, and data definitions. That will help you in applying or choosing, the tools in R

software. Okay. So, now let me take a simple example, suppose there are ten persons who

participated in a test and their results were declared, their results were declared in two categories,

either they passed or they failed. So the candidate, who has passed, he or she has been assigned,

the letter capital P and the candidate who got failed he or she has been assigned letter capital F.

So, now you can see here, that this is here the data of 10%, who either got pass or fail and their

outcome is recorded here as say P, F,P, F, F,P, P, F, P,P. Well, I'm trying to take care of very small

data set which you can see from your eyes and what about mathematical manipulations, I am doing,

165
that you should be able to see from your own eyes. But you can always think that, this data can be

very very large, there can be ten thousand candidates, there can be 1 million candidate, there can

be 10 million candidates. So, now how to combine this data? How to condense this data? So, we

are going to use the concept of absolute frequencies and relative frequencies to condense the data.

And then later on, we will try to put them in some proper format for example, in the form of a

table, to get more clear information. So, now I would try to denote here two categories, there are

two categories. One is here pass and say, another is here fail. So, I can now in general, represent

these categories as say here, a 1 and here a 2. So, this a 1 category will represent the candidates

who have passed and a 2 category will represent those candidate who have failed. So, now I have

introduced here a word category. So now you can see, category contains all the observation, which

are similar to each other, for example, the category of candidate who pass this will contain all the

candidate, who have passed. The category fail, contains all the candidates, who got failed in the

test. Right. So these are called, ‘Categories’. So, I can see here, that there are some number of

candidates who passed and some number of candidate who failed. So let me count it. So firstly let

me count here, how many candidates passed, 1, 2, 3, 4, 5 and here, 6. So there are 6 candidate who

passed. What about fail? 1, 2, 3 and 4. So there are four candidates, who failed. So now, this

number, the number of candidates who passed and the number of candidates who failed? This is

denoted by, say n1 and n2, n1 and here n 2. So, this category, this is here and n 1 and this category

this is going to be denoted by here, n 2. And this number n 1 and n 2, they are simply trying to

represent, the number of candidates, in the category a 1 and in the category a 2. Or simply the

number of candidates who fail, or the number of candidates, who belong to category a 1 or the

number of candidates, that belongs to the category a 2. So this n1 and n2, they are simply trying

166
to present, the number of units present in the category, a 1 and a 2. So the number of observation

in a particular category, they are called as, ‘Absolute Frequency’.

Refer slide time: (12:20)

Now, one of the drawback, in absolute frequency is that, means, if I try to give that, Okay, there

are 100 candidates who pass; there are 300 candidate who failed. But, you are not trying to see that

how many candidates appeared. So, in order to incorporate, this feature, we have a concept of

relative frequency. for example, there are 6 candidate which passed, there are 4 candidates which

failed, I can also say, that there are 6 candidate out of 10, who passed and there are 4 candidates

out of 10, who failed. So, I can denote, the relative frequency of the class a1, to be the number of

persons in the class, divided by total numbers, in total number of observations that are available

and n1 is the numbers in a 1. And this is denoted as he here f1, f 1 is going to be n 1 upon, n 1 plus

167
n 2. So this is called, ‘Relative Frequency’. And in this case, this number is 6 upon 10, because we

have observed 10 daasets, so that is 0.6 or this can be called as ’60%’. So, I can say that there are

60% candidate, who passed. And similarly, the relative frequency of the second class a2, that is

denoted by, f 2. And the definition of f 2, is the similar to the definition of f 1, that is the total

number of elements, in the category a2, total number of elements in category a1, divided by total

number. So, that is going to be the number of candidates, who failed that is 4/10 which is here 0.4

or I can say 40%. So, this will give us an information, about the number of candidates, who passed

or failed with respect to the total number of candidates, who appeared in the examination. And this

can also give us the number in terms of percentage. So this is the basic idea, of the absolute and

relative frequencies. Now, next question comes, how to compute this absolute and relative

frequencies in the R software, in order to compute this absolute frequency, first we need to define

our data vector, the data vector as I said, in the earlier lecture, the data vector will consist all the

numerical values and they are combined using the c operator. Right.

Refer slide time: (15:17)

168
So, I can see here that any data vector, that is going to be in the format of c and then here, the value

x 1, x 2, up to here, x n. means, if I assume that there are n values, for example, in the earlier case

there are 10 observation and is going to be 10 and so on. And so, after this, the command to obtain

the absolute frequency is table, t a b l e all in small letters. So, when I try to write down, this

command table and inside the bracket, the data vector, then this will create the absolute

frequencies of the data, which is given inside the arguments, under the vector, data vector. And

suppose you want to find out the relative frequency, so now, if you try to understand, what is the

relationship between frequency and relative frequency? Frequency means, absolute frequency. The

relative frequency is, absolute frequency divided by total frequency. Total frequency means, total

number of observations. So, once I am trying to write down all the observation in a data vector,

then whatever is the length of the data vector, that is your total frequency. So, this total frequency

can be computed by the, by the command, length. a length and inside the brackets, you have to

169
give the data vector, say here, x. so, now in case if you want to find out the absolute frequency,

then I simply have to use the same command, table and inside the arguments, I have to give the

data vector, say here x and I need to divide all the values by length of x.. Remember one thing,

length of x is going to be a scalar value. But, now if you recall, when we did the division operation

of our data vector with respect to a scalar, then we had learned that the division happens in each

and every element of the data vector. So, this value will give us, the absolute frequency divided by

total number of observation and this will give us the information about the relative frequency.

Refer slide time: (17:53)

So, now let me, try to take an example and try to show it, on the R software also. So, for example,

I will take the same example, but now, I'm doing one thing more, means, earlier all the ten

candidates, who were categorized in two categories, as pass or fail. Now, I will try to assign them,

170
an indicator value. Because, as we discussed in the earlier lecture, that unless and until, you try to

assign a numerical value to our data. We cannot operate any statistical tool. So now I have here

two values, one is a here pass and and there is, here fail. So, what I try to do here, for pass, I try to

represent it by number one and for fail, I try to represent it by number two. Once you do it, then

this data P that will be represented by one and this F will be represented by 2 and similarly, variable

is your P that will be replaced by 1 and wherever is your F that will be replaced by 2. So now, this

is our data. And now, I need to type this data into a data vector, before I can expose this to R

software. So you can see here, that I have, created here a data vector like this and you can see here,

this one is here this 1, this 2 here is this 2, this one is here this 1, this 2 here is this 2, this 2 is this,

this 1 is this, this 1 is this, this 2 is this, this one is here 1 and this 1 one here is one. And based on

that, I have created my data vector here, result. So you can see here, the outcome and here is the

screenshot, of the same result, well I will show you on the arc console also, but before you let me,

try to explain it. And then, it will be more convenient for me to show it on the R software.

Refer slide time: (20:01)

171
Now, I will simply use my table command and I write here, t a b l e and inside the argument I try

the data vector name, that is a result, as soon as I write here, table and result inside the argument,

it will give us this type of argument. So, first we try to understand, what is the meaning of this

result? You can see here, there are four numbers here, one, two, six and here, four. So, what this

one is indicating? And what two is indicating, first of all. This one is indicating, the category. Your

category one and then category two. And now, what this six and four are indicating, six is

indicating the number of elements in category one. And similarly, this four is indicating number

of elements in category two. And this is nothing but these are your absolute frequencies, six is the

absolute frequency of category one that is class1 and 4 is the absolute frequency of class 2. And

similarly, in case if you want to obtain the same result, with respect to the relative frequencies,

Refer slide time: (21:54)

172
Then, what we have to do? I would write here the same command table, inside the arguments the

data vector, whose name is result and I would divide it, by the length of the data vector. Once you

do it, you will get here an outcome like, this. What is this showing you? you can see here, there

are 4 values, 1, 2, .6 and .4. This 1 and 2, as earlier, they are trying to show you, the categories.

category one and category two and .6 is actually, 6/10, six here is n 1 and 10 here is n 1plus n 2,

where you can see here n 1 is equal to 6 and n 2 is equal to here 4 and similarly this, 0.4 is 4 upon

10, which is here n 2 divided by n 1 plus n2. So this, 0.6 and 0.4, these are the relative frequencies

of categories 1 and 2 respectively. And this is here, the screen shot that we are going to get, when

we try to execute this thing on the R console. So, let me now first come to R console and try to

173
show you here, that how the things are happening, so first I will try to create here, a vector here

result and then I would try to show here, different things.

Refer slide time: (23:50)

So, you can see here, I have created here a, data vector, result like this. Now, I would try to create

or try to find out here the absolute frequencies by using the command table, t a b l e and inside the

argument, I will say, the data vector whose name is result, so you can get here, the similar output,

that I discussed, this 1, this is indicating the category 1, this 2 this is indicating the category 2, this

4 this is indicating the number of elements in category 1 and this 4 is indicating the number of

elements in category 2. But, definitely this is going to give you the result, in terms of absolute

frequency, now in case if you want to have this result in terms of relative frequency, then first I

174
will show you, that what is the value of here, the length of result, this you can see here, this is 10

and you can count here 1, 2, 3, 4, 5, 6, 7, 8, 9 and here, 10. So there are10 values in the data vector

result., So now, if I try to write down table result, divided by the length of result, then the outcome

comes out to be like this one. So you can see here, this one is representing the category 1 and this

point 6 is trying to denote the relative frequency of class 1 and this 2 is denoting the category 2 or

class 2 and this 0.4 is denoting the relative frequency of the class 2. So, this is how you can obtain

the, the absolute and relative frequencies. And this, absolute and relative frequencies, you can see,

there will be more prominent, when you have a qualitative variable. Now, to give you an idea,

what that, whatever we have done here, I would write to put them, in the right words. So, I had a

set of data of 10 candidates, in terms of two categories, a1 and a 2, pass and fail. And we have

found the number of persons, who are passed and number of persons who are fail categories. So

and then, I would try to or I would say that I have rearranged the entire data sets into 2 groups. So,

this arrangement of ungrouped data into group data.

Refer slide time: (26:38)

175
This arrangement of ungroup data in the form of group, this is called the, ‘Frequency Distribution

of Data’. This is a standard terminology and we always call it, please create a frequency

distribution, that means, you need to group the data and then you have to condense the data,

condense means, you can see that, the, the ten data values, have been condensed only into, two

categories and they are based only on two value, six and four, where six is the frequency of class

one and four is the frequency of class two. And now, what we try to do? Whatever is the data, this

data is condensed into different groups and for that, I try to create different groups, for example,

this group's a1 and a2, they are not coming from sky, you are the one who has created this group

says, pass or fail. So, I will try to, divide the entire data into different groups and these groups are

called as ‘Classes’. So, the meaning of a class is simply a group and for the given set of data, we

always try to create suitable number of groups. Well I'm saying here, suitable number of groups,

well, there is not a very hard and fast rule to decide how many groups can be there? For that you

have to, use your common sense and some basic information, about the experiment to decide that

176
how many groups can really help you in getting the data or the information which is contained

inside the data. Means, obviously if the number of group is, is too small or too large, then possibly,

it will be more difficult to handle the data. So, we need to have some suitable number of groups

that we will try to see, with some lectures in the coming slides. Now, in every group, you have to

define the boundaries, for example, in this case where we have only pass or fail, well there are no

mathematical boundaries, but, but they are categorized only by, by two, this categories pass and

fail. But, suppose you try to record the age or height, of say, some number of candidates, then the

age can be, 5 years, 7 years, 9 years, 12 years, 20 years, 18 years, 21 years, 30 years and so on. So,

this ages can be defined into, can be defined in some groups, like as, 5 years to 10 years, 10 years

to 15 years or this can be 0 to 10 years, 10 to 20 years, 20 years to 30 years and so on. So, in this

case, they will be representing our class, so whenever we are trying to define a class, there are

going to be two values, one is the lower boundary of the class and upper boundary of the class

Refer slide time: (29:36)

177
and these are called as, ‘Lower Limit and Upper Limit’. And when we are trying to find the

difference between these two limits, the lower limit of the interval and the upper limit of the

interval, then this is called as, ‘Width’, of the class or ‘Class interval’. And when you are trying to

define a class, there are two values, lower value and upper value. And this will be a sort of interval.

Now, whatever is the value in the mid of this interval, this is called as, ‘Mid Value’. And the

advantage of this mid value is that, when we are trying to group the data, the data will be scattered

over the entire interval. But, we assume that, that entire set of data is going to be concentrated only

at the mid value. So, in case, if you try to see here, what will really happen, suppose I try to take

here, two intervals, say ages, five years, ages say here, ten years and ages, fifteen years. And

suppose I try to collect the data, the age comes out to be seven years, it will come here. It comes

out to be eight years that will come here. Now, it will come out to be nine years that will come

here and so on. There will be many many observations in this interval. and similarly, all those

edges which are lying between 10 years, 11 years, 12 and so on, up to 15 years, they will be lying

178
in this interval, so all these values are lying over here, in this interval at different locations. But,

we assume that once they are grouped, then all are going to lie in the mid of the interval and the

mid of the interval is, simply going to be five years, plus ten years, divided by two, is equal to

seven years, six months. And similarly here, the mid value is, ten years, plus 15 years, divided by

2, this comes out to be 12 years and six months. So, this is what we assumed that, the value of the

variate lies in the middle of lower and upper limits.

Refer slide time: (32:14)

And whatever is the number of observations which are lying inside a particular class, this is simply

called a ’Absolute Frequency or in simple words, we also call it as ’Frequency’. And when we are

trying to divide this absolute frequency, by the total number of observations in that class, then this

value is called as ‘Relative Frequency’, of that class.

179
Refer slide time: (32:51)

Now, there is another aspect, which is cumulative frequency. What is the cumulative frequency?

As the word suggests, cumulative means, you are trying to accumulate; you are trying to add more

and more. So the cumulative frequency is also defined to a particular class and this is defined for

all the classes separately. So, the cumulative frequency, corresponding to any variate value is the

number of observation less than or equal to that value. And this cumulative frequency corresponds

to a class and what is the total number of observation, less than or equal to the upper limit of the

class. What does this mean? Okay.

Refer slide time: (33:39)

180
Let me, try to take an example and try to illustrate all these things, then it will become more clear

and easy to understand. Now, in this example, there are twenty participants, who participated in a

race and time taken to complete, the race is recorded in seconds. So, this 32 means that, the first

participant took thirty two seconds. This 35 means, the second participant took thirty five seconds.

This 45 means, they are the third participant of the forty five seconds and so on. Now, what you

have to look here, this is very important. And yeah, I would also request you that you please try to

concentrate on this example, that how I’m trying to create class intervals and what are the steps

which are involved. Because, in this lecture, I am going to explain, the example in detail and in

the next lecture, I will try to implement the same example in R software. And one and when you

are trying to implement it, it is important for you to, to do, the same steps in R software, which

you are doing here . Okay. So, now looking at this data, first we try to see, what is the minimum

value in this data and what is the maximum value in this data? So, I can see here, that 32 is the

minimum value and here, 84 is the maximum value. So the minimum value is 32 seconds and

181
maximum value is 84 seconds. Now, looking at these two values minimum and maximum, I have

to define here the width of the class interval. And this width is going to decide the number of

intervals also. So by looking at this data, suppose suitably, I choose that the width of the interval

should be of 10 seconds. So I try to create here, different classes, like this, class one say here, a 1

this is consisting of 30 to 41 seconds. That means, this interval a 1 will contain all the values of

the time, which are lying between 30 seconds and 41 seconds. And similarly, the next class is a 2

which will contain all the values between 41 seconds and 52 seconds. And similarly, I have here

Class a 3, 4, 5 and here 6. So I have created here, six classes and you can notice, that all these six

classes, they can contain my all that data and that is another point, while creating these groups,

that these groups and the limits have to be defined in such a way, such that the entire data can be

accommodated, among these, groups. And now, in this group you will see here, this this 30 and

41,, this 30 is the lower limit and 41 is the upper limit. And similarly in this case, in the 81 is the

lower limit and 91 is the upper limit. so now, I have, created the groups, in which the entire data

can be summarized.

Refer slide time: (37:27)

182
Now, I will try to present it in some suitable tabular form, so that I can easily understand it. So,

now you can see here, I have created at table and the first column here, is class interval and in this

interval you can see here, that I am trying to, give the same class interval, which I had denoted

here as say, a 1, a 2, a 3, a 4 a, a 5 and a 6. And then in the next column, I am finding out the

midpoint. So, the midpoint of Class a 1 is 35.5, which is coming from 30 plus, 41 divided by 2 and

so on. And similarly I have found the, midpoints of other classes, so we are going to now assume

that whatever data is, is a spread in the interval 3 to 41, this is going to be concentrated, at 435.5

and another point of difference will be 35.5, that I will say, that the data, that all the data in this

interval will have the single value, at 35.5. And as I said, earlier and we had discussed in the earlier

lecture, also once you try to group the data, the information on the individual information is lost,

but only the information on the data is available, that this data belongs to which category or see

which class. Now, after this, this is the third column, in which I am trying to count the absolute

frequency or frequency, forFexample, here, this is the interval 30 to 41. So, if you try to look at

183
the data, how many values are lying between, 30 to 41? So, this is the first value, this is here the

second value and they say the third value, fourth value and you can see here, this is the fifty value

here. And similarly if you try to find out here, how many values are lying between, 41 and 50, I

can say here, this is the value for forty five one and then, if you move for the, forty two second

and then, forty two once again here two three, so there are only three values, which are lying

between 41 and 50, so you can see here, this is represented here, where I am trying to make us

circle and similarly, I have found the, I have counted the number of observations in that particular

category, and I have done it manually and I have written here, so this five is indicating that there

are five values in the interval 61 to 70 and so on. Now, the total number of observation here, you

can see is 20. And this we try to denote by here and this is equal to here 20. So now, when I try to

divide the absolute frequency by the total number of observations, I get here the relative frequency

in the next column. So, you can see here, this 5 is coming here, this 3 is coming here, this 3 is

coming here and this is here the total number of observations. And once I try to divide it, I get here

the values of relative frequencies. The advantage of relative frequency is that all the relative

frequencies, they will always be lying between, 0 and 1 and so, they can be easily converted into

percentages. Now, in the last column, I am trying to find out the, cumulative frequency and here,

I would like to, once again explain you, that how these cumulative frequencies are found. Now,

this is my here, I can rewrite here, class a 1, a 2, a 3, a 4 plus, a 5 and class a 6. Now, I am trying

to say, the total number of observation in the given set of data, in Class a 1 is 5 or this a 1 is, having

the limit 30 to 41. So I can also say, that there are only 5 observation in the entire data set, where

I am trying to make it circle there, there are only 5observations, whose values are smaller than 41.

Now, let me come to the, second category, a 2 this is going from 41 to 50. So you can see here,

that thet total number of observation, whose values are smaller than 50, what are those things, 5

184
and the absolute frequency? Absolute frequency of group 1and absolute frequency of group 2, of

class 1 and class 2. So this is here, 5 and 3. And similarly, if you try to look at a 3, a 3 has a limit

51 to 60. So, this is trying to give you, a value here, which is the sum of all the absolute frequencies

up to the third class, which is 5 plus, 3 plus, 3, that is 11. and similarly for the a 4 this is trying to

give you the total of all the absolute frequency, from class 1 to class 4 and similarly for the class

5, this is trying to give you the sum of all the absolute frequencies up to the class, 5 and similarly

this sixth and the last group is trying to give us, the sum of all the absolute frequencies up to the

class 6 and this is obviously going to be the total number of observation. So, this is what we mean

by cumulative frequency. So, by looking at the value of the cumulative frequency, I can always

find that how many values are smaller than this value. And now, the same thing I can just, make

general. And now, I will say that, instead of here,

Refer slide time: (43:49)

185
6 class interval, I have here, k class interval, in general. And there are total number of observation,

are here n and this observations are divided into k class interval. And this class intervals are

denoted by a 1, a 2, a k, in such a way, such that class a 1 contains, n 1 observation, a 2 contains

n2 observation and n k contains here, nk observations respectively. So, obviously if you want to

find out the relative frequency of the j th class, this will be here, the number of observation in j th

class divided by the total number of observations. and j will goes from 1, 2 up to here k and now

all this information can be combined together, in this format class interval here, I can write down

a 1, a 2, a K, then the absolute frequency n 1, n 2, n K, relative frequency F 1, F 2, F K and if

required, I can also add the information on the cumulative frequency. So, this table, what we have

drawn here, this is called as, ‘Frequency Table’, or The Frequency Distribution.’ why we call it

distribution? Because, we are trying to see, how the values of a variable are distributed. So, now

in this lecture, if you try to see, I have taken an example and then, based on that example, I have

tried to give you, the different definition, concepts and how the things are implemented. But,

whatever I have done, that is manually. Now, in the next lecture, I will continue with the same

example, but then, I will try to show you, that how the things can be implemented over the R

software.

So, you practice it, you try to learn it, you try to understand it and we will meet in the next lecture,

till then, Good bye.

186
Lecture 09
Frequency Distribution and Cumulative Distribution Function

Welcome to the next lecture on the course Descriptive Statistics with R software. You may recall

that in the earlier lecture, we started a discussion on the frequency distribution. We had

understood the concept of absolute frequencies, relative frequencies and we had put them in a

table, what we called as, ‘Frequency distribution’. And once we are trying to construct the

frequency distribution, we have different types of variables, discrete variable, continuous

variable. So we had completed our discussion on the discrete variable and we started the

discussion on how to construct the frequency distribution or the frequency tables, based on a

relative frequency from a continuous data. So, in the earlier lecture, we had taken an example

and we had seen, that how manually, you will try to create the groups and based on that, you will

try to compute the frequencies and based on that you will try, to construct the frequency

distribution or frequency table. So, we will continue, on the same lines and we would like to see

today, that whatever we have done manually, manually means we had made all the calculation

by hand, now we would like to implement it on the R software, using the same steps same

concept, same methodologies and let us try to see how the things are going to work and

essentially how you are going to obtain a frequency distribution table, using the R software in

case of continuous data. Okay. So, let us start our discussion.

Refer slide time :( 01:57)

187
You may recall that in the earlier lecture, I had taken an example like this one, I had recorded the

time, taken by 20 participants in a race and I had created a data vector, time and this data vector

consists of 20 values. Now after this, our objective was to create a frequency distribution, you

may recall, what was our first step? First step was that, we tried to find out, what is the minimum

and maximum value in this dataset. So, you may recall that we had identified that 32 is the

minimum value and here 84 is the maximum value and using this minimum and maximum

values, we need to decide, that how many class intervals, we would like to have. We had

discussed, that all this data has to be grouped in some suitable number of class intervals or in

simple words classes. So, first we need to decide that, how many class intervals we can make?

188
So, in this example we had decided to make, the class intervals of the width 10 seconds and

based on this you can,

Refer slide time :( 03:32)

have a look on this table, we had constructed this table. So, first we had constructed this class

interval and based on that we had found this frequency and based on that we had found the

relative frequency and finally we had found the cumulative frequency. The same thing we are

going to now do in the R software.

189
Refer slide time :( 04:00)

So, the first step in the construction of a frequency distribution is to find out the, range of the

data values. The range of the data values is defined as, say here maximum value minus minimum

value, once you get a range then you will have an idea that this range has to be partitioned in

how many class intervals. So, in order to find out the range, we have a command in R software

as r a n g e and range and in order to use this, I have to give the name of the command, range and

inside these arguments, these brackets, I have to give here the data vector and if I try to do so

then the outcome will be the minimum and maximum values of the data contained in this data

190
vector, that is the first step. So, what will be that second step? Looking at the value of the range I

have to decide the number of class intervals, I have to decide the width of the interval and then I

need to partition this range into different segments, what we will call as classes.

Refer slide time :( 05:29)

191
So, you can see here, I have operated, this functional range, over the data of time and you can see

here that this is giving us 32 as the minimum value and 84 as the maximum value and once you

can obtain the range, the next step is this, how to divide this range into suitable number of

classes. So, we had decided in the earlier lecture, that we are going to have the classes of width

10 seconds. So, we had the classes like 31 to 40, 41 to 50, 51 to 60, 61 to 70, 71 to 80 and 81 to

90. So, now this range, 32 and 84 will be converted into 30 and 90 and this range will be

partitioned in different classes. So, now I have another task that how to create this intervals. So,

you know when we are trying to divide a range into different segments of equal length, we can

use a command of sequence, we are simply trying to create a sequence or the values of the

sequence as some fixed interval. So, in order to do this thing,

Refer slide time :( 07:01)

192
I try to create here, a sequence and in order to do this thing, I'm going to use here the command

seq, I would like to address here that it is not really possible to give you here the details of all the

R commands. So, for that you need to learn first the R software and once you learn the R

software, the basic things, then you will be able to use them or the statistical function. So, in case

if you want to do it you can go to the slides available, in my another lecture on R software that is

introduction to R software, its slides and the videos are available on the NPTEL website, you can

have a look. So, anyway without explaining the use and function of this operator s e q, I would

try to use it and the first value inside the argument is giving me the starting point and the second

value in the argument is giving me the end point, that means I need to create a sequence from 30

to 90 and this here y equal to 10 is giving me value or we are providing a value that what should

193
be the width of the interval, that means the sequence has to be broken at which point. So, once I

start with the 30 then the sequence will be broken at 30 40 50 60 70 80 and 90. Right. So, in case

if I try to store the outcome of this command, that trying to break the sequence into seven classes

at an interval of 10 units, I would like to store all the things in a new variable here breaks, well

I’m going to use this breaks variable later on, in the construction of frequency table. So, once I

try to execute it on the R console I will get it here this value and this is here the, the screenshot.

Right? Now once I get, this sequence, at an interval of 10 units is starting from,

Refer slide time :( 09:14)

194
30 to 90, then I need to convert this numeric vector into a factor, once again you need to know

what are the factors in R software and for that I will again request you to have a look on the

lectures on introduction to R software and in order to achieve this, we have a command here cut

and this command is used in this particular way. I would try to use the command here c u t, cut

and then I need to specify the data vector for which I would like to operate the, function cut and

then I would like to define here the brakes, brake is going to be a numeric vector, of two or more

unique cut points, or a single number, which is greater than equal to two and this will give us the

number of intervals in which this data vector has to be cut. So, this break is going to control the

number of partitions in this data vector and here I have here a command, small letters r i g h t,

right and this is equal to here FALSE and you can see here this is the logical operator, as we

195
discussed capital letters TRUE and capital letters FALSE they are the logical operators and this,

right equal to FALSE is denoting that the intervals are to be closed on the left and open on the

right and this statement, we want to set as FALSE. So, in case if you want to make it TRUE then,

then instead of here FALSE you have to give it TRUE. So, let us try to execute this command,

over the data vector.

Refer slide time :( 11:14)

196
and then we try to see, what happens? And what is the interpretation? And once I complete this

thing, then I will take you to the R console and I will try to show you, what is really happening.

Okay. So, if you try to see here, I try to take the same data vector here time and I am trying to

use here the data, which I have already generated as brakes, remember we have generated the

brake where here, 30, 40, 50, 60, 70, 80 and 90. So, now I am trying to tell this function, that

please use the data of time vector and create the brakes, using the data in the brake and right is

going to be FALSE, that means all the intervals are going to be closed on the left hand side and

they will be open on the right hand side. In case if you want that the vectors have to be closed on

right hand side also, then you need to use here the command TRUE, but anyway I have not used

it here and whatever is the outcome of this thing I am trying to store it in a new variable here

time dot cut. So, this is simply indicating, that the time has been operated with the, command cut

197
and the outcome of this time dot cut will look like this and here is the screenshot. You can see

here there are values here 30 to 40, 30 to 40, 40 to 50, 80 to 90. And so, on and here there are

some here levels, what are these values indicating? You see, this is very important in any

software that you need to understand what the software is doing because unless and until you you

understand it you will not be able to execute the, the correct command on the given outcome. So,

first we try to understand,

Refer slide time :( 13:14)

198
what is the meaning of this outcome? You will recall that the data in your time vector was, 32,

35, 45, 83, 74 and so on. And if you try to see here what is the outcome of this variable time dot

cut. I have simply copied and pasted these two data vectors over here, this is 30 to 40 to 40, 40 to

50 and so on. And you can see here from here to here there are only here 20 values. So, what is

really happening? That these are my individual data’s, these are my individual values in the time

vector. And now I have created the intervals, in which these values are going to lie. So, these are

the intervals in which the values lie. For example suppose I take this value here 32, this is my

first value and what is the first value in that variable time dot cut, this is 30 to 40 so you can see

here this is indicating that this value 32 is lying in the interval, 30 to 40. Similarly, the second

value here, the second interval that is indicating, that where the second value of the vector time is

lying. So, you can see here 35 is lying, between 30 and 40. So, the second interval is indicating

the interval of the second value and similarly if you move forward, the third interval here is

indicating the interval in which this third value 45 is lying and so on. So, you can see here, there

are 20 values here and there are 20 values in the data vector time dot cut. So, every value is

corresponding to the interval in which it is lying or the 20 values or the 20 intervals in the time

dot cut vector they are indicating or they are providing the interval in which the corresponding

values are lying. So, this is how the interpretation of this time dot cut goes. Right. Now what we

have to do? We have got this data and now I have to create a frequency table. So, we had

learned in the earlier lecture that in order to do. So, we have a function here table, using the table

command, we had constructed the frequency table in the case of discrete data also or in the case

of qualitative data that was finally converted into some numerical values indicating the, the

variables values. Okay. So, the same command I will use here, but now this command is going to

199
be used over this new variable time dot cut, because now we have got this interval and then we

have got the data also time, now I need to create the frequency distribution using this data vector.

Refer slide time :( 16:42)

So, as we had discussed earlier, now we are going to use the absolute frequencies of this data

vector using the table function. So, the usage will be table, you have to type table all in a small

letter and inside the arguments you have to write the data on the variable. So, here you try to

write the data. So, now once you try to execute table dot variable then it will create the absolute

200
frequencies corresponding to the data, which is contained in the arguments under the variable

name variable. So, table variable inside the argument will simply try to create or simply try to

inform you the values of the absolute frequencies, with respect to different intervals. So, now

when I try to execute it over the R software,

Refer slide time :( 17:43)

you can see here, once I try to do it as table and inside the bracket I say time dot cut. So, you can

see here I get this type of outcome, what is this telling you? This is trying to say that there are

five values, in the interval 30 to 40. So, 5 is essentially the absolute frequency of class a1 which

is equal to 30 to 40. And similarly here this here is 3 this, 3 is trying to indicate that there are 3

values in the interval 40 to 50 which is our class a 2 and similarly there are 3 values here which

201
are between 50 to 60, there are 5 values which are lying in the interval 60 to 70 there are 2 values

which are lying in the interval 70 to 80 and there are 2 values which are lying in the interval 80

to 90. Well, you can see here one thing, that here the intervals are continuous, because I have

used here open interval here and close interval here and that is the way we try to create the

frequency distribution in a continuous data. Well, in this case because all the values are going to

be some integers. So, that is why you have to make sure that if there is a value 40 on the

boundary then where it is going to be added, in the earlier interval or in the next interval. So, you

have to be careful. Okay. And now this is the screenshot, but you can see here one thing, usually

in that textbooks, whenever we try to write down the frequency table, they are not written

horizontally, but they are written vertically like this, means here you will have here class

intervals a 1 a 2 and so on and here you will have frequency f 1, f 2 and so on. So, now but this is

the outcome, you can see here, this is actually horizontal this frequency table, is coming like

cookies here this is horizontal and we would like to make it vertical.

Refer slide time :( 20:19)

202
So, in order to make it vertical, we have a command here, cbind this function is used to print the

frequency distribution in the column format, up to now, the intervals and the irrespective

frequencies they are coming in rows, now I want to make it column wise. So, I have to do

nothing, the same outcome whatever we have obtained here, I simply have to operate the cbind

function over that. So, you can see here I try to obtain here the cbind and inside the argument I

am simply trying to write down the table and inside the argument I'm writing time dot cut and

this is the same command, if you try to see here what I have used here where I am circling. So,

whatever is the outcome of this, command this is being used here with the function cbind and

you can see here you get here a vertical table. And this is you’re here, frequency distribution. So,

it's the same thing, there are five values in the interval, 30 to 40 there are three values in the

203
interval 40 to 50 and so, on. So, this is the lower limit, say a1 a2 and this is my interval a1 to a 2

and this is a frequency, f 1 this is the frequency f 2 and so on. So, that is the same table that we

had obtained earlier there is a difference of say between 31 and to 40and 41 to 50 but that can

also be adjusted by using the appropriate intervals. So, that is not going to make much

difference. We had taken manually, 32 41 and 31 to 40 and 41 to 50 because, we were doing it

manually and all our data values were an integer. So, in case if you have fraction the data, the

same concept will continue.

Refer slide time :( 22:24)

204
But now, before going further let me try to show you these things over the R console. So, let me

try to start here, with the data. So, you can see here your data was here like this, say here time.

So, I would try to copy it here.

205
Refer slide time :( 22:42)

I will come to here R console and I will say her time, time is equal to c and inside the bracket I

have to give that data. So, you can see here now this is my here data. Right. On which I would

like to create my, frequency distribution. Now the first step is this I want to find out the range of

this data. So, I have to operate range of say time and this comes out to be here 32 84. So, now

this is giving us an idea, that we can have an interval of 10 units starting from 30 till 90.

206
Refer slide time :( 23:26)

So, now after this, what I have to do? I need to first create the sequence, at an interval of 10.

207
Refer slide time :( 23:41)

So, I would try to create here a sequence using the break and if you try to see here, this breaks

comes out to be the same outcome that we had done earlier the use of the values30 40 50 60 70

80 and 90 here. Right.

Refer slide time :( 23:58)

208
Now after this, what we have to do?

Refer slide time :( 24:03)

209
I simply have to create here, I have to use here the command cut. So, I try to use the data, on

time and breaks.

Refer slide time :( 24:03)

210
Time is here, breaks is here and I have to use this data, to get the values of time cut and you can

see here time dot cut comes out to be like this. Right, And after this, what I have to do? I simply

have to operate the table function over this as you can see in this light, I simply have to use here

the table function. So, if I try to use here the table function. So, you can see here table time dot

cut and you can see here, this is, the same outcome here, same outcome here, this is the

frequency table that we have obtained, but you can see here that this frequency distribution table

is in horizontal. So, I want to make it vertical. So, I can use the command here, cbind and then I

will try to type here the same command or the same data that we want to convert into vertical,

which is time dot cut and the outcome of the function table, time dot cut has to be converted into

a vertical table. So, you can see here I am getting this thing here. So, now you can see here this is

the same frequency distribution that you used to obtain by the manual calculations. Right.

211
Refer slide time :( 24:03)

Now there is another issue, they issue is this in this case, you have obtained the frequency

distribution with respect to absolute frequencies. Now suppose alternatively, you want to find out

this frequency distribution with respect to relative frequencies, then how to do it? So, as we had

discussed in the earlier lecture, that there is a very close connection between absolute frequencies

and relative frequencies. The relative frequencies are simply obtained, by dividing the absolute

frequency by total number of observation, for which we had used the command length. So, in

order to find out the frequency distribution with respect to the relative frequency in the same data

set, I simply have to divide the variables at appropriate places, by length of the data vector. So,

212
yes so, in the last lecture, we had obtained the frequency distribution, using absolute frequency

and then I had divided it by length of n that has given us the relative frequency. So, the same

concept, I am going to use it here once again. So, what I will do here that in order to obtain the

frequency distribution with absolute frequency, I have this function table and inside the argument

I have to give the data on the variable, for which I want to create the frequency distribution. Now

I will try to divide it, by the length of the variable. So, the length of the variable is simply going

to be the number of observation present in the data vector and once you try to do it you will get

the frequency distribution with respect to,

Refer slide time :( 27:43)

213
relative frequency. So, if you try to see here, I have obtained the first the frequency distribution

with respect to relative frequency. So, you can see here I have divided by here the vector time

dot cut length and I am now getting here this value here 0.25 0.15, 0.15 and so, on what are these

values? If you try to see what was your outcome in the earlier case the outcome was your here5,

3, 3, 5, 2, 2. So, I will try to copy this thing here 5, 3, 3, 5, 2, 2 and then now I am going to divide

it by here 20 because there are 20 observations, 20 and here 20 and now you can see here

whatever is this outcome 0.25 this is nothing but 5 divided by 20 and this is here the that

screenshot and once again if you want to put it in the vertical columns, then the vertical format,

then you simply have to use the, same command here cbind, but now cbind will be operated over

this variable, which was used to obtain the frequency distribution with respect to the relaive

frequency and here this is the same outcome and the vertical columns and this is here the

screenshot. So, I would like to show you the same outcome on the R console also. So, you can

see here. Right. So, we had obtained the data here time dot cut, time dot cut was this thing and

now we are going to obtain the frequency distribution. So, this will be table time dot cut divided

by length of this time dot cut. So, you can see here you are getting the same outcome and if you

want to make it vertical, then I would say here you simply have to operate here the command

cbind, write this and you can see here, this is once again the frequency distribution where the

frequencies are in terms of relative frequency, here. Right. So, this is how we try to create our

frequency distributions. Now after this, the next thing comes the last column.

214
Refer slide time :( 30:26)

which was the cumulative distribution function or the data or the cumulative frequencies, we

would like to compute. So, as we had discussed, that the cumulative frequencies gives us an idea

up to a certain point, the frequencies up to that particular point, that how many values are less

than or equal to this particular value. So, in order to compute the cumulative frequencies, we can

use here a function here, cumsum. So, that is trying to abbreviate cumulative sum and in order to

use this thing, we simply have to use this function cumsum on the variable for, which we want to

create the cumulative totals, but the variable has to be first operated with the table function,

because once we have a data in some variable then first it needs to be converted into a frequency

table and then based on that frequency table, the sum of those frequencies, can be obtained. So,

215
that is why the complete command will be the cumsum function, on the table variable. Right.

And in case if you try to do it, this will produce the cumulative frequencies.

Refer slide time :( 32:01)

So, let me try to show it on the, data set that we are considering. So, now if you try to see, we

had this dataset.

Refer slide time :( 32:07)

216
that was obtained for the absolute frequencies means, I had here six intervals and then six

absolute frequencies, like as 5, 3, 3, 5, 2 and here too and now, if I try to obtain the cumulative

frequencies here, I had shown you in the slide.

Refer slide time :( 32:36)

217
in the earlier slide, that this is how we are going to find out the cumulative frequencies, first

frequency that will be the sum of itself, second cumulative frequency will be the sum of the first

and second frequencies, third cumulative frequency is going to be the sum of first second and

third frequencies, fourth cumulative sum is going to be the sum of first four frequencies and the

fifth cumulative sum is going to be the sum of first five frequencies and the last one, which is the

sixth cumulative frequency here, that is going to be the sum of all the six frequencies which is

equal to the total values of the observation and you can see it that you are getting here the value 5

8 11 16 18 20. Now once you are trying to obtain the outcome of cumsum you can verify here.

Refer slide time :( 33:29)

218
you are getting the same outcome here 5 8 11 16 18 20. So, this is how this cumulative

frequencies are obtained, now in case if you want to find out these cumulative frequencies with

respect to relative frequencies and if you want to represent this outcome in a vertical way, you

have to use the same command, in order to make this horizontal outcome into a vertical outcome

just use cbind command and in case if you want to present it with respect to the relative

frequencies, just try to divide this command by length of the data vector. So, let me try to show

you here. So, you can see here that in this light, I am simply trying to express the outcome of this

light in a vertical columns by using the function cbind and here is the screen shot.

Refer slide time :( 34:40)

219
And similarly incase if I want to produce the same cumulative frequencies, with respect to the

relative frequencies, then I simply have to use the same command, table, variable and then I have

to divide it by here the function length of the variable and then use the cumsum command over

this new variable. So, using this command over the function cumsum will produce the

220
cumulative frequencies or the cumulative relative frequencies of the data contained in the

variable. So, again I would try to show you here.

Refer slide time :( 35:12)

221
That now I have operated the cumsum on the earlier data, table time, dot cut, but now it is

divided by length of the time cut. So, now this is giving me the sum of the relative frequencies.

So, this is my here the first frequency, first relative frequency rather, this is the sum of first and

second relative frequencies and so, on. And so on means other things are also fine and if I want

to present it, in a vertical way.

Refer slide time :( 35:50)

222
then means I simply have to use, the bind function over this. Right. So, now I would try to show

you on, the R console also.

Refer slide time :( 36:02)

So, you can see here that I have here the function or I already have obtained the data here time

dot cut. Right. Now I would try to find out the, cumsum of the table of time dot cut and you can

see here, this is the outcome and in case if you want to make it vertical, then you simply have to

223
use here the bind function, you can see here like this and in case if you want to obtain these

cumulative sums, with respect to the relative, frequencies, means I can show you here I simply

have to write down the cumsum table with the variable time dot cut and I have to divide it by

length of time dot cut, the same variable in which I stored the data and you can see here now this

is the outcome, with respect to the, the relative frequencies, but again this outcome is in the

horizontal direction and suppose if I want to convert it into a vertical direction then I have to use

the cbind function, on the same variable and executing it you can see here I’m getting the same

data which was here in the horizontal way, now this is coming in a vertical columns. So, now

we come to an end to this lecture and I have given you the basic idea, that how to create

frequency distributions using, using the R and I would like to emphasize here one thing here

more. You cannot assume this lecture or this course, which is trying to teach you pure statistics.

But my objective is that that, many people are using these things. So, I want to give them a basic

idea, related to the use and interpretations and I want to show them that how to do the same thing

on the R software. So, my motive is very simple. Definitely my this lecture cannot substitute or

will not provide you the escape from saying that that without reading the books or without

reading the chapters of frequency distribution from a proper book will help you. So, I would

request you, please try to have a look on the chapters of frequency distribution, data tabulation

and any good book, try to see the concept, try to learn the concept and then this lecture will help

you in brushing those concept and will teach you that how to use them on the R software. So,

you try to take different example from the books, from the assignment, practice it and try to see

how the interpretations are being made and we will see you in the next lecture till then, Good

bye.

224
Lecture -10

Graphic and Plots - Bar Diagram

Welcome to the next lecture on the course descriptive statistics with R software. Up to now

what we have done? We have considered the aspect of frequency distribution in the last

couple of lectures and before that we had two introductory lectures, where we had learned

that once we get the data in our hand, then there are two types of tools that can be applied,

one is graphical tool and another is analytical tool. So in the last two lectures where we had

done the frequency distribution that was the first step when you would like to make an

arrangement with your data so that the data is compatible to be exposed to the graphical and

analytical tools. So now in this lecture and in the next couple of lectures first I will try to

target at the graphical tools and after that I will continue with the analytical tool. So now the

first question comes why should we use the graphical tools? So we know, that graphics are

very easy to understand and that is why, we take the help of graphics to extract and to

understand the information which is contained inside the data. Now we are going to use the

information extracted from graphical tools as as well as analytical tools together, so now

firstly let me try to explain you why do we need this graphical tool.

Refer Slide Time :( 02: 25)

225
So now if I say, suppose I want to convey that a person is happy or sad, well you can explain

in say several sentences that how the happiness will look like or how the face of a sad person

will look like and so on, but in case if you try to use the Smileys, what do you see here, that

the mood of a person is very easily conveyed by these three Smiley's, just by looking at this

structure means anybody can say very quickly, that this is indicating the happiness and this

one blue one means anybody can say after looking at this face that the face is sad and in the

green face in the middle one means anybody can have a look and can say very easily that this

is reflecting where the person is okay. Similar is the information that is expressed in graphics

mode. As we had discussed that when we start in start six we have only a sample of data and

the size of the data can be very small or it can be very very large and each and every data

contains some information, but we want to have the information in some combined way and

that is why our first target was frequency distribution, now once again we are trying to

combine the information in some graphical way and you would like to condense the

information in the form of a graphic, so that we can have some idea about the information

that is hidden inside the data.

226
Refer Slide Time :( 03: 56)

So there are various types of advantages. The graphics can explain the hidden information

very compactly and very quickly, which is very easy to understand by a common person.

nobody needs much stronger knowledge of mathematics and statistics to understand

behaviour of a smiley face or the behaviour of a curve, one can easily understand it, so that is

the advantage. Once again, I would say that whenever you are trying to conduct any

statistical analysis, there are various types of graphics that can be used, but sometimes there

is a myth that unless and until you use more number of graphics, the analysis is not good,

rather, I have heard people saying that that the goodness of a statistical report depends on the

number of graphics, higher the number of graphics better is the report, well this is wrong,

means if somebody has some problem, it does not mean that if the doctor gives more

medicine then the doctor is good. The doctor is good if he gives the appropriate medicine

inappropriate quantity, same is the message in the statistics also. We have to use appropriate

graphics and say appropriate number of graphics also, right, and the use of appropriate

graphics and correct number of graphics will only give us the correct information in a more

fruitful way. So in statistics there are various types of graphical tools, that can be used.

227
Refer Slide Time :( 05: 35)

There are two dimension plot, three dimension plot, there are scatter diagram, there is pie

diagram, histogram, bar plots, stem and leaf plot and there are box plot and there are many,

many more and particularly with the advent of software, these graphics have become very

popular because they are very easy to create and they can be created in a very small time

actually. Similarly just like all other software, this R also as a capability to create the

graphics, not only create the graphics, but it also gives you an option to save the graphics in

different mode, like as Postscript format, jpeg format, PDF format and so on. So in R there is

a long list of graphics that can be created.

Refer Slide Time :( 06: 29)

228
For example, same plots which you have learnt earlier like a, bar plot, pie chart, box plot,

group box plot, a scatter plot, Coplots, histogram, normal QQ plot and there is a long list, all

sort of two dimension, three dimension, coloured. There are many, many possibilities and it

simply depends on your capability that how many graphics you can learn. Well I am going to

explain here some selected type of graphics. My idea is not to teach you the graphics, my

idea is that, I will try to show you, how one can create the graphics in R software and what

are the different options which are available and then I will try to give you here several

example and I believe that after that you will be confident enough to learn how to create any

other graphics according to your need.

Refer Slide Time :( 07: 34)

229
Well, so let me try to start herewith one of the very basic graph this is called bar diagrams. So

this bar diagram is essentially used to visualize the relative or absolute frequencies of the data

or the relative or absolute frequencies of the values that are observed for a variable. And this

bar diagram, this will consist of one bar for each category and one of the characteristic or one

of the very important characteristic, of the bar diagram is the height of the bar, height of the

bar is simply proportional to the frequency or to the relative frequency, right, and the height

of each bar is determined by the absolute frequency, or the relative frequency of the

respective class and this height is shown on the y-axis whenever we are trying to create a bar,

like this, then this bar has two things, one is the width of the bar, this is here and another is

the length or height of the bar. So what we have to keep in mind that when we are trying to

consider the bar diagram then this width, this width is not important and width of the bar is

immaterial and this can be chosen arbitrarily. Only this length, this is import and this is going

to represent the frequency say absolute frequency or relative frequency but one thing I would

like to emphasize here that many times or rather most other times you will see that, whenever

we are trying to create the bar diagram, the width of the bars are taken to be the same, that is

230
just because the graph should look nice and one should not get an impression, who does not

know the theory of bar diagram, he should not get confused, that why the widths are so

different, so that is the only reason, right. So now in case we have a frequency distribution of

a discrete data or a qualitative data that is converted into some numerical value, through some

proxy variables, like as yesterday, we had given the values, tastes, like a sweet will be

denoted by 1, salty will be denoted by 2 and so on, so similar to that,

Refer Slide Time :( 10: 16)

now we assume that we have this type of frequency distribution, where these are my classes,

A1, A2, Ak minus 1, Ak. so there are altogether k classes and f1, f2, fk minus one, fk, they

are the frequencies or they are the absolute frequencies. So they are simply going to represent

that f1, number of values belongs to class A1, f2 number of values belong to class A2 and so

on and once they are divided by the total number of observations, which is denoted by here,

231
‘n’, total number of observations, then this third column, this is giving as the value of relative

frequency. So f1upon n, this is the relative frequency of class A1, f2 upon n, is the relative

frequency of class A2 and so on. So now suppose I want to create here a bar diagram, then

this is the basic philosophy or these are the basic fundamentals, that first I need to create here,

X and y-axis and on X-axis, I need to create here the bars. So for example, if I say here I have

a class here, A1, so this class A1 is going to be denoted by here some bar like this, so this will

denote here class A1 and similarly class A2 can be denoted by here another bar, say here A2

and so on, they will be here Ak, so width of these bars like is this, this they can be same or

they can be different, right, but it is always advised to have the equal width so that the bar

looks better and now if you try to see the height of this bar, on the y-axis ,this is somewhere

here, so this is going to represent the frequency or absolute frequency of the class A1, if I try

to consider here Class A2 here then this height, this height is here somewhere here and this is

going to denote here point f2, this is point here f1 and this is at point f2 and similarly here

this height here is proportional is actually see here fK. so you can see here that the heights of

the bar, of bar is proportional to the frequency. Now instead of frequency one can also use the

relative frequency, so in that case the f1 is going to be changed by f1 upon n and so on so

these heights are going to represent only the relative frequency. So now I have to option I

have this bar and the height of the bar is f1 or say f1 by n that is the absolute frequency or the

relative frequency. The advantage of using the relative frequency is that, that the maximum

value of the relative frequency is always 1. So it becomes easier to compare the heights of the

bar.

Refer Slide Time :( 14: 08)

232
So in case if you see a diagram like this one, small bar and a higher bar, by looking at this

diagram I can always say that this has a lower frequency and this has a higher frequency.

Suppose if I say these bars are going to indicate the number of shirts sold in a shop on a given

day, so I have now two shops and sales are represented by the height of the bar, so by

looking at the height of the bar, I can very easily conclude that which of the shop is selling

more number of shirts, right. Now the question is this how to create this a bar diagram or say

bar plot on, on the R software.

Refer Slide Time :( 14: 56)

233
In R software, we have a command here, bar plot. This bar plot helps us in the construction of

bar diagram. When we try to construct a graph, then there are many, many parameters and

you would like to handle those parameters so that you can get the outcome in the required

format, which is more suitable to understand and for that we have different parameters. For

example if you want to create a graphics then there will be x-axis there will be y-axis, so you

would like to put some desired labels on your x axis and y-axis that what they are

representing, you need to control the width of the bar, different bars are representing different

things, so you would like to add information inside the graph that which of the bar is

indicating what you would like to give different colours to the bar and there are many things

that you can do. So when we try to use this command barplot, then I have two options in a

simple language I will say, simple bar diagram and in an in simple way I will call the other

option, bar diagram with more options. So if you simply want to create a bar diagram, then I

need to use this command barplot and inside the arguments, I simply have to give here the

data inside a variable called here as a height and that will work. So if I simply try to type,

barplot and say here data vector this will give us a simple bar diagram but suppose you want

234
to modify it, you want to improve it, so that it looks better, then in that case the detail

command of the bar plot is here like this and you can see here this is the command here bar

plot and this is here the argument, first value is nothing but the height, which is here the data

second is here which is equal to one so that is going to control the width of the bar. Similarly

there is here a space, there is here names dot arg, legend dot text, besides equal to FALSE

horizon-- and soon you can see here that this is a long list so next question comes, how you

will learn all these things? I would suggest you simpler option is to take the help on the

command barplot. Because it is practically impossible to keep all the commands always in

your mind, so best option is this, R is free software that will always be available with you.

Try to look into the help menu of barplot and there you will see that the interpretation of each

and every parameter, that is expressed inside the argument, that is very well explained there,

so whenever you need, you simply try to read that part and execute it. To make you

comfortable to make you understand, I will try to take care some option and I will try to add

them one by one, through two examples.

Refer Slide Time :( 18: 33)

235
But before that in case if you really want to know or if you want to take the help from the bar

plot then how to do it. In order to do it you simply have to use here the command help and

inside the arguments, within the double quotes, you need to write barplot. This syntax is not

only for the bar plot, but this is for all the graphics, also all the commands, this is one of the

way to obtain the detailed help and now I will try to show you on the console, R console that

how this happens and how do you get all this information but before that please try to have

the look on this slide and the next slide. You will get the same information whatever is

mentioned here, you can see this is our detailed knowledge.

Refer Slide Time :( 19: 33)

236
And in the next slide, you have all the details, like as what is represented by height, either a

vector or a matrix of values describing the bars and so on. What is here the width? This is an

optional vector of bar widths space, the amount of space left before each bar names dot arg,

this is a vector of the names to be plotted below each bar or the groups of bar and so on, and

there will be a long list. So what you need to do, you simply need to read this help bar and

this will be your best teacher to understand what is really happening and based on that you

can do that, right,

Refer Slide Time :( 20: 16)

237
So you will see here, now I try to copy and paste this command on the R console.

Refer Slide Time :( 20: 23)

You can see here as soon as I will press enter, this will come on the internet and a website

containing the help on this barplot will be opened. This is one possible way, so in order to use

this type of help you need to have an internet connection or else you can also go to the help

238
menu were here inside the R software and then you can have a detailed thing, but here as

soon as I say here enter, you can see here what is happening.

Refer Slide Time :( 20: 53)

you can see here that this internet site, is opened here and actually this site or this command

has taken you directly to the R server and on the R server, you have the latest help whatever

is documented, that is available for you, So you can see the advantage of this R, you are

getting the best possible help over here.

Refer Slide Time :( 21: 29)

239
And you can see here if I try to scroll down you can see here that this part is the same part

which I had copied and pasted here

Refer Slide Time :( 21: 39)

240
And if you move further, you can see here that there are different here argument like as here,

height and height is trying to give you the all this information something like this

Refer Slide Time :( 21: 45)

and if you scroll it more, there is an detailed interpretation for the width and detail

interpretation for the space and so on you can see here, this is a long list. So now it depends

on your capability that how much here you want to learn and how beautiful or how

informative graphic you want to create, right. And you can see here I have simply copied and

pasted this thing just to give you an idea, right. So now one very important aspect in barplot

whenever you want to construct the bar plot, please decide, you are constructing the bar plot

on what? On the individual data or on the categories, the answer is this, we want to create a

bar plot on the categories. For example, in case if we have data, which has been categorized

in two categories one and two, suppose you have got hundred data values, would you like to

plot those hundred values or

241
Refer Slide Time :( 23: 02)

you would first translate those hundred values into two categories- category 1, category 2 and

based on that you will have here their frequencies or the absolute frequencies and actually,

we would like to plot the frequencies. So first you need to input your data in the form of a

frequency and the frequency can be obtained by using the command table. So that is why

suppose I have some data vector here x.

Refer Slide Time :( 23: 31)

242
So if I want to create the bar plot using this command, barplot then first I need to transform

this data into a frequency table from, using the command table and then I have to create the

bar plots. Similarly if you don't want to use the absolute value and if you want to use a

relative frequency, then in that case, the same command has to be transformed and in this

case what I will do that now I will try to operate the command barplot over the table divided

by length of x, we have learnt that once we try to use this command, this will give us a

frequency table frequency distribution using the relative frequency, so now I would try to

create a bar plot on the data which is given by table x divided by length of x.

Refer Slide Time :( 24: 42)

243
So why not to consider here a simple example and we try to see how these things are

happening. So now let me take a very simple example, so I have data of ten persons and we

have recorded whether the person is graduate or non graduate, so in case of a person is

graduate, then it is it denoted by here G and if the person is non graduate it is denoted by here

N. Now so we have this dictator here G N G and and so on. So as we had discussed earlier

this data cannot be exposed to a statistical tool, we need to convert it into a number, so what

we have done? We are giving this graduate a value 1 and a non graduate a value 2. So now in

this case the person is graduate, so I am giving it here value 1, then the second person is non

graduate I am giving it value here 2, third person is graduate, I am giving here the value 1and

so on. So now I have a data which is here 1, 2, 1, 2, 1, 1, 1, 2, double 1. this is my here data,

on which I would like to create my bar plot. So I try to store this data using the c command

here, in variable here ‘quali’, which is a I have used a short form of qualification. So you can

see here I have entered this data and this will look like this and this is here the screenshot.

Well I will try to show you on the R console also. So let us first try to go to the R console.

244
Refer Slide Time :( 26: 22)

First let me create this data vector over here, right.

Refer Slide Time :( 26: 25)

245
So you can see here this is data. Now please try to observe what is really happening. Now I

will say here I have been told to use the command barplot on this data to get the bar plots, so

I try to use here barplot and this quali,

Refer Slide Time :( 27: 01)

Or now if you try to observe what are you getting here? Do you think exactly do you wanted

this thing? For example I will just copied and paste this graph over here, if you try to look at

this graphic you would realize that no, this is not matching what we wanted because this is

giving me a graph of data 1, 2 and so on, right. This is giving me only here two values 1 and

here 2.

Refer Slide Time :( 27: 22)

246
So if you try to look at that data, first 2 people this is here 1 and 2

Refer Slide Time :( 27: 27)

And this is giving here first person here 1 and second person here to be 2 and then there are

such 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, what is this? We didn't wanted this thing. Well this is the very

common mistake what we always make while making this bar plots or say pie chart that I will

show you later on that we directly expose the original data set, but what we have to do here, I

247
need to first create the frequency distribution of this variable here quali and then I need to

give here the command barplot.

Refer Slide Time :( 28: 09)

And now this is what I'm trying to do here. You can see here now first I try to create the data

of the frequency table, using the cable command on the data quali, this comes out to be here

like this gender and then 1, 2 and here 7, 3, so that simply indicating that there are 7 persons

which are graduate and there are three persons which are non graduate and then I'm trying to

use the command bar plot on this data table quali and now I am getting here this bar chart,

you can see here this is bar number one and this is here bar number two. So this bar number

1, that is representing the graduates and this is and the bar number 2, this is representing the

non graduates and by looking here, you can see here at this height, this is indicating that there

are seven persons, the absolute frequency here is seven and similarly here if you try to see

this is here three, so there are three persons, so by looking at this bar diagram I can very

248
easily conclude that there are seven persons which are graduate and there are three persons

which are non graduates and. in case if you see what is indicating here? This is here no

nothing but the frequency or the absolute frequency and the width of these bars, with of these

two bars, say this is here width and this is here the width this is the width here they can be

arbitrary, but anyway as I said, it is nice to have the bars with equal width for easy to

understand graphics, right. So this is how we try to create this thing. Now I will try to show

you on the

Refer Slide Time :( 28: 09)

R console but before that, let me also show you here that in case if you want to create this bar

diagram using the relative frequency, how are you going to do it? So first I try to create the

frequency distribution, with the relative frequency, so you can see here, once I try to execute

the table command on quali divided by length of quali, I get here this type of frequency

distribution, where this is indicating graduate category, this is giving me non graduate

249
category and this point seven is actually seven upon 10 and then 0.3 here is 3 upon 10, which

is indicating the relative frequency of the two classes and when I try to execute the barplot

command over this data, then I get this type of graph, but now you can see here means again

that this is my class A1 of say graduates and this is my here class A2 of non graduate, so this

graph is exactly the same as the earlier one, but there is a difference here on the y-axis. Now

you can see here these values are 0.0, 0.1, 0.2 and so on and this is here the height which is

indicating 0.7 and this 0.7 is the same thing here like this and this height of second bar A2

you can see here, this is 0.3 and this 0.3 is the same thing which is coming here. So this is

how we try to create the bar plots.

Refer Slide Time :( 31: 47)

250
But now let me try to show you on the R console also, how you are going to do it. So now I

would try to correct it and I would say I would like to make a bar plot of the table quali. So if

you try to see it here quali and then this you can see here now, this is your here the graphic

and this is here in the on the y-axis, you can see here where my cursor is this is 0, 1, 2, 3, 4, 5,

6, 7, this is indicating the frequency and if you try to use the same command with the length

of quali, you can see here the graph.

Refer Slide Time :( 32: 31)

The graph remains the same but now you can see here that that on the y-axis, these

frequencies have now changed and now they are relative frequencies. So the height of this bar

is simply proportional to the relative frequency.

Refer Slide Time :( 32: 44)

251
So this is how we try to create the bar diagram. Now let me take here one more example to

make you understand better and so I have collected the data of say hundred customers. These

hundred customers are visiting a shop and there are three sales persons and these three sales

persons are indicated by numbers 1, 2 and 3, salesman number one, salesman number two

and salesman number three and when these customers enter into the shop, they are attended

by a particular sales person. So now which sales person has attended which of the customer,

this data is recorded for the first hundred customers entering into the shop and so this data

you can see here this consist of the numbers 1, 2 and 3 only. 1, 1, 2, 1, 2, 3 and so on, so now

you can see here by this small data set, set consisting of hundred values, you can see this is

only a say, numbers numbers and numbers, They are not giving you any fruitful information,

so my first attempt will be to make some suitable graphic to have this information. So I

would try to first attempt to create a bar diagram. So I would like to store all this data into a

data vector head called ‘salesper’, which is a short form a ‘sales person”, right.

Refer Slide Time :( 34: 09)

252
And now, after this if you try to make the same mistake here if you try to make it here bar

plot of here salesper you will not get here the correct information, right.

Refer Slide Time :( 34: 22)

253
So first let me try to show you these things on the, our console also and we will move

together. So let me first try to create this data vector.

Refer Slide Time :( 34: 22)

254
So I have created this data vector and you can see here now this is my data, right, and now if

I try to create here a bar plot, of this thing, you can see here, you are getting a graph like this

one, definitely you don't want this thing and yeah I accept that I have made here a mistake,

Because that correct option is this. I don't have to use the data directly, but I have to first

create the frequency table. So I would first try to create the frequency table, you can see here

looking at this hundred data values you are not betting much information but by looking at

these three values, you are getting the information, that there are 28 customers, which are

attended by the salesperson number one, there are 43 customers, which are attended by

salesperson two, and there are 29 customers which are attended by salesperson three, right.

Refer Slide Time :( 35: 40)

255
Now I would try to create a bar plot over this thing. So I try to write down here the command

barplot and inside the bracket I have to get the data and you can see here that this is giving

me this bar plot, right,

Refer Slide Time :( 36: 01)

and similarly in case if you want to create this bar plot with respect to the relative frequency,

then I need to modify the data vector as by length of salesperson, data vector and you can see

here, now in this case, the frequency that was earlier as 10, 20, 30, 40 and so on, this

frequency on the y-axis is now changed to a relative frequency, right.

Refer Slide Time :( 36: 28)

256
So the same thing I have copied and pasted here for your understanding and then I would try

to do something more on this thing, so you can see here that this is the same bar plot that we

have just obtained and similarly in case if you want to have this bar plot with respect to the

relative frequency this is obtained here.

Refer Slide Time :( 36: 50)

257
And now, now I would try to add some features to this plot and suppose I want to give here a

title, so now you can see here this command, this command is the same what we had used

earlier. But now in order to add a title on the graph, I am using here a command main, equal

to this and inside this double quotes, I am writing the title what I want, customers attended by

sales person and if you try to execute it, you will get here the same graph but there you can

observe that you are getting this title. So the moral of the story is this, that in case if you want

to add a title on the graph, use the command main.

Refer Slide Time :( 37: 45)

258
And I would try to show you it on the, our console also and you can see here now we have

this title added, right.

Refer Slide Time :( 38: 08)

259
Now let me try to do some more changes in this graph. Suppose I want to add here some

legends and title to this bar chart, so first you try to look at this bar diagram, you can see here

now I have added here a legend to the bar number one, as SP1 that is salesperson one, legend

to bar number two, SP2, salesperson two, Legend to bar number 3 SP 3 and here I have added

a title on the x-axis and here I have added a title on the y-axis. So now we need to see that

how these things can be done. Well, I will tell you the very simple way go to the help menu,

try to see what are the different parameters that can do the job and they will also explain you

how this values have to be given, some time they are in some number, some time they are

inside the double quotes, please try to look from the help menu and try to use it. For example

here if you try to see this bar plot table and main command they are from my earlier plot, now

I want to add here three names sp1, sp2 and sp3, now this help tells me that please try to give

the names of the bar inside the double quotes, whatever you want, so I’m trying to give you

give here three names sp1, sp2, sp3, inside the double quotes and which are separated by

comma, this is the format and when I want to give a title on the x-axis, I have to use the

command xlab equal to and inside the double quotes I have to give whatever I want to write.

For example I have given here sales person and inside and the bracket SP and this entire thing

I want to print on the x-axis, so you can see here this is coming here and this sp1 sp2, sp3,

this is coming here and, similarly if I want to give a title on the y-axis, then the command is

ylab. So I have to write ylab, equal to and whatever I want to give the title on the y-axis I

have to write it inside the double quotes and you can see here I have written number of

customers and this is here.

260
Refer Slide Time :( 40: 32)

So I can execute this thing on the R console also and you can see the outcome which you are

getting over here, right,

Refer Slide Time :( 40: 39)

261
you can see here there is no title no regents over here and now I’m trying to execute it and

you can see here that this labels sp1, sp2, sp3 and titles on x and y-axis they are added over

here.

Refer Slide Time :( 40: 57)

262
Now suppose I want to add some colours and I want to make these bars of different shades,

different colours right. So in order to add the colours I will have to use one more command

this is here c o l and rest of the command that is simply the same as in the last slide. Now if

you try to see in order to and here colours I am giving here col equal to then I am trying to

give here three colours, red, green and orange and these colours are given inside the double

quotes and remember one thing, red, green and orange, they are the reserved words. Reserved

words means, once you try to write down red, green, orange the R will also understand them

red green and orange and there is a list of different types of colours which R understands, so

you simply have to give these colours inside the bracket using the c command and the same

order will be followed in your bar diagram also, for example here in case of here red, this is

red is coming here in the first, second place is green, this is coming here green, third place is

orange, so orange is coming over here, right.

Refer Slide Time :( 42: 26)

263
So we kind of try to do the same thing on the R console and let us try to see what do we

obtain? So you can see here now the, this is the same graphics but now as soon as I execute it

these colours are added, right. Now similarly you can add many more things into the graphic

to make it more informative and more useful and there is a long list, but definitely it is not

possible for me to take all the issues in this lecture, but I believe that I have given you

sufficient background to understand, that it is not difficult at all, to add different types of

parameters and it is not difficult at all to create nice graphics, using the R software which is

completely free. So you are creating all these graphics just for free. Now only thing is

this,yes, you need to spend some time in learning it, but definitely, in case if you have to

spend some time in learning and you are saving a lot of money in from buying an expensive

software to create all this graphic, I think is not a bad bargain. So you please try to have a

look try to take different datasets and try to create more graphics and in the next turn, I will

try to take more graphics and I will try to explain you all these features. So you practice,

enjoy it and I will see you in the next lecture again. Till then good bye

264
DESCRIPTIVE STATISTICS WITH R SOFTWARE

LECTURE 12

GRAPHICS AND PLOTS – SUBDIVIDED BAR PLOTS, PIE AND 3D PIE DIAGRAMS

Welcome to the next lecture on the course descriptive statistics with R software. You may recall

that in the last lecture, we had considered some graphics and we had considered the construction

of bar diagram. Now he will continue with the topics on graphic and we will try to learn some

other types of graphs in this lecture. We are essentially going to discuss about subdivided bar

diagram, pie diagrams and say three dimensional pie diagrams, okay.

So let me start our discussion at the first topic that is subdivided bar diagram and this is also

called as component bar diagram. What does this mean and what it interprets? You have seen

that when we created the bar diagram, we had created these types of bar, right and every bar

indicating a class a1, a2 and so on, but these bar diagrams are going to indicate only one value at

a time. Now suppose there is a situation where the value inside this bar or this bar is also

265
subdivided and it depends upon some other values, then what we will try to do that I will try to

create here the bar and I will try to subdivide it like as this is the component of first aspect, and

here, for example, this is the component of second aspect, similarly if I try to take the second

aspect then I can say here well, this is the component of second aspect and similarly if I try to

say here, take the third aspect then I can say here, this is my here third aspect. So now you can

see here inside these classes a1 and a2, you also able to compare different things. For example I

can compare that the contribution of this part, it is here, so I can see here that in the class a1 the

contribution of this green diagonal lines is less. Similarly if you want to compare these orange

lines, they can be compared by the height like this and if you want to compare the third category

you can compare by the red lines like this. So what happens that this subdivided bar diagrams,

they try to divide the total magnitude of variables into different parts, in various parts.

Let me try to take an example and try to show you how these things are going to work and then

how we are going to do it on the R software. Suppose I try to take here a data and this data is on

three shops, shop number one, shop number two and shop number three and the data is recorded

on the number of customers who are visiting, say for example between 10:00 to 11:00 a.m. in the

266
morning, on four consecutive days which I am denoting by day 1, day 2, day 3, day 4. So this is

sort of two way table in which the rows are indicating the shops and columns are indicating the

days and the interpretation of this is like this. Suppose if I take the value here 2, this means there

are two customer who visited shop one on day one whereas there are 20 customers who visited

shop 2 and there are 30 customers who visited shop 3 on the first day and similarly, if you try to

take any other number, suppose if I take here this 15 here this is indicating that there are 15

customer who visited on day 3 the shop 2 and so on. So now you can see here that here are two

aspects- one is here shop and another is here day and these two values are going to determine the

number of customers visiting the shops during 10:00 to 11:00 a.m. in the morning, right, so now

how to do it, or how to plot a subdivided bar diagram? In this case, what I would like here is the

following- you can see here that in case if you try to make a simple bar diagram then it is not

very convenient or it is not very informative because that will be giving you the information

either on shops or on the days but these data values are depending on two aspects- shop and the

days. So this is the advantage of using the subdivided by diagram that I can control or I can

represent both the aspects together, okay.

So what I would like here is the following. Suppose I want to create here three bars. So this is

indicating the shop number one, first bar, second bar is indicating the shop number two and third

bar is indicating the shop number three. So on the x-axis, I am trying to denote here the shops

and on the y-axis, I will denote here the days. So for example, in case, if I say here on day one is

representing this thing here, this thing here, and this thing here. Similarly if I try to take it here

day 2, day two might be indicating here somewhere here like this this orange lines and similarly

if I try to take here day three, day three I can take here like this and if you go for day four, this is

267
some this dotted area. So you can see here that this height, this indicating the day 1 and this

height, orange height, here, here and here, this is indicating day 2 and so on. So now looking at

this type of graphic, you can have the information that how many people visited a particular shop

on a particular day in a single glance and this is called as subdivided bar diagram. Now we want

to construct it but before you use the command to plot this subdivided bar diagram, you have to

think that how you are going to input the data in your R command. Why? if you remember in the

bar diagram you had input the data using the c command, just as a simple data vector but in this

case, it is not a data vector but the data is given in two dimensions.

So I can use the aspect of matrix theory and I can use the matrix command to input my data and

you can see here, I am trying to give here data, if you try to write down this matrix here, I can

represent this data as 2 20 30 26 53 40 42 15 25 30 75 and100. So this is going to provide us a

matrix. So what I would like to do here that first I try to create the data matrix. So now you can

see here that in this matrix, there are one, two, three, and four, there are four rows and there are

268
three columns. So now you may recall that we already have discussed the use of matrix theory or

how we do? You would like to provide the data inside the matrix. So I use the same command

here and I try to store the outcome in the data vector or the data variable say here cust which is

indicating the customers. So I will use here the command matrix. Now as per the rules of this

command, I will try to provide the number of rows by the parameter nrow equal to four, number

of columns by the parameter ncol equal to 3 and now I have to give the data which I want to

insert inside the matrix. So this data is given row wise. So I'm trying to give here this command

byrow is equal to 3 that means TRUE and data is given in the format like here 2 to 20 to 30 and

then here 26 and then 53 and then 40 and so on. So you can see here now I have given here this

data and once I try to see the outcome of this command, you can see here I get here a matrix of

order 4 by 3 data where this column is denoting the shop number 1, shop number 2 by the second

column and shop number 3 by the third column and what about these rows? These rows are

denoting the days. So now you can see here the data, what is here in this matrix, it is the same as

data given in this two-by-two table, right.

269
Once you enter the data after this, you have to use the command barplot and inside the arguments

you have to give the data or the name of the variable that is containing the data in the matrix

format and this command will create a subdivided or component bar diagram where the columns

of the matrix are going to denote the bars of the diagram. So this bar will denote the columns and

there will be some sections here and these sections are going to denote the frequency in

cumulative format. What does this mean? For example, if you try to look in this data matrix, the

first column is here 2 26 42 30. So if I try to denote here this value here say here 2 26 42 and 30

right, so you will see here this, these are my frequencies and now they will be denoted in the

cumulative format. How to do it, how it will look like. I will just try to plot it and when I will try

to explain you, right. So remember one thing that in the subdivided bar diagrams, the frequency

on the bars they are essentially the cumulative frequencies and in case if you want to find out the

frequencies by looking at the bar diagrams or the subdivided bar diagrams, it is pretty simple.

Try to subtract the two cumulative frequencies to get the difference and that will be indicating

the values of that particular class. Suppose if I try to take the cumulative frequency of two

classes and I try to subtract it by the cumulative frequency of the class one then whatever is the

difference, that is going to provide the value of the frequency for the class 2, okay.

270
So now just for your given here the data, right and when I try to execute the command here

barplot cust, cust was the name of the variable in which I have given the data in the matrix

format. Then I get this type of subdivided bar diagram or the component bar diagram. What we

need to do here that first let us try to understand what is this showing us? You can see here, there

are four sections here, one here is black ,second here is say here dark gray, and then here lighter

gray and then here is more lighter gray. So these are four different colors which are used inside

this bar to divide it into four different components. What is here your bar? Bar here is like this

and what are your here components? First component is here one, second component is here two,

third component is here say like here three and this is your here fourth component. So you can

see here as the name suggests, the bar of the diagram is subdivided. Now if you try to see what is

happening on the x-axis? This is trying to denote here the sh0p one, second bar is denoting shop

two and third body's denoting shop three.

Well the basic command that is the bar plot will not give you all this information but in the

further slides I will show you that how you can insert these legends on the x-axis, y-axis and how

you can add titles and how you can provide different types of colors to the bars, right, but in this

slide, I am simply trying to explain you that what is the interpretation of a bar and its component.

Now if you try to see over here on the y-axis, these values are 0 50 100 150 and so on. So these

values are the values of cumulative frequency? How? Let me try to explain it by taking the first

bar of shop one. You can see here, her,e where I am denoting this is your a very small bar of

black color so in case if you try to move from bottom to up on this y axis, the height of the bar

diagram is actually here 2, this is given here. So this height if I try to make it here this is here 2,

271
what is it because 2? This 2 is actually this value and now whatever is the boundary of this dark

gray and light gray component where I am trying to make a cross, this boundary is the

cumulative sum of first two classes. What are the first two classes? if you try to look at this data

table in this first column in the shop one, first frequency is two, second frequencies 26, third

frequency is 42 and fourth frequency is 30. So this border line is indicating the cumulative

frequency of two classes that is the first and second class. So you can see here the frequency of

the first class is here 2 and the frequency of the second class is here 26 which is given here. So

their sum becomes 28 and this value here is actually 28 and similarly, if you come on the next

partition, I am trying to make here a small circle, so that you can see on the screen. What is this

point? This point is again representing the cumulative frequency. Cumulative frequency of what?

Cumulative frequency of the class first, second and third classes. What are this thing? The

frequency of first class is 2, frequency of second class is 26 and frequency of third class is 42.

How? You can see here like this, this is the 42 value. So this value at this circle, this is indicating

the value 2 + 26 + 42 which is equal to here 70. So this is the cumulative frequency of all the

observations and if you come to the last border where I'm trying to make it here a square, what is

this point? This point is the sum of all the frequencies. So all the frequencies are here first class

has frequency to 2, second class has frequency 26, third class has frequency 42 and the fourth

class has frequency 30 that you can see here and their sum is going to be 100 and this is what is

being denoted here as say100 and the same story goes for the shop two and shop three. Similarly

you can create partitions and you can create the component bar diagram for shop two and shop

three. Now what is the advantage of creating this type of bar diagram? So let us try to have a

look on this bar diagram. If you try to see here, if I try to compare here the peaks like this one or

if I try to compare here the height of this particular components. What are the indicating? The

272
height of shop number one first component is smallest, the height of thus bar 2 which is

indicating the shop 2 has more height than the height of the shop 1 that is the was the first bar

and third bar has the highest height. So that is indicating that the number of customers visiting

shop one, shop two and shop three. So one can very clearly see from this graphic that the number

of customer visiting the shop number three, they are the highest and for that, you don't need to

look into the data and now in case if you want to find out that on a particular day which of the

shop has more number of customer? What you have to do? You simply have to just compare the

component with respect to that day in this bar. For example, in case if you want to see that on

day four which of the shop was visited most by the customers? So you can see here in this bar

number three, height of this component and try to look in the height of the this part in the second

component, you can see that here this component is smaller then this one. So I can save very

clearly by looking in the last component of these three bars that the number of customers who

visited on day four were the highest in shop number three, then in shop number two and the

lowest was in shop number one because this height is the smallest. Similarly if you want to see

what really happens on that day two? So you can see here by comparing the dark gray part, this

part, in the three bars, you can simply compare and can look into the heights of the components

and whichever height is more, you can say very clearly that the number of customers going to

that shop they are the highest. Now let me try to first show you this graphic on the R console so

that you get more convinced.

273
So first let me try to copy this data vector here. So you can see here I have created here a data

matrix like this and after that you can see here, my command was barplot and name of the

variable in this case, so into this case, my name is cust. So I can write down here bat plot c here

c u s t and you can see here this is the same graph which we have just obtained, right. Come back

to our slide and try to do something more.

Now you could see my objective is that I would like to add some colors and I would like to add

some legends on x and y axis. I want to add some labels, so how to get it done? You see, adding

10

274
colors will definitely make the components more informative. They will be more easy to

visualize. The choice of colors depends on you and in R software there is a particular code for

every color. Well, I'm trying to use here the simple colors like a red, green, orange, brown. For

that they have the same spelling but in case if you want to use any particular, you please look

into the help menu of our software and can decide what color you want and what is the correct

spelling of the command to give that color. So I'll try to write down here the command and I will

explain you what is really happening. So you can see here first 2 bar plot cust that that is the

same command to have the bar plot, this subdivided bar diagram. Now I want to add here these

labels. Please look into the diagram - shop one, shop two and shop three and I would like to add

here that this is my x-axis which is indicating the shops. So how to do it? In order to add these

names, you have a command here or a parameter in the bar plot command which is names

dot arg, n a m e s dot a r g and then you have to give the name of the bars which you would like

to put inside the double quotes separated by comma. So suppose I have here three bars and I

want to give it the name- shop one, shop two and shop 3. So I have enclose it with double quotes

and I have separated it with here comma and all these values, they are converted into a data

vector using the c command. Now in order to put a legend on the x-axis, for example, here I am

using her shop so this I am doing by using the command here xlab. xlab is going to give you the

idea that what should be the label on the x-axis? So here I have the same thing that I am trying to

put the name inside the double quotes and I want here the shop so these names are user defined

and it completely depends on you. Similarly on the y-axis also, I want to give here a name -days.

So this is given here by ylab, right, and this is here the days inside the double quotes and now

you can see here, in this bar I have given here first component as here red, second component it

as a green, third component as say here orange and fourth component here as a brown. So I need

11

275
to give these colors in the same sequence in which I want to put from bottom to top. So I'm

trying to make here a data vector of red, green, orange and brown colors and each of the color is

written inside the double quotes and they are separated by comma and then all these colors are

put into a data vector using the command c and the name of the parameter under which I'm going

to store this data is c o l which is the short-form of color and once you try to do it and then you

try to execute this command, you will get the same outcome.

So I can show you here on the our console also that how these things are happening? So on this

R console, I try to copy and paste this command and I try to execute it. So you see here you are

getting the same outcome which I have shown you, right. So this is here red color, this is green

color, this is orange, this is brown and on the x axis, we have a label shops and different bars

have got the name shop 1, shop 2, shop 3, y axis you have the labels days and so on. Now let me

come back to our this slide.

12

276
Now in this slide or in this graphic, in case if you want to make interpretation, you can also do it.

For example, just by comparing the height of the brown component, you can compare that how

many people visited shop on day four and you can compare that which shop had more number

of customer and by simply by comparing the heights of say orange component, you can once

again compare that which of the shop was visited more by the customer so the height of the

component is simply proportional to the number of customers visiting a shop right and on this y-

axis, as I told you, this is giving you the value of cumulative frequency, right. So this is how you

can create the subdivided bar component and yeah, there are many other options available here

and if you want to explore them more, I would ask you to look into the help menu, okay and you

can also see here I have given you different aspect means if you want to add labels, if we want to

add colors, so now you can see that this graphic is almost the same which you use to obtain by

any software that was an expensive paid software. The same thing can be obtained in the R

software without any cost and it is not that difficult. The only thing is this, yes you need to study

13

277
the commands, you, but that is also not difficult, help menu is always there. You simply have to

look into the help menu and then just type the commands, okay.

Now after this I try to come on another chart which is the pie diagram. Pie diagrams also are

used to visualize the absolute and relative frequency and what happens in the pie diagram that a

circle is created and circle is divided into different segments and these segments will denote a

particular category like a category 1, category 2, category 3, category 4 and the size of these

sections like as here this one or say here the size of this here the category 2 or the size of this

category 3, actually this depends upon the relative frequency and the size of this segment is

controlled by the angle. Well I can use a here the red color to make it more clear, this is the angle

which is going to determine the size of that category 1. Similarly this is the angle which is going

to determine the size of the category 3 and this size is determined by the angle relative frequency

multiplied by 360 degree. So whatever is the frequency that you have obtained, just multiply it

by 360 and whatever the angle you get here, we need to create this angle over here and that will

14

278
give you the segment of the pie diagram and this type of diagram is called as pie diagram. Now

this pie diagram can be created into two-dimensional and three-dimensional. For example, here

I'm trying to make it here in the two-dimensional plot but the same plot can also be made it like

this, like this, something like this and more beautiful and so on so here you can see this is the

height and so on so I will try to discuss two dimension and three dimensional pie diagram both

here.

In order to construct the pie diagram in R, the command here is pie and inside the arguments you

have to give the, the data. This data is given by here a vector called as here x. Now I will be

more often using the symbol x to denote the data vector and after that, there is a long list of the

arguments or the parameters which can be used here to give labels, control the size, control the

colors and so on, right. So but in our case I have chosen some popular aspects. For example, first

aspect is here x which is giving the data vector, then the second parameter I will show you that

so the labels which is giving a description to the slices, then third parameter is radius which is

15

279
indicating the radius of the circle of the pie chart and then another parameter here is mean, mean

is going to indicate the title of the chart, c o l colors that is going to indicate the colors of the

slices that we can choose and last option which I will show you here that is the clockwise,

clockwise means this is used to indicate that if the slices are drawn clockwise or same anti-

clockwise and for that, you can use here the command here logical say true or logical false by

writing TRUE and FALSE in capital in letters. So and if you want to have here more idea, I will

request you that you please go to your R and try to look into the help.

For example, I can show you here if you want to help you the pie, simply try to give it here help

inside the double quotes, if you go to the pie and you can see here, you will come to the website

of the R software where they have given here all the details. So but for this you need an

internet connection, right. You can see here there are many many options. So definitely, I am not

going into those details but I will try to continue with these things.

16

280
So now I would try to show you or explain you this thing as an example. Suppose 10 percents are

asked whether they are graduate and or non graduate and their data is recorded as G for graduate

and N for non graduate like is here graduate G, non graduate N and then in order to convert it

into a numerical value, we will use the symbol 1 or the number 1 to denote a graduate and

number 2 to denote a non graduate person. So the data on the third person which is here G can be

converted or can be written as 1, the data on the 4th person which is non graduate can be written

as or can be denoted as 2. So if we have now here this data vector and we want to create a pie

chart for this thing, ok. Now I try to collect this data using the c command under a variable

named quali which is a short form of qualification. So this is the data which I have stored here

and this is a screenshot. Now in case if I want to create here the pie chart, you can see here that

there are now two categories- categories 1 and 2 indicating the graduate and indicating the non

graduate.

17

281
So in case if I want to create here a pie diagram I would simply use here pie and then here quali

and as soon as you do it you will get here a graph like this one but now my question is do you

want this think about it? If you try to look into this graphic, this pie diagram, this is giving you 1

2 3 4 and so on up to here ten categories but just now, you indicated that there are only two

categories- 1 and 2 for the graduate and non graduate. Then what is this happening? Now you

may recall in the earlier lectures while creating the bar plot, I explained this aspect that whenever

we are trying to plot the bar plot or the pie chart, we are essentially plotting the frequencies. So

whatever is the data, that has to be converted first into the frequencies and then I have to

create this chart on the frequency, right.

18

282
So first I try to use the table command and I try to convert this data into frequency table. So you

can see here there are two categories 1 and 2 and this is indicating that there are seven persons in

category 1 and 3 persons in category 2 and then I try to make it here the pie diagram. You can

see here now this is giving us a pie chart or a pie diagram that we wanted. So this white is

indicating that there are seven persons and this blue is indicating that there are 3 persons. So by

looking at this angle, you can see that this segment is much larger than this segment. So this is

giving us a clear idea that the number of graduates are higher than the number of non graduates

and this is the screenshot here but I would like to show you here first on the R console that how

the things are happening.

19

283
So first I try to create the data so you can see here this is my data on qualification quail and then

I try to make it here a frequency table of this data quail which is like this and then I would try to

use the pie command over table quali and you can see here, you are getting here the same

outcome that we had in the slides.

Now I will come on the next aspect of this pie diagram that if you want to make it more beautiful

by adding colors and labels etc. So you can see here this is the same pie diagram but here I have

20

284
added a label, a title, and I have added here the labels - graduate and non graduate and I have

used different colors- red color and blue color. So how to do it? Now I have to use the different

options. Different options are means if I want to give this graduate and non graduate labels, I

have to use, I have to give it here by using the parameter labels l a b l s equal to graduate inside

the double quotes and the second label non graduate inside the double quotes separated by this

comma and both this graduate and non graduate labels are combined using the c command and

this title - the persons with qualification, this is given here by the parameter main, main is used to

indicate the, the title and then whatever title I want, I'm trying to give it inside the double quotes,

you can see here, and after this I am trying to give up vector of colors as I did earlier that colors

red and blue, they are written again inside the double quotes and separated by comma and they

are combined with the c command and they are stored in the parameter c o l and if you try to do

it here, you can see here that you are getting the same thing. So I would try to show you here on

the R console.

21

285
So you can see here, you are getting the same graphics over here. Now I would just take a quick

example to show you that what really will happen when we have large amount of data.

For example here you can see, I am taking a simple example where there are 100 customers who

are visiting a shop and they are attended by three sales persons what we call as 1, 2 and 3 and it

is recorded that which of the customer was attended by which of the salesperson, like as first

customer was attended by salesperson 1, second customer was attended by salesperson 2 and so

on, right.

22

286
So this is the data and I try to collect all the data inside this data vector salesper that is indicating

the sales person and they can I try to create here a frequency table. You can see here there are

three categories which is indicating the salesperson 1 2 and 3. Now you can see here by looking

at this data, you may not have an idea that what is the number of ones, twos and threes but by

looking at this frequency table, I can say very clearly that first sale person has attended 28%,

second has attended 43% and third attended 29 % and if I try to create a pie diagram, this is now

given over here.

So this is indicating the category 1, this is indicating the category 2 and this is indicating the

category 3 and similarly if you want to make it here more beautiful by adding label, more

informative then I simply have to use the same command labels means and colors and then I

have to define what colors I want, what heading I want, and what labels I want to give it here.

For example, I am giving here sp1 sp2 and sp3 to the salesperson 1, 2 and 3, right.

23

287
So let me quickly show you here that what will happen here. So I try to use this data, store this

data and then I try to make it here table of here salesperson, this is here this thing and then I try

to create here a pie diagram of, you can see here, this is the same pie diagram that you obtained.

Now I would like to add here these colours etc. So I can make it here, you can see here, I'm

going to get the same outcome which I just shown you there.

Now I would like to stop in this lecture. We have learned how to create the pie diagram which is

essentially a two-dimensional pie diagram. Now in the next lecture I will continue and I will show

you that how to create the three-dimensional pie diagram. So now I would request you, please try

to take some data from the books and try to create such diagrams on the R console and try to

experiment it and now and my suggestion will be that please don't restrict yourself only to the

parameters that I have used in showing it. I am doing it because of the limitation of the time but

please go through with the help menu, try to read what are the different interpretation of different

parameters and try to use them inside the R software under this diagrams and that will give you

more confidence and you will become much better in producing more beautiful and more

informative graphics. So you practice and we will see you in the next lecture till then, Good Bye.

24

288
DESCRIPTIVE STATISTICS WITH R SOFTWARE

LECTURE 12

GRAPHICS AND PLOTS – 3D PIE DIAGRAM AND HISTOGRAM

Welcome to the next lecture on the course Descriptive Statistics with R Software.

(Refer Slide Time: 00:19)

You may recall that we had a discussion on different types of graphics in the last lecture, and

we had concluded our lecture with a discussion on pie diagrams.

So in this lecture I’m going to address two topics, two more types of graphics, one is 3

dimensional pie diagram and another is histogram. This pie diagram and 3 dimensional pie

diagram, they are more or less similar, the only difference is in their look. The construction, the

structure and the interpretation, they are the same as in the case of pie diagram.

289
So let us start our discussion first with the 3 dimensional pie diagram. As in case of pie diagram

there are different slices, and those slices represent the absolute or the relative frequencies.

(Refer Slide Time: 01:29)

Similarly in case of 3 dimensional pie charts or 3 dimensional pie diagram, they also represent

the absolute and relative frequencies. The difference between a pie diagram and a 3

dimensional pie diagram is that in case of pie diagram there is a slice, but in a 3 dimensional pie

diagram there is a circular slab and this slab is partitioned into different segments or slices, and

every segment or every slice represents a category of the frequency distribution and the size of

each segment, this depends on the relative frequency, and this is the same case as it happens in

the case of pie diagram also.

And here also the size of each segment is determined by the angle, and the angle is determined

by the same formula as in the case of pie diagram that is relative frequency into 360 degree, so

290
a pie diagram is a circular diagram which is partitioned into different segments and the size of

the segment is determined by the angle.

Similarly, the 3 dimensional pie diagram, this is a sort of circle having a third dimension as its

height,

(Refer Slide Time: 03:00)

and same way as in the case of pie diagram, we create the slices and the size of the slice is again

determined by the angle.

(Refer Slide Time: 03:13)

291
So let us now try to first understand how to create a 3 dimensional pie diagram in R software.

So in order to construct the 3 dimensional pie diagram, we have a command here, pie3d, pie

that is small letters and 3 number and d, and then here is the data vector, exactly as in the case

of pie diagram.

292
And then there are certain parameters which are given for different types of options like as

labels and another things, the difference between pie diagram and 3 dimensional pie diagram is

that construction of pie diagram is the part of the base package of R that is inbuilt in the base

package, but in order to construct the 3 dimensional pie diagram, we need to install a package

or a library, so in order to do it, we need here a library plotrix, p l o t r i x,

(Refer Slide Time: 04:33)

so first we need to install this library using the command install.packages and inside the

arguments, inside double quotes we have to write plotrix, p l o t r i x, and once I do it then I

have to use the command library plotrix.

293
In case if you execute these two commands on the R console, you can get the library plotrix on

your computer and if you try to see I have installed this package on my computer and this is the

screenshot and so on,

(Refer Slide Time: 05:18)

294
now I will try to take some examples to show you how to create the 3 dimensional pie diagram.

So once again I’m continuing with the same example that I consider in the case of pie diagram

that I have a data on 10 persons and we have recorded their educational qualification in 2

categories graduate and non-graduate and this data has been indicated by 1 for graduate and 2

for non-graduate and we have this data vector and we have stored the data in a variable name

quali, and so now we have here this data vector quali consisting of two numbers 1 and 2 and

we would like to create a 3 dimensional pie diagram for this data.

(Refer Slide Time: 06:12)

So obviously as we have discussed earlier that whenever you want to create pie diagram or say

3 dimensional pie diagram you need to input that data in the form of frequencies. So what I’ll

try to do? First I would try to create the frequency table using the data quali, using the

command table quali and you can see here I already had done it in the last lecture, so I’m

295
simply reproducing here a screenshot, and after this, you simply have to use the command

pie3D, remember one thing D here is in capital letter, and then you have to use the data that is

obtained by table quali.

And once you do so you will get here an outcome like this one, so you can see here this is your

3 dimensional pie diagram,

(Refer Slide Time: 07:13)

the third dimension has been added by this height here, and in case if you want to make it here

more informative

(Refer Slide Time: 07:22)

296
by adding the names to the slices like as non-graduate and graduate and if you want to add here

title of the graphics like as persons with qualification, and if you want to change the colours

you can use the similar commands what we used in the case of pie diagram. For example, if you

try to see pie3D table quali this is the same command that we used earlier.

And now in order to give here two categories graduate and non-graduate, I’m using here are

parameter labels, l a b e l s, labels equal to the graduate and undergraduate whatever we want to

give the name inside the double quotes, and these two values are combined in a vector using the

c command.

And similarly, if you want to give here that title, this title is given by the parameter main, so I

have to write main is equal to an inside the double quotes, I have to write what word is that,

title I would like to have.

297
And then you can see here one slice is in red colour, and another slice is in blue colour, so once

again I will use the similar command here col is equal to red and blue that is the R command

for the two colours inside the double quotes and separated by comma and they are combined

with the c operator and I will mention here the colours of this one. And once you try to do it

you will get a 3 dimensional pie diagram like this.

Now I would like to show it to you on the R console, so first I try to load here the library so you

can see now there is no error, the library has been uploaded,

(Refer Slide Time: 09:17)

and now I defined here the data quali, and if I want to make it here the 3 dimensional pie

diagram, I have to first create the frequency table of the quali by, and if you try to do it here I

get here a 3 dimensional pie diagram like this one.

(Refer Slide Time: 09:45)

10

298
And similarly, in case if you want to add here more information I can execute the same

command over here, I will try to copy and paste the same command and you can see here that

you are getting the same graph that you had obtained, right.

Now I will try to show you another feature in the same 3 dimensional plot, you can see here

that here these two slices are joint.

(Refer Slide Time: 10:20)

11

299
In order to make it more informative I can separate it, so that the graphic will look like this that

you can see here that the red and blue parts are separated.

(Refer Slide Time: 10:35)

12

300
In order to make this type of graph, I can use here one parameter that is explode, so you see

here in this command which is the same as the earlier one, but now I’m adding here a new

parameter explode, e x p l o d e all in small letter is equal to 0.2, actually this 0.2 is the factor

that is going to decide that how much separation do you want, for example in this case, this is

the space between the two slices or two slabs, so this 0.2 factor is going to determine this thing,

so I’ll try to show you on the R console so that you are more comfortable and then I’ll try to

take one more example and I’ll try to show you all those things very quickly, so if you try to see

here now I have used here the function see here explode, and suppose if you want to, well I’ll

try to show you that change here, suppose I try to change this explode value, so suppose if I try

to make it here this explode is equal to suppose here, instead of here 0.2 suppose I make it here

0.8, you can see here what happens. You can see here now the separation becomes more,

(Refer Slide Time: 12:12)

13

301
so by increasing the value of the parameter explode, you can increase the separation between

the, or I mean the slices.

So now I would take one more example to make you comfortable, so once again I’ll try to use

the same data set which I had used earlier in the pie chart,

(Refer Slide Time: 12:31)

so that was about the 100 values or that was the data on the 100 customers, they were

entertained by 3 sales person, 1, 2 and 3, and this is the data here that was stored in a variable

salesper, and now I’m trying to create the frequency table using that table command,

(Refer Slide Time: 12:51)

14

302
and now you can see here there are three categories 1, 2 and 3, and if I try to make the simple 3

dimensional pie diagram using the command pie3D, I simply have to use the same command

and I have to change the name of the variable which is now here the salesper, so you can see

here this is the standard 3 dimensional pie diagram which is using the default values, so you can

see here this 1, 2 and here 3 they are indicating the 3 classes that is the sales person 1, sales

person 2 and sales person 3.

(Refer Slide Time: 13:34)

15

303
And similarly if you want to give here title or the names to these slabs, right, you can also do it

here by using the same command labels, main, colour, but now I have here 3 categories, so now

I’m using here green, red and blue, 3 colours, exactly on the same lines and you will get here

this type of plot.

(Refer Slide Time: 13:57)

16

304
And in case if you want to use the parameter explode, so for example here I’m trying to use

here explode is equal to 0.3, then you will get here three separated slices, so you can see here

these slices are now separated.

So I’d try to show you on the R console also so that you are more comfortable, so first I’ll try to

copy this data, and then I’ll try to make it here pie3D on the sales person, but I need to give it in

the form of a table, so table of this one, so right, you can see here this is the same graphic that I

have just shown you.

(Refer Slide Time: 14:43)

17

305
And similarly if you want to make it more clear I can by adding the titles and colours and

legends you can use the same command here and I can show you the outcome is coming out to

be like this, this is the same output that I just shown you.

(Refer Slide Time: 15:00)

18

306
And in case if you want to use here the explode function, just add this as one of the parameter

inside the arguments and you will see here, now this is separated,

(Refer Slide Time: 15:17)

19

307
and once again, in case, if you want to make the separation bigger you simply have to increase

the value of explode, suppose I make it here 0.8, so you can see here now the separation

becomes more,

(Refer Slide Time: 15:35)

so now it essentially depends on the choice of the experiment here that what exactly he or she

wants.

Now after this pie diagram, let me try to introduce here histogram.

(Refer Slide Time: 15:48)

20

308
So histogram is graphic but this is used for continuous data. You can recall, we had discussed

the aspect of discrete data, continuous data and so on, so histogram also does the same thing

what a bar diagram or a pie chart does, but the difference is this bar diagram and pie diagrams

they are essentially for that categorical variable where the values are indicated by some

numbers representing the category, but histogram is for continuous data, so histogram also does

the same thing that it first try to categorize the data into different groups, and then it plots the

bars for each category and in this case, the data is always continuous or I would say that,

whenever the data is continuous, please plot histogram.

Now there is a difference between the bar plot and histogram, you may recall in case of bar

diagram I had told you that the height of the bar is simply proportional to the frequency or

relative frequency, width of the bar is immaterial, so we don’t bother about it which has no

interpretation, but this does not happen in the case of histogram.

21

309
The size of the bar is essentially proportional to the area of the bars in case of histogram. So

essentially the area of the bar is given by the height of the bar and width of the bar which have

to be multiplied. So now in this case you can see here that the bars in the histogram had to be

controlled with respect to height and width both, you will notice in most of the cases the width

of the bar is kept the same in case of histogram, but the reason for this is just to make it simple

to understand, means if you have 2 bars and if you have to compare with their area whereas if

you have 2 bars where you have to compare them only with respect to the length or the height

of the bar because the width is same, then which is more convenient? Obviously the length or

the height of the bar is more easy to compare than the area of the bar, okay.

(Refer Slide Time: 18:28)

Now let us try to see how you are going to create the histogram based on the frequency

distribution, you may recall that we had some data, some continuous data and we had discussed
22

310
that these data is divided into different classes, and those classes have lower limits and upper

limits, and this is called the class interval.

And the size of the interval that will provide us the width, and when every class will have some

frequency or the relative frequency which is the number of values which are belonging to that

particular category.

So now if you try to understand the construction of a histogram what we really try to do? That

we try to create here 2 bars,

(Refer Slide Time: 19:35)

say this will have the limits, for example this value will be your a0, this will be your here a1,

and this will be your here a2.

(Refer Slide Time: 19:43)

23

311
Now we have got the data x1, x2, xn, suppose n values are there, now I try to see where this x1

belongs to category a1 or category a2 or to the class interval 1 or class interval 2, this is my

class interval 1 and this is here 2, suppose this belongs to x1, so suppose its value on the X axis

inside this bar lies somewhere here, and now I take the second value suppose this values lies

over here, third values which lies over here, fourth value which lies over here, fifth value here

and so on some f1 number of values will be lying inside the bar 1, and similarly here f2 number

of values will be lying inside the bar 2.

So one thing what we do that we assume that all this values which are spread around the mid

value, mid value is to determine by this a0 plus a1 by 2 for the category 1, and for this interval,

for the second interval a1 to a2 the mid value will be a1 plus a2 divided by 2, so what I’ll try to

assume here that all the values are concentrated in the mid value.

(Refer Slide Time: 21:17)

24

312
So what I’ll try to see here that the frequency of the class interval 1 a0 to a1 is f1, so assuming

that all the values are at one place, I’ll try to make it here the frequency f1, and similarly the

height of this one will become here f2, and I’m assuming that the width of both the intervals are

the same, so this is how the histogram is constructed.

(Refer Slide Time: 21:57)

25

313
Now obviously in case if you try to create here a histogram something like which is so thin and

another is so big,

(Refer Slide Time: 22:12)

26

314
this is not so convenient to compare the two areas, so that is why it is emphasized that for all

practical purposes the widths of the bars are kept the same.

And now instead of this frequency, I can also have here relative frequency f1/n and say f2/n,

but it depends on the need and requirement what we really want to do.

(Refer Slide Time: 22:40)

Now in R software, the histograms are constructed by the command h i s t and inside the x

which is, we have to give the data vector, and you will see that in this case you don’t need to

create the table, histogram function or the function hist will automatically create the frequency

table and then it will create the bars, so this is different than in the case bar diagram or the pie

diagrams.

Now in histogram I have two options, histogram can be created using the absolute frequencies

or it can be created using the relative frequencies, so in case if I want to use the absolute
27

315
frequency then there is no issue this command hist will take care of it and that is the default

choice, but in case if you want to create the histogram using the relative frequencies then you

have to add here one more parameter f r e q is equal to capital F that means frequency is equal

to FALSE. So as soon as you give the frequency to be false the function h i s t will

automatically considered that the function has to considered the relative frequency for the

construction of the bars.

Now in case if you want to improve your histogram as we have done in the case of bar chart,

pie chart and so on,

(Refer Slide Time: 24:20)

there are some more choices of parameters which can be given inside the arguments, so

obviously this here x this is going to determine the data vector, the numerical values for which

the histogram has to be constructed.

28

316
Now in case if we want to give the title of the chart then this is controlled by main, m a i n, in

case if you want to change the colours of the bar then we have to use the parameter c o l, in case

if you want to add any description on the x axis then we have to use xlab, and in case if you

want to control the limits on the x axis then you have to use the command xlim.

And similarly in case if you want to control the limits on the y axis also then you can use it here

ylim, and there are more options but I would suggest you that you please try to look into the

help using the command, help hist inside the double quotes, inside the arguments, that will give

you more information.

29

317
(Refer Slide Time: 25:40)

Now let me take here an example to show you the construction of histogram, so here in this

example we have the heights of 50 persons recorded in centimeters. Now you can look in these

values, do you think that way? In the first glance are you getting any information whatever is

contained inside the data? It is very difficult and that is the advantage of using the histogram

that it will try to reveal the information contained inside this set of data, so I tried to stored all

this data into a variable here height using the command here c,

(Refer Slide Time: 26:30)

30

318
and after this if I try to use here the command h I s t over the variable height h e i g h t we get

here this type of data, so you can see here this is trying to give us the intervals here 120 and

125, then here is 130, 135, 140 and so on, and once I say, what are the values which are

contained inside this bar, so all those values which are less than 125, they are stored in this bar,

and I can look at the height of the bar which is here, so since the width of each of the bar this,

this, this, this and so on they are the same, so by looking at this value I can say that there are 5

values which are smaller than 125.

(Refer Slide Time: 27:30)

31

319
Similarly, in case if I try to look at this interval, the frequency here is 2, so I can say here there

are two values which are lying between 125 and 130.

Similarly, in case if I try to look at this interval, this is starting from 155 to 160 and the

frequency here is say close to 7,

(Refer Slide Time: 28:00)

32

320
so that is indicating that there are 7 values between 155 and 160 and also this is the same height

of the next bar which is here, so these two bars they have the same frequency, so I can say that

the number of persons having the heights 155 centimeter to 100 centimeters, and 160

centimeter to 165 centimeter they are the same, so this type of information is revealed from this

type of graphics.

Now in case if you want to improve the look by adding colours or by adding say legends or

controlling the limits you can use the parameters, and how to use those parameters, I will try to

show you here some of them,

(Refer Slide Time: 29:00)

33

321
but I would request you to have a look on the help and then try to see. So for example here I am

trying to give the title of the chart as say heights of person, and I have changed the colour,

colour of the bars and on the X axis I am trying to give here a legend say heights or title,

heights, on the Y axis I’m trying to give here the title number of person.

So in order to get a graph like this one, I simply have to add the parameters inside the hist

command. So I’m trying to use here the command here main, heights is equal to heights of

persons, so that is going to give me the outcome of title of the graph.

And similarly this green colour, this is controlled by this command c o l, so I’m trying to give

here col is equal to green inside the double quotes, and this title on the X axis heights that is

going to be controlled by xlab, so I’m trying to give the name of the height inside the double

quotes.

34

322
And similarly the name on the Y axis is controlled by ylab which once again I’m trying to give

it inside the double quotes. And similarly if you try to add some more parameters over here,

you can make it more informative depending on your choice, depending on your wish,

depending on your requirement.

So now I stop here in this lecture and once again I would request you that you please try to

choose some dataset from the book which are continuous and try to create this histogram, and

similarly you try to practice for the 3 dimensional pie diagram and try to use different types of

parameter, try to give them different values for example I have shown you that one, that when

we try to use the explode is equal to 0.2, 0.8 then how much is the separation between the two,

so that will give you a more idea that how the graphics can be controlled and or how the

graphics can be made more informative.

Similarly in the histogram also there are some other parameters which we have not used here,

but I would request you to have a look on the help menu and try to see how they are used, and

try to experiment them. So keep on practicing and we will see you in the next lecture, till then

good bye.

35

323
Lecture – 13

Graphics and Plots - Kernel Density and Stem-Leaf Plots

Welcome to the lecture on the descriptive statistics with R software. You may recall that, in the last

lecture, we had a discussion on the construction of histogram. Now, in this lecture we, will continue on

the same idea and we will try to discuss, kernel density plot. And after that, I will take up the stem-and-

leaf plot.

Refer slide time: (0:41)

Now let me start our discussion. In case if you recall, what do you do in case of construction of

histogram? First you have a frequency, distribution in which you have class intervals and then you try to

plot the bars, where the lower and upper limit of the bars are indicating the lower and upper limit of the

class interval. And in case if your data becomes quite large, then what will happen? Ideally you assume

that, whatever are the observations which are contained inside the bar, they are assumed to be

concentrated at the midpoint of the class interval. So, in case if your, number of data point become large,

324
then obviously, you will have to create more number of bars. And if you are in case, if I say in very

simple word, in case if you want to make your histogram more precise, then you need to make more bars.

So, what will really happen, that we try to first understand and then based on that, how to represent the

data, that we try to see. Actually this can be done, through the concept of kernel density plot.

Refer slide time: (2:05)

So, first you try to see here, what happens? In case of histogram, the continuous data is artificially

categorized, in different class interval. The choice and width of class interval is, very crucial in the

construction of histogram. For example, you may recall, that in case if you have our data, the histogram

may look like this, Or other alternative is this, that in case if you try to make the width of the bars to be

smaller, then it may look like this and so on. So, if you try to see, that in case if you want to make your

histogram, more precise or if the number of the data points, which are becoming very very large, you

need to create here, more number of bars. And what ideally, you assume here, is that whatever is the data

325
concentrated inside the bar. That is concentrated at the midpoint of the interval. Right. So, what we can

do? That we can, join these points, like this, like this.

And similarly, in case if you want to do it here, this will look like this, they are the straight lines and so

on, and then yah means, these points can be extended, so that they vanish on the x-axis, but now, in case

if this number of bars are increasing, then another option is this, that I can join the midpoints of the bars,

by a smooth curve, like this, you see I will put my pen here and then I will try to make a small curve like

this, or something like this and this, curve has been drawn in such a way such that it is passing through

with most of the midpoints of the class interval, this curve is called as frequency density plot, or in simple

word this is called as density plot. Now, this is the concept. Now, our next issue is this, when we really

want to implement it, for computation. Then how to get it done. and in order to construct such type of

plots, we try to take the help of, kernel functions and based on that, we try to create or construct the

kernel density plots.

Refer slide time: (5:09)

326
These density plots are like smoothened histograms. Smooth and histograms, means we are trying to

create the histograms and then, we are trying to join the midpoints of the bar, by a smooth curve. This is

really, what we mean. and now, this is smoothness, this depends on the number of bars, in case if you

have large number of bars, the curve is going to be more smooth and as, the number of bar becomes,

larger and larger and they tend towards infinity, this number tends to infinity, then this curve will become

a perfectly smooth curve, this is the basic idea. So, in order to plot it, this smoothness is controlled by a

parameter, which is called as ‘Bandwidth’. And the density plots helps us in visualizing the distribution of

the data over a continuous interval or a time period. And in case, if I want to explain this kernel density

plot, in very simple words, then I would say, that this is only a variation of a histogram and the histogram

is being constructed by using the concept of kernels. Right. So, density plus is simply a variation of

histogram, that uses kernel is smoothing, to smoothen the plots, by smoothing out the noise, this noise is

coming, because of the, class intervals. Right.

Refer slide time: (6:40)

And whenever, we are trying to create such a kernel density plot, the peaks of the density plot, they

display where values are concentrated over the interval. That is the value of the frequency. And the

327
advantage of kernel density plots, over the histogram is that, the shape, of the density plot is not affected

by the number of bins, number of bins mean number of bars. Where as in case of histogram, the shape of

the histogram is determined by the width of the bins, number of the bins and so on. So, that is why

density plot play better role than the histogram when you have large theta. And these density plots are

constructed using the concept of kernel density estimate. So, let us try to first understand, what is this

kernel density estimates?

Refer slide time: (7:45)

What are this kernel density functions? So, a kernel density plot is essentially, defined by your, kernel

density function. and how this, kernel density function is defined, you can see here, I am writing here, a

mathematical function, which is here fˆn ( x ) here, is a data or the variable on which, we are trying to

1 n
 x  xi 
create the data or we try to collect the data, this is given by
nh
 K 
i 1 h
 , h  0 , and we are h is

some positive value. Here, this is small n this is denoting the sample size and this is small h, this is trying

328
to indicate the bandwidth, this is the parameter that is going to control the, smoothness of a kernel density

function. And this K here, this is called here say, say kernel functions. This kernel density plot is not

arbitrarily defined, this function also has some properties which have to be satisfied in case if a function

has to be treated like as a kernel density function and these properties are similar to the properties or

probability density function.

Well, those who do not have a background statistics, I can tell them, that in, statistics we have probability

density function, for a continuous random variable and these are the function which have certain

properties and these function helps, in determining the probabilities of events. For example, you have

heard normal density function, gamma density function, chi-square functions, t distribution, chi-square

distribution and f, f distribution and so on. So here, I'm not going into that detail, but I just wanted to give

you some idea. And in this case of a kernel function, in case if I try to choose, different types of K, that is

the kernel function, we will get different estimates and constitutively, we will get different types of plots.

Because, now I can say briefly that this kernel density plots are going to be constructed on the values of

the kernel function, that we are going to obtain, on the basis of given set of data. So we have data, we try

to estimate the values of kernel density functions and then, we try to plot them. So, obviously in case, if

you try to change the form of the kernel function, your estimate may also change. But, definitely those

kernel choices are taken in such a way such that, there is not much difference among different kernels.

And the aim is to provide a function which is more efficiently presenting the true frequency distribution

of the data. Okay. So, just to give you an idea, what I am going to use here, and what choices are

available in our,

Refer slide time: (11:06)

329
We are going to discuss here, three choices of kernel function. One function here, is a rectangular kernel

and second is Epanechnikov kernel and third is, normal distribution kernel or this is called as Gaussian

kernel. Right. What is this rectangular kernel? This rectangular kernel is defined, as a function, which

1
 if  1  x 1
takes value K ( x)   2 . Similarly Epanechnikov kernel, this takes the value,
 0 elsewhere.

3
 1  x  if | x | 1
2
K ( x)   4 . Similarly the Gaussian kernel, this is based on the normal
 0 elsewhere.

distribution. We know that the probability density function of normal distribution, with parameters mu

1  1  x   2 
exp    ,   x  . . Right. And actually,
 2    
and Sigma Square, this is given by
 2  

when we are trying to construct such, a kernel density plots in R software, then in case if you don't give

any choice, then this Gaussian kernel is the default choice. But, I will try to show you that, how the

density plot looks, when we try to change the type of kernel or the choice of kernel. So, could we take

here, the same example that I had taken in the case of histogram and I will try to construct the kernel

330
density plot in the R software and I will try to compare them with histogram also. So now, let us try to

consider an example, in which

Refer slide time: (13:12)

the heights of 50 persons are recorded in meters. And we would like to create a density plot for this data.

So, this data has been stored in a variable, here height, here like this.

Refer slide time: (13:30)

331
And after this, in case if you want to plot this, kernel density plot. Then the command in R software is,

plot density. And so, I would try to write here plot, inside the argument density and then inside the

arguments, I have to give the data. So, this is the command here, for plotting the density and remember,

this is plot, inside the argument you have to write, density and then, once again inside the argument, you

have to give the, data. So here, the data is given by, variable here height and in case if you execute, it you

will get here, this type of graph. So, you can see here, here are the number of observations, here 50, The

bandwidth here, is controlled by the Gaussian kernel, which is the default kernel, when we are not

specifying anything and you can see here, this curve looks like this, this is a smooth curve something this.

So, this is called a ‘Kernel Density Plots’. And here, these are the values, of something like a class

intervals, in case of case of histogram. And this type of a smooth curve helps us in getting an idea about

the distribution of the numerical values and these types of curve are actually more useful when you have

large number of data that will give you a much better information, than the histogram. Okay.

Refer slide time: (15:17)

332
Now, in case if I try to plot the histogram and here, this density plot side-by-side. Then it will help you in

comparing the two. So you can see here, I have plotted the histogram of the same data and here is the

Gaussian kernel plot. Now, you can see here that how the things are happening over here. So, one thing

what you have to keep in mind here, that the histogram is starting from hundred twenty, whereas, kernel

density plot is starting with zero, and then, what is really happening, that the width of these bars is now

made a smaller and smaller, and that is controlled by the choice of the kernel and then all this frequency

they have been joined together, to give this type of curve. So, you can see the, similarity between that two

graphs. For example, you can see here, the maximum frequency is here, at the same place. But this is

giving us the more information. But, before going further, let me try to, plot this data on the R console

and I try to show you here, so that you are more confident.

Refer slide time: (16:33)

10

333
So, now let me copy this data on the R console and you can see here, this is the data on the height and

now, I use this plot, density command and if I execute it on the R console, you can see here, you get, this

type of kernel density plot. Based on the choice of Gaussian kernel. Right. And now, I will show you one

more aspect,

Refer slide time: (17:09)

11

334
You can adjust the plots this kernel density plots, using a command, adjust. This adjust is going to be a

parameter, that takes some numerical value and if you, try to see here, the number of bins or the structure

of the curve, that is going to be controlled by this adjust parameter. For the sake of understanding and

explanation, I am trying to take here, two possible values of adjust at 0.5 and here, adjust equal to 1 and

then I am trying to plot the same data using the same command, so, you can see here, the structure of this

data set and structure of this data set. So, you can see here, the data sets are the same in both situations

and the same Gaussian kernel density function has been used to construct these densities. But, their

structure is different. This is giving you more variation and here the second one this is more is smooth

and, and you can see here, the in, the first case, the value of bandwidth chosen is, two point six nine eight

and in the second case that chosen bandwidth is five point three nine seven which is corresponding to a

just equal to one. Right. Now, I would say that a question comes, what should be the proper value of

adjust? So for that, I will say, try to create a histogram, try to look into your experimental data and try to

choose certain values of edges and try to plot your curve and whatever, curve is representing the situation

in a more authentic way, more honestly, try to choose that value of a just.

Refer slide time: (19:13)

12

335
And similarly, in case if you want to increase the value of adjust, for example, in the next slide I am

trying to create the same density plot, using the adjust equal to one and two. And you will see that, as I try

to increase the value of adjust parameter, the smoothness increases more. For example, you can see here,

this first graph, this is the same data with a adjust equal to 1 and the second graph, this is with adjust

equal to 2. You can see here, with adjust equal to 2, this is more like a Gaussian curve or normal curve.

Because, here the bandwidth has increased from 5.397 to 10.79, nearly double. Right. So, this is how you

play with your adjust parameter.

Refer slide time: (20:06)

And now, in this slide, I am simply trying to give you, all the three slides or say all the three graphics,

with a just equal to 0.51 and 1, two together. So, that you can make a better inter-comparison, you can

see, that as the value of adjust is increasing, this degree of smoothness here is increasing. So, this is how

you can play with this data. So, first let me show you, this execution on the R console.

13

336
Refer slide time: (20:43)

So, you can see here. Now, I'm trying to use here, first the density plot, with adjust equal to 0.5 and then

click on I try to take the value here 1, 1 and then I try to take the value here 2, you can see here, this

becoming more normal and suppose just for the sake of illustration, I try to take the value to be here 10,

you can see here, this is now more towards the normal curve and so on. Right. Ok. Now, let me try to

address this another feature of kernel density plot.

Refer slide time: (21:30)

14

337
That in case if I try to use different kernel function. So, you can see here, on the upper left hand side, I'm

trying to create a dencity plot, using the Gaussian kernel, which is the default kernel and for that, now I'm

giving the choice, as kernel k e r n e l equal to inside the single quotes, Gaussian g a u s s i a n and

similarly, I am trying to take here, another kernel here, that is a rectangular kernel and if you try to

compare here both these graphs, you can see here, the structure is almost the same, means even here if

you try to plot the graph, this will look, something like this, on the similar lines. But, the structures are

different. So, now you are the one who has to decide that which of the kennel is going to give you the

more representation of the data. And you can notice that, the bandwidth in both the cases that is here the

same, 5.397. Right. And if you want to increase the, bandwidth and the choice of kernel, both together,

then you try to use the parameter adjust as well as the kernel choice.

Refer slide time: (22:46)

15

338
and similarly in case if you try to choose other, kernels like as here, in this first graphic, I'm trying to use

the, kernel Epanechnikov and in this one, I’m trying to use a triangular kernel and so on. So, you can see

here, the structure is more or less similar, but then, they are not exactly the same. So once again, I will try

to show you, all these graphics on the R console, to make you more confident and then, I will move to

another graph. so you can see here, here I'm trying to make the Gaussian kernel, then I'm trying to make

the, density plot based on rectangular kernel, which was reproduced in the slides and similarly if I try to

take Epanechnikov kernel, then it comes out to be like this and if I try to take a triangular kernel, then

again this curve changes, you can see here. Right. So, this kernel density plot will help you in getting a

smooth density plots and they are more or less like similarly to the histograms also.

Refer slide time: (24:06)

16

339
Now, we come to next topic, that is stem and leaf plots. These are once again, another way of

representing the data. And in this case, the absolute frequencies in different classes, they are represented

in a different way, for example, we represented the frequency or the absolute frequency using the bar

diagrams in case of discrete data and histogram in the case of continuous data. Right. So, in this, case we

have a data set and the graphic is going to be a sort of combination of the graph and numbers. Graph or

text, so that is why, this is also called as a, ‘Textual Graph’. Means, it is trying to use the, text as well as

graph. So, this stem and leaf plots, show the absolute frequency, in different classes in the frequency table

or in a histogram. And they also, present that same information, the only difference is that, this is based

on a quantitative variable and these are the textual graph. And the presentation in this graphic is based on

the data according to their most significant numeric digits. And this type of graphic is actually more

suitable for a small data set. So, I would advise you, in case of your large data sets, try to go for histogram

or say, density plots.

Refer slide time: (25:44)

17

340
And this is stem and leaf plot, this, this is a sort of actually tabular presentation. Where each of the data

value is going to be splitted into two parts. One is called, ‘Stem’, and another is called, ‘leaf’. Usually in a

stem leaf plot, the stem is representing the first digit or the first digits of the data. Or leaf is going to

usually represent the last credit of the data. For example, in case if I have a data, 56. Then 56 is going to

be splitted into two parts, five and here six. This five is called a stem and this six is called here leaf.

Similarly in case if I say that, I have a number, whose stem is two and leaf is eight, this means, that the

number is 28. So, this is how we try to, represent the interpretations stem and leaf.

Refer slide time: (26:45)

18

341
In order to create the stem-and-leaf plots, what we have to do? First we need to, separate each observation

into a stem consisting of all but the final, which is the rightmost digit and a leaf which is the final digit.

Actually this stem may have as many as possible digits as needed but each leaf contains only a single

digit. This is the key point. And then, we try to write down the values of the stem in vertical column in

such a way such that the smallest value is on the top and then we draw a vertical line and after this, we

write the value of the leaf in each row, to the right of the, the stem. And this is again done in increasing

order.

Refer slide time: (27:39)

19

342
So, first let me, give you here the R command and can I will try to show you, how the stem and leaf plot,

look like which will make you understand better. So in R, we have a command s t e m, this command,

produces the stem and leaf plot, of a variable here, of the some values here in data x, so, we need to write

here, say stem inside the argument, see here, x. and then, the size of the stem, that means, how it has to

go, this is controlled by a parameter, what we call it as say, ‘Scale’. So, in case if I try to choose here

two values of scale, scale equal to 1 and is scale equal to here 2. then the interpretation will be, that when

I am trying to choose the value 2 here, then this will give me, a stem and leaf plot, which is roughly that,

twice as long, as of the default value, which is scale equal to 1. So, in this case, when we want to

construct the stem and leaf plot, the R command is, stem and inside the argument, I have to give the data

and then, I have to give the scale value, this can be number and this control the length of the plot and

similarly, there is another parameter here width, which controls the width of the plot. So, scale controls

the length of the plot and width controls the width of the plot. And the default value is here, usually 80 is

taken.

Refer slide time: (29:18)

20

343
So, let me try to take an example, over here and then we try to understand it. Suppose there are 15 lots of

an item and the, number of defective items in those lots are found to be as follows, like this. So there are,

46 defective number of pieces, 24 number of defective pieces , 53 number of defective pieces and so on.

And yeah, that's a small data set said, only consisting of 15 values, so and we would like to construct the

stem-and-leaf plot, for this data set, so I try to put all this data inside a variable defective, so this will

look like this and then, I will try to execute, the R command over here.

Refer slide time: (30:03)

So, as soon as, I say here, stem defective. Then this the outcome will look like this, which I will show

you, on the R screen also, on the R console, also. So, you can see here, that this part here is the stem. And

this part here is the leaf. And these are the vertical lines, which I was discussing. So, you can make this,

vertical lines to be something like this, so that there is a partition over here, so that stem and leaf, they are

separated. So, first question comes that, what it is trying to interpret? The interpretation of this graphic is

easy to understand, in case if I try to add here the scale parameter. So, first I would try to show you here

Refer slide time: (31:04)

21

344
the outcome, the screenshot of the outcome here and first, I will try to give you, the explanation and then

I will show you, how to plot it on the R console.

Refer slide time: (31:13)

So, now if you try to see here, first I'm trying to discuss here, the role of, parameter scale. So, you can see

here, I have created here two stem plots. One here is like this, number one and here number two. A

number one, I am using the scale 2 and in number two, I'm using the scale value here one, that that I used

22

345
earlier. You can see here, that there is a difference, that here, there is only here one, two, three, four, four

values here, whereas here, one, two, three, four, five, six, seven. Then in case if you come on the leaf part,

corresponding to two, you can see here, there are four values 4 4 5 8, whereas in the case of scale equal to

two, there is only here one value and there is no here three, three is not present here, five is not present

here and similarly one is not present over here. So, when I am trying to increase the value of the scale

from 1 to here 2, then this is giving me a more clear stem-and-leaf plot. So, now let us try to see, in the

figure number one, what is the interpretation? Now, first you try to look in the data. I will change the

color of my pen, so that you can clearly see it, in the data set, try to see there is here a value one eight.

Now, and now try to look at, this thing. So, this is actually indicating one as stem and eight as leaf and

this is the same value, which is given here. And similarly if you try to look at the second value here, 2,4.

So this is actually here 2,4. What is this indicating? This is also indicating a value inside our data and if

you try to look on the data, this is the value here 24, this value is indicated here. So, these are the cases,

where we have single digits only. Now, I try to take here that third column. So you can see here, this is

indicating here, something like three as it stem and the interpretation is first digit is 34. Now, next value

here is five, so the interpretation goes, the stem here is three and the leaf here is five, so the value here is

thirty-five, then the third value here is eight, so this becomes your stem three and value here eight, that is

leaf is eight, so the number is 38. And you can see, that these numbers are present in your data, so if you

try to see here 34. 34 is present in the data here like this. And if you try to see here, 35, 35 here is this data

is present here and similarly if you try to see here 38, 38 is present here, in this data. And similarly if you

try to interpret, this here fourth line, this number is going to be 44, it is representing actually four values,

44, 46, 48 and 49. And you can see here, that these values are also present, here in this data says 44, 46,

48 and 49. And similarly, if you will try to choose other values, you can also see here, that this is here 53,

54 and 56 and these are again given here in the data set 53, 54 and here 56. So you can see here, that this

stem and leaf plots, they are trying to represent the sort of histogram in terms of the frequency of the

given data set, how?

23

346
Refer slide time: (35:29)

Let me try to show you here, well this is here the screenshot of the two with a scale equal to 1 and is scale

equal to 2.

Refer slide time: (35:36)

24

347
So, in order to compare it, I am trying to make here the histogram and then I have the histogram will look

like this and so on, I have rotated it so that it becomes comparable with the stem-and-leaf plot. So, you

can see here, in this data set this is here 1 8. So this is indicating here, like this, so now, if you try to see

this is here 1. So this is indicating the frequency, that the number of values between 10 and 20 is only

here 1. So, the same thing is happening here also. We are trying to read here 1, 8 as 18. so that is

indicating, that and the next value here is 24, so that is indicating, that the class interval here is 10 to 20

and say, 20 to 30 and so on. 30 is coming because of here this 3. So, you can see here, that the number of

values in the interval, first interval is only here one, this is here 8, so this is indicating here, on the y axis

here 1. Now, in the second case also, there is only one value here, which is 24, so again, this is indicating

in this second bar, so again you can see here there is only here, one value which is indicating here, and

similarly if I try to take here, the third one, so third one has, the interval now, say 30 to 40 and then there

are here three values 34, 35 and 38. So, the frequency here, f3 becomes here 3 and you can see here, that

this interval is representing here, in this bar, where the frequency here is 3. And similarly in the fourth

case, if you say observe means, I can say simply here, that the frequency in this case is here 4, f4 here is

4 and this value is indicated over here and you can see here, that this frequency here is 4.

So now, you can see here, that the stem-and-leaf plot and the histogram, they are more or less

comparable, the only difference is that, in case of a stem and leaf plot, it is also trying to give an idea,

about the individual values, whereas inside the histogram, the individual values are lost. So, I'm not going

to compare to see their advantages or disadvantages of these two graphics. But, I will say that depending

on the need, you try to create a graphic, according to your need. Now, I will try to show you, this stem

plot on the R console. So, first let me try to copy this data, so I can close this thing and then you can see

here, this is the defective data here. Right. And then I try to make it here, I try to create here, a stem. So

you can see here, this is the same stem, what was presented here? And then I try to play with the scale

part. So, I try to create here, the two actually this is what you have obtained here, this is actually here, the

same width scale, with a scale equal to here one. So you can see here, that these two are the same thing.

25

348
But, definitely in case if you try to add here, say here, is scale equal to here 2. So, you get here, this thing.

So, that is the same thing, what we have done here?

So, I will now stop here and I will also try to close our discussion on different types of graphics in one

dimensional. There are some graphics, which are available for two-dimensional, when we have two

variables. But, that I will discuss, when we are trying to deal with the data on two variables, at this

moment, first I will try to take up all the issues, when we have the data only on one variable. So, now I

will stop here, but definitely I am not saying that, that these are the only possible graphics available.

There are many more graphics available and day by day, their new types of graphics are, also coming into

picture, because of the use of software. But I'm sure that this type of background will surely help you and

it will take out your fear from your heart, that it is difficult to create graphics in R software, in comparison

to a software where you simply have to do some clicks. What you have to do, just make some practice

and if you practice, I'm sure that, you will be very very comfortable in making more beautiful and more

interactive graphics. So, you practice it and we will see you in the, next lecture with a, new topic. Till

then, Good Bye.

26

349
Lecture 14
Central Tendency of data –Arithmetic Mean
Welcome to the next lecture on the course, descriptive statistic with R Software. Up to now,

in all the earlier lectures, we handled how to create, different types of graphics. And they

were the part of the graphical tools of descriptive statistics. From now onwards, we are going

to handle the analytical tools and in uni-direction: that means when we have only, one

variable. And we try to handle more than one variable; then again I will try to introduce the

graphical tools in two dimensional and analytical tool for two the dimensional data. So, the

first step after we get the data is that we would like to get some quantified information that is

hidden inside the data. As we had discussed the data is very silent, data cannot speak, data

cannot tell you, well I have this value, I have the this information and graphical tools, will

give you a graphic view, visual, information, from that you have to use your knowledge, your

common sense, your information, your statistical knowledge, your information from the

experiment and you need to combine them to get a clear-cut conclusion. Now, we would like

to quantify that information. So, when we talk of the information contained inside the data,

there can be enormous information which is contained, but our question is how should we

take it out? So, we had discussed that we would try to take the or extract the information on

different aspects of information like as central tendency, variation, symmetry etc. So, we are

going to start it with a new topic, in which I'm going to discuss the central tendency of the

data and then I will try to, discuss different types of tools, which are used to extract the

information on the central tendency of the data. So, in this lecture I am going to, discuss the

aspect of arithmetic mean.

Refer Slide Time :( 2: 58)

350
So now, whenever we try to conduct an experiment, there will be several aspects and we try

to collect the data on those aspects. So, finally the data set may contain many variables,

several variables and every variable may have many observation and our basic objective is

this we want to know the information contained inside the data, which is not possible, so we

are trying to develop the tools which can help us in digging out the information from every

observation. Now, the question is this, what we would like to have? Suppose I here hundred

data points and every observation tells me something or alternative is that, instead of having

hundred pieces of information, I have a summary information, that may provide more

information to a common person and that will be more useful. So, here now we are looking

forward to understand some summary measures which can give us the information hidden

inside the data on different aspects.

Refer Slide Time :( 4: 14)

351
Now, let me take a simple example to explain my view. Suppose I want to know the

clothing, and I have two choices, Lucknow, in Uttar Pradesh, which is quite hot, during the

month of May, it is the summer season and similarly other city is Srinagar in and Kashmir

which remains cool during the month of May. So, now we have collected the data on the day

temperature of last year, on say five days and this data is coming out to be, 35 degrees

centigrade, 37 degree centigrade, 36, 40, 38 degree centigrade for Lucknow and 20, 18, 17,

22, 23 degree centigrade for Srinagar. Now, what information I can get from this data? This

data can be a large, there can be means, I have taken here only 5 values for the sake of

understanding, but these values may be, hundred, thousand and, and, and even millions. So,

from this data, I would like to know for example: that what type of clothing I should, take

there, in case it is cold then I would try to take some woollen clothing’s and if it is hot and I

will try to take some, simple cotton clothing. Right. So, now how to get this information? By

looking at this information, yeah! it is telling me that,the temperature is quite, high and here

the temperature is here, usually low. But, we would like to have a summary information, the

information in the summary. Now, what we observed that is the human tendency to compile

352
the information, in terms of averages. For example, in case if I say, in a class, some students

might have got 45%, somebody has got 55%, somebody has got 65% and there will be more

marks, but then, I am more interested in what is the average performance of the class?

Refer Slide Time :( 6: 35)

So, if I sayhere: that the average mass in thesubject in a class are 60%, then it makes more

sense. Similarly in case if I amtrying to go for a tablet of a medicine and suppose the

shopkeeper or the doctor tells me that this tablet can control the body temperature and bring

the temperature down for six hours, what does this mean? The doctor is telling the average

value and we are very, easily understanding it, this six hours, cannot always be exactly six

hours, this may be five hours, this may be seven hours, this may be five point five hours, but

and this may be six point five, five hours, but this data, has been collected and doctor has

found a sort of arithmetic mean or an average and he's convinced that, if this medicine is

given to a person, having a fever, then this tablet can control the body temperature up to six

hours or say on an average six hours. This is what we mean? So, in statistics, this concept

353
refers to as average or the central tendency of the data. Central tendency of the data means for

example, if I have a data which I plot it here like this, then I would try to see, what is the

point here around which the data is concentrated, for example, here you can see, this is trying

to give us a central value, around which the data is concentrated. In statistics we have

different types of measures, to study this, central tendency of the data,

Refer Slide Time :( 8: 14)

for example, arithmetic mean, geometric mean, harmonic mean, median, quintiles, mode,

etc.

Refer Slide Time :( 8: 19)

354
So we are just going to discuss these measures one by one and I will also try to show you

that how to compute them on the R software. So first let me try to explain, the arithmetic

mean for an ungrouped data. Ungrouped data means, we have a variable here X and we have

collected the data on X, s say here, x1, x2 and so on. See here, x n, so for example, if I say

here X is my hair height, then x1 is going to be the height of first, person say 152

centimetres, x2 is going to be the height of the second person say 165 centimetres and so on.

And then we have, total number of an observation,which are denoted by x 1, x 2, x n and

these x's are small letters. Okay. Now, the arithmetic mean of these observations is defined

1 n
by like this x   xi , this means, I have first to sum x 1 plus, x 2 plus, x n and then I have
n i 1

to divide this, sum by here the number of observation and, this is the meaning of this symbol.

In order to compute, it in R, the command here is, mean and then I write, mean and inside the

argument x, then my data is contained in the x, using the c command.

Refer Slide Time :( 9: 58)

355
So, let us try to make an example and try to see here. Well in case if you want to know, more

about this mean for example, I will be discussing here, another aspect that how to handle the

missing values, but it but, but there can be trimmed mean also and there are some more

parameters, with this command, I would request you to go to the help, of mean and then try to

look into different parameters. So now coming back to my example, so this is the same

example that we had considered earlier that there are 20 participants, who participated in a

race and their time in second, seconds has been recorded here, like the 32 second, 35 seconds

and so on. And this data has been captured inside a variable here stay here time.

Refer Slide Time :( 10: 46)

356
Now, in case if I want to find out here this variable, then I simply have to type here, mean

inside the argument, the variable name time and I get it here, as the value 56. So, you can see

here, this is the screenshot. So, I will try to show you here that how it works on the R console.

Refer Slide Time :( 11: 10)

So, let me store the data over here so that you can see here, this is my here time and when I

try to find out here mean of for your time, the variable,in which the data is contained, this

comes out to be 56.

Refer Slide Time :( 11: 24)

357
Now I try to address here one more aspect. If you remember, when we started our course,

then in some initial lecture, I had given you an idea, that there can be many situations, where

some data might be, missing and this data, is represented as capital N and capital A that is N

A and which is a reserved value. So in case if the data, is missing, then how would you like

to compute mean and other components, you see, the way I am going to explain here: that

how to compute the mean, when data is missing, the same concept will be used, in all other

cases: that you want to find out the variance the standard deviation or the median, the same

concept and the same command will be used. So, here I will try to explain in detail and after

that I will quickly take it. So, when some data is missing, then in that case, the mean

command is used to find out the average value, arithmetic mean of the data in x vector here.

But, there is another parameter which is added here n a dot r m, is equal to true, capital T,

capital R, capital U, capital E. So, this is trying to tell that please compute the mean,

compute the mean, after removing the, the N A values, N A or the missing values. This is,

what is trying to say. So, in order to understand it,

Refer Slide Time :( 13: 12)


9

358
let at me take the same example and you can see here that I have replaced, the first two

values, by n a. I have just made it underlined, so that, you can easily, see it and so in this

case my data vector will contain first two values here as say na and na. And in order to, store

this value I am trying to use a different name and this name I am time to give, the time which

I had used earlier, dot n a, dot n a is not a result but this is simply trying to indicate: that the

data on the time variable that we have used earlier, this is the same data with na. Right.

Refer Slide Time :( 13: 54)

10

359
So, in case if I try to do it here, then you can see here, if you try to find out the mean, only of

the time dot n a vector, where the data is missing, this will not give you any numerical value.

But it will give you simply here a say NA. Why? Because this is going to find out the sum of

here NA plus NA plus, the value but, the numerical values here, 56 plus dot whatever are the

values here, divided by here, 20. Right. So, that is why this value is coming out to be NA,

you can see here: that this NA, plus this NA, plus this value, plus this value, plus this value

and so on, divided by 20. So, definitely this cannot be computed so, this is giving you NA,

whereas in case if you try to add here the command n a dot r m is equal to true, then, what it

is trying to do? It is trying to find out the sum of all the values, all the values after removing

NA, divided by the number of observation, which now becomes here, 20 minus 2: that is

there are 20 observation and 2 observations are missing. So, this is going to be here 18, so

this number is going to be divided by 18. And now you can see here that this is giving me a

value 58.5, where, If you recall that the mean of time which was 56. So, this is now

changed, because this has been computed, after removing the missing value. Now, in case if I

try to make this n a dot r m to be here FALSE, like as here,then you will see here, this is

11

360
giving me the outcome NA, because once you say, na dot r m, that this is the default function.

And when you are trying to use here the, mean of time dot na, actually, this is, the default that

here n a dot rm is taking always FALSE as a default. So, whenever we are trying to find out

the value of the mean, it assumes by default that all the values are available. In case they are

not available, you need to inform your R software: that yes, there are some values which are

not available and please try to compute the mean after removing those, missing values. Okay.

Refer Slide Time :( 16: 32)

Now, this is here the screenshot, I will try to show you on the R console.

Refer Slide Time :( 16: 35)

12

361
But before that, let me try to show you here that in this case, which I just discussed actually

that mean of this time, which is containing here, 20 values, is computed like this, sum of all

the 20 values divided by 20 and whose value is coming out to be 56, whereas when you are

trying to use, the data with missing values, then it is actually based on only 18 values and this

value is coming out to be 58.5. Now let me first come to R console and try to show you here,

what is really happening. So, I try to copy this data,

Refer Slide Time :( 17: 14)

13

362
data here, so you can see here, time dot n a console has, this thing. And now, if I try to find

out the mean of here, time dot n a, this comes out to be here NA, but in case if I try to add

here, one more parameter and a dot rm, is equal to here true, then you can see here, this is

coming out to be 58.5, this is the same outcome that we have, discussed in the slides. Okay.

Refer Slide Time :( 17: 50)

14

363
So, let me come back to our slides and let us try to have here another aspect. So, up to now

we have discussed the arithmetic mean, for an ungrouped data. Now, I'm going to discuss

how to compute the arithmetic mean in case of group data. You remember that in the case of

group data, first you need to construct a frequency table. So, now we will learn, how to

compute the arithmetic mean from the frequency table.

Refer Slide Time :( 18: 22)

So, you may recall that while constructing the frequency table, we had constructed the class

interval and the class intervals were constructed on the basis of given set of data and they

were divided in suitable number of intervals of suitable widths. And these intervals I am,

denoting as e1 to, e2 to e3 and so on. And this part here, the first value e 1 and here, this here

e 2, in the second case, they are called the ‘Lower Values’ and similarly this e 2, in the first

interval and e 3 in the second interval, they are con the upper values, of the interval and so

on. So similarly I have created here k classes. So, I have created here, case such classes and

15

364
then I’m trying to find out the midpoint, of this interval, so midpoints of the first interval, i

am denoting by here m1, which is simply here, e1 plus, e2 divided by 2 and similarly, the

midpoint of the second interval is denoted by m2, which is the lower limit plus upper limit,

divided by 2 and so on and similarly I try to find out the weight values, of all the intervals

and based on the given data, set I try to find out the absolute frequency. So, there are n 1

values in the first interval and n 2 values in the second interval and so on, n k values in the

Kth interval. And we also know that the sum of all this n 1, n 2, n k, we are trying to denote

by here, n and the relative frequency has been obtained say, say here f 1, which is n 1 upon

n, f 2 is here n 2 upon and so on here, f k is, n k upon n and n is the total frequency and is

here the, total frequency. And means obviously, in case if you, try to sum all the relative

frequencies over here. So, this will come out to be sum of all ni’s a is divided by n. So, sum

of all n i is here is n, so this become and upon n, which is equal to here 1, which is written

here. Right.

Refer Slide Time :( 20: 31)

16

365
Now, I would try to define the arithmetic mean for this group data. And it is defined here as

1 K
say x   ni mi , so m i is here the, midpoint of the interval. So, now in case if you try to,
n i 1

simplify it, so this can be written here they , fi frequency, so other alternative is that, I can

K
simply find it out here as the  fi mi . And based on this, there is another version, of the this
i 1

type of mean in case of group data, which is called as see here `Weighted Arithmetic Mean’

wm i i
and weighted arithmetic mean is defined, as say x  i 1
K
where, w i's are the weights,
w
i 1
i

which are assigned to the values, right. So, this is a more generalized function which is

useful in many, many statistical datasets.

Refer Slide Time :( 21: 38)

Now, in case if you want to find out here the arithmetic mean of the group data set. So, we

know this is now going to be simply here, i goes from here 1 to k which is here, see here, fi,

mi. So, if you remember, when we were discussing, different types of mathematical operation

using the R software, then we had discussed that, this type of thing can be obtained by, say

17

366
sum of two data vector f and here m, Right. So, we had f is going to be the data on say f 1, f

2, f k and m is going to be the data, on say mid values, m 1, m 2, m k. So, in order to do it,

this R has already a built-in function, which is called as weighted dot mean, w e i g h t e d

dot m e a n and inside the arguments, I have to give that two vectors here, m and f, where m

is containing all the midpoints and f is containing all the frequencies. So obviously, when you

want to compute, the arithmetic mean of this group data, first objective will be to find out the

frequencies and in order to find out the frequencies, you may recall that we had used the

command, table and table will have two types of components, the first one will be intervals

and second will be frequency. So, we need to, extract the frequencies from the outcome of a

table command. Now, you have to be watchful here that in this function, weighted dot mean,

I have used, this symbol f to indicate the absolute frequencies, whereas if you observe, in this

slide, in this formula, I have used here f to indicate the relative frequency. So now in this

example and in order to compute the weighted mean I will be using the, the indicator f to

indicate the data vector, of absolute frequencies and not the data vector of relative

frequencies.

Refer Slide Time :( 23: 54)

18

367
So, let me take here an example and show you that, how the things are going to what? So,

once again this is the same example that I discussed earlier that there are 20 participants, who

participated in a race and their time is recorded in seconds. And this data has been recorded

here, under the variable here time. Right.

Refer Slide Time :( 24: 17)

19

368
And now, we had converted this data earlier in the form of a frequency table and you can see

here, that I have created herethe class intervals, like this 31 to 40, 41 to 50, 51 to 60 and so

on. So, there arealtogether, six class intervals, so K here is equal to 6. And then, I have found

the midpoint, which is here 31 plus 40 divided by 2 and so on, so this is the value of here m

1, this is the value of here m 2 and so on, so we have here, m 6. And their absolute frequency

have been obtained as here five, for the first class, 3 for the second class, 3 for the third class

and so on, so these are the value of here f 1, f 2, f 3 and here see here, f 6 and this is here the

value of here sum of all the frequencies, which is equal to here, n. Right?

Refer Slide Time :( 25: 09)

So now, just to give you a brief recall, that how we had found the frequency distribution, I

mean I have taken some slide from the earlier lecture and you may recall that first we had

defined sequence, between 30 to 90, at an interval of 10, by using the command, s e q and we

had stored this data, inside a variable breaks. So, breaks was 30, 40, 50,60, 70 ,80, 90 values

and then, we had using the data time and using this data vector, here breaks and putting the

20

369
right hand side interval to be open, we had used the R command cut, c u t, to convert the data

into factors. So and this data was, stored in time dot cut. So, this outcome was, like this, this

all we had discussed earlier, right.

Refer Slide Time :( 26: 03)

And based on this time dot cut data, we had found the frequency table of the data in time

vector by using, table inside the argument time dot cut and this was the frequency distribution

that we had obtained. Right? Now, what we have to do? We need to find out the weighted

arithmetic mean or the arithmetic mean for this grouped data. So, this can be done by the

following ways, first step is that we need to extract the frequencies from this frequency table.

So this is here the frequency table and we want to extract only this data vector 5 3, 3 5, 2 2,

because this is the value of f 1, this is the value of f 2, this is the value of f 3 ,this is the value

of f 4, this is f 5 and this is here f 6. So, now in order to do it, we try to operate, a command,

21

370
as dot numeric on this frequency table data. So that will be a s dot n u m e r i c, all in small,

alphabets and inside the argument, I have to give the data vector and the data vector is going

to be the outcome of the frequency table. So, in this case, you can see here, your frequency

table data is given by table and inside the argument time dot cut. So I try to operate this

command over here, as dot numeric and inside the alchemists table, time dot cut and this

gives me here, this outcome. So, you can see here that this data is the same data that you have

obtained, say this 5 is the same as here this 5, this 3 is the same here, this 3 and this 3, this is

the same here, this 3 and this 5, this is here the same 5 and then this 2, this is here the same 2

and this 2, this is here the same 2. Right?

Refer Slide Time :( 28: 16)

So now, once we have obtained, this here, vector here f, which is the vector of the

frequencies, similarly we can find out the vector of the data, on midpoints and then I simply

have to use here the command, weighted dot mean and with m and f and this will come out to

22

371
be 56, right. Okay. So, let me now first try to operate, this thing on the R console to show

you.

Refer Slide Time :( 28: 45)

So now, if you try to see, we already had entered the data on time, which is here. Now I first

need to create a frequency table. So, I will simply try to, copy and paste the commands that

we had used earlier. So this is about the breaks and then I will try to execute the command to

get the data, time dot cut and then I will try to find out the frequency table, using the time dot

cut and you can see here, this is the same data set. And now, I will try to extract here, the

data, from this table, using the command as numeric. So, you can see here, this data is here

the same, if you try to see here, this line, which I am highlighting and this line, which I'm

highlighting, they are the same, right. And now I need to define here the, the vectors of

midpoints. So, this is here m, so you can see is here m and now, once you try to use here, the

function or the command, weighted mean, this weighted mean come out to be 56. So, this is

how you can obtain this weighted mean, in case of this group data.

23

372
Refer Slide Time :( 30: 16)

And now here is, the screenshot of the same operation that we have just done. Now, I would

like to stop here in this lecture and if you try to see, what we have done? We have simply

learned the concept of arithmetic mean and we have learnt how to execute it on the R

software. And this arithmetic mean is found for the group data, for the ungrouped data. So

that is pretty simple, but it is more important to learn that, what are the different other

parameters that can be used in the command mean that can be looked through the help menu.

So, I would request all of you: that you try to take, a small data set, say only few numbers and

try to compute the arithmetic mean by your hand, manually and try to compute it using the R

software. And once you see that, both the things are matching, then it will give you more

confidence that yes, the software is also doing the same thing, what we wanted to do? So, you

practice and we will see you in the next lecture. Till then, Goodbye.

24

373
Lecture – 15
Central Tendency of Data - Median

Welcome to the next lecture on the course descriptive statistics with R Software. You may recall

that in the earlier lecture, we had discussed the idea of central tendency. And we had planned

that, we will discuss, several measures of central tendency of the data. In the last lecture, we had

explained the concept of arithmetic mean. Now, in this lecture, I am going to consider the aspect

of partitioning values. And under that topic, I will try to consider the median. Okay. So, let us try

to first understand what are these partition values.

Refer slide time :( 1:02)

374
If you try to see, means if I try to create here the frequency distribution means, on the x-axis we

have values, say class width or the x values and on the y-axis we have frequency values and

suppose we have got a, frequency distribution like, this one. Now, you can see here that the

entire frequency is being covered under this curve, you can see here, these are the frequency

values and these are the different values here of the frequency on the curve. Now, we would like

to know that how these values are going to be partitioned, for example, if I say, suppose I want to

divide this in four equal parts, so I can make here first, say here second, say here third and here

fourth. So, you can see here, from here to here, this is indicating the area, which is containing

nearly the 25% of the total frequency. And similarly from here to here, this is an area, which is

trying to cover another 25% of the frequency. And similarly, this is also 25 percent of the total

frequency and this is also 25 percent of the total frequency. So, these values here, they are called

as, suppose here, partitioning values which are trying to divide the total frequency into four equal

parts, so I can call it here, as a quartiles. So, this is quartile this is quartile value. And yeah, I

mean, so the first value can be called as, ‘First Quartiles’. Second value can be called as ‘Second

Quartile’ and so on. So, essentially what is happening that we are trying to divide the entire

frequency into four parts. And similarly, in case if you want to define, divide it into more parts,

for example, if this is the frequency curve, then possibly I can divide it into ten parts, one, two,

three, four, five, six, seven, eight, nine, ten and so on. And similarly, I can partition it in system

other way also, for example, in case if I want to partition it into say hundred equal parts, 1, 2, 3,

4, 5, 6, 7, 8 and dot, dot, up to here, there will be hundred such partitions. And it is also not

necessary that this partitions have to be of the same length, they can be of different lengths. So,

by looking at the partitioning value, we can have an idea that, how the frequency is distributed,

over the entire range of the frequency distribution. And this will also give us an idea that how the

375
frequency is concentrated in different regions of the frequency curve. So, we will try to take up

all these partitioning values, one by one, we will try to understand them and we will try to see,

how to compute them on the R software. Now, I can say very simply that the frequency

distribution is partitioned to have an idea about the concentration of the values, over the entire

frequency distribution. And as I said there are several measures: median, quartile, deciles,

percentile. So, let ustry to start our discussion with median.

Refer slide time :( 5:05)

Now, going through with that definition, suppose if I try to plot here, the frequency curve like

this. And suppose I say, I would like to divide the entire frequency into two parts, two equal

parts, such that, 50% of frequencies on the left side of this red vertical line and 50% of the

376
frequency is on the right-hand side of this red vertical line. So, now corresponding to which here

is this value, this is called as, ‘Median’. So, median is the value, which divides the observations

into two equal parts, such that, at least 50% of the values are greater than or equal to the median

and at least 50% of the values are less than or equal to the median. So, median is a measure,

which is trying to divide the total frequency into two parts. So, if I say that the median of my

frequency distribution is suppose say is 20. So, I can say here that, 50% values are less than 20

and there are 50% values which are more than 20. Okay. Now, if you try to compare median

with arithmetic mean, then in all those situations, where we have got extreme observation that

means, some observation which is taking very, very high value, then in those cases this median is

preferred, why because if you try to see, suppose if I try to take here, two values. Two and four

and then I try to find out it's arithmetic mean, arithmetic mean is going to be two plus four

divided by two is equal to three. But, if I try to add here 2, 4 and here 100, then this value

becomes here, the arithmetic mean is equal to two plus four plus hundred divided by here three

and this is hundred six by three and this is closely equal to 35.3. So, you can see here, there is a

huge difference between three and thirty five point three and this difference is coming because,

there is a new value, which is added here hundred and this hundred is very much different from

two and four, there is a huge difference between the two values. So, this medium is a better

average, than automatic mean in case, if we have extreme observations. Right.

Refer slide time :( 7:47)

377
Now, I would try to give the definition and how to compute the median in two cases. One is

ungrouped data and other is group data. So, first we try to understand the median and its

computations, when we have a data that is ungrouped. So, let us try to say, we have observations

x 1, x 2, x n, so there are n values and they are ungrouped. Ungrouped or you can call as, they

are the values of some discrete variable. Now, what I do? I try to order the observations and I

present the ordered values as, x and in the subscript inside the bracket, I am writing here one and

the second ordered value will be, the value of x, inside the bracket, I am writing here two and so

on. What does this mean? This means that, this value x 1 is the smallest value, this is the

minimum value among x 1, x 2, x n and this x inside the bracket n, this is the highest value or the

maximum value among x 1, x 2, x n. What does this mean, suppose if I try to take here 4

observation, say here, x 1 is equal to say here 20. x 2 is equal to here, see here, 10. x 3 is equal to

here, 60 and x 4 is equal to 5. Now, these are the four values. So, I try to find out here the

minimum value among, twenty, ten, sixteen and here five. So, this is here, five. So, the first

378
ordered value, which I will denote as say, x and one inside the bracket in the subscript, this

becomes here five. And after that, once again I try to find out the minimum value, among the

remaining values, which is twenty, ten and 60. So, this gives me the second ordered value, which

is equal to here, now you can see here this is a here ten. And similarly, x 3 is the minimum value

among the remaining values, with between 20 and 60 and this is equal to 20. And obviously then

the largest value is here 60. So you can see here, the difference between, the simple observations

and the ordered observations. So you can see here, what is the relationship? The relationship is

this first order value x 1; this is same as you’re here, fourth unordered value. Similarly here,

second ordered value is the same as, the second unordered value. And third ordered value is here,

20 which is the same as, x 1; this is the first unordered value. And fourth ordered value is 60,

which is the same as, third unordered value. So, you can see here that how the ordered and

unordered values are interrelated. Okay, So now, the first step in finding the median of an

ungrouped data is to order it first. And once you order it, then there are two situations that, the

number of observations, they can be odd or the number of observation can be even. So, now in

case if the number of observations are odd, then the median is going to be the, (n + 1)/2th,

ordered value which is here like this and in case if even that means the number of observations

is, even then the median is going to be the average of (n/2)th ordered observation and ((n/2) + 1)th

ordered observation like this. So, this gives you here the definition of the median, so you simply

have to see that, whatever is the appropriate ordered value according to this rule that will give

you the median. Now, we consider the median for the grouped data, so we know that whenever

we have a group data or the data on any continuous variable, then the first step is that, we try to

create the frequency table, then in frequency table, we will have classes.

379
Refer slide time :( 12:58)

So now here, we start our discussion by assuming that we had the data and from the data, we

have created the frequency table. And this frequency table has classes and there are K classes,

denoted as A 1 A 2 see here, A K. So now, the entire frequency is distributed, equally among K

classes and we assume here that, the absolute frequency of the i th class is n i. So, there are a

number of observations in the i th Class A i. From this absolute frequency, I can compute the

relative frequency. And we are denoting the relative frequency f i say here, n i upon, say here,

total frequency. One thing I would like to mention here and I would like to draw your attention

that please notice the definition and symbol of f i, in the case of median I am trying to denote, f i

the relative frequency. But, later on when we are trying to consider other type of measure, there

is a possibility that I may define, this fi to be the absolute frequency, so be watchful. Now, after

this what we have to do? Now, I have got here classes A 1, A 2, A K. And we know by

380
definition, the median is the value where the total frequency is going to be divided into two equal

parts, so there will be a class, where that half of the frequency will be lying. So, I would try to

find out here, the class where half of the frequency is lying. And let this class be denoted by here,

as see here A m. So, A m the interval or the class, which includes the median. So, I can define

this median class, as the m th class where, in case if I try to sum all the frequencies, from one to

m minus one and sum of all the frequencies one to m, then this sum is going to be smaller than

half and this summation from I goes from 1 to m, this is going to be greater than or equal to half.

So, this is going to be my median class.

Refer slide time :( 15:40)

Now, the expression for finding out the median, in case of group data is given by this. Here, you

can see, there is a quantity here e m minus 1, this is denoting the lower limit of the median class.

And similarly here, there is a quantity d m this is going to denote the, width of the median class,

width means, upper limit minus lower limit. So, this is the class width, then there is a relative

381
frequency here f m, which is going to be the relative frequency of the median class. And based

on this, we try to compute the median of the given grouped data and we denote it here say x bar

med, the short form of median. Now, let us try to take a example and try to see how to get it

done. Here, I would like to, inform you that when we try to compute the median on the R

Software, then at least to the best of my knowledge, there is only one command, which is

available inside the base package, to compute the median. So, this R Software has no separate

commands, for computing the median for grouped and ungrouped data. So now, through this

example, I will try to show you that how these different values, like as relative frequency of the

median class and so on, how they are chosen and then, I will try to show you that, how the

median is computed on the R Software. But then I will not be able to show you that, how to

specifically compute the median for the group data. Well one can write a small program or a

small function to compute, such thing but at least I will not be handling it here.

Refer slide time :( 17:57)

382
So, let us try to consider this example, in which the data is collected, on the time taken by a

customer to arrive in a shop, in our insider’s shopping mall and this time is recorded on different

days of the month. So, assuming that, there are 31 days in the month, this data has been collected

for example, on the day one he takes 30 minutes, on the day two the customer take 31 minutes

and so on. So now, I will try to find out the median, from this data first considering it as

ungrouped data and then I will try to group it and then once again I will try to find out the

median. Okay.

Refer slide time :( 18:43)

So, now here in this case, when I try to consider this data, as an ungrouped data. So, the number

of observations here is 31, so n is equal to 31, so the value of n plus 1 is here, two this is equal to

16. So, now what we have done that, we have ordered the data, this data has been ordered, well

10

383
I'm not showing you here that you can do and then I am trying to find out the, (n +1)/2th ordered

value and this is the 16 the value in the ordered data. And this value comes out to be here 26, so

26 minutes is the median time. And now, in case if I try to convert the same data into an even

number of observations, so I can drop the last observation and I try to consider only here the 30

observation, so in that case, the number of observation becomes a 30 and then, n by 2 here is 15

and (n/2) + 1 is here 16. Now, according to the definition of the median, the median is going to

be the, mean off (n/2)th ordered value. And, ((n/2) + 1)th ordered value. So, essentially this is

going to be the automatic mean of the fifteenth ordered value and sixteenth order value and from

the data, we find that the fifteenth ordered value is 27 and 16th ordered value is 26. So, this

median comes out to be twenty six point five. So, this is how we compute the median in case of

ungrouped data, considering the data to be or an even numbers.

Refer slide time :( 20:35)

11

384
And similarly in case if I try to consider this data as a group data, then we try to create here the

frequency table, so you can see here, I have already created the frequency table here, these are

my class intervals of the width five units, like as 22 to 25, 25 to 30 and so on. And then in the

second column, I have found the absolute frequency and in the third column, I have computed

the relative frequencies, of all the classes. Now, you can see the advantage of working with the

total relative frequency. Total relative frequency is always going to be, there are five classes, so i

goes from here one two five, this is going to be here one. so, I simply need to find out here, what

is the class say here m minus one, fi which is smaller than 0.5 and for what value of here m, the

sum of frequency is greater than half. So, once I try to do it here, I observe that there is a third

class, this is e3 for which the sum of the relative frequency of class 1, class 2 and class 3, this

comes out to be 12 upon 31. So, this is going to be 3 minus 1, so essentially I am trying to find

out, f 1 plus f2. So, you can see here this f 1 is 0 and f 2 here is, 12 upon 31. So, this comes out

to be smaller than 1/2 and if I try to find out the sum of f 1, f 2 and f 3, this comes out to be 30

upon 31. How this 30 is coming into picture? This is coming out to be, the absolute frequency of

class one, this is zero plus absolute frequency of class two this is 12 and absolute frequency of

class three, this is 18. So, this is essentially 12 plus 18, which is equal to 30 and this comes out to

be greater than half. So now, I can say here, my median class is third class e3 and so here, m is

equal to 3.

Refer slide time :( 22:47)

12

385
Now, I try to find out the lower limit of the median class, which is here 25, the relative frequency

of the median class, which is here 18 upon 31. And then, the width of the interval, of the median

class, this is here 30 minus 25 it is equal to 5. Now, once I try to substitute all these values over

here, in this expression and I try to simplify it, I get here 25.97. So, you can see here, there is not

much difference in the values of the median, when we are trying to compute it as a group data or

say ungroup data. Right. For the ungrouped data, you may recall that, this value was coming out

to be 26, you can see here. And for the grouped data, this is coming out to be 25.97. So here, you

can see means, if your data is proper then, practically there is no difference either you try to

compute the median, say by this formula or by that formula and possibly this is the reason that R

has not implemented it. And then if you try to see this 25.97 is also close to 26. So, for all

practical purposes there is not much difference.

13

386
Refer slide time :( 24:18)

Now, I try to come on the aspect of R Software. Inside the R Software, to compute the median,

the command is median and m e d i a n and this x is my here, data vector. And then in median

also there are several option, so but I would once again we trust you that you try to look into the

help menu and try to see, what are the different possible parameters that can be given inside the

arguments. But, here I would certainly like to address that how would you compute the median,

in case if some data is missing and that is represented, as say here NA. So, in that case, use the

same command median and give the data vector and use the option here n a dot rm is equal to

TRUE. So, this will give you, the value of the median.

Refer slide time :( 25:12)

14

387
So, I try to now collect all the data on the minutes, inside this data vector here minutes and then I

simply try to find out the median of minutes here, this comes out of here, 26. So you can see

here, this is matching with the value that you had obtained earlier. And this is the screenshot. So,

I would like to also show you here, inside the same data that in case if the data is missing, then

how you are going to handle it. So, inside the same data set, I try to make the first two values, to

be not available and I replace them by NA. So now, in this case, I try to create or I try to store the

data inside a new data vector, minutes dot na. Well I would like to address here, one thing in R

there is an option to name the variables using the dot sign or say full stop sign. So that is why

minutes dot n a I am writing, it is not a say built-in function or there is a rule, it is simply trying

to denote, for, just for the sake of convenience that this is the same data of minutes. But, now I

am using the missing values. So now, using this data set, I try to find out here the median, using

the same command, median minutes dot na and I'm using here a na dot rm equal to TRUE. And

you can see here that this is the screenshot and this value comes out to be again here, 26. So,

15

388
before I try to do something more, let me try to show you, how to compute these things on the R

Software, So, first I try to store this data on the R console,

Refer slide time :( 26:54)

see here minutes, so you can see here these are the values of minutes and then I try to find out the

median, of here minutes. So, you can see here, this comes out to be like this. And similarly and I

try to consider here the missing values, I try to once again, store the data inside a new vector

median dot na . And you can see here, this, this is the data meet minutes dot na and you can I try

to find out the median of this data. Right. And if you try to see you have not used the option, n a

dot r m is equal to true, so that's a very common mistake. So, now let me use it here na dot r m is

equal to true. So, you can see here, this value is once again coming out to be 26. Now, I would

like to stop here, I have given you a detailed overview of median, how to compute it, what is the

concept and how to compute it in R. And you please try to practice it, take some data and try to

16

389
calculate the median manually and then, try to do the same thing, with the software. And try to

see, what is the difference? Usually I expect that unless and until the data is extremely

heterogeneous, this difference will be very, very small. And for all practical purposes, the value

of the median that you, compute from the R command either for the grouped data or the

ungrouped data, they will not differ much. So, for all practical purposes, you can accept them. So

you practice and I will see you in the next lecture. Till then, Goodbye.

17

390
Lecture - 16

Quantiles



Welcome to the next lecture on the course Descriptive Statistics with R Software. You may recall

that in the earlier lecture, we started a discussion on the partitioning values. And we had discussed

the concept of median and we had learned how to compute it manually. And then how to do it on

the R software. Now I would like to extend this concept further. As we have seen that the median

is the value, in a frequency distribution, which divides that total frequency into two equal parts.

So now the question is this, why only two equal parts, they can be more, they can four, they can

be ten, they can be hundred, also this partitioning can be equal or this part is things, can also be

unequal widths. So, now let us try to understand this concept in this lecture. And which are

generally called as Quantiles. So, quantile are nothing, but they are the values, which try to divide

the frequency distribution, into different partitions.

Refer slide time :( 01:30)

391
So, in case if I try to plot here, the frequency distribution, like this one then. We have understood

that in case if I try to divide, it into two parts, then this is called Median. And if I say, now I want

to divide it into, four equal parts, for example, this can be first part, then second part, then third

part and then here, fourth part. So, you can see here, this is the first part, I can denote it with here

q1 this is the second partition I can denote it by here q2, third partition q3 and fourth partition q4.

And similarly, I can divide it into say here, more equal parts like as ten equal parts, I convert it

here 1, 2, 3, 4 and here 5, 6, 7 8, 9, 10. So this will become here first part partition, second partition

and so, on and similarly I can increase the number of partitions also. And these partitions can be

of equal widths or safe unequal. Right. So, in general these partitions are called as Quantiles. So,

just like, the median partitions the total frequency into two equal parts similarly, quantiles

partitions the total frequency into say, some other proportions and these proportions are decided

by us.

Refer slide time :( 01:30)

392
So, for example if I say 25 percent quantile. So, if you remember, the definition of median, the

median was the value, which was trying to split the data into two parts such that at least 50 percent

of the frequencies are lower than that value and fifty percent values are more than that value than,

the value of the median. So, similarly if I try to say here 25 percent quantiles, than 25 percent

quantized, they split the data into two parts, such that at least 25 percent, of the values are less than

or equal to that quantile. And actually 75 percent of the values are greater than or equal to that

quantile. So, if I try to plot it here it will look like, this that this is here, the quantile here. So, you

had Q and this is the 25 percent part and this is here on the right hand side, this is the 75 percent

part. So, this is here the quintile value. And similarly if I try to define the 50 percent quantile, than

50 percent quintile will split the data into two parts such that at least 50% of the values are less

than or equal to the quantile. And at least 50% of the values are greater than or equal to the quantile.

So once again if I try to plot here the frequency distribution, then the value of the quantile is

dividing, the total frequency into two parts, such that this part is the 50% of the total frequency

and this part with vertical line, this is also the 50% of the total frequency. And this value which is

partitioning the total frequency, into two parts this is called here as a quantile. And this is

essentially the 50 percent quantile and you may notice that 50 percent quantile is nothing but your

median. So, this is the

Refer slide time :( 05:08)

393
basic idea of the quantiles. So, now as we have defined that 25 percent quantile, 50 percent

quantile, similarly I can choose any value say 3 percent, 5 percent 11 percent, 18 percent, 25

percent, 25 percent, 90 percent and so, on whatever we want. So, in general I can extend this

definition to a general definition of quantized. So, I can see here, when we choose here the value

here, . So, I can see here, the definition of  into 100 percent quantile means if I try to choose

the value of , between 0 and 1. So, if I take the value of here  to be say here 0.20, then this

becomes 20 percent quantile, if I try to take  to be 0.30, then this becomes 30 percent quantile

and so on. So, this  into 100 percent quantile is the value, which divides the data in two

proportion, one consisting of  into 100 percent, another partition containing the 1 minus  into

100 percent. And this division has been made in such a way, such that at least  into 100 percent

of the values are smaller than or less than the value of that quantile and at least 100 into 1 minus

 percent, of the values are greater than or equal to that quantile. So, in general if I try to plot it

394
here. So, for example I can see here that this region, this region consists of  into 100 percent, of

area this reason with the hair dots, on the right hand side, this is hundred into 1 minus  percent,

area and these are the values of the, frequencies. So, this is a graph on the, x and y axis like this,

sorry this x axis is here, x axis and here y axis. And this value, which is doing it here, this will be

you’re here quantile or more specifically  into hundred percent quantile. So, now I can choose

different values of , the values of  can be a single value, the value of  can be in a sequence,

where the values are at equal width or the values of  can be a sequence where the values are

different, they are not of the equal width. So, firstly let us try to understand the basic definition, of

this quantile and in order to understand this definition, means I can suggest you, please try to recall

the discussion, in the earlier lecture, when we had discussed, the concept of ordered data and

unordered data and based on that we have, we had defined the median for ungrouped data.

Refer slide time :( 08:16)

395
Now here, in this case also, let me consider, the data is given to be x 1 x 2 x n and yeah this is

ungroup data. Right. And this data has been ordered. So, this value x 1 is denoting the minimum

value among x 1 x 2 x n. And this x n is denoting the largest value or the maximum value in the

data and similarly here, this x 2 means the second ordered value that is denoting the second highest

value. And now in case if you recall the, definition of median, for the two cases when the number

of observations were odd or even, that definition is now extended to an integer n . So, now the

definition of the  quantile, is the following, first try to decide, whether n , this is an integer or

not an integer. So, if n  is not an integer, then I have to choose the k which is the smallest integer

greater than n . And then corresponding to this value of k, try to find out the kth ordered value.

And this kth value or the k th ordered value is going to provide the  into 100 percent quantile.

Similarly in case if, n  is an integer, then the quantile is going to be the arithmetic mean of two

values, one n  th ordered value and second and ( +1)th, ordered value. So, both these values,

they can be obtained from the data. And simply try to take the arithmetic mean of these two values

and that will give you the value of the hundred into  quantile. So, after understanding, this

definition, now I can see here that just by choosing the different value of , I can divide the total

frequency into different number of groups. So, for example, in case if I try to,

Refer slide time :( 11:05)

396
choose the  to be here, 0.25, 0.5 0.75 and yeah means obviously the last value of  will be 1,

because  is always lying between 0 and 1. So what essentially, I am trying to do if I try to create

here the frequency distribution like this, so I'm trying to divide the entire frequency, into four equal

parts here,  is equal to 0.25, corresponding to 25 percent, the second section, another 25%, third

section another 25% and another section 25%. So, these are the fourth partition, this is the first

partition, this is the second partition, this is the third partition and this is the fourth partition. And

corresponding to which, I can write here the values on the x-axis, say here this value to be q1, this

value to be q2, this value to be q3 and whatever is here the fourth, the final value this is here q4.

So, these values are called as Quartiles. So, these quartiles are the values or they are the particular

values of the quantiles, when the entire frequency distribution is divided into four equal parts and

this is denoted by q1, q2, q3 and q4. So, this q1 will denote the first quartile, which has 25% of the

397
observations. And q2 is the second quartile, which has 50% of the observation, what does this

mean? If you try to see here, my frequency distribution is like this. So, this is here first quartile

and this is here second quartile. So, now if you try to see, this q2 is going to be the value which is

trying to take care of entire this data. So that is why the two values here 25% and 25%, they are

added and they are corresponding to 50 percent of the data. So, that is why I am calling it here that

the second quartile has the 50% of the observations. And this is the same as the median. And

similarly the third quartile is represented by q3 and which has 75% of the observations and then

obviously that goes without saying that, the fourth quartile q4 will take care of all the hundred

percent observations. So, now the rule is very simple, I am trying to divide the entire frequency

into four equal parts and those partitions will be called as Quartiles. And they are the particular

value, of quantiles. Okay. Similarly in case if you want to divide the total partition into ten equal

parts, for example here we have divided in four equal parts, now I am changing it to ten equal

parts.

Refer slide time :( 14:25)

398
Then in this case this is called Deciles. So, deciles are the values we divide the entire frequency

distribution into ten equal parts and we denote them as here D 1, D 2, D 10. So, this will look like

this if I have this frequency distribution, then I am trying to divide it into say five, six, seven, eight,

nine, ten. So, this partition value on the x axis this is here say D 1, this is here D 2, this is here D

3 and so, on this is here D 10. And every partition here, this partition, this partition, every partition

will take care of only ten percent of the frequency. So, I can say here that this, first partition, this

is trying to take care of only the 10% of the frequency. So, I am trying to decide, what I am trying

to define the D 1 as the first decile, which has only 10% of the observations, similarly we and I

come to the second quantile, in this case this is here D 2, this is trying to take care of 10%, plus

10% frequency. So, I can see here that the second decile is the value of the quantile which has 20%

observations or 20% of the total frequency. And similarly if you go for third, fourth and so on,

similarly third decile will take care of 30% of the value, fourth will take care of the 40% of the

value, fifth will take care of the 50% of the observations and that is going to be the same as median.

And similarly the, ninth decile will take care of the 90% of the observation. So, when we are trying

to divide the total frequency into ten equal parts, we call the quantile deciles. Now similarly in

case if I try to, divide the total frequency into hundred equal parts, like that suppose I this is my

here frequency,

Refer slide time :( 16:30)

399
graph and I try to make it a one, two, three, four see here, hundred partitions. So, hundred partition

means, every partition will take care of one percent of the total frequency. So, here it will be here,

they notice here p1 p2 and the last one will be here P hundred. So the percentiles are the values of

quantiles, which I divide the given data into 100 equal parts. And they are denoted as P 1, P 2 P

100. So, this first percentile is denoted by, say p1 and this takes care of 1 percent of the total

frequency or the 1 percent of the observations. Similarly the second percentile will take care of 2

percent of the observation. And that is denoted by here P 2 and similarly if I try to take the 50th

percentile, this will take care of the 50 percent of the observation and P 50 is the same as median

and similarly if you go for to say here 90th percentile, then this will take care of the 90 percent of

the total frequency and it is denoted by P 90. You have heard that some of the examinations, have

a condition that the candidates should have obtained the marks, which are lying in the top 20

percentile or top 30 percentile, what does this mean? If you try to see, what is top 20 percentile?

10

400
The maximum value of the percentile can be hundred. So, top 20 percentile means, those who are

lying in the top 20 percentile that is from 80th percentile to 100th percentile. So, what they are

asking, they are saying that suppose, a large number of students have appeared in the examination.

And they have prepared the frequency curve of the marks obtained. And now they have were

divided those frequencies into say here hundred equal parts. So, this first value is denoting the first

percentile P 1 second is denoting P 2, somewhere here is P 80 that is the 80th percentile and finally

it is here, P 100 that is the final percentile. What they are asking is that any candidate who has

scored the marks which are lying in this region, they are eligible to appear in the examination. So,

that means, they have got the marks, which are greater than 80th percentile. So this is the basic

interpretation and this is what they want?

Refer slide time :( 19:24)

11

401
Now after this, let me explain, how to compute this different types of quantiles in the R software.

So the basic function is here quantile, q u a n t i l e and inside the arguments, we have to give

different types of thing, but the compulsion is that I have to give here, a data vector. And I am

denoting the data vector here say x. Now after this, we have several options, but I am going to

illustrate here, the use of two- props and then na.rm and I will also discuss about type, this first

parameter is going to give you the data vector, of the data for which the quantiles are needed, p r

o b s, this is going to denote a vector, of probabilities between 0 and 1. So, this is essentially the

value of . So, it depends, which of the quantiles you want to obtain. So, by controlling, the value

of here this probs, we can generate different types of quantiles, like as quartiles, deciles, percentiles

or something else. Third option is n a dot r m that we know that in case if there are missing values,

in the data, then if I say that na dot r m is equal to TRUE, then the quantiles, will be computed on

the basis, of the data that is available inside the vector x, after this there is another parameter, here

type, type is going to take a value between 1 and 9 that can be 1, 2, 3, 4, 5, 6, 7, 8 or 9 and this is

going to inform the quantile command that, which of the nine available algorithms to compute

the quantiles is to be used? What does this mean? Well, you see once, we have got that data, the

data has to be ordered, using some algorithm and then based on that, the algorithm has to partition

the values and then the algorithm needs to choose the correct value of the quantile. So, this all

these things, this entire process is based on, certain algorithms, different people have given

different types of algorithm to compute the quantiles. And it is possible that when we try to

compute the quantiles based on different algorithms, their values may differ a little bit. But that is

essentially the choice of the experimenter or those who want to compute it that which of the

algorithm they want to use. So this R has this facility that you can choose say any of the algorithm

that is available, inside R software to compute the quantiles. Okay. So, for example,

12

402
Refer slide time :( 22:59)

if I say, type 1 means it is something like, I will type, type is equal to 1, then this is going to use

the, algorithm based on the concept of inverse of empirical, distribution function, if you choose

type equal to 2, this is similar to type 1, but here in this case the averaging, has been done at the

points of discontinuities. And similarly if you choose type equal to 3, then this is based on the

nearest even order statistics. But definitely I am not going to discuss, this type of algorithms here,

but my simple objective is this how to use them on the given set of data. So, now I will try to take

here, an example and I will try to show you that how you can compute, the quantiles or in particular

quartiles, deciles, percentiles or anything else.

Refer slide time :( 23:55)

13

403
So, now I consider the example, which I also had considered earlier that there is a data on the

heights of 50 persons, which are recorded in centimeters. And this data has been stored inside a

variable here, height. And now we would like to compute,

Refer slide time :( 24:14)

14

404
different types of coin ties over this data. So, when I say, quantile and here height, the height has

to be the data vector, which has to be given inside the arguments, brackets. And this is here the

outcome, you can see here, there are two rows. First row is this one. So, it is trying to show, 0

percent, 25 percent, 50 percent, 75 percent and percent and just below, this there are values here

121.0, 137.5, 146.5, 158.0 and 166.0. So this is indicating the value of the quartile at zero percent.

And then the second value is denoting the value at the twenty five percent, third value is denoting

the value 146.5 at fifty percent quantile, this seventy five percent value is indicating the value

158.0. So this is indicating the seventy five or seventy fifth quantile. And this is indicating the last

value, is indicating the hundred quantile, whose value is 166. So, essentially if you try to see here,

this value is indicating, the first quartile Q1, this is indicating the second quartile Q2. And this is

third value is indicating the third quartile Q3 and Q4 is the last value, which is the value of the

fourth quartile. So, you can see here, this is also the value of the median. And this will also be

same as the value of P 50 or say D five. Right? So, what you have to observe here that when you

are trying to use the command quantile, then that default option here is quantiles. Now in case if

you want to,

Refer slide time :( 26:23)

15

405
find out other things, like as deciles and percentile, then you simply have to control, the parameter

probs. But here I would like to show you before I go to decide. So, percentile that in case if I try

to choose, the probs here, to be a sequence between, our sequence from, starting from zero to one

at an interval of 0.25. So, this command s e q, it is going to create a data vector, like 0.00, 0.25,

0,50, 0.75 and 1. So, now I'm asking to compute the quantile of the data in the height vector using

the probabilities which are here. so, you can see here, the outcome will look like this and now what

you have to understand that how the things are happening this 0%, this is the same as this 0,00,

this 25%, is the same as 0.25, the second value in the probs vector, 50% is the same as 0.5, which

is the third value in the probs vector, 75% is the same as the value 0.75, which is the fourth value

in the probs vector and 100% is the last value 1,00 in the probs vector. And you can see here that

these are the values of the quantiles and essentially if you try to see, they are the quartiles. So, you

can compare these values and the values, which you have obtained directly, by using the quantile,

they are the same.

Refer slide time :( 28:01)

16

406
And this is here the screen shot of which I will try to show you on the R console. So. let me try to

first copy the data on the R console. So, you can see here, this is the height, data and I'm simply

trying to fight here quantile of here, height. Right. And this is giving me the this, value and if I try

to use here mice, this probs or the sequence of between 0 to 1, at an interval of 0.25, you can see

here that these values are the same values, which are here. Right, Okay.

Refer slide time :( 28:45)

Now after this, I try to show you that how you, would try to find out the, other types of quantiles,

like as percentile and deciles. So, first we try to understand, how to generate or how to compute

deciles. So, now you have understood, it is very simple, the command, is the same what you have

to do? You simply have to change the probs vector or the data inside the probs vector. So, deciles

are essentially controlling the value of  to be as 0.1 0.2 up to here 0.9. So, I simply have to

generate here, a sequence from starting from zero and ending at one at an interval of 0.1. So, using

17

407
the command s e q, I generate, such a sequence and this value is coming out to be here like this

you can see here. And now I simply use the same command, which I used earlier but in this case,

I have simply replaced the value 0.25 earlier to now 0.10. And you can see here, here I am getting

the 10 value of that deciles. So, for example as we have explained earlier the zero percent is

corresponding, to 0,00, 10 percent is corresponding to 0.1, twenty percent is corresponding to 0.2

and similarly a 100 percent is corresponding to one. And these are the values, of say this here D 1,

D 2, D 10 this is here D 1, this is here D 2, this is here D 3, D 4, D 5, D 6, D 7, D 8, D 9 and here

D 10. Right. So, these are the values of deciles.

Refer slide time :( 30:33)

And similarly this is here that screenshot; I will show you on the screen also.

18

408
Refer slide time :( 30:37)

Now we try to compute the percentiles, P 1, P 2, P 100. So, now it is pretty simple, I simply have

to generate, a sequence from 0 to 1 at an interval of .01. So, I try to generate, this sequence and

you can see here, these are the values which are obtained here 0.0, 0.01, 0.02 up to here 1.

Refer slide time :( 31:04)

So, now just using the earlier, command means I generate the quantile. So, I simply have a replace,

the value in the probs by here point is 0.1 and you can see here that the values of all hundred

19

409
percentiles, they have been obtained, for example this the value of first percentile, this is the value

of second percentile, this is the value of third percentile. And similarly in the last this is the value

of here 99 percentile and this is the value of here, 100 percentile. Similarly in case if an

examination needs that the candidates, must have the marks more than 80th percentile, then this is

going to be controlled by this 80% and the P 80 here is 159. So, this means the candidate needs,

to have marks more than 159. Now before,

Refer slide time :( 31:59)

I go further let me try to show you, these things on the R console. So, I try to compute here the

deciles you can see here, the ideal that it can value and similarly if I try to change here, only the

last value, then this will give me hundred values, which are here the percentiles and they have been

presented on the slide. So let me come back to my slide.

Refer slide time :( 32:27)

20

410
Now I will try to continue, with the same example and I will try to show you that in case if the

data is missing then how we are going to compute the quantiles. So, I have replaced the first two

data values by here NA an I have stored this data into height dot na here, as we have done it earlier

and now I will try to compute,

Refer slide time :( 32:50)

21

411
the same thing, whatever we had to compute, for this height dot na. So, you can see here, as soon

as I give here only the quantile function without, any specification of na.rm , this will give me an

error. Right. And so, I try to add in the quantile command na.rm is equal to true and this gives me

this data. That is the first quartile, second quartile, third quartile and fourth quartile. Right. And

similarly in case if I want to find out that deciles in the same data, I have to use the same command

here. And I have to use the probs, which is the sequence of 0 to 1 at an interval of 0.1. So, this will

give me the deciles so, you can see here these are the deciles.

Refer slide time :( 33:48)

That we have obtained and this is the screenshot of the same operation which I just shown you.

Refer slide time :( 33:55)

22

412
And now I would try to show you, this thing on the R console to make you confident. So, I try to

store this data I say here height dot na. So, you can see here this is my here, height dot ne and if I

try to find out the quantiles of height dot na and this will give me an error. So, now I have to add

here that n a dot rm is equal to TRUE. And this will give me the values, of the quartiles and

similarly in case if I want to find out say deciles then I have to add here the function, probs is equal

to sequence starting, from 0 to 1 at an interval of 0.1, which is here like this. So, you can see here

this is here the deciles.

Refer slide time :( 35:01)

23

413
And similarly in case if you want to compute here, percentiles in this case you simply have to

change, this value to be here, point 0 1. Okay,

Refer slide time :( 35:06)

You can see here, these are the different percentile.

24

414
Refer slide time :( 35:13)

Now I would like to show you that in case if you want to compute, some other percentiles, which

are not in the equal width. Suppose I want to compute here the 14th percentile. So, I will try to

give it here, the value 0.14. Suppose, I want to compute the 23rd, percentile. So, I will try to give

it here the value as 0.23, suppose, I want to compute the 79th percentile, so I will try to give it the

value here point seven nine. Okay. And now if I try to write down here, quintile of the data, heights

and with probs, given by here as above. So, you can see here you are getting here the 14 percent,

23 percent and 79 percent percentile. So, this way actually you can compute different types of

percentile and if you learn the statistic, there is a topic say testing of hypothesis, where we define

the level of significance, those the definition of level of significance is nothing, but it but it is

related to computing a particular percentile. So, this concept is going to be very, very useful for

you, when you go for further courses, any statistics. So, I would like to stop here and I would once

again request you, to, to look into this concept, try to practice them, on the R software and we will

see in the next lecture again. See you and good bye.

25

415
Lecture 17

Descriptive Statistics with R Software

Central Tendency of Data – Mode, Geometric Mean and Harmonic Mean

Welcome to the next lecture on the course descriptive statistics with R software. You may recall

that in the earlier lecture we had a discussion on computing the quantiles which are the

partitioning value and they help us in determining the central tendency of the data. So we will

continue on the same topic that what are the different other tools to measure the central tendency

of the data and in this lecture we are going to address three more tools that is mode, geometric

mean, and harmonic mean.

416
So let us try to start our discussion with mode. So what is here a mode? You might have

encountered in your day-to-day life that if you go to a shop of the shirt or shop of the clothing

then whenever you want a size of your shirt that is usually available. How it happens suppose

you are the shopkeeper and you want to open a shop for the clothing and suppose you want to

sell shirts. So which of the size of the shirt is not that much in demand. So you want to know

which is the most popular size of the shirt that you should keep in your shop in more quantity. So

in this case what you would like to is do is say you will take a sample of the data you will ask

people what is the size of the shirt and then you will see that whatever size of the shirt has more

frequency that you would try to keep more and the size of the shirt which has a smaller

frequency, you would try to keep say smaller number of shirts. So this is basically done with the

help of mode.

So suppose if a fruit juice shop opener wants to know which of the fruit juice is more preferred

and simply as I said a clothing shop owner wants to know that size of which of shirt or say

trouser is more in demand or say highest in demands and so on. So here in such cases the concept

of mode is used. So the mode of say n number of observation is say x1, x2, xn is the value which

occurs the most compared with all other values.

417
So essentially the mode is the value which occurs most frequently in the set of observation. How

this frequently word is coming into picture that is coming because of the frequency distribution

or the frequency of the values. So the definition of the mode will be interconnected with the

frequency of those observation or the frequency distribution of those observations. One

advantage of mode is that mode is not at all affected by the extreme observations. For example,

if I say I have a data set here 1, 1, 1, and say here 3, 4, and 6 so you can see here that number 1 is

occurring here three times, 3 is occurring one time, 4 is occurring one time and 6 is occurring

one time. So the maximum number of value which is occurring here this is here 1. So the mode

here is going to be 1. Now in case if I try to add here a value 500, mode is not going to be

changed because that is also appearing further one time only. So that is the basic idea that mode

is not at all affected by the extreme observations and in case if you try to plot the frequency

curve and suppose I have these two types of frequency curve. One is going to give like this and

418
say another is going to give like this. So you can say here that here is only one mode but here

you can see here although the more here is only one but still there are two peaks, so we call

associate contribution as a bimodal and in first case, they are called as unimodal. So all the

distribution having only one mode they are called unimodal and all those distribution which have

two modes, this is called as biomodal.

So now we try to define the mode for two cases when we have group data and when we have

ungrouped data. Here I would try to inform you that in the R software and in the base package of

R software, there is no direct command to find out the mode. Well, there is a command mode, m-

o-d-e but be careful that mode is not used to find out the mode that we are trying to find as a

measure of central tendency. That command is used to describe the behavior of the data that

means that data is where the numeric or not something like this. So please be careful. Although I

will try to show you here that by writing some special functions or say some special commands

we can find out the mode. But you cannot use the function m-o-d-e or the command m-o-d-e

which is built-in inside the R software. So be careful.

419
So first we try to discuss the mode for ungrouped data. So for the ungroup data or for the discrete

variables, mode is very simple. The mode of a variable is the value of the variable which has got

the highest frequency and obviously this is true in the case of unimodal distribution. So what you

have to do here you will have here say x1 data with frequency f1. x2 data with frequency f2 and

so on xn data with frequency fn and you simply have to first choose that whichever is the

maximum value. Suppose this maximum value occurs as say fm, then corresponding to fm

whatever is the value of here xm that is going to be the mode. So obviously this is true when we

have a unimodal distribution where there is only one mode.

420
So in order to find out this mode in the R software, we will go like this. If you try to understand I

am simply going to write here two steps and they are simply trying to copy the same thing what I

had just told you that you try to create a frequency distribution, try to choose the maximum value

of the frequency and corresponding to the maximum value of the frequency, try to choose the

value of x and that is going to give you the value of your mode. So when we try to compute this

mode in R software I am giving you here two steps. Step one and step two. The step one is very

simple, whatever is your data try to create a table of that data vector or that can be a matrix also.

How that will be useful? I will try to show you. So whatever is your data either in the form of a

vector or a matrix, try to convert it into a table. So I will try to store my data inside a variable

named data d a t a and whatever is the outcome of the table of this data yes I am going to indicate

as modetab. Yeah that is the short form of mode of table data.

So in order to use this thing I will use here a command table that we had used earlier and inside

the arguments, I will try to write here as.vector. a s dot v e c t o r that means you try to create or

you try to consider that the data is vector and this data which has to be considered as vector this

is given here inside the variable named data. So I will show you later on but here I can tell you

that whatever is the outcome of this command, the first row of this command will be a sorted list

of all the unique values in the data vector data. Now after this you have to operate here over one

more command. This command I am writing here. This is here names then inside the arguments

you have to write the data that you have obtained in the first step. This data is called as modetab

then inside the square brackets, remember these are the square brackets here. Inside this bracket

you have to write modetab or the data what you have obtained in there step one is double equal

to, what is this two equality sign, this is a sign of a logical operator for equality. So that is a sign

for logical equal sign, for example we have less than sign, greater than sign, and equality sign but
6

421
equality sign means equal to but the logical equal sign is denoted by two equality signs. And then

I am trying to find out the maximum value of the data which is inside the modetab.

Here I would just like to inform you that here I have used the commands here names of modetab

and this function which is generally taught when you try to learn the R software logical operators

and how to find or extract the names from R data and so on but here I would not going into those

details but I would simply request you please try to use this command. So now suppose I simply

try to take an example here and try to show you that how these things are happening.

So I am simply trying to take the data which is here like this inside the data vector and from there

I am trying to create the table of this data using the same command and here is the outcome. You

can see here. So you can see here that here are in the data vector of the values are 10, 10, 10, 10,

2, 2, 3, 4, 5, 6. So essentially when we try to create the frequency table, so it is trying to show

you in the first row these are the sorted values of all the unique values in data. You can see here

that the unique values here are 2, then here 3, then here 4, then here 5, then here 6 and this value
7

422
is repeated again and again four times that is 10 and now in the second row, this is trying to

count that how many times a value is occurring. For example, if you try to see here 2 is occurring

two times. One, two. So this is here two. Three is occurring here only one time so this is here

one. 4 is occurring here only one time. So this is here one. Five is occurring here say one time.

So this is here one. Now and then 6 is occurring here say one time this is here one. And now 10

is occurring here 1, 2, 3 and here 4 times so this is here 4 and this is here the 10. So what

essentially I need to do. In order to find out the mode I simply have to extract the values from

this frequency first.

So the frequency, maximum frequency is at 4, this here and then in this second step, I have to

find out the value corresponding to this maximum frequency which is here 10. So this command

here which is here at the second steps, this tries to extract this value 10 from the first row and
8

423
this gives me here 10. So this is how you can compute the mode but definitely, I would like to

address here this is not the only way out. You can define it at an own way is using your own

logic also. And this is here the screenshot and I would request you that you please try to execute

it on your data also. So for example I can show you it on the R console but you please unless and

until you do it with your own hand on the R console you may not really understand it. So I have

got our data you can see here and with this data I am trying to get the value here modetab so you

can see here this is here modetab and now after this modetab, I'm using the command in the step

two to get the value of the mode. This is here ten.

Now I will try to show you another aspect. As I told you that this command which I just

introduced that can be used on a data vector or a data matrix. So suppose I have a data in the

form of a matrix and we have learned how to write the data inside the matrix. So there is a three

by three matrix and the data values are given inside this data vector. The values are 1, 2, 2,

424
double 3, 4, 5, double 6 so this matrix will look like this. When I say I would like to find out the

mode of this matrix, essentially I am trying to say that I need the mode of first column, second

column, and third column. So in the first column you can see here, the value 2 is occurring two

times, so here the mode should be equal to 2. And the second column the value 3 is occurring

two time. So here the mode should be equal to 3 and in the third column, the value 6 is occurring

two times so here the mode should be equal to 6.

So this is what I mean when I try to repeat the same command on this data matrix. So I try to do

the same thing, I have to simply copy and paste the same thing so you can see here I'm using the

same command but now it is giving me here this type of value and then I'm trying to use here the

command in the second step and it is giving me an outcome here 2, 3, 6. So you can see here this

2, 3, 6 is the same what you had obtained manually here 2, 3 and here 6.
10

425
So my idea was simply to inform you that this data can be a data vector or a data matrix and here

is the screenshot of the same operation that we done.

And now we come on the aspect that how to compute the mode for the grouped data or the data

when is on continuous variable. So in the case of continuous variable the mode of the data is the

value of the variable with the highest frequency density corresponding to the ideal distribution.

What is an ideal distribution? Which would be obtained if the total frequency were increased

indefinitely, that will be they are becoming very very large and if at the same time the width of

that class intervals were decrease indefinitely. Now you may recall that we had a discussion on

histogram and frequency curve. What we had seen that frequency curve or the frequency density

or density plots they are more useful when you have large number of data and in that case the

bins of the histograms are reduced and they are made as small as possible and the number of data

points are made as large as possible. So this is the same thing which is trying to say so under that

11

426
thing in case if you are trying to make here a frequency curve like this, the bins are going to be

very very small and so on and then you will get to here at this type of frequency curve. So this is

the highest value of the frequency around which you will get the value of the mode.

Now in order to compute the mode for this grouped data, the first step is to create a frequency

table and this frequency table we just need three things. One is the class intervals. Second is the

midpoint of the class intervals. And the third is here the absolute frequency. One thing what you

have to notice here that in this case I am trying to use the symbol f to denote that absolute

frequency and earlier in some of the lectures, I have used the symbol f to denote the a relative

frequency but I am just trying to keep the standard symbols so that you don't face any problem

once you try to read from the books.

So here if you try to see I have created different class intervals. Now I am simply trying to find

out the midpoint of this class interval say m1, m2, mk and they are obtained simply by finding
12

427
out the value of lower limit plus upper limit divided by 2 and corresponding to the first class I

have frequency f1. Corresponding to second class I have frequency f2 and corresponding the kth

class I have frequency fk. Now what we have to do? We simply have to find out the maximum

value among these frequencies and whatever is the maximum value say here fm, I have to

identify wherever this fm is lying corresponding to which I have to find out the value of here m

and that will give me the based on that I will try to compute the value of the mode. So the class

where this maximum frequency is occurring this is class modal class. In order to compute the

mode for the group data the expression is given by like this.

Well this is based on certain computation certain derivation but I am not going into that detail.

So here this value here el is the lower limit of the modal class and this dl it is the class width and

f0 here is the frequency of the modal class and by this f and inside the subscript it is minus 1 this
13

428
is denoting the frequency of the class just below the modal class and this frequency here f1 this is

just indicating the frequency of the class just after the modal class. So for example if I have here

modal class here see here Am and then I have here 1 say here class, say here 1 and this here is

class 2 then this is going to the frequency of this Am is going to be of, this is going to be f-1 and

after this this is going to be f1.

So not based on that we will try to compute the mode. Now we consider an example and I will

try to show you that how you are going to compute this data the mode based on this data. So this

is again here the same example that we considered earlier that the time taken by a customer to

arrive in our shop in inside a mall on different days is recorded and there are 31 days, so there

are 31 values here on the number of minutes that the customer takes.

14

429
Now we try to convert or we try to prepare the frequency table. So you can see here I have made

here five classes, five class intervals and I have computed their midpoint, for example in this

case the midpoint is computed as 15 plus 20 divided by 2 is equal to 17.5 and a frequency is here

zero and similarly other midpoints have been calculated and their corresponding frequencies

have been calculated. Now in this case out of this frequency, f3 equal to 18 this is the maximum

frequency which is occurring. So the modal class is going to be the class corresponding to which

we have the maximum frequency. So here this 25 to 30 is going to be the modal class. So I can

take here that l is equal to 3 which is 25 to 30, this interval this is the modal class. Now based on

that I will try to see, here for example here you can see the frequency of the modal class is

something like what we had denoted as f0. This is 18 and the frequency just for the modal class

which is here f-1 as per our notation is 12 and the frequency just after the modal class which is

15

430
here denoted as f1 this is equal to here 1. So substituting these values we will try to find out the

value of the mode from the given expression.

So you can see here that el is the lower limit of the modal class which is 25. The width of the

class is 5 and these are the frequencies modal class frequencies f0. The frequency of the class

just before the modal class as 12 and frequency of the class just after the modal class as 1 and if

you try to substitute all these values over here you get here the value 31.52. Now I would like to

address here one thing first. Inside R, there is no built-in function to compute the mode of the

grouped data what we have just obtained through formula. Well you can write a small function

or a small program to compute it but definitely I am not going to consider with that idea over

here. And now I would try to address two more tools geometric mean and harmonic mean which

16

431
are also two different measures to find out the central tendency of the data. So first I try to

address geometric mean.

Geometric mean is useful in calculating the average value of ratios or rates of interest in say

banking and in finance sector and so on and this geometric mean is not really applicable when

any of the observation is 0. Why?

17

432
This is going to be clear from the definition of the geometric mean. So one condition in

geometric mean is that all the observations should be positive. So let x1, x2, xn be the n

observation which are all greater than 0. Now these observations can be on a discrete variable or

an ungrouped variable data set or they can be data on continuous variable and may create a

grouped data. So when these observations are ungrouped then in this case the geometric mean is

defined by here xG something like this so what we are trying to do here we are simply trying to

take all the observation x1, x2, xn, we are trying to multiply it and then I am trying to find out

the nth root by taking the power to be here 1 upon n. And similarly in case if you have the

grouped the data in which every xi has frequency see her fi then in this case the geometric mean
1
is defined by xG  ( x1f1  x2f2  ...  xnfn ) N . So this is the product of xif raised to the power of fi and

18

433
here the power here is 1 upon capital N. N is the sum of all the frequencies. So now in case if

you want to find out the geometric mean in the R software once again there is no direct

command but writing down this command is very simple. If you remember in the initial lectures,

we had discussed different types of built-in functions and using those built-in functions we can

create the command for the geometric mean. So if you see here I have created this command here

but there are different ways to construct it.

So suppose if I try to denote the data vector here as say x then if you try to see what are we going

to do? In the case of ungrouped data first I am trying to find out the product of all the observation

and then I am trying to take the power 1 upon here n. So this product can be stored or can be

defined by product of x where x is containing all this data and then for this power I am using

19

434
here hat and then this here n, n here is the number of observations in x. So this can be determined

by lengths of x. So this power can be written as hat and inside the bracket 1 upon length of X.

So this is pretty simple and similarly in case if you have a group data where you are trying to say

that the data vector has something like values x1, x2, xn with frequencies f1, f2, fn such that the

x1 has frequency f1, x2 has frequency f2 and xn has frequency fn. So I can write the data vector

and the frequency inside a vector so that data is written inside the data vector x and all the

frequencies have been written inside the vector here f and finding out these frequencies is not

difficult just by creating a frequency table and extracting the frequencies as we have done it

earlier one can obtain it.

So now whatever is the product something like x1 raised to the power of f1 multiplied by x2

raised to the power of f2 and so on multiplied by xn raised to the power of fn this can be written

by product of x raised to the power here f. And then it is here 1 upon N. N is going to be the sum
20

435
of all fi. So sum of all fi this can be determined by the function sum of f and so writing it here

like this you will get the value of the geometric mean in this case. So you can see here that just

by using the built-in functions in R you can find out the geometric mean.

So now in case if I try to take an example, the same example in which I have considered the data

on minutes so this data is here contained here like this and if you want to find out the geometric

mean considering this data as a discrete data you simply have to use this expression and you can

see here this is the value of the geometric mean and this is here the screenshot and considering

this data as a group data we can find out this frequency table that we already have discussed and

based on that I have this data here that is the same data.

21

436
And now first I need to extract the frequencies. How it can be done? You may recall that in the

earlier lecture in the case of median I had shown you that how to find out the or how to extract

the frequencies from a group data. So I am not going to explain it here again but I will simply be

using the same command that I had used in the case of median and if you have forgotten I would

request you please try to go to the lecture on the median and try to see that how these values like

as breaks, cut commands were used to extract the data. So now using the earlier explained

method I will try to construct here a data vector breaks which is a sequence of 15 to 40 at an

interval of 5 because our class intervals are starting from 15 and going up to 40 and they are of

the width 5. So this data comes out to be as 15, 20, 25, 30, 35, 40, and then I have to operate the

command here cut on the given data vector minutes using here the breaks and right intervals are

going to be open so this is right equal to FALSE and once you try to do it I will get here this type

22

437
of data. So you can see here these are the class intervals that we have obtained. Now I have to

find out the frequencies of this minutes.cut data.

So in order to find out the frequencies, I will simply operate here the command table and inside

the arguments, the data that is minutes.cut it will give me the frequency table. So these are the –

the first row is indicating the class interval and the second row is denoting the frequencies.

23

438
Now how to extract this frequency vector for that we have used the command as numeric and I

try to store this value here in this f so as numeric on the data provided by table minutes.cut and

this comes out to be like this here. So now I have obtained here the vector f and now I have to

collect all the midpoints and I have to create a vector here x. So I am trying to collect all the

midpoint 17.5, 22.5, and so on and now I have here two vectors x and here f and based on that, I

can use this function which we have just evolved to compute the geometric mean. And here if

you try to see I have given here the screenshot but I would request that you please simply try to

copy these commands, paste it into your R console and try to see whether you are getting the

same outcome or not. And this is the same outcome which you will be getting here.

24

439
Now after this I will come to the last topic on the measure of central tendency that is harmonic

mean. So harmonic mean is also defined for group data and say ungrouped data. For our discrete

1
data, the harmonic mean is defined here as say here xH  . So doesn't it look like if I
1 1
n


n i 1  xi

am trying to find out the mean of the inverse of the observations and then I am trying to take it

here once again the inverse and the same definition is for the continuous data for the group data

and the expressions for finding out the harmonic mean is like here this. But here you have to see

here that xi has frequency fi, same terminology that we have used in the case of geometric mean

in the grouped data case.

25

440
So now in case if you want to compute the harmonic mean in the R software, once again I would

like to inform you that there is no built-in command inside the base package of R but writing

down a small command to compute the harmonic mean is not difficult. Just by using the built-in

function and by looking at the structure how the mean has been defined, one can easily do it.

26

441
1
So if you try to see here the command which I am writing here is for this quantity . So
1 n 1 

n i 1  xi

you can see here if I try to denote here 1 upon xi to be here something like here yi then this

quantity becomes 1 upon say 1 upon n summation here and yi and which is nothing but y over

here 1 upon mean of here y. So this can be written here simply here 1 upon mean see here 1 upon

here x vector in you R software. So that is the same thing which I am writing here that if your x

is a data vector then the harmonic mean in case of discrete data is defined as 1 upon mean of 1

by x.

27

442
And similarly in case if you have a continuous data and the group format we are x1, x2, xn are

the values which are occurring with the frequencies f1, f2, to respectively then the harmonic

mean once again can be defined by the 1 upon mean of f upon x because if you try to see you are

simply trying to compute the average of this fi xi so for that fi of xi this is written here to say f

divided by x in the R symbol and then I'm simply trying to find out here the mean of this and

then I am trying to invert it.

28

443
So now if I try to take the same example that we considered earlier of the minutes so this is here

the data which I am stored in the variable minutes and if you simply try to execute this data upon

this command to compute the harmonic mean that we just discuss you will get this value over

here.

29

444
This is not difficult at all now and similarly in case if you want to consider this data as a

continuous data, so as we had obtained the frequencies in the geometric mean case up to that

point you have to copy that the same thing and finally we were getting the frequencies like as

here. So now you have got here the f vector and you have got here the x vector, x is here this –

this is the midpoints. So this is what you have to keep in mind that in this case xi’s are the

midpoints and if you try to just execute this command 1 upon mean of f upon x then you will get

this value here. And these things are not very difficult to obtain. You can see here this is the

screenshot of the same thing.

Now I would like to stop here in this lecture and I would also like to stop on the topics of

measures of central tendencies. So we have discussed in this chapter, different types of tools

arithmetic mean, geometric mean, harmonic mean, median, mode for the grouped data, for the
30

445
ungrouped data and as far as possible, wherever available, I have explained you how to compute

in the R software. So once again I would request you that you take different types of data sets,

and on the same datasets, you can compute each and everything- quantiles, mean, median, mode,

harmonic mean, geometric mean, try to convert the same data into grouped data as well as

ungrouped data. Try to operate the things and try to see how much difference you are getting and

try to think, why this difference is coming, that you will get from the theory of statistics. So you

practice it and we will see you in the next lecture, till then, Good bye.

Now I would like to stop here in this lecture and I will also like to stop on the topics of measures

of central tendencies. So we have discussed in this chapter different types of tool, arithematic

mean, geometric mean, harmonic mean, median, mode, for the group data, for the ungroup data

and as far as possible wherever available I have explained you how to compute it in the R

software. So once again I would request you that you take different types of data set and on the

same data sets you can compute each and everything, this quantiles, mean, median, mode,

harmonic mean, geometric mean, try to convert the same data into group data as well as ungroup

data try to operate the things and try to see how much difference you are getting and try to think

why this difference is coming that you will get from the theory of statistics. So you practice it

and we will see you in the next lecture. Till then good bye.

31

446
Lecture – 18

Variation in Data – Range, Interquartile Range and Quartile Deviation

Welcome to the lecture on the course Descriptive Statistics with R Software. Now, you may

recall that, when we started the topics of descriptive statistics, we have taken several aspects.

One option was the central tendency of the data, which we have discussed in the last couple of

lectures. Now, we will aim to discuss the topic of variation in data. So, now the first question

comes, what is this variation? Why it is important? How it is useful? What type of information it

is going to give us and what are the different quantitative measures of such variations? So, in this

lecture, you will try to develop the concept, need, requirement, of having the measures of

variation. And, we will discuss three possible measures in this lecture; range, inter quartile range

and quartile deviation.

So, let us start our discussion. You have seen that whenever we have the data, we simply want to

dig out the information contained inside the data. And, as we had to discussed, that, data itself

cannot tell you that I have these properties. So, in the last couple of lectures, we have

concentrated on the central tendency of the data.

Refer Slide Time: (01:56)

447
And we have seen that those measures of central tendency gives us an idea about the location,

where most of the data is concentrated. What does this mean? Means, if I have, suppose this

data, which I'm trying to plot here through a graphical measure. And, suppose if I say here like

this is my x-axis, y-axis. So, I can see here that this data is concentrated somewhere here. So, this

is trying to give us the information about, that where this where all this data is concentrated. But,

there is another thing; you can see here that there is a deviation between the center point and

those individual points. And, some points are close to the center and some points are away from

the center. So, if I say here that suppose I have here, these two types of data sets, and suppose

their scales are the same on y-axis and so, scales on the x and y axis, they are the same. So, there

is no issue.

Now, one data is here like this and another data here is like this. So, you can see here, in this

case, the most of the data is again concentrated over here. But these deviations that means the

difference between point and this center point or where is the mean is located, this is changing.

And, you can see here, that in the first figure, this region is of this type and then it's another

figure this region is of bigger shape. So, what I can see here, that there can be two different data

448
sets which may have the same mean, for example, here in this case the mean is here and this of

the mean is here. So, I'm assuming that this point on the x axis is suppose here is mu and here

mu. Which is the generating the mean suppose. But, the spread of the values around the mean is

different. And, similarly I can take any other point instead of a mean also.

So, now what we can see from these two figures that there can be two different data sets, which

have got the same arithmetic mean. But, they may have different concentrations around mean.

Now, the question is this, from the graphics I can show you that there are two deserts data sets in

which the individual observations are scattered around the central point, in a different way.

Graphically I can view it, but, now the question is how to quantify it? For example, I can take

here a simple example to explain you, that what type of information is conveyed by the measures

of variations. Suppose, there are three cities and we have measured the temperature, the weather

temperature in those cities on six days.

Refer Slide Time: (05:29)

449
And, you can see here, those temperatures in degrees centigrade are recorded here in this table.

So, now please try to have a look on the data that is given inside this table. So, here I am taking

here three cities, city 1, city 2, city 3. And, in the rows we have here different days, days one,

two, three, four, five, six. So, if you try to observe in the city number here one. You can see

here, this temperature here is zero, on that day one, temperature on that day two is zero, the

temperature on the day three is zero and the same temperature continues for all the six days. So,

now in this case if you try to find out the mean of these observations, mean of these temperatures

this will come out to be zero. So, what we can see here, that the arithmetic mean of the

temperatures of the city 1 is zero.

Now, similarly in case if you try to observe in the city number 2. First three values that mean the

temperature on first three days -15. And, the temperature in the day four, five and six, this is +15.

Once again, in case if you try to find out here the average, this will come out to be some of -15, -

15, -15, +15, +15, +15 divided by 6. And, this will again come out to be zero. So, in this case

also this x bar is coming out to be zero. So, now you can see here, there are two cities; city 1 and

city 2. In which the arithmetic mean is coming out to be the same, zero, zero. But, in case if you

try to look into the data, see here, here, here, here… Do you think that the data in the city 1,

which is all zero and the data in the city 2 are the same, the answer is no. They are different, but

somehow their automatic mean is coming out to be zero.

Now, in case if you try to observe in the data in the city number 3. You can see here, on day one,

city three has temperature 11, day two-9, day three- 10, day four- 8, day five- 12, day six-10.

And, now in case if you try to find out the arithmetic mean of these temperatures, like eleven,

450
plus nine, plus ten, plus eight, plus twelve, plus ten divided by six, this will come out to be ten.

So, the arithmetic mean of the temperature in city 3 is coming out to be 10. So, now you can see

here, in this case the temperatures are little bit different, then in compared to the city one and city

two. So, you can see here, in these three cities, I have artificially taken three different types of

data sets. And, I’m trying to find out their arithmetic mean. What you have to notice, that the

arithmetic means in the city 1 and city 2, they are the same. But, their data values are trying to

show us a different information.

Refer Slide Time: (09:15)

Now, can I say that since the mean temperature in city 1 and city 2 are the same as zero. So, the

pattern of the temperature in both the cities are the same. Why? Because, city two has an, has a

peculiar characteristic that it has that temperature on two extreme, -15 degree centigrade and 15

degree centigrade. Whereas in the city one, the temperature is, city one the temperature is

actually constant, it is always zero. Where is in the city two? Yes, there is some variation in the

data. Now, in case if you try to look into the data of city 3, this also has some variation in the

451
data. But, now incase if you try to look at the temperature patterns of these three cities, what do

you think? Can I say that the information provided by the city three temperatures, it's more

reliable? No.

Slide Refer Time: (10:41)

Let us see, whether this statement is right or wrong. So, first let me try to plot, this data on a

simple graph. And, let us try to see what a type of information I am going to get. Well, that's a

very simple plot and I will show you later on that how to plot this type of data in R software.

But, if you try to see here in that city number 1, this temperature is constant, zero means all

these dots are denoting the temperature on, day one, day two, day three, day four, day five and

day six, and it is here zero. Similarly, in the city two, you can see here, there are two values here

-15 and +15. And, there are three observations on days one, two and three, which are the same, -

15. And, there are three temperatures on day four, day five and day six they are also +15 and

they are the same.

452
But, can you see the pattern in Figure one and figure two. And, you can see here that the average

value will be somewhere here at about zero, in both the cases. Yeah! But, if you try to see such a

diagram for city 3 so on, day one this is trying to show that the temperature is 11, day two it is 9,

day three it is 10 and so on. So, if you try to see here this pattern don't you see that, the points

here are scattered in different places. For example, you can see here, this is the figure number

one, figure number two and figure number three. Right. What this point should be here. Right.

So, you can see here, and the mean value is somewhere here. So, now our objective is very

simple, we have understood that the mean value is giving us some information and the variation

in the values is giving us different type of information. So, from graphics I can see, but I would

like to quantify it.

Refer Slide Time: (13:03)

453
So, now the next question is how to get it done? So, I can say here, that the location measures,

location means the central tendency. I will say in simple language. The measures of central

tendency are not enough to describe the behavior of the data. The, there is another aspect, the

concentration of data or the dispersion of data around any particular value. This is another

characteristic of the data that we would like to study. And, now the question is, how to capture

this variation? And, in order to capture this variation, various statistical measures of variation or

dispersions are available.

Refer Slide Time: (13:47)

Now, we are going to say that, what is the means, how to use it, what is the interpretation and

how to implement these tools on the R software? These are the three objectives in this lecture.

Okay. So, now let we try to show you here a simple graph. So, here you can see here I have

made here two types of dots in this picture here. One here is in green color like as here and

454
another here is in red colors, which are concentrated somewhere around this. So, these are two

data sets and I have simply plotted the scatter plot. You can see here, in both the cases, the

arithmetic mean is going to be somewhere here. But, you can see here, that the data set in green

color, that is more close to the central value, which is here. And, whereas the data in the red

color, that is more scattered from the center of the data. So, I can say here, that in the case of

green dots the variation is only up to this point. And, whereas in case of, say here red dots this

variation is going up to this point. So, this orange color pen and then this blue color pen, they are

trying to give us the idea of the scatterdness of the data. So, I would try to now devise some

tools, which can measure this scatterdness or this constant. That which data is more concentrated

around the central mean or which data is more scattered around the mean or say. That mean is

the general consideration otherwise I can measure it around any particular value also.

Now, I'm going to address one simple issue, before going further. Sometimes people do ask me

that they have got two different data sets, and they have got the same variation, is it possible that

they also have the same mean? So, I'm just trying to create here two different data set means

hypothetical graphics to show you the answer. The answer is this by looking at the variation of

the data, you cannot really comment on the central tendency of the data and vice-versa. For

example, in this in the last slide we have seen, that there are two data sets, which have got the

same mean, but they have got different variation. Now, I will try to show you that, I have got two

data set which have got the same variation, but they have got different means. Okay.

Refer Slide Time: (16:45)

455
Now, if you try to see here, I have taken here two data sets, one in if denoted by red dots and

another by green dots. Well! I have prepared it by hand. So, essentially I have tried to make it

very similar. So, you can see here, the mean in the data set one is somewhere here, and the mean

in the data set two is somewhere here. And, here is our x-axis and here is our y-axis. So, you can

see here, that the variability in both the cases, this is the same; nearly the same means I have

tried to make it as close as possible graphically. But, they have got different mean, the mean of

first data set is here, calling as mean 1 and the data set two has a mean 2. So, you can see here,

even of the two the data sets have got the same variability, but they can have different means.

Usually, you will see that the spread or scatterdness or concentration that can be measured

around any particular point. But, we will see that measuring this concentration or spread around

the mean value. And, particularly the arithmetic mean this is more preferable. And, there is a

statistical reason actually, that when we try to measure, some measures like as variance, standard

10

456
deviation around the arithmetic mean, then they have got certain statistical advantages.

Definitely, this is not really the platform or this is not really the course, where I can really

explain you the advantages of using arithmetic mean. But, as I go further in the lectures, I will

try to show you that, what is the most preferable location with respect to the given tool, around

which we should measure the variability. Okay.

Refer Slide Time: (18:58)

So, now we have understood that different measures of variations or different measures of

dispersions, they help in measuring the spread and scatterdness of the data around any point, and

preferably the arithmetic mean value. Now, there are different types of measures which are

available in statistics. There is a long list, but I will try to take here some of them in the given

time, and those things with which I can show you easily on the R software. So, I would write, so

11

457
there are different major like as range, interquartile range, quartile deviation, absolute mean

deviation, variance, standard deviation and there are some graphical things also. And, there is a

long list. So, I will try to consider here, these many measures in this lecture and the forthcoming

lectures, and I will try to show you that how to measure them on the R software also.

So, now what I will do, I will try to take these measures one by one. And, I will have several

questions to, to answer depending on the tool. First thing will be what is the definition of the

measure, second thing will be how to compute it on the R software, third thing will be how to

interpret it. And, in some of the cases I will try to show you that in case if some missing values

are present in the data set, then how to handle them, how to compute the tool in the presence of

some missing observation. And, lastly, wherever possible I will try to show you the tools for

grouped and ungrouped data. So, this idea or these things will continue during the entire topics of

variation in data. Okay.

Refer Slide Time: (20:56)

12

458
So, now let us take the first topic here, which is here the range. So, first I will assume here and

that will be valid forever all other lectures, that I have here a variable x on which we are

collecting the n observations. And, I am denoting it by small x1, is small x2 and say small xn.

For example, in case if I say x is here height, then I am trying to collect the data on the heights of

say here n persons, and the height of first person is denoted by small x1, height of second person

is denoted by small x2 and the height of nth person is denoted by small xn. So, this small x1,

small x2, small xn they are going to be some numerical values, Right. So, I will now say in

simple words that we have a set of observation x1, x2,….,xn which is our data set. And, our

objective is this how to define the tools, and how to compute them using this data? The range is

defined as a difference between the maximum and minimum values of the data. So, it is pretty

simple. Just try to find out the maximum value out of x1,x2,….,xn that is the given data. Try to

find out the minimum value from the given data set among x1, x2,., xn. And, just try to find out

the difference between that two. And, this will give us the value of the range.

Refer Slide Time: (22:46)

13

459
So, this is pretty simple actually. Now, the question is this once you get the range then how do

you interpret it? So, the rule is pretty simple, the data set having higher value of range has more

variability. So, I can say one thing that if you have got more than one data sets, and if you want

to measure the variability in terms of range. Then, what we have to do? Just try to find out the

ranges of all the data sets. Try to compare them and whosoever range is coming out smallest, the

corresponding data set will be thought to have a smaller variability. So, I can say nowhere that

the data set which is having the lower value of range is preferable. And, in case if we have two

data set and suppose their ranges are represented by range one and say here range two. Then if

range one is greater than range two, then we say that the data set of range one is having more

variability than the data in the data set of range two.

One thing I would like to make it clear. That we are going to discuss different types of tools

range, interquartile deviation, quartile deviation, absolute mean deviation, variance, standard

deviation and so on. So, whenever we are trying to measure the variability. Then, we are trying

to make such decision only with respect to that measure. Now, if I say that suppose you have two

data sets, and you try to find out the range of one data set and say standard deviation of say

another the data set. Now, if you try to compare the range of first data set and the variance or the

standard deviation of the second data set, that may not be appropriate.

So, my advice is that whenever you want to compare the variability try to use the same tool, and

then you try to make an inter comparison.

Refer Slide Time: (25:07)

14

460
Now, I am coming on the aspect that how to compute the range in the R software. So, I will

denote here by this x the data vector that whatever are the data values they are contained here

like c (x1, x2…….xn). Now, you know as we have defined the range here the range has been

defined as maximum value minus the minimum value. So, what I can do here, that I can use the

built-in commands in the R software to find out the maximum value and the minimum value that

we had discussed in the earlier lectures, so the maximum value of x1, x2, xn is going to be

computed by max of (x), max and inside the argument you have to give the data vector and the

minimum value of x is going to be computed by the command min and inside the argument you

have to give the data and then you try to find out the difference between the two. So, max of (x)

minus min of (x).

15

461
Now, suppose it happens that x has some missing values and they are denoted by capital N

capital A. And, suppose I try to store this data into another data vector say xna, so xna is my

another data vector which has got some missing values. So, in case if you want to compute the

range, of such a data vector where the values are missing, you simply have to use the same

command max and min, but inside the argument you have to give the data, in this format xna the

data vector in which you have got some missing values and you have to use the command na dot

rm is equal to logically TRUE. That is capital T, capital R, capital U, capital E. So, what will

happen? That once you try to operate the max command on this vector, this data set, then it will

try to remove the missing value and then it will try to compare how to compute the maximum

value. And, similarly when we try to use the min minimum command on this data vector, then

when we specify na dot rm is equal to TRUE, then this operator will remove the missing values

from this vector xna which are denoted by capital N, capital A and then after that it will try to

compute the minimum value. So, this is how you have to compute the range in case some data is

missing.

One thing which I would like to caution you here, as you have seen in the R software, we have

the names of the function which are giving a value which is easily understood by the name, like

as mean. Mean, means the arithmetic mean of the data vector median which is trying to give the

median of the data vector. Similarly, when you try to use here the command range, r a n g e.

Then it appears that as if this is going to give me the value of the range, that is maximum value

minus minimum value, but it does not happen. Range will try to give you the two values. The

range command will give you two values, one is the maximum value of the data vector and

another is the minimum value out of those data inside the data vector, so be careful.

16

462
Refer Slide Time: (29:04)

So, here I would like to make here a note of caution, that if you try to use the command r a n g e,

then this will return a vector containing the minimum and maximum values of the given

argument.

Refer Slide Time: (29:22)

17

463
So, just be careful and if you recall the same thing happened in the case of mode also, m o d e

that was trying to give some other information but by name it appears that as if this is going to

give me the value corresponding to the maximum frequency. So, similar is the case with range,

so you need to be careful. Now, after this I will try to take an example and I will try to show you

that how to compute the range on the given set of data. So, I will take the same example which I

have given or which I have used in the earlier lectures.

Refer Slide Time: (29:55)

18

464
So, I have computed the or I have observed the time taken by 20 participants in a race and they

are given in seconds over here like this, and this data is recorded inside a variable here time. So,

this is my here the data vector and now I'm simply trying to execute the R command on

maximum time minus minimum time, and I can see here that this is giving me the value here 52.

And, you can also verify it from the given set of data, for example here you can see here this is

here the maximum value and this is here the minimum value. So, if you try to subtract it here 84

minus 32 you get the value equal to 52. Just to show you what will be the command, or what will

be the outcome of the command range, then you can see here this is giving me two different

values. This is here the minimum value of time and this is 84 is the maximum value in the data

given inside the time in theta vector, Right.

Refer Slide Time: (31:12)

19

465
So, before I try to show you on the R console, let me try to give you one more example and then

I will come back and here, in this slide you can see here the screen shot on the R console and

now what I am trying to show you in the same example, if that data values are missing then how

you are going to handle it, so in the same example where we have recorded the time taken by

twenty participants in a race.

Refer Slide Time: (31:22)

20

466
I have made this first two values to be NA, that means they are not available, and all other values

are the same. So, now I'm trying to record this data inside a new variable, this is time dot na. so,

time dot na is simply indicating that, the data is not available inside the time vector, Right. And,

there is no rule that the, that you always have to use here the point or dot, that is your choice. So,

I have stored this data and now I'm trying to find out the maximum value of this time na, minus

minimum value of the time na. But now if you see this will give me a output like NA. Why?

Because I have not used here that command na dot rm is equal to TRUE. So, I try to correct

myself and I try to find out the maximum and minimum on the data set time dot na using the

command that na dot rm is equal to TRUE. So, as soon as we give na dot rm is equal to TRUE,

this will understand, the maximum command will understand that there are some missing values

in the time dot na, which have to be removed before computing the maximum, and the same

21

467
thing will happen here. This minimum command will understand that the first the any values or

the missing value we have to be removed and this value comes out to be here 49. Right.

Refer Slide Time: (33:07)

So, now I will try to show you here on the R console and you can see here this is the data this is

the screenshot of the operation on the R console.

Video Start Time: (33:22)

22

468
So, let me try to first copy, this data set, say her time and I try to put it here time. So, I can see if

this is my here that data time. So, now suppose if you say by mistake if you try to operate the

command range, you can see here this will come out to be 32, 84 that is the same outcome that

we had received, Right. But if you try to find out the maximum of time minus minimum of time

you can see here, you are getting the this value which is the value of the range, and similarly if

you try to remove the missing data Then how to operate it? So, I try to create here say here

another, the data vector time dot na, so you can see here I have created this data vector here time

dot na in which the first two values are missing and suppose if I try to find out here the range of

time dot na. This will give me there is something wrong, so I try to use here the command na dot

rm is equal to TRUE. I mean now it will give me the minimum and maximum values.

But my objective is not to find the minimum and maximum value. I want to find out the

maximum and minimum values and I want to find out their difference, so I will say here time dot

23

469
na, and when na dot rm is equal to TRUE, minus the minimum of time dot na, na dot rm is equal

to TRUE. Now, if you try to see here, you will get here the when you 49. But in case if you by

mistake if you do not give the command na dot rm is equal to TRUE, then you can see the

outcome will be NA.

Video End Time: (35:27)

Refer Slide Time: (35:28)

24

470
Now, after this I come on at this another topic, which the interquartile range. Just like range is

trying to measure the difference between the maximum and minimum values. Similarly, if we

have another measure what is called as interquartile range, and interquartile range simply tries to

measure the difference between the third and first quartiles. Now, you may recall, what was the

quartile?

Refer Slide Time: (36:00)

25

471
If you try to recall, then we had discussed that in case if we have the frequency distribution like

this one, then this frequency distribution is divided into four equal parts and the first 25 percent

of the frequency is covered in the first quartile to denote it as Q1 , next 25 percent of the

frequency is contained between Q1 and Q2 .So, essentially Q2 is trying to consider the total fifty

percentage of the frequency, so this is the median and similarly we had here Q3 and finally here

Q4 . So, now what I try to do here, that I try to take here Q1 ,Q2 and Q3 . Q2 is the median, and I

try to consider this area. So you can see here that this area is consisting of 25 percent of the total

frequency and this area is consisting of the another 25 percent of the frequency. So, the area

between Q1 and Q2 is 25 percent of the total frequency and the area between Q2 and Q3 is 25

percentage of the total frequency. So, altogether if you try to add it together then this entire area,

which I am denoting by here dots this is going to take care of the 50 percent of the total

frequency. So, the interquartile range is defined as the difference between the 75th and 25th

percentiles or equivalently, this is nothing but the third and first quartile, so this is denoted by

here I Q R is equal to Q3 minus Q1, Right. And, as I have shown you in this figure here, that this

IQR or the interquartile range covers the center of the distribution and contains 50 percent of the

observations.

Refer Slide Time: (38:16)

26

472
So, now how to make the decision making. Once again, the rule is the same the data set having

the higher value of inter quartile range has more variability that will be the interpretation. So,

obviously if we would always like to have a data set which has got the smaller variability, so the

data set with lower value of interquartile range is more variable. So, suppose if I have got two

data sets and suppose their interquartile ranges are computed as IR1 and IR2 . So, if IR1 is greater

than IR2 , then we say that the data in IR1 is more variable or has more variability than the data

in IR2 . So, this is our interpretation, so now with the these two examples you can see here that

range and interquartile range, the, both of them are trying to measure the same aspect of the data

that is the variation, but they are doing in a different way, Right. Now how to compute it on the

R software this is pretty simple. There are two ways that you can write your own program or just

use the command to compute the quartiles and try to find out their difference or there is a built-in

command in R software to find out the interquartile range.

27

473
Refer Slide Time: (39:36)

So, if I say that my data vector here is x which is consisting of here say observation (x1, x2,..xn)

what we had assumed, then the interquartile range is computed by the command IQR and inside

the argument you have to give the data vector, and in case if the data vector has some missing

values which are denoted by here xna, then the command will be modified as IQR of xna in the

command na dot rm is equal to TRUE.

Now, with this thing I would like to introduce one more measure that is called as quartile

deviation. This quartile deviation and interquartile range both are very closely related to each

other, and after this I will try to show you how to compute it on the R software. Okay.

Refer Slide Time: (40:32)

28

474
So, this quartile deviation is another measure to find out the variability in the data and this is

defined as the half difference between the 75th and 25th percentiles are the half difference

between the third and first quartile. So, this is essentially half of the value of the interquartile

range, so half of the interquartile range is called as quartile deviation. So, it is not really difficult

to give the definition of the quartile deviation, we simply have to take the difference between Q3

minus Q1 which is nothing but your interquartile range and you have to divide it by 2, which I

am doing here, so this is nothing but the half of the interquartile range and the decision making

in this case is the same as in the case of interquartile range the data set having a higher value of

quartile deviation is said to have more variability. Right.

Refer Slide Time: (41:36)

29

475
Now in case if you want to compute the quartile deviation on the R software, it is pretty simple,

you already have learned how to compute the interquartile range so you simply have to write

down the same command and divide it by 2, and suppose if the data vector has some missing

values, then again just write the same command define for the interquartile range in the case of

missing data and divide it by 2. So, this command will give you the value of the quartile

deviation. Now I will try to take an example to show you how to compute these things and then I

will to show you on the R console also. So, again I'm going to take the same example,

Refer Slide Time: (42:17)

30

476
where, I have restored the data on the time of 20 participants, now if I simply tried to operate

here the command IQR inside the argument time. This will give me the value of the interquartile

range and this value comes out to be here 27, and similarly, if I want to find out the quartile

deviation. I simply have to write down IQR of time, the same command which I have used here

and just divided by 2. So, this value will come out to be 13.5.

Refer Slide Time: (42:53)

31

477
And, this is here the screenshot of the operation.

Refer Slide Time: (42:55)

32

478
And, now I will try to show you first that how you are going to handle the essing value. So, once

again I will try to take the same example, which I have taken earlier that in the time data, the first

two values have been replaced by na, so they are representing the missing values and these

missing values have been given inside the data vector time dot na.

Refer Slide Time: (43:17)

Now after this in case if you simply try to put here IQR of time dot na, so obviously this is going

to give you an error because there are missing values. So, what you have to do, in case if you

want to compute the interquartile range, then give IQR as the command the data containing the

missing value time dot na and write the command na dot rm is equal to TRUE. So, this is going

to tell this time dot na that when we are trying to compute IQR then first the missing values have

to be removed, so this value comes out to be a 25.25, and if you try to find out the quartile

33

479
deviation, just try to use the same command here and divide it by 2. So, this will give you the

value of interquartile range.

Video Start Time: (44:09)

So, I will try to now show you this thing on the R console. So, you can see here you already have

the data entered here time, so I will say here IQR of time. This is coming out to be 27 and if you

want to find out the inter quartile deviation you just divided by 2, this will come to be like this.

You see what happens if I try to use the small quick address i, iqr this will give me a mistake. So,

what you have to keep in mind that IQR command is case sensitive. They're small iqr and say

capital IQR these are different thing, Right.

Video End Time: (44:47)

34

480
Refer Slide Time: (44:48)

And, similarly if I try to take the data on the single use time dot na. So, this is my data vector

now if I want to find out the IQR of this time dot na, so I try to do here without giving the

argument na dot rm is equal to TRUE, and you can see here this gives me a mistake. So, I try to

35

481
add here the command that na dot rm is equal to TRUE, and you get here this value and if you

want to find out the, the IQR divided by two, the quadrille deviation. So, you have to simply

divide the interquartile range by two, and this gives me the value here 12,625. Okay.

So, in this lecture I have given you concept of variation in data and I have introduced two

measures which are based on certain values, for example range is based on two values minimum

and maximum, and quartile deviations they are or interquartile range. Both are based on two

values first quartile and third quartile. So, at this moment, I would request you, you please try to

go through with the lecture take some example and practice them on the R software try to make

different experiment, try to get the values and try to see that how the values which you have

obtained for range and interquartile range, they are going to give what type of information which

is contained inside the data so you practice it and I will see you in the next lecture. Till then,

goodbye.

36

482
Course Title
Descriptive Statistics with R Software
Lecture - 19
Variation in Data - Absolute Deviation and Absolute Mean Deviation
Prof. Shalabh
Department of Mathematics and Statistics
IIT Kanpur

Welcome to the next lecture on the course Descriptive Statistics with R Software. You may

recall that in the earlier lecture, we started our discussion on the topic that how to measure the

variation in the data and in that lecture, we had considered three possible tools: range,

quartile deviation and interquartile range. The measure of range was dependent only on two

values, that is the minimum and maximum, and quartile deviation as well as interquartile

range. They were dependent on two values- the first quartile value and third quartile value.

So now there is another thing that these measures are going to be based only on say two

values at a time, either minimum, maximum or first and third quartiles. They are trying to

take care of the entire dataset in a different way. That is the minimum, maximum, quartiles

are computed on the basis of all the observations.

But now there is another concept to measure the variation that why not to measure the

variation of individual data points from the central value and then try to combine all the

things together; combine all the information, combine all the deviation together.

So now we are going to start a discussion on different types of tools to measure the variation

which are based on the individual deviation of the data from the central value or any other

value also. So, in this lecture, we are going to discuss on two topics. First is absolute

483
deviation and another is absolute mean deviation. We will try to understand the concept and I

will try to show you how to compute it on the R Software also.

(Refer Slide Time 02:24)

Now we are going to discuss the another aspect that what are the different measures which

are based on the deviation? What does this mean? If you remember in the early lecture, I have

made this type of data, right, and if you try to see here, the mean value is somewhere. I'm

assuming somewhere here. This is here the mean value and I'm trying to make this difference,

this. So these are my data points and this blue colour lines are indicating the difference

between the central value, say mean or it can be another value also from this individual data

points. So these are the deviations.

(Refer Slide Time 03:16)

484
So now the concept what we are going to discuss here is that the tools like range, interquartile

range, and quartile deviations, they were based on specific values and partitioning values. A

specific value mean minimum value, maximum value where the partitioning values were first

quartile, third quartile.

Now we would like to have a measure which can measure the deviation of every observation

around any given value. So in this figure instead of taking here mean, I can also take it here

say any known value say here A. So how to get it done?

(Refer Slide Time 04:06)

485
So now if you say the first question is how to measure the deviation, and first I am initiating

the discussion that I would like to measure the deviation around any value A and later on I

will try to choose appropriate value of A. So you can see here that in this graph, if my this is

the data point and this is here the mean value, then or the known value here A, then these are

my deviation.

So this is my here x1, this is x2, and this is x3. And essentially, this difference is trying to

measure the difference between x1 and say here let me denote it by here d1. And similarly,

this difference here, this is trying to measure the deviation between x2 and A which I am

trying to denote it here as d2, and similarly, the different between here this A and x3 denoted

by here d3. So, in general, I will say here that di = xi - A is going to measure the deviation.

(Refer Slide Time 05:23)

Now in case if xi is greater than A, then all such observation where this whole, in those cases

the deviations will be positive, and in case if the value of xi is smaller than A, then all such

486
deviations will be negative, and if xi is exactly equal to A, then all such deviations are going

to be zero. So now this di can take three possible values: zero, less than zero, and greater than

zero.

Now there is another issue. The issue is this that in case if I have got the observations say x1,

x2, xn, then corresponding to every observation I have got the value of deviation d1, d2, dn.

(Refer Slide Time 06:15)

Now d1, d2, dn, they are the individual values. So suppose you have got 20 observations and

then you will have 20 values of deviation. Some are positive. Some are negative. Some are

zero. But it will be very difficult by looking at deviation and get a summarized value. As we

had discussed in the case of measures of central tendency that the basic tendency is that we

would always like to have a summarized measure. That means all the information which is

contained in d1, d2, dn, this has to be somehow combined in a single quantity by looking at

which I can always compare the things. Right.

487
(Refer Slide Time 07:15)

So now one option is this that once I have the values of d1, d2, dn, I can find out the average

1 n
of these deviations. So I try to compute  di , but this deviation may be very, very close to
n i 1

0. Why? Because it is possible that some of the deviations are positive; some are negative. So

there is a possibility that when we try to take the average of positive and negative values, the

mean may come out to be 0 or very, very close to 0. So once this value is coming out to be

1 n
zero that  di this is 0, then it is possibly indicating as if there is no variation in the data or
n i 1

if this value is pretty small that means it is indicating that there is very small variation in the

data.

So if I try to make here a figure here that suppose my central value is somewhere here and

my observations are something like as one and two are here and one and two are here. So in

this case if you try to see here if this is my here A and x1, x2, x3, x4, then the deviation d1, d2,
6

488
and deviation d3 and d4, what will happen? d1, d2 will have the opposite sign as of d3, d4

considering that d1, d2 had been measured with respect to A.

(Refer Slide Time 09:02)

So now in case if I try to find out the value d1 + d2 + d3 + d4 divided by 4, then two are

suppose positive and two are suppose negative values, then their average may be close to 0 or

exactly 0. So this may be misleading because may not give us the correct information.

(Refer Slide Time 09:24)

489
So, obviously, now by this example, it gives us a clear idea that we need a measure where

these signs are not considered. Why? Because I'm interested only in the scatteredness of these

green circles around this red point. I'm not interested in their individual values. So we need

not to consider the signs of this di, signs of these deviations, but we need to consider only the

magnitudes of these deviations.

(Refer Slide Time 10:00)

So now after looking at this example, we have understood that now I need a summary

measure which can denote the deviation in the data or the variation in the data based on the

deviation and these deviations have to be considered only in their magnitude, but obviously,

in the data, some deviations are going to be positive and some deviations are going to be

negative. So as long as we have positive deviations, there is no problem, but then I'd need to

convert the negative deviations into a positive deviation. That means I'd need to convert the

sign from negative to positive.

490
Now the next question is how to convert the sign? So in mathematics we have understood

that there are two options. Either I try to consider the absolute value or I try to make the

negative values to be squared. So now based on these two aspects, we have two types of

measures. One is absolute deviation and another is variance. So, in this lecture, I'm going to

consider the concept of absolute deviations and in the next lecture I will try to consider the

concept of variance.

(Refer Slide Time 11:25)

So I will be considering two options to make the negative deviations to be positive or to

consider their magnitude. Consider absolute values or consider their squared values of those

deviations.

So now in this lecture I'm going to concentrate on absolute values. So I had the observations

x1, x2, say here xn. Now I have taken the deviation from the value A. Now after this I will try

491
to take the absolute value of all these deviations and now I need to combine all this

information into a single measure.

Now the question is this, how to combine it? So we will consider it. Now before going to the

discussion on the how to combine such information, first, let me clarify the symbols and

notation what I'm going to use in this lecture and the next lecture. You see I have two types

of data sets. One is on continuous variable and another is on discrete variable.

In discrete variable, we simply try to use the observations as such, but in the case of

continuous variable, we try to group them. We try to convert the data into a frequency table

and then we try to extract the mid-values of the class intervals and the corresponding

frequencies to construct our statistical measures, the same thing that we had done in the case

of arithmetic mean if you remember.

For the ungrouped data that was defined as summation of xi divided by number of

observations and in the case of continuous data or the group data, it was defined as

summation xi fi divided by summation of fi and in the second case xi’s were denoting the mid-

values of the class interval.

So I would now try to introduce these measures for grouped data as well as for ungrouped

data.

(Refer Slide Time 13:51)

10

492
So, first, let me just explain here that in case if our variable is discrete or the data is

ungrouped, then we will denote the sets of set of observation as say x1, x2,...,xn. There is no

issue and we will use these values directly, use the values directly.

(Refer Slide Time 14:17)

11

493
Whereas in case if I have grouped data, so suppose my variable here is X and I have got the

observations, which I have tabulated in K class intervals in the form of a frequency table like

this here. If you can see you here, I have made the observation in this class interval, say, first

class interval, second class interval, up to here, Kth class interval and after that I have found

the mid points of this class intervals. So midpoint of the class interval e1 to e2 is rather (e1 +

e2) / 2. The class interval e2 to e3 has the midpoint at (e2 + e3) / 2 and these mid points are

going to be devoted by x1, x2 and so on up to here xK. Right.

(Refer Slide Time 15:14)

So I am denoting the mid points of the intervals by x1, x2,...,xk and which is just different than

in the earlier case where we had denoted x1, x2,...,xk as the values that are obtained directly

from the data and corresponding to these intervals we have the absolute frequencies, which I

will be denoting by f1, f2, fK and the sum of these f1, f2, fK is going to be denoted by n, which

is again different from the earlier case from the case of ungrouped data.

12

494
(Refer Slide Time 16:01)

So this is what you have to keep in mind that when I'm trying to consider the case of

ungrouped data, then x1, x2,...,xk, they are going to denote the values which I’m observing

directly, and when I'm trying to consider the ungrouped data, then x1, x2,...,xk are going to

denote the mid-values of the class intervals and the corresponding frequencies in those class

intervals are f1, f2, …,fK.

Similarly, the symbol is small n that is going to denote the number of observation in the case

of ungrouped data whereas a small n will be denoting the sum of all the frequencies in case of

grouped data.

So now with these notations I will start the discussion on the first measure, that is absolute

deviation. Right.

(Refer Slide Time 16:51)

13

495
So, first, try to concentrate on this slide, this one here. You can see here I have obtained here

this n such deviations which are the absolute values. Now in case if I try to find out the mean

of all such deviation, then this will give me a sort of summary measure of absolute deviation.

(Refer Slide Time 17:18)

14

496
And the definition of absolute deviations is that if you have got a data x1, x2,...,xn, then the

absolute deviation of this observation around a value A, which is known, is defined as

follows.

(Refer Slide Time 17:37)

Now I have here two cases for ungrouped or discrete data or for grouped or say continuous

data. In case of discrete data, we simply try to take the arithmetic mean of such absolute

deviations and in case of continuous data, we again try to find out the mean of such absolute

deviations using the concept of arithmetic mean in the case of frequency table.

So now you can see here, this is how we try to define the absolute deviation, sum of absolute

deviation divided by n, which is here the arithmetic mean of the absolute deviation and in the

case of continuous data also this is again the arithmetic mean of the absolute deviation which

are suitably defined. Right,

15

497
(Refer Slide Time 18:27)

Now the next question comes, how to choose this A? Well, in this definition, I'm assuming

that A is known to us, but in case if I try to choose different values of As, then the values of

absolute deviations are going to be changed. So by using a small algebra, we find that these

absolute deviations is going to take the smallest value when this A is chosen to be the median

of those values. So we try to replace the value of A by the median of x1, x2,...,xn and this

gives us another measure, which is called as Absolute Mean Deviation.

So in case if I try to take the observations x1, x2,...,xn, then this absolute deviation is going to

be minimum when it is measured around median. That is A is the median of the data, say, x1,

x2,...,xn.

(Refer Slide Time 19:36)

16

498
Remember one thing, we had computed the median in case of a grouped data and ungrouped

data by different expressions. So whenever we're trying to find out the value of absolute

mean deviation, then we need to compute the value of median suitably. Suitably means if you

have discrete data, try to order the observation and then try to take the middle value

depending on whether your number of observations are odd or even what we have discussed

in the case while handling the topic of median.

And similarly, if you have grouped data, then you please try to convert it into a frequency

table and use the expression for computing the median from that table, and you remember

that we had computed the median in a different way and by not using the direct functions.

Right. So this is what you have to keep in mind.

So in case if I have discrete data, then I simply try to replace A by the median of x1, x2,...,xn,

which is denoted by x bar med x̄med and then I try to take the arithmetic mean of those values

and this is called as the absolute mean deviation in case of ungrouped data. And similarly, if

17

499
you have grouped data on continuous variable, then, again, I try to find out the value of

absolute deviation by replacing A equal to here by x median x̄median, but remember this x

median x̄median and x median x̄median, they are suitably computed. Right. And also note that the

definition of here n and the n in the case of grouped and ungrouped data, they are going to be

different. Right.

(Refer Slide Time 21:43)

So this is called the absolute mean deviation.

(Refer Slide Time 21:48)

18

500
Now what are these absolute deviation or absolute mean deviation doing? So you can see

here either it is absolute deviation or say absolute mean deviation. They are trying to present

the information on the variability of the data in a comprehensive way. So this absolute mean

deviation measure the spread and scatteredness of the data around, preferably the median

value, in terms of absolute deviations.

(Refer Slide Time 22:26)

19

501
And in case if you want to know that how to make a decision that if you have got more than

one data sets and if you want to know that which dataset has got the more variability, then

how to do it? So, again, we have a same rule as we had discussed in the case of interquartile

range and quartile deviation.

So I will say here the data set which is having the higher value of absolute mean deviation or

even the absolute deviation is said to have more variability and the data set with lower value

of absolute mean deviation or the absolute deviation is preferable. And suppose we have two

data sets and we try to compute their absolute mean deviation and we computed say AMD1

and AMD2. Then we say that if the value of AMD1 is greater than the value of AMD2, then

the data in the AMD1 is said to have more variability than the data in the AMD2.

One thing what you have to keep in mind, there are two interpretation. More variability or

second is less concentration. They have got the similar meaning.

(Refer Slide Time 23:51)

20

502
What I'm trying to say? More variability means that the data is more scattered around the

value. When I say less concentration, that means the data is less concentrated around the

median value. So, usually, people provide their inferences using that term variability or the

concentration. So you have to be careful. More concentration means less variability. Okay.

(Refer Slide Time 24:31)

Now the question is how to compute it on the basis of R Software? So now I have two cases

for ungrouped data and second for grouped data. So in case of ungrouped data, I'm denoting

the data vector by here x. So this is going to be x = c (x1, x2,...,xn) separated by comma.

Now, once again, I will try to address that in R, there is no built-in function to compute the

absolute deviation or the absolute mean deviation, but by using the built-in commands inside

the base package of R, it is not difficult at all to write down the command for absolute

deviation or for the absolute mean deviation. Let us see how.

21

503
(Refer Slide Time 25:24)

If you try to see what I'm trying to do, I have the observations absolute value xi - A and I'm

trying to find out the absolute value of this xi - A. So for that I have the function here absolute

a b s (xi - A), and then I'm trying to find out the sum of these and divided by the number of

observation, which is nothing but the mean of absolute values of xi - A.

(Refer Slide Time 26:05)

22

504
So I have used this concept here, and so, first, I try to write down the absolute value of all the

data vectors around a given point A and then I try to find out the mean of all such absolute

values. So by writing mean inside the argument absolute (x - A) we will find out the value of

the absolute deviation around any given value A. So A is assumed to be here known and if I

try to replace A by median of here x, then I know how to compute the median. The command

here is median of the data vector x. So I try to find out the absolute value of the deviations x

and median(x) by using the command abs and then I try to find out their arithmetic mean.

(Refer Slide Time 26:58)

So this command will give me the value of absolute mean deviation.

(Refer Slide Time 27:04)

23

505
Now in case if I have ungrouped data or the missing data, so what I say is suppose I have this

data vector here x and whatever are the values missing in this vector, they are denoted by NA

and those values are collected inside the data vector xna. So in case if I have missing values

in an ungrouped data, then absolute deviation is going to be computed by this command.

You see I have done nothing here. If you have understood the command that how to write or

how to handle the missing value, I have written the same command earlier, but I have added

here the command na.rm = TRUE.

So now this is telling that please try to compute the values of xna - A after ignoring the

missing values using the command a b s and then whatever is the outcome, operate the

command mean on it. And similarly, in case if you want to find out the absolute mean

deviation in such a case, simply try to replace A by median of x.

24

506
So in the same command, I'm trying to simply replace A by the median, but remember one

thing that median also has to be computed using the command na.rm = TRUE. And then,

first, na.rm, this is going to be used by the command median and this na.rm, this is going to

be used by the command mean. So you simply have to be little bit careful while writing this

expression. Sometime we make a mistake of the this brackets. Okay.

(Refer Slide Time 28:47)

Now, similarly, when we have grouped data, then now I will assume that my data vector here

is x, which is essentially denoting the midpoints of the class intervals, and this data has the

frequency f. So x1, x2,...,xn, they are my class intervals with frequency f1, f2, say here fn.

Right.

1 K
Now in this case, the absolute deviation is simply going to be the  fi | xi  A | . So I try to
n i 1

compute this xi - A by the function absolute of x - A and then I have to multiply by the

25

507
corresponding fi. So I can use here the operator star * and then I have to find out here the

sum. So I can say here first of all find down the absolute value of x - A. Then multiply it by

here vector f and then whatever is the command here, this you have to find the sum, and it

has to be divided by here n where this n is going to be the Σ fi. So I can divide it by here sum

by say here f. Right. And the same thing has been written here in this command here.

(Refer Slide Time 30:40)

So once again you can see here that there is no built-in function, but you have to be careful,

and you can easily write such commands. And similarly, if you want to compute the absolute

mean deviation, what I have to do? I simply have to replace this A by here median.

(Refer Slide Time 31:03)

26

508
So just in order to make it clear that here we are not going to use the command median

directly, but we need to compute the value of the median separately, I am using here a black

colour font and I'm denoting here as xmedian so that you remember that this value has to be

computed separately using the functions and command that we have used in the case of

computation of median in case of grouped data. Right. Okay.

(Refer Slide Time 31:48)

27

509
So now let me take here an example and I will try to take the same example. So I will try to

first treat the data set as ungrouped data set and then I would convert it into an ungrouped

data set. So, first, I try to consider the case of ungrouped data set. You have the same

example that we have used earlier that we have the data of 20 participants, and this data has

been recorded inside a variable say time, and this is the data vector here.

(Refer Slide Time 32:22)

And in order to compute the absolute deviation, I try to choose say value of A. Just for the

sake of understanding, I'm trying to choose here the value of A to be 10. And now once I

operate this command what we have just noted mean of absolute value of time - A, this will

give me the value here 46. And if I try to find out the median of the time, you can see here,

this value will come out to be 46.5 because this is the discrete variable. You have the

ungrouped data. So you can use this command, and now using this command directly inside

the command here, I'm trying to compute the absolute value of these deviations of time from

28

510
the median and then I'm trying to find out the mean. So this is giving me the value of absolute

mean deviation around median. So this value comes out to be here 14.5.

So as we had discussed that this absolute mean deviation is going to be minimum when it is

measured around the median, so you can see here when I try to choose here this value to be

here A equal to 10, then this value was coming out to be 46, but now this value is coming out

to be 14.5.

(Refer Slide Time 33:44)

Okay. Before I try to go to the R console, let me show you here the screenshot of the same

operation that we are going to do, but before that, let me consider this data as a grouped data

and show you how to compute the median in this case also.

(Refer Slide Time 34:00)

29

511
So now considering the same dataset as a grouped data, I have divided the entire data into six

class interval and I have found their mid-value xi here in the second column. So you can see

here this value is 31+40 divided by 2. So this is 71 by 2 which is 35.5 and so on other values

and here are the frequencies. So 5 is the frequency of class 1; f2, which is the frequency of

second class, is 3; then f3; then f4; then f5 and then f6.

(Refer Slide Time 34:51)

30

512
Now what is our objective? We simply want to find out the frequency vector and median

from this given set of data.

(Refer Slide Time 34:53)

Now how to find out the frequency vector and how to find out the median, I'm not going to

explain here in detail because I already have discussed it in more detail when I was trying to

compute the median and I also have used it in the earlier lectures. So but here just for your

understanding and so that you can recall, I will simply try to give you the broad steps and

these broad steps are simply I have taken from the slides that I had used earlier.

So my first objective is this. I want to obtain the frequency vector. So we had used the

command sequence s e q to generate a sequence from 30 to 90 at an interval of 10 and we had

denoted this by breaks b r e a k s, and this breaks has an outcome like 30, 40, 50, 60, 70, 80,

90 and this is here the screenshot. We have done it earlier.

31

513
(Refer Slide Time 36:04)

And then after this we wanted to obtain the frequencies. So in order to obtain the frequencies,

first, we need to classify the time data in the class interval using the width that was defined

by the breaks. So for that we have used the command cut c u t and this was operated. The cut

command was operated over the data vector on time using the breaks and we had used the

option right = FALSE so that the intervals are open on the right-hand side and I had stored

this, these values in time.cut vector, and this, these values were obtained here like this and

this is the here the screenshot.

(Refer Slide Time 36:51)

32

514
And here in this case if you recall the last row, which is indicated with levels, that is going to

indicate the class intervals. Now after this we had used the command table to tabulate the

values of time.cut and the frequency table was obtained here like this. This was the frequency

table and the second row over here, 5, 3, 3, 5, 2, 2, this is going to give us the frequency

values.

(Refer Slide Time 37:32)

So we had obtained the frequency vector from this table using the command as.numeric and

inside the arguments the name of the frequency table, and this gives me the here an outcome

5 3 3 5 2 2 and this outcome is the same as given here in this vector. So this is our frequency

vector. So now this is our frequency vector.

(Refer Slide Time 38:03)

33

515
Now we need to find out the vectors of midpoints. So now how to obtain it?

(Refer Slide Time 38:12)

We had defined this vector x to be here the midpoints of the class interval. So these points are

the midpoints of class intervals. How they are obtained? You see, one thing you have to be

very careful here. When we are compiling the data and when we are trying to compile the

data through the R Software using the command table, then in this case I have to use the

34

516
midpoints of the class interval and the frequencies, which are provided by the R Software.

Right.

(Refer Slide Time 39:07)

So here if you try to see, in the earlier slide, we had obtained the class intervals like this one.

So the midpoint of this interval is (30+ 40) divided by 2 which is equal to 35. The midpoint

here is 45 and so on, and this is different than what we obtained earlier manually.

(Refer Slide Time 39:39)

35

517
You may recall that this value was 35.5 and so on, but anyway we are now doing in the

software and this will not make much different. So this is now here my vector here x. After

this I have to use the expression for finding out the median. So if you remember we have

used the notation m, em, fm, dm and we had this expression to find out the median, which we

have discussed in the case when we had our lecture on median.

(Refer Slide Time 40:11)

So now using this em, which is denoting the lower limit of the class, this fm, fm is going to

give me the frequency of the median class, which is here, and some of these fi’s that is going

to be the sum of first two classes frequency, frequency of classes one and two and dm here is

the width of the median class. This is 10. So now using these values over here, I get the value

here 56.66.

(Refer Slide Time 40:45)

36

518
So what you have to notice here that in this case we have to find out the median separately.

(Refer Slide Time 40:50)

And now using these commands, now I have here the data vector on frequency f, data vector

on say here x, which are the middle points, then the value of the median 56.6 and the value of

here constant A, which I have chosen myself. So now using the same commands for finding

37

519
out the absolute deviation with value A and finding out the absolute mean deviation around

the median, but here the median has been defined separately, we get these values here.

(Refer Slide Time 41:26)

And in case if you try to compare the results that we have obtained for the grouped and

ungrouped data, you can see there is not much difference actually, right. Here that this is for

the ungrouped data and then for the grouped data. You can see here the values of this

medians around A equal to 10, they are coming out to be the same and similarly the values

which we have obtained for the absolute mean deviation in case of grouped and ungrouped

data, they are not much different. This is 14.5 and 14.16, so 14.2, so there is not much a

difference between the two here.

(Refer Slide Time 42:10)

38

520
And now I would first try to show you these things on the R console. So we come to our

example and where we try to find out the first the absolute deviation.

(Refer Slide Time 42:25)

39

521
So I can show you that here I have the data here time, and I can take A to be here 10, and if I

try to find out the absolute deviation, this is coming out to be 46, and if I try to replace A by

median of time, this is coming to be here 14.5 and you can verify here that these are the same

data set or the same result that we had obtained earlier.

(Refer Slide Time 42:59)

Now I try to come to the grouped data case where we have this, the frequency vector to be

here like this. We have copied here and the vector of the midpoints which I am writing here

like this and the value of here xmedian that has been obtained separately. This is here like

this. So you can see here f is coming out to be here, x to be like here, xmedian is coming to

be like as here. Right.

(Refer Slide Time 43:38)

40

522
Now let me just clear the screen so that you can see clearly. Now using these values, I'm

trying to compute the absolute deviation and absolute mean deviation. You can see here this

is coming out to be 46 and similarly if I try to compute here the absolute mean deviation, this

will come out to be like this. So you can see here that it is not very difficult to.

(Refer Slide Time 44:10)

41

523
Now I will just quickly show you the same example that in case if you have got missing

values in the data. So I will try to take the same data set, and I'm assuming that first two

values are replaced by here NA and I will try to show you that how to compute these values

using this missing value data.

(Refer Slide Time 44:27)

So this data has been stored in the variable time.na, and you can see here that I'm trying to

choose here A = 10, and then I'm simply trying to use the command here for computing the

absolute deviation when data is missing, and then I'm trying to find out the median, and then I

am trying to compute the absolute mean deviation using the command that I have just

discussed and these values are coming out to be here like this.

(Refer Slide Time 44:55)

42

524
So I will try to show you on the R console also, and if you remember the time.na data has

already been there. Now if I try to use the same command here, I’m getting the absolute

deviation, and if I try to find out the absolute mean deviation, then it is going to be this.

(Refer Slide Time 45:19)

43

525
See here that these are the same values, the screenshot what we have just done. Right.

(Refer Slide Time 45:26)

So now I would like to stop in this lecture and you have seen that how we have formulated

the measure to find out the variation in the data based on the absolute value. Notice that there

is no built-in function or command in the base package of R to compute it directly. So using

the earlier concepts of built-in function, we have defined the function or the command to

compute the absolute deviation around any arbitrary value or say around the median that is

absolute mean deviation.

You also remember, you also need to remember that the computation procedures for grouped

and ungrouped data, they are also different in R Software. So you have to be careful while

doing it. So I would request you that you please take some more example and try to practice

it and we will see you in the next lecture. Till then good bye.

44

526
Lecture – 20
Variation in Data – Mean Squared Error, Variance and Standard Deviation

Welcome to the next lecture on the course Descriptive Statistics with R software. You may

recall that, in the earlier lecture, we had considered the aspect of measuring the variation, by

using the absolute deviation and absolute mean deviation, and if you recall, we had developed

that tool, on the concept of deviations of the observations around any arbitrary point or around

some central tendency value like as mean, or say median like this. We also discussed that

whenever we want to develop these types of tools, we had to convert the deviations only into a

positive value, that means, we needed to convert or we needed to consider only the magnitude of

the deviations, and for that we have two options. You try to consider the absolute value of the

deviation or second option is that, we can consider the squared values of those deviations. So,

now in this lecture, we are going to, to discuss the second aspect that, how to build up or measure

of variation by considering, the magnitude of the deviations by squaring them. So, in this lecture,

we are going to discuss the concept of mean squared error, variance and standard deviation, and

we will try to see that, how to implement it on a R software. So, before we start, let us try to fix

our notations once again. Although, I had done it in the earlier lecture but, here I will be just

doing it quickly.

Refer Slide Time: (02:00)

527
So, we may be considering here, two types of variable, one discrete variable on which we will

have ungrouped data, and in this case, the variable is going to be denoted by capital letter X, and

on this variable, we are going to obtain the n observations and these observations are denoted

here like small x1, small x2, small xn.

Refer Slide Time:(02:25)

Similarly, when we are trying to consider, a continuous variable here, and we have grouped data

on that variable, this means, we have a variable, continuous variable X on which we have

obtained the observations and those observation have been tabulated in K classes, or say K class

intervals, and the entire tabulation has been presented in the form of a frequency table. For

example, here you can see the frequency table in which all the observations have been converted

528
into the groups, and these groups are the class intervals, they are denoted by here even to e2, e2 to

e3 and so on. So, we have here K class intervals, or say K groups and after this whatever is the

midpoint of this interval, that is going to be denoted by here x1. So, x1 is going to denote the,

midpoint of the first-class interval, x2 is going to denote the, midpoint of the second-class

interval and so on. And all these x1, x2, xk. Here, now in the case of grouped data, they are going

to denote the midpoint and not the value of the observation, as it happens in the case of

ungrouped data and the frequency of these intervals is, they converted by an f1, f2, fk. So, f1 is

going to denote the frequency of the first-class interval, f2 is going to denote the frequency of the

second-class interval and so on. And the, some of all this frequency is denoted by here n. So, that

is going to be our basic notations for grouped and grouped data, So, as soon as I say that, we are

going to define the measures on grouped data, I will be using these notations and as soon as I

say that, I am going to develop the tool for the ungrouped data, then I will be using the earlier

symbols and notations, Right! Now, I come on the aspect of, developing a tool called as mean

squared error and I will be following the lecture almost, on the same line as I did in the earlier

lecture. You may recall that first, I define the absolute deviations and these deviations were

defined around any arbitrary value A, and then I developed the measure, and then I replaced A,

by some measure of central tendency and we defined the absolute mean deviation by replacing

A, by the median, because median was the value, around which the absolute deviations were

minimum. Now, similarly on the same lines, now instead of absolute deviations, I will be

considering the squares of the deviations, Right! So, if you remember,

Refer Slide Time:(05:28)

529
that in the case of absolute deviations, I have used the quantity absolute value of xi minus A.

Now, I will try to, consider the squared values of xi minus A, and I will write down here the

squares of these deviations as  xi  A  . So, these deviations are the squared deviation around
2

any arbitrary point A. Now, I will try to obtain, this quantity for each and every observation, say

x1 minus A whole square, x2 minus A whole square, and up to here xn minus A whole square,

and then, I will try to take the arithmetic mean of all these values, and once I try to do it, then in

the case of ungrouped data, discrete variable, this quantity is denoted by here like this, you can

see here this is the same quantity what I have given here. This quantity, is denoted here as, s2

(A), which is called as mean squared error with respect to A. And similarly, whenever we have

continuous variable or grouped data, then in that case, the mean squared error, with respect to

any arbitrary value A is defined here like this. So, you can see here that, now this summation is

going over the number of classes, and here this is small n is here the sum of all the frequencies,

and this quantity is a sort of weighted mean, where the weights have been given by the frequency

530
fi. So, this is how we try to define the mean squared error in case of grouped data and ungrouped

data with respect to any arbitrary value A, Right!

Refer Slide Time:(07:46)

Now, one can choose any value of A, so this can be shown mathematically that, the mean

squared error is going to assume, the minimum value, when A chooses the value arithmetic mean

or in simple words, I can say, mean squared error takes the minimum value, when the deviations

are measured around the arithmetic mean. So, now what I will do? I will try to replace capital A

by x bar, which is the sample mean, or the mean of the observations, and when I try to do it, then

1 n
what happens, that I simply try to replace A by here x bar, x   xi
n i 1
, that is the arithmetic

mean, and then I try to define here the deviations, x1 minus x bar whole square, x2 minus x bar

531
whole square, and up to here xn minus x bar whole square. And then I try to find out, the average

of these things, simply arithmetic mean and this quantity is denoted here, in the case of

ungrouped data or the discrete variable, that these values are the values which are given here, in

case of a ungrouped data or a data on the discrete variable. So, you can see here, this is the same

thing and this quantity which is essentially here, s square, x bar which is in general, we denote

by here, s square, this is called the variance, and in case, if you try to simplify this expression, so

1 n 2
you can write down here, this same thing here  ( xi  x 2  2 xi x ) . So, this comes out to be
n i 1

1 n 2 n 2 n
1 n

n i 1
xi  x  2 xi x . Now, this can be further simplified to  xi2  x 2  2 x 2 . So, this
n i 1 n i 1

becomes here, twice of x bar square. So, this quantity becomes a 1 upon n, summation i goes

from 1 to n, xi square minus x bar square. So, the alternative expression for the variance is here

given by this thing, which is the same thing. So, actually you can use any of this expression to

compute the value of variance in case of ungrouped data.

Refer Slide Time:(11:05)

532
And similarly, when we have grouped data on a continuous variable, then in this case, the

variance is defined like this. So, this is 1 upon n summation i goes from 1 to K, where K is the

number of classes, and then multiplied by fi into xi minus x bar whole square like this, and where

x bar is defined like this, and this is small n, is defined as the sum of total frequencies. Similarly,

if you want to simplify this expression as we did in the earlier case, you can see here, this will

also come out to be i goes from 1 to K, fi, xi squared, plus x bar square, minus twice of xi, x bar,

and if you try to see here, this is 1 upon n summation i goes from 1 to K, fi, xi square, plus x bar

square, summation fi, i goes from 1to K upon n, minus twice of x bar, summation i goes from 1

to K, fi, xi upon here n. So, this quantity here, becomes here same as x bar. So, I can write down

here 1 upon n, i goes wrong 1 to K, fi, xi square, plus x bar square. Why because, the numerator

summation fi becomes here n upon n, minus here, twice of here x bar, into x bar, so this quantity

comes here, 1 upon n, summation i goes from 1 to K, fi, xi square, minus x bar whole square, this

is the same quantity given over here. So, any of this expression can be used to compute the

variance in case of grouped data.

Refer Slide Time:(13:10)

533
Now, after giving the definition of variance, I would like to address here, one more aspect, you

have seen that, in the definition of variance, I am trying to take the average of n value. So, this is

defined as 1 upon n, summation i goes from 1 to n, xi minus x bar whole square. In statistics,

there is another expression of various that is quite popular, and this expression is given by this, in

case of ungrouped or discrete data, this is given by here this quantity, you can see here that this

expression is like, the earlier one, the only difference here is that, instead of here n, I have here 1

upon n minus 1, and earlier we had only here n. So, this is what you have to keep in mind, and

similarly in case of continuous data or a grouped data, the divisor that was earlier n, and now this

becomes here 1 over n minus 1. Now an obvious question comes that, why I am using this

expression? Actually, the properties of this expression, when we have divisor n, or when we have

divisor n minus 1, they are different. If you have the idea of an unbiased estimator, in the context

of statistical inference, then when I say or when I use the divisior n minus 1, then this form of the

variance is an unbiased estimator of the population variance. Whereas, when I am using the

divisor n, then 1 upon n summation xi minus x bar whole square, is not an unbiased estimator of

the population variance. So, that is why many times the software uses this definition. For

example, in the R software, this definition is used where the divisor is n minus 1. So, that is the

reason, I would like to inform you here that, whenever you, you are using any software, please

try to look into the manuals of the software and see what they are trying to do. Well, in case if

the data is very large, then the value of the variances, may not differ much but, in case if the data

is small, then the values of the variances computed by using the division n or say n minus 1,they

may have difference. So, you should know, what you are obtaining, it should not happen that,

you are assuming that, the divisor is n, and your divisor is actually inside the software is n

minus1. Okay, so, just be careful.

534
Refer Slide Time:(16:15)

Now, after the variance, I come to another concepts, that is called standard deviation. You have

used possibly two types of terminologies; one is standard deviation and second is standard error.

In general, people do not differentiate between these two names but here, I would try to explain

you that, what is the difference between the two, but to start with, I will try to use the common

terminology, and I will use here the standard deviation, Right! So, when I say that s square is

going to denote my variance, then it is actually the sample variance, sample variance means, that

the variance calculated on the basis of given set of data, or given set of or given sample of data,

Right! So essentially, we are trying to compute the sample variance, but we always call it

without loss of generality as variance. When I try to take the positive square root of s square,

then this is called as sample standard deviation. So, once again you can see here, I am writing

here the sample word, inside the bracket just to indicate you, that the common language is simply

535
the standard deviation, but here actually we are trying to compute the standard deviation on the

basis of given sample of data, Right! Once I'm saying that the sample variance, or the variance,

or the standard deviation has been computed, on the basis of given sample of data. What does

this mean? If you try to see, in statistics what happens, that you are usually trying to collect a

sample of data, and on the basis of sample of data your objective is to compute the population

counterpart. You may recall that in the beginning of the lectures, we have discussed the concept

of population and sample. So, suppose means, I would like to compute the total number of

people, who are eating, say more rice in the country like India, which is a very huge country,

very big country. Now, if I want to find out the arithmetic mean, the average number of people,

who are eating more rice than wheat, then what I have to do? I have to compute the number of

such persons all over the country, which is very difficult to compute, unless and until you, you

execute a census. So, we try to take a sample of data that means, I will try to choose a small

number of observation and based on that, I will try to find out the mean, and that will be called a

sample mean. Similarly, if I want to find out the variance of the data that I have collected inside

the sample, then that will be called as sample variance, but there will always be a counterpart

like as population mean or population variance. Population mean, means the arithmetic mean of

all the population. Similarly the population variance means, the variance of all the units inside

the population. So, what happened that when we are trying to compute the positive square root of

the population variance, then this is called as standard deviation. But the problem is this we do

not have the entire population in our hand. So, we always work on the basis of a sample of data,

and that is why in a common language, people, do not differentiate much between the two

definitions, standard error and standard deviation. But, once you are trying to do a course on

10

536
statistics, as a student you must know it. So, and that is my idea to explain this concept here in

the next couple of slides. Okay.

Refer Slide Time:(20:50)

So, Now, I will try to denote sigma square as the population variance, and the positive square

root of sigma square that will be called as standard deviation or this is actually the population

standard deviation. But, as I said, they say using sigma to denote the standard deviation and

using sigma square to denote the population variance is a more popular notation among the

practitioners.

Refer Slide Time:(21:16)

11

537
Now, next question comes. What is the advantage of having a standard deviation or say standard

error in place of variance? Now, if you try to see suppose I am trying to collect the data on the

height of some children, say in meters. So, the arithmetic mean will be in meters. But, the

variance will be in meter square. So if I try to see that say that there are two data set whose

variances are say 16 and 36. Then, it is more convenient to compare the two values if they are in

the same units as of the original data set. So, what we try to do, we try to find out, the positive a

square root of 16 and 36 which are 4 and 6 respectively. So, Now, once I say that, I have got a

data set in which I have measured the heights of the children in meter, and then they have got the

arithmetic mean of say of 1.2 meter and standard deviation of say 0.5 meter. Then, it is more

easy to understand, and similarly if I say that I have got two data sets in which the standard

deviations are 4 and 6. Then it is more easy to understand that the standard deviation of the

second data set is higher than the standard deviation of the first data set that is the only reason

actually. So, this the standard deviation or standard error has an advantage, that it has the same

unit as of the original data. So, that is easy to compare. For example if I have a variable in which

I have taken the observations denoted as, say small x then if this is small x have been obtained
12

538
in the unit meter, then variance s square will be in meter square, which is not so convenient to

interpret also. on the other hand, if I have obtained the observation x in meter, then the standard

deviation s will also be in meter, and which is more convenient to use, more convenient to

interpret. That is the reason that why people prefer to use the tool of standard deviation or

standard error.

Refer Slide Time:(23:40)

Now, the question comes what does this variance or standard deviation actually measures? The

variance or equivalently standard deviations measure, how much the observations vary? or how

the data is concentrated around the Earth metric mean? For example, if I try to take here a data

set and suppose this data set is like this and suppose there is another data set which is here like

this. So, you can see here in both the cases, the mean is somewhere here. But, these deviations, in

the case of red dots, and in the case of green dots, they are different, and the deviations in the

13

539
case of red dots they are larger. So, in this case if I try to find out the value of the variance say

here, variance 1, and here variance 2, then if I try to compute the value of variance 1 and

variance 2 on the basis or given set of data, then we will find that variance 1, it is smaller than

variance 2. So, whenever I have a value of variance, say variance equal to 4 and variance equal

to here 10, then this obviously indicates that the data in the variance 1, is more concentrated,

around the mean value, and the data with variance 10, suppose I'm denoting in the red dot that is

more scattered around the mean value. Which is here like this. So, this is how we try to take the

interpretation of the value of the variance.

Refer Slide Time:(25:45)

So, obviously when we want to make a decision on the basis of a given value of variance, then

the lower value of various are equivalently the standard error, standard deviation indicates, that

the data is highly concentrated or less scattered around the arithmetic mean. Whereas the higher

value of variance or the higher value of equivalently standard deviation or standard error

14

540
indicates, that the data is less concentrated or highly scattered around the mean. So, this is the

interpretation,

Refer Slide Time:(26:23)

and obviously on the other hand if I have a data set which has got the higher value of variance or

the standard error or standard deviation, then I can say simply that the data set has got more

variability. In statistics, usually we always prefer to have a data set which has got the lower value

of variance, or lower value of standard deviation, or standard error. So, in case if I have got two

data sets, and suppose we compute the variances, and suppose these variances are coming out to

be v a r 1 and v a r 2, then if v a r 1 is greater than v a r 2. Then we say, that the data in the

variance 1 is having more variability, or less concentration, around the mean. Then the data in

the dataset used for computing the variance 2.

15

541
Now, there is a very simple rule. We would always like to have a data which has got lower

variance and if you remember in the initial slides I had discussed, that one of the basic objective

in a statistical tool is that, we would always like to have a data, in which the variability is less.

Okay.

Refer Slide Time:(27:41)

Now, in case if you try to compare the variance and absolute mean deviation, then you know,

that when there are some outliers or some extreme observation inside the sample. Then the

median is less affected than the arithmetic mean, and in this case means, if you the data has very

high variability or the data has extreme observation or say outliers. Then in this case using the

absolute mean deviation is a better option and it is preferred over the variance or standard

deviation. Whereas variance has has its own advantages. For example, if you are working with

this statistical tool, then the variance has some nice statistical properties. So, from the statistics

point of view, from the algebra point of view from the from the statistical analysis point of view,

it is more easy to operate on the variance mathematically, algebraically, than on the absolute

function, like s absolute median deviation, or we call it a say, say here absolute mean deviation.

16

542
Refer Slide Time:(29:10)

Next, we try to understand what is the difference between standard deviation and standard error?

To understand this thing, we have a concept which is called as a statistic. You see the spelling of

the subject statistics is s t a t i s t i c s. But we are not using here the last s, and this is called

only as a statistic. As statistics is a function of random variables. So, if you have random

variables X1, X2, Xn, then any function of X1, X2, Xn is called as a statistic. For example if I

say you have random variables X1, X2, Xn and you try to find out the arithmetic mean like is

here the capital X bar 1 upon n summation i goes from 1 to n XI. Then this is this X bar is itself a

random variable. So, this is called as statistic.

Now, the concept is very simple. Whenever you are trying to find out the standard deviation of a

statistic, then the outcome is again going to be a function of only the random variables, and this

standard deviation is called as standard error. So, whenever we try to find out the standard

deviation of a statistic, it is called as standard error.

17

543
Refer Slide Time:(30:44)

What does this mean actually? Actually, ideally what happens, that standard deviation is a

function of some unknown parameters. For example, if you try to understand, suppose mu is

representing the population mean. Right. The mean of all the units inside the population, which

is very very large, and practically it is very difficult to find out the mean of entire population. So,

usually it is unknown. Then ideally the standard deviation is defined as the positive square root

of the variance of all the values in x which is denoted here as a square root of sigma square

1 n
  xi    . So, mu
2
which is equal to here sigma, and that is equal to a square root of  2 
n i 1

here is actually the population mean population mean of all the values xi’s. But since this mu is

not known this is unknown. So, you cannot compute it. This value cannot be computed. Why

because how you will get the value of mu.

18

544
Refer Slide Time:(32:02)

So, then in that case the Sigma square cannot be found. So, one option is this, that we can replace

the value of mu by its sample counterpart. So, when I say mu is here what, this is the population

mean? So, this is equal to 1 upon the total number of units in the population, which is population

size, i goes from 1 to here. Population size and say here x i's. So, I try to replace it by the sample

value So, this is for population and now for sample I try to replace by 1 upon see here n number

of observation i goes from 1 to n. Say here x i and i denoted the arithmetic mean to be here x bar.

So, now mu is unknown to us. But sample mean x-bar is known to us. So, what I can do? I can

1 n
replace mu by the sample mean x   xi . and then in that case the standard error which is the
n i 1

19

545
1 n
positive square root of the variance becomes like this 
n i 1
( xi  x ) 2 , and this quantity is called

as standard error.

So, a standard error always remember, standard error will always be we are function of observed

values. So, in simple language I can say that the standard error will always refer to a standard

deviation which can always be computed on the basis of given sample of data. You have got the

data, you are asked to compute the, the standard deviation. You can compute it, and in that case

this will be called as standard error,

Refer Slide Time:(34:19)

and in that case if you try to see here more specifically the population variance was defined here

1 n
  xi  x  for the case when we have ungrouped data,
2
like this, and now this becomes s 2 
n i 1

1 K
 fi  xi  x  in case of we have group data. So, this is basically the definition of sample
2
and
n i 1

variance and in common language, we usually do not call again and again it to be sample. But we
20

546
simply call it that find out the variance of the set of given data. Now, after this I, will come to the

aspect that how are we going to compute the variance or standard deviation on the R software

Right?

Refer Slide Time: (35:13)

So I, will take the first case here the case of ungrouped data. So, in this case the data vector is

going to be denoted by say here a small x and the R command for computing the variance is here

v a r and inside the argument you have to give the data vector. But remember one thing, this

command var and inside the data vector x gives the variance with a divisor and minus 1 that is

1 n
  xi  x  quantity then what I can do I
2
the this value so, in case if you want to obtain s 2 
n i 1

can multiply and divide it by the quantity n minus 1. So, now this I can write down here

n 1 n
  xi  x  . So, in case if you are very particular in getting the divisor n in the variance
2

n i 1

then in that case I, would like to suggest that you try to use this command try to multiply the

21

547
variance of x by the factor n minus 1 upon n. How? for example, you can see here now I'm

writing in red color this quantity will give you variance of x and this quantity will be the factor

by which if you try to multiply the variance you will get the variance with divisor n, and were n

is the length of the x vector that with the number of observation present in the data set.

Refer Slide Time: (36:45)

Now, we are going to consider the case when we have the group data now in the case of group

data you know that there is no built-in command in the base package of R. So, I need to compute

the mean value of the given data set separately along with the midpoints from the frequency

table, and the frequencies from the frequency table. So, in this case we are going to compute the

mean as say x mean separately and in this case if you try to see your expression for variance was

i goes from 1 to k f i x i minus x bar whole square so, now this x bar becomes x mean and then

this quantity here I am trying to write down here x minus x mean whole square and then it has to

be multiplied by the frequency vector here f and then this has to be some means all the elements

in this vector will be sum divided by the n which is the sum of all the elements in the frequency

22

548
vector f. So, this is how this expression has been obtained to find out the variance in case of

grouped data.

Refer Slide Time: (37:57)

And in case, if you have some missing data, then in the case of ungrouped data if the data vector

x has some missing values as NA, then we are going to denote this data vector as a here xna, and

in that case the command remain the same but I have to give her the option na dot rm is equal to

TRUE Right.

Refer Slide Time: (38:21)

23

549
And similarly if, you want to compute the standard deviation that is very simple simply try to

find out the square root of the variance that you have obtained earlier so, if I try to find out that

square root which is a function s q r t of the variance of x that we had earlier obtained then this is

going to give you the standard deviation or the standard error all right, but in this case you notice

that the divisor is going to be n minus 1 in case if you want the divisor to be n then in that case

simply try to take the square root of the earlier expression that we had obtained Right. So,

finding out the standard deviation is simply equivalent to finding out the square root of the

variance and considering its positive value.

Refer Slide Time: (39:08)

And, similarly in case of grouped data also whatever the expression you have obtained before the

computing the variance just try to take the square root of the variance and this is how you can

compute the standard deviation in the case of group data.

24

550
Refer Slide Time: (39:23)

Now I, try to take the same example or the same data set that I considered earlier where we have

the data of 20 participants who participated in a race and their time taken to complete the race

was recorded and this data is stored here in this data vector. Now I, try to find out the variance of

this data vector and I use the command var and inside the argument the name of the data vector

that is time and I get here this value 283,3684 and here you can see the divisor here is n minus 1

that you always have to keep in mind and I, will show you that how to find out the variance with

divisor n in the next slide. Similarly, if you want to find out the standard deviation then you

simply have to take the square root of the variance. So whatever variants I, have obtained here

I'm trying to operate the function s q r t, sqrt is a function to find out the square root and I get

here the value 16.83355.

25

551
Refer Slide Time: (40:30)

Similarly, if you want to have the variance or standard deviation with the divisor n then in this

case, we have learned that we need to multiply the variance x with a factor n minus 1 upon n

where n is the length of the data vector. So, I simply try to multiply this n minus 1 upon n in the

given variance that we have obtained earlier and I, will get here the value of variance with

divisor n and similarly if I, try to take the square root of this value then I will get the standard

deviation which is based on the divisor n.

Refer Slide Time: (41:11)

26

552
So, you can see that it is not difficult to operate or get the variance or standard deviation on the

given set of data, and here is the the screenshot which I will try to show you. Now, next we try to

find out the variance in the case of group data so, we consider the same example and we try to

convert it into a group data we try to find out the frequency table and from there we will try to

find out the frequency vector, vector of midpoints and the arithmetic mean. We, already had

discussed this aspect that how to find out the frequency vector, vector of midpoints and how to

create the frequency table. So I will not discuss here but I, will very briefly give you the

background so, that you can look into the earlier lectures how to get it done.

Refer Slide Time: (41:54)

27

553
So, you see the, the given data has been classified in 6 class intervals and these are the

midpoints, and these are the absolute frequencies that we already have obtained, and now we

need to find out the frequency vector and mean from the given data.

Refer Slide Time: (42:08)

28

554
So, we had used the command breaks and then cut command and whatever is the outcome of this

cut command we had created a frequency table using the table command and after that we had

operated the as numeric on the data obtained from the earlier command which has given us the

frequency like this one, and we had created the vector of the midpoint from the given output

from here like this Right. So now, we have obtained the x and f vector.

Refer Slide Time: (42:41)

So, now I can say here that in case if, you want to find out here we already have obtained x we

have already obtained f now I need to find out the mean of x you can see here that in this case

mean of x, is going to be defined by summation fixi goes from 1 to k Right. So you can see here,

I'm trying to define here at x mean and this is sum of fixi divided by n which is sum of f and if I,

try to do it here I get here the value of here x mean to be 56.

29

555
Refer Slide Time: (43:12)

Now I try to use the command or syntax that we had defined earlier using with a given set of data

here and this gives me the value here 269 and if I, try to find out the standard deviation of this

quantity this is here this, will give me the value of the standard deviation in this case.

Refer Slide Time: (43:35)

30

556
And, this is the screenshot whatever we have obtained here,

Refer Slide Time: (43:40)

31

557
and similarly at the same time I, would try to show you that if you have some missing data so,

then how to handle the situation how, to find out the variance and standard deviation so, in this

earlier example I am trying to take that first two values to be missing and I am denoting the data

inside new vector time dot na where two values are missing and if I want to find out the variance,

it is simply the variance of time dot any with the command na dot rm is equal to true and this

will give me the value of the variance as 250.2647 this will have the divisor n minus one and in

case if, you want to convert it into the division n then you know how to get it done. And if I, try

to take the square root of this value, this will give me the standard deviation when, we have the

missing values are present in the data Right.

Refer Slide Time: (44:32)

32

558
And, this is here the screenshot, you can see here now I will try to come on the our console and I,

will try to show you that how you are going to obtain these values Okay.

Video start time (44:48)

So I, try to first take it here the time. So, you can see here I am copying the data on time. I will

put it in the our console and I try to find out here the variance of here time and you see I, get here

this value if I want to find out the square the standard deviation, I simply have to find out its

square root and you can see here this is the value and similarly, these values have been obtained

with the divisor say n minus 1 if, you want to have the divisor n in the variance then you just use

the same command and you get here the outcome and if you want to find out here the standard

deviation like the divisor n then you simply have to find out this square root of the earlier

expression and this gives you the value 16.40732and this is here the same screenshot which I

33

559
have shown you here Right. and similarly if, you want to find it out in the case of of here this

group data then first I, need to create here the data vector x and f which I, already have done.

So, you can see here I can clear the slides x is here my midpoints f is here like this and if I, try to

find out the mean in case of group data this is coming out to be 56 and now if I, try to find out

the variance in this case, this is going to be by the same command that we discuss here like this,

this is 269, and if you want to find out the the standard deviation, you simply have to find out its

square root this has come out coming out to be like this and similarly if you want to see that how

the things are happening in case of a single use in the data so, I simply need to use this command

here and you can see here that this data time dot na I already have copied here, and if you try to

find out the variance when this na dot rm is equal to TRUE, this give me this value, and if you

want to find out the standard deviation in this case, then this is simply the square root of this

value that we have just opted and it is coming out to be here like this, and if you try to see here

this is here the same screenshot that I have shown you here.

Video end time (47:36)

34

560
So now I, would like to stop in this lecture and I, would request you that we already have discuss

the concept of variance which is one of the very popular tool to find out the variability in the data

please try to understand it, please try to grasp it, and try to see that how this measure has been

developed and please try to take some datasets and try to compute the variance with divisor n

and with divisor n minus 1, try to compute the mean squared errors around any arbitrary value

and get comfortable in computing the variance or a standard deviation or standard error using the

R software. So, you practice, and we will see you in the next lecture. Till then, Good bye.

35

561
Lecture – 21

Variation Data – Coefficient of Variation and Boxplots

Welcome to the next lecture on the course descriptive statistics with R software. Now, you may

recall that, up to now what we have done? We have considered two aspects of data, one is the

central tendency of the data, and another is variation in data. And, both these aspects they are

very important part of the information which is contained inside the data. Now, I am coming to

another aspect, suppose we have data set and we want to know the variation in the data set that

should also depend on the measure of central tendency. What does this mean? Up to now we

have taken the aspects of central tendency and measure of variation separately, one by one. Now,

I would like to have a measure which can inform me the information contained inside the data,

based on arithmetic mean and variance. This will help us in getting an idea about the variability

in the data in various types of situations. We are using either the mean or the variance may not

really be advisable, and may not really give us the correct information. So in this lecture, we are

going to first discuss about a tool or statistical tool to measure the variation, this is called as

coefficient of variation. And, after this I will try to consider one quantitative measure, and say

another graphical measure based on the R software, to have combined information on various

aspects.

Refer Slide Time: (02:38)

562
So, let us start our discussion with the coefficient of variation. The coefficient of variation

measures the variability of a data set without reference to the scale or units of the data. What

does this mean? Suppose, I have got two different data sets, one is measured in say centimeters

and say, another is measured in say meters. Incase if you simply try to combine the mean and

variances of the two data set and if you try to compare that, it might be little bit difficult.

Similarly, in case if you want to compare the house rents say in India, and say house rent in US,

the house rents in India, they are given in Indian rupees, and they are with respect to the salaries

that we get here. Similarly, if you go to US, the house rents are going to be in terms of US

dollars, and they are also with respect to the salary structure in US, and the salary structures in

US and India, they are very different. So sometimes you have heard that people simply try to

multiply the dollar by the exchange rate and they try to say that, Okay, I am earning this much or

I'm spending this much, or I am paying this much of house rent. So, how to handle these types of

situation, that can be done using the concept of coefficient of variation, Right.

563
So, this coefficient of variation will be useful to us, when we try to compare the results from two

different surveys or they are obtained from two different tests and they are obtained on different

scales. For example, if I say that I have got here two data sets, and we try to find out the

arithmetic mean and standard errors of the two data sets. So, the sample mean of the first data set

is obtained here they say x bar 1, first data sets mean and x bar 2 is the arithmetic mean of

second data sets. And, similarly I try to find out the standard errors s1 and s2, so a standard error

s1 corresponds to the first data set, and standard error s2 correspond to the second data set. Now,

there are two aspects mean and standard errors or central tendency or variation. Now, how to

compare the two data sets, that is the question that we are going to entertain here.

Refer Slide Time: (05:30)

564
Now, the definition of coefficient of variation, as we had discussed in the earlier lecture in case

of variance, that the variance can be for the entire population, which is usually unknown or the

variance can be for the sample, that is called as sample variance, which is computed. Similarly,

in the case of coefficient of variation, we have two versions, one for the population and one for

the sample. But, here when we are trying to discuss the tools of descriptive statistics, we want to

compute everything on the basis of given sample of data. So, I am going to discuss here the

sample version of the coefficient of variation. Please keep this thing in mind. Okay, so, once I

have got the data say x1, x2,…., xn, this can be either for the grouped data or say ungrouped data

or whatever you want. Then the coefficient of variation is defined as standard upon mean. So, if

you remember that we had denoted the variance, sample variance by s square. So, the standard

deviation or in simple language we call it standard error, whatever you want to call. I am trying

to consider both as a similar meaning, without any loss of generality. This is denoted by say here

s. And, sample mean we already have denoted by x bar. So, this coefficient of variation briefly

𝒔
denoted as CV, this is defined as . Now, if you try to see a standard deviation will always be
𝒙

positive. So, this coefficient of variation is properly defined only when the mean is positive or

𝒙is greater than zero. See here this definition of CV; this is based on the sample. Now, if you

want to understand, what is the population counterpart? Then, if I say that sigma square is the

population variance. And,  is the population mean, then CV of this population will be defined

here as a  upon . But, since we do not know the value of a  or , so we try to replace it by s

for Sigma and X bar for . So, this gives us the sample based definition of the coefficient of

variation, which can be computed using the data x1, x2,…, xn. Right.

565
Now, the next question is how to take that decision? Because, coefficient of variation is also a

measure of the variation in the data. So, if I have two data sets, then how we are going to

measure it. And, now you can see here, that if I try to take here two data sets. In which, suppose

the arithmetic mean of one data set is greater than the arithmetic mean of data set 2. And,

suppose the standard deviation of first data set is smaller than the standard deviation of, of data

set 2. So, what is happening? In one data set mean is larger, but the standard deviation is less,

and in other data set, just opposite is happening. In that case, which of the data you have to

choose? That cannot be answered directly by using the mean or standard deviation. So, in these

situations the coefficient of variation helps us and we say simply try to find out the coefficient of

variation of both that data sets. And, the higher value of coefficient of variation is going to

indicate that the variability is higher. So, the data with higher CV is said to be more variable than

the other.

Refer Slide Time: (10:14)

566
So, that is again similar to the interpretation of variance, higher the value of variance that means

more variability, has the value of CV that means there is more variability. Just to explain you in

more detail, let me taken a simple example? Where two experimenters have collected the data on

the heights of same group of children, one experimenter has taken the observations in meters and

others experimenter has taken the observations in centimeters. And, they have found the average

height and standard deviation of the two data sets. So, you can see here, I have tabulated the

information, this is the first experimenter, second experimenter. And, first experimenter has

found the average height to be 1.50 meters, and standard deviation to be 0.3 meters. And,

similarly the second experimenter, he has found the average height to be 150 centimeters, and

the standard deviation to be 30 centimeters. Now, this is the usual tendency to compare the

standard deviations. So, you can see here, here the value of a standard deviation is 30. Whereas

here this value is here 0.3 in the first case. So, obviously in the first look, it appears that 30 is

much much greater than 0.3. And, this indicates that the variability in the second data set is very

very high. But, this conclusion is actually wrong. Because, if you try to see both sets of

567
measurements, they have been taken on different scales, but they have got the same value. The

heights say 𝒙𝟏 and say here 𝒙𝟐 they have the value 1.50 meters and 150 centimeters, which are

the same. And, similes and similarly standard deviation, they are point 3 meters and 30

centimeters they are also the same.

So, now in this case how to report it or how to identify it, how to know it, that is the question.

So, in this case, this coefficient of variation comes to our help and risks you. So, if I try to find

out the value of coefficient of variation in both the cases, then in the first case the coefficient of

variation comes out to be standard deviation divided by mean, which is 0.3 divided by 1.5, and

this comes out to be see here 0.2. And, in the second case also the coefficient of variation comes

out to be 30 upon 150, which is equal to 0.2. So, you can see here, that both the values are same.

And, this is indicating that both data sets have the same variability. And, this was not possible by

looking only at the values of mean and standard deviation.

Refer Slide Time: (13:24)

568
So, similarly this coefficient of variation also helps us in comparing the data sets on two

completely different measurements. These measures, these observations can be obtained on

different scales. But, the advantage what coefficient of variation has, that coefficient of

variation is dimensionless. So, this helps us in the comparison of the variation in two data sets.

For example, in India if I take an example of rents of houses in a metro city and in a village. We

know that the rents in, in a metro city in India they are very high, where is the rents in a village

that are also very, very low. And, similarly if I try to take say another example, the rents of

houses in Mumbai, they will be in Indian rupees and the rent of houses in London, that will be in

pounds. Now, how to compare when the data has been obtained in the same unit, as in the first

case when we are trying to find out the rents in India in a metro city and a village and when the

data has been obtained in different units. As in the case of, rents of houses in Mumbai and

London, so how to compare? Then again this in this case this coefficient of variation helps us.

569
Refer Slide Time: (14:57)

Now, that question is how to use this concept of coefficient of variation in making a decision?

So, the data set having higher value of coefficient of variation is thought to have more

variability. So, definitely when we have lower value of coefficient of variation this is preferable.

For example, in case if I have two data sets and suppose we have computed their coefficient s of

variation as CV1 and CV2. Then, if CV1 is greater than CV2, then we consider or we say that

the data in CV1 has more variability or say less concentration around the mean value than the

data in CV2 or in the second data sets. And, similarly in case if I have the opposite that CV1 is

smaller than CV2 that means the data and CV1 has a smaller variability than the data in CV2,

Right.

Refer Slide Time: (15:53)

570
Now, the next aspect is how to compute it on the R software. So, as such there is no built-in

command inside the R software to compute the coefficient of variation. But, computing the

coefficient of variation is very simple and straight forward. This is only the function of standard

deviation and arithmetic mean. And, we already have learnt how to compute the standard

deviation and how to compute the arithmetic mean. So, just by using the same commands, we

can always compute the coefficient of variation. So, this is how we are going to compute the

coefficient of variation. Okay. So, if I say that we have got a data vector x, then the coefficient of

variation is going to be defined as like this. What is this? Coefficient of variation is simply here

standard deviation upon mean. So, a standard deviation was nothing, but the square root of

variance and mean was by the function here mean of x. So, this is what I'm trying to do, first I'm

trying to find out the square root of the variance that is that will give me standard deviation

divided by mean of x. And, this will give me the value of coefficient of variation.

10

571
Now, if I ask you that how would you compute the coefficient of variation in case the data is

missing, then it is very simple. How? We already have learnt how to compute the standard

deviation when the data is missing, and we also have learnt how to compute the arithmetic mean

when the data is missing. So, you simply have to use the same syntaxes, same functions and you

have to write down the syntax for computing the coefficient of variation. So, if you recall that

what we had done earlier. Suppose my data vector x has got some missing values which are

denoted by NA, and I'm denoting this data vector as xna. So, now I would try to find out the

variance of this x and a by using the command inside the argument na dot rm is equal to TRUE.

So, this will help us in finding out the variance when the data vector has missing values and then

we try to find out the square root, which will give me the standard deviation. So, this function

will give me the standard deviation in the presence of missing data. And, similarly the mean

function on the data vector xna with the argument na dot rm is equal to TRUE, will give me the

value of the mean when the data is missing. And, by using this command, we can always find out

the value of the coefficient of variation. And, now using the same command, you can also write

down the syntax and command for computing the coefficient of variation in case of grouped

data, that is not so difficult, Right.

Refer Slide Time: (19:03)

11

572
Now, I will try to take a small example to show you that how to measure it. Suppose, I have

collected the data on 20 participants, in the time taken in a race and this data has been recorded

inside a variable time. So, this is the same example that I have used earlier, now in case if I want

to find out the coefficient of variation of time, so CV of time that is simply here standard

deviation of time divided by mean of time. So, I am using here this syntax and this is giving me

this value 0.3 and so on.

Refer Slide Time: (19:45)

12

573
And, suppose if the data is missing, then in that case I have the same data in which I have

replaced first two values by NA, and this data has been stored inside a new data vector time dot

na. And, I try to use the same command to compute the standard deviation in the presence of

missing value and the command for mean, in the presence of missing values and I try to take the

ratio and this will give me the coefficient of variation. And, this value comes out to be here 0.27.

Refer Slide Time: (20:19)

13

574
And, you can see here this is the screenshot of the same thing what I have done. Now, I will try

to show you on the R console also.

Video Start Time: (20:27)

I'm trying to take the command and this data has already been copied, which is the same data set

had a time. And, if I try to find out the coefficient of variation, this is giving me this thing, Right.

And, similarly incase if you want to find out in the presence of missing values. Then, I already

have stored the data time dot na, you can see here and if I try to use the same command on this

data set, you will get here the same value. And, the same value has been reported here in this

slide, this is the screenshot.

Video End Time: (21:08)

14

575
Now, I have completed different types of measures of variation that we had aimed. Now, I'm

trying to address another aspect, when we get the data, then data has all sorts of feature. And, we

would like to have all sorts of feature like measure of central tendency, partitioning values

variation and so on. Up to now what we have done, we have taken all these aspects one by one.

Means, how to compute maximum among the data values, minimum, range, quartiles and so on.

In R software there is a command, by which you can compute all these values like as minimum

value maximum value and different types of quartiles in the single shot.

Refer Slide Time: (22:18)

15

576
So, I'm trying to discuss now this command, which is a summary command. So, in R, there is a

command, summary (s u m m a r y). And, this summary commands provides us a

comprehensive information on different types of quartiles, mean, minimum and maximum values

of the data sets. So, if my data vector is denoted by x, then we use the command s u m m a r y

summary and inside the arguments here x. And, if you try to use this, then the outcome of this

command will give us an information on the minimum values, minimum value, maximum value,

arithmetic mean, first quartile, second quartile this is median and the third quartile of the data set

which is contained inside the data vector x.

Refer Slide Time: (23:13)

16

577
Now, let us try to take an example to understand it. And, now I take the same example, that, that

I took earlier where we have the data on 20 participants in the time taken in a race and the data

has been stored in a variable here time. So, now when I say summary time, then on the execution

we get here an outcome like this one. So, you can see here there are 6 values, first value here,

second value here, third value here, fourth value, fifth and sixth. So, now if you try to understand

what are they trying to give us. The first value here is giving us the minimum value. Minimum of

the values contained in that time data vector, which is 32, and we can see here this is here the 32.

Similarly, the second value giving us the value of the first quartile. Remember this is quartile not

the quantile, Right. So, you know how to compute the first, second or third quartile. So, but here

in this case you need not to compute it separately, but it will give you inside the same outcome.

Similarly, the third one is the median value, which is the second quartile. Third value is here the

mean that is the arithmetic mean. Similarly, fourth value is the value of the third quartile, and the

fourth and the last one the sixth one is the maximum value of this observations, which is here 84,

17

578
you can see here this is the value. So, by using this summary command you can get all this

common information in a very comprehensive way and that is the advantage here.

Refer Slide Time: (25:30)

And, here you can see here this is here the, the screenshot. I will try to show you on the R

console also so if you try to come here.

Refer Slide Time: (25:54)

18

579
If you say see here that data on time here is already stored there, and if I try to write down here

summary of your time, we get here this value, Okay. But it cut us now come back to our slides

and we try to discuss now another aspect. So, now you can see that, this summary command is

trying to give you different type of information minimum, maximum, quartiles, means in a

comprehensive way and when we started the discussion on descriptive statistics I had told you

that there are two types of tools, one are quantitative tools and others are graphical tools. So, now

the next aspect is this whatever the information has been provided by the summary command

can it be represented in a graphical way now, and now, what will be the advantage of this thing?

Suppose if you have two data set or even more than two data sets and if you want to compare all

the characteristics at the same time, you can use the summary command on all the data vectors

and also you can create the graphics. So, graphics will give you a visual comparison of the

information contained inside the different data sets, and in order to do so, there is a graphic

which is called as box plot. So, now I try to discuss what is a box plot, and how to construct it

inside the R software, okay.

Refer Slide Time: (27:24)

19

580
So, this box plot is actually a graph. We summarize the distribution of the variable by using

different types of information like as median, quartiles, minimum, maximum. I remember this

will not give you the value of the mean. So, this box plot looks like actually this. How you can

see here the graph, you can see here there are two lines here one in the bottom please try to

observe the pen. One in the bottom and now other in the top, in green color you can see here

these two values are giving the minimum value of the data set and the upper value is giving the

maximum value of the data set.

Now, in case if you try to find out the difference between the minimum and maximum value,

What you will get? you will get here the value of range, and similarly if you try to look on these

three lines which I am indicating here, first second and third. So, if you try to see this is here as

sort of here a box and the lower edge of the box, which is here. That is giving you the

information on first quartile. Now, I will between the color of the pen, so you can see it clearly.

20

581
Similarly, the upper edge, which is here, this is giving us the information on third quartile. So, if

you try to see if you try to find out the difference between the two, don't you think that this will

give you an idea of the quartile deviation and also in some sense, it will give you the information

on interquartile range. These two measures we had discussed as the measure of variation. Now,

finally if you try to look in the line in the middle of the box. This line is going to give you

information on the median, which is the second quartile Q2 . So, you can see that inside this box

there are several measures which are combined and just by looking at this difference, and this

difference you can also compare the variation by looking at the middle value, you can compare

the median and so on.

Refer Slide Time: (30:12)

21

582
So, let's try to first see the applications through the software. Inside the R software there is a

command here box plot and inside the arguments you have to give the data vector for which you

want to create a box plot and there are several arguments, which are several options available in

the plotting of box plot. I would request you to go to the help menu and try to see what are the

different arguments, and what are their uses in creating the box plot, you can make the legends,

names, colors etcetera, shapes, etc.

Refer Slide Time: (30:45)

So, let me take here the same example which I considered earlier so this is the same data set on

time and now, I have created here the box plot on the set time so you can see here this upper

value this is going to give me the maximum value of this time, and which is here 84. You can see

here this is somewhere here and similarly the bottom line this is going to give me the minimum

22

583
value of this data set which is here 32 somewhere here you can see here this is the data set. Now,

these three-values bottom value, second value, and third value. These are trying to give me the

information on first quartile, middle in the second quartile and upper edge in the third quartile.

So, you can see here these values are somewhere here, so you can compare it with the values that

you have just obtained in the summary command and by looking at the difference of these two

edges here and these two edges here you can have the idea on the variation in the data in terms of

range and quartile division or say interquartile range. Now, you can see here, why this plot is

called as box plot? You can see here that all the information is contained inside this box so that

was possibly the reason that it was called as box plot.

Refer Slide Time: (32:29)

Now, what is the use of this box plot, how it is going to help us?

23

584
Refer Slide Time: (32:38)

So, now what I have done here first you please try to notice, what are the first two values which I

am indicating here, these are 32 and 35.

Refer Slide Time: (32:48)

24

585
Now, what I do in the same data set, I try to make it 320 and 350 very high value and I give this

data a new name time1, and then I try to create the box plot the time1. You can see here that this

box plot is very, very different than what you have obtained in the earlier case. Not only this, it is

also showing you that there are two observations which are possibly the extreme observation. So,

by looking at this graph this is giving you information, well when you are trying to analyze that

data please try to take a look at the values, which are somewhere between 300 and 350. These

values are unusual because they are very, very far away from the remaining data, all our data is

lying between 50 and 100 here, where these two values are lying between 300 and 350. But my

objective was that I want to compare it. So, I try to artificially make these two plots side-by-side.

Refer Slide Time: (34:01)

25

586
So, I have simply copied and pasted these two graphs manually and you can see here that the

first graph is for the data set time, and the second graph is for the data set time1. but now you can

see still we are not very comfortable. Why? because the range on the y-axis in the both data set,

they are different. So, they are not really comparable well they are comparable, but you need to

put a harder work in the comparison. So, we would like to make a graph in such a way where we

have only this type of boundary on say x and y axis and this plot should be inside the same

boundary, so that they are comparable. So, in order to do so, what we have to do? This I am

going to now discuss to demonstrate.

Refer Slide Time: (34:59)

26

587
We have a graphic, which is called as grouped box plot. This graphic will combine different

types of datasets and it will create a box plots inside the format of a boxplot using the data

inside the format what is called as data frame. What is this data frame and why this is needed?

Suppose we have here three data vectors, which I am indicating by x, y and z. So, what we want

here is the following that we want here a graphic, which is enclosed inside this rectangle where,

there are three box plots like this and they are indicating the box plots for the three data vectors

x, y, z. So, in order to create this combined box plot, first we need to combine that data, and in

order to combine the data, we have a concept of data frame, and we do as follows that we use the

command here data dot frame and inside the arguments we try to give the names of the data

vector which are separated by commas and this will give us a data set in the framework of the

concept of data frame and this data will be a combined data set.

27

588
Now, at this stage question crops of that, what is a data frame? Well data frame is a method to

combine different types of data sets in R. Well, it is not really possible for me to explain this

concept in detail here but these concepts have been explained in the lectures on the course

introduction to R software, if you wish you can have a look on those lectures or you can look

into the help menus of the R software to have an idea about the concept of data frame, Right. But

in this lecture, I will be using only this command just to combine the data sets, so I am not going

into that detail. I have given you the command that how to use it. If you want some advanced

features, advanced knowledge above this concept, I would request you to look into the books and

help menus, Okay.

Refer Slide Time: (37:47)

28

589
So, now my objective is very simple, I would try to create a box plot for this data set which has

been combined using the concept of data frame.

Refer Slide Time: (37:56)

So, now what I do here is the following, I try to take the same data set time and here time 1 and I

try to combine it and create a data frame using the command data dot frame, and inside the

argument time separated by comma and then time 1, and I store this data as a data box plot.

Refer Slide Time: (37:56)

29

590
And, after this I use the command box plot and, and give the data inside the argument as data

box plot, which has been obtained through data frame. Now you can see here that both the

graphics have been combined together, Right. This is the box plot for the time, and this is the

box plot for the time 1 and here you can see that that is indicating the presence of two extreme

observations, Right. So, in this case you can see that we have combined the two box plots, but

they are not really informative because, the ranges or both the box plots are very different, Right.

This is here and this is indicating that there are two extreme observations, so this is giving a

different box plot that if, what we wanted for the ideal thing is that, after looking at this picture,

first try to remove the extreme observations, extreme observation and create the box plot again,

and you will see that this will give you more information, and I would like to illustrate the same

thing with a different example.

Refer Slide Time: (39:39)

30

591
So, this is here the screenshot of the creation of data frame of time and time1 data.

Refer Slide Time: (39:47)

31

592
Now, I will take another example, to show you the utility of box plots. Suppose the marks of ten

students in two different examinations are obtained and we would like to compare the marks

using the concept of box plots. So, the marks in the first examination they are stored here inside

the data vector named as marks1. and the marks of those ten students in second examination,

they are stored here, and they are stored in a the data vector marks2. So, what I want here is the

following? I want here a graphic like this one, where there are two box plots one indicating the

marks1 and another box plot indicating the marks2. So, in order to do so, first I try to create here

a data frame using the command data dot frame and inside the arguments, I try to give the data

vector, which I would like to combine. And, this data suppose I am trying to store as say data

marks, Right, Okay.

Refer Slide Time: (40:59)

32

593
Now this is the screenshot of this operation and now I try to create the box plot of this data box

plot data which has been obtained in the framework of a data frame like this box plot of data box

plot.

Refer Slide Time: (41:03)

Now, you can see here now here you are very nice and clear picture. By looking at these two

values these green color lines, you can easily compare that which data set or which of the group

of students has got the lower marks or the minimum marks. Similarly, if you try to look at the

red pink which I am highlighting here by comparing this to you can have an idea about the

maximum marks obtained by the students. Similarly, if you try to look in the middle value these

are trying to give you the idea of the medians. So, by looking at these values you can see here,

that the median in the marks two case is higher than the median marks in the case of marks1.

33

594
Similarly, in case if you try to compare this orange color part, this will give you the idea of the

first quartile and similarly if you try to compare the third quartiles, highlighted by this violet

lines you can compare it and you can see here that the third quartile Q3. That Q3 in the case of

marks2 has higher value then in the case of marks1 and you can see here the range in the marks1

and the range in the marks2. So, you can see here that very clearly that the range of the marks2 is

higher than the range of the data in marks1, and similarly you can also compare the quartile

deviation and interquartile ranges, so you can see here this is how we can obtain it, now I would

try to show you the construction of the box plots and the group box plots on the R software,

Right.

Refer Slide Time: (43:14)

34

595
So, you can see here we already have the data on time.

Refer Slide Time: (43:19)

So, if I try to create here the box plot it will look like box plot of say head time.

Refer Slide Time: (43:29)

35

596
And you can see here this comes out to be like this.

Refer Slide Time: (43:34)

Now, similarly if you try to take another data set time1.

Refer Slide Time: (43:48)

36

597
Where we have increased the values of two observation first two observation.

Refer Slide Time: (43:50)

Then in this case this data set is here time1 and you can see here that there are two extreme

values.

Refer Slide Time: (43:52)

37

598
And, if you try to create here the box plot of the same.

Refer Slide Time: (43:54)

38

599
This comes out to be here like this which we had reproduced in the slides, Right.

Refer Slide Time: (44:03)

And, in case if you try to combine here the data here see here, data the time is equal to Data dot

frame and inside the bracket, time separated by comma time1 argument closed.

Refer Slide Time: (44:32)

39

600
You will get here at the, the combined data on time and time1 in the data frame mode and you

can see here this is the data which I have obtained here.

Refer Slide Time: (44:37)

40

601
Now, I will simply try to make sure the box plot of the same data set.

Refer Slide Time: (44:43)

So, you can see here if I try to create the box plot this comes out to be here like this which we

had to reproduce whenever slides.

Refer Slide Time: (44:53)

41

602
And, similarly if I try to take here the data on marks.

Refer Slide Time: (45:04)

42

603
So, data on marks here is like this. and data on here mark here is like this.

Refer Slide Time: (45:00)

You can see here, this is the data on marks1 and marks2, Right.

Refer Slide Time: (45:14)

43

604
And, I would like to combine this data into a data frame.

Refer Slide Time: (45:19)

44

605
So, I try to use as the data frame come on and you can see here this data has been combined see

here click at a frame like this, Right.

Refer Slide Time: (45:29)

So, I would like to now create a box plot of the same thing two box plot of the data marks you

can see here no this looks like this so you can compare it and can have some fruitful conclusions,

Okay.

So, now I would stop here in this lecture and I would also complete the discussion on the topics

of different measures of variation. So, we have discussed different types of measures of variation

and every measure will give you a different information, and a different numerical value. Your

experience in dealing with the data sets and using these measures will give you more insight into

45

606
the aspect that how to interpret, how to say whether the variability is low or variability is higher,

this is always a relative term. But, remember one thing, from the statistics point of view in case if

the data has very high variability then most of the usual statistical tool will not really work well.

They will give you some information, but that information may be misleading. So, it is very

important to use the appropriate tool to bring the information out from the data regarding its

inherent variability, different samples taken by different people from the same population they

may have different variation, so if you try to use different types of tools, ideally all the tools

should give you the same information, but they will have different numerical values. So, I would

request you that please try to look into the books, try to understand the concept of variation and

the data, try to look the different drawbacks, different advantages of all these tools all the tools

cannot be applied in all the situations, and more importantly, how to compute them on the R

software, this is what you have to learn. So, I would request you that you take some datasets and

try to employ all the tools whatever you have done up to now. Different measures of central

tendency, different measures of variations and try to see how these values are trying to provide

you different pieces of information. So, you practice, and we will see you in the next lecture with

a new topic. Till then, good bye!

46

607
Lecture - 22

Moments - Raw and Central Moments

Welcome to the lecture on the course descriptive statistics with R software.

Refer Slide Time: (00:18)

From this lecture, we are going to start a new topic, and this is about moments. What are this

moments, and why do we need it, you may recall that, up to now in the earlier lectures, we have

considered two aspects of the data information. First is central tendency of the data, and second

is the variation of the data, and we have developed different types of tools like as arithmetic

mean, standard deviation, variance and so so on to quantify that information. Similar to this,

there are some other aspects like as symmetry of the frequency curve, or how the values are

608
concentrated in the entire frequency distribution. Similarly, there is another aspect, what is the

hump of the frequency curve? So, in-order to study all these things, and some other important

properties of a statistical tool, we need a concept, which is called concept of moments. So

essentially you will see in this lecture that, moments are some special type of mathematical

functions, and using these moments, we can quantify different types of information, which is

contained inside the data or the frequency table. So, let us first try to understand what are these

moments?

Refer Slide Time: (2:17)

So, moments are essentially used to describe different types of characteristics of a frequency

distribution, different features of a frequency distribution like as central tendency, dispersion,

symmetry, peaked-ness etc., Now, if you try to understand that, how these things have been

609
developed? So, if I try to see here, when I wanted to study the central tendency, then we defined

arithmetic mean, and what we did, we had observations x1, x2 up to here xn, then what we did, we

just added them, and we divided it by the number of observations, say here n. So, this gave us the

information about the central tendency of the data that, where the mean or the average value of

the data set lies in the frequency distribution or a frequency curve, Right! Similarly, when we

studied the variation in the data, we defined a quantity like variance or absolute deviation, I will

try to show you both, what we did, in case of variance, we had the data x1, x2 up to here xn, and

then what we did, we took the difference of these observation from their arithmetic mean, which

we call as deviations. So, we took the deviations of each observation around the arithmetic mean.

So, we obtained x1 minus x bar, x2 minus x bar, up to here xn minus x bar, and after this, we

squared them, all the deviations were is squared, and after this what we did, we simply took the

arithmetic mean of these squared deviations, and similarly when we computed the absolute

deviations, then we had considered the observations, x1, say here x2, up to here xn, and then when

we wanted to study the absolute, say mean deviation, then what we did, we simply found the

median of these observations, and we took the deviation from their median like this, x2 minus x

bar median, and xn minus X bar median, and after this what we did, we simply took the absolute

value of these things, and after this we just took the arithmetic mean of these values, now you

can see here, what are we trying to do? In the case of arithmetic mean, we are trying to find out

1 n 1 n
here, x  
n i 1
xi , in the case of variance we are trying to find out,  ( xi  x ) 2 , and in the case
n i 1

of absolute deviation, we are trying to find out the arithmetic mean of the absolute deviations of

xi, from the median. So, now looking at these three examples, you can see here, what are we

trying to do, we are essentially trying to consider the deviations, say x1 minus, some arbitrary

value here A, the deviation of second observation from some arbitrary value here A, and

610
deviation of the nth observation from some arbitrary value A, and then I am trying to use a

suitable power function on this deviation. For example, if you try to look over here, please try to

observe my pen, here you are trying to use the, the square. So, I can make it more general,

instead of using here the square, I can replace it by here some general quantity, say here r. So, I

try to consider, now here the deviation of observation x1 minus A, x2 minus A, xn minus A, that

is the deviations of observation around any arbitrary value A, A is say any value, any arbitrary

value which we have to choose. So, now I try to take the rth power of these expressions, and then

1 n
I try to find out their average, this I can express as 
n i 1
( xi  A) r , and now you can see that this

function has several advantages, for example, if you try to take here r =1, and A = 0, you get here

nothing, but mathematic mean x , if you take here r = 2, and A equal to say here, x , then you get

here, variance of x. Similarly, if you try to choose other values, we can define some other

functions which are going to represent the different characteristics of the frequency distribution.

So, in general I can call this function as, say rth moment, and now you know that this is what I

have defined for the data, which is observed as, x1, x2, xn, and that is ungrouped data, and if you

want to define the same quantity for, say here, some grouped data,, then this can be simply

1 K
expressed into 
n i 1
f i ( xi  A) r , and so on. So, now this is the basic idea behind defining a

function which is called as moment. Now, what is the advantage of this thing, you will see here

that when I try to take specific values of r and A, we get different types of characteristics, for

example, when I choose r equal to 1, and A equal to 0, I get arithmetic mean, which is giving

other information on the central tendency of the data. Similarly, if I take r equal to 2, and A is

equal to sample mean, then this quantity is giving us the information about the variation in the

data. Similarly, we can think about that there are some other properties of the frequency curve

611
like a symmetry or it's hump, hump of the curve, how to quantify those information? So, on a

similar line as these two values are going to give us the information on central tendency and

variation in the data. So, we can use the concept of moment to quantify the information, about

the symmetry of the frequency curve or the hump of the frequency curve, but in order to do so,

first we need to understand the concept of moment, and what are the basic definitions in grouped

case, in ungrouped data case, and beside the idea which I have given you here, there are some

other aspects. So, in this lecture, I am going to consider about the raw moment and central

moments, and in the next lecture I will try to show you, how to compute them on the R software.

So, in this lecture please try to understand the basic concepts, and they are going to help you in

the, for that lectures and in future,

Refer Slide Time: (11:58)

Before going into the further details, let me specify my notation for grouped and ungrouped data.

Now, I'm going to consider two cases, case one where the data is ungrouped, and the variable X

is discrete, and we have obtained n observations on X, which are denoted as small x1, small x2,

small xn.

612
Refer Slide Time: (12:28)

Similarly, the second cases, when we have the data which is grouped, and we have a variable X,

capital X which is continuous in nature, and the data is obtained on X, and the same data has

been tabulated in K class intervals in a frequency table like as here. First column of the

frequency table is giving us the class intervals here, like even e1 to e2, e2 to e3 and so on. So,

these are essentially, you are here k class intervals, and the second column is giving us the

information on the midpoints of this class interval. So, the idea noted here is x1, x2, xk. So, x1 is

going to give us the information on e1 plus e2, divided by 2 and so on, which is obtained from

here. So, the second column is giving us the value of x1, x2, xk, but here x1, x2, xk are the

midpoints, and similarly in the third column we have the frequency data f1, f2, fk which is

denoting that, the first interval e1 to e2, whose midpoint is x1, has frequency f1, e2 to e3 which is

613
having the frequency f2 and so on, and the sum of all this frequency is denoted by here n. So, I

can see here that the midpoints x1, x2, xk they have got the frequency f1, f2, fk respectively.

Refer Slide Time: (14:04)

Now, after this I first try to define the moments about any arbitrary point here, here. So, let me

define here the general rth moment as we have just discussed, so the rth moment of a variable x

about any arbitrary point A, based on the observation x1, x2, xn is defined as follows For the case

of ungrouped data, discrete data, this is simply the sum of the rth power of deviations, and these

deviations are obtained, you can see here xi minus A, can see here A, and then the arithmetic

mean of these deviations has been obtained, and this is denoted as  r , this symbol here, I see  r ,

and this symbol here, this is called here prime. So, this is the basic definition of the rth moment

for the ungrouped data around any arbitrary point A. Now, similarly if you have group data on a

614
continuous variable, the same can be similarly defined, that now I am trying to find out the

weighted mean of the other power of the deviations, and weights are given by the frequency,

Right! and this is also denoted as  r . Although, I agree that these two the  r for the grouped and

ungrouped data, they should have different type of notation, but we are using here same notation,

because at a time we are going to handle either the discrete case, or the continuous case. So, this

is how we try to define the rth moment of a continuous variable x, around any arbitrary point A,

and here this n is going to be the sum of all the frequencies. So, you can see here, this is the same

function that we have just developed.

Refer Slide Time: (16:32)

Now, I try to give you the idea of what is called here raw moments. Now you will see that, most

of the things they are simply the particular case of what we have obtained. So, as you had seen in

615
the case of arithmetic mean, we are trying to take the mean of say xi. So, similarly instead of

taking here xi, I am trying to take the rth power of xi, and this is going to give us the value of the

rth moment, or this is called as the rth raw moment. So, if you try to see this is nothing but in the

earlier case if you try to substitute A equal to 0, you get the same thing. So, the rth sample

moment around the origin A is called as raw moment, and it is defined here like this, in case the

data is discrete, and similarly in case, if you have the group data on a continuous variable, the

definition of discrete can be extended to continuous case, where you are simply trying to find out

the average of the group data, say fi, xi, and then you try to convert xi, into xi raised to power of

here r, and here n is going to be the sum of all the frequencies. So, this is how we define the rth

raw moment in this of continuous data, one thing you would like to notice here, that in the first

line of the definition I am using here a word sample, you can see my here pen, why I'm trying to

write down here the sample, and that is inside the bracket, what does this mean? Now, you can

see here, I will try to explain this concept in this color, so you can notice it, suppose I try to take

the first case, where I have defined the rth moment for the ungrouped data, you can see here, that

I am trying to take here the sum over the observed data, what does this mean, that I have taken a

sample of data and I am trying to find out the average over the sample values, but if you try to

see, there are two counter parts one is the sample and another is the population. What is the

sample, this is only a small fraction from the population. So, as we have computed this quantity

on the basis of a sample, the similar quantity can be estimated for the entire population units, and

essentially in statistics, we are interested in knowing the population value, but since this is

unknown to us, so we try to take a sample and we try to work on the basis of sample. So, as we

have defined the moments, or say rth moment or raw moment or other types of moment that we

are going to define, we are going to define on the basis of sample, but surely there exists a value

616
in the population. So, the counterpart of the sample is the population, and when we try to

compute the same moment, on the basis of all the units in the population, that would be called as

the population moment, or the moment based on the population values. So, that will be defined

here, for example, if I try to define this  r , for the population values, and suppose if I say that

there are suppose, capital N values in the population, then in this case this is going to be i goes

1 N
from
N
x
i 1
r
i . So, N is here the total number of units in the population. Now, there are two

counter parts, one for the population and, say another for the sample but obviously our interest

here, is to compute the values on the basis of sample. So, in this lecture and in the previous

lecture, I will try to define these values, and you have to just be careful that whatever we are

doing here, they are on the basis of sample. So, in practice we always do not call it as a sample

moment, but we simply call it moment, but the interpretation is that, that the moments have been

computed on the basis of given sample of data. So, this is what you always have to keep in mind,

Okay. So, now we try to take a particular example,

Refer Slide Time: (22:03)

10

617
of this raw moment, and I simply try to choose two values r equal to 1, and r equal to 2, and then

in the case of ungrouped data, you can see here as soon as I say, r equal to 1, I get this value, 1

1 K
is equal to  xi , and if you try to identify what is this, this is nothing but your arithmetic
n i 1

mean, and this 1 , this is called as first raw moment, and similarly, if you try to put r equal to 2,

1 n 2
then I get here  2 is equal to  xi , this quantity 2 is called as second raw moment, and if
n i 1

you try to recall, where this quantity was occurring, you may recall that when you wrote the

1 n
expression of variance of x, this was written as  ( xi  x )2 , and which we have written as
n i 1

1 n 2

n i 1
xi  x 2 . So, you can see here, that the first term here this one, this is nothing but your

second raw moment, and this quantity here x , this is simply the first raw moment. So, now this

will give you an idea, that why this raw moments are needed, and now you can see here that this

raw moments are going to help us in defining the variance of x, Right! So, in this case I can

redefine the variance as say here,  2 , minus, 12 , and similarly if you try to take the grouped

data, then when I substitute r equal to 1, I get here 1 which is defined here like this, which is

the arithmetic mean, once again and when I try to put r equal to 2, then I get  2 which is here

defined here like this, and this also has the same interpretation, that if you try to see variance o. ,
2
1 K 1 K 1 K 
we had defined 
n i 1
f i ( xi  x ) r
, and we had denoted it while 
n i 1
f x 2
i ii   
 n i 1
f i xi  .. So, this

is nothing but your  2  12 . Now, see here by this example, that the interpretation and the utility

of the moments in case of grouped and ungrouped data, this is similar, Right! What you have to

just observe here, that if I try to choose here r equal to 0, then in this case, this 0 , becomes 1,

11

618
and this is true for both the cases, for ungrouped data and for grouped data. Now, after discussing

the concept of raw moment,

Refer Slide Time: (25:51)

let me define here what is called here as central moments. So, we had discussed the rth moment,

K
around any arbitrary point, say A,  ( x  A)
i 1
i
r
, and in this case, if you simply try to choose A, to

be the sample mean x , then whatever is the moment, that is called a central moment. So, the

moments of a variable X about the arithmetic mean of the data, they are called as central

moments. So, now if I want to define the rth central moment based on the sample data x1, x2, xn,

then in the case of ungrouped data, I simply have to replace A by x . So, this quantity becomes

here like this, so if you try to see here this quantity here is nothing, but the rth power of

12

619
deviations, and then arithmetic mean of these are the power of deviations have been taken, and

this quantity is defined here as say  r , there is no prime in this case. So, that is the standard

notation, that the raw moments are defined by  r , and the central moments, they are defined by

r . So, the rth central movement in case of ungrouped data is given by this r , is equal to

1 n

n i 1
( xi  x ) r , and similar definition can also be extended to the case of group data, and in this

case, this is the rth power of the deviation, and they are multiplied by the frequency fi, and then

the arithmetic mean of this quantities have been taken where, n is equal to the sum of all

frequencies, and what you have to just notice, that in this case, you are going to compute the

1 K
arithmetic mean by this expression, that is  fi xi and whereas in the case of discrete data, you
n i 1

1 n
try to compute the x , simply as a  xi . So, this quantity is called as the rth central moment of
n i 1

our variable X. Now,

Refer Slide Time: (28:34)

13

620
let me try to choose here, r equal to 1, and r equal to 2, and see what happens to the first and

second central moments. So, first I try to take the case of ungrouped data, and this case if I try to

1 n
substitute r equal to 1, you can see here what I will get  ( xi  x )r , which is here 1. So, this is,
n i 1

1 n n
I can write down here 
n i 1
xi ‐ x . So, this quantity becomes is x bar minus x bar is equal to 0.
n

Now, I can say that here, that the first central moment, in case of discrete data will always be

equal to 0, and in the next slide, I will show you, that the same is true in the case of continuous

data also. So, in general I can say, that the first central moment is always 0, similarly when I try

1 n
to substitute r equal to 2, then what I get here, 
n i 1
( xi  x ) r , r is equal to 2. Now, can you

identify this thing, what is this thing? This is nothing but your sample variance, what we have

defined in the earlier lectures. Now, I can say, that the second central moment which is denoted

by  2 is representing the variance, or the sample variance. One thing I would where I would like

to have your attention is the following, you can see here, the second central moment is going to

represent the variance, and first central moment is always 0. Whereas, the first raw moment that

was denoting the arithmetic mean. So, when you are trying to interpret these moments you have

to be careful while making an interpretation for the arithmetic mean. Arithmetic mean is

represented by the first raw moment, and in the case of first central moment, this is simply

denoting the sum or the averages of the deviation around mean which is always 0.

Refer Slide Time (31:21)

14

621
Now, in this is step, I would also try to show you here, that when I try to write down here this

1 n 1 n
expression. Then  2 was written has here 
n i 1
( xi  x ) 2 which was written as  xi2  x 2 . So, I
n i 1

can write down here this quantity here, this is nothing but my  2 , and this quantity here x bar

square this is 12 . So,  2 becomes 2  12 . So, you can see here, that this is a relationship

between second central and second and first raw moments.

So, what I am trying to explain you here? I am trying to ensure you here, that there exists a

relationship between the central and raw moments. I have shown you here that, how to express

the second central moment as a function of raw moments, and I have taken here the example of r

equal to 2. Similarly, if you try to take r equal to 3, r equal to 4 and so on. Then you can obtain a

similar type of relationship between central and raw moment.

15

622
One more important aspect which arises here, is the following usually you will see that in

statistics, we are considering the 1st, 2nd, 3rd and 4th moments. Whether the raw moments or the

central moments. Well, one can very easily compute the higher-order moments say 5th, 6th and

so on. But up to now, what has happened that we have the interpretation for the first four

moments. For example, first moment that's the first raw moment that is indicating the value of or

the it is quantifying the information on the central tendency of the data. Second central moment

is giving us the information on variability. Similarly, I will try to show you in the forthcoming

lectures, that third moment is giving us the information on the symmetry of the frequency curve.

That is called as property of skewness, and similarly the fourth moment will give us the

information about the hump of the frequency curve. The peakedness of the frequency curve and

that property is called as kurtosis. But what is indicated by fifth moment, sixth moment and so

on, that is still an open question. So, that is the reason, that usually we are interested in finding

out the moments up to order four. There also exists a clear-cut relationship between rth central

moments, as a function of raw moments. But here I am not showing you here, I am not

discussing it here. But I will request you please try to have a look in the books, and there it is

clearly mentioned. But here I would certainly show you, that what is the relationship of first four

central moment with respect to the raw moments.

Refer Slide Time (35:26)

16

623
Now, if I try to take the r equal to 1and r equal to 2 in case of continuous data, group data, then

we have the simple similar interpretations. That in the case of first central moment, this is going

1 K
to be  fi xi  x . So, you can see here this is nothing but, x bar minus x bar is equal to 0. So,
n i 1

once again I have shown you that the first central moment in case of continuous data, is always

0, and similarly, if you substitute here r equal to 2, then this quantity is nothing but the sample

variance, and in this case also, this  2 can be represented as 2  12 . So, this is again the same

outcome that we have obtained in the case of ungroup data. So, you can see here and notice that

this relationship is not going to change in the case of group and ungroup data. Now, if I try to

choose here r equal to 0, then I will always get 0 equal to 1 either we have a ungroup data or a

group data set.

Refer Slide Time (36:50)

17

624
Now, after this I will show you what is the relationship between central moments in raw

moments. You have observed, that the fourth central moment and first raw moment both are

taking the value 1, and first central moment here, this is taking the value always 0. These are the

two points where you have to be careful. I already have shown you this result, where I have

shown you how I can express, the second central moment  2 , as a function of first raw moment

and second raw moment. Just using a similar concept, I can also express the third, and fourth

central moments, as a function of raw moments as a function of 1 ,  2 and 3 , and similarly the

4th central moment can be as a function of 1 ,  2 , 3 , and  4 and you may recall, that 3 is

1 n 3
nothing but,  xi and 4 is, which is the fourth raw moment. I goes from 1 to n summation
n i 1

and, then X is the power of here four. So, you can see here that these are the relationship of

central moment and raw moments. Well, I am trying to give you here only four relationship as I

said. But using the binomial expansion you can obtain a general relationship between the rth

central moment as a function of the first r raw moments and that is available in in all the

18

625
standards statistics books. So, I will not do it here. But I would like to stop here. I have given you

the basic concepts of moments. Well, this lecture was purely theoretical. But we also know, that

any development is based on the theoretical construct. Unless and until you understand the basic

fundamental you will not be able to understand what the software is trying to do and these are

very important concepts in the subject statistics. So, that was my objective to give you an

exposure of this concept. Now, in the next lecture, I will try to consider the concept of absolute

moments, and I will try to show you that how these moments can be computed on the basis of r

software, and after that I will introduce the concept of skewness and kurtosis. So, please you take

a break revise the lecture try to read from the books, and I will see you in the next lecture. Till

then, Good bye!

19

626
Lecture – 23

Sheppard’s Correction, Absolute Moments and Computation of Moments

Welcome to the next lecture on the course, descriptive statistic with R software. You may

recall that in the last lecture. We started a discussion on the concept of moments, and we

discussed raw moments and central moments. Now, in this lecture I will introduce you with

another type of moments which are called as absolute moment, and I will show you, that how

the raw moment, central moment and absolute moments are computed on the R software. But

before we go to the concept of absolute moment, let me introduce you one small topics,

which is about Sheppard correction. Okay.

Refer Slide Time (1:06)

So, you may recall, that whenever we have our continuous data or say group data. What we

do? We try to group them in, group the data in class intervals, and the frequency of that group
1

627
is going to indicate that how many values are present in that interval. Now, if you try to see

what are we trying to do? We have here a sort of interval, say here e1 to say e2. Right, and

we assume, that the frequency of this interval is concentrated at the midpoint x1, if you

remember x1 was the midpoint of the interval. So, we assume that on the y axis this value

here, is showing here the value of say here frequency say f1. For example, first class. But

now if you try to see what is happening. There were f1 values in the interval or in the class

interval e1 to e2, and these values were scattered at different location inside this interval. But

when you are trying to group all this information, you are assuming that these values are

concentrated at x1. So, you assume that this frequency comes here, this frequency comes

here, this frequency comes here, this frequency comes here, and so on, and you assume that

all these values are concentrated only in the midpoint, and this number of values is here f1.

So, in some sense what are we doing? We are trying to group the observations. But when we

are trying to group the observation, the information contained inside the individual

observation is lost. What does this mean that the information is lost. Suppose I have two

values one is 5 and another is 10 and suppose the mid value of the interval is at 6. So, I am

assuming that 5 is also becoming 6, and the value 10 that is also, becoming 6, and after this,

in case if you try to observe the value 6, two values of 6 like a 6 and 6. You cannot

differentiate whether this value was 5 or 10, that which of the value of 6 is representing the

value of 5 and which of the value of 6 is representing the value of 10 or in general, we have

lost the information about the individual values of xi whether the values were 5,6,7,8,9,

whatever it is we simply assume, that they are just concentrated at the middle value xi. So,

you can see here when we are trying to group the observation there is some error which is

introduced and now obviously when you try to compute the moments on the basis of grouped

data, then this error is going to be reflected in the value of moments, and when this moment

are not representing the true value, consequently they will be giving us the wrong value of

628
mean, wrong way of variance and wrong value of other quantities which are based on

moments. So, this is very important for us, that whenever we have grouped data, we should

apply a sort of change or correction in the value of moments so that the modified or the

corrected value of moment is used which in turn will give us the correct information. Okay.

So, in this direction professor Sheppard worked, and he introduced, and he provided, some

expressions, and these expressions are based only on the moment and the class interval, and

he explained how this changes can be made so that the moments are reflecting the value

without end grouping error, or in simple words professor Sheppard suggested how this

grouping error can be can be treated. So, let us try to start the discussion on this direction.

Refer Slide Time (6:53)

629
So, we assume that in a group data that the frequencies are concentrated at the middle part of

the class interval, and this assumption may not always hold true and so-called the “grouping

error” is introduced in the data.

Refer Slide Time (7:09)

Now, how to improve these values and how to take care of the grouping error? So, this effect

can be corrected in calculating the moments by using the information on the width of the

class interval. So, this is pretty simple. So, let us assume that suppose small c is denoting the

width of the class interval. Then Professor WF Sheppard proved that if the frequency

distribution is continuous and the frequency tapers off to zero in both that direction that is on

the left-hand side and right-hand side, then this grouping effect can be corrected as follows.

Refer Slide Time (7:51)

630
and he provided the value of raw moments and central moments after applying the changes.

So, in case of raw moment, Sheppard’s corrections are applied as follows. On the left-hand

side, I am trying to indicate the corrected values of moments, and this is the same here also in

case of central moments and on the right-hand side, I am trying to indicate moments, based

on given data without any correction. So, you can see here, that the first raw moment that

remains the same. There is no error in this case. So, the value of the first raw moment and the

so called first raw corrected moment, they are the same. But in the case of second raw

moment the second corrected raw moment is a function of second raw moment 2 , and it is

adjusted by here a quantity c2/12. So, what are we trying to do? That in order to take care of

the grouping error, we are simply subtracting the raw moment by a quantity c2/12 where c is

the width of the class interval, and now I will get here a new value of second raw moment in

which the grouping error has been taken care and similarly in case of third raw moment the

c2
expression goes like this and the ird corrected raw moment 3  1 . , and similarly in the
4

case of fourth raw moment the fourth raw moment after taking care of the grouping error is

631
c2 7 4
obtained by 4  2  c where once again I would say c is the width of the class
2 240

interval, and similarly in the case of central moments also, we can modify the second third

and fourth central moments. Because first central moment is always 0. So, there is no

grouping error in that case.

So, the second central moment after incorporating the grouping error or after taking care the

grouping error becomes the corrected value of the second central moment is equal to the

value of second central moment minus c square by 12. There is no change in the third central

moment. Thus, third central moment is not affected by the grouping effect. So, the original

value of the 3 and the corrected value of the 3 both remain the same. Similarly, for the

fourth central moment, the value of the fourth central moment after taking care of the

c2 7 4
grouping effect is given by this. Which is the fourth raw moment 4  2  c . So,
2 240

basically these are the part which have to be taken care only during the computation, and here

also you can see, that if I explain you how to compute this mu r prime. That is the raw

moment and r which is the central moment at a central moment. Then after that at least in

the first four central moment you can simply write a simple syntax in R software.

So, whatever is the expression for computing a particular moment, that value has to be

adjusted just by adding and subtracting few terms as proposed by Professor Sheppard. So,

implementation of Sheppard correction in R software is not difficult at all. Now, after this I

will come to the aspect of absolute moments. So, how this absolute moment comes into

picture?

Refer Slide Time (12:24)

632
You have seen, that when we introduced the idea of absolute deviation what we had done?

We had observations x1 x2 suppose here xn. Then, what we did? We chose an arbitrary

value, and we subtracted every observation by that arbitrary value A, and after this what we

did? We considered the absolute value of these deviations, and after this, we simply found the

1 n
arithmetic mean of all such observations. So, this was simply  xi  A . Now, in case if I
n i 1

try to consider here the rth power of this. So, I try to add here r. So, now what will happen

means if I try to take here r equal to say here 1. This will become simply a sort of absolute

1 n
deviation around mean, and if I try to take here r equal to 2. This becomes 
n i 1
( xi  x ) 2

which is same as your variance of x. So, this gives us an idea, that why not to define the rth

absolute moment and the quantity which you have defined here this is called as rth absolute

moment about A. So, the rth absolute moment about arithmetic mean based on the sample

observations x1, x2, xn is defined as like this in the case of ungroup data. So, this is simply

the rth power of the of the absolute deviations, and after that, what are we doing? We are

633
simply trying to find all such deviations and we are finding out its arithmetic mean. So, this is

called the rth absolute moment about arithmetic mean, and similarly in case of grouped data

1 K
 fi xi  x .
r
the rth absolute moment about arithmetic mean is defined as
n i 1

So, you can see here that this is the same philosophy or same way of development as we have

done here, the only difference is this I, have to just adjust it for the case of grouped

observation, and here this x is going to be obtained by this expression.

Refer Slide Time (15:32)

Now after this I, would come on the aspect that how to compute it in the R software. Well

when we want to compute the values of the moments on the basis of given sample of data in

R software, then this part is not available in the base package of R but, in order to compute

the moments, and after that I, will show you that when we are trying to measure the departure

from symmetry and peakedness of the frequency curve, and we compute the coefficients of

634
his skewness and kurtosis, then we need a special package, and we need to install the package

before we try to compute the moments.

So, in order to compute the moments, we first need to install a package which is called as

moments, and then we load it as a library and then we operate. So, when we are trying to

compute the moment, the first step is that you try to install a package moments, and in order

to do so what you have to do? You have to simply write install dot packages inside the

arguments within the double quotes you simply write “moments” m o m e n t s and the

package will be installed, and after this you need to load the package library moments. In fact

whenever you will need to compute the moments you need to upload this for library. Now

after this, all the sample moments are computed by the command all dot moments a l l dot m

o m e n t s and this syntax or disk function has several arguments you can see here x order dot

max central absolute and any dot rm just by controlling this parameter inside this argument,

you can generate different types of moments, raw moments, central moments and absolute

moments. So, what is happening in R package, in R software, we have only one command all

dot moments and just by giving different choices of TRUE and FALSE inside the argument

to various parameters, we can compute different types of moments, so there is no separate

command for raw moment or central moments or absolute moments. So, this is what you

have to now understand which of the choice of the parameter or the values inside the

argument is going to give you which type of moments Okay.

Refer Slide Time (19:27)

635
So, now firstly you let me explain you the meaning of this argument you see the first value

here is x. This is going to denote the data vector then there, is another parameter here all

order dot max this is written here as two but this is going to give us the information on the

number of moments to be computed. This I have explainde in the next slide, here you can see

here like this but I will try to explain you here also, and 2 here is that default value now in

case if, you want to compute three moments, four moments, five moments, six moments, you

have to simply choose the appropriate value here, next command here is central, central is

indicating for central moments, and the default value which is taken inside the all command

is FALSE, that is the logical FALSE but in case if you want the central moments then what

you have to do you just have to use the logical TRUE so in place of here false you simply try

to type TRUE and it will give you the central moment. Similarly the next option is absolute,

so absolute will give you the values of absolute moments the default value which is taken in

the command all that moments is FALSE, but in case if you want to compute the absolute

moment, you simply have to replace this FALSE by logical TRUE and the last syntax, it is

known to you now this is na dot rm is equal to FALSE, so this will try to help us in the case

10

636
of missing values means if you want to compute when the missing values are present then

what you need to do you simply have to change this logical FALSE to logical to capital

TRUE so, this is how just by handling different arguments with different logical TRUE and

logical FALSE you can generate the value of different moments.

Refer Slide Time (21:38)

For example, in this slide I am trying to explain all these things in detail so that you can have

a look so, this is about ordered max, this is about central, this is about absolute and this is

about na dot rm, same thing which I have just explained you, Right.

Refer Slide Time (21:51)

11

637
Now after this I, will try to take an example and I, will try to show you how those things are

being computed so I'm taking again the same example that we have used couple of times in

the earlier lectures that we have a data of 20 participants in a race, and this data has been

stored in a variable here time, and now I would like to compute different types of moments

for this data. So, as I, said first we need to use the command install dot packages to install the

package moment, and then I have to load this package by using the command library

moments Right,

Refer Slide Time (22:30)

12

638
Now after this I, will show you that how to compute raw moments, how to compute central

moments, and how to compute absolute moments. So, first I try to take the raw of moments

and suppose I want to compute first two raw moments. So,I have to control it by here order

dot max 2 so, I use the command here all dot moments and the data vector time and I have to

give here order dot max equal to 2 and in fact even if you don't give this option even then you

will get that same outcome but my objective is to show you that how the things are being

controlled. So, you can see here now once you execute it you will get here this type of

outcome. Now the next question is what is the interpretation of this value another outcomes,

the first value here is 1.0 which is indeed keynoting the value of 0 that is the value of raw

moment at r equal to 0 similarly, the second value here 56.0 this is denoting the value of 1 .

that is the value of raw moment at r equal to 1 so, this is the first raw moment, right. Ts is the

first raw moment and similarly the last value here 3405.2 this is the value of 2 that is the

value of rth raw moments through moment at r equal to 2 and this is denoting the second raw

1 n
moment, which is 
n i 1
( xi  x ) 2 right, now in case if I, try to repeat the same command just

by changing here the order dot max equal to four, then what will happen? Now you can see

here that in this earlier case, the maximum value of r up to which the moments are computed,

this is R equal to two and this is the same value which is here ordered dot max is equal to

two. So now suppose I want to compute first for raw moments so, in this case I, simply have

to give the value here for and then I, try to execute this command with order dot max is equal

to four, I get this outcome so, you can see here the first value is the value of mu 0 prime,

second 56 value is the value of   , third value 3405.2 is the value of 2 , fourth value is the

value of here 3 and the last value here 15080073.2 this is the value of 4 . So, you can see

here this fourth which is the maximum value here for this is being indicated by this order dot

13

639
max so, this will give you the first four moments, now before going further let me try to show

you that how to compute it on the R software.

Video Start Time (26:00)

So, first I, try to create the data vector which is here like this so, you can see here this data

vector here is like this and yeah I, already have installed this package on my computer but

yeah you need to install the package, and I am simply trying to upload it. So now this

package is uploaded moments. Now I, try to use the same command here all dot moments of

the data vector time and I, want to compute the first to moment that is 0 , 1 and 2 so this is

like this if I try to compute say first four moments starting from 0 , 1 , 2 , 3 and 4 raw

moments. So, this is going to give you here like this suppose here you want to compute eight

moments, you simply have to give this and you will get the outcome.

14

640
Video End Time (27:02)

Ah very important point which you have to notice here, is that in the command all dot

moments if you are not giving any option like as central or absolute or na dot rm etc., the

default outcome is the raw moments, this is what you have to always keep in mind

Refer Slide Time (27:33)

15

641
that whatever outcome we have obtained here, these are the default outcomes of the

command all dot moments which are the simply the raw moments, Right.

Refer Slide Time (27:53)

Now I, will show you how to compute the central moments so, again I, will repeat the same

thing that first I will compute the central moments up to order 2 and then up to order 4. So,

what I have to do here my command or the earlier command remains the same here all dot

moments of the time data vector with order max equal to 2 what I have to do here that I

simply have to add here one more argument central dot central is equal to logical TRUE, that

default value here is central is equal to FALSE so, I try to adhere this central equal to TRUE

inside the argument separated by comma and the first outcome here is indicating the value of

mu 0 and that was already shown that it will always take the value 1. Similarly, if you come

to the second value which is 0 point 0, 0.0 because this is indicating the value of 1 which

16

642
1 n
was  xi  x and this always takes value 0. Similarly if you come to the last outcome to
n i 1

1 n
69.2 this is indicating the value of 2 which was nothing but 
n i 1
( xi  x ) 2 and if, you see

this is nothing but the value of variance of x or here x is actually here time the data in time

vector. Now similarly if, you try to repeat the command and if you wish to compute the

moments up to the order 4 then you need to make here only one chain that order dot max is

equal to here four and the same command and you can see here that this is here the outcome

where first value is denoting 0 , second value is denoting 1 , third value is denoting 2 ,

fourth value is denoting 3 that is that value of the third central moment and simply the last

value here this is denoting the fourth central moment. So, you can see here that it is not really

difficult to compute it.

Video Start Time (30:20)

17

643
Now I, will try to show you the computation of these values on the R software on the time

data, so you can see here I, try to take a like this and I, get here the moments up to order 2

that is 0. first and two moments. Similarly if I, try to make it here 4, I get here first four

values and similarly, if you want to make a say here first 8 moments you have to simply

make order dot max equal to 8 and here is the outcome so, you can see here it is not really

difficult to compute these values. Now let us come back to our slide and try to see finally that

how to compute the absolute moments.

Video End Time (31:06)

Refer Slide Time (31:09)

18

644
Now we have understood that as we, have computed the central moment similarly we can

compute the absolute women just by controlling the argument values. So in this case if I,

want to compute the absolute moments say up to second order that r equal to 0 1 and 2, my

command remains the same as earlier all dot moments time order max equal to 2 and now I,

add here one more option that is absolute is equal to TRUE, the default value is absolutely is

equal to FALSE but now I, need absolute moments so I'm trying to give here absolute is

equal to logical TRUE and this gives me here this these values. So, as we have done in the

case of earlier example this first value that is going to give us the value of absolute value of

mu 0 which is always equal to 1, and second value here this is giving us the absolute moment

that is the first absolute moment, and this second value is giving us the value of second

1 n

2
absolute moment. So this is nothing but xi  x which is here the variance and this 1 .
n i 1

1 n
first absolute moment this is the value of  xi  x . Now similarly if, you want to compute
n i 1

the moments up to order 4 then I have to just make order dot max equal to 4 and the same

19

645
command I try to use here and this gives me this outcome so obviously this first value is

going to give me the first value of absolute moment at r equal to 0 the second value is the

value of first absolute moment, third value is the value of second absolute moment, and

fourth value is the value of absolute moment and last value is the value of fourth absolute

moment. So, this is how you can compute these absolute moments and I, will try to show you

it on the R console also.

Video Start Time (33:35)

So, you can see here here were the time data, and then I try to compute the absolute moment

with order max equal to 2 which is giving me the first three values and now I, try to choose

first four moments and these values are given to me like this. Similarly, if you want to

compute any higher order value say up to nine moments, it is coming out to be like this

remember one thing, this counting is starting from zero so, that will so when you take order

dot max equal to nine there will be ten values.

20

646
Video End Time (34:11)

Refer Slide Time (34:13)

21

647
This is the screenshot what we have just done.

Refer Slide Time (34:17)

After this I, will try to show you what is the use of last option that when we have some

missing value. So I will take the same data set in which the first two values have been

removed and they are substituted with the na so, now and this data vector has been stored

here as a time dot na.

Refer Slide Time (34:38)

22

648
Now after this in case if I want to compute here the raw moments in this case, I have to

simply use the same command all dot moments the data vector and and suppose I want to

compute first four moments so I have to give order dot max equal to four here, and then I

have to say here na dot rm is equal to TRUE and if you don't write it then the default value

here is FALSE, and once you execute it you will get here the same outcome. So, these are the

values computed in the same way as in the earlier example. The only thing is this now those

missing value have been removed and then the rock moments have been computed. Similarly,

if you want to compute first for central moments in this case you have to use the same

command that you use earlier and with the data given by time dot na and use na dot rm is

equal to TRUE and this will give you here the odd come and you quicker says the same

outcome that did this is the value of 0 this is the value of 1 , third value the value of 2 . ,

and fourth value the value of 3 , and the last value is the value of 4 .

23

649
Refer Slide Time (35:55)

And similarly if, you want to compute the absolute moment in this case then again you have

to use the same command of absolute values computation and just add here na dot rm is equal

to TRUE and you get here the value of the absolute moment for R equal to zero, the value of

absolute moment for R equal to 1, that is the first absolute moment, then second absolute

moment for R equal to two, the absolu`te movement for R equal to three, and finally the last

value for R equal to four indicating the fourth absolute moment.

Video Start Time (36:28)

24

650
Now I, will try to show you that how to get it done on the R software. So first I try to create

the data vector here, you can see here this is my the data, and now if I try to compute the raw

moments, and if I, and suppose I show you that that if I don't use this option na dot rm then

what will happen? But after execution this command on you will get the same outcome that

we have used earlier now. I will try to remove this na dot rm is equal to true and you will see

that it is giving you here NA NA NA NA, the first value is coming out to be one because that

will always remain true whatever is the value of here R and similarly if you want to compute

the central moments here, just use this command and you will see the outcome here and

similarly if you want to compute the absolute moments over here, then simply use the

command for absolute moment and you get this outcome

Video End Time (37:38)

25

651
And similarly if you want to compute higher order moments in the case of missing values,

simply try to control the value of order dot max. So now I, have given you the basic concepts

of moments and I have explained you how to compute them on the basis of given set of data.

Now it's your turn, try to take some datasets and try to practice and try to observe that just by

choosing the logical TRUE, and logical FALSE inside the arguments which are given to the

different parameters, how you can generate raw moments how you can generate central

moments, and how you can generate absolute moments, and in the next lecture I, will show

you what is the use of third and fourth order moments by considering the concepts of

skewness and kurtosis. So, you practice and I, will show you in the next lecture. Till then,

Good bye.

26

652
Lecture - 24

Moments-Skewness and Kurtosis

Welcome to the next lecture on the course descriptive aesthetics with R Software. You may

recall that, in the last two lectures, we had discussed the concept of moments. And we discussed

the raw moments, central moments and absolute moments. We also learned how to compute

them on R Software. Now, I am going to introduce here the concept of skewness and kurtosis,

which are again, the two features of a frequency curve and our objective is that, to understand

firstly, what are these features? Secondly, how to quantify them? And thirdly how to compute

them on the R Software? When we are trying to quantify them, then you will see that, we will

need the definition of the moments and in particularly the central moments. And that was the

reason, that I had, explained those concepts earlier. So now, let us start this discussion.

Refer Slide Time :( 1:39)

653
And first we try to understand, what is the skewness? The dictionary or the literal meaning of the

skewness is lack of symmetry. What does this mean? This symmetry is, talking about the

symmetry of the frequency curve or the frequency density; you have seen that, how we have

computed the frequency table, from there we had drawn the frequency density curve. So, when

we say that, this is the lack of symmetry. Then what is symmetry? Symmetry here is like this, so

this is the basic meaning of symmetry. So now, I am saying that, this symmetry is lacking, when

the symmetry is lacking, what will happen? Means in, an ideal situation if I say, suppose if I say

this is my symmetric curve, then the symmetry is distributed around mean, this curve will look

like this or this curve will look like this. Now, what is the interpretation of these curves? Suppose

I try to take an example, where we are counting, the number of persons passing through a place,

where many, many offices are located. So, we know, what is the phenomena? The phenomena is

like this; that usually the office will start at, nine o'clock, ten o'clock, in the morning. The traffic

at that point will be very less say around 7 a.m., 8 a.m., in the morning. And then, the traffic will

start increasing and then, it will increase say up to 10 o'clock 10:30 or say 11 o’ clock in the

morning and after that the traffic will decrease.

So, in case if I want to, show this phenomena through a curve, this curve will look like this.

Suppose if I say, this is the time here, I'd say here, 10 a.m.. And this is the time here, somewhere

here say, 7 a.m... And this is that time here, up to here, say here 3 p.m. to 10:00 a.m., 11 a.m. and

so on. And here is the number of persons passing through that point. So, I can say here that from

7 a.m., this frequency or the number of persons, this is very small, it starts increasing and then it

keeps on increasing up to say 10 o’ clock and then after that everybody comes in the office and

then, there are less number of people, who are coming to office and then finally, see if you come

up to here, 3 p.m. this number will decrease. Now, on the other hand, the opposite happens in the

654
third case, suppose if I try to mark these points, as here I said 12:00 p.m., 1 p.m. , I'm up to here

sometime here, say here 6 p.m. and then here 7 p.m. and say here 9 p.m. and here we try to count

the, same did, same record, the same data that is the number of person, which is denoting the

frequency. So now, what will happen? Once the office starts and offices from say 9:00 to 5:00 or

say 9 a.m. to 6 p.m. or so, so usually in that marketplace, or in the that place, where there are

many, many offices, people will be working inside the office and then, in the evening when the

office hours, closes then they will leave the office. So what will happen? Say from say 12 o’

clock or 1 o’ clock, the number of person passing through that point will be very less. And this

number will start increasing say from, 4 p.m. 5 p.m. and it will be maximum say, between say 5

p.m. and 6 p.m. and once everybody has left the office, then the number of person passing

through that point will sharply decrease and say at 7 p.m., 8 p.m. the number will be very, very

less. So, now how to denote this phenomena through a curve, so this type of phenomena can be

expressed by this curve. So, initially at 12 o clock the number is very, very less. And then, say

around 5 p.m., 6 p.m., the number is increasing and then it is decreasing after, say 7 p.m. or so

on. In both these cases, what is happening, you can see here, more data is concentrated here and

the first figure and more data is concentrated on the right-hand side, in the last figure.

So, these are the areas in red color, where more data is concentrated. Now, if you try to take the

third figure, you can see here in this case, the curve is symmetric around the mid value. If you try

to break the curve into two parts, from this point and if you try to fold it and if you keep it or dis

thing, then this will look symmetrical. So what I'm going to say? Suppose the curve is like this

and if I try to break it in the mid and if I fold it, then the curve will look the same. So, this is

what we mean by symmetry. And this feature is missing in the first and last curve. That if you try

to break the first curve at this point and the last curve at this point, where I am moving my pin,

655
then this will not be symmetric. So, the objective here is, how to study this departure from

symmetry, on the basis of given set of data, I would like to know, on the basis or given set of

data that whether the data is concentrated, on the left hand side more or more concentrated on the

right hand side of the frequency curve. So, this feature is called as, ‘Skewness’. And in order to

quantify it, we have a coefficient of skewness. So now I can say that here is skewness gives us an

idea about the, shape of the curve which is obtained by the frequency distribution or the

frequency curve of the data. And it indicates us the nature and concentration of observation

towards higher or lower values of the variable.

Refer Slide Time :( 9:03)

656
So, you can see here, I have plotted here three figures. I can call it; say here figure 1, figure 2 and

here that this is a figure 3. So, I will call this figure number 3, as bell-shaped curve, why this is

called bell-shaped? Have you ever seen a bell, the bell is like this and then, there is here is a ring,

right. So, you can see the structure of this curve here, this is symmetric, so that is why this

structure shape in the figure 3 is called as ‘Bell Shaped Curve’. So, the bell shaped curve, I will

say, this is a symmetric curve. Now, when the symmetricity is lacking, then the frequency curve

will look like, as in Figure 1 of Figure 2. So, the curve is or the frequency curve is, more stressed

towards right or towards left, this is indicating that more values are concentrated in this region, in

Figure 1 and more values are concentrated in the, in this region in the figure 2. In these cases, we

say that the curves are skewed. So, our frequency distribution is said to be skewed, if the

frequency curve of the distribution is not the bell shaped curve and it is stretched more on, one

side then to the other. Now, how to identify it, because now we have two types of lack of

symmetry, one in Figure 1 and 1 in Figure 2. So, we try to give it here a name.

Refer Slide Time :( 11:01)

657
So let me rename the figure, this is Figure 1, this is Figure 2 and this is figure 3, from the last

slide. So, now in the figure 1 you can see that the curve is more stretched on the right hand side.

So, when the curve is more stretched on the right-hand side, this is called as, ‘Positively Skewed

Curve’. And similarly if the curve is more stretched on the left-hand side, then this is called as a,

‘Negatively Skewed Curve’. And in case of a symmetric curve, we assume that the curve is

symmetric and we say that it has got zero skewness. When we want to discuss the property of his

skewness, we try to write whether the frequency curve is positively skewed, negatively skewed

or it is symmetric. And this is how we try to express the finding from the data. But, definitely

there will be one thing, suppose if I try to take here, two curves, like this and like this, both the

curves are, positively skewed. So, the next question is that both the curves are positively skewed,

but their structure is different, one curve is lacking the symmetry more than the others. But, just

by saying, less or more it will not help us we need to quantify it. So, our next objective is that,

how to quantify this lack of symmetry. And in order to understand this thing, we have a concept

of coefficient of skewness.

Refer Slide Time :( 13:02)

658
And the definition of coefficient of skewness depends on the second and third central moments

of the data. So, you may recall that, we had denoted the second central moment by and third

central moment by . So now, the coefficient of skewness is denoted by . And this is defined

as, the square of third central moment, divided by the cube of second central moment. And this is

called the, ‘Coefficient of Skewness’. There is another coefficient of skewness, which is defined

as, the square root of this  and this is denoted by . Now, what is the difference between the

two measures of coefficient of skewness that is  and ? You can see here that in the case of

,  can be positive or  can be negative. But,  will always be positive. And similarly, 

will always be positive, so  will always be positive. So, this  will always be positive. So,

this will give us the information, on the magnitude of the skewness or the magnitude of the lack

of symmetry. But, this  will, not be able to inform us, whether the, the skewness is positive or

negative. Where is, when we are trying to use the, next coefficient , then what we try to do

here: that will also give us information about the sign. And when I try to combine, the

information obtained by , then this will, give us the information on the magnitude, as well as,

the sign. Sign can be positive; sign can be negative, indicating the positive or negative say

skewness. So, this is the basic difference between the two measures,  and . And you will see

that in the R Software, R Software provides the value of gamma 1. And one thing now you have

to notice here and I can explain you on this slide itself , that I have defined here,  and , this is

for the population. But, what happens to us,

Refer slide time :( 15:55)

659
We have got a, data set say x 1, x 2 and here see here x n in this case, what we try to do? We try

to compute the value of and , on the basis of data x1, x2, xn or they are called as Sample

Moments. So, we try to compute the sample moments, of order 2 & 3 and we try to replace,

and  by their sample moments. So, in this case, this  I am trying to denote it by s that

means  based on the sample values, becomes like this. So, that is the same thing, I simply have

computed, the second and third, central moments and I have replaced him at in the definition of

. Next the coefficient of , now I am denoting by s, s means sample, this becomes here, this

square root of s and it is given by here like this simply, the square root of the s. So, now this

will give us the information, on the magnitude and sign and where this  will give us

information only on the magnitude.

Refer slide time :( 17:20)

660
Now what is the interpretation? The interpretation goes like this. So, I'm trying to divide the

interpretation, based on and s and both are actually the same? So, the first case is if I say is

equal to 0, this means the distribution is symmetric, when I say is greater than 0 that is

positive, then the distribution is also said to be positively skewed. And if is negative, then the

distribution is negatively skewed. And the same continues, for the definition of s, if s is 0, this

means symmetry, if is positive that means that distribution is positively skewed, if is

negative then the distribution is negatively skewed. So, you can see here, now that having the

coefficient of skewness, it is not difficult to know, the feature of the frequency curve with respect

to the symmetry and I can see whether my distribution is symmetric or not if not symmetric then

it is positively skewed or negatively skewed. Now a simple question comes, what happens to,

your here mean, median and mode in these different types of distributions, when that

transmission is symmetric or positively or say negatively skewed. So, now I try to give you a

brief information.
9

661
Refer slide time :( 18:49)

When we have a positively skewed distribution, in this case will be equal to 0. Now if you

recall what will be the here mode, mode will be somewhere here, which is the maximum

frequency value. So, corresponding to maximum frequency this will give me the value of your

mode, median is the value which will try to divide the entire area under this curve into two parts.

So, this will be somewhere here and the mean, will be somewhere here, yeah. So, in the case of

positively skewed distribution, mode will have the highest value followed by median and mean.

So, mode will be smaller than median and median will be smaller than mean. The opposite will

happen, when we have the negatively skewed distribution, in this case, the mode will be

somewhere here, corresponding to this frequency. So, mode will be somewhere here,

corresponding to X and similarly the median will be corresponding to this frequency and median

will be somewhere here and the mean will be somewhere here. So, in this case, when we have

with a negatively skewed distribution, then mode will be greater than median and median will be

10

662
greater than mean. Well, in the case of symmetric, distribution, all the values of mean, median

and mode, they are going to be the same see here and in this case is equal to 0 and median,

mean, median mode will be somewhere here.

Refer slide time :( 20:19)

There are some more, coefficients of skewness, which have been given in the literature so, I will

just briefly give you an idea. So, one measure of coefficient of skewness or one coefficient of

skewness is based on, the mean and mode, which is given by mean - mode divided by standard

deviation. So, Sigma x is giving the value of standard deviation. So, this is essentially the value,

of say s what we have used in the earlier lecture. But I'm using here is say Sigma x, to denote it

that this is standard deviation, because there's a standard notation in many, many books. And this

is the same as this quantity, which is based on the mean and median because you may recall that

11

663
we have a relationship that x  xmode , is approximately equal to 3( x  xmedian ) under certain

conditions. So, these two measures lie between minus 3 and plus 3 and we say that if these

coefficients are greater than 0 that means the curve is positively skewed, if they are negative that

means the curve is negatively skewed. And if this coefficient is 0 that means the curve is

symmetric.

Refer slide time :( 21:40)

And similarly, we have two more measures, which are based on the definitions of quartiles and

percentile. So, the coefficient of skewness based on quartiles, is given by like this

(Q3  Q2 )  (Q2  Q1 )
. And similarly the coefficient based on percentile, is given by this formula
(Q3  Q2 )  (Q2  Q1 )

and both this coefficient, they lie between minus 1 and plus 1 and they have the same

interpretation as earlier, means, positive value of this coefficient will indicate positively skewed

12

664
curve, negative value will indicate the negatively skewed curve and 0 value will indicate the

symmetric curve.

Refer slide time :( 22:22)

Now after this I come on the next aspect, which is called,’ Kurtosis’. Please try to observe here,

in this picture, I have made here three curves, duck car in the sail let me call it here's curve, one

two and here three. I request you to please try to observe, the hump of the curve, where is the

hump of the curve, which is here I'm making my yellow color, this and by looking at these three

curves, can we really say that what is the feature related to the peak of the curve? The peak of the

curve 3, this is the highest, followed by the peak of the curve number 2 and followed by the peak

of curve number 1. So, the question is, is how to show this property of peaks and how to quantify

it? So, this property of kurtosis, this describes the peakedness or flatness of a frequency curve,

flatness means, how flat is the curve at the peak. Now after this you can see here, from this curve

that one of the curve has more peak and other curves have less Peaks. But how to compare it,
13

665
what is more and what is less? So, what we try to do here, we try to measure the peakedness,

with respect to the peakedness of normal distribution, what is the normal distribution? In

statistics we have a probability density function, what will what we call as normal distribution or

sometime it is called as,’ Gaussian Distribution’. So, before I try to give you an idea about this

peakedness, let me try to give you the idea of normal distribution.

Refer slide time :( 24:38)

So, the probability density function of a normal distribution function is given by like this. And

this probability density function is controlled by two parameters, mu and Sigma square. So, we

denote this function, as say here n, which P is normal and the two parameters  and , which

are the parameters of this, density function, here this  is indicating, the mean and  is

indicating the variance. So, if I try to draw this curve, this will look here like this. So, this value

14

666
is indicating here, the mean and this is spread here around the mean, this is giving us the value of

sigma square. And this curve is actually, symmetric around mean. So, we try to compute, the

coefficient of skewness and kurtosis in the case of normal distribution. And in this case the

coefficient of skewness comes out to be zero and coefficient of kurtosis comes out to be zero.

And this was the reason that we were trying to conclude on the basis of coefficient of skewness,

being zero, positive or negative. And similarly we are going to do, with the case of kurtosis. So,

now the curve of the normal distribution will have zero kurtosis. So, now we can compare the

peaks of other curves, with respect to the normal curve. And this is what we are doing in this

picture,

Refer slide time :( 26:21)

Here if you try to see, the curve in the mid, curve number here tow this is the curve of normal

distribution. And we are trying to compare the flatness or peakedness, with respect to the curve

15

667
number two. So, now you can see here, first into the curve number three, this is here, which is

here, the curve number three has got more peak, then that curve number two. And similarly if

you try to look in the curve number one, then curve number one has got a smaller peak, than the

curve number two. So, what we try to do here? That all those curves which have got higher

Peaks than the normal curve, they are called as ’Leptokurtic’, the peakedness of the normal

distribution, is called as,’ Mesokurtic’. And the peakedness of those curves, which have got,

lower peak than that of normal curve, this is called,’ Platykurtic’. And this is how we try to

characterize the less or more peakedness, with respect to the peakedness of the, normal

distribution or the mesokurtic curve.

Refer slide time :( 27:46)

16

668
So, now I can explain you that the shape of the hump, which is the middle part of the curve or

the frequency distribution of the normal distribution, has been accepted as a standard one. And

this kurtosis, the property of kurtosis examines the hump or the flatness of the given frequency

curve or distribution, with respect to the hump or flatness of the normal distribution, which we

have just understood.

Refer slide time :( 28:15)

And those curves, which have got the hump, like a normal distribution they are called,

’Mesokurtic’, curves with greater peakedness or say less flatness, then that of normal distribution

curve, they are called as,’ Leptokurtic’, curves and those curves, which have got less peakedness

or say greater flatness, then that of normal distribution, they are called as ’ Platykurtic’, curves.

17

669
Refer slide time :( 28:46)

Now the Pearson’s is how to quantify it. So, we have a coefficient of kurtosis and there are

different types of coefficient of kurtosis, but here we are first going to consider, the Karl

Pearson's coefficient of kurtosis, which is denoted by  that is the standard notation. And similar

to the coefficient of skewness you can see here that this  is also depending on the  and ,

what are this  and ?  is the second central moment and  is the 4th central moment. And

the coefficient of kurtosis is defined as the 4th central moment, divided by the square of second

central moment, the value of  for a normal distribution; this comes out to be 3. So, what we try

to do? That we try to define another measure, which is be  - 3 and we denoted by here . Now

the advantage of is that that just by looking at the value of , we will get the idea of the

magnitude and if this is greater than 0, is smaller than 0 or equal to 0, will give us the idea about

the, nature of hump. So, that is why we have two coefficient of kurtosis and in R software, this

is produce in the outcome.

18

670
Refer slide time :( 31:27)

So, now you can see here the same thing. So, I can see here for a normal distribution  is equal

to 3 and equal to 0 and if  is greater than 3 or is greater than 0, then we say that the curve

is leptokurtic, if  is equal to 3 and or equivalently the is equal to 0, then we say the

frequency distribution or the frequency curve is Mesokurtic. And if  is smaller than 3 and is

smaller than 0, then we say that the distribution is platykurtic. So, you can see here that in the

same figure that we had drawn, for the curves leptokurtic, we have this for mesokurtic, we have

this and for platykurtic, we have this. So, this is about the coefficient of kurtosis.

Refer Slide Time :( 31: 19)

Some properties, which I'm not going to prove here, it is just trying to say that , the coefficient

of kurtosis will always, be greater than or equal to one and coefficient of kurtosis , will always

19

671
be greater than the coefficient of skewness,  and this  will always, be greater than or equal to

 + 1, these are some properties just for your information, I'm not going to use it here.

Refer Slide Time :( 31: 46)

And but, I would try to define the sample based coefficient of kurtosis, because there in the 

and , I simply use the population values. So now, I know, I don't have to do anything, I need

here two central moments,  and , I will try to compute the sample based moments, i.e., the

value of  and , on the bases or given set of data and I will simply replace, them in the

coefficient of kurtosis and I am denoting this coefficient of kurtosis, as s, right, s means sample

and similarly, the coefficient of kurtosis, is now transformed to s and this is the same thing

s - 3 and they have the same, interpretation as we have the case, in the  and , for

20

672
leptokurtic distribution, s and s will be greater than 3 and s will be greater than 0, similarly

s will be 3 or swill be 0 in case of Mesokurtic and s will be smaller than 3 or swill be

smaller than 0, in case of platykurtic distribution. Right.

Refer Slide Time :( 32: 57)

So now, we have a fair idea: that how to measure two types of characteristics, in a frequency

curve, what is the symmetry and other is the peakedness and they are going to be quantified, by

coefficients of skewness and kurtosis. So now, the next question is how to, compute them, in the

R software. So, as we have seen: that in the case of, computing the moments, we need a package

moments, so obviously, in order to compute the coefficients of kurtosis and coefficient of

skewness, we need the information on the moments, , , and . So, we first need to install the

21

673
package, moments and then based on that, we can compute the coefficient of skewness and

coefficient of kurtosis. So, we try to understand it here. So, in order to compute the coefficient of

skewness and kurtosis in R, first you need to install the package, moments by this command,

install dot packages, inside the arguments, you have to write moments and then you have to load,

this package moments. Right. And now, the sample-based coefficient of kurtosis, which was our

s, this will be computed by the expression, a skewness, s k e w n e s s and the data vector here x

and yeah! If you want to use, the na dot r m, which means for the missing value, you are saying

because here FALSE: that means there are no, missing values and if there are missing value, we

will simply write na dot r m is equal to TRUE and similarly the sample based, coefficient of

kurtosis, which we have defined as s, this will be computed by the command kurtosis, k u r t o s

i s all in small letters and inside the argument, same thing, like the data vector x and if you do

not have any missing value, then use n a dot r m is equal to FALSE and if there are missing

values, then use na dot r m is equal to TRUE. Okay.

Refer Slide Time :( 35: 05)

22

674
So, you can see here, this is not a difficult thing. So, yeah! If you have missing value and if you

try to store those missing values inside the data vector xna then the command will become,

skewness xna and na dot rm is equal to TRUE, for computing the coefficient of skewness and the

coefficient of kurtosis, in this case will be given by kurtosis, x dot na and na dot rm is equal to

TRUE. Right.

Refer Slide Time :( 35: 31)

Now, I would try to take an example and show you that, how to measure this is skewness and

kurtosis in the data. So, I am going to use the same example, in which we had collected the

timings of twenty participants in a race and this data has been stored inside a variable time. Now,

after this, we simply have to use the command skewness, s k e w n e s s and inside the argument

give the, data vector and this is giving us the value, 0.05759762. So, this is indicating: that the

23

675
skewness is, greater than zero. So, the frequency curve, in this case is, positively skewed and

similarly you can see here, when you try to operate the kurtosis command, on the time vector,

then it is giving us the value, 1.7 and which is  is smaller than 3 then we say the distribution

is platykurtic. zero. Now, I will try to show you it on R software,

Refer Slide Time :( 36: 56)

So you can see here, I have here that data on time, first I need to load the package, library,

moments and I already, have installed this package on my computer. So now, I will need, the

skewness of time, this comes out to be like this and kurtosis of time. Right. You can see here,

this is the same thing which you have just obtained and this is the screenshot of the same

operation that I just did.

Refer Slide Time :( 37: 32)

24

676
I know, I will take one more example, where I will show you that, how to compute the

coefficient of skewness and kurtosis, when we have some data, missing. So, I will take the same

example, in with the I have just removed two first two observation and I have replace it by NA,

NA that means the data is missing and this data is stored in the data vector time dot na and now, I

will use the command skewness, on time dot na with the option any dot rm is equal to TRUE

that means, please remove the NA value and then you try to compute the skewness. And

similarly for the kurtosis, I will use the same command kurtosis, on the time dot na with then

option n a dot r m is equal to TRUE and this will give me the value of coefficient of kurtosis,

when the two values are missing. So now, you can see here, the coefficient of skewness comes

out to be negative, this is less than zero. So that means, the frequency curve based on the, the

remaining observation, is now negatively skewed. So, it, this is indicating that, when the first

two observations are deleted, then the nature of the skewness has changed. Similarly for the

kurtosis, this value is 1.81, and which is  is smaller than three, then we say that the distribution

25

677
is platykurtic. So, this is indicating that, even after removing the first two observation, the nature

of the curve, remains the same. And now, I will try to show you, this on the R software also.

Refer Slide Time :( 39: 21)

So I already have stored this data time dot na on my computer. This is here like this. Now I want

to compute the skewness; this is coming out to be the same, what we have just observed. And if I

try want to come compute the coefficient of kurtosis, this is here like this.

Refer Slide Time :( 39: 39)

26

678
And then, the next slide is the screenshot of the same operation that we have done. Now, I would

stop in this lecture and you can see that we have discussed the coefficient of skewness and

kurtosis, which are going to give you, two more pieces of information about your frequency

curve, beside central tendency and variation. Now, you know how to find the behavior of the

frequency curve, with respect to central tendency, variation, lack of symmetry and peakedness.

And now you can see, just by looking at the data, you were not getting all these things, but now,

you know that how to quantify these things and how the graphics in this case will look like,

means if you try to plot the, frequency curves of the data, which I have taken say time or say

time dot na and try to see, whether the feature of the curve is matching with the information

given by the coefficient of skewness and kurtosis or not? And you will see that it is matching. So

this is the advantage of these tools of descriptive statistics that instead of looking at the huge data

sets, you simply try to look into these values graphically, as well as quantitative way. And they

will give you a very reliable information, but, you should know, how to use this information and

how to interpret that data. So now, up to now, from the beginning I have used the tools, when we

have data only on one variable. So, we have discussed the univariate tools of descriptive

statistics. Now, from the next lecture, I will take up the case when we have more than one

variable and in particular, I will consider two variables. So, when we have the data on two

variables, they also have some hidden, properties and characteristics, so how to quantify them,

and how to have the information graphically, these are the topics which I will be taking from the

next lecture. So, you practice these tools, try to understand it and enjoy it. And I will see you in

the next lecture. Till then. Good bye.

27

679
Lecture – 25
Associate of Variables - Univariate and Bivariate Scatter Plots

Welcome to the course on Descriptive Statistics with R Software. You may recall that in the earlier
lectures up to now, we have considered the descriptive tools which are used for a univariate setup.
Univariate setup means there was only one variable; and we try to measures the, measures of central
tendency, measures of variations etc. only on one variable. Now what will happen when we have more
than one variable?

For example, there can be two variables which are interrelated. So, the question comes that how to
know whether the two variables are interrelated or not and in case, if they are interrelated how to
measure their degree of association. So now, from this lecture we are going to attempt to study the
descriptive tools which are used for more than one variables. Then there will be two types of tools: one
graphical tools and say another are analytical tools like as quantitative tools.

So, in this lecture we are going to study on say univariate and bivariate scatter plots which is the
graphical procedure.

(Refer Slide Time: 01:42)

So, now I will get it is a start about discussion here and let me first take few examples. Now for
example, we know from our experience that if a student studies for more number of hours usually he
will get more marks in the examination. So now, if I try to take these two as variables, the numbers of

680
hours of study and the marks obtained in the examination, then from experience we know that both
these variables are associated and they are related.

But you can think that, if you have got a data set on two variables, how would you know on the basis
of given values that whether the two variables are related or not. And in case if they are related how to
show it graphically and how to quantify the degree of dissociation? So, the first example which I have
just taken is that, that the number of hours of study, they affect the marks obtained in an examination.
Similarly we also know that when the weather temperature increases, for example, during summer,
then we use more electrical appliances like a cooler air conditioner and so on; so the electricity or say
power consumption increases.

So, I can say that the two variables power consumption and weather temperature they are related and
their tendency is that as the temperature increases the power consumption also increases. Similarly, in
say another example we know that, weights of infants and small children they increase as the heights
of those children increases under normal circumstances right. So, now, my question is this from this
data set, from this type of information we have use our experience and based on that we are trying to
conclude this things. But my question is this how to do it statistically, how to do it mathematically and
more so over how to show it graphically and how to quantify it?

So, now I will be considering the association between two variables. So, I will consider two variables
and I will first try to show you that what are the different types of plots available.

(Refer Slide Time: 04:19)

681
So, now my question is this that I have got the observations on two variables and both the variables are
assumed to be related to each other. So, first question comes how to know that the variables are really
related; and if they are related how to know what is the degree of relationship between the two
variables. So, there are various graphical procedures like a two dimensional plot, three dimensional
plots and so on.

And there are some quantitative procedures also like a correlation coefficients, contingency, tables, chi
square statistics, linear regression, non-linear regression and so on. So, we will try to study these tools
one by one.

(Refer Slide Time: 05:02)

So, now first let me try to describe the setup that what are the variables and how the observations have
been obtained; and now we are interested in creating the graphs. So, I simply assume here suppose
there are two variables and these two variables are denoted by capital X and capital Y and small n
number of pairs of observations are available on these two variables and these observations are they
denoted at x 1, y 1 which are occurring together x 2, y 2, which are occurring together and lastly x n y
n which are occurring together. What does this mean? Suppose I take here X variable as say here
height of children and I try to take here Y to be the say here, weight of children and now I try to collect
the observation on these two variables like this. Suppose I take a child and I try to measure his height
and suppose this height comes out to be 100 centimeters.

682
So, for this child number one, the height is coming out to be 100 centimeter. So, I try to denote it say
here x 1 is equal to 100 centimeter. And, then I try to find out the weight of the same child, child
number 1 and suppose this weight comes out to be see here 20 kilograms. So, I try to denote this first
value as y 1 equal to 20. So now, this x 1 y 1 which is here say here, 100 and 20 this is my first paired
observation. And similarly if I try to take second child say child number 2; and if I try to measure the
height of this child, suppose this comes out be 90 centimeters. So, this is going to denote the height of
the second child height is denoted by capital X. So, I can denote the height of the second child by x 2;
and similarly I try to find out the weight of this child.

So, weight is given by Y and its value is given by small y, so I try to write down the y 2 which is
indicating the weight of second child and suppose we observe that, this weight is suppose 17 kgs. So
now, this x 2 y 2 which is equal to here 90, 17, this (Refer Time: 08:21) my second pair of observation
and so on, we can say collect more number of observation and this observation will be given as (100 ,
20), (90 ,17) and so on. So, this will indicate that this is the value of x 1 y 1 and this is the value of x 2
y 2. So, as soon as I write that there are n pairs of observations that mean, these are some numerical
values which we have obtained by experimenting the data or by observing the data in any phenomena.

(Refer Slide Time: 08:58)

Now, after doing this my first question will be I would like to know, whether there is any relationship
between, the two variables or not and that I would like to judge on the basis of given set of data. In
order to do this thing we have a plot which is called as a scatter plot and in scatter plot what we do that
we try to plot the paired observation in a single graph. How? For example, I have here two variables X
4

683
and here Y. So, I will try to take the value of here x 1 and here y 1 and I will try this point over here.
Similarly, I try to take the value of here x 2 and the value of here y 2 and I try to plot it here.

So, this point is x the point (x 1 y 1) and this point is here (x 2 y 2), so this is called a scatter plot.
Now, this scatter plot can reveal different types of nature and trends of possible relationship. So, for
example, we can broadly divide the nature of relationship to be linear or non-linear. So, here you can
see that I have just plotted I scatter plot here and here. So, you can see here these circles here they are
trying to indicate the value of some observation, right. So, all these circles they are essentially trying to
plot the values of x i’s and y i’s.

But if you try to look at the pattern in this graph, what you see that you can see here that there is a sort
of trend mean the trend is that that as the value of X is increasing the value of Y is also increasing and
this is happening moreso over in a linear fashion there is a sort of linear relationship. So, I can say here
that by looking at the scatter diagram I can conclude that the relationship in this case is linear; that
means, there exist a linear relationship between X and Y.

Similarly, if you try to look into this figures second figure, you can see here that once again these
points are plotted here, but you can see here that this points are something like they are not actually
linear, they are not showing that there is a really a linear trend, but the trend is something like this. So,
this is indicating that the relationship between X and Y is non-linear, right.

(Refer Slide Time: 11:47)

684
After this once we have a decided that the relationship is linear or say non-linear, then how to see
whether the strength of the relationship is say more or less. So, we are going to now consider here only
the linear relationship. And, similar type of conclusion will also be there for the non-linear
relationship, but I will not consider it here in this lecture. So, if you try to see here in these two
graphics, figure number 1 and figure number 3 on the left hand side.

In case if you try to see here in the figure number 1, these points are concentrated inside this band; and
in figure number 3, the points are concentrated in this band. And now if you try to observe the width of
this band; and compare it with the band width of figure number 1 and you can see here that in case of
figure number 3, the observations are scattered more than in the case of figure number 1, but in both
the cases you can see that the trend is almost the linear which I am denoting with the red colour. In
both the cases you can see here that the trend is nearly linear.

But what is happening in this trend, in this figure number 1 you can see here that those points are very
close to the line. All the scatter point they are lying very close to the line in red colour. Similarly in the
case of figure 3 here, if you try to see the points are lying quite far away from the line you can see here
in the orange line I am trying to denote the deviations. And when I try to compare these deviations of
observation from the trend line or the red colour line in figure number 3 and figure number 1, I can say
here that the strength of the linear relationship in figure number 1 is more.

Why? Because in figure number 1 here the points are lying more close to the line and in figure number
3 the points are quite away from the line in comparison to figure number 1. Now, there is another
thing which you have to observe, in figure number 1 and say figure number 3, now try to observe my
line in purple colour; you can see here as the values of X are increasing, the values of Y’s are also
increasing. And, the same thing is happening in figure number 3 also as the values of X are increasing,
the values of Y s are also increasing. You can see here in figure number 1 this is my here X this is my
here Y and now if I try to take another X here this is my here another Y, right.

So, this is indicating that the relationship is increasing or we call is positive. So, this is what I mean
which I have written in the title strong positive linear relationship; that means, the relationship is linear
I am this is the strongly positive in comparison to the relationship in figure number 3, where I am
saying that the relationship is positive, but it is moderate positive and it is linear. Similarly, now in
case if you try to observe in figure number 2 and figure number 4, here you can see here in this case as
the values of X are increasing the values of Y’s are decreasing this is happening in figure number 2
and the same thing is happening in figure number 4.
6

685
So, in this case I can say that the relationship is decreasing relation between say here X n Y n both the
cases figure 2 and figure 4. Now, in case if you try to create here are line that is called trend line, you
can see here this is like this and in the figure number 4 this is like this. Now if you try to analyze what
are the deviations of individual observation from this line. So, if you first observe in figure number 4
these points are lying quite away from the trend line you can see here in comparison to the figure
number 2 because in figure number 2, now if you observe these points are very very close to the line in
comparison to figure number 4.

So, I can say now here that in figure number 2 the relationship between X and Y is quite strong and
since it is decreasing relationship. So, we call it as a negative linear relationship, because the
relationship is linear; and similarly in the case of figure 4 I will call it as a moderate negative linear
relationship.

(Refer Slide Time: 17:44)

And, similarly incase if you try to plot X and Y and if you get no clear relationship for example, it is
happening here in figure number 5. For example, you cannot say here whether there is an increasing
trend or decreasing trend or say where is the trend something like this.

So, in this case by looking at the scatter plot, I can see here there is no clear relationship and even we
do not know whether it is linear or non-linear or even this is positive or negative. So, now, in this
lecture, we are going to study the aspect of linear relationship. So, we will assume that whatever things

686
we are going to do in those cases, the relationship between X and Y is linear and there will be two
expects say graphical and say quantitative, right.

(Refer Slide Time: 18:37)

So, now what I will do that, I will try to take some examples and I will show you that what are the
commands on the R software; and I will also show you that how to execute them and how to
understand the outcome. First I am going to discuss here the command which is plot command. One
thing which I would like to make clear here that this plot command can be used to create the scatter
diagram in univariate as well as bivariate set up. I had not covered it when I covered the univariate
graphics because I knew that I am going to cover the topic on plot. So, why not to cover it together,
right. So, increase if you have only one variable univariate case then the data on that variable is stored
in a data vector x and then the R command is plot p l o t and inside the argument you have to write the
data vector x, ok.

687
(Refer Slide Time: 19:49)

Now, I try to take simple example and I try to show you here that how the graph will look like. So, I
have once again taken the same example which I considered in the earlier slide that, we have collected
the data on height of 50 persons and which has been recorded here and this data has been stored inside
a variable whose name has been given as height like this.

(Refer Slide Time: 20:16)

Now, I would like to plot this data. So, I give the command here plot and inside the argument height;
and you will see here you will get here this type of outcome on the R console. And my objective is not

688
to show you here the figure, but I want to show you that how to interpret it. You can see here on the x
axis, this is giving me the index, what is the index. For example, if you try to see in this data set what
is my first observation? This is my first observation. So, this is my here second observation.

So, this index is trying to give the order in which the data has been given. So, this was observation will
have index 1, second observation will have index 2, third observation 130 will have index equal to 3;
and in this plot what they have done? They have taken index here they try to take the value of index
number one and whatever is the value here of the x data for example, here is this is height, they will
plot it here they will try to take the index number two they will plot the data here somewhere we were
it lies.

So, here you can see here on the y axis, we have here the height. So, this is only I scatter diagram of
one variable. So, from here actually you can have the information on say central tendency or say
dispersion if the data is more consternated or it is more concentrated around a particular value. So, all
the types of measures of central tendency and dispersions can be viewed from this type of graph, ok.

(Refer Slide Time: 22:10)

Now, I come to bivariate plots. So, bivariate plot means there are two variables and these are the plot
which gives us. The first and visual information about the nature and degree of relationship between
the two variables; whether the two variables are related or not and if they are related whether their
relationship is linear or say non-linear. So, all this type of information can be used by these bivariate
plots. So, in bivariate plots what we try to do we will take two variable here X which will be plotted on

10

689
X axis and say values of Y variable which will plotted on Y axis and then whatever are the values of
say here x 1, x 2 and so on, here x n and similarly on the y axis all the values y 1 y 2.

See here y n they will be plotted here like this and they will try to show you the trend and degree of
relationship. So, we will try to take some examples and we will try to see that what are the commands
and how they have to be executed.

(Refer Slide Time: 23:16)

So, now in order to create a bivariate plot, suppose we have two variable which I am decorating by
small x and small y in which the data of these two variables have been stored. Now in case if I want to
make a plot like on x axis I will have the value of x variable and on y axis, I will have the value of y
variable this can be executed by the command plot. But now inside the arguments I have to give the
variables x and y separated by comma. So, this is how we have to do now there is another option
available in this command which is here type. Just by giving various values to type we can create
different types of graphics. For example, if I said type equal to p then it will give me only the point
which is the default value and if I say here p is equal to l, this is going to give me the lines. So, you can
see here I have made this p and here l to be bold which is indicating that what is the meaning so you
can easily remember also. And, if you want this point a line to both to be present, then I have to use the
type equal to b, which is coming from both. And similarly, if you try to use that type equal to c then I
will get the lines which are only part of the b because then we have the points and lines both.

11

690
And, similarly if I try to use here the command o then I will get a over plotted means the point and line
are both are over plotted inside the graph. Similarly, if I try to take the type equal to s then I will get a
graph which will look like a stair steps. And finally, if I choose type equal to h, then this will give me a
lines which will look like as if I have created the histograms of this data or there will be some high
density vertical lines.

(Refer Slide Time: 25:38)

So, now let me try to take an example and I try to show you that how these different types of graphical
look like, but before that there are some more options which are available in the plot command. For
example, in case if you want to give a title then you have to use the main inside the argument and you
have to just give whatever you want to give as a title of the plot. Similarly, if you want to give us
subtitle of the plot, then you have to use the command suba and if you want to give the title on the x
axis then the command here is x laba.

And similarly for y axis title the command is y lab a, and if you want to maintain the aspect ratio that
mean how the graph will look like should it be more stretched in the x direction or in the y direction;
then this is given by the option aspthe. And, if you give a numerical value to this thing this will
maintain the aspect ratio and definitely if you want to have more information, I will suggest you that,
please try to look into the help of this type command or say help on plot command, right.

12

691
(Refer Slide Time: 26:40)

Now, I will take a simple example to demonstrate how to use this command and how the graphs will
look like. As we have discussed that we know that number of hours of study of a students this affects
the marks obtained examination. So, the number of hours of study and the marks obtained in the
examination, they are related, but this is from my experience. So, we have collected the data on 20
students that with that how many hours every week, they have studied and finally, how many marks
they have obtained in the examination out of 500.

And those marks and the number of hours of these 20 students have been recorded as follows that for
example, if a student has studied 23 hours per week then he has got 337 marks; if a student has studied
25 hours every week then he has got 316 marks and so on. So, the first row is denoting the marks and
second row denoting the number of hours per week which a student has studied and this data for 20
student has been obtained here.

13

692
(Refer Slide Time: 27:53)

So, now what I do here first let me try to compile all the data into data vectors. So, I try to create a data
vector here marks in which I tried to store all the observation which are given in the first row here, this
and here this and which is here. And, now in the second case I have taken variable here hours in which
I have stored the data which is given in the second row here and I have created here two data vectors,
marks and hours as I will be using this example again and again. So now, I have explained you the
genesis of this data set.

(Refer Slide Time: 28:44)

14

693
Now, I would try to simply make a plot of these two data vectors hours and marks. So, I will be using
here the plot commands for the bivariate plots which was plot x y. So, now I will say here plot inside
the arguments I will try to give that two data vectors hours and marks, separated by comma and this
will give me here the graphic like this one. So, you can see here that these are the data points and by
looking at these data points you can see here the that, there is going to be a sort of linear trend; trend
means that most of the points are going to lie nearly on a straight line.

Now, in case these points are line close to the line or away from line, that will give some idea about
the extent of the relationship between marks obtained in the examination and number of hours studied,
right.

(Refer Slide Time: 29:46)

Now, in case if I want to change this type then now I am going to use here a type which is here l if you
remember this l was used as the type to create the graph with lines. So, now you can see here once I try
to do it here, I get here a graph like this one in which all those data points like this one here and they
are inter connected by such lines and so on. And if you wish you can compare it from this curve, if you
try to make it here this, this, this, this and so on connected them by line that will give you a line plot.

15

694
(Refer Slide Time: 30:33)

Similarly, if you try to choose here the option for type equal to b; b means both line and point. So, you
can see here this is the combination of the first two graph that we have obtained here those points are
here; all those points are here and they are connected by these lines like this, this, this, this, this and so
on. So, this is how you can obtain this type of plot.

(Refer Slide Time: 31:07)

Similarly, if I try to use the option of type equal to o, o means over plotted. What you mean by over
plotted? In the earlier graph if there are two points here like this there were simply joined like this, but
now I am saying that there are two points like this and they and the line will cross through like this
one. So, you can see here these are the points and so on which I just joined from point to point. So, the
line and those dots both are over plotted. So, this type of graphs will again give us a different type of
information.

16

695
(Refer Slide Time: 31:49)

Now, in case if I use the type h. So, we have discussed this h means this is a sort of histogram or say
high density vertical lines. So, we can see here we have this data here like this and this data has been
joined on the x axis like here this. So, this is another type of plot which can be obtained by using the
type h.

(Refer Slide Time: 32:16)

17

696
And similarly if you want to create a stairs steps type of plot then use that type equal to s; and if you
see the plot which we have obtained here these are my data points and so on, you can see here first the
that what we are my data points in the first curve.

You can see here this point is here like this and this and the same thing is here like this the points are
here and here and now they have joined by say stairs type of plot. And, similarly you can see here
these points are going like a steps means you have seen that when we climb on the roof then there are
stairs like this one. So, that is why this plot is called as a stair step type plot.

(Refer Slide Time: 33:01)

And, now in case if you want to make it more informative, suppose you want to add here the title of
the plot titles on the x and y axis. So, you have to use different options and if you remember we have
discussed these things in more detail when we have discussed the graphics in the case of univariate
data. So, for example, now exactly on the same way, suppose I want to give here a title like marks
obtained versus number of hours per week.

So, we have discussed that this can be given by the command main. So, I try to write here is the main
is equal to the title inside the double quotes. And similarly if you want to give here title on the x-axis,
suppose I want to give number of weekly hours. So, then in order to do this thing we have the
command x lab. So, I try to give whatever the title I want to give on x axis by writing x lab is equal to
the title inside the double quotes and this gives me here an outcome like this.

18

697
Similarly, incase if you want to have a title on the y axis. Suppose I have want to give here marks
obtained then for that we have the command y lab. So, I write y lab is equal to the title which I want to
give inside the double quotes and this outcome is given over here. And, similarly there are some other
options which I would say that please try to look into the help menu and try to see here. So now, I will
try to show you that how these graphics using the plot command are constructed inside the R software.

(Refer Slide Time: 34:48)

So, let us come to our software part. So, you can see here that I already have stored the data h o u r s
hours and on the marks, you can see here like this, this is the same data that we have obtained in the
example you can verify it. Now I am trying to see here plot between see here marks and see here hours
and as soon as you enter it, you can you get here this type of graphic right. So, you can see here this is
the same graphic that we have obtained.

Now, in case if you simply try to change the order of the variable here, you can see here at this
moment marks are coming on x axis and hours are coming on y axis. Now, suppose I want to
interchange with in place of marks I will give hours and in place of hours I will give the marks. So, I
will type it plot hours and marks and now as soon as I enter you can see here that this graphic is
changed you can see like this.

19

698
(Refer Slide Time: 36:02)

So, now I will continue with the same command and I will clear the screen, but I will try to show you
that what are the different options which you can use using the type option. So, if you see here this was
our first option was type equal to l by which I will get the lines. So now, you can see that as soon as a
press enter this graphic will change, you can see here this is now l and suppose if you want to have line
and points both then I have to give the type equal to b, so you can you see here you get a different
graphic. And, similarly if you want to use the over plotted option by choosing the type equal to o small
o then as soon as I press enter, I get this curve this is the same graphic which we have seen inside the
slides.

And similarly if you want to have a histogram or say high density lines then using the type equal to h
we get here this type of command we can see here. And finally, if you want to have stair type graphics
then I have to use the type equal to s and as soon as I do. So, I will get here an graphic like this one, so
you have now seen that creating graphic is not difficult at all. Now, suppose if you want to add some
more features like as there are some default features that whatever is the name of the variable that will
be coming on the x and y axis, but suppose if you want to add titles on x axis title on y axis and main
title then you have to use the same command. So, you see here on my slides that I had shown you here.

20

699
(Refer Slide Time: 38:02)

Now, I will try to copy this command and I will try to paste it on the R console. You can see here this
is coming like this; and now you can see here as soon as I have press the enter this is graphics changed
now you have the titles on x y axis as well as the other titles. And similarly, in case if you want to
change the colours, this is also possible using the options, but definitely I have show you the basic
operation you can make it as beautiful as possible as informative as possible whatever you want.

So, now I would stop in this lecture and in this lecture, I have given you an idea of the plot command. I
am not saying at all that which of the command or which of the graphic or which of the type is the
best. It depends only on you people or the experimenter. The experimenter has to decide or you people
have to decide what type of information you want from the graphic and which is the graphic which is
more suitable to provide that information in the correct way. And, this also comes by practice and
experience.

So, at this moment the objective is to learn how to create graphics and how to make them more
informative more interactive using the R commands. So, you please take some data set try to
experiment on it, try to give different types of options, try to create the plots and try to see what type of
information they are going to provide, try to take different types of types.

And, you practice and we will see you in the next lecture till then good bye.

21

700
Lecture – 26

Association of Variables – Smooth Scatter Plots

Welcome to the lecture on the course Descriptive Statistic with R Software. You may recall that in

the earlier lecture, we started a discussion on the association of variables. And, we had considered

how to construct the bivariate plots using the command plot and we had used different types of

options to create a more interactive plots. Now, when we are trying to use the plot option to create the

scatter plot, then we would like to have two types of information; one is the direction or the trend and

second is the magnitude and the magnitude was decided on the basis whether the strength of the

relationship is strong or say moderate or weak and so on.

So now, the question is this by looking at the scatter diagram how you would know that whether the

strength is more or less. So, in order to do so we had created a line manually, but now that can be

done on the basis of software also. So now, we are going to consider the plots where we will create

the scatter plot and we will also add a smooth line and that line will give us a sort of fit; and by

comparing the observation with that fit we can compare whether the strength is less or more or this is

weak or strong and so on. So, in this lecture we are going to consider the scatter smooth plots.

(Refer Slide Time: 01:56)

701
So, now we assume that there are two variables and those variables are related and we have obtained

and paired observations say (x 1, y 1), (x 2, y 2) up to here (x n, y n) and so on. Now, the objective is

that we want to create a scatter plot and inside the scatter plot we want to have a line which is called

as fitted line why this is called a fitted line that will be clear to you after some lectures. And when I

try to do so, this type of graphics will provide us the information on the trend or relationship between

them.

And in order to construct such a graphic in R software we have a command scatter dot smooth s c a gt

t e r dot s m o o t h. And this command produces a scatter plot and it also adds a smooth curve to the

scatter plot.

(Refer Slide Time: 03:09)

So, now how to do it and what are the details? So, this function actually this command is scatter dot

smooth, this is based on the concept of loess l o e s s actually is locally weighted scatter plot

smoothing method and this is used for local polynomial regression fitting and in this case it fits a

polynomial surface which is determined by one of the one or more numerical predictors using the

local fitting.

702
Definitely I am not going to discuss about the details about the loess and so on, but we are simply

going to use it. And I thought that because you will see that later on there are some options the details

are written in terms of l o e s s loess. So, I thought that I should tell you. So, that you do not get

confused at a later stage well; if you want to have more details about this is scatter dot smooth

function please look into the help of this command and you will get more details, right.

(Refer Slide Time: 04:11)

So, now the more detailed command of a scatter dot smooth which will give you a scatter plot and a

smooth curve is the following; you can see here the command is the same scattered dot smooth. Now

I am trying to give here the data here x and here y, y is here actually NULL because we are trying to

make it only with the one variable, but if you want to make it bivariate plot you can use both x and y

data. And then there are different option here as say span, span controls the smoothness for this loess

and this then there is a degree here is given as here one.

This degree is the degree of the local polynomial which is used for fitting and then it is asking for

family. This family can be symmetric or Gaussian and well there are different methods for fitting the

data. So, in case if we are using the Gaussian, Gaussian is indicating that the fitting has been done on

703
the basis of least square method, he will consider the least square method at a later stage. And, in case

if the option of symmetric is used, then this is indicating that descending M estimator is being used to

get the polynomial.

And, similarly if you want to give the labels on x axis or say y axis then these are the command x lab

and y lab here. And, similarly if you want to give the limits for example, if you want to provide the

limits, on the y axis then this is given by y lim and then, yeah, means this is the range. If you

remember range gives you the two values- minimum and maximum that we had discuss and after that

if you are handling with the missing data, then you have to use the command na dot r m is equal

TRUE or FALSE. So, there are some more option available, but I would request you that you please

try to look in to the help menu and try to understand how to use them.

(Refer Slide Time: 06:13)

Now, I will try to take an example and would show you that how to plot such graphics. So, I am

going to consider the same example which I discussed in the last lecture, where we have obtained the

data on the marks obtained by the students out of 500. And, the number of hours they studied in a

704
week; and this data was obtained for 20 students and this data here is given like this, the first way

here in this case is the marks of the students out of 500 similarly here also.

So, these are the marks student have obtained and the second row is giving the information on

number of hours they have studied in a week like this here, ok. And this data I already have stored, in

two variables when I am calling it here marks and hours, exactly in the same way as I did in the last

lecture.

(Refer Slide Time: 07:01)

Now, after this I would like to create a scatter plot with a smooth curve using this data.

(Refer Slide Time: 07:14)

705
And for that I use the command scatter dot smooth and inside the argument then I try to give the

name of the variable say hours and here marks; and if you try to see here this is the graphic that we

are going to obtain, right. You can see here now these points are the same point which you occurring

only in the scatter plot that we had constructed in the last lecture.

But now there is a line which is added to this thing and this line is helping us in knowing that how

much is the deviation of this individual observation from these points. And, if these deviations are

less or if these points are lying very close to the line, I can say that the strength is quite high.

Suppose, if you had got the same data with this line and, but the points are lying here and there and so

on, right yeah, this may happen then in this case I would say that the strength is weak or the degree of

the linear relationship is weak in this case.

(Refer Slide Time: 08:35)

So, this is how we try to do it and now I will try to show you on the R console also that how to

operate it. So, I already have entered the data on say hours and here marks you can see here.

706
(Refer Slide Time: 08:43)

And I try to make it here a plot scatter smooth by this command and you can see here we are getting

the same curve what we have shown here, right ok. So, now let us come back to our slide.

(Refer Slide Time: 09:03)

And now if you try to see I am taking a very small data set here. And, I am just trying to consider 10

values and these 10 values are indicating the weight of 10 bags of grains. And this data which is

707
recorded here for 10 bags in say kilograms, this is stored here inside the variable weight, right. And I

am trying to plot the scatter smooth graph for this data set and this is plotted here and the graph

comes out to be like this.

So, you can see here these are the points and this is indicating that possibly the relationship is not

actually linear, but it is a sort of non-linear relationship. And please keep this figure in mind because I

will try to give you some more information later on using the same data set, ok. And, if you want to

plot it here on the R console, I can copy this data here and, right, this comes out to be like this and if

you try to see, what I am trying to do here.

(Refer Slide Time: 10:28)

I am simply trying to make a scatter smooth plot of this thing and this comes out to be like this; one

thing what you have to notice here that in this example, I have used only here one variable. So, you

can see in this graphic that the value on the x axis, this is only an index. So, what we are trying to

learn here that is scattered smooth command can also be used for getting the curve only for univariate

data when we have only one variable, right.

708
(Refer Slide Time: 11:10)

And now I will take an example where there are some more number of data set and I will try to show

you how this curve look in the univariate case, single variable. So, I have collected the heights of 50

persons like this, it is the same example that we have considered earlier and this data has been stored

inside a variable here height like this; and based on that I will try to create a scatter smooth plot of

this data here in height.

(Refer Slide Time: 11:36)

709
So, you will see here this looks like this and it is indicating well, this can approximately be a sort of

linear relationship, but it is indicating that this relationship here is curvilinear. On the x axis, this is

only the index of the observation and on the y axis these are the values of heights. The reason why I

am taking here this graphic is that that you will get on I want to create another graphic on the same

data set and I would like to compare both of them together, but if you want to see the same graphic on

the R console,

(Refer Slide Time: 12:24)

you can see here that I am trying to store the data say here height this is the data vector height and I

try to create here it is quite a smooth graph of height, this will come like this.

10

710
(Refer Slide Time: 12:35)

So, this is the same one which we have shown here on this slide.

(Refer Slide Time: 12:41)

Now, there are some more options which are available on this scatter smooth curve, which will give

you more interactive things and that we already have discussed here. So, you can just look into the

help and try to experiment with various options.

11

711
(Refer Slide Time: 12:59)

Now, I would like to discuss another type of smooth scatter plot. There is another command in R

software, which produces a smooth scatter plot, but this plot is little bit different than what we have

obtained earlier this is essentially a smooth and colored density plot. And, this is obtained through a

two dimensional kernel density estimate. You may recall that when we were discussing the graphics

in the univariate case, then we had created the frequency density curve or say density plots using the

kernel estimates. So, there we have defined the kernel functions. Those kernel functions were having

some nice statistical properties simile to the properties having in the probability density function. So,

that kernel function was for one variable. So it so that was a univariate kernel function.

Similarly, when we have more than one variables, then in statistics it is possible to handle them

through the probability density functions, which are functions for more than one variable. Suppose

you have two variables x and y or three variables x, y and z then it is possible to define the joint

probability density function of x and y or joint probability density function of x, y and z. And

similarly the kernel functions can also be defined in a multivariate setup. So, in this case, this plot is

going to use the concept of kernel density estimates in two variates and that is why this is called a two

dimensional kernel density estimate? So, well we are not going into the detail that what are those

12

712
kernel density estimates in two dimensions, but definitely you should know that what are we going to

do and how the outcome has been obtained?

And definitely if you want to know the more details, we already have understood that one of the

biggest advantage of R software is that it is not a black box. You can go to the site of R software and

there you will see the help menu and there will be all the details that how this scatter plot has been

constructed, ok. So, let us Now come back to our slides and in order to create this type of a smooth

scatter plot, the command is smooth scatter, but you try to see the difference, here one letter capital S

of a scatter is in capital letter, that you have to keep in mind capital letter; and the spelling is a smooth

scatter s m o o t h any small letters then S in capital letters and then a t t e r in small letters and inside

the arguments you have to give the data vector, right.

And similarly, if you want to have more options, you can see here they are given here, but definitely I

am not going to discuss them.

(Refer Slide Time: 16:09)

But I have given them on the slides, those with are with you and you can have a look and if you try to

use them, you will get a more informative and better graphics.

13

713
(Refer Slide Time: 16:18)

So, just try to have a look on the help on this smooth scatter and you will get all these information,

right.

(Refer Slide Time: 16:31)

So, Now I would try to first take the same example which I just took in the case of univariate data;

and I would try to take the data on say here weight where we had collected the weight of 10 bags of

14

714
grains in kilograms, right, and this data has been given inside the data vector here weight. So, earlier

we had obtained the smooth scatter plot using the command scatter dot smooth and this curve was

looking like this.

Now, I will use this new command smooth scatter on the same data set and we will obtain the new

graphic and I will try to compare it with that too.

(Refer Slide Time: 17:14)

So, you can see here when I am trying to use the command smooth scatter for the data on weight

here I get here a graphic like this one. What are these points and what is this showing? Now if you try

to recall, the earlier plot that was obtained was like here this. Now if you try to put both these points

side by side, you can see here the similarity. You can see here this is here a point, this is here a point,

this is here the point on the earlier plot, let me call this points a point number 1, 2 and 3 and they were

created as a dots.

Now, the same points have been created with this new command is smooth scatter which are here like

this point number 1, point number 2 and point number 3. So, this point is here, this point is here and

this point is here, right. And, similarly if you try to see here, now I will use a different color, so that

15

715
you can observe the movement of my pen, these two points, they are here. And similarly, if you try to

here, this point this is here and similar to this if I try to say, here this point 1, 2 and 3 here these are

the points here. So, they are here. So, you can see here both these plots are going to give us the

similar information, but their structures are different. And now it depends on the experimenter or the

statistician or you people, you have to decide that under the given circumstances which graph is going

to give you much better picture. One situation where this smooth scatter type of plot will be more

useful is that, that when you are trying to obtain the observation and you are not 100 percent

confident about the values whether this value is 20 or 20.1 or say 19.8 then in that case this type of

graphics that we have obtained, now using the smooth scatter command, they will be more useful;

because they will also try to show you the uncertainty involved in the point. But definitely, when I am

trying to say the value is 20 then definitely the margin of error should be as small as possible and

definitely if the value is 20 then there is no error and this part is indicated that the values are 20 or

19.8 or 19.5 or 19 or the value 20 is 20.2, 20.5, 21 or 22 this is indicated by the decreasing tint of the

color. Now if you try to see in this a graphic, you can see here, suppose if you try to observe in this

black color one you can see here the values in the center here are most dark.

Similarly if you try to take it another say anything any point over here, the middle part has the darkest

color. And as we are moving from the center like this here if you try to see my pen in red color, we

are moving towards outside or in this point number 1, this is the center part and when we are trying to

move from the center you can see that the color is now decreasing and the color is becoming lighter.

So, this is what is indicated by this type of curve or this type of graphic, as the color is becoming

lighter, that is showing the level of uncertainty. If you are confident, if you are is strong that your data

value is correct that is indicated by a stronger color. But if you are weak and you are not confident

about the data, then that variation or that uncertainty is indicated by lower tint of the same color or

similar colors.

16

716
So, this is how you have to take a decision that which of the graphic you want you would like to use.

(Refer Slide Time: 22:21)

And now if I try to take the earlier example, where we have collected the height of 50 persons and

this data was recorded inside of the vector height.

(Refer Slide Time: 22:33)

17

717
Then if I try to create the smooth a scatter plot it will look like this you can see this because the

number of data points are quite large, so this is more concentrated, right.

(Refer Slide Time: 22:45)

And, if you try to compare it with the with the earlier a scatter plot, you can see here, earlier we had

obtained this thing; and now excluding the new command is smooth a scatter we have obtained this

plot.

So, you can see here this point and this point they are here the similar. And similarly here on the right

hand side corner, these two points and these two point they are the similar. So, you can see here that

both these graphics are trying to give similar type of information, but in a different way. So, now,

before I go further let me try to plot these things on the R console. So, first let me copy here the data

here weight and you can see here.

18

718
(Refer Slide Time: 23:27)

So, this data here is given as here weight and now I try to copy the same command smooth scatter on

the data set weight.

(Refer Slide Time: 23:40)

And you will see that, as soon as I execute it, it gives me this type of information. Now, up to now

you could see I have taken two examples where I have considered the univariate data and I have

created this plot.

19

719
Now, I would try to take a bivariate data and I would try to show you that how the information can be

retrieved and how the information is present in this scattered a smooth curve?

(Refer Slide Time: 24:06)

So, you may recall that, we had considered an example where we have collected the data on the

marks obtained by the students and the number of hours they studied every week. And this data was

stored in the variables marks and hours and based on that, I would try to make the smooth scatter plot

using the command smooth scatter between hours and marks.

(Refer Slide Time: 24:29)

20

720
And if you try to execute it you will see here you will get with this type of curve. So, this is indicating

that there is a sort of here linear trend and this is also giving you a sort of that what can be the

maximum deviation in this data for example, like this.

You can, simply have to visualize that what is the difference between the green line and the darkest

part of the data darkest color of the data for them darkest colors is here in this center. So, this will

give us information that how the things are happening. So, I would try to show you that how this is

executed on the R console.

(Refer Slide Time: 25:08)

So, you can see here I already have stored the data on hours and marks because we have just used it.

21

721
(Refer Slide Time: 25:16)

And now I try to give here this smooth a scatter plot command and you get here a data like this one.

So, now, in this lecture I would like to stop and you have seen that we have discussed that how to

construct the smooth scatter plots. We have discussed two types of plots and in every type of graphic,

there are different commands, although I am not discussing it because they are exactly on the same

lines as we have done several times in the past.

So, my request is that you please try to experiment with them. You can take even the same data set

and try to see how you can add or change the labels on axis, colors of these dots and how you can

gave different types of titles and how you can incorporate more type of information by using different

options available with the command. So, you practice it and we will see in the next lecture with some

more graphics, till then goodbye.

22

722
Descriptive Statistics with R Software
Prof. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology, Kanpur

Lecture- 27
Association of Variables - Quantile – Quantile and Three Dimensional Plots

Welcome to the next lecture on the course Descriptive Statistics with R Software. You may
recall that in the earlier lectures, we started our discussion on association of two variables and we
had discussed several types of two dimensional plot. And those two plots were trying to give an
idea about the direction and the degree of association between two variables.

Now, I am going to consider two dimensional plot, but they are used in different concepts. So, in
this lecture I am going to talk about Quantile Quantile Plot and after that I will give you a brief
introduction and brief description about the Three Dimensional Plots which are available in R
software.

Now, there is a situation that two samples have been obtained from some population. Now, we
want to know on the basis of given sample of data that whether the two samples have been
drawn from the same population or different population, these type of situations are very very
useful in several statistical tools. For example, one good example is in the case of testing of
hypothesis. We try to conduct one sample test, two sample tests and so on. In those types of test,
there is a requirement that sample is coming from a particular type of population and those
populations are characterized by some probability density function.

For example, you must have studied in the books that are popular sentences let x 1, x 2, x n be a
random sample from normal population with mean  and variance 2. So, definitely in this case
we are trying to say that there is a population which is very big and practically unknown to us
and this population is characterized by the normal density function.

Now, I am drawing a small sample, say 20 observation, 30 observation or say 100 observations
and so on. And I would like to know whether the sample is coming from a normal population or
not and this type of information is needed because the tools like the t test, z test and so on and
different types of tests which are use in the testing of hypothesis, they are constructed assuming
that the population is characterized by normal distribution.

723
So, unless and until this assumption is verified, the further statistical inferences will be
questionable. So, one question I would like to address here is that how to judge that a particular
given sample is coming from a normal population and if I try to extend this concept. Then I
would say suppose 2 persons have brought two samples to me and I would like to know whether
these samples are coming from the same population or from different population.

And in order to makes such an comparison, one option is that, I can compare the quantiles of the
samples and then I can conclude that whether the two samples are coming from the same
population or not, in case if they are coming from the same population I would expect that the
quantiles of both the populations are going to be the same. And similarly when I want to test
whether a sample is coming from a normal population or not, then I would try to compare the
quantiles which are computed on the basis of the sample and the quantiles of the normal
probability density function, if they match then I can say, yes, my sample is coming from a
normal population with certain mean and certain variance. So, these types of plots are called as
quantile quantile plots. So firstly, let us try to understand the concept, interpretation and then I
will show you how to use them on the R software.

(Refer Slide Time: 05:18)

So, in the case of quantile quantile plots, what we try to do? That we try to consider two
variables and we try to find out their quantiles and when the quantiles of the two variables are
plotted against each other in a two dimensional plot, then we get the quantile quantile plot. For
example, if I say if I have got here two variables X and Y and based on that I have got two
2

724
samples, say x 1, x 2, x n and say y1, y 2 say y m these samples may be of same size or of
different size.

And then I would try to plot the quantiles of X and on the Y axis, quantiles of y and then I would
try to see that how they are matching and in case if they are matching, then I would say yes they
are coming from the same population otherwise not. So, it is something like this 25 percent
quantile of x and 25 percent quantile of y and say 40 percent quantile of x and 40 percent
quantile of y, say 70 percent quantile of x and 70 percent quantile of y if they are matching, then
I can join this line and that is going to be a straight line.

And similarly if I want to compare or if I want to know that a sample x 1, x 2, x n is coming


from a normal population with mean  and variance 2, then in this case I would try to plot the
quantiles of x on 1 axis and say quantiles of normal distribution normal  2 square on the y axis
and then I would try to see the pattern.

So, in case if I try to do so, these types of graphics they provide us a summary whether the
distribution of the two variables are similar or not with respect to the location and we try to plot
this quantiles of the variable against each other.

(Refer Slide Time: 07:40)

In order to plot such quantiles in R software, we have a command here qqplot and then inside the
arguments, we give the data vector. So, I have here two data vectors x and y and we use the

725
command qqplot and inside the argument x, y and this is going to give us a QQ plot of two data
sets. Similarly, there is another command qqnorm. This qqnorm produces a normal quantile
quantile plot of the values in the data and in this case they are compared with the quantiles of the
normal distribution.

So, here the command is qqnorm and inside the argument you have to give the data vector x.
Similarly, there is option that inside this normal qqplot, we can add here a line and this line is
based on the theoretical quantiles and by default, this is normal. And these quantiles whatever
are being plotted, they can be controlled by the probs function. You may recall that we had use
the probs function to define the probabilities or they were trying to indicate that the quantiles
have to be computed for which of the probabilities.

So, you may look into the lecture on the quantiles where where we had used this probs function.
So, the command here is qqline and inside the arguments we give the data vector. So, this will
also plot a line inside the qqplot.

(Refer Slide Time: 09:44)

So, the first question comes that how to make interpretations from this quantile quantile plots?
So, I will try to take here different types of possible situations and based on that I will try to
show you that how we are going to take a conclusion or draw a statistical inference.

Suppose I try to plot the quantiles of data on x and data on y on a two dimensional plot against
the x axis x and y, these dots they are trying to show the data. And if you try to see in this case,
4

726
all this data that is lying on a straight line, straight like this and this line is essentially a 45 degree
or this line is made at an angle of 45 degree from the x axis.

So, in this case you can see that all the points of quantiles, they are lying on a straight line which
is drawn at an angle of 45 degree from the x axis. So, this is indicating that the two samples have
the similar distribution. And in practice it is always not possible to get such a 100 percent clear
straight line, but the plot will look like this. So, in this case, in case if I try to plot here a trend
line, this will look like this and you can see that the points are lying nearly on the straight line.
So, in this case, we can say that yes the two samples are coming from two populations which
have got the similar distributions.

(Refer Slide Time: 11:51)

Similarly, in further case suppose we get a quantile plot like this one where all these data points,
they are lying below the straight line and no point is lying in this direction here, in this region.
So, in this case, what I can conclude is that the y quantiles are lower than the x quantiles and this
has an interpretation that y values have a tendency to be lower than the x values. This obviously,
indicates that the distribution from where the samples have been drawn on x and y, they are not
the similar.

And in practice, in case if you are getting a data like this one and suppose the trend line is
passing through like this one. So you can see here; you can see here that most of the points are
lying in the lower region below the line and there are very few points which are lying above the

727
line. So, in this case, in general, I can say that the y values have a tendency to be lower than the x
values and hence the distributions are not the same.

(Refer Slide Time: 13:24)

Similarly, the opposite of this can also hold true that all the data points are lying above the line
and there is no data point here in the lower side of this line and this is indicating that the x
quantiles are lower than the y quantiles and this has an interpretation and it is indicating that the
x values have a tendency to be lower than the values of y. And hence the two samples from x and
y they are not coming from the same distribution.

(Refer Slide Time: 14:10)

728
Similarly, in case if you are getting a QQ plot or the quantile quantile plot in this way, where you
can see that here there are some data on the lower side of this line and suddenly there is a break.
And after that there are few points towards the end and these points are above the line. So, in this
case this is indicating that there is a break point up to which the y quantiles are lower than the x
quantiles and after that point, the y quantiles are higher than the x quantile.

So, you can see here that this is the region where there is a break point and these quantiles are
lying in this direction and in the upper part they are lying in the upper direction from the line. So,
in this case also, we can interpret that the two samples which are coming from two different
populations and those populations are not the same.

Now, I can do one thing that I am trying to take two data points say or say two samples - one
from x and one from y. And similarly, in case if I try to take one of the quantile to be the
theoretical quantile from the normal distribution, then I can compare the quantiles of a data set
with the quantiles of a normal distribution.

(Refer Slide Time: 15:45)

So, let me try to show you through an example and I will try to make this quantile quantile plot
or popularly they are called as QQ plots using this data set. So, now this data set is the same
example that I have use earlier couple of times and this is about the heights of 50 percents which
are recorded in centimeters right and this data is collected inside the data vector height.

729
(Refer Slide Time: 16:14)

And now after this I try to first prepare a qqnorm of this data set. So, now you can see here that
these points are lying here, these dots are indicating the quantiles and if you try to see this line
looks something like this. So, you can see that approximately it looks linear, there are some
points over here and here which are going beyond the lines, but here you can see that most of the
points are lying on the straight line.

So, possibly I can compute or I can conclude that the quantiles which are computed and
indicated on the y axis on the basis of given sample and the quantiles of the normal population
which I have been computed using the PDF of normal distribution and these are the theoretical
quantiles they are matching, they are nearly matching. So, one can safely assume that this data is
coming from a normal population.

(Refer Slide Time: 17:27)

730
Now, in case if I try to add here a line using the command qqline. So, the command will be
qqline and height and you can see here that this line has been added to the same quantile quantile
plot. So that is helping us in comparing that how much is the deviation of this point from this
line. You can see here this deviation is less and in the starting and towards the end this deviation
is here more something like this. So, this will help us in taking a conclusion whether the samples
are coming from a normal population or not.

(Refer Slide Time: 18:08)

Now, in another example, I would try to take the same data set which I had used earlier on two
variables. So, in this data set 20 students have given their data on the marks obtained and the
number of hours they have studied every week and this is here the first row is giving the marks
and the second row is giving the number of hours they studied per week and this data is
contained in the two data vectors here marks and hours.

(Refer Slide Time: 18:39)

731
(Refer Slide Time: 18:48)

So, when I try to make a qqplot, so the command will be qqplot and inside the argument, the two
data vectors marks and hours. So this qqplot is going to help us that I have got here two data
sets- one is the marks and another is the number of hours. And this is suppose coming from say
population number 1 and hours are coming from the population of hours called as population
number 2.

So, I would try to see whether these two populations are same or not, this population 1 and
population 2 are they have got a similar characteristic or they have got different characteristics.
So, in this case also you can see here that there is a line which can pass through this thing and
this angle is going to be 45 degree for this line. So, one can conclude that well most of the points
are lying on the line or near to the line. So, I can say that the samples are coming from the
similar population.

Here you can see here these are the deviation that we need to look and here I am trying to create
this trend line and you can see here that how the points are going to lie and whether the trend is
linear or something else. So, now looking at this data set, we can have this idea. Now, before
going further let me show you these operations on the R software.

10

732
(Refer Slide Time: 20:30)

So, I am trying to first create the data vector here height, so you can see here this is the data
which is contained here height and now I will try to plot the qqnorm of here height.

(Refer Slide Time: 20:44)

And as soon as I enter you can see here I am getting the same plot which I have shown you on
the slides and if I try to make it here qqline means I would like to add a line trend line in this
qqplot, you can see here now we have this data and this is the qqplot and we have got here this
line try to have a look on the cursor.

11

733
(Refer Slide Time: 21:11)

Next I would try to make a qqplot with the marks and hours. So I already have stored this data
marks and hours is like this and if you try to make here a qqplot between marks and hours, you
will get here a qqplot like this one.

(Refer Slide Time: 21:31)

And also you please try to have a look and see if I try to change the order of the variables, say
instead of qqplot marks hours, I will say qqplot hours and marks, you can see here this that

12

734
correction will simply change, but the information is going to be the same what we are going to
conclude. Now, let us come to next topic.

Now, I am going to discuss here briefly how to create the three dimensional plot. You see in R
software we have a facility to create several types of three dimensional plots, well it is not
possible for me to give the details of all the plot, but surely I will try to show you how to create
those plots and how start it. And I will try to give you an example that how the different types of
3D graphics are made.

So, one of the question you guess, in what type of situation these three dimensional plots are
useful? You see, whenever we are trying to deal with multivariate data and we want to study the
interdependence of the variables over each other, then in that case we would try to make such
plots. For example, if I take an example which I have taken in my slides, we know for children,
height, weight and age, they all change with time. As the age increases height also increases
weight also increases. As the height increases the age and weight also increases and so on. So,
now how to explore this type of interdependence, so for that we will try to create here the scatter
plot.

(Refer Slide Time: 23:36)

So, now I am going to take here some examples and I will try to show you the commands with
those examples. So, first plot which I am going to consider here is the scatter plot that is a three

13

735
dimensional scatter plot and this is created by the command is scatterplot3d s c a t t e r p l o t and
3 and d, all in small letters and 3 in numbers.

And inside the arguments I have to give the data vectors for which I need to create this plot and
this command will plot a three dimensional point cloud on the data x y and z, but for this thing I
need a special package this is not included in the base package of R and the package which is
needed here is a scatterplot3d.

So, first you need to install the package using the command install dot packages and inside the
arguments within double quotes type a scatterplot3d and after installing it, you need to upload it
by using the command library scatterplot3d. This you know how to get it then otherwise you can
simply use this command on the R console and can install it.

(Refer Slide Time: 24:52)

Now, I will try to take the example which I just discussed that we have taken the data on 5
persons for their height, weight and age and we would like to create a three dimensional plot for
this data set. Well I am taking here only 5 data values. This is because I want to show you that
how the picture will look like and so that you can see inside the picture that how these values are
coming. So, the person number 1 has the height 100 centimeters with 30 kilogram and age is 10
years and so on we have this data set.

14

736
(Refer Slide Time: 25:34)

Now, I have copied this data set in three vectors- height, weight and age like this height, weight
and here age. And before that I have install the package a scatterplot3d and I have loaded on the
R console.

(Refer Slide Time: 25:52)

So, now in case if you use the command here scatter plot3d height, weight, age inside the
arguments you will get here this type of picture. You can see here these are the dots here which
are indicating the values 1, 2, 3, 4 and here 5. And now you can see here this is a sort of cube or

15

737
cuboid for which this graph is giving us an information that how the points are lying inside that
cuboid, right.

So, by looking at such graphs you can have an idea that how the things are happening. It is also
possible to create the surfaces which are called surface plots and they will give you an idea that
how the variation in the data is happening or how the data is behaving by looking at these
observations.

Now, in this type of plot, there are various options by which you can draw more meaningful
inferences. For example, in case if I want to change the direction; direction means you can see
here on one axis we have age, another axis is here height and say another axis here is weight.

(Refer Slide Time: 27:22)

Now, I would try to change the direction of this cuboid, so I try to now take here weight on the
this side height on the x axis and age on the other side. So, then again if I try to use the same
command here scatter plot3d with height, weight, age, but now I am giving here an option angle
is equal to 120 degrees. So, by giving an option of angle, I can control that how much the cube or
the cuboid has to be turned or to be rotated.

So, the earlier picture is now rotated by 120 degrees at an angle of 120 degrees and but now you
can see here these are my points here, so you can see here in sorts of your curve, one can see
here. Whereas, in this case it was showing like as if there is a straight line. So, by making

16

738
different types of cuboid with the changing angle you can have an idea that what is really
happening.

(Refer Slide Time: 28:23)

And similarly in case if I want to change the color of the points, I can use here an option- color is
equal to, For example, here I have use red inside the double quotes and you can see here the
color of these points is coming out to be now red and this command can also be combined with
the angle and before I go further you help me to try to show you this thing on the R console.

(Refer Slide Time: 28:58)

17

739
So, I will try to collect here the data, this is my height, this is my data on weight and this is the
data on age. And now I need to first upload the library, I already have installed this package on
my computer, but you can do it yourself, right. And now if I try to use this command to create
the scatter plot of height, weight and age, you can see here I get this type of picture.

(Refer Slide Time: 29:35)

(Refer Slide Time: 29:41)

18

740
And if I try to add here say angle is equal to 120, this picture changes here, well I can show you
the both the things together. Suppose, now I try to make here an angle of say here not 120, but
say 150.

(Refer Slide Time: 30:07)

So, you can see here these points will change you can see here this direction is now changed.
Similarly, if you want to add here the color, say colors and angle can also be combined, color is
equal to say here red. So, you can see here, this now gives me a red color and suppose if I try to
make it here the colors to be blue, now the colors are blue.

So, by looking at these types of pattern, you can have more information. One good thing will be
that you please write a program in which the angles are changing continuously say at an angle of
1 degree, 2 degree and so on. So, then you will have a picture which is continuously rotating and
then you can have a 3 dimensional view which is possible in R, just by writing a small function,
right.

19

741
(Refer Slide Time: 31:11)

Similar to this three dimensional plot we have here some other types of graphics which I am not
going to discuss here, but I am just informing you one is here see here contour plots which is
give you the plot with the contour line, we have dot chart, we have image plots and this will
produce a picture with colors as the third dimension we have a mosaic plot. And which is a
mosaic plot for say multi dimensional diagrams a particular in case of categorical variable or say
contingency tables that we are going to use later on. And there is another say here perspective
plot which is obtained by either command p e r s p and in this case you get surfaces over the x-y
plane.

(Refer Slide Time: 31:57)

20

742
So, I will simply try to take an example here although I am not going to discuss it in detail, but I
will show you that how the perspective plot is created and how it looks like and what is the
advantage. For example, I have collected here some data, I have collected the data x as a
sequence between minus 10 and 10 with the 30 observation and then y is going to the same as
here x.

And then I have created a function to compute the value of r is equal to square root of x square
plus y square or 10 into sin of r divided by r. And then I am trying to obtain here the z vector as
an outer of x y and f and then I am trying to use here a logical operator and then I am trying to
give other parameters.

(Refer Slide Time: 32:42)

And then I try to create here the plot say perspective x y z with some other parameter theta equal
to 35 equal to 30 expand equal to 0.5 and color is equal to light blue and this graph will look like
this.

21

743
(Refer Slide Time: 32:55)

And similarly if you try to add here some more options over here, try to inform the tick types
shades etc. etc. you can obtain here a different perspective plot. So, I would simply try to show
you it on the R console. So, that you are confident that will these things are possible. So, I will
simply try to copy all this commands at the same time.

(Refer Slide Time: 33:22)

And I will remove this thing and I will and you can see here that I have simply copied these
commands over here and now I am going to plot here this curve, you can see here when I try to

22

744
execute this command over here I get this type of plot. And similarly if I try to use this command
which I have done here, this is as soon as I execute this color changes, there are shades and it is
more informative.

Now, I would like to stop here with all the graphical tools for studying the association between
two variables or more than two variables. Well in the given time frame, I have taken some
representative topics or some important types of plots, but this does not mean that these are the
only plots. There are many other plots and you have seen that in case if you want to make your
graphics more informative, the simple rule is try to take the help from the R software about those
syntaxes, commands, try to see what type of information they can give you and try to use those
commands inside the argument and try to control your graphic in the way you want.

And now you can see here, with this one dimensional graphics, two dimensional graphic, multi
dimensional graphics, you can produce very good graphics which people try to create from say
expensive software, but the only thing is this that here you have to study little bit you have to
understand. But the advantage is that you can control each and every parameter, each and every
characteristic of the graph, whereas in case of built in packages you do not have much options.

So, now in case if you spend some time, try to learn more you will become successful in creating
good graphics and which will give you lots of hidden information contained inside the data.
Now, in the next lecture I would try to develop some tools, so that we can get such information
in a quantitative way. So, you practice with these graphics, take some example try to create
graphics, try to experiment with them, try to take different combination of the values of the
parameter and see what you get and I will see you in the next lecture till then good bye.

23

745
Descriptive Statistics with R Software
Prof. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology, Kanpur

Lecture – 28
Association of Variables – Correlation Coefficient

Welcome to the lecture on the course Descriptive Statistics with R Software. You may
recall that in the earlier lectures we started a discussion on the aspect of association
between two variables and more than two variables and we have two types of tools-
graphical tools and analytical tools. In the last couple of lectures, we have discussed the
graphical tools, to study the association between or among the variables. Now the next
question is how to quantify that association? For example, we have seen in graphics that
there can be an association that is strong or that is weak. But now how to quantify where
that what value represent a strong or what value represent a weak association?

So, in order to do so we have a concept of correlation coefficient. And this correlation


coefficient is a measure of a degree of linear relationship between the two variables, but
now when I say the relationship between two variables, now this variables can be of
different types; they can be continuous, they can be discrete, they can be categorical
variables and so on. So, in order to study the quantification of association, we have
different tools depending on the nature of variables. We have the concept of correlation
coefficient, we have the concept of rank correlation we have the concept of contingency
tables and so on.

So, we will try to do it in the next couple of lectures. So, in this lecture we are going to
discuss about the correlation coefficient and I will be giving you here only the concept
and the theory of correlation coefficient means what is the formula, what is the structure,
how you use it what are the interpretations in different situations? And in that next
lecture I will show you that how to compute it on a software and how to interpret the
graphics and numerical values together. So, let us start our discussion.

So, we already have understood what is called the association between or among the
variables. Now I will try to take several examples to explain you that how would you
differentiate the association between continuous variable, between discrete variables and

746
so on. If I take an example, that we all know that as the number of hours of study
increases the students usually obtain more marks. And you can see this number of hours
can be expressed as hours, minutes and so on, that is essentially the time and marks can
also be measured on a continuous scale, they can be 70 or 70.50 also.

So, in this case you can see that there is a relationship between, the time spent on studies
and marks obtained in the examination. So, in this case we would like to see how is this
association is increasing, decreasing and what is there and what is the strength of a
relationship? Similarly, if I take another example, we know that during summer times
when the weather temperature is high, then the consumption of electricity increases
people are using air coolers, air conditioners and so on.

So, one can feel that as the weather temperature is increasing, the consumption of the
electricity also increases. But now we would like to verify it on the basis of a given
sample of data, we would try to quantify the degree of association and we would like to
see that how to say whether the association is strong, moderate or weak and how to see
whether the trend is increasing or decreasing and how to quantify it alright. Similarly, if I
take another example, we know that for the small babies and children, as their age is
increasing, their weight also increases up to a certain age means after certain age the
height and weight both stabilizes.

So, now I can say that as the age of those babies are increasing, their height and also
their weight, they increase. So, there is a association between the two variables. So, now
in these cases, the variables are continuous.

747
(Refer Slide Time: 05:26)

So, now we have noticed that in case if the numbers of hours of study, the increase the
marks obtained by the students are also increasing. So, number of hours of study they
affect the marks obtained in an examination. Similarly, the power consumption or say
electricity consumption increases when the weather temperature increases weight of
infants and small children increases as their height increases under normal
circumstances. So, in this case you can see and you can observe that the two variables
are continuous in nature. So, the question is how to quantify this association.

(Refer Slide Time: 06:04)

748
Similarly, if I try to take here another example, in this example I will try to consider the
variable which are discrete; their values are obtained only at point, right. For example, if
I want to know that in a college whether male students prefer mathematics more than the
female students or not or the male students prefer biology over the maths or not. In this
case, what we have to do, we are simply going to count the number of male and number
of female students who are preferring the subject.

So, here the numbers are going to be the observation on the discrete variable why it is
called discrete? Because this number of students can only be an integer there can be 5
student and 6 student, but there cannot be 5.50 students. So, in this case we would try to
see that what is the nature of association about the gender versus the subject.

Similarly, in case if some vaccine or some medicine is given to some patients, then we
would try to see that how many patients are getting affected? If there are significant
number of patients which are getting affected by the medicine, then one can conclude
that, yes, the effect of medicine and the number of patients they are associated. And we
would like to see, what is the nature of association in this case of discrete variable?

So, whenever we have our discrete of or say counting variable, we would like to know
whether male is to in preferred mathematics over female students or not. For example,
and say in say another example we will consider we want to know if the vaccine given to
the disease person was effective or not, right. So, these observation, you see they are
based on the counting of the two variables and in this case and in this case, the variables
are discrete in nature or and their values are obtained only as a number.

749
(Refer Slide Time: 08:13)

Similarly, there can be third situation. For example, in a viva or enough fashion show the
model or the candidate, they appear before that interviewers. For example, in the case of
a fashion model for example, the model comes on the stage and there is a group of
people who try to judge the performance and based on that they try to give the marks.

Now what do we expect? We expect that if a model is good, then all the judges will be
giving the high scores. And in case if the model is bad, then all the judges are going to
give the lower scorse, but it is possible in real life that whenever a model come, there
will be certain number of judges who are giving higher score and certain number of
judges who are giving lowest scores.

So obviously, we would like to see that what is the correlation or what is the association
or what is the nature of association between the ranks given by two judges. How to
obtain the ranks? Whatever the marks given by the judges, they are converted into ranks
and finally, we would be interested in knowing the nature of association of those ranks
not of their original values. So, in this case, we have a concept of rank correlation
coefficient. So, now we are going to consider here different types of thing.

So, in case of ranked observations, there can be a situation for example, where two
judges give rank to the fashion model or there is another example that quite a person has
cooked the food and there are two persons who are giving ranks to the food preferred or
their scores are ranked. In these cases, the observations are obtained as the ranks of two

750
variables or say two judges or two persons are those two variables. So, now we have here
different types of situations and those situations are described by the nature and behavior
of the variable.

So, my objective is now here is that, I try to consider the nature of variable, one at a time.
I try to show you that what are the different measures how to interpret them and how to
compute them? So, in this lecture first I am going to consider that case where the
variables are continuous and based on that we have a concept of correlation coefficient,
after that when we are going for the ranked data, then we will discuss about rank
correlation coefficient.

And when we have counting variables then we will be discussing about the different
types of this coefficient like contingency coefficients, chi square coefficient and so on.
So, now we are going to start our discussion, where we have two continuous variable and
we want to study the association between the two variables.

(Refer Slide Time: 11:41)

What is the meaning of association? The association can be linear or that can be non-
linear. How do you know whether the relationship is linear or not? For example, we
know, and we have learnt that we will try to plot the data on with the scatter diagram. In
case if the data looks to be like this then we say that there is going to be a linear trend
and if the trend is like this means you can see here, this will be called that the trend is not
linear, right ok.

751
So, now our basic framework is that we have here two variables capital X and capital Y
and they are measured on a continuous scale. And both these variables are linearly
related. This is you have to keep in mind that now we are going to talk about the
relationship which is linear in nature. So, we know that if the relationship is linear this
can be expressed in a mathematical format by the equation of a line like as Y = a + b X;
we have this a and b there are some unknown constant value. For example, in the case of
line this equation is in the standard form it is presented at y = mx + c.

So, here this a is going to represent the c which is the intercept term and b is going to
represent the value of m that is the slope of the line. So, now you can see here that this Y
and X, they are related and they will have some degree of association. Now how to study
this degree of association? For that we have a tool what is called as a correlation. So,
correlation is a statistical tool to study the linear relationship between two variables,
right.

(Refer Slide Time: 13:32)

So, now means I can say that the two variables are said to be correlated if the change in
the one variable results in a corresponding change in the other variable. What does this
mean? For example, we have taken the example of marks versus the time is spent in
studies.

752
So, we know that when the students increase the time of their study, then the marks
obtained in the examination will also be changed. So, in this case the change in one
variable is causing the change in another variable. Similarly, in the case of say height and
weight of small children, suppose height and weight are my two variables. So, when the
height increases, then usually the weight will also increases and similarly when the
weight changes then the height also changes. So, change in weight also causes that
change in height.

So, this is what we are trying to say and in these cases, we can say that the two variables
are correlated. If you try to see, how this word is coming correlated? This is co related.
Now when we say that the two variables are correlated, so we are trying to say the
change in the value of one variable is causing the change in the other variable.

Now this change can be positive or this change can be negative, means if the change in
one variable that is, suppose if I try to change the value of one variable and suppose the
other value that increases. Or in simple word, if the values on X increase then the values
of Y also increase; that means, it is a positive relationship and if the opposite happens
that if the values of X increase, but the values of Y then decrease then this is negative
relationship.

So, based on that we have a definition of positive correlation and negative correlation, so
we can say here that if two variables deviate in the same direction; that is the increase or
say equivalently the decrease in one variable results in a corresponding change to
increase or to decrease the other variable. Then in this case, the correlation is said to be
positive or the variables are said to be positive correlated.

So, what will happen in this case? Suppose if I try to make here the scatter plot in the
plot will if I try to change here the value of here X, suppose the Y value is here and if I
try to increase the value of here X then Y will also be somewhere here if I try to increase
it more, it will be here like this. So, we will have a graph like this.

So, in which the trend line will go like this. So in this case I can say that the observations
on X and Y they are positively correlated and the nature of correlation is positive. So, in
this case what is happening in case if the value of X is increasing then the value of Y is
also increasing, the opposite is this if the values of X are increasing then the value of Y’s
8

753
are decreasing and the next situation is where in case if the values of X are increasing
then something is happening in the Y values, but the nature of the change is not clear.

(Refer Slide Time: 17:25)

So, the next case is that if two variables deviate in the opposite direction what does this
mean? That is as one variable increases then the other variable decreases or vice versa
then in this case the correlation is said to be negative and the variables are said to be
negatively correlated. So, in this case, what will happen suppose if I try to plot the data,
suppose I have value here of x as say x 1 and somewhere here is y equal to y 1 which is
here like this right and then I try to increase the value of x say here x 2 then the value of
y 2 becomes here which is lower than y 1. And similarly if I try to take here another
value of x 3 then this value comes over here which is here y 3.

So, in this case you can see here that as the values of x are increasing the values of y’s
are decreasing and there is a negative trend, right. Similarly, if there are two variables
and if the one variable changes and then the other variable remains constant on an
average or there is very small change or no change in the other variable, then in this case
the variables are said to be independent or they have no correlation.

For example, in case if I say I am trying to take here the value of say here x 1 and
suppose this value comes out to be here y 1 somewhere here, then I try to take here x 2

754
and this value comes out to be here y 2 then in this case here x 3 and this value comes
out to be here and so on we have some more values.

So, there is no clear cut trend in the data. So, this is indicating that when we are trying to
change the value of x, there is practically no change in the value of y, then in this case
the variables are said to be independent of each other and they do not affect each other.
And in this case we say that they have no correlation or they have zero correlation.

(Refer Slide Time: 19:26)

Now, in case if you try to represent these situations. So, what will happen here, suppose
if I try to take the observations on two variable x and here y and we make a scatter plot
then you can see here that in the figure number here 1 that these observations are here
like this and in this case a trend line can be fitted which will passing through like this.

So, in this case when the values of x’s are increasing then the values of y’s are also
increasing and so, we say that the relationship is positive and there is a positive
correlation. Similarly in the figure 2 here, as the values of x’s are increasing then one can
say here that the values of y’s are decreasing and these values are these observations,
they are going like this and here in this case the trend line can be fitted like this.

So, in this case we would say that x and y are negatively correlated. And similarly if we
try to change the value of x and suppose the values of x are increasing, but there is no
change or say no trend in the values or we do not know how the y values are the y’s are

10

755
going to behave and they remain on an average as constant, there is no change when we
are trying to change the value of here x.

So, it is here it is very difficult to say in this case, we say that x and y they are
independent of each other and in this case we say that the correlation between x and y is
zero or the x and y have no correlation here.

(Refer Slide Time: 21:07)

Similarly, if I try to take here two situations of having the positive relationship, in say
figure number A and figure number B, if you try to see the trend line will look like this.
So, in figure number A, you can see here that the points are lying more closer to the line
then in the case of figure number B, right.

And means I am assuming that the scales are of both the figures are same means
otherwise there can be a confusion. So, in this case when the points are lying close to the
line then we say that, there is a strong relationship and in this case, the relationship is
positive. So, we say that there is a strong positive relationship between X and Y.

And similarly in the figure number B, there is a positive relationship, but this
relationship is not as strong as in the case of figure A. So, we call it that there is a
positive coalition, but it is moderate. Now we have understood what is the concept of
correlation. Now we need to define a quantity which can measure it and for this we have
a definition of correlation coefficient and this correlation coefficient is based on the
11

756
concept of variance and covariance. What is variance, that we already have discussed but
what is covariance?

You can recall that in the case of variance what we had done we were trying to measure
the variability of the observation around the mean value say arithmetic mean. Now if
there are two variables and suppose both the variables are affecting each other they are
interrelated, then when the value of one variable changes then the value of other variable
also changes.

So, there is a sort of co variation between the two values. So, similar to the concept of
variance, we have a concept of covariance. As variance measure the variability of a
variable covariance measures the co variability of variables, right.

(Refer Slide Time: 23:50)

Now, we are going to address a question that how to quantitatively measure this degree
of linear relationship and for that we are going to use the concept of correlation
coefficient which is based on the concepts of covariance and variance and now first we
address what is covariance?

12

757
(Refer Slide Time: 24:07)

So, as we have discussed this covariance is very very similar to the concept of variance,
when there is only one variable, he variation exist and that is measured by variance;
when there are two variables or even more than two variables, then for these two
variables, their individual variation exist; that means, the suppose there are two variables
of course, the two variables will have their own variance. And beside their own variance.
they will also have covariation and you have, obviously, if the effect each other if they
are independent then there is no concept of co variation.

(Refer Slide Time: 24:59)

13

758
So, the question is now how to quantify it and how to measure it. So, but before going
further you may recall that we had defined the variance of a variable here x
1 n
say  ( xi  x )2 . Similarly, if what we try to do here, we try to write this function, as I
n i 1
go through 1 to here n and we try to write down here, ( xi  x ) and say another

( xi  x ) will be replaced ( yi  y ) .

And this will be sort of quantity that will be measuring the co variation between x and y.
We assume that there are two variables which are represented as X and Y and it is
obvious that we are assuming that these variables are related or correlated. Now we have
obtained n pairs of observations on these two variable and these observations are
expressed as (x 1, y 1, (x 2, y 2, (x n, y n).

So, these are numerical values and we already have understood while discussing the
graphical techniques that how to obtain such observations. Now the covariance between
the variables x and y based on the sample observations, this is defined as covariance cov
indicating the covariance between two variables X and Y it is defined
1 n
as  ( xi  x )( yi  y ) .
n i 1

So, you can see here these are the deviations in x i and y i minus y bar they are the
deviations in y i’s. And we are trying to take a cross product of the deviations in x and
deviations and y and we are trying to find out the average of those cross product of
deviations. And here this x and y they are the sample means of x and sample means of y
the sample mean of x is defined here like this and sample mean of y is defined here like
this, right .

And I have given you here the definition of covariance in case of actually in case of
ungrouped data. And similarly in case if you want to have it for the grouped data, then a
then a similar definition can be defined as a covariance between x y is equal
1 k
to  f i ( xi  x )( yi  y ) .
n i 1

14

759
But you have to remember that here these symbols and notation they are going to have a
different interpretation; where now this x i and y i’s is they are indicating the mid values
and those (x1, x2,…, xn), (y1, y2, yn), that data has been grouped into k groups and so on,
but anyway I will consider here only 1 case.

(Refer Slide Time: 28:18)

Now, the next question is how to compute this covariance on the basis of given set of
data in the R software. So, if I try to indicate x and y are the two data vectors, then the
syntax or the command to compute the covariance is a c o v, all in a small letters and
inside the arguments, we give the data vector. But here you have to remember one thing
that, this command c o v in the R software, this will give us the value of covariance
whether divisor here n minus 1.

So, in case if you want to find out the covariance between x and y say having a divisor n
then what you need to do here, that you need to multiply and divide by here n minus 1
1 n
into the quantity say here  ( xi  x )( yi  y ) and then this will be become here (n -1)/n
n i 1
and covariance between x and y and this is your here the r command, right.

And this is the same story that we also had discussed in the case of variance that the
variance was defined in two ways having a divisor n and say divisor n minus 1 and we
had discussed that when we have the divisor n minus 1, then this is an unbiased estimator

15

760
of the population covariance. And the same story continues here also, that when we try to
take the divisor to be n minus 1 then it is going to be an unbiased estimator of the
population covariance, but anyway I am not going into the details of estimation and
statistical inference.

So, but this is for your information. So, that in case if you really want to compute a
particular type of covariance with divisor n or say n minus 1 you should at least know
how to do it and you should also know what r is trying to give you, ok. But anyway, I
will not take an example here to compute a to show you the on the R console, but that I
will try to show you in the next lecture when I am trying to compute the correlation
coefficient. Now I come to the definition of correlation coefficient.

(Refer Slide Time: 30:47)

This coefficient of correlation is also called as Karl Pearson coefficient of correlation.


There is a reason, because Karl Pearson constructed the coefficient. So, if you try to see
this correlation coefficient is denoted by r and this is equivalently written as r and inside
the bracket xy. So, that this is indicating that this r is a function of two variables x and y
cov( x, y )
and this quantity is defined as the . So, essentially if you try to see this is
var( x).var( y )
covariance between x and y divided by standard deviation or standard error of x and
standard deviation or standard error of actually y.

16

761
So, that is what I said, that in order to define coefficient of correlation, we need the
concept of covariance and variance. So, we now know that this covariance is written like
1 n
this  ( xi  x )( yi  y ) this is covariance and this quantity here is trying to define the
n i 1
variance of x and this quantity here is trying to define the variance of y and then we try
to take its square root. So, this is the expression for the correlation.

So, this can be simplified if you want to make it more clear this
1 n
 ( xi  x )( yi  y )
n i 1
is , right. And if you try to further simplify it ,then the
1 n 1 n
 ( xi  x ) n 
n i 1
2

i 1
( yi  y ) 2

numerator will become summation x i y i minus n times x bar y bar and this denominator
n n
will become x
i 1
i
2
 nx 2 and variance of y will become similarly y
i 1
i
2
 ny 2 you may

recall that we had solved this expression.

When we discussed the concept of this variability and while discussing the concept of
variance. Similarly, if you want to see here that how this expression can be simplified
n
which is involved in the definition of covariance  xi yi  n x y . So, this can be written
i 1

n n n
as x y nx y
i 1
i i - see here y  xi - x  yi + n x y . And now if you try to observe here
i 1 i 1

what is happening what is this quantity?

This quantity is nothing but n x and this quantity here is ny . So, what will happen that
the two factors here this and this will get canceled out and I can write down here that,
n
here the  x y  n x y  n x y  n x y . So, this n this gets canceled out and we get here
i 1
i i

the same quantity what we have obtained here like this is the same quantity, right.

17

762
(Refer Slide Time: 34:48)

So, having understood the basic definition or the mathematical form of the coefficient of
correlation, let us try to understand what it is doing and what are the different types of
interpretation? So, essentially r is measuring the degree of linear relationship. Well, it is
a very important term in a linear relationship well. You always have to keep in mind that
correlation coefficient can be used only to measure the degree of linear relationship. In
my experience I have seen that many people they try to use the correlation coefficient
blindly and even they are trying to use this correlation coefficient to measure the degree
of non-linear relationship, this is actually wrong.

And if you try to see this mathematical form of this correlation coefficient this is only a
mathematical function, a mathematical formula, whenever you try to give some value of
x and some value of y it will give you some numerical value. But the interpretation of
those values will be wrong and they will not be indicating the information contained
inside the data.

So, this is my humble request to all of you that please use this correlation coefficient
only to measure the degree of linear relationship and in order to do so, first use the
scatter plot, try to see whether the relationship is linear or not that can be increasing or
decreasing whatever you want. But the trend has to be linear and only then one should
use the definition of correlation coefficient.

18

763
This correlation coefficient is also called as Bravis Pearson correlation coefficient and
also as say product moment correlation coefficient. Why this is called as Bravais Pearson
correlation coefficient? Actually Professor Karl Pearson presented the first rigorous
proof or first mathematical rigorous treatment of the correlation and he acknowledged
Professor Auguste Bravis because, Professor Bravis had made some initial mathematical
contribution by giving the mathematical formula for the correlation coefficient.

So, that is the reason it is called as sometimes this is also called as Bravis Pearson
correlation coefficient and this is also called as product moment correlation coefficient.
Why this is called as a predictive moment correlation coefficient? You might recall that,
1 n
we had learned the definition of r th central movement and it was given as  ( xi  x )r .
n i 1

So, this was a sort of the automatic mean of the rth power of the deviation in the value of
x is. So, this was valid only when we have one variable, but now suppose if I have two
variables, what I can do here is the following, I can consider the deviations of x, I can
consider the deviations of y, and then, I will try to take the rth power of divisions of xi s
and s th power of the divisions of y i’s, then I will try to find out the arithmetic mean of
the product of these deviations. And this is denoted as here rs and this is called as say r s
th product moment.

So, that is why this is called the product moment correlation coefficient. Why? Because
in case if you try to substitute see here r equal to 1 and s equal to 1 this will give you the
1 n
value here  ( xi  x )( yi  y ) which is your covariance between x and y.
n i 1

19

764
(Refer Slide Time: 39:12)

Now, we try to discuss the magnitude and sign of correlation coefficient, what is their
interpretation? So, this correlation coefficient value lies between minus 1 and plus 1
well, I am not giving you here the mathematical proof. So, r lies between minus 1 and
plus 1 and those values 1 and minus 1 they are inclusive. So, what is the interpretation?
You can see here this is here minus 1 and this is here 1 and this is here zero. So, this is
the limit of here r.

So, what happens if r is negative lying in this side and what happens if r is positive lying
in this side? So, when we try to compute the value of r on the basis or given set of data
and if this comes out to be positive then, this indicates that there is a positive association
between x and y and hence x and y are positively correlated. Similarly, if the value of r
comes out to be negative, then this indicates that that there is a negative association
between the two variables x and y and hence these two variable x and y are negatively
correlated.

Similarly, if r equal to zero; that means, if you compute the value of correlation
coefficient and it comes out to be zero well, zero is a theoretical value, but even if this is
very close to zero, then in this case this indicates that there is no association between X
and Y and hence X and Y are uncorrelated.

20

765
(Refer Slide Time: 40:54)

So, now we have seen that the value of r has two components; one is the sign and another
is the magnitude. This sign of correlation coefficient indicates the nature of association;
that means, whether the relationship has got an increasing trend or decreasing trend,
right. So, the positive sign of r indicates that there is a positive correlation, this means
what as one variable increases or the values of one variable increases then the value of
other variable also increases. And similarly, if the values of one variable decrease and
the values of other variable also decrease.

So, plus r will give us an information that the relationship is positive and the degree of
linear relationship is the magnitude of r. And similarly if we consider the negative sign
of r say minus r, then the negative sign indicates the negative correlation. So, as the
values of one of the variable increases then the value of other variable decreases. So the
relationship is opposite and similarly, if the value of one variable decreases then the
value of other variable increases.

21

766
(Refer Slide Time: 42:33)

And what about the magnitude of r the magnitude of r indicates the degree of linear
relationship so, we have seen that r lies between minus 1 and plus 1. So, there will be
two extremes are the minus 1 and plus 1 and one in the middle value. So, when we say r
is equal to 1, this indicates the perfect linear relationship, what does this mean that if I try
to plot the scatter plot between x and y, then all the points are lying exactly on the same
line.

So, if we try to make here a line on this graphic, it will look like this. So, there is a 100
percent perfect relationship and all the values are lying exactly on the line. Similarly, if I
say r equal to 0, this will indicate that there is no correlation, there is zero correlation and
in case if I say any other value of r between 0 and 1, that will indicate the degree of
linear correlation relationship higher the value of r higher the degree of linear
relationship, this relationship can be positive or negative.

So, when r is equal to plus 1, this will indicate the perfect linear and increasing
relationship between x and y and when we say r equal to minus 1, then this will indicate
a perfect linear and decreasing relationship between X and Y. For example, if you try to
make it here x and here y then what will happen, this in case of decreasing relationship
this relationship will be like this and in this case if you try to make a trend line this will
be a perfect straight line. So, this is the case of decreasing relationship and the above one
was the case of increasing relationship, right.
22

767
(Refer Slide Time: 44:32)

And now I would simply try to show you these things graphically what does this actually
mean for example, now what you have to do, you simply have to keep your
concentration on my pen, right. First you try to look over here at this figure, you can see
here that this X is here X is indicating the values of X and this is here the values of Y
inside all these figures.

So, we can see here in this picture here that as the values of X are increasing the values
of Y’s are decreasing here like this. And in case if you try to create here a line, trend line
this trend line will be something like this and you can see here that all the points are not
lying exactly on the same line.

So, in this case the sign of the correlation coefficient will be negative and definitely this
value is not 1, but most of the points are lying very close to the line so the correlation
coefficient can be close to 0.90. And similarly, in the next figure, in this one, you can see
here when we are trying to increase the value of X the values of Y’s are decreasing and
in this case if you try to create the trend line, this will be like this. So, now in case if you
try to compare, this picture then most of the points are lying close to the line, but these
points are not as close as in the figure number 1 here.

So, that is why if you try to see the values of here r, this is minus 0.50 this is indicating
that the relationship is linear by this negative sign and the value here is 0.50, well its not

23

768
0.50, but it is very close to half and this is lower than the value of 0.90 as in the earlier
case. Now in the third case, you can see here, that there is no clear relationship and the
value of X’s are increases Y’s are also changing and there is no relationship in this case.
So, this is the case of zero correlation or no correlation and this is represented here by r
equal to 0.00. So, all these two cases in the above panel, they are trying to indicate the
negative relationship.

Now, we try to consider the lower panel, where we have the increasing trend, here you
can see here in all the panels there is an increasing trend in the data. So, if you try to see
here in the first picture here the as the values of X’s are increasing the values of Y’s are
increasing. And if you try to make here a trend line you can see here it is like this, but
definitely the points are not so close to the line, so the value of r is plus 0.50 here.

So, that is indicating that the sign here is positive. So, that is indicating the increasing
relationship and 0.5 is the magnitude of the r which is indicating that well there is a
linear relationship, but; obviously, all the points are not lying exactly on the line. And
similarly, if I try to increase the value of correlation coefficient something like here r is
equal to 0.90, the sign is positive in this picture and you can see here as the values of X
are increasing, the values of Y’s are also increasing and the trend line here will be like
this and if you try to compare the first two pictures over here, this picture and this
picture, you can see here that in this reason, the points are lying closer to the line in
comparison to the points in the first picture like this.

So, that is why this difference is indicated by the magnitude of the correlation coefficient
which is from 0.50 to 0.90. Finally, in the last picture here you can see that as the values
of X are increasing, Y’s are also increasing and all the points are lying exactly on the
same line. So, this is the case of perfect increasing linear relationship and this is
indicated by the value of r is equal to plus 1.

So, this is how we try to get the information about the magnitude and direction of the
relationship by looking at the scatter diagram. So, there are six diagrams, I have
represented here and they will give you a fairly good idea that how the things are going
to be done. Now I would try to address a very important and a very interesting
observation.

24

769
(Refer Slide Time: 49:40)

You see whenever the value of r is close to zero or say r equal to zero, this may indicate
two things, there can be two types of interpretations either the variables are independent
or the relationship is non-linear, why because correlation and coefficient is only
indicating the degree of relationship when it is linear.

So, now what happens if the relationship between X and Y is non-linear? In this case the
degree of linear relationship computed by the correlation coefficient may be low and so
the value of r is close to zero and this will be indicating that as if the variables are
independent, but it is not correct because their exists a non-linear relationship. So, in this
case this r is close to zro even if the variables are clearly not independent, for example, if
I say there is a trend like this one you can see here this that the relationship is very very
clear there is a sort of sine curve, but the correlation coefficient in this case will give you
the value close to zero.

So, be careful with these types of interpretation and remember that when X and Y are
independent then the correlation coefficient between X and Y will be equal to 0, but not
the opposite, but not the converse is true. So, this is very important point for you to keep
in mind that what is the meaning of r equal to zero?

25

770
(Refer Slide Time: 51:24)

Similarly, another property correlation coefficient is symmetric; that means, correlation


coefficient between X and Y is the same as correlation coefficient between Y and X?
What does this mean? That if somebody finds the correlation coefficient between height
and weight and say and other person find the correlation coefficient between weight and
height, then both are going to be the same.

(Refer Slide Time: 51:49)

26

771
Similarly, one very nice property with correlation coefficient has that this quantity is
independent of the units of measurement in X and Y. So, what is the advantage? Suppose
one person measures the height in meters and weight in kilograms and find out the
correlation coefficient say r 1. Now there is another person who measures the height and
weight of the same set of people, but he measures the height in centimeters and weight in
grams and he finds the correlation coefficient as say r 2, then in this case both r 1 and r 2
are going to be the same they will have the identical value and that is a very nice
advantage of using the correlation coefficient.

Now in this lecture, I would stop here, that was a pretty long lecture. And my objective
was to give you the information and development of the correlation coefficient concept.
Now, in the next lecture I will show you that how to compute this on the R software and
how to interpret it. In the meantime, you please try to read from other books and try to
develop the concept of correlation coefficient in more depth and I will see you in the
next lecture, till then good bye.

27

772
Descriptive Statistics with R Software
Prof. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology, Kanpur

Lecture- 29
Association of Variables - Correlation Coefficient using R Software

Welcome to the next lecture on the course Descriptive Statistics with R software. You
may recall that in the earlier lecture, we started a discussion on the concept of association
between two continuous variable and we learned about the correlation Coefficient. And
we also have understood now that how to interpret the values of correlation coefficient
with respect to the magnitude and direction of the association.

Now, in this lecture I am going to demonstrate that how you are going to compute the
value of correlation coefficient using the R software and how you are going to
implement it, how you are going to use it when you get a data set. So, first just a quick
review what we had done earlier.

(Refer Slide Time: 01:10)

So, you may recall that we had discussed the concept of covariance between the two
variables X and Y and for which we had obtained the n pairs of observations as (x 1, y
1), (x 2, y 2), (x n, y n), based on that we have computed the covariance between x and

773
y as like this and this was for the ungrouped data and similar definitions folowl for the
grouped data also.

(Refer Slide Time: 01:37)

And we also had discussed that if I have two data vectors in R software which I denoted
as x and y then the command cov that is covariance between x and y. So, you write cov
and inside the argument write the data vectors, this will give us the value of covariance.
But remember one thing this command covariance between x and y in R software is
going to give you the value of covariance in which this divisor is 1 upon n minus 1. And
1 n
whereas, we had defined the covariance as  ( xi  x )( yi  y ) which has the divisor n.
n i 1

(Refer Slide Time: 02:18)

774
So, after this we had define the correlation coefficient this was defined as here say r and
finally, we had obtained the expressions here like this, a simplified version of this
correlation coefficient and after that we had understood that how to interpret the values
and the sign.

(Refer Slide Time: 02:36)

Now, the next question is how to compute this correlation coefficient inside the R
software? So, if I say that I have the same setup that I have two data vectors x and y, then
the correlation coefficient between x and y is computed by the command cor and inside
the argument we have to write the data vector. So, the cor inside argument x and y will
compute the correlation coefficient between x and y that is the data vectors x and y.

And when you try to look into the help of the c o r function, then there are several
options and I am going to detail here some important things which we are going to use
actually. You will see here the function here the c o r and inside the argument this and
this, I am writing several options.

So, now we try to understand them one by one. This x and y that we know these are
going to denote the data vectors here like this. Now, there is here another command here
use and I have written here inside the double quotes as everything.

775
(Refer Slide Time: 03:57)

Now, what does this everything means and what is the use of this syntax u s e use.
Actually this use is an optional character in this c o r function which is trying to give a
method of computing the covariance in the presence of missing values. If you remember
earlier we had used the command like n a dot r m, so this command also have a similar
utility.

So, then in this case, we have several option to give here say everything when you try to
use all the observations or say all observations, complete observation n a dot or dot
complete or pairwise complete observations and so on because, there are different test
situation in which one would like to compute this coefficient of correlation.

So, we are simply going to use here the option here everything where I want to compute
the correlation based on all the data, right, remaining details you can look into the help
menu. After this there is another option here which now I am trying to denote in blue
color, so that you can see clearly this is here method and inside the c command inside the
argument, I am writing here three options; pearson, kendall and spearman.

Actually there are several types of correlation coefficient, up to now what we have
studied, the r and if you remember I had told you that this r is also called as say this Karl
Pearson coefficient of correlation and this is how this Pearson is coming here. Second or
say another correlation coefficient is rank correlation coefficient, which we will discuss

776
in the next lecture and this rank correlation is also called as Spearman, Spearman’s
correlation coefficient or Spearman’s rank correlation coefficient.

So, this option here which now I am highlighting in red color spearman inside the double
quotes, this option is used when we want to compute the rank correlation coefficient.
And similarly there is another form of the correlation coefficient which is defined by the
kendall which we are not using at the moment.

So, essentially we are going to use here the option - pearson the correlation coefficient
1 n
 ( xi  x )( yi  y )
n i 1
that was defined by say here this is actually computed
1 n 1 n
 ( xi  x ) n 
n i 1
2

i 1
( yi  y ) 2

by this option pearson, right.

And this is method is mentioned here in the next slide that method is a gives us a
characterstring indicating that which correlation coefficient of the covariance is to be
computed for say pearson, for kendall or say spearman, this has to be abbreviated inside
the brackets and the default will be pearson.

(Refer Slide Time: 07:13)

Now, I try to illustrate you first some example if you try to see here I have taken here
two data vectors 1, 2, 3, 4 and 1, 2, 3, 4, same values and I am trying to compute the
5

777
covariance between them. Once I try to do it, this value will come out to be 1.66 and so
on and here you can see the slide. And now what I do in the next example, which I am
denoting now in say here purple color. I try to take here the same data set c 1, 2, 3, 4, but
in the next the data set, I try to change the signs from the second data vector.

So, now if you try to see what happens here, the value of the covariance comes out to be
-1.66 and so on. So, you can see here that these two values which I am now highlighting
in red color, this and this, the magnitude of this values are the same, but only the sign is
opposite and here is the screenshot. What is the meaning of this?

Now, if you try to look into this slide of the definition of correlation coefficient. We had
understood that r can take positive value and r can take negative value, but if you try to
see in the denominator, this value will always be greater than zero, this value will always
be greater than zero. So, this is the only value which is the covariance between x and y,
this can be greater than zero or this can be smaller than zero.

So, the direction of the correlation coefficient is determined by the covariance. So, I can
say that the sign of correlation coefficient is determined by the covariance. And this is
what I am trying to show you here that if the data points here in the first case here and in
the second case here they have got a opposite sign, then this is given by this negative
sign here, right.

(Refer Slide Time: 09:50)

778
And if you try to plot these two data points how they will look like. For example, now I
try to find out the correlation coefficient and by which I will also show you the direction.
So, when I try to find out the correlation coefficient between the two data vectors 1, 2, 3,
4; 1, 2, 3, 4, they are the same, then this correlation coefficient comes out to be here one,
that you can see.

And you see this is also indicated in the scatter plot which is given here you see these are
the point 1, 2, 3, 4 and they are lying exactly on the straight line. So, this is a case where
we have exact positive linear dependence and this is the screenshot of the same
operation, I will try to show you on the R software also.

(Refer Slide Time: 10:38)

And similarly if I try to take the data vectors with exactly opposite sign, say first data
vector with the all positive sign data and second data vector minus 1 minus, 2 minus, 3
minus, 4 having the opposite sign and if I try to find out the correlation coefficient
between the two this comes out to be minus 1.

And you can see here this is indicated in the scatter diagram here these are the four
points and this relationship is decreasing and this is the case of exact negative linear
dependence between the two variables and here is the screenshot of the of both the
things.

779
(Refer Slide Time: 11:17)

(Refer Slide Time: 11:21)

Now, before taking an example, let me try to show you this things over the R console
also. So, you try to see here I try to find out the covariance between 1, 2, 3, 4 and here
minus 1, minus 2, minus 2 and minus 4, you can see here this comes out to be minus
1.6667 and if I try to remove the signs and suppose both the data vectors have got the
same value, then in this case the magnitude remains the same, but the sign become
positive.

780
So, in this case you can see here the covariance is positive. Now in case if I try to find
out the correlation in the same case where the covariance is positive, you can see here
this is obtained by the function c o r and this comes out to be 1 for the sign of positive is
maintained here. Similarly, if I try to take the data set with negative signs and if I try to
find out the correlation coefficient between 1, 2, 3, 4 and minus 1, minus 2, minus 3,
minus 4 you can see here this comes out to be here minus 1, right.

Now, if I try to make here the scatter plots, I will try to show you the both the things
here. So, you can see here, say I can make here the scatter plot of 1, 2, 3, 4 and minus 1,
2, 3, 4 you can see here this comes out to be like this where the direction of my cursor is
indicating the negative relationship.

(Refer Slide Time: 12:56)

And similarly if I try to make it here positive; that means, that two data vectors both are
here positive 1, 2, 3, 4; 1, 2, 3, 4 in this case you can see here that this scatter diagram is
trying to give a positive exact relationship. Now, I try to take an example to show you
that how the interpretation of correlation coefficients comes into picture. Suppose I try to
take here a data set where I have obtained the marks and the number of hours of study by
the 20 students, this is the same example which I have considered in the earlier lectures
while making different types of graphics or say bivariate plots.

781
So, we have recorded their marks in the first row in both the tables and the number of
hours per week they studied in hours here, right this is the same example which we have
considered several times and this data has been stored inside two variables marks and
here hours, right.

(Refer Slide Time: 14:05)

Now, the first thing what it comes to our mind that now we have got the data and before
using the concept of correlation coefficient, we would like to see what is the type of
relationship whether it is linear or not. So, we simply use the option plot and we try to
plot here the marks versus hours.

(Refer Slide Time: 14:23)

10

782
So, you can see here in the data here is lying like this all the data here are your marks and
here is your actually hours, right and you can see here this there is a linear trend here. So,
this gives us a sort of assurance that the relationship between marks and hours is well
close to linear and now I am, but still I have made the trend line by hand.

(Refer Slide Time: 15:02)

And now we also have learned how to make this trend line, so for that I will use the same
command that we discussed earlier scatter dot smooth and I will try to make your line
between the two data vectors marks and hours. And you can see here that this line is now
also indicating that this is nearly linear and this gives us a confidence that ok, in this case
I can use the correlation coefficient.

11

783
(Refer Slide Time: 15:29)

And then I try to find out the correlation coefficient between marks and hours you can
see here this comes out to be 0.96. So, you can see here what is this 0.96 indicating. If
you try to look into this curve you can see here the that these deviation which I am
plotting in orange color, they are very close to lines and these deviations are very small
and these points are lying very close to the trend line.

So, this is possibly indicating that on an average the degree of linear relationship
between marks and hours is very strong and this is nearly 96.79 percent. Similarly, if you
try to interchange those variables, so earlier I have taken say marks and hours and now I
try to take hours and marks. So, you can see here, this value of the correlation coefficient
remain the same this is what we have discussed in the last lecture also, right.

So, in this case you can see here the sign of the correlation coefficient is positive and this
is indicating that the relationship between x and y is positive which is here, you can see
here. And we can conclude that as the number of hours of study per week are increasing,
the marks obtained by the students are also increasing because there is a positive
relationship and the value of correlation coefficient is pretty high. So, that is why I can
say that whatever the data is telling, that is correct and my conclusions based on the data
they are correct and data is also showing a linear trend.

12

784
(Refer Slide Time: 17:08)

(Refer Slide Time: 17:19)

And this is here the screen shot I will try to show you it on the R console also before I
take one more example of the negative one, so I will try to show you here. I already have
stored the data, so you can see here this is the data for marks and this is the data of our
say hours.

So, if I first try to plot marks versus hours or say hours versus mark whatever you want,
this gives you us you can see here nice scatter plot indicating the linear trend, if you try

13

785
to make it here a trend line also, so try to give a scatter dot smooth and you can see here
that a line is also plotted here, right.

(Refer Slide Time: 17:36)

And now I try to find out the covariance between marks and hours that will indicate the
direction. So, you can see here that the value of the covariance is 365.5368 and it is
positive. So, it is indicating and it is matching with our graphical conclusion that the
relationship is positive.

Now, I would try to find out the correlation coefficient between marks and hours and
you can see here this is 0.96, right. So, you can see here that this correlation coefficient is
trying to take care of the variance as well as covariance between the two variables.

14

786
(Refer Slide Time: 18:41)

Now, I will try to take one more example and try to illustrate something more. Suppose
there are 10 patients and those patients are given some medicine and the quantity of the
medicine is measured in milligrams mg and the time say in hours is recorded and to
show that when the medicine is started showing the effect and this data is compiled here,
say quantity of medicine and the time in say hours. So, this is indicated here, so this is
the data set of patient or the person number 1 or patient number 1.

So, this is indicating that 30 mg of medicine was given and it took 4 hours of time.
Similarly for the second data set, this is the patient number 2 and a 45 mg of medicine
was given and he took 3.6 hours of time and so on. So, this data on the quantity is stored
here inside a variable quantity and time is stored here in say effect dot time. One thing
please don’t use their variable time because time is used by the R software also, so be
careful. So, now I will try to take this data set and I will try to first make a plot here.

15

787
(Refer Slide Time: 20:09)

So, you can see here this is the screenshot and you can see here this is on the x axis it is
quantity and y axis, this is time, that is the effect dot time and you can now see here these
observations are coming out to be like this. So, you can see here that as the value of x’s
are increasing the values of y’s are decreasing.

So, this shows here a sort of negative trend, right and this information we would like to
verify with the covariance function or the correlation function but now let me plot here
the trend line also.

(Refer Slide Time: 20:43)

16

788
So, you can see here that the trend line is also indicating that the relationship is almost
linear and we can safely use the concept of correlation coefficient, right, so this is the
outcome. And now I try to find out the correlation between quantity and effect dot time,
so this comes out to be here minus 0.9885454, well you can control the number of digits
also. And now if you try to see this is showing the value negative, so the sign of negative
is indicating that this relationship is decreasing, as the values of x’s are increasing the
values of y’s are decreasing and the magnitude here is 0.9885454 close to 0.99.

So, close to 0.99 means the relationship is going to be nice and linear and the degree of
linear relationship is pretty high, the maximum value is 1. In the case of 1 what will
happen that all the observation will be lying exactly on the line and whereas, in this case
this is just 0.988, so it is very close to the line. So, now, this give us a information and
confidence that we can use here the concept of correlation coefficient and this degree is
coming out to be 0.988.

(Refer Slide Time: 22:20)

So, now, I can conclude that as the quantity of medicine is increasing, the number of
hours to effect are decreasing and I would try to show you this thing on the R console
also and this is the screenshot what I have done ok. So, let me first try to copy this data
the earlier data was there because I had used it earlier.

17

789
(Refer Slide Time: 22:39)

So, this here is quantity first let me clear all the things, clear the screen by control l, try
to get this quantity and then I try to copy the effect dot time variable and the data content
in it. So, this is here effect that time, so you can see here quantity is obtained here, effect
dot time is obtained here. Now, I would like to know what is the nature of relationship.
So, I try to make here a plot between see here quantity and effect dot time.

(Refer Slide Time: 23:15)

And you can see here this is the plo, this is the same plot that we have just obtained in
the slides and now if you want to make it here, a scatter a smooth plot by adding a trend
line. So, I have to use the command here scattered dot smooth and this gives me here this
18

790
graph which is the same points, but with a trend line. And now first I try to find out here
the covariance between the two because covariance will assure us that what is the sign of
the direction whether this is positive or negative. So, you see this covariance comes out
to be minus 22.625.

So, this minus sign is indicating here that the relationship is negative and this is verified
here in this graph also, if you try to observe the direction of my cursor on the screen, this
is decreasing, right. Now, I try to find out the correlation coefficient between the quantity
and effect dot time and this comes out to be here minus 0.988. So, once again this sign is
coming from the covariance and it is indicating that the relationship between quantity
and effect of time, they are negatively correlated, as the quantity increases that time
taken to affect is decreasing, right.

And this is quite obvious also we know from our experience that when we try to increase
the dose of the medicine, then it acts faster and the time to react becomes smaller and
smaller, well up to certain extent, after certain limit that may say that is not advisable and
you have to depend on that doctors advice, right. So, now in this lecture I have shown
you that how to attempt or how to take a decision whether you want to compute the
correlation coefficient or not.

So, the steps are first to try to find out the scatter plot or the smooth scatter plot whether
trend line try to look at the trend. In case if you are convinced, yes, there can be a linear
trend or the relationship between the two variable is approximately linear, this can be
positive or negative, then you decide that well in this case I can use the concept of
correlation coefficient to measure the degree of linear relationship and then you try to
use the formula for the coefficient of correlation.

So, I stop here and I would request you that you please try to take some more data sets
and try to plot them and then try to compute the values of correlation coefficient and see
what do you get. So, you practice and I will take the topic of rank correlation in the next
lecture and then I will see you in the next lecture.

So thank you very much and see you soon, goodbye.

19

791
Lecture – 30
Association of Variables – Rank Correlation Coefficient

welcome to the lecture on the course descriptive statistic with R software. You may recall that in

the last two lectures, we had discussed the concept of correlation coefficient and we learned that

how to compute it in the R software. So, you may recall that when we started the discussion on

the topic of association of two variables, we had discussed three possible situations where we

would like to measure the degree of association. First was, when the variables are continuous,

and the data is collected on some continuous scale. Second situation is where the data is

collected as ranks, that means the observations are obtained and they are converted into the ranks

and then we need to find out the correlation coefficient, and the third one is where the data is say

categorical, where you try to count the number or the frequency. So, in this lecture we are going

to consider that now the second case where the data is obtained as the ranks of the observations.

Now, first question comes where such situations can occur. Although I had explained you in the

earlier lecture, but I will take a quick review, Suppose there is some fashion show that is going

on and suppose a model comes on the stage and suppose there are two persons who are looking

at the performance and they are giving some scores. Now, what do you expect? You expect that

in case if the performance is good, then both the judges should give some higher score and in

case if the performance is bad, then you expect that both the judges should give lower scores.

Now, in practice it is very difficult that the judges are giving exactly the same is scores, suppose

the scores are to be given in the say between zero and hundred, it is very difficult that they are

giving the same scores. If the person is good, they may give a score of say 80 or say 85. So, now

the question is this, how to measure the association between the opinions expressed by those

marks by those two judges? What we can do that, we can rank the its scores of different

792
candidates. Suppose they are ten candidates. So, those ten candidates are judged by the two

judges. Judge one has given some his scores to the ten candidates and judge two also has given

their scores to the ten candidates. Now instead of using the scores, we will try to find out the

ranks, that which of the candidate got the highest rank, which of the candidate got the second

highest rank and which of the candidate got the lowest rank and these ranks will be calculated for

both the judges, and then we would like to find out the association and direction of the

association between the ranks given by those two judges and this can be achieved by using the

concept of rank correlation coefficient. What do we expect on the basis of what we have learnt in

the case of correlation coefficient? If both judges are giving the similar scores or they have the

similar opinion, if a person is good, then it is good for then he or she is good for both, then the

correlation should be positive and it should be say quite strong and in case if there is an opposite

opinion that means the judges just feel opposite, one judge just said good and say another judges

judge says bad, then in that case, we expect that the correlation should be negative and now how

close or how strong or say how weak, this depends on the magnitude of the correlation

coefficient. So, you will see that the similar type of concepts will also be there in case of rank

correlation coefficient as in the case of Pearson's correlation coefficient.

Refer Slide Time: (05:49)

793
So, now we discuss our lecture and we will simply assume that here that we have here two

variables X and Y and and observations on X and Y are available, Right! and after this whatever

are the observations they are ranked with respect to X and Y and ranks of those and observations

are recorded. What does this mean, suppose if I say in the example which I have given, suppose I

say there is a judge one and he has given the scores say here 90, 20, say 60 and say here 35. So,

this is my observation number one, observation number two, observation number three and

observation number four. So, these are now essentially the values of xi’s. So, this is my here x1,

this is x2, this is x3 and this is x4. Now, what I can say I will find out the ranks, if you try to order

these observations, you can see here that the smallest observation here is 20, that is the smallest

observation and after this we have 35, then we have 60 and then we have 90. So, 90 is the largest

observation, Right! So, if you try to give the rank, so the smallest observation will be getting a

rank equal to 1 and there are 4 observation. So, the largest observation will be getting a Rank

four and the second largest observation which is here 60, it will be getting ranked 3 and third

794
largest observation, it will give here the rank two. So, you can see here now 1, 2, 3 and here 4,

these are the ranks given to these observations. Now, if I try to write down here the ranks, what

is the rank of this x1, which is having the value 90, this is here rank is 4 in which is written in red

color. Now, second observation is here x2 equal to 20, what is the rank of this observation 20,

this is here 1, you can see here and this I can write down here 1, Similarly, how this 4 was

coming, four was coming from here and next if you try to see x3, the value of x3 is 60 and what is

its rank, this rank here is 3. So, this comes over here and I write here this 3 and similarly here x4

is equal to here 35 and 35 has got the rank 2. So, this comes here and we write it here two. So,

now you can see here that whatever are the its scores given by the judge, they have been

converted into the ranks and similarly there will be a second judge and we will try to convert the

scores of judge two also in the order of ranks and then we will try to find the correlation

coefficient between the two ranks. Now, incase if you think mathematically, that how to obtain

the value of the correlation coefficient, in this case, you may recall that in the case of Pearson

correlation coefficient, we have taken this xi and yi to be the values on the continuous scale, now

the values are here as integers and because they are the ranks 1, 2, 3, 4 and so on. Now in case if

you choose x1, x2, xn in the case of Pearson correlation coefficient to be 1, 2, 3, 4 up to n and so

on, and similarly y1, y2, yn also to be the integers 1 to n, then just computing the correlation

coefficient or the Pearson correlation coefficient with the two sets of values 1 to n and 1 to n, you

will get the value of the correlation coefficient and this is the idea that how the expression for the

rank correlation coefficient has been obtained. Well I'm not going to give you here the derivation

between the two, but definitely I think that you should known, and this correlation coefficient

was given by Spearman. So, that is why this is also called as Spearman's rank correlation

coefficient.

795
Refer Slide Time: (11:12)

So, now let us try to continue with our discussion and now we try to understand it that how are

we going to do it? So, we discuss about how to compute the Spearman’s rank correlation

coefficient? Now, suppose there are n candidates and who participated in a talent competition

and suppose there are two judges and these two judges are indicated by X and Y and both these

judges, they judge the performance of n candidates and they try to give the ranks to every

participant or they try to give the scores which are finally converted into ranks. So, I will say

here now we have a situation like this, one say judge is here X, it is trying to give us the scores

and they are suppose candidates. So, you must know that how this data will look like, so that so

there are 1, 2 up to here n candidates, and judge X gives them the scores x1, x2, xn and then these

scores are converted into the ranks. So, these ranks are going to be indicated by the numbers say

here 1, 2 up to here n and similarly there is another here judge which is here judge Y, this judge

796
Y also gives the scores as y1, y2 suppose here yn and these scores are once again converted into

ranks say 1, 2 up to here n, and now we are going to find out the correlation coefficient between

the two sets of observation 1 to n, definitely these numbers 1 to n will not occur in the same

sequence. For example, suppose judge X says that in his opinion the best candidate is x2. So, he

has given the rank x2 to be here, see here n and we're suppose judge Y has given the rank to x2,

say here 3, because he think that he is only at the third order. So, you can see here that the person

here is the same, same person, but he has been given two different two different ranks and then

three, and then this is what we want to find that either the ranks given by the two judges whether

they are similar or they are different or very different. So, that is the objective to introduce this

correlation coefficient.

Refer Slide Time: (14:21)

So, now if you see what I'm going to do here judge X gives ranks to the n candidates, and

suppose he says that whosoever is the worst candidate, he gives the rank one, and this rank 1 is
6

797
given to that candidate, who had scored the lowest score and similarly, whosoever got the second

lowest say score out of this x1, x2, xn, he has given the rank 2 and similarly whosoever is the best

candidate, the one who has got the highest score has been given the rank n, and similarly judge Y

also has given the rank to the same n candidates. Remember the candidates are same and

whatever score he has given based on that he has converted the scores into ranks as 1 to n, based

on the scores y1, y2, yn exactly in the same way as a judge has judge X has done it. Now, what we

have to do that, now we understand that every participant has got two ranks which are given by

two different judges, what do we expect? We expect from both the judges to give higher ranks to

the good candidates and lower rank to the bad candidates. Now, our objective is this, we want to

measure the degree of association between the two different judgments through the ranks and we

would like to find out the measure of degree of association for these two different sets of rank.

Refer Slide Time: (16:07)

Now, the question is how to do it and what will it will indicate? So, in order to measure the

degree of agreement between the ranks of two judges or two data set, in general, we use the

798
Spearman rank correlation coefficient. So, one thing what you have to notice and always keep in

mind, we are not using here the original observation, but we are using only their ranks.

Refer Slide Time: (16:35)

So, now let me develop or tell you that how do we do it. So, first we try to define here the rank of

an observation, suppose rank of xi, is denote as say here rank of xi mean rank and inside the

argument xi. So, how it has been obtained that all the observation x1, x2, xn have been ordered

and then from there whatever is the numerical value of the rank, this has been recorded, Right!

Refer Slide Time: (17:07)

799
So, as I shown you here, if you see here, I have shown you that how these ranks have been

obtained, Right!

Refer Slide Time: (17:12)

800
So, exactly in the same way these ranks have been obtained for the data in X, and similarly these

ranks have been obtained for the data in Y and the ranks in Y, are denoted by simple statement

rank of yi. Now, what we do that I try to find out here the rank between xi in yi and we try to find

out the difference, we consider the ranks of xi and yi and we try to find out the difference. So,

what is the, if you try to see what I am doing, I try to take here the ith person and I see what is the

rank given by judge X and what is the rank given by judge Y and whatever is their difference I

try to compute here and denoted by di. Now, the expression for the Spearman's rank correlation

n
6 di2
coefficient denoted as capital R, is given by this expression R  1  i 1
, and this R is called
n(n 2  1)

as Spearman rank correlation coefficient and this lies between minus 1 and plus 1,

Refer Slide Time: (18:31)

10

801
One thing which I would like to address here, that it does not matter whether the observations

have been arranged in ascending order and their ranks are computed. So, what I am trying to say

that if rank 1 is given to the worst candidate or to the best candidate, it will not change the rank

correlation coefficient provided both the judges have used the same criteria, if judge X has given

the lowest rank with the worst candidate and judge Y also has given the lowest rank with the

worst candidate and judge X has given the highest rank to the best candidate and same has been

followed by the judge Y also, that he also has given the maximum rank to the best candidate,

then in both the cases if you try to compute the correlation coefficient or the rank correlation

coefficient, their values will come out to be same, but the only thing is this the ordering has to be

the same in both the cases either ascending or descending. So, you give the ranks 1, 2, 3, 4 up to

n or you try to give the ranks n, n minus 1, n minus 2 up to here 1, Right! Okay? Now, what is

the interpretation of the values of R. So, we have seen that this R will lie between minus 1 and

plus 1. So, when I say that R is equal to plus 1, this means that same ranks have been given to all

the candidates by the to the two judges, exactly the same rank, there are n candidates and

whatever the ranks have been given by judge X to a particular candidate, the same rank is given

by the judge Y and this is true for all the candidates. So, in this case both the judges are going to

give that exactly the same opinion and that is reflected by R is equal to plus 1. So, here this plus

is trying to indicate the direction, direction that both the judges are given the opinion in the same

direction or that direction of the opinion of both the judges is the same and one is trying to give

us the value of which is indicating the perfect relationship as we have done in the case of simple

correlation or the Pearson correlation coefficient. Similarly, incase if the two judges give just

opposite ranks to all the candidate that is another extreme, that means whosoever is the best

candidate in the opinion of judge X, is the worst candidate in the opinion of judge Y, then in this

11

802
case the value of rank correlation coefficient will come out to be minus 1. So, this is the case of

say perfect relationship, once again and this relationship is negative. So, this negative sign is

indicating the direction. So, R equal to plus 1 means perfect positive relation and R is equal to

minus 1 means perfect negative relation, this is the interpretation.

Refer Slide Time: (21:58)

And now in case if you try to choose any other value of R between minus 1 and plus 1, similar to

the interpretations of Pearson correlation coefficient, we will have the similar interpretation in

the case of rank correlation coefficient, For example, if the correlation coefficient is suppose

rank correlation coefficient is 0.95, then I would say that, Okay. Both the judges are giving more

or less the similar opinion, but definitely, if the rank correlation coefficient is supposed 0.02,

then I would say, well, the the opinions given by the two judges are not dependent on each other,

12

803
but definitely if the correlation coefficient is suppose minus 0.9, then I would say that well both

the judges are giving just the opposite observations, and they have got just the different nearly

the opposite opinions, whatever judge X thinks good, other judge is just thinking opposite of that

that is bad and so and vice versa. Now, let me take here an example and I try to show you that

how to compute the rank correlation coefficient and then I will show you that how to compute it

on the R software. This is important for me here to show you this minor calculations, because

many times people try to compute the rank correlation coefficient not on the basis of the ranks,

but simply on the basis of the of the observed values, but here you have to be careful, when you

are trying to implement that the concept of rank correlation coefficient. Whenever you are given

a data, try to see whether you have been given the original scores or you have been given the

ranks. So, if you are given the original scores, then first you need to convert them into ranks and

then try to compute the value of rank correlation coefficient, but in case if you have been given

the ranks, then you can directly use them and compute the value of correlation coefficient, right!

So, in this example, I am trying to consider the scores and we will try to convert them in to

ranks, now I'm taking a very simple and small example, so that I can show you the calculation.

Suppose there are 5 candidates and they have been judged by two judges, now the judge one has

given the score 75 to candidate number 1, 25 to candidate number 2, 35 to candidate number 3,

95 to candidate number 4 and 50 to candidate number 5 and similarly judge two has given 70 to

candidate number 1, a score of 80 to candidate number 2, a score of 60 to candidate number 3, a

score of 30 to candidate number 4 and a score of 40 to the candidate number 5. So, you can see

here, we do not have the ranks. So, first we need to find out the ranks. So, first let me find out

here the ranks, in the case of judge one, what is that the minimum value among all these values

75, 25, 35, 95 and 50. So, you can see here, this value here is 25. So, now I decide that my

13

804
ordering will be that we will give rank equal to one to the candidate having the minimum score.

So, this candidate who has got the score 25, which is here, this he or she gets the rank 1. Now,

once again I try to find out the minimum or maximum, whatever operation you want to do

among the remaining values. So, I try to take here 75, 35, 95 and 50. So, this comes out to be 35.

So, rank of 2 has to be given to a candidate whose score is 35. So, you can see here, now this is

here the candidate who has been given the rank 2. Similarly, I try to compute the minimum value

out of the remaining value which is 75, 95 and 50 and this comes out to be here 50 and then, I try

to give rank equal to 3 to the candidate whose score was 50. So, you can see here, this I am doing

it here. Now, once again I try to find out the minimum value with the remaining values which is

here 75 and 95, right! and this is comes out to be 75. So, I try to give the rank 4 to the candidate

who has got 75 marks. So, you can see here this is here and this candidate has been given the

rank 4 and similarly, now the value which is the maximum or the value which is left here is the

maximum value, and maximum value here is 95. So, this will give the maximum rank 5 and the

same operation is done on the scores given by judge 2. So, you can see here out of this 70, 80,

60, 30 and 40 values, the smallest value here is 30 and so, this has been given the rank 1. Now,

after this I try to find out the second largest value, second largest value here is 40. So, this

candidate has been given the rank 2. Now, I try to obtain the third largest value, third largest

value here is 60. So, this candidate has been given the rank 3, and similarly try to obtain the

fourth largest value, this is here 70, and this candidate has been given the rank 4 and finally the

maximum score is here by 80, and this candidate has been given the rank 5, right! Now, I try to

find out the difference between the rank of xi and rank of yi. Actually here, I have both the

options either I try to consider the rank of xi minus rank of yi or rank of yi minus rank of xi, and

this will give you the same correlation coefficient, you can see here in the formula of this rank

14

805
correlation coefficient here, we are using here di square. So, this sign of di will not make any

difference. So, now if you try to take here the value here, I will try to highlight here 4 and here 4,

this difference here is 0, this is 4 minus 4, second value here is, try to observe my pen, this is 1

and here 5. So, this is 1 minus 5 is equal to minus 4. Similarly, next value here is 2 and 3, this is

2 minus 3 is equal to minus 1, then the fourth value here is 5 minus 1, is here plus 4 and similarly

the difference between 3 and 2, which is 3 minus 2, is here plus 1. Now, after obtaining the value

of this di is,

Refer Slide Time: (30:01)

I will try to implement them on the expression for the rank correlation coefficient, which is given

here. So, these di is we have obtained and here the number of observations are here 5. So, n is

equal to here 5 and this value comes out to be minus 0.7, what is this indicating? Now, you can

15

806
see here, this value here is minus 0.7. So, this minus is indicating the direction, direction of the

ranks given by the two judges, right! So, it indicates that both the judges have got say negative

opinion and the degree of the association between them is 0.7. So, this is indicating that both the

judges have got say quite different opinion about the candidates who participated, right!

Refer Slide Time: (31:00)

Now the next question is how to compute the rank correlation coefficient in the R software?

Computation of experiments, rank correlation coefficient and the competition of Pearson

correlation coefficient, they are similar. In R software, we use the same command for computing

the both. The only difference is that when we try to specify the option method, then we have to

be careful. If you remember, when we discussed the computation of Pearson correlation

coefficient in the last lecture then, then I had given the option, say method is equal to Pearson, at

that time I also explained to you about the rank correlation. Now I am giving you more

16

807
explanation that that we need simply need to specify the method is equal to Spearman, and then

the entire computation procedure and all other R commands, they will remain the same, they will

have the same interpretation. So, now let us try to understand it here. So, the R command to

compute the Spearman rank correlation is the cor and inside the argument you have to give the

data vectors here x and y, and in this case there are some other options which have to be used in

case if you want to compute the correlation coefficient based on rank, that is the rank correlation

coefficient c o r will remain the same, x and y will remain the same as the data vector, the option

of use, to use all the data by giving the the command everything inside the double quotes, will

remain the same as in the case of Pearson correlation coefficient, the only change will occur here

that now, the method is going to be a Spearman which has to be given inside the double quote

with the c command and once you try to do it here, then you will get the value of this rank

correlation coefficient.

Refer Slide Time: (33:11)

So, this is about the use that handles the presence of missing value that, that is the same thing

what we discussed in the case of Pearson correlation coefficient.

17

808
Refer Slide Time: (33:20)

So, now I try to take the same example in which I have computed the rank correlation coefficient

manually, you can see here that two values are here the same

Refer Slide Time: (33:32)

18

809
75, 25, 35, 95,50 in the first case,

Refer Slide Time: (33:40)

and there's 75, 25, 35, 95, 50 in the first case. So, similarly the other values are same. So, I try to

give here the data x equal to the scores given by the judge 1 and y here as the scores given by the

judge 2. Remember that I am not giving here the ranks, but I am simply giving this scores and R

software will automatically convert them into ranks.

Refer Slide Time: (34:05)

19

810
So, I try to name this data for the sake of convenience, as judge 1 the data of given by judge 1

and judge 2 for the data given by second judge and now I try to find out the correlation

coefficient between judge 1 and judge 2 to using the option everything and now my method

becomes a Spearman and this you can see here, that this value comes out to be minus 0.7,

Refer Slide Time: (34:33)

20

811
and this is the same value which you had obtained manually

Refer Slide Time: (34:37)

and this is here the screenshot of this operation, but I will try to show you the same operation on

the R console also.

Refer Slide Time: (34:44)

21

812
So, first I try to copy the data of judge 1, say here judge 1 and the data of here, judge 2 and then I

try to take the command of finding or the correlation coefficient using the method Spearman

Refer Slide Time: (35:17)

and you can see here this comes out to be minus 0.7. Now, I will stop here and now we have

learned that how to compute that the association, when the data is in the form of ranks. So, once

again as usual, I will request you that you please try to look into books, try to study more about

this rank correlation coefficient, try to take some examples and try to execute them on the R

software and you practice it, more you practice, more you learn and I will see you in the next

lecture, where we will discuss about, how to find out the or how to measure the association,

when we have counting variables, till then goodbye!

22

813
Lecture – 31
Association of Variables – Measures of Association for Discrete and Counting
variables: Bivariate Frequency and Contingency Tables

Welcome to the course descriptive statistics with R software. You may recall that in the last

couple of lectures, we have understood how to measure the association between continuous

variable and ranked data and we had the concept of correlation coefficient and rank correlation

coefficient. Now the third case that is left is how to measure the association between two

variables which are counting in nature or they are discrete in nature.

So, in this lecture we are going to understand two concepts; bivariate frequency tables and

contingency tables and we will also see how to implement the concepts on the R software and in

the next lecture, we will continue with some more measures of association in the case of

counting variables or counting data. So let's just start our discussion here. First of all you have to

understand that what is a bivariate frequency table? You may recall that we have already

discussed the concept of a frequency table and in that case, we have essentially considered the

univariate frequency table. Univariate frequency table means there is only one variable on which

the data has been collected and the data was tabulated. Now suppose there are two variables and

we want to tabulate the data. So this type of table will be called as bivariate table.

Now the question is this how does this come into picture and how the measures of association in

this case of counting variable comes into picture. Now suppose you want to know in a college

that what is the choice of the subjects between say mathematics and biology between boys and

girls students, male and female students. Now what are we going to do? We will take a sample of

students consisting of both boys and girls and we will ask their choices; do they like mathematics
1

814
or biology. Now what we expect that in case if there is no choice of the subject with respect to

the gender then the number of students who are considering to study mathematics or biology,

they should be the same number of boys and same number of girls but if a particular gender

gives more choices for a particular subject that would indicate that yes, the choice of the subjects

between biology and mathematics that is being dominated by the gender.

So what are we going to do here, let us take this simple example and it's the same example that

suppose we want to know if boys and girls in a college have say an inclination to choose between

mathematics and biology.

So obviously as we said that if there is no choice, no discrimination between the two subjects

then we expect that the total number of boys and girls opting for mathematics and biology should

815
be nearly the same. And now we have to collect the data to make a conclusion on such an

opinion.

So the data is collected on boys and girls with respect to biology and mathematics and this data is

obtained in the form of a frequency and now we need to summarize this data into a frequency

table and based on that, we need to devise a measure based on this frequency data or summarized

data to study the association between two such variables which are basically counting in nature.

Now suppose I take a sample of 1, 2, 3, 4, up to 10 students and we ask individually, each and

every student, what is their choice and we note down their gender. So I am denoting the gender

of male students as M and gender of female student as F and the subjects- the mathematics is

being denoted by here Mth and biology is being denoted here as a Bio. So now the data is

collected as follows. Suppose we ask the first student and first student is a male and he says that
3

816
yes, he prefers biology. After this we ask the second student and second student is girl. So we

write here F female and she answered that she prefers to have biology. And similarly we try to

take the third student who is a male and this boy answers that he prefers mathematics and so on

and similarly we try to ask all the students.

Now you can see here how many boys are here 1, 2, 3, 4, and 5. So number of males here are 5

and obviously the number of females here are once again 5. 1, 2, 3, 4, and 5 and now we try to

see what is the data on the subject mathematics. This is here 1, 2, 3, 4, 5, 6. So there are 6 student

who prefer mathematics and there are 1, 2, 3, 4 students who are preferring biology but now you

can see here, from this type of data, from this type of frequency we are unable to conclude

anything. So what we need to do here we try to create a bivariate frequency like this one where

on one side, we will express the gender and on other side, we will make the subject.

817
Now there are two genders male and female and there are two subjects see here mathematics and

here biology and then we try to count these numbers that how many males are preferring maths

and with how many females are preferring maths how many males are preferring biology and

how many females are preferring biology. So you can see here that in this case the number of

males who are choosing mathematics is number one, number two, number three, and number

four. And the males, who are choosing biology is here, in the first student only. And similarly the

female students who are referring maths, they are here one and two. And similarly, if we collect

the other data also and all this data is compiled here in such a table. So you can see here the

number of male students who are choosing maths is four and this number we are denoting by n11,

n is indicating the frequency and what is the meaning of 11, 1 is corresponding to first row and

first column. Similarly, the number of female students who are preferring here mathematics, this

number is given in this cell and this number here is 2 and so we write this frequency as n12 and

12 means 1 is the row, this is the first row and 2 is the column, this is the second column. And

similarly the male students, who are choosing biology this number here is one and this is denoted

as the frequency n21, so once again 2 is going to denote here the second row and this 1 is going to

denote the column. And similarly, the number of female students who are choosing biology is

here n22 is equal to 3, this means the second row and second column the data in the second row

and second column. So you can see here I have denoted here the frequency as here say row and

column. So the address is given by two numbers row and column for a particular type of

frequency in any class. So these are essentially the absolute frequencies.

Now the next step is this I try to count the numbers row and column wise. Suppose I count the

numbers in the first row. This is here 4 + 2 and this is here 6 and if you try to see, I am trying to

denote this number here as say n1 plus. So n1 is going to indicate that the subject in the first row
5

818
which is here mathematics, this one, and this + is indicating that we have added the frequency

over the column. n1+ equal to 6 is going to give us information that how many students are

preferring maths and similarly in case if I try to sum the frequencies in the second row, this is

denoted here as say n2+ and this number is 1 + 3 which is equal to here 4. So once again here this

2, in this n2+ , 2 is indicating the subject in the second row which is here biology and + is

indicating here that this addition has been obtained on the second subscript which is column. So

this n2+ is equal to 4, this is giving us the information that there are four students who are

choosing or preferring biology.

Now the same exercise can be done in columns. So when I try to add the numbers in the columns

here like this, then this comes here 4 + 1 and using the same philosophy of the symbols I am

denoting this sum as n+1, +1 is coming as a subscript. So this + is indicating that the sum has

been obtained on the first column or this is the sum of the frequencies in the first column and this

1 here, this is indicating the first column. So if you try to see, this number here n + 1 this is equal

to 5 is indicating the male students here who are choosing maths as well as biology or any

subject out of this. And similarly if I try to come on the second column here and then I try to add

here n12 and n22, this is here 2 + 3 which is equal to here 5. So this number here n+2 this is going

to indicate that the sum of the frequencies has been obtained based on the second column by n+2.

So this is essentially giving us the number that how many female students are preferring maths

and biology. So you can see here that in this case the entire data whatever we had obtained here

on the basis of the sample has been classified into a two by two table and this type of two by two

table is called as contingency table and in particular this will be called as two by two

contingency table.

819
Now if you try to see what are the different symbols, in general, what are the indicating this I

have summarized here. And you see here what is here nij in general this is the frequency in the

ijth cell and essentially this is the absolute frequency in better terminology and similarly when

I'm trying to take here n1+ this is indicating the row total and 11 plus n12. So this is indicating the

first row of the data or the sum of the frequencies in the first row of the table. Similarly n2+ is

equal to n21 + n22. So you can see here this 2 remains the same and this + is indicating that the

sum of this 1 and this 2 has been obtained.

So this is going to give us the sum of the frequencies in the second row of the table and similarly

in the case of columns n+1 is equal to n11 + n21. So you can see here the sum is obtained over this

1 and 2 and this is indicated by this + sign and this one and this one they remain the same and

this is going to give us the column total of the frequency of the first column. And similarly n+2 is
7

820
equal to n12 + n22 and similarly this quantity is going to give us the sum of the frequencies in the

second column. Now if you try to look in this table and try to see here n equal to 10, what is this

n equal to 10. This is indicating the sum of all the frequencies and if you try to see here, this can

be obtained in different ways. First is this, this is n11 + n12 + n21 + n22. This can also be obtained

as here sum of the frequencies in the rows. So this is n1+ + n2+ this is again equal to 10. So the

sum has been obtained from the row and similarly if you try to take the sum of the columns, then

again this can be represented as n+1+n+2. So this is what I have mentioned here in the last line

that n is equal to sum of all the frequencies which is here and this is the same as sum of the

frequencies in the row and sum of the frequencies in the column and so n is going to indicate the

total frequency. So that is going to be our general symbols and notations in contingency table.

So now if I try to make it more general. Suppose I try to take in general, two variables, two

discrete variables on which the observations are obtained as counting. So you can see here for
8

821
example, in the earlier case X was denoting the gender and Y was denoting the subject. And we

had divided the data into two classes for X and two classes for Y. Two classes of X are male and

female and two classes of Y are maths and biology. Similarly I can make it more general and we

assume that on the data on X variable, we have created K classes which are denoted as x1, x2, xk

and similarly for the Y we have created L classes say y1, y2, yl and this nij this is the absolute

frequency of the ijth cell corresponding to the observations xi, yj and obviously i goes from 1 to

k and j goes from 1 to l. So now in general these frequencies in general case, they can be

represented in k cross l contingency table, similar to what we have represented in the two-by-two

contingency table and this contingency table will look like this. I have used here different types

of colors so that you can get an idea.

822
So you can see here this part in blue color, this is going to indicate the absolute frequencies of

different classes. So n11 here is indicating the absolute frequency in x1, y1 cell. Similarly here

say nij that is going to indicate the frequency in the xi and yjth cell and so on. So all these values

n in blue color they are going to indicate the absolute frequency of all the classes.

Now we try to find out the row sums and column sums. So we try to add all these frequencies

here. So this is equal to n1 + is equal to n11 + n12 + n1j up to here n1l. So this is denoted here and

this quantity is called as marginal frequency for the values here in this column which are

indicating the total of the rows or the sum of the frequencies in different rows they are called as

marginal frequencies and more precisely this is called as marginal frequencies of X or x1, x2, xk

different classes. Now we try to do the same operation on the columns. Suppose I take the

frequencies in the first column and I add them here. So this will be n11 + n21 up to here and nl1 +

in general nkl.

So this sum is going to be denoted here as a n+1 and similarly if we try to do the same operation

for each and every column here, here and so on. So these frequencies are denoted by here n+j and

if you try to see, they are the summation from 1 to K and ij means sum of all the frequencies in

the column. So this is also called as marginal frequencies and they are essentially denoting the

marginal frequencies of Y that means the module frequency of the class y1, y2, yl and so on.

Now finally if you see this here n now this n can be obtained as a sum of all the frequencies

which I am denoting here in red color or say all the frequencies which are denoted here in blue

font, blue color. So this is the sum of all here n and this sum can also be obtained by adding the

values in the rows that is n1+n2+ni+ up to here nk+ and this value will be going to be the same as

n and similarly if you try to add here the marginal frequencies of the columns that will also give

10

823
you the same value here and this I have denoted here in this expression. So this is called the total

frequency. So this is how we try to interpret and we try to construct the contingency table and

this contingency table is our k cross l contingency table. Why? Because there are K number of

rows and there are l number of columns.

So now we have understood that all the data this can be represented in different types of

frequencies and we simply try to summarize it here once again. This thing here and as we have

discussed this nij’s are going to discuss about the absolute frequencies. Now in case if you try to

see what are these nij’s representing. These nij’s are giving us different numbers about the

choices between X and Y which are occurring together. So these values of nij they represent the

joint frequency distribution of X and Y, Now you may recall that when we had discussed the

frequency table in a univariate case then we had only one variable but now here I have two

variables X and Y and we had discussed that how the frequencies are distributed in different
11

824
class intervals that was compiled in a univariate frequency table but now since we have here two

variables or the frequencies inside the cell, they are determined by two values; the value of X and

the value of Y that is why this is called as joint distribution.

So this joint frequency distribution tells how the values are both the variables behave jointly and

just to inform you here that here I am trying to take only two variables but these variables can be

more. There can be three variable, there can be four variables and corresponding to those

numbers so we can create the suitable contingency table. Suppose we have three variables X, Y,

Z. X having two classes, Y having three class and Z having four class. So then we will create a

table of the order 2 by 3 by 4 or 2 into 3 into 4 contingency table.

12

825
Now the next symbol here ni+ this was the sum of the frequencies and similarly n+j. This was

again the sum of the frequencies in rows and columns. So these values are representing the

marginal frequency distribution of X and marginal frequency distribution of Y. What does this

marginal frequency distribution tells us? The marginal frequency distribution tells how the

values of one variable behave in the joint distribution of X and Y. So now you can think here that

these values are going to be determined by two values and we assume that the value is affected

by two variables X and Y. So obviously one question comes that when we have the joint

Distribution of two variables X and Y what is the contribution of X and what is the contribution

of Y. So this information can be digged out from this bivariate frequency table or the

contingency table by finding out the marginal frequencies.

Similarly when we had discussed the concept of frequency table then we had two types of

frequencies; absolute frequencies and relative frequencies. The advantage of using the relative

frequencies was that that the sum of all the relative frequencies will always be equal to one and

the relative frequency of every cell or any value will always be between 0 and 1. So this is very-

very similar to the concept of probability. So in fact this frequency tables, in univariate or say

bivariate case, they are indicating or they are representing the probability distribution of discrete

variable in the case of say this probability theory.

So now in this case also, in place of absolute frequency, we can also use the relative frequency

and relative frequency will be obtained simply by dividing the absolute frequency by the total

frequency and a new table or a new contingency table can also be created using the bivariate

frequency table based on the relative frequency. So in case if we try to use the relative frequency

in place of absolute frequency then the similar information is provided and we call as joint

relative frequency distribution. And similar to marginal frequency distribution. Now we will
13

826
have marginal relative frequency distribution and there will be one more concept what we call as

a conditional relative frequency distribution.

So what is this we try to understand here. You see the relative frequency of any class or say any

class corresponding to xi, yj or ijth class, this will be obtained by nij/n and this is indicated by the

symbol fij. So similar to nij’s, now this fij’s will represent the joint relative frequency distribution

of X and Y. Now we try to obtain one more quantity which is called as conditional frequency

distribution. This conditional frequency distribution is obtained in two cases. When the value of

Y is given or the value of X is given. So when the value of Y is given say Y equal to some

particular value yj then in this case the conditional frequency distribution of X given Y equal to

yj is obtained by nij divided by n+j that is the frequency divided by the marginal frequency of that

class and this is denoted here say F of x given Y and Y is given as Y equal to yj and there is a

subscript here i given j that's a standard symbol for indicating that conditional frequencies.
14

827
Now the second case will be that in case of Y is given suppose X is given. The value of X is

given and suppose X is given as xi then in that case the conditional frequency distribution of Y

given X equal to xi is obtained by nij/ni+ so that is again the ratio of say this here frequency

divided by marginal frequency and this is denoted as say f of y given X equal to xi and in the

subscript, we write j given i. This symbol here vertical line this is thus this is called as given.

So these conditional frequencies or the conditional frequency distribution gives us an

information that how the values of one variable behave when another variable is kept fixed. For

example, we have considered the case of gender versus subject. Now suppose I want to know

what is the behavior of the subjects for a given gender. Then this type of information can be

obtained by the concept of conditional frequency distribution. So I will try to take an example to

show you that how to interpret such values. But before that let me just write all the symbols in

general.

15

828
So when I try to find out the sum of all the frequencies in the rows and columns corresponding to

X and Y, they will give us the marginal relative frequency similar to the concept of marginal

frequency. So when I try to sum all the relative frequencies corresponding to X then for the ith

class I get fi+ which is here and this is the sum of all the frequencies in that particular class or say

particular row. Similarly the marginal relative frequency distribution of Y values or the classes in

Y this is denoted as f+j and this is the sum of i goes from 1 to k fij’s. And similarly the

conditional relative frequency distribution of X given Y equal to yj this will be denoted by fx

given Y and here i given j in the subscript and similarly the conditional relative frequency

distribution of Y given X equal to Xi this is denoted by here f of Y given X and in the subscript j

given i.

16

829
So now let me take a very simple example and we try to first understand that how these values

are obtained and how to interpret them and after that I will show you that how to create the

contingency table and this type of marginal frequencies inside the R software. Suppose I have an

example here where a soft drink was served to some persons and those persons have been

divided into three groups depending on their age. First group is children. Second group is young

person. And third group is elder persons. And they would ask that how the drink taste and they

were given two options whether the taste is good or the taste is bad and based on that we have

obtained the data. For example, you can see here in the row whatever the data I have obtained

this has been counted and compiled in a 2 cross 3 contingency table like as follows. You can see

here in the row I'm writing here three classes of children, young persons, and elder persons and

in the column, I am taking another variable taste and which has two classes good and here bad

and then based on the data collected from such hundred persons if we try to count that how many

children said that the drink is good and this number is supposed 20. So how many children and

taste is good here and similarly we try to count that how many young person said that the drink is

good and similarly, we try to count that how many elder persons said that the drink is good and

this number is here 10. And similar information was obtained for children and there are 10

children who said that the drink is bad. Similar to this, there are 15 young persons who said that

the drink is bad and similar to this, there are 15 elder persons who said that the drink is bad. Now

you see, in this data, we have three groups on each and two groups on taste. So there are three

classes of age and two classes of taste and one can see here that the taste and age are not

independent. Different people in different age groups they are giving a different opinion. Had

this drink been very good, then we expect that all the person, all the 100 persons would have said

the drink is good but this is not happening here. So this is indicating that the variables X and Y

17

830
are not independent but they are correlated and my issue is and my question here is this how to

measure this association?

So there are different types of measures which have been suggested and all those measures tries

to measure this association in different ways. So the objective here is to understand what are

those measures and how are they going to give us information.

Now after this I try to find out their row sums and row columns so you can see here the sum of

20 + 30 + 10 here is 60. So this 60 is going to give us the information on the marginal frequency.

So I can see here by looking at this number 60 that there are 60% out of 100 who said that the

drink is good and similarly this marginal frequency which is here 40, this is obtained as the sum

of this 10, 15, and 15 and this is here 40 so this number 40 is essentially indicating that out of

100, there are 40 persons which are saying that the drink is bad. Now on the same lines, let me

try to add the numbers in the column. So you can see here I try to add here this 20 and this here

10 and this gives me value here 30, 20 + 10 is equal to 30. So this number here at 30 this is

giving us an information that out of 100 persons there are 30 children and similarly in the second

column I get this number 45 which is equal to 30 + 15. This and this number. So that is

indicating that there are 45 young persons out of this hundred persons and similar to this and the

last column of elder person, this value is 25. So that is indicating that there are 25 elder persons

in the sample of 100 persons. So you can see here that this marginal frequencies in rows and

columns, they are giving us a particular type of information and this information has been

obtained by making one of the effect to be the constant. For example, when I say that how many

persons said that the drink is good then we are suppressing the information on the age and we try

to add simply children + elder person + young persons together. And similarly, when I want to

find out that how many persons are there in different categories then I am trying to suppress the
18

831
variable taste. I am not bothering who said good or who said bad but I am simply counting that

how many children said the drink is either good or bad. Similarly, how many young person or

elder person said that the drink is good or bad, right. So this is the type of information what we

obtained from the marginal frequency and now the similar information can also be obtained in

terms of relative frequencies. What we have to do in the same frequency, I simply have to divide

each and every frequency nij by total frequency n.

So you can see here that total frequency here is 100. So first cell has a relative frequency 20 by

100 and similarly other cells have 30 by 100. 10 by 100, and 10 by 100, 15 by 100 and 15 by

100. Now once again what I try to sum them row wise then the sum is 60 by 100 and 40 by 100

in the first and second rows. So they are essentially trying to give us the marginal relative
19

832
frequencies and in the columns when I try to add it here this number is 30 by 10. 40 by 100 and

25 by 100. So this is trying to give us the same information in terms of relative frequencies and

obviously here the sum of all the frequencies will always be equal to one.

Now so if you try to see what type of information, I have got here from this table well I'm not

going to discuss here each and every information but I will try to give you some information. So

one thing is clear that this is a joint frequency distribution and it is informing us that how the

values or both the variables behave jointly. So when I try to see here about the marginal

frequency distribution, so you can see here in this case, where I am now making here as circle if

you try to see here 60. So there are 60 persons who said that the drink is good and there are 40

persons who said that the drink is bad. So in general I can also write it there as 60 out of 100 and

into if I multiply by 100 this will give us the value in percent. So I can say in general that 60% of
20

833
the person said that the drink is good and similarly 40 or 40% percent person said that the drink

is bad. And similarly, if I try to look at the values in the column say here, here, and here then

30%, 45% and here 25% persons in the sample are children, young persons, and elder person. So

you can see here that I can say that there are 45% young person and 25 or say 25% elder persons

in the sample.

Similarly, if I try to find out the frequency distribution in terms of conditional frequencies then

this conditional frequency distribution is giving us an information that how the values of one

variable behave when another variable is kept fixed. So you can see here I am obtaining here a

value 20 by 60 and I am saying that 20/60 into 100% children said that the drink is good. How

this has been obtained? If you try to see how this 20 and 60 values are coming. So now you can
21

834
see here I will try to make here a circle in a red color, if you try to see this data here, this is here

that 20 and this is here the 60, the marginal frequency. So you can see here, I'm trying to fix one

variable here which is good. This I'm now fixing and that is the idea of the conditional frequency

distribution that I have fixed the variable here good and then I am trying to find out the

conditional frequency by taking the absolute frequency 20 and the marginal frequency 60 and

this is going to give us an information that how many children said the drink is good and

similarly there will be another information that well how many children said that the drink is bad

so for that I try to take the information on here the bad and here the frequency is 10 and the

marginal frequency here is 40. So I try to take here 10/40 or this is equivalent to 25% of the

children said that the drink is bad and similarly if I try to come under columns you can see here

that I'm trying to take here different values from here and I am trying to say here that 30/60

which is equal to 50% of the young person said that the drink is good. So I'm again fixing the

variable here good and then I'm trying to take the frequency here say nij divided by marginal

frequency and similarly, I can have the information about the young children, sorry the young

person, who said the drink is bad. So this can be divided by the total number of young persons

divided by the marginal frequency and this comes out to be 37.5% people said that the drink is

bad.

So now if you try to see what we have done here we have understood the concept of how to

create the bivariate frequency table and in turn how to convert them into a contingency table and

this contingency table is going to give us different type of information. Now the next question is

how to create this frequency table and contingency table or the contingency table from the

frequency table using the R software. So this I would try to discuss in the next lecture.

22

835
In this lecture you please try to revise all these concepts and try to understand them what they are

trying to say, what they are trying to tell. Once you understand them then getting the output from

the software is very simple but the main thing will be how to interpret those values. So you

practice here and I will see you in the next lecture. Till then good bye.

23

836
Lecture-32

Association of variables

Measures of Association for Discrete and Counting Variables: Contingency table with

R, Chi-Squared Statistics, Cramer’s V Statistics and Contingency Coefficient

Welcome to the next lecture on the course, descriptive stats it with R software. You may

recall that in the last lecture we started our discussion on measuring the association between

two discreet variable, on which the observations were obtained as numbers obtained by

counting, and in that lecture we had discussed that from the given set of data, we can create a

table, and a contingency table. From that contingency table, we can obtain marginal

frequency distribution and the conditional frequency distributions, and this marginal and

frequency distributions can be obtained in terms of absolute frequency and relative

frequency, and we had taken an example and we understood how these values are coming and

we understood how to compute them manually.

Now, in this lecture I will try to show you that these contingency tables can be obtained

inside the R software. So I will take an example and I will try to show you that how to obtain

the contingency table and after this I will introduce some measures, some quantitative

measures to find out the magnitude of the association or the degree of association between

the two variables. So now first I try to take the topic that how to construct the contingency

table in R software. Right.

Refer Slide Time: (2:03)

837
So as usual I will assume that we have got here a data vector and suppose we have two data

better x and y, you may recall that in the case of univariate frequency distribution if I have

the data vector as x then we had used the command table to find out the frequency table. And

when this table was divided by the length of x then we had got the frequency table in terms of

relative frequencies. Similar to that when we want to tabulate the bivariate data, the same

command is used that is table t a b l e, the only difference is that now inside the argument you

have to give that two variables or two data vectors, and similarly if you have more than two

you can express all those data vectors here separated by comma. So, this table (x, y) is used

to cross classify the factors to build a contingency table of the count set each combination of

the factor levels. Right, and if you try to use this command table (x, y) this will give you an

output in the form of a contingency table with absolute frequencies, and similarly if you try to

divide this table by the length command, then it will return a contingency table with relative

frequencies.

Refer Slide Time: (3:36)

838
Now in case if you want to find out the marginal frequencies, then there is a command here

addmargins a d d m a r g I n s and this command addmargins is used along with that table

command, and this gives us the marginal frequencies to the contingency table that was

constructed by the command table. So the entire command to obtain or to add the marginal

frequencies in the contingency table will become addmargins and inside the argument try to

tell, adding margin to what? So add margin to the contingency table which was provided by

the command table (x, y), and in case if you want to obtain the marginal frequencies in terms

of relative frequencies, then you simply try to use this addmargins command inside the

argument and inside that argument, try to write down the contingency table of which you

want to obtain the marginal relative frequencies. Okay.

839
Refer Slide Time: (4:51)

So now I try to take a very simple example, and I try to convert the given data into a

contingency table and then I would try to obtain the marginal frequencies. Suppose there are

twenty persons and they have been divided in three categories with respect to their age as a

child, young person, and elder persons and all of them were given a drink and the taste of the

drink was asked? You must note here that this is a similar example which I took in the earlier

lecture, in the last lecture, where I took hundred persons but my objective here is to show you

that how the contingency table is constructed from the raw data. So, showing you here

hundred observation is more difficult so that is why I am taking here only twenty

observations. So, you can see here that there are twenty persons one to ten and then here

eleven to twenty and first person is a child, and the child has been asked that how's the drink,

and he responds good. Similarly, the second person is a young person who said the drink is

good. Similarly, the third person is an elder percent who said that drink is bad, and then the

840
fourth person is a child who said that the drink is bad and so on and this is how we have

collected the data on the age and taste here. Right.

Refer Slide Time: (6:30)

So now you see I would try to store this data into two data vectors. One is here person in

which I would try to store the data set which is here this and here this.

Refer Slide Time: (6:41)

841
So, I have simply typed it.

Refer Slide Time: (6:48)

And then in the second variable which is here taste here I have collected the data,

Refer Slide Time: (6:52)

842
On this tale good bad and so on. And this data has been assigned to these vectors in the same

order.

Refer Slide Time: (7:00)

Same order means if you try to say here first person here a child and this child said that the

taste is good, you can see here and now if you try to see here in the data vector,

Refer Slide Time: (7:13)

843
this is the thing here child is saying the taste is good and so on.

Refer Slide Time: (7:20)

So, these observations are written exactly in the same order as in the table.

Refer Slide Time: (7:25)

844
And now after this I have to use the command table and inside the arguments, person

separated by comma taste, and this command will provide a contingency table with an

absolute frequency and when I try to execute this command on the R console, I get here a

table like this one. So now how? to interpret this table and how to read this table that is more

important to learn, you can see here one where you will here is taste and another variable here

is person, and this person has three categories child elder person and young person and these

categories are the same what you have denoted in the data, and similarly that taste is also

divided into two categories bad and good. And this classification has been done by the R

software automatically by Counting that how many persons are in which category. Now this

is showing you here, for example, if you try to see this is here your contingency table data or

the frequencies, so these values are your absolute frequencies.

For example this two is indicating that there are two children which are saying that taste is

bad and similarly if you try to see here, I will use a different colour pen say here six, so six

means that there are six elder person who are saying that taste is good, and similarly if you

try to take another data here say here 2, so this 2 is indicating that there are two young

persons out of 20 who are saying that the taste is bad. Now next, we would like to obtain the

marginal frequencies so you can see here I am using here the command addmargins and

inside the argument I am using the same command which was obtained here to get this

contingency table. Now you can see here, this contingency table is the same which is

obtained here but now there is one more column and one more row which is added in this

case here you can see here this is here sum and sum, so what is this sum this value here is 4

you can see here this has been obtained by 2 plus 2 is equal to 4, and similarly if you see here

10, this is here 4 plus 6 is equal to 10, and similarly if you see here is this 6 the 6 is here 2

plus 4 is equal to 6, and similarly if you try to take here this first column, 2 plus 4 plus 2 this

845
is equal to here 8, and similarly for the second column, if you try to add 2 plus 6 plus 4 this is

here 12. So you can see here this sum is indicating the marginal frequencies, so these are the

based on row and similarly here, this is the marginal frequencies based on columns, and

finally this value here 20, this 20 is the sum of all the observations or sum of all the

frequencies its frequencies 2 plus 2 plus 4 plus 6 plus 2 plus 4 which is equal to 20. So, this is

how we are going to obtain the table with marginal frequencies.

Refer Slide Time: (11:22)

Now this is here the screenshot which I will show you that how I have obtained on the R

console,

Refer Slide Time: (11:31)

10

846
Now we try to find out the same thing with respect to the relative frequencies. In order to find

out the contingency table with the relative frequencies first we need to find out the, the total

number of observations. In order to find out the total number of observations we have two

options since we have got here two variables one is here person, and another here is taste and

you observe that both these variables have got 20 observation. So now I can use here the

context the length of the person or I can also use here length of taste. Both are going to give

us the same values because both the variables have got the same number of observations. So,

once I try to operate here the command length of this vector, I will get here a value here 20

which is indicating the total number of observations. Now in order to find out the

contingency table, I have to use the same command but now I have to divide it by the length

of the data vector, so I try to use here the same command table, person, taste and now it is

divided by length of the data vector, and once you try to do it, you will get here an outcome

like this one. So first, let me try to show you that how you are getting this value, so suppose if

I take here this value here 0.1 how it is coming if you try to see here,

Refer Slide Time: (13:18)

11

847
In the earlier slide, we had obtained the frequency here this year two which is corresponding

to bad taste and a child. Right. So now this two is being divided by the total number of

observations which is here 20 and this will be equal to here 1 upon 10 and which is equal to

here 0.1.

Refer Slide Time (13:47)

And this is here the same value, so this is no nothing but your nij upon n which was your h

equal to 2 upon here 20 is equal to 0.1, and similarly if you try to find out here how this value

has been obtained 0.3

12

848
Refer Slide Time: (14:06)

So, you can see here the corresponding value here is 6.

Refer Slide Time: (14:10)

13

849
So, I try to use here say here nij upon here n which is equal to here 6 upon 20 and this comes

out to be 8 could be a 3 upon 10 which is equal to here 0.3. So, this is how all other values in

this table are obtained. Now in case, if I also want to find out the marginal relative

frequencies, so I have to use here the command addmargins and I have to use the same

command which I you have used to find out the contingency table, and in case if you try to do

it, you can see here this part here, this is the same as this part here because this is

corresponding to the contingency table, and now there is additional row and column here

which are here like this, this and here this. So the first question comes what are these

additional rows and columns are indicating? So I will try to show you here that suppose if I

try to take here the first row, so for the sum of first row which is 0.1 plus 0.1 is equal to 0.2

and this is indicated here in the first value in this column. Similarly, if you try to add here 0.2

plus 0.3 which is equal to here 0.5 so this is the second value, and similarly the third value

here is 0.1 plus 0.2 which is equal to here 0.3. So they give us the values of marginal relative

frequencies. Similarly, if you try to look into the columns, so if I try to sum here these things

so 0.1 plus 0.2 plus 0.1 this is being given here as 0.4, and similarly in the second column 0.1

plus 0.3 plus 0.2 in this direction, this is giving us the value 0.6. Right. So, these two values

here 0.4 and 0.6, they are also the marginal relative frequencies.

Refer Slide Time: (16:40)

14

850
And this is here the screenshot of the operations which are to be done on the R console. Now

if we're going further let me try to show you all these operations on the R console. So I will

try to take the same example and I will try to enter the same data set, and I will try to obtain

the contingency tables with respect to the absolute frequencies as well as with respect to the

relative frequencies.

Video Start Time: (17:10)

15

851
So you can see here, first I will try to create these two data vectors here on the R console and

similarly I try to create this data vector taste so you can see here now I have the data on

person, and here taste. Right, Okay. I clear the screen and I try to now create here the

contingency table. So you can see here this is the person and here taste. This is the same table

what you have obtained. Now I just want to show you that if you try to interchange person

and taste, so first, I try to give inside the argument taste and then person. You see what

happens? So obviously the data will remain the same but only the rows and columns are

interchange that you can observe here. So, in case if I want to find out the marginal frequency

of anyone say person comma taste, you can see here this is obtained here like this, so this is

the same thing what we had just obtained. Now I try to find out the same contingency table

with respect to the relative frequencies. So you can see here I'm trying to take here the same

command but now I'm trying to divide it by the length of the data vector and I'm choosing

here the data vector to be person and you can see here that we have got here this type of

command and these are the values which are the same which we have shown you on the

slides, and now in case if I want to find out the same contingency table with respect to the

length of another data vector, so I try to choose here the data vector taste inside of here

person and you will see here that in both the cases, you are going to get the same command

because length of the two data vectors here is the same which is here 20. Right, and now in

case if I want to add the marginal relative frequencies so I have to use the same command but

I have to add here one command addmargins so you can see here that this gives us this value.

Right. So, you can see here now the sum of all the marginal relative frequencies is coming

out to be here one. Right, and this is the same output which I had shown you on the screen.

So you can see here that finding out search contingency table with respect to absolute

frequencies or relative frequencies is not difficult once you have some data set.

16

852
Video End Time: (20:16)

And now I would try to discuss one more new topic. So now I'm going to discuss a tool

which is called as  statistics and the role of this  statistic is that it tries to give us an idea

by quantifying the degree of association, similar to what we had in the case of continuous

variable, we had correlation coefficient. Right. So one thing you have to keep in mind that

when I am going to introduce the  statistics, actually this  statistics is used for testing of

hypothesis and  statistics is based on a probability density function which is called as 

probability density function, and when we try to use this statistic, there are certain conditions

in the case of test of hypothesis. For example, the cell frequency should be greater than 5 and

so on but here you see I am taking an artificial example so in this example means, I have kept

the frequencies to be low means if you have more data then obviously these frequencies are

going to be higher, so while computing the statistics on the R software, you may get sort of

warning but you need not to worry for these things you have to follow essentially the

procedure and the concept.

17

853
Refer Slide Time (21:48)

So this  statistics or this is called as Pearson’s chi squared the statistics, this symbol here

chi this is a Greek letter which is written here like this  and this statistics  statistic is used

to measure the association between the variables in our contingency table suppose there are

two variables, so the  statistics for a k cross l contingency table what we have created

 ni  n j  
2

  nij   
  n  
k l
earlier is given by this quantity you can see here this is the    . So, you
i 1 i 1
ni  n j
 
 n 

can recall that this nij is giving you the value of absolute frequency and this ni plus and n plus

j they are the marginal frequencies of x and y and small n here is the total number of

observation or the total frequency. So this is the statistics which gives us an idea about the

18

854
degree of association and this value of 0   2  n  min  k , l   1 means whatever is the

minimum value out of k and l they this can be k or l whose ever is minimum.

Refer Slide Time (23:24)

Now what is the interpretation of this statistics now given the data one can compute this

statistics now in case if the values of are coming close to zero or the value of is very

close to zero this will indicate that the association between the two variable is weak. so, a

value of close to zero indicates a weak association between the two variables x and y and

similarly you can see here,

Refer Slide Time: (23:53)

19

855
that the has the limits zero and here this minimum k, l minus 1 into n. so, in case if the

value of

Refer Slide Time: (24:01)

20

856
 is close to the second limit which is n[min(k, l) – 1] then this would indicate a strong

association between the two variables remember this thing this is not between like zero or

one or minus one or one something like this. So, this value depends on the size of the

contingency table, size, size of the contingency table well that's a drawback and that was

overcome in the, in some further modification that we are going to discuss. Now, in case if

you are getting any other value of  which is not close to zero or this interval and into

n[min(k, l) – 1] then suitably that will indicate the degree of association between the two

variables say as low moderate or say high, in general. One aspect of this  statics is that it is

symmetric, symmetric means that which of the variables you are taking in the row or say

column this will not make any difference for example, you had seen that we had constructed

the frequency table will person and taste and taste and person both the cases this  statistics

will remain the same.

Refer Slide Time: (25:18)

21

857
Now, you look at me just give a particular example which is more popular, say in case if I

have only two by two contingency table. So, suppose there are a variable x and y which have

got two classes each x1, x2 and y1, y2 and absolute frequencies in these cells are a, b, c and d

which I have written in green colour and now, if you try to find out the sum of rows if it is the

marginal frequency this will come out to a plus b and similarly the marginal frequency of this

second row corresponding to x 2 is c plus d and similarly the modular frequency with respect

to the first column is a plus c and the modular frequency with respect to the second column y

2 is b plus d in this case if you try to substitute all the values inside the  statistics this will

simplify to this thing. where here obviously n is equal to a plus b plus c plus d which is

indicating the total frequency or Right! Okay. Now, after this what I will do is the following

that I will take a simple example being on this two cross two data and I will try to compute

this  statistic and later on I will introduce some more statistics and then I would try to

measure the degree of association in the same example using different statistics. So, this data

is about that a sample of hundred student, the students was collected and they were judged

whether they are weak in academics or good in academics, they are good in your studies or

bad in your studies based on their I can make performance now after this a group of student

was given a tuition and after the tuition they were just once again and the idea was that we

wanted to know whether this tuition is going to be helpful or not, whether this tuition means

extra studies are really helping the students and improving their academic performance? That

is the question which I would like to know on the basis of given sample of data. So, what we

have done here?

22

858
Refer Slide Time: (27:44)

That this sample of handed student is divided into two groups, weak and strong in academics

and some of the students from say weak and strong both, they are given tuition and after that

their academic performance was just and after this that data was compiled in the following

contingency table here. So, you can see here there are weak students and there are strong

students and students who were given the tuition hours, they were not given that tuition. So, it

was found finally that there are 30 weak students who were given the tuition and there were

10 strong students, 10 good students who were also given the tuition. Similarly, there were 20

weak students who were not given the tuition and there were 40 strong students who were not

given that tuition. and based on that we would like to find whether there is any association

between tuition and the academic strength of the students.

23

859
Refer Slide Time: (28:50)

So, we try to find out their marginal frequency. So, you can see here modular frequencies are

here 30 plus 10 is equal to 40, 20 plus 40 is equal to 60, and similarly at 30 plus 20 is equal to

50, and 10 plus 40 is equal to 50 and the total sum here is hundred. Based on that I try to

substitute all these values of frequency's in terms of a, b, c, d.

Refer Slide Time: (29:12)

24

860
what we have done here

Refer Slide Time: (29:12)

in this table and I try to use the same formula here and

Refer Slide Time: (29:18)

25

861
I try to compute this value this value of  comes out to be 16.66, this is a manual calculation,

Right! and then the value of the upper limit which is n into n[min(k, l) – 1]. So, you can see

here the value of n here is hundred and the k here is two and l here is two, the minimum value

between 2 and 2 is 1. So, 2 minus 1 is 1 and this value here is hundred. So, now you can see

here whether this value 16.6 is close to zero or close to hundred this is what we have to see

and based on that we have to take a call what's really happening. So, you can see here that

this  value is not close to zero, but on the other hand it is also not close to hundred. Right.

So, one may conclude that well there can be a moderate association or a lower association, it

depends on you how you want to interpret it there is no hard and fast rule to decide what is

low and what is moderate and what is strong but, yeah in my opinion I can say that well there

is a moderate Association.

Refer Slide Time: (30:30)

26

862
Now I try to take the same example which I had considered earlier, and I would try to find

out the  statistics. So, this is the same example which I just considered Right! and which we

have collected the data on 20 persons and their high responses for the taste of a drink are

recorded based on the category of the age that is child is young person or elder person.

Refer Slide Time: (30:54)

So, you have already done this job that you had obtained the contingency table here using the

command table person dot taste. Now in order to obtain the, this  statistics. I am giving

you here two things, one is the command and second is the application on this example. So,

the command here you can see here is c h i s q dot test. So, this is a short form of  test and

you can see here that inside the argument I am trying to give the data in terms of contingency

table. So, this is table comma person comma test and this bracket close this is the same

command which is given here. Now after this I am using here a command dollar. Which is

given on your keyboard and after this I have to write down statistic, s t a t i s t i c and this will

27

863
give you this outcome. So, the outcome looks like this that this is showing that the value of 

which is written here say X because Chi is a Greek letter so in R, we cannot type the Greek

letter, this value is coming out to be 0.277 and so on but here you will see that there is also a

warning message that and it is saying that in  test this data the  approximation maybe

incorrect. Why this is happening? I just informed you that the  is statistics which we have

used here to find out the association between the two variables persons and tastes, this is

actually used for test of hypothesis and to test the hypothesis that whether there is an

significant association between the two variables see here in this case persons and tastes. So,

when we try to apply the  test of hypothesis, then there are certain conditions for the

applicability of the test and one of the condition is that that each and every frequency should

be greater than 5 and here you can see here in this case there are several frequencies which

are not greater than 5 like as 2, 2, 4, 2 and 4 and that is why this outcome is given here in

terms of warning and the  test is telling you well you are trying to find out the  statistic

but the number of frequencies are smaller than 5 so that means the values may not be so

accurate and you may have a wrong conclusion but that is related to the test of hypothesis and

here our objective is very simple I just want to show you how to calculate the  statistic. If I

can take a bigger data also but then you will not be able to match what R is doing and what

you can obtain manually, rather my request will be you try to take the same example and try

to create the same contingency table yourself and try to compute this  value this will come

out to be the same.

Refer Slide Time: (34:31)

28

864
So, now let us try to see that how you can compute these things on the R software on the R

console.

Refer Slide Time:(34:41)

29

865
So, you may see here that I already have this data on person, we just used it taste and here

like this and if you try to create here a table say here person and taste, you get here the same

contingency table. Right.

Refer Slide Time: (35:04)

So, I try to clear the screen so that you can see everything clearly and now I try to compute

the  statistics. So, you can see here I just use the same command what I have used in the

slides and this is giving you the same outcome. Okay.

Refer Slide Time: (35:21)

30

866
Now, let us come back to our slides now you see by looking at the value of , you can judge

whether the association between the academic performance and the tuition is significant or

not. Right. Just by computing the values of n into minimum of k, l minus one and so on. But,

now I would try to address another aspect. You can see here in this case you need to compute

the values of the range, well. One is zero but another is n into minimum of k, l minus one. So,

the value itself is not giving you a clear cut indication, for example if you remember in the

case of say correlation coefficient that was lying between minus one and plus one or the

magnitude will lie between zero and one. So, just by looking at the value of r, you can very

easily communicate whether the association is high or low and so on. So, here a modification

was registered in the  statistics and a new statistics which is a modified version of the 

statistics was defined as Cramer's V statistics and in this case what happened.

Refer Slide Time: (36:40)

31

867
That in the case of this Cramer V statistics the range is if the range of the Pearson’s 

statistic depends on the sample size and the size of the contingency table. So, these values

depends on the situation that what is the number of row what is the number of column and so

on. So, this issue was solved and a modified version of the Person  statistics were presented

as Cramer's V statistics for a k cross l contingency table, for the same table and this was

2
defined as say and the advantage of this this V statistics was that it lies
n[min  k , l   1]

between zero and one.

32

868
Refer Slide Time: (37:36)

So, now making inference about the degree of association became more simpler. For

example, we can conclude that if the value of V is close to zero that would simply indicate

the low association between the variables and in case of the value of V is close to one then

this will indicate the high association between the variables and similarly if you have any

other value say between zero and one, then depending on the magnitude of the value that

would indicate whether there has been a moderate association or a lower association or a

strong association. So, now you can see here that in the earlier case we had obtained

Refer Slide Time: (38:16)

33

869
the value of  to be here 16.66, and we had concluded that the association is moderate. So,

now

Refer Slide Time: (38:28)

34

870
for the same example I try to compute this value. So, using that value of  to be 16.66, I try

to compute the value of V statistics and this comes out to be 0.40. So, that would indicate

once again that ideally, V should lie between zero and one but it is lying somewhere in the

middle. So, I can say that there is a moderate association.

Refer Slide Time: (38:52)

And similarly, if you want to compute this V statistics in the R software for that we need a

special package which is called as l s r. So, first we need to install this package by using the

command install dot packages and inside the argument within the double quotes you have to

write lsr and once you do it the package will be installed I am not showing you here because I

discuss it in the starting lectures and after that you need to note the library by using the

command library inside the argument lsr, and if you see, what we had obtained earlier, in this

data set.

35

871
Refer Slide Time: (39:35)

This example where we had 20 percent who responded for the taste of drink.

Refer Slide Time: (39:39)

36

872
So, I am going to calculate this value for this thing. So, remember one thing, I am taking here

two examples one I am trying to do manually and and another I am trying to do on the basis

of R software. Right. So, this is an example which we had just done on the R software. So,

we had obtained here this type of contingency table and

Refer Slide Time: (40:02)

now I would try to compute the value of Cramer’s V statistics. So, once again I would like to

show you here two things what is the command and what is the interpretation? So, you can

see here the command here is Cramers V c r a m e r s and V is in capital letters that you have

to remember, and then I have to give the contingency table for which I would like to compute

the value of Cramer's V statistics and once I do it this gives me the value 0.11 . Right, and

yeah, once again you will see here the warning message this warning message is coming out

because of the same reason that this is based on the  statistic and the total number of

frequencies in every cell

37

873
Refer Slide Time: (40:57)

you can see here they are say, smaller than 5 like as 2, 2, 4and so on. Right?

Refer Slide Time: (41:03)

38

874
So, this is how you can do it? and this is the screenshot but I would like to show you it on the

R console also.

Refer Slide Time: (41:10)

So, first I try to load the library.

Refer Slide Time: (41:11)

39

875
I already have installed this package on my computer.

Refer Slide Time: (41:15)

So, I simply have to load it

Refer Slide Time: (41:20)

40

876
and after that

Refer Slide Time: (41:22)

I will try to use the command here cramers V and inside the arguments

Refer Slide Time: (41:32)

41

877
I have to give the contingency table of for which we want to compute it. So, you can see here

this is coming out to be like this. Right.

Refer Slide Time: (41:43)

Okay. So, you can see it is not difficult at all and it is more easy to interpret because all the

values are going to lie between zero and one. So, by looking at the value of see here 0.11, one

can conclude here or here that the association is there, but it is quite low. Right. So, the taste

is not much depending on the age.

42

878
Refer Slide Time: (42:09)

Now, after this there is another coefficient which is used to measure the degree of association

in such a case and this is actually the called as contingency coefficient. And this is simply the

corrected version of the Person's contingency coefficient which is based on the  statistics

once again and this contingency co-efficiency which I am denoting here as C, c o r r which

is an abbreviation for corrected this is defined as C upon C max, where C is given by square

root of Chi square divided by Chi square plus n and C max is given by this quantity that is the

square root of minimum of k, l minus one divided by minimum of k, l and this is statistics

also has an advantage that it is lying between zero and one. So it is more convenient to take

conclusions or statically inference using the value which lies between zero & one. So, this

also have similar interpretations like it’s value of C, if it is close to zero that would indicate a

lower association between the two variables. In case if the value of C is close to one this is

going to indicate a higher association between the two variables and other values of C

between zero and one they would try to negatable indicate the degree of association between

the two variables.

43

879
Refer Slide Time: (43:36)

And now, in case if I try to take this example, just for your remembrance.

Refer Slide Time: (43:42)

44

880
This students versus tuition where we have used, we have found the value of here 2 to be

16.66, now for the same thing I would like to find out this contingency coefficient. So, you

can see here from this value I was concluding that there is a moderate association then, I use

the Cramer’s V statistics this is also indicating the moderate Association and now let me see

what happens in the case of this contingency coefficient.

Refer Slide Time: (44:09)

So, if I try to compute the value of C based on the values of  equal to 16.66, then the value

of C is coming out to be 0.38, obviously n is here hundred and the value of C max is coming

out to be here 0.71, and finally the value of contingency coefficient C is coming out to be

0.54. So, since this value is lying between zero and one. So, this value 0.54 is lying

somewhere in the middle of zero and one, so once again, I can say that this is indicating a

moderate association between the two variables. Now, we have considered different types of

things, different types of measures, to find out the degree of association between the two

variables where the variables are in the form of counting data, well, you can see here that

45

881
different statistics, they are giving us different values and they have got different

interpretations but then obviously, I believe, I personally believe, that if there is an

association present inside the data, then ideally all the statistics should indicate the same

thing, there can be a small difference for example, Cramer’s V statistics is close to 0.46

whereas this contingency coefficient is giving 0.56 and so on. But definitely they are

indicating, yes, they that there is a moderate association. So, this is how we go. Now,

definitely the question comes how to decide whether this is a really low moderate or strong?

for that you have to use your own judgment and this type of power to judge that you can very

easily develop by practice. So, I would request you please try to take some more example and

try to practice it. Now, I would like to stop with this topic of measuring the association

between different types of variable, either they are continuous, ranked observation or

counting observation, and you also have learnt that how to compute them on the basis of the

given software. So, the main thing is this if you understand the concept, it is easy to compute

them but, the main thing come how to interpret them. So, the main objective which you have

to emphasize here is this how to choose the right tool and how to compute it correctly and

then how to take the correct statistical inferences out of that. So you practice and learn this

technique, develop this technique and I will see you in the next lecture till then good bye.

46

882
Lecture – 33

Fitting of Linear Models: Least Squares Method – One Variable

Welcome to that lecture on the course descriptive statistics with R software, in this lecture we are

going to start with a new topic

Refer Slide Time: (00:22)

fitting of linear models. So, you see what happens, whenever we have a sample of data, then the

first information is obtained from the graphical tool and suppose we have data on two variables

and both the variables are associated, they are not independent, then definitely these observation

will have some correlation structure. So, first step will be to create a plot like as a graphic plot, a

smooth a scatter plot and so on. Those plots are going to give us an idea, there is an association

present in the data. Now, after this we try to use different types of tools for example, correlation

coefficient to quantify the degree of association or the degree of linear relationship. Now, the

final question which is remaining is, can we find out the mathematical relationship between the

883
two variables or can we find a statistical model between the two variables and if yes, how to do

it? So, that is the question which we are going to entertain in this topic fitting of linear models.

Now, we are assuming that there are two variables or there can be more than two variables and

there exists some relationship between those variables, right, and whenever there is a

relationship, there are going to be two types of variable- one output variable, that means the

variable on which, we obtain the values of the output and second aspect will be input variable.

So, whatever are the values which are given to the so called input variable based on that we will

have an output, for example, suppose if I say that whenever we are doing some agriculture, then

the weight of a crop that depends on the quantity of fertilizer, that in case if I try to increase or

decrease the quantity of fertilizer in the field, then up to a certain extent the crop will increase or

decrease, Yeah! obviously if you try to increase it more than the crop will get burnt up. So, this is

a case where we can see that relationship between the yield of the crop, the quantity and the

quantity of fertilizer. Similarly, if I try to extend this relationship, then we know from our

experience that the outcome of a variable in this case for example, the yield of the crop, it does

not depend only on one variable the quantity of fertilizer but it depends on other variables also,

quantity of fertilizer, temperature, rainfall, irrigation, moisture and so on. So, now I have given

you two situations where the outcome is going to be dependent on one variable and on more than

one variables. So, now how to handle this situation, this is what we are going to understand in

this lecture,

884
Refer Slide Time: (04:03)

we assume that a relationship exists between two variables or this can be generalized, that the

relationship exists between a variable and more than one variables and so on, that we will see

later on and in this type of relationship, we assume that the outcome of variable is affected by

one or more than one variables for example, the example which we have just considered that the

yield of a crop increases with an increase in the quantity of fertilizer, right! Similarly, if you try

to see, incase if you try to observe the phenomena of an electric fan. So, whenever we try to

increase the speed, how it is done that is controlled by a switch and we try to control the switch

from position number one to position number two, position number two to position number three

and in this process, what is happening, inside the switch, the quantity of current that is increased.

So, I can say that when I am trying to increase the position of the switches from one to two, two

to three and so on, then automatically inside the switch the amount of current flowing in the

885
circuit increases and finally the outcome of the fan which is the RPM, that is rotation per minutes

increases or in simple words the speed of the fan, say increases. Similarly, in another example,

we have seen that as the temperature of the weather increases, then the quantity of water

consumed is more. You can see that will consume more water during summer than winter, Right!

So, in these type of examples, where I am saying that the speed of an electric fan which is

measured by the rotations per minute, because increases as the voltage or current increases,

people drink more water as the weather temperature increases, and similarly the same example

that the yield of a crop depends on other variables also, like as quantity of fertilizer, rainfall,

weather, temperature, irrigation etc., right!

Refer Slide Time: (06:26)

So, now we are assuming that such type of relationships, can be expressed through models and

model is a very fancy word, in nowadays everybody wants to find out a model among the

variables and so how to do it, but first question comes what is a model, this model is only a sort

of mathematical or statistical relationship among the variables and this relationship is in such a

886
way such that it is representing or depicting the phenomena that whatever is happening, this is

indicated by the mathematical functioning of the mathematical relationship or the statistical

relationship. So, when we talk of this modeling, then modeling and relationship among the

variables, they are the same thing and in such a case, the relationship is characterized by two

things- one is variables and say another is parameters. So, the first question comes that what is

the difference between variables and parameters. So, this i will try to address in a very simple

language through a very simple example, but before going further, let me clarify here that what

type of relationships can exist, then there can be different type of relationship and broadly I can

classify them into two parts linear and nonlinear. Now, in this course we are going to entertain

only the linear relationships.

Refer Slide Time: (08:17)

Now, I come back to my means earlier issue that how to start. So, whenever there is a

phenomena or whenever there is something is happening, then in that phenomena, you have to

887
observe that usually there will be two types of variables or the variables can be divided into two

categories, one category is input variables and another category is output variables. Now, the

first question comes that how to take a call or how to decide that which of the variable is an input

variable and which of the variable is an output variable. So, this I can explain you with the two

simple examples. For example, we from our experience that whenever a student is studying

more, usually he or she will get more marks. So, now I have here two variables, one is the marks

in the examination and second is the number of hours a student studies. So, now in this situation

I will try to ask two question that there are two possibilities, one possibility is that I can assume

that marks depends on the number of hours studied and second option will be that the number of

hours studied depends on the marks obtained. Now, in such situation there is no mathematical

rule or statistical rule which can explain you, but this is only your experience with the data, the

information about the phenomena that is going to help you in taking a call or in taking a decision

that which is affecting what. So, in this case, we have two options as I have given in the slides

that the marks depend upon the number of hours a student studies or the second option is the

number of hours of study depends upon the marks obtained by the student, right! Now, which

one is correct, this is correct and this statement is wrong and similarly, if I try to take another

example, that the end of a crop depends on the quantity of fertilizer and temperature of weather

or the reverse happens, that the quantity of fertilizer or the temperature of that depends on the

yield. So, in this case, second option is not possible and we know only from our experience that

weather, temperature and the quantity of fertilizer, both are going to affect the amount of yield

up to a certain extent. So, in this case, the quantity of fertilizer and the temperature of weather,

they become the input variables and the yield of the crop becomes an output variable. So, in this

case, if you try to see we have two options which are written on the slide that the yield of a crop

888
depends upon the rainfall and weather temperature or the second option is that the rainfall and

the weather temperature depends upon the yield of the crop. So, obviously the first sentence is

correct and second sentence is wrong, this is wrong and similarly in the first case this was wrong,

right! So, this is how we try to decide in a phenomenon that what is going to be an input variable

and what is going to be an output variable in any given situation. Now the next question is that

whenever we are trying to write down a model, model is essentially going to be a mathematical

equation, Yes, the statistical concept will be used to obtain that mathematical equation but in that

equation, there will be two components, one is variable and another is parameter. So, I will try to

take here a very simple example to explain you, that what is the difference between the two and

what is the role of variable and parameters in a model. Now, I am going to take an example of

the equation of a simple linear line which you have possibly studied in class 10, 11 or 12, y = mx

+ c, you see y = mx + c has two types of component, one category I can define as x and y, and

another category is m and c, out of these two categories, one set of quantities is a variable and

another set of quantities is a parameter. Now, how to take a call, how to decide what is what. So,

let me try to explain you here.

Refer Slide Time: (13:30)

889
So, we know that the equation of a straight line is given by y equal to mx plus c, where your c is

interceptive, and this line suppose it looks like this. So, here x is the x-axis and this x is going to

indicate the values on the x axis, and this is my here y-axis and here this y is going to denote here

the values on y-axis, and this quantity here is say here, c. So, c is going to be an intercept term,

right! and this angle that is going to be represented by m in terms of tan of the angle the

trigonometric function. So, this is how we denote this equation y = mx + c. Now, in this

equation, if you see there are two options which I can explain you here, one is here x and y and

second here is m and c and now out of this one of them is parameter, one of the set is

representing parameter and another set is representing variable. Now, means I can give you a

simple query that please match these two columns. So, column one and column two that is the

simple type of question that you have solved that please match, match the columns. So, now let

us try to understand through an example,

Refer Slide Time: (15:04)

890
that how you can do it. Suppose I give you first option as that you know the value of x and y,

suppose I know the value of x and y to be say x equal to 4 and y equal to 2, then my question is -

can we know the entire information or all the information about the line. So, in this case your y

equal to mx plus c will become, say here y equal to 2, x equal to 4 and then m plus c. Now,

looking at this line, do you think that you have the entire information about the line? Certainly

not. I just know that there is a point x equal to 4 and y equal to 2. Now, the second option is this

that instead of having the values of x and y, I know the values of m and c, and suppose m is

equal to 5 and c is equal to 6, in this case, the equation becomes here y is equal to 5x plus 6.

Now, my question is do we know the entire information about this line or can we really know by

looking at this equation, all the information contained in this equation about the line, the answer

is yes, if you wish, even I can plot this line here, say here 1, 2 up to, up to here. So, here is 6. So,

this line is somewhere here like this where this is going to present the intercept term c and

similarly this angle is given by this quantity here 5, tan of theta is equal to 5 or 10 of angle is

equal to 5. So, now one can see here by looking at the values of m and c, I can have the entire

information about this line. So, this option one was incorrect and option two is correct.

Refer Slide Time: (17:08)

891
So, now if you try to have here in more detail that what is happening in this equation? You have

to keep in mind that your ultimate objective is to know the equation, y equal to mx plus c. Now,

when I say that I want to know the equation y equal to mx plus c, then it is equivalent to knowing

m and c. If you tell me the values of m and c, then I know the entire line, if we know m and c, we

know the entire line, whereas just by knowing x and y, I do not know the entire line. So, in this

situation such quantities m and c, they are called as parameters, another values on which we try

to collect the values, they are called as variables. So, what will happen that if I try to take the

earlier example of marks obtained in the examination, they depends upon see here some

parameter here m into numbers of hours of study plus c. So, we see here that here this marks and

here number of hours of study, they are my variables, and m and here c, they are the parameters.

So, what we try to do, we try to conduct an experiment and we collect the data on variables,

collect the data on variables, right! So, now I can solve your this question and I can say here that

x and y are the variables and m and c are the parameters. So, if you try to see, what is the

advantage or how you are trying to make a decision? The parameters are those values which we

give you the entire information about the model. So, whenever we call that we want to find out a

model in simple word, I'm trying to say, I want to know the values of the parameter. So,

whenever we hear the sentence like that we want to construct a model that is equivalent to saying

that I want to estimate or I want to know the values of the parameters on the basis of given

sample of data, tight! So, now how this data is collected and how to indicate in our statistical

scale language, how to make symbols and notation for the data? So, we try to take an example

here and and we try to understand,

10

892
Refer Slide Time: (20:29)

so now we take an example here, this is the same example what we can consider about the

quantity of fertilizer and yield. Suppose capital X is a variable, which is denoting the quantity of

fertilizer in kilogram and Y is the variable which is denoting the yield of a crop in kilogram. So,

what is being done here, we are trying to conduct an experiment and we are trying to collect the

data, so that we know what we have in our hand and our objective is that we want to find out the

relationship between X and Y, that is the quantity of fertilizer and yield of the crop. So, now we

conduct the experiment and collect the observation in the following way. Suppose I take a plot of

some fixed size, Yeah! don't change the size of the plot, fixed size and then we put 1 kilogram of

fertilizer in the field and after some time we get here say 6 kilogram of yield. So, this is going to

give us the value of x1, and this 6 kg of yield is going to give me the value of y1. Similarly, I try

to repeat the experiment and then on a plot of the same size I try to put 2 kilogram of fertilizer.

11

893
So, this quantity will be denoted as say x2, which is the second observation on X, then after some

time, we are suppose we observe that 7 kg of yield is obtained. So, in this case the value of Y for

the second observation is to converted as y2, and similarly I can keep on repeating, for example, I

can take 3 kilogram of fertilizer and its value is going to be denoted by x3 and then we obtain 6

kg of yield, suppose and this value is here y3. So, you can see here, we are going to obtain here,

the paired observations like, when I give the value of x1 then I get the value of y1, when I give

the value of x2 then I get the value of y2 and when I give the value of x3 then only I get the value

of y3 and so on, and suppose we say that we have obtained I say, n number of observations. So,

all these paired observations are going to be denoted by (x1, y1), (x2, y2) up to here (xn, yn), right!

Refer Slide Time: (23:12)

12

894
Now, once we have obtained such paired observations (x1, y1), (x2, y2),…, (xn, yn), then the first

information is given by the graphical plots. So, we try to plot this data on a scatter diagram. So,

for example I have just made it here these are the point which are indicating the data points,

suppose this is here x1, this is here y1. So, this data point is denoting the value of (x1, y1).

Similarly, suppose this is here x2, this is indicating here y2. So, this data, is point is the location

of the point (x2, y2) and so on, right! So, now you can see here that looking at this graph, you can

decide whether there is going to be a linear trend or not, right! So, you can see here that the

things are going in this direction, so that there is a presence of linear trend in the data or there

can be a nonlinear trend also, but my objective is here that by looking at the values of these

observations, how to know the equation of this line, how to know this thing or how to know the

equation of the curve and this equation is going to be found in such a way such that it is

representing the population. What is the meaning of this contents? You see, whenever we are

trying to make a model, the model is given in on the or say for the entire population and the

problem is that we do not know the entire population, so we have to work on the basis of given

sample of data, for example, have you ever heard a statement like, this medicine controls the

body temperature of Americans for seven hours and the same medicine controls the body

temperature of Indians, say for 10 hours and the same medicine controls the body temperature of

say, say German people only for five hours, it doesn't happen, medicine is a medicine, and the

effect of medicine under the similar type of persons will also be the same and that will be valid

for the entire population all over the world, Right! But, when we are trying to know the duration

of the temperature control, we try to conduct an experiment by giving the doses of the medicine

to some people, we try to obtain the data and then we try to find out the equation of the curve or

the line, and based on that we make a conclusion and this conclusion is valid for the entire

13

895
population, this is the entire process of modeling but here in this course, we are going to find

only the equation, Right! the remaining part, there is a course on say linear regression analysis

and the tools are for linear regression analysis gives you all the information that how to construct

a linear model, Right! But, here we are just going to concentrate only on one aspect, okay. So,

that is our now objective.

Refer Slide Time: (27:10)

So, now that we try to take here the same example, what we had considered in the earlier lectures

and we try to see how to get this equation of the curve. So, you may recall that we had

considered an example, we had recorded the marks in the examination obtained by twenty

students out of five hundred marks and the number of hours they studied in a week. So, this data

14

896
is given here for example, in the first row here, the marks are obtained and in the second row the

number of hours studied by that corresponding students are given. So, say this is student number

one, he or she studied for 23 hours and he or she has got 37 marks out of 500 and so on. So, this

data is here for twenty students. So, this is to number one, this is to number two, this is to

number three up to here is student number twenty, right! So, now we want to know what is the

relationship between the marks obtained and the number of hours studied in a week. Although,

we know from our experience that marks obtained by students in increase as the number of hours

increase, but we would like to see whether this statement is correct or not, right!

Refer Slide Time: (28:31)

So, suppose I have a stored this data on marks inside a variable named marks, like this, in the R

software and similarly the data on number of hours is stored inside a variable named hours, like

this, inside the R software and how this data is presenting that is what you have to see. So we

15

897
have collected this data and this is how we are going to represent the data that x1 is equal to 23

then y1 is equal to 337 so that the data inside the hours and marks, a data vector that is given in

the same order you can see here, this is 23, 337. So, this is 337 in the first position and 23 in the

first position. Similarly, we have the second value 25 of x2, and 316 for y2. So, this is again in the

same order, say 316 here and 25 here. So, the data for the paired observation is given in two

different vectors but the order of the observation remain the same in both the vectors, this is very

important and you should keep in mind, so you can see here this 23 occurring is here, 25

occurring is here, 26 occurring is here, 337 is occurring here, 316 is occurring here, 327 is

occurring here, and then after this, these are the paired observation 23 and 337, 25 and 316 and

similarly here 26 and 327, right!

Refer Slide Time: (30:12)

this is what you have to keep in mind. Now, what is your first step, first of all I would try to

create here a plot that we already have actually done by using the command plot inside the

16

898
argument hours and marks, right! So, you get here a plot like this and you can see here that there

is a sort of linear trend and in case if you try to make plots like this scatter smooth, another thing

they will also give a sort of estimated close line, but my objective here is to know that how this

line is created. So, before going into the details of this, let me give you here a small information

on the notations. In the language of statistics, the variables are denoted by say English alphabets

like as A, B, C, X, Y, Z and so on and whereas the parameters they are indicated by the Greek

letters like , , ,  and so on. So that is our standard language. So, now the equation which I

had just expressed as y = mx + c, we had understood that m and c are the parameters. So, I try to

represent it in the statistical way, and we try to write down the variable here as say y, and instead

of here m I use here X + . So, that is a standard notation, when we are trying to say that the

model is linear.

Refer Slide Time: (32:09)

17

899
Now, we consider the same example and we move forward. So, you can see here in the same

scatter plot, these points here this small circle, they are going to denote the data points and by my

experience, I have drawn a line here, which is indicated in the red color and I have done it

manually and I feel that this is a line which is representing the values or the relationship between

the values of X and Y, and my objective is to know the equation of this line which is in here in

red color. So, suppose I try to denote the equation, the mathematical equation of this line by here,

y =  + X, and now you know that Y, I am using here  and  and not m and c. So, now in this

equation y =  + X, this X is going to denote the number of hours which is here and Y is going

to indicate the marks obtained which is plotted on the y-axis here, and now my objective is very

simple, I want to find out the relationship between X and Y in terms of y =  + X, and now you

also understand and we have already discussed that this line is known to us only, when the

parameters  and  are known to us. So, I can say if we know  and  then the entire equation

will be known to us. So, now the fundamental question comes in front of us, how to know this 

and .

Refer Slide Time: (34:08)

18

900
So, that is the objective what I am going to now explain here, but before going that try to

observe, suppose I try to take in the same figure, I try to take this small section and I am trying to

enlarge it. So, this is the figure here which is simply the enlarged part of this, say circle. So, you

can see here that this point is here, this point is here and so on and there are one, two, three, four

points here, one, two, three, four points here, right! So, now if you try to see what are you trying

to do, you are saying that this red line, this is y =  + X, and you assume that all the observation

point should lie on this line, that is your idea that all the observations should lie on this line, so

that you can say that this is the mathematical equation between X and Y and this is how the

experiment is being controlled, but in practice this will not happen, all the points are not going to

lie exactly on the same line. So, you can see here, if here is this point, then you want or you

expect that this point should lie here, somewhere here and similarly if you try to take another

point here, see here this point, you expect that this point should lie exactly on this line. So, you

can observe in this graph that this is not really happening and there is some difference between

these two points and that difference or the deviations or the deviation between say this point and

this point indicating here is a sort of error which is happening in our approximation, right!

Similarly, you can see here that in all other cases also you as, there's an error here here, there is

error is here, the error is here and it is here, error is here. So, you can now notice here, that there

is a deviation between the absurd points and the corresponding points lying over the line, right!

and one thing what you have to observe here that in this case, there are two values X and here Y,

one is here indicating here x-axis and here it is indicating y-axis. So, if you try to see here in this

case, in the case where I am making this line, the values of X remains the same, in this case

19

901
values of X remain the same, only Y is changing, why Y is changing, you can see here one here

is this Y, and another here is this Y. So, this is essentially it called as error.

Refer Slide Time: (37:47)

And now we would like to find out the values of  and , such that these errors are minimum

and you can see here that these others are happening in each and every observation here, here,

here, here, here and so on. So, now the question is how to compile all these errors, so that by

minimizing that quantity, I can find out the value of  and . So, one objective is to minimize the

sum of such errors, I can simply measure this error and then I try to take the sum of all the errors

and I try to minimize them,

Refer Slide Time: (38:29)

20

902
but you can see here that when we want to measure these errors, then these errors are measured

with respect to this line, this red line. So, this observation which I'm indicating here, this has got

an error e1, but this is above the line and this second observation here for which the error here is

e2, this is lying below the line. So, now we need to measure the direction of the points whether

they are lying above the line or below the line. So, we assign that all the points which are lying

above the line, I will indicate them by plus sign and all the observation which are lying below the

line that I'm trying to indicate with negative sign. So, now when I am trying to add all these

errors, then some errors are in the positive direction and some errors are in the negative direction

with respect to the line and hence when I try to sum them, there sum may become very close to

zero or exactly zero, and this might be indicating as if there is no error in the data or if this sum

is very small very close to zero, that will indicate that the amount of error in the data is very

small which is not correct. So, now I need to device a methodology by which I can change the

21

903
sign or I can get rid of the sign. So, I have two options, either I take absolute value or I try to

square these errors. So, I opt here that we try to consider the sum of squares of this errors, that is

a better option, why I am calling the better option, it's just due to mathematical simplicity, I can

say at this stage and in this case, I can find out the clear-cut expressions for the values of  and

. in case if you try to take here absolute values, that is also possible, but that I am not discussing

in this lecture.

Refer Slide Time: (40:53)

So, now you can see here that when you are really trying to represent the values of this

observation. So, as we have discussed here that there are two values corresponding to every

observation, for example, this is the value here which I am observing inside the experiment and

these values are suppose corresponding to here X equal to say here x1. So, this quantity will have

22

904
coordinate say here x1 and small y1. Now, I assume that this point should lie on this line and this

line is being denoted by Y, Capital Y is equal to + X. So, this coordinate is going to be small

x1 and capital Y1, Right! and there is some error in this data and we call or read, express this

errors as e1. So, now incase if I try to express this fact here, then I can express, in general, that

every pair of the observations satisfies the equation like yi = +  xi + ei, where ei is are the

errors they can be in the positive direction or they can be in the negative direction, and now we

are going to find out the values of  and , such that the sum of squares of this ei is minimum.

So, now we take a call that we will try to find out the equation of this line, which line, this line,

red color line on the basis of the given data set say it using all the small n paired observation on

xi and yi, such that he line is passing through with maximum number of points and the deviations

of points with the fitted line are minimum.

Refer Slide Time: (43:10)

23

905
So, you can see in the same picture that this difference or this error or this deviation, this is

essentially the difference between this value, see here capital Y1, and this value here is small y1

on the y axis. So, this difference is denoted as say yi difference Capital Yi and this difference

between y and capital Y can be positive or can be negative, but you have to use the same

structure that either you try to measure them by yi minus Capital Yi or yi minus yi Right! So, now

we will try to find out the value of  and  such that the sum of squares of these deviations ei is

minimum,

Refer Slide Time: (44:06)

how to obtain it, Right! So, for that I can use the principle of maxima and minima and we

minimize the quantity, summation i goes from 1 to n, ei square which is the sum of square of all

the deviations, ei and this is denoted by capital S. So, now my objective is now defined, I want to

find out the values of parameters  and , such that the line is passing through with the

maximum number of given data points and the sum of squared deviation or errors from the line
24

906
is minimum and this is called as method of least square or principle of least squares, and the

principle of least square is simply saying that try to find out the the equation of the lines in such a

way, such that the line is passing through with the maximum number of given data points and the

sum of squared deviation from the line is as minimum as possible. So, now in order to find out

the value of  and  using the principle of least square,

Refer Slide Time: (45:16)

we try to write down the sum of squares of errors, like here this and you know that ei is given by

yi -  - xi. So, I try to replace this e is over hereby (yi -  -xi)2, and now I have to use the

principle of maxima and minima. So, for that I need to find out the first order derivative of this S

with respect to  and , I need to put it equal to zero and then, I need to check by finding out the

second order derivative that whether the maxima or minima has been achieved. So, I try to find

out the first order partial derivative of S, with respect to  and this give us this equation and

25

907
following the principle of maximum, minima I try to put this first order condition equal to zero.

n n
So, now incase if you try to solve it you get here, that  y     x
i 1
i
i 1
i  0 . So, this quantity is

1 n
simply you’re here, n y -n - n x , Right! Because, x and y are defined as  xi and y is
n i 1

1 n
 yi , Right! So, now you can solve this equation, and this gives you here y -  -  x = 0, and
n i 1

this gives us that  = y -  x , which is here. So, now on the basis of given set of data, I can find

out the values of say here x and y , because we have observations xi and yi, i goes from 1 to n.

So, now this value of  is going to be known to us, if  is known. So, y is known from the

sample data, x is also known from the sample data and  is unknown. So, now my next

objective is how to find out this .

Refer Slide Time: (47:43)

So, we try to do the same process and we obtain the first order partial derivative of S with respect

to  and I try to substitute it equal to 0, then I get this equation and I put it equal to 0 this is the

26

908
first order equation, and now if you solve it, that's a pretty simple algebra, you will get the value

 ( x  x )( y  y )
i i
of here  like this, which is i 1
n
, and now this is the value of  that can be
 (x  x )
i 1
i
2

obtained on the basis of given set of data on say xi and yi, right! So, this estimated value of 

which is obtained on the basis of given sample, it is denoted here as a ̂ , that just write  and put

here a gap, hat, Right! So, now I can obtain the value of  as ̂ from the given sample of data

and I try to substitute the value of  = ̂ in this equation of . So, once I try to substitute here 

= ̂ , I get here the value of  which now I can find out on the basis of given set of data. So, this

is denoted as ̂ . So, now you can see here we have obtained two values of parameters ̂ and ̂ .

So, ̂ is the value of  and ̂ is the value of  that can be obtained on the basis of given sample

of data.

Refer Slide Time: (49:36)

27

909
Now, mathematically the next issue is how do I know, whether this value of  and  are

minimizing the sum of squared deviations or not. So, I try to find out here the second order

partial derivatives of S with respect to  and  and I try to substitute the value  = ̂ and  = ̂

and these values comes out to be zero, that you can verify yourself. So, now I have a value of ̂

and ̂ like this. So, this  hat is given by this expression, which is telling us the what is the value

of  on the basis of given sample of data and this ̂ is called as least squares estimate of , this is

based on the principle of least squares, and similarly, the value of  we which is obtained here is

̂ , y - ̂ x and this can also be estimated on the given sample of data and this is called the least

square estimate of . So, now you can see here the equation y =  plus x + e, that was our

original model that we wanted to find, and now we have found the value of  to be ̂ and  to be

̂ x, and now it will become 0, because we have obtained this equation in such a way such that

sum of ei square is minimum and the value of ei for which this sum is minimum is 0. So, that is

why this is called as fitted model, right! and this is simply called here as a model, and now we try

to compute these two values, the values of ̂ and ̂ on the basis of given sample of data,

Refer Slide Time: (51:28)

28

910
on the basis of example, that we were considering earlier. So, you can see here, this marks is here

your Y and number of hours per week is your here X. So, this is the value of y1 and this is the

value of x1 and this is the value of x2, this is the value of y2 and so on, this is you here all the data

1 20
set, now I try to compute the values of x and y which is here y , if you try to compute  yi ,
20 i 1

this comes out to be 389.9 and x says simply comes out to be a sample mean of all the values of

xi's as 35.1, and similarly the expression of ̂ , if you try to substitute all the values, this will

come out to be ̂ is equal to 6.3 and the expression for  hat will come out to be 168.65. So,

now you can see here that your model becomes here marks is equal to 168.65 plus 6.3 in two

hours. So, this is your here fitted model and you can see here that this model has been obtained

on the basis of given sample of data only, Right! Okay. So, we stop here now in this long lecture

and you have seen that how we have computed the values of parameters  and  on the basis of

given sample of data, but we have done this computation manually, now I will try to show you in

the next lecture that how to get all these values from the R software directly, but it is important

for you to understand that how the values inside the software are obtained and what are the

29

911
computations and the philosophy and the concept which has been used in the computation. So,

you try to understand this concept and I will see you in the next lecture, till then goodbye.

30

912
Lecture-34

Fitting of Linear Models: Least Squares Method – R Commands and More

than One Variables

Welcome to the lecture on the course descriptive statistics with R, and welcome to the last

lecture of the course. Yes! Whenever the last lecture comes this gives us happiness for the

students and for the teachers also. So we try to understand what are we going to do in this

lecture you may recall that in the last lecture, we had discussed the principle of least squares,

and we also had understood that how one can obtain the estimates of the parameters or the

values of the parameters on the basis of given set of data in case of a model y =  + X. We

had taken an example and we had solved it manually. The idea was to show you or to expose

you with the basic concepts. Now in this lecture we will learn that how to obtain the same

result using the R software, and after that considering only one input variable is not a very

realistic thing, so in case if you want to extend the principle of least square where, where you

have more than one independent variables or more than one input variables, then how to do it

and how to use the R software to obtain the least square estimates of parameters in that

model. So, this is what we are going to do in this lecture. So, the first thing comes what is the

command for obtaining the least squares estimate or fitting a linear model using the R

software.

Refer Slide Time: (2:05)

913
So in case of R software, the R command for fitting a linear model is lm these applets this is

an abbreviation of ‘l’ means linear and ‘m’ for model, and this command ‘lm’ is used to fit

the linear model. Well you can see here that this ‘lm’ has various arguments and so on. But

here we are going to use a very simple thing, just basis on formula and you should know

why? why I'm not going to discuss all the details. I had explained you in the earlier lecture

that whenever we try to fit a linear model, this does not ends the story. After that you have to

check how the model is going to perform with the entire population. There are different types

of statistical assumptions which are needed to expose the estimated values over different

types of statistical tools like a test of hypothesis goodness of fit and so on. So well here I am

not considering the course on linear regression analysis but I am simply trying to use one of

the methods for finding out the model which is used in the case of linear regression analysis,

and this command ‘lm’ in R software was developed in order to find out the further statistical

tool related to the linear regression analysis. So that is why you will see that in the command

‘lm’ there are many, many arguments, but I will not go into those details but I will request

you that if you want to understand them first try to have a course on linear regression analysis

914
and then try to understand those conceptz. The interpretation of all the terms inside the

argument they can be, be studied from the help on ‘lm’ command. Right.

Refer Slide Time: (4:16)

So now my objective here is to show you that how to get the things but briefly I will try to

give you the idea here what we are going to use that there is an option here ‘formula’. So

basically, we will be adding or using here the option of formula, so this is going to give us a

sort of symbolic description of the model to be fitted. Right. and what are the details of the

model, they are given under as separate specifications and next the data, this is also an

optional argument where the data is given in the frame of, in the structure of a data frame, or

list of environment is given by a as data dot frame and so on but we are not going into that

detail and similarly there is another option here upset and this gives you an option to use an

optional vector and which is specifying a subset of observation to be used in the fitting

process but anyway we are not going to use it, we are simply going to use the option here

‘formula’.

915
Refer Slide Time: (5:24)

So now how to use this formula I will try to show you with an example the same example we

had considered earlier where we collected the data on 20 students on their marks and the

number of hours in a week which they studied.

Refer Slide Time: (5:35)

916
And this data, if you remember in the last lecture was obtained here that ̂ was obtained as

6.3 and ̂ was obtained at 168.65 and model was in like it, but this all was done manually all

the competitions were done by using the case simple calculator.

Refer Slide Time: (5:57)

Now I will try to show you how to do it on the R software. So, as we had discussed, we

already have stored this data in that two data vectors- marks and hours and how to store this

data that I discussed in the last lecture. Right. That this all this data is coming in the same

order so that these observations are paired, I mean the first observation (xi yi) mean first

observation of marks and first observation of hours.

Refer Slide Time: (6:26)

917
Now this is the most important part that how to give the command. First you try to write

down the instruction ‘lm’ and then try to write down here the output where you will here ‘y’

and then use this equivalent sign this is present on your keyboard, the keyboard of the

computer which you are using and after this, you try it to use here the variable to give the

input data. So why is your here output variable and x is your here input variable. Right? So,

and they have to be given in this format. So in our case our output variable here is marks and

input variable here is hours and they are given in this framework just joined by the sign

equivalent. Right, and the sign I can make it here more bigger something like this you will

see on the on your keyboard. Now once I try to do it, this will give me this type of outcome

so first we try to understand what is the meaning of this outcome. Once I say that that lm

marks equivalents hours this is going to indicate that our model is marks =  +  x hours +

some error. Right, and this will inform the R software that we had obtained say ̂ i goes

from 1 to here n xi minus x bar and yi minus y bar upon summation i goes from 1 to n xi

minus x bar whole square. So this command will inform the R software that the value of xi’s

are coming, for example, I will use a different colour, so this values of here xi’s are coming

from the data which is given here in hours, and similarly the values of y's are coming for

918
computation of these things from the first data vector marks. Right, and so it uses the formula

is equal to marks, equivalence hours and the outcome comes out to be like this so you can see

here this is written as coefficients the coefficients means the value of  and , and this is

specifying here the value of  that is also called as intercept. So, this value is coming out to

be 168.647 and the value of the , which is coming here, here so this is here the value of ̂ ,

and this is here the value of ̂ . So, this value is indicating that this is the intercept term, and

this is indicating that this 6.304 is the coefficient associated with the variable hours and you

can see here these are the same value which you had obtained manually you can compare

here,

Refer Slide Time: (9:52)

with this, and here this.

Refer Slide Time: (9:54)

919
So, you can see here this is pretty straightforward to obtain such a result inside the R

software.

Refer Slide Time: (10:02)

Now I will try to show it on the R console, and this is here the screenshot of the same thing.

920
Video Start Time: (10:10)

So first I will try to prepare my data. So this is here marks data which I have entered and this

is my here data on hours which is entered here so you can see here marks data is like this

hours data is like this and now I'm trying to find out here the model ‘lm’ say marks

equivalence say hours. So, you can see here that this is coming out to be like this. Right? and

here you have to be careful that how you are going to specify the value of x and y, for

example, in case if you make a mistake and if you say in place of marks hours and hours in

place of marks, then this linear model will be fitted like hours and between hours and marks

and you can see here that this is entirely different than the first one. You can see here the first

value is 168 but here the value is .22 but this is a wrong thing to do because here in this case

when you are trying to fit a model like this one you are trying to say that number of hours is

your output variable and input variable is your marks which is not correct here in this case.

So, this is how we you try to obtain this thing. Right.

921
Video End Time: (11:50)

Refer Slide Time: (11:52)

10

922
So now let us try to come back to our slides, and now I try to give you one more concept.

Now you see we have learned how to find out a model or the least square estimates of  and

 in the case of a linear model where you have only one variable but in practice, you can

imagine that any process is not going to be controlled only by one variable but there will be

more than one variables and, we had discussed such example in the last lecture for example

the yield of a crop will depend on several factors quantity of fertilizer, quality of fertilizer,

temperature, rainfall, irrigation and so on. So now in case if you want to extend this principle

of least squares to the case when you have more than one input variable then how to do it.

Well this is a topic which is basically taught in case of linear regression analysis under the

topic multiple linear regression model but here my objective is not to teach you the regression

analysis, my objective is to tell you or show you that how you can obtain the values of the

parameters on the basis or given sample of data using the R software in a case when you have

more than one input variables. So in this case, I will try to take an example of the linear

model and I will show you logically that how you can extend the model which you have

considered here into a multiple framework when you more than one input variable but yeah

this is not a very general technique that is valid for all the models, every model has its own

way to extend it to a multivariate situation but here you please try to learn that if I try to

extend my linear model in this particular way then how the R software can be used to find out

the values of the parameter estimates. Okay.

So now you may recall that we had considered the model here y =  + x now I'm trying to

say that here you had only here one input variable, but now I'm trying to extend it and

suppose I'm saying that there are more than one input variables. So, in the first case I have

denoted the input variable by the notation x so now suppose I say there are p number of

variable of input variables, so I can denote them say here x1, x2 see here xp. Right. means I

11

923
deal in that symbols and notation of regression analysis this small x1, small x2 is small xb

should be capital X1, capital X2,capital XP but here my idea is something different than the

objective in linear regression analysis. So I try to extend it and since I am going to consider

here and edit your model so I try to write down the terms like x for each of the variable so

for x1 this will become say here 1x1 for x2 this will become here 2x2 and similarly here for

xp, this will become here pxp and then I try to add them together so this is what I am writing

here in the second line that y is equal to  + 1 x1 + 2 x2 + pxp + e. So e is also here a as a

random error involved in the observations. Now similar to the simple scatterplot that we had

obtained in the last lecture, here in this case you have more than one independent variable, so

matrix plots are more useful in verifying whether the relationship between y and x1, x2, xp is

linear or not. One thing which you have to keep in mind here that in the earlier case we try to

establish the linear association between x and y but now you have here a group of x's, x1 x2 xp

inside the one group and there is here one simple output variable y, so we are trying to verify

the linearity of a single variable y with respect to a group of variable of x1, x2, xp. This is not

so straightforward. One option is that I can make individual plots y versus x1, y versus x2, y

versus x3 and so on, and now what we can conclude that if all the relationship bit of y is with

respect to each of the x1, x2, xp is coming out to be linear then we can expect that their joint

effect may also be linear, but this is a little bit tricky situation and you need some experience

to handle this condition but here but here I would like to just inform you that how you can

verify or how you can construct such plots in the R software.

So now the first I would try to show you that how to construct such matrix plot in this case if

you try to see how you are trying to obtain the observation, you are trying to conduct the

experiment, say in x and y, for every experiment you will have an observation like

observation on the first variable observation, on the second variable and observation on the

12

924
pth input variable and then here value of y, and if you remember earlier it was simply here (xi,

yi) because there were only one variable so here the observations are going to be something 1

i2 in and yi. So, this is going to represent the ith and ith n tuple of observations on say y and x.

Right, so now we assume that each set of observation will satisfy this equation.

Refer Slide Time (18:28)

So we can write here for the first set of observation we can write that this set of observations

satisfies the equation like this so instead of here x1, I have x11 instead of here x2 I have here

x12, instead of here p I have here x1p, and associated random error here is even and the

obtained value of y here is y1 and similarly we have obtained here say n observation which

are satisfying this equation. Now using a very simple theory of mathematics based on vectors

and matrix, this entire set of equation can be expressed in the form of vectors and matrices.

So this I'm trying to give it here I am not explaining the theory of matrics here but I assume

that you know it so all this observation they are contained inside the n cross 1 vector here y1

13

925
y2 yn and all those parameters , 1, 2, p they are contained in say another vector here

consisting of p + 1 elements , 1, 2, p and the associated matrix of input variables on the

data on say here x1, x2 and say here xP this is given here like this, and the first column is

indicating the presence of intercept 1,1,1,1,1. So now I can express this y vector here as say

small y and this entire matrix here to be here X, this entire matrix to be here  and this entire

e2, en vector to be here small e. Right.

Refer Slide Time: (20:14)

So now we have this model here y = x  + e. Now in case if you try to use the principle of

least squares then principle of least squares, if you remember in the last lecture we defined

n
the sum of square of this eis and this were defined as e
i 1
2
i , so this quantity can be written as

say here e’e where is now a vector so e is now here y -x ’ y - x like this. Now in case if you

try to differentiate this quantity with respect to here , put it equal to 0, then after solving you

14

926
get here say  = ̂ which is equal to (X’X)-1X’y which I am writing here. Right. I am not

giving you here the details but if you try to see this is a simple extension of the least square

estimate that you obtain in the model with one input variable so now these things are replaced

by matrix, so you can see here that X is the matrix of X here is matrix of observations on say

input variable so this is known to us similarly here y, this is a vector of observations on the

observed values y1 y2 yn, so this X and y are known, so I can compute the value of ̂ and

this will be called as least squares estimator of parameter , and ̂ will look like if you see 

is looking like this then ̂ will look like say ̂ , ̂1 , ̂ 2 , up to here ˆ p . So, this also has the

same order as of  which is of order (p + 1) X 1. Right.

Refer Slide Time (22:25)

Now I would try to show you the computation of ̂ using example and I have just taken two

input variables in this example to make it more simple and in this example, 25 observations

have been collected on the time taken by a courier person in delivering the parcels. So, it is

15

927
recorded that how much time the courier person takes in delivering the parcels and obviously

this time is going to be dependent that how many parcels are being delivered by this courier

person and also how much time the courier person has to travel. So this number of parcels are

going to be denoted by x1 and distance traveled by the courier person is to denoted as here

say x2 and whatever the time is taken that is denoted by here y. So, the interpretation of the

data goes like that there is a courier person who has to deliver 7

parcels, and suppose the courier person travels 560 meters and the total time in doing this job

is 16.68 minutes. Similarly, there is another courier person who delivers three parcels and

person travel 220 meters and person takes 11.5 minutes and so on. So, this is how we have

obtained this 25 set of data and all this data on y, x1 and x2 has been stored inside the data

vectors inside the R software. Right. As we have done it many times in the past.

Refer Slide Time: (24:18)

16

928
So essentially now we are going to fit here a model yi =  + 1x1i + 2x2i + ei where all these

observations are denoted by here i, and all observation will satisfy this model y =  + 1x1i +

2x2i so all this data in the same order has been stored in three vectors the time of delivery

and del time in the number of parcel in parcelno they have been a parcel numbers and

traveled distance inside the data vector whose name is the distance. Right. So once again I

will explain you that this first observation correspond to in the first observation of parcel

number and this correspond to the first observation in that distance which is given here,

Refer Slide Time: (25:06)

I say observation number here one. Right. So, this data is given over there so that we already

had understood in the earlier example.

Refer Slide Time: (25:16)

17

929
Now I would try to first show you that how to create a matrix plot. Right. What is a matrix

plot? You see when we had an idea or when we had discussed the simple case where I had

only one independent variable, or one input variable, and one output variable, then I can

make a scatter plot and a scatter plot will give us an idea that how the things are going to look

like whether there is a linear trend or a nonlinear trend, but definitely when we have more

than one input variables then what we are looking, we are looking between the joint effect of

one variable y with respect to a group of variable x1 ,x2 , xp. So finding out such a curve is

more difficult so what we try to do we try to create the scatter plots pairwise between y and

x1, y and x2, y and x3 and so on, and then what we interpret is that if all the relationship that

means the relationship of y with respect to each of the x1 ,x2, xp is linear, so we can expect at

the joint effect or joint association between y and all x1, x2, xp is also expected to be linear.

Well, you need some experience in interpreting search results but here I would like to show

you that how to create such matches plot. Okay.

18

930
So that command for creating the matrix plot here is pairs p a i r s and inside the argument

what we try to give here the data vectors and with some more information and another more

general structure is pairs and inside the argument, we give the formula, just in that case of say

linear models the operator ‘lm’ and after that we there are different options to be given data,

subsets, na.action and so on. So here you can see here I have given the details here that x is

trying to give us the coordinates of a points given as numeric columns of a matrix or a data

frame and similarly here formula, that this is the same thing, what we discuss in the case of

linear model so we have to write the formula as say with an equivalent sign given by the

variables and separated by plus sign. Right. So, each of this term will give a separate plot

with respect to y, and then data is a data frame in which the data for the formula is, has to be

used. Right.

Refer Slide Time: (27:49)

Then there is an option here subsets and data, exactly in the same way as we did in the case

of linear model. So, I am not discussing it here.

19

931
Refer Slide Time: (27:57)

Now I would try to create here a matrix plot of the given data set just by writing the formula.

‘formula’ you see I am writing here now pairs and then inside the argument formula for that I

write here an equivalent sign and then I need to find out the scatter plot of three variables -

delivery time, parcel number, distance, So I'm not discriminating here what is my input

variable and what is my output variable and it means you can get more information about this

command pairs from the help but here I am using one option which is the ‘main’ to give the

title of the graph. So, you can see here this is my graph and this title is the same here matrix

plot of delivery time data which is printed here. I will try to show you on the R console also

but first we try to understand this picture. This matrix plot will look like this so you can see

here first try to understand this graphic. So you can see here what is there on the y-axis, what

is here on the y-axis? This is the same thing which is written here, and what is here on the x-

axis, I will use a different color so that you can observe this is the same thing which is written

here. So in this box, the x-axis is denoting the parcel number and y axis is denoting that the

20

932
delivery time, and this is the scatter plot of two variables- parcel number versus delivery time

or the number of parcels versus the delivery time and you can see here that this trendt is

nearly a linear trend and a positive trend rather. Similarly if you come to this block, again

now what is being mentioned here, here this is the same thing which is written here say

distance. So now this is the plot between distance, and what is happening on the y axis here,

this is that delivery time, so you have to take the y axis from the y axis side and x axis from

the x axis side, so this is a graphic between the distance travelled by the courier person versus

delivery time, and you can see here that this curve is also coming out to be a sort of a having

a linear trend and similarly if you try to look here in the third case, see here, so you can see

here on the x axis here, this will be here distance and on the y axis, we will have here number

of parcels.

So this is a graph between the distance versus parcels, parcel number or the number of

parcels, and you can see here, here also the if you try to look at the trend, this again shows a

nearly linear trend and then if you observe direction of my pen, it is here this crossing this

diagonally. Now what I have shown you here is the plots in the upper diagonal, in this side.

Now what is happening in the lower diagonal, this is the same thing. For example, these two

will match this, two will match and these two will match. So usually we try to look in either

in the upper diagonal or on the lower diagonal, so they are going to give us the similarity

information. Okay, and yeah! means some of these blocks which I am indicating here by

cross these are not used why because this is a plot between, for example, in this case this, is

the plot between say here delivery time versus - delivery time which has no meaning. Right.

So, by looking at this matrix plot we can have an idea that whether the joint relationship of y

with respect to x1 and x2 is a linear or not.

21

933
Refer Slide Time: (32:07)

So in this case I can take a call yes it is approximately linear. Well in this case, I will try to

show you one thing more that is since we are dealing here only with two variables and I want

to see the joint effect of y with respect to x1 and x2. So there are three variables y, x1 and x2 so

I can also take the help of this three dimensional plots and if you remember we had discussed,

we had discussed earlier a three dimensional plot which is, by using the command

scatterplot3d. So, we try to use it here, but it is always not possible because sometimes the

data is more than three directions, but this is your judgment what you want to do. So if you

try to use the command here the scatterplot3d, then first you need to upload the library so use

this command library inside the argument a scatterplot3d, and now I try to plot this

scatterplot3d with these three variables delivery time number of parcels and distance and this

comes out to be like this. So, this is trying to give you here a sort of here panel, which is here

in this case. So you can see here this panel is looking like, yes, there is a that most of the

points are lying close to the panel now you can also use the option to change the direction of

22

934
this cuboid for example, I use here the option in the next command as angle equal to 120 and

I try to rotate this figure by 120 degree. So you can see here this that the axis are changed,

these are changed and this is see here changed, but now I looking at different structure now

this is something like this. So just by looking at different types of picture you can finally

conclude whether you want to have a linear model fitted to this data or not.

Refer Slide Time: (34:07)

Now we have concluded on the basis of given set of data, yes, we are confident that a linear

model can be fitted. So now we use the R command. So now you can see here we are

interested in this model where I have two variables x1 and x2 and coefficients are , 1 and 2.

So I need to estimate three parameters , 1 and 2, so I will try to use the same command

that we use earlier but now I am making here a small change I will use the same command

lm. Now whatever is my here output variable I am trying to give it an a here as such, and now

there is an equivalence sign which is trying to indicate the formula that now the variables to

23

935
be used as input variables are starting. So now I'm using here two variables named, parcel

number and distance, so they are given in this format that they are separated by this plus sign.

So if you have more variable suppose if the model is 1x1 + 2x2 + 3x3 + 4x4. So, all those

variables will be added here something like say here x1 + x2 + x3 + x4 and so on. So, this

formula will remain the same and if you try to operate it on the R software, you will get this

type of outcome, so this is giving us the formula. Right. So, this is essentially writing that y is

equal to something like 1 x1 and say here 1x1 + 2x2, and now this outcome has to be

read like this that these three values are giving us the values of coefficients associated with

these values. So, 2.19579 it is the value of intercept term, 1.67803 this is the value of the

coefficient associated with parcel number which is here actually ̂1 , and this is the value of

the coefficient associated with the variable distance. So, in our symbolic this is the value of

̂ 2 , and yeah! this intercept term this 2.19579 this is the value of ̂ .

So, my model becomes here deltime this is the delivery time = 2.196 + 1.68 X parcel number,

or the number of parcel + .013 X distance. So, you can see here now we have obtained a

model with two variable or model with two input variable and the same story you can

continue with more than one variables, and I will try to show you these things on the R

console also.

Refer Slide Time: (36:47)

24

936
And you can see here this is the screenshot what I'm going to show you now.

Video Start Time: (36:50)

25

937
So, let me first try to create this data vector here. So you can see here and then similarly I try

to create the data vector on parcel number, it is here, and similarly I try to create the data on

distance travelled like this. So, you can see here this is my data deltime like here like this

parcelo number of parcel it is say like this and distance here like this. Now I first clear the

screen so that you can now understand what is going to happen, so I will first try to create

here a three dimensional scatter plot and for that you know that first you need to create here

a, copy this command here, so you can see here what I'm doing , that I try to copy and paste

this command and you get here this type of, so this is the same plot which I just explained

you inside the slides, and now after this I try to obtain the scatterplot in three dimension so

first I need to load the library here, so I load the library of scatterplot3d. We had already

installed this library earlier, so this is already there because you need to install this package

only once and after that whenever you want you just use the library. So you can see here this

is the scatter3dplot and if you try to change the angle here say angle is equal to suppose 120

so it will give us this change picture, and suppose if you try to make it here see angle equal to

say here 90, so it will give you a different picture like this one. So, by using these things you

can have an idea that what really you are going to do. Okay. Now next I try to fit here a linear

model with this data set. Right, I close this pictures and I clear the screen by control L, and I

paste the command over here and this gives me the this command. So you can see here these

are the values of the coefficients that you have obtained, so this value 2.19, this is the least

square estimate of intercept term, this 1.67803, this is the least square estimate of the

coefficient associated with the variable number of parcel, and this 0.01311 is the value of the

least square coefficient or the value of the least square estimate of the coefficient associated

with the variable distance travel.

Video End Time: (39:58)

26

938
And this same screenshot has been given in the slides also. So now I would like to stop in this

lecture. I have tried to give you the idea of principle of least square and how to implement it

inside the R software. Sometimes it is possible that there are some nonlinear curves, and in

case if it is possible to make some transformation to change those nonlinear form into a linear

form, say by taking log or by taking exponential, then you can use the command ‘lm’ to find

out the least square estimate in that transformed linear model, that was originally nonlinear.

But in that case you have to keep in mind that if the original data is given as y and in case if

you transform the data into or the variable into log of y then the curve becomes linear then

you need to input the data on log of y. So these things say you have to keep in mind, and

using this technique you can estimate the least square estimate very easily using the R

software. But definitely, you will need more things to learn if you want to use it further, and

which is usually not possible for me to, to do in the same course. There is a usually there is a

entire course on the topic of linear regression analysis.

But here now we come to an end to this course and this was the last lecture of this course I

hope you enjoyed it. you understood it. Well I am saying that this is the end of the course but

27

939
practically for you this is the beginning of the new course. I have given you only the basic

fundamentals. I have told you very basic things but believe me, these are the things which are

going to create the foundation for further things.

In case if you want to conduct a Monte Carlo simulation, you want to find out a good model,

or you want to do any data mining or anything, the first step is the tools of descriptive

statistics and even inside those things, for example, in data mining and other things you use

the tools which we have discussed here under the course descriptive statistics, and you have

seen that one tool will not give you the complete information about the entire data sets. There

are different types of features which are hidden inside the data. Data is so naive that it is not

telling you that I have this, I have this, I have this, you are the only one who has to use

different types of tools, and get the information from the data, and it is very important that

you try to use the correct tool on the correct data. In case if you try to use anything else like

as wrong data on correct tool or correct tool on wrong data, you will not get a correct

aesthetical outcome, and then people try to make different types of stories that the statistics is

not telling the truth, because they don't know, what is the appropriate tool for an appropriate

data. So my request is that please try to understand these topics in more detail before you try

to apply them on a real data set. Definitely you need to also study a book which will give you

more properties of these tools and more application of these tools before you become eligible

to use the tools in a real situation, and in a real situation, you cannot say that I can use only

the measures of central tendency or the variation or the association, all the tools can be

applied to any data set but this is only you who is going to take a call, take a correct decision

that which of the tools have to be used in a given situation, either graphical or say analytical

or a combination of them. So, you learn more in statistics and enjoy the course. I will take a

leave and see you sometime soon once again, till then Goodbye, and God bless you.

28

940
THIS BOOK IS NOT FOR SALE
NOR COMMERCIAL USE

(044) 2257 5905/08 nptel.ac.in swayam.gov.in

You might also like