Data Analytics With Python
Data Analytics With Python
S. No Topic Page No
Week 1
1 Introduction to data analytics 1
2 Python Fundamentals - I 33
3 Python Fundamentals - II 54
4 Central Tendency and Dispersion - I 83
5 Central Tendency and Dispersion - II 108
Week 2
6 Introduction to Probability- I 127
7 Introduction to Probability- II 155
8 Probability Distributions - I 177
9 Probability Distributions - II 198
10 Probability Distributions - III 225
Week 3
11 Python Demo for Distributions 246
12 Sampling and Sampling Distribution 256
13 Distribution of Sample Means, population, and variance 287
14 Confidence interval estimation: Single population - I 304
15 Confidence interval estimation: Single population - II 324
Week 4
16 Hypothesis Testing- I 342
17 Hypothesis Testing- II 364
18 Hypothesis Testing- III 380
19 Errors in Hypothesis Testing 394
20 Hypothesis Testing: Two sample test- I 422
Week 5
21 Hypothesis Testing: Two sample test- II 442
22 Hypothesis Testing: Two sample test- III 464
23 ANOVA - I 480
24 ANOVA - II 494
25 Post Hoc Analysis(Tukey’s test) 513
Week 6
26 Randomize block design (RBD) 542
27 Two Way ANOVA 563
28 Linear Regression - I 583
29 Linear Regression - II 601
30 Linear Regression - III 614
Week 7
31 Estimation, Prediction of Regression Model Residual Analysis 634
32 Estimation, Prediction of Regression Model Residual Analysis - II 652
33 Multiple Regression Model - I 674
34 Multiple Regression Model-II 695
35 Categorical variable regression 714
Week 8
36 Maximum Likelihood Estimation- I 744
37 Maximum Likelihood Estimation-II 761
38 Logistic Regression- I 785
39 Logistic Regression-II 802
40 Linear Regression Model Vs Logistic Regression Model 818
Week 9
41 Confusion matrix and ROC- I 838
42 Confusion Matrix and ROC-II 860
43 Performance of Logistic Model-III 883
44 Regression Analysis Model Building - I 895
45 Regression Analysis Model Building (Interaction)- II 910
Week 10
46 Chi - Square Test of Independence - I 928
47 Chi-Square Test of Independence - II 949
48 Chi-Square Goodness of Fit Test 971
49 Cluster analysis: Introduction- I 990
50 Clustering analysis: part II 1009
Week 11
51 Clustering analysis: Part III 1026
52 Cluster analysis: Part IV 1046
53 Cluster analysis: Part V 1068
54 K- Means Clustering 1083
55 Hierarchical method of clustering -I 1109
Week 12
56 Hierarchical method of clustering- II 1134
57 Classification and Regression Trees (CART : I) 1162
58 Measures of attribute selection 1187
59 Attribute selection Measures in CART : II 1206
60 Classification and Regression Trees (CART) - III 1224
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology, Roorkee
Lecture No 1
Introduction to Data Analytics
Welcome students this course on data analytics with the Python today is the introduction
class. This lecture is on introduction to data analytics.
(Refer Slide Time: 00.34)
The objective of this course is to introduce the conceptual understanding using simple and
practical examples rather than repetitive and point clique mentality, here most of the students
generally they are, how they are using the software for doing data analytics. Just they want to
just click it, they want to get the result, they do not want bother about exactly what is
happening inside the software. This course should make you comfortable using analytics in
your career and your life.
You will know how to work with a real data and you might have learnt the many different
methodologies, but choosing the right methodology is important. This course will focus you
will help you how to choose the right data analytical tools.
(Refer Slide Time: 01:17)
1
Objective of the course is, when you look at this picture. How this person is using this tool,
there is a ladder. He was not knowing correctly how to use this ladder for the, for the
purpose it is intended. So the danger in using quantitative method does not generally lie in the
inability to perform the calculation, because of the computer development in computer
technology.
There are many packages are available for doing data analytics. But, the real threat is lack of
fundamental understanding of why to use particular technique or procedures and how to use it
correctly and, how to correctly interpret the result. This course will focus on how to choose
the right technique and how to use it correctly and how to interpret the result.
(Refer Slide Time: 02:01)
2
So what was the learning objective of this class that is; after completing this lecture what you
will learn one is you can define what is data and its importance. You can define what are data
analytics and types. You can explain why analytics is in today's business environment is so
important. Then we can see how statistics, analytics and data science are interrelated, there
seems to be some overlap in this we will clarify that what is the difference, how these are
overlapped how these are interrelated.
In this course we are going to use a package called Python. I will explain how and why it is
important to use the Python in this course, at the end of this session we will explain the four
important levels of data that is nominal, ordinal, interval and ratio. Now we will go to the
content;
(Refer Slide Time: 02:54)
We will define
data and its
importance. There are
three term one is
variable,
measurement and data.
Next we will see what
is generating so much
data. Next we will see
how data add value to the business, and then we will say why data is important.
(Refer Slide Time: 03:11)
3
See the variable, measurement and data these are the terms which we are going to use
frequently in this course. So what is a variable? Variable is a characteristic of any entity
being studied that is capable of taking on different values. Say for example, X is the variable
it can take any values it may be 1 it may be 2 or it may be 0 and so on. The measurement is,
when you standard processes used to assign numbers to your particular attributes or
characteristics of variable are called a measurement.
For that X, you want to substitute some values. For that value, you have to measure the
characteristics of the variable, that is nothing but your measurement. So then, what is the
data? Data are recorded measurement. So there is a variable you measure the phenomena,
after measuring the phenomena you are substituting some value for the variable so the
variable will take a particular value that value is nothing but your data.
So X is the variable for example number 5 is the data. How you are measuring that 5, that is
called measurement. Then what is generating so much of data.
(Refer Slide Time: 04:33)
4
Data can be generated different way humans, machines, and human - machines combines.
The humans, machines and human - machines combines in the sense, now seen everybody is
having the various Facebook account, we have LinkedIn account, we are in various social
network sites. Now the availability of the data is not the problem. It can be generated
anywhere where the information is generated and stored and structured or unstructured
format.
(Refer Slide Time: 05:06)
So how the data add value to the business? So the data after getting from various sources
assume that it is a store in the form of data warehouse. So from the data warehouse the data
can be used for development of a data product. Here we are using the word data product and
in the coming slides, I will explain exactly what is the data product with some examples. So
5
the same data, if we look at the right hand side that can be used to get more insights from the
data.
Okay, what do you mean the data product? For example, algorithm solutions in production,
marketing and sales, example of some data product. For example, recommendation engine
one of the example for data product. Suppose, if you go for Flipkart or Amazon for buying a
particular product in that package, that software itself, we will recommend to you what is the
next product, possible product that you can buy. That is nothing but the recommendation
engine.
Even if you watch some YouTube videos on particular topic, that YouTube itself will suggest
to you what are the relevant videos are available. So that is a recommendation engine. That
is one of the examples of your data product; with the help of data so that will help you to
forming a data product or you can get an insight from the data. That will add your business
value to you.
(Refer Slide Time: 06:27)
See this is an example of your data products, this is the driverless car. Google car, so the
whole concept of Google car is with the help of data. It is detecting all other requirements for
driving the cars. The next example is for recommendation engine, as I told you previously
when you buy any product they will suggest you that along with this product, the other
product also can be purchased.
6
Another very common example for a data product is Google. The Google is lot of
applications, there one of the application of example for data product is Google mapping. So
the Google mapping is helping you to find out what is the right route, which road there is a
traffic, in which road there is a toll booth, so this kind of information we can get it from the
Google map. So this Google map is the one of an example of your data production.
(Refer Slide Time: 07:20)
Now why data is important? The data helps in making better decisions, data helps in solve
problem by finding the reason for underperformance. Suppose some company it is not
performing properly by collecting the data we can identify what was the reason for this under
performance. The data helps one to evaluate the performance. So what is the current
performance, the data also can be used for benchmarking the performance of your business
organization.
And after benchmarking data helps one improving the performance also, so data also can help
one understand the consumers and the markets, especially the marketing context. You can
understand who are the right consumers and what kind of preferences they are having in the
market.
(Refer Slide Time: 08:16)
7
Next we will define what is a data analytics and its types? So in this coming two, three slides
we are going to discuss, we will define what is data analytics? Then you say why analytics is
important? Then we will see that data analysis? Then we will see how data analytics is
different from data analysis? At the end will we see types of data analytics?
(Refer Slide Time: 08:40)
We will define data analytics is the scientific process of transforming data into insights for
making better decisions. See it is a scientific process for transforming the data into for
making better decisions, even without the data also even without doing analytics also you can
make the decision but you cannot make the better decision without analytics. By the virtue of
your experience on intuitions you can take the decisions that also sometimes may be correct.
8
But about the help of data if you are making the decision then that will enable you to make
the better decisions. Another professor James Evans, he has defined the data analytics in this
way. it is the use of the data information technology, statistical analysis, quantitative methods
and mathematical or computer-based models to help managers gain improved insight about
their business operations and make better, fact-based decisions.
You see that there are many terms which are appearing here, one is IT, next one is a statistical
analysis, and next one is the quantitative methods, then mathematical knowledge and
computer-based models. So when we will see how these are interrelated in coming slides.
Generally, among the students, there is a confusion whether the analysis and analytics is same
or different?
(Refer Slide Time: 10:13)
Why analytics is important. The opportunity abounds for the use of analytics and big data
such as: for determining the credit risk, for developing new medicines, especially in
healthcare. The healthcare analytics is an emerging, that is helping you to identify what is the
correct medicines. Finding more efficient ways to deliver product and services. For example:
in the banking context data analytics is used for preventing the fraud, and it is uncovering the
cyber threats.
With the help of data analytics you can find out the possible cyber crimes and we can detect it
we can prevent it. And data analytics are also important for retaining the most valuable
customers. We can identify who is your valuable customer or non valuable customers. So we
can focus on more on our valuable customers. Okay,
9
(Refer Slide Time: 11:08)
Now what is the data analysis? Is the process of examining, transforming and arranging raw
data in a specific way to generate useful information from it. So data analysis allows for the
evaluation of data through analytical and logical reasoning to lead to some sort of outcome or
conclusion in some context. Data analysis is a multi-faceted process that involves a number
of steps approaches and diverse techniques. That we will see in coming lecture.
(Refer Slide Time: 11:41)
So now we will see what is the analysis is data analysis and data analytics. When you say
analysis when you say data analysis it is something about what has happened in the past. So
we will explain why that has happened? We will explain how it has happened? We can
explain why it has happened? For example, when we say data analysis that is nothing about
10
studying about what has happened it is like kind of a post-mortem analysis. What has
happened in the past?
(Refer Slide Time: 12:13)
Okay, in the contrary the analytics is studying about what will happen in future and with the
help of analytics. We can predict explore possible potential future events.
(Refer Slide Time: 12:25)
11
So in the analysis data analysis also we can go for qualitative. We can explain how and why
a story ends in that way it did? When we say in quantitative we can say, how the sales
decreased the last summer. When I say as I am repeating, when you say analysis is something
studying about what has happened in the past.
(Refer Slide Time 13:12)
Okay, so it is not exactly analysis equal to analytics. Similarly when you say data analysis is
different data analytics is different.
Similarly business analytics is different business analytics. When you say analytics is nothing
but studying about the future events with the help of the past data.
(Refer Slide Time 13:34)
12
Next we will go for classification of data analytics, based on the phase of workflow and the
kind of analysis required, there are four major types of data analytics. One is descriptive
analytics, diagnostic analytics, predictive analytics and prescriptive analytics. We will see
these four types of analytics in detail in coming classes:
(Refer Slide Time 13:57)
If we look at the difficulty and the kind of value which we can get from different types of
analytics; this picture shows for example: when you see the descriptive analytics that will
answer what happened? Diagnostic analytics, will help you to answer why did it happen?
Predictive analytics will help you what will happen? Prescriptive analytics will help you to
answer how can we make it happen? There is one context when you look at the level of
difficulty you see that the descriptive analytics is the level of difficulty is very less.
13
And the contrary when you look at the prescriptive analytics the difficulty level is more and
the value also, value in the sense business value which adds to you also more. so when there
is a more difficulty there is a more value. Okay,
(Refer Slide Time 14:54)
Then we listen what is the descriptive analytics? Descriptive analytics is the conventional
form of business intelligence or data analysis. It seeks to provide the depiction or summary
view of facts and figures in an understandable format. These either inform or prepare data for
further analysis. so descriptive analysis or we can say another way in statistics can summarize
raw data and convert it into your form that can be easily understood by humans. They can
describe in detail about an even that has occurred in the past. Okay,
(Refer Slide Time 15:40)
14
Some of the examples of descriptive analytics is a common example of descriptive analytics
are company reports that simply provide the historic review like: data queries, reports,
descriptive statistics, data visualization and data dashboard. Okay,
(Refer Slide Time 16:00)
The next one will go to the diagnostic analytics. Diagnostic analytics is a form of advanced
analytics which examines data or content to answer the question why did it happen? So we
are diagnosing, suppose we are meeting a doctor for consulting, so he will try to understand
why this has happened? Okay so that kind of analytics nothing but diagnostic analytics. So
the diagnostic analytical tools aid and analyst to dig deeper into an issue.
So that, they can arrive at the source of the problem. So doctor also will identify you
somebody has got some disease what was the sources of the problem. Similarly the
diagnostic analytics also if something has happened for example the company's not
performing well that diagnostic abilities will help you to identify what was the core reason
for that. In a structured business environment tools for both descriptive and diagnostic
analytics go parallel.
So when you look at the whether it is a prescriptive or diagnostic analytics, the tools,
analytical tools which are using can be same only the purpose may be different.
(Refer Slide Time 17:09)
15
For example: data discovery, data mining, and correlations. These tools can be used for your
prescriptive analytics also. Okay,
(Refer Slide Time 17:20)
Now we will go for predictive analytics, predictive analytics helps to forecast trends based on
the current events. When you say predicting obviously it say, that it is discussing about what
will happen in future? Predicting the probability of an event happening in future are
estimating accurate time it will happen can all be determined with the help of predictive
analytical models. Many different but co-dependent variables are analysed to predict a trend
in this type of analysis.
So in the predictive analytics one of the tool is the regression analysis. There may be some
independent variables, some dependent variables, sometimes more dependent variable, more
16
than one dependent variable and how these variables are inter-related. So that kind of study is
nothing but your predictive analytics.
(Refer Slide Time 18:11)
When you look at this picture, you see that with the help of historical data by using different
algorithm, predictive algorithms you can come with a model. Once the model is developed a
new data can be fit into this model so we can get some predictions about the past events.
(Refer Slide Time 18:35)
Example is linear regression, time series analysis and forecasting and data mining. These are
the techniques for predictive analytics.
(Refer Slide Time 18:46)
17
The last one is the prescriptive analytics. A set of techniques to indicate the best course of
action. It tells what decision to make to optimize the outcome. The goal of prescriptive
analytics is to enable: quality improvements, service enhancements, cost reductions and
increasing productivity. Okay,
(Refer Slide Time 19:13)
In the prescriptive analytics, some of the tools which we can use is optimization models,
simulation model, and decision analysis. These are the tools under prescriptive analytics.
(Refer Slide Time 19:27)
18
Next is we are going to see, why the analytics so important? In this section we will see what
is happening the demand for data analytics and we look at the different elements of data
analytics.
(Refer Slide Time 19:44)
This picture shows, Google Trends, this was up to 2017. See for example, the blue represents
the data scientist; this orange represents the statistician operation researchers. You see the
trend is it is increasing that means people are searching in the Google search engine the word
data scientist more number of times. See the search count is increasing. That means there is a
demand for that particular say job.
(Refer Slide Time 20:19)
19
You see, if you look at this is the newspaper clipping from Times of India. There are so many
news are coming about data scientists and the future requirement of data scientists. You see
the data scientist earning more than CA’s and engineers. You can look at this link for further.
(Refer Slide Time 20:37)
And you see the demand for data analytics. This also newspaper clipping with companies
across industries striving to bring their research and analysis department up to speed, the
demand for qualified data scientist is rising. So there is an emerging field. so many
companies are looking for the qualified data scientist. So if you take this course and end up
the course that you may be qualified for getting into these companies.
(Refer Slide Time 21:07)
20
Many times you see what is data analytics, Statistics, data mining, optimizations. These are
students having different understanding on that. So when we say data analytics, there are
different element one is statistics, next one is the business intelligence information systems,
then modelling and optimizations, then simulation and risk. We can say if you are able to do
what if analysis? That is nothing but sensitivity analysis, visualization, data mining. These are
the components of data analytics and how these different domains are interrelated?
(Refer Slide Time 21:47)
Next we will see, what kind of skill set is required to become a data analyst? then we will see
the small difference between data analyst and data scientist?
(Refer Slide Time 21:59)
21
See to become a data analyst is the basic fundamental knowledge is you need to have
knowledge of mathematics. Next you need to have the knowledge of technology is nothing
but hacking skill. Hacking skills in the sense, if the data is given hacking is done and looked
at the positive way. How to use the data to get more information? The next skill is business
and strategy acumen; you should have the knowledge of the domain and knowledge of the
business and you knew to the strategy equipment.
So these three skills are required for a good data scientist. It is very difficult to have a one
person will have all these three skills that's why availability of good data analyst is becoming
very difficult. Because somebody may be very good at mathematics but they may not have
very good knowledge and business, some people may be very well at technology, technology
in the sense information technology, they may not have good knowledge on the business
knowledge.
So we need to have the combination of all these three skills otherwise the group of people
some people from mathematics department or mathematics area, some people from computer
science, some people from the domain knowledge. They were to work together to form a
good data scientist team, so these forms data science.
(Refer Slide Time 23:31)
22
Now what is the difference between data analysts and data scientists and the difference is
what kind of role they are doing? For example; the role of your data analyst is, see in your
business context, he may have the knowledge of business domain. For example; if he is good
at doing analytics in the area of marketing, he can be called as a marketing analyst. If the
person is from finance area, he can be called it as a finance analyst. So he is the analyst, data
analyst.
But the role of data scientist is little bigger, because the data scientist need to have the
knowledge of advanced algorithms and machine learning and able to come out with a data
product. Which I told you in the previously, so the data scientist can come out with a data
product. Okay,
(Refer Slide Time 24:30)
23
In this course we are going to use Python. In this in the next lecture, I will tell you the basic
introduction about the Python. Here we will see why we are going to use the Python?.
Because python is very simple and easy to learn. Most importantly it is a free software and
open source. It uses interpreted, it is not the compiler. Suppose what do you my compiler and
interpreter is you need a compiler to solve the whole program but interpreter need not be in
that way.
It can solve, even you can interpret one sentence also, one line in the programming line also.
it is dynamically typed, dynamically type in the sense in some other programs every time you
have to declare the variable. What is the nature of the variable? Whether it is integer?
Whether it is a float? But here you need not do. It is dynamically takes the value. it is
extensible, extensible in the sense if you make a code in some other language that can be
extended with the help of Python.
And can be embedded, embedded in the sense you have made some program in Python it can
be embedded with the some other platforms and it has extensive library.
(Refer Slide Time 25:45)
The usability of Python is it is a desktop and web applications, it can be used for data
applications, it can be used for networking applications, most importantly it can be used for
data analyst, data science can be used for machine learning, it can be used for IoT Internet of
Things and artificial intelligence applications and can be used for games.
(Refer Slide Time 26:05)
24
Another reason for choosing Python is most of the companies, they use Python is a language
in their company. Like for example; Google, Facebook, NASA, Yahoo and eBay. They use
Python is a programming language.
(Refer Slide Time 26:23)
In this Python also we are going to use Jupyter notebook. In the next class I will explain you
because it is the client-server application is edit code on web browser. It is easy in
documentation, easy in demonstration and user friendly interface.
(Refer Slide Time 26:39)
25
This was the last session of this lecture; we will explain four different levels of the data. What
is the type of variables? Levels of data measurement? Compare for different level of data:
will say nominal, ordinal, interval and ratio. We will see that why and what is the usage of
knowing this different level of data?
(Refer Slide Time 27:03)
The one way for classifying the data is the categorical data, one is a numerical data. In
categorical data; you see marital status, political party, and eye color. These are categorical
data. Numerical data; it can be discrete or continuous. Discrete data may be a number of
children and defects per hour. So this is the discrete data. In the continuous data may be
weight and voltage. These are the example of continuous data.
26
So what is the difference between discrete and continuous is, you say a number of children
you may say two children or three children 2.5 children was not possible but in continuous, if
you look at between 0 & 1 the numbers are continuing there are infinite number of values that
are there between 0 & 1. So it is a continuous variable.
(Refer Slide Time 27:56)
Next will you see the different level of data measurement? Easily we have seen the
classification of data. We classified as the categorical data and numerical data. There is
another way of classification is, classifying into nominal data, ordinal data, interval data and
ratio data.
(Refer Slide Time 28:14)
We will look at, what is a nominal data? Nominal scale classifies data into distinct categories
in which no ranking is implied. The example of nominal data is gender, marital status. For
27
example; gender suppose you are conducting a questionnaire. Suppose you captured the
gender male 0, female 1. This 0 1 represents just the gender. You cannot do any arithmetic
operations with the help of the 0 & 1.
For example, you cannot find out the average, software will give you some number but there
is no meaning for that. Similarly, marital status, whether it is married or unmarried. This is
the example of nominal data.
(Refer Slide Time 29:01)
The next level of data is the ordinal scale. It classifies data into distinct categories in which
the ranking is implied. Here the numbers are the ranked. For example; you may ask the
customer to give a ranking about their level of satisfaction. For example, satisfied, neutral,
unsatisfied. The faculty ranking, for example; professor, associate professor, assistant
professor.
You see that their rank is followed for example 1 professor, 2 associate professor, 3 three
professor. Student grades, A, B, C, D, E, F. These are ranking, because the numbers 1, 2, 3
represents the rank.
(Refer Slide Time 29:45)
28
The next level of data is interval scale. The interval scale is ordered scale, in which the
difference between measurements is a meaningful quantity but the measurement do not have
to zero point. The example of interval scale is, for example year. Say now, this here is 2019,
you can add and subtract something. You can add another five years, its 2024 or you can
subtract another nine years, its 2010.
But you cannot multiply, if you multiply that number for example 2019 and 2020 you will
end up with the big number there is no meaning for that. Because, there is no meaning for
zero. Another example of interval scale is your Fahrenheit temperature. For example, in the
Fahrenheit scale, the zero represents freezing point but it is not the absence of the seat but
absence of the temperature but at the same time in the Kelvin for example minus 273 it is
absence of heat. So Kelvin will be the some other scale. That you will see the next one,
(Refer Slide Time 30:52)
29
The ratio scale is the ordered scale in which the difference between the measurements is a
meaningful quantity and the measurements have the true zero point. Weight, age, salary and
the Kelvin temperature comes under ratio scale. Because 0 Kelvin that means the absence of
the heat. So in the ratio scale, he can do all kinds of arithmetic operation. For example the
nominal, you cannot do any arithmetic operation. In ordinal you cannot do in arithmetic
operation.
In the interval you can add and subtract but you cannot multiply. But in the ratio data, you
can do all kinds of arithmetic operations. You can add. You can subtract, you can multiply,
and you can divide.
(Refer Slide Time 31:35)
30
You see the usage potential various level of data. For example the usage potential of nominal
data is not that much. The next one is ordinal; next one is interval, next one ratio. So the ratio
data is having the highest to use its potential. The nominal data is having the least usage
potential.
(Refer Slide Time 31:56)
This is more important, why we have to still know the different types of data. Because this
types of data is helping to choose the right analytical tools for doing analysis. For example; if
the data is the nominal data. You can do only nonparametric tests. For example the data is
ordinal, here also you can do only nonparametric test. But if the data is interval, you can do
parametric test. You see that interval; you can do all above plus addition and subtraction.
In the ratio, if you can do all of the above plus multiplication and division and statistical
methods. You can go for parametric methods. So the purpose of classifying the data into
nominal, ordinal, interval, ratio is to choose the right analytical tools with it whether it is a
parametric or non parametric. The other reason is sometime for if we want to do a non
parametric analysis that is used only for nominal data.
Sometime the students they will, the data may be nominal but they may go for a parametric
test that, should not be done. That is the purpose of knowing what kind of, what is the nature
of this data. So in this class we have seen the introduction for data analytics. We have seen
the importance of data analytics. We have seen the classification of data analytics. Then we
can we have seen what is the analytics and analyst and we have seen different types of data.
31
The next class we will learn about what is Python? How to install the Python and what kind
of descriptive analysis we can do with the help of Python? So the next class will meet you
with another lecture. Thank you very much.
32
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology, Roorkee
Lecture No 2
Python fundamentals -1
Good morning students, in the last class, that was the introduction class, we have seen the
importance of data analytics and we have seen certain classification of data analytics. This is
my second lecture that is Python fundamentals because we are going to use this Python. In
this lecture I have 3 objectives.
(Refer Slide Time: 00:50)
One is I will tell you how to install Python second one I will see some fundamentals of the
Python, third one some data visualization. In the data visualization I am going to give only
theory in this class. The next class we are going to use Python and we are going to take some
sample data and we have to visualize the data using Python software.
(Refer Slide Time: 01:13)
33
As I told you the 1st one is how to install this Python. There are 5 steps is there. Step 1, we
are going to see in detail in coming slides. In step 1 we are going to visit this website
www.anaconda.com at the address bar of the web browser 2nd one we are going to click on
download button 3rd one will download python 3.7 version for Windows operating system.
Then we will double click that is a 4th step we will double click on file to run the application.
The 5th one will follow the instruction until the completion of installation process. What I
have done I have taken the screenshot of all these, all the 5 steps while installing the laptop. I
am going to show each steps in the form of screenshot.
(Refer Slide Time: 02:03)
The 1st one is type this www.anaconda.com at the address bar of the web browser.
(Refer Slide Time: 02:13)
34
2nd one is once you typed it you can see this screen, here you see this location you can see
here. This location there is a download option. When you click that you see that the left side
also I have rounded there is a download option you download it.
(Refer Slide Time: 02:36)
In the 3rd step is there are two versions of python, python 3.7 and python 2.7. In these
courses we are going to use the latest version that is the Python 3.7.
(Refer Slide Time: 02:46)
35
In the 4th step double click on file to run the application it will get downloaded. when you
double click for example; I have stored this anaconda in F folder.
(Refer Slide Time: 02:59)
36
You have to agree for their agreements, terms and conditions.
(Refer Slide Time: 03:06)
37
Then it is installed in C drive click Next.
(Refer Slide Time: 03:15)
38
Again click Next
(Refer Slide Time: 03:29)
39
Now we have installed the anaconda. So I will explain you how to open Jupyter notebook. I
will switch the screen
(Refer Slide Time: 03:46)
Yeah, this is the screen. The initially there are you see, I will see what is this some box it is
showing in blue color sometimes it will show in green color that I will show you later. So this
is the Jupyter notebook look like.
(Refer Slide Time: 04:04)
40
The next one is why there are some more interfaces there for using Python. There is a spider
is there Jupyter is it but we prefer Jupyter for some reasons because it is edit code on web
browser, it is easy in documentation, it is easy in demonstration and it is user- friendly
interface. That was the reason we are using Jupyter it is not necessary if you already you are
comfortable in some other interface you can continue with that.
(Refer Slide Time: 04:34)
See that in anaconda it consists of two software one is python that is on the left hand side the
another side the right hand side the Jupyter applications these are combined together and kept
in the Anaconda software package.
(Refer Slide Time: 04:49)
41
When you from the start when you type Jupyter you will get this screen
(Refer Slide Time: 04:58)
Then when you click launch you will get this one. So now from the start I am going to
explain how to start this Python jupyter notebook.
(Video Starts: 05:08)
You have to type Jupiter and Jupiter notebook .When you click it you will get this one.
Suppose if you want to type in a new go new Python 3. Yeah? Here there is a Jupyter there is
a it is coming untitled 2 there you can change the name. You give the name as introduction to
Python, introduction Okay?
(Video Ends: 05:40)
42
You see there is a box is appearing this is called cell I have made it in the red color, it is a cell
it can be the cell can be accessed using Enter key.
(Refer Slide Time: 05:51)
You see sometime that box will look like a green color, Green color indicates it is in edit
mode sometime the box will look like in blue color.
(Refer Slide Time: 06:01)
43
The blue color indicates it is a command mode. See when you go to below the help there is a
file name is called mark down .There if you type something then you select mark down that is
used for making documentation. So it contains documentation, here text not executed as a
code it is only for our understanding purpose.
(Refer Slide Time: 06:22)
Okay? Now about the Jupyter Notebook Command mode allowed editing notebook as a
whole. To close edit mode press Escape key. Execution can be done in three ways you can
simultaneously we can press Ctrl+Enter. So what will happen when you press Ctrl+Enter
output field cannot be modified, another way is to press Shift+Enter output field is modified.
Then there is a third way is there is a run button on the jupyter interface.
44
That you can directly you can click that. Then your code will get executed command line is
written proceeding with # tag symbol. So when you want to make some understanding on
your program you can use the # symbol, so that will not be executed.
(Refer Slide Time: 07:17)
That you only for your understanding purpose but there are about the Jupyter notebook
important shortcut keys. When you press A that is used to create a cell above when you press
B that is to create a cell below when you press D+ D for deleting cell. When you press M that
will made a say mark down cell, when you press Y that is for coding cell.
(Refer Slide Time: 07:46)
45
We will go to the next one fundamentals of Python and you see loading here .What we are
going to see in coming slides. Loading a simple delimited data file counting how many rows
and columns were loaded and determining which type of data was loaded. Then looking at
different parts of data by subsetting rows and columns because these activities are more
important because once we loaded a data that may have n number of cells n number of rows,
column and rows.
Sometime we need to do some operation using only few rows are few cells .You should know
how from a big data file how to use only a particular row or how to use a particular column.
Sometimes we can have a collection of rows also, collection of columns also for doing our
specific operations.
(Refer Slide Time: 08:49)
46
This was the reference book which I am following for this course and the book name is
Pandas for everyone especially for this lecture. It is the professor Daniel Y. Chen he is the
author of this book.
(Refer Slide Time: 09:04)
Now we are going to learn how to load a simple delimited data file. This is the fundamental
because before doing data analysis the first step is how to load the data into the Python. For
that purpose we are going to import some basic libraries one is pandas numpy another is
matplotlib.pyplot as plt. So, first we are going to import these three basic library .Then we are
going to load the data. The data sources it is taken from
www.github.com/gennybc/gapminder.
So I have downloaded this data set already I am going to tell you how to load the data set in
to the python. Before that I am going to open that excel file, I am going to show what is the
column? What is the row open the excel file?
(Video Starts 10:07)
When you look at this I am reading the column see that there is a country, year, population,
continent, life expectancy that is given as the short name life exp then gdp per capita. So in
rows there are, how many rows is there I will tell you how many rows is there I am coming
down and this is a this is csv file format. How many rows are there? There are 1705.
The last row is Zimbabwe Right? Please look at the data Zimbabwe year 2007. I think it is a
population, continent, life expectancy this is a per capita income. Okay? Now this data, this
csv file I am going to import into the Python. You see that I am going to call this data set df.
47
df= pd because pd is the short-form of pandas, Pandas nothing but the panel data
Pandas.read_csv.
Why I am using csv because the csv file is I am going to read it. The location of the file given
the path of that file you can directly copy that path but one thing you have to note it down.
See, C: this will be this should be \ because when you copy that path directly. Generally you
will get here / but you have to change it. So I changed it back C: / users / ET cell / desktop /
gapminder-five year data.csv.
Look at their it should be in the code. Now I am going to read the df, Yes? once I read it you
see that, the row is starting from 0 . That is a very important. It is a 0 indexing 0, 1, 2, 3, 4, 5,
6, 7, 8 I am able to see whatever I have seen in the csv file just a few minutes before. You see
I am able to see the country, year, population, continent, life expectancy and gdp per capita.
Okay? What I how I have read it pd. read_csv
Suppose I have installed, I have loaded that data I want to say what are the headings of that
file. Heading means what are the columns. For that there is, in Python there is a two type
print and open the parentheses df.head when you execute this one you will get 1st 5 rows that
means 0, 1, 2, 3, 4. So that means when you execute this one you can see 1st 5 rows from the
data set Yes? You are able to see that, Okay?
I will go to the next command, suppose I want to know the size of that file that is I want to
know how many rows and how many column is there. For that there is a command called the
shape. So print df.shape, df is they were finally because we outer loading that csv file we
have named in the variable called df. Okay? So when you type print df.shape then we will
come to know how many rows are there. How many columns are there?
So, I am typing print df shape. One more thing you should not type this parenthesis because it
is the shape is without parentheses. So I am going to remove this parenthesis again I got to
run it. Yes, it is showing how many rows? How many columns? Okay? We will go to the
next one now I want to know how many column names? What are the column names? So if I
type print df.columns Right?
48
Here, please note that here also there is no parenthesis if I type print df.columns. This was the
output which I copied see what is output disappearing country, year, population, continent,
life expectancy, gdp per capita, data type is object; I will show you how this comment is
running. Type print df., yes you are able to get this way. So what the students what you have
to do while looking at the video you have to open your laptop you have to type this
command.
Then you have to see you can verify the answer. Okay? The next command is to get the data
type of each column; you have to type this command print df.dtypes. That will give you the
summary of the all data set and what is the nature of the data. We will see that how it is
appearing. So, I am going to type print df.dtypes. Now you we will see the data type of each
column. For that you have to use this command dtypes.
So print df.dtypes, this is the output which you will get it. I will show you in the Python, first
we will look at what is the subject output you see countries object, year is an integer,
population pop it is a variable that is in the float. Float means there is a decimal continent it is
an object that means a character, life expectancy it is a float that means you are going to get
that value in decimals.
Similarly, gdp per cap that also going to get in the decimals then data type is object now we
will go to the Jupiter. We run this command so you see that you see line number 8. Print df.
dtypes you are getting whatever it was there in this or whatever I have shown in the slide is
there. Say country object, year integer, population float type of data and so on.
( Video Ends 17:10)
(Refer Slide Time: 17:11)
49
This is a classification of types of data in the perspective of pandas, in the perspective of
Python. See when they say string it is a most common data type it is a character. When I say
‘Int’ it is a whole number integers. It is a float number with the decimals. Date, time is that is
to represent the data it is not loaded by default that need to be imported. Whenever it is a
requirement is there that we will see
(Refer Slide Time: 17:38)
That one more command is to get more information about the data. So you type df.info you
will get the full details about each columns. We will do that one.
(Video starts: 17:53)
Look at this when it print df. info so I am getting data columns there were 6 columns country
there are 1704 rows is there Non null object that means all the data is there is no missing
values. Similarly year 1704 rows is there, Non-null that is an integer, Non null means, that all
50
the values are filled. There is no missing cell so population, float, continent object, life exp
float, gdp per capita float, memory usage is this much.
Suppose, there is a big data file is there we want to see the specific rows are specific columns.
How to do that? Now get the country column and save it to its own variable. So country if
you look at the data which I will show initially countries one of the column. So I want to pick
up only that country column I am going to save it. I am going to give the name for that a
country_df= df you see that you have to open the square bracket, Square bracket within
quote.
Suppose, in the country column I want to see 1st 5 rows Okay? You type print, open
parenthesis country_df.head that shows 1st 5 rows and see that now from the full data we
have fetched only the country column. That we have seen there are 1st 5 rows, that is 0 th row
is a, 0, 1, 2, 3, 4, 5 to 5 rows we are able to see when you from the big file. Suppose there
may be requirement you need to see what are the last five observations for that purpose.
You type print country_df.tail, then you can see from the bottom we can see last 5 rows. You
will see how it is appearing, Yes? So what is it we are able to see last 5 rows from the
country, country_df file.
(Video ends: 21:15)
51
There may be requirement you need to see more than one column at a time. So I am going to
save in the form of another file name that is called a subset, Subset=df. You see there is a
double square bracket so I want to switch the country columns, continent columns and year
columns. Then I going to see what are the heading that means I want to see what does the 5
rows of these subsets so we will go to they go to Python.
(Video starts: 17:46)
I am going to call it a subset continent. Suppose I want to see the 1st 5 rows of this file called
a subset. Data set called subset. You see that I am able to fetch 3 columns at a time that is on
the country, continent and year. The same way we from the subsets file I want to see last 5
rows so print subset.tail. Let us see what we are getting we will get this output Yes? You see
that there are 3 columns.
There were the last 5 rows from the button. So far we looking at different columns now we
want to subset rows by index label there is one command called loc. So first we look at the
file initial file that is a print df.head. Next you see that I want to locate the 0th row so for that
purpose, print df.loc see it is a square bracket you type 0 because if you suppose we want to
know the 1st row i out to enter because I would enter 0 because Python counts from 0, so
print df.loc 0 that will show the 1st row.
You see 0th row access country Afghanistan year 1952 population is this much continent is
Asia. Suppose I want to access this 1st row that means 0th row, Yeah? 0th row you can verify
0th row is the country Afghanistan, year 1952 this is a way to access a particular row. Dear
students whatever comments which I am typing that I will be given to you when you take this
course you can practice yourself.
You need not bother about in case we are not getting at this stage this all the commands all
the codes will be given to you .You can practice on yourself. Suppose I want to get the 100th
row how to access from the file df? I want to look at 100th row so you type print df.loc 99.
You can exactly access in 100th row what is the element is there? Suppose I want to access
100th row df, 100th row is the country Bangladesh, year 1967, population this is.
This is the way to access different rows for our calculation purpose. So far we have seen how
to load csv file into the Python, we have seen some basic commands.
(Video Ends: 25:21)
52
We have seen how to know the size of the file then we have seen how to access a particular
row and also we have seen how to subset from the given big file? How to subset different
small data file? So that can be used for our further analysis. So the next class we will see how
to access different columns that will continue in the next lecture. Thank you.
53
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology, Roorkee
Lecture No 3
Python fundamentals
Okay? We will continue our lecture. How to access different rows and columns, because, it has
very important applications.
(Refer Slide Time 00:33)
When the data file is very big sometimes you need to access only some rows or some columns
for your calculation purpose. That we will learn how to access a particular rows or particular
columns there is looking at columns, rows and cells. When you look at this see print df.head
when I use this command and getting there are different. For example; the first column says 0, 1,
2, 3, 4, country, year, population, continent, life expectations, gdppercapita.
(Refer Slide Time: 01:04)
54
Suppose I want to get the first row as we know that the Python counts from 0. If you want to
know the first row you type a print df.loc, it is a location in square bracket 0. Will do that you
will get the details which are there in the first row.
(Refer Slide Time: 01:26)
So first if I want to know hundredth row so printed df.loc 99. We knew that python count from
zero. If I want to know 100th row you have to type 99. So it should be in Square bracket you can
see the details in the 100th row.
(Refer Slide Time: 01:42)
55
Suppose we want to know the last row in the data set. So print df.tail n equal to 1. If you type n
equal to – 1, it will not work, that we will see why if you want to know the last row simply type
to df.tail n equal to 1, you will get to know that what is the last two, So we will see that.
(Video Starts: 02:01)
Now we are going to use this command to see the last row, that is a detail about the last rows.
Now we can subset a multiple rows at a time. For example; there will be requirement we have to
select 100th row, 1st row 100th rows and 1000th rows. For that purpose you type this command
print df.loc. You see there are two square brackets 0, 99, 999, you will see what output where
getting. So type print df.loc. Yes, so we are able to see the 1st row, 100th row, 1000th row.
There is another way we can subset rows by row number by using this command iloc. Previously
loc, now we are going to use iloc. Suppose for type I want to get the 2nd row, if I type print
df.iloc 1. I will get the details about the 2nd row. Okay? Yeah, this is a detail about the 2nd row.
Suppose I want to know 100th row by using iloc command so go there. Yes? That is the details
about the 100th row. You see that if I want to access the last row by using iloc command.
So you can directly type print df.iloc in squared bracket - 1. So that will be the details of the last
row. So what you can do we can open our Excel file you can verify what was the title, the last
row and soon.
(Video Ends: 04:27)
56
(Refer Slide Time: 04:27)
See then important note here with iloc command. We can pass in the - 1 to get the last row, but
same thing that we could not do with loc. That is the difference between loc command and iloc
command.
(Refer Slide Time: 04:42)
Suppose we want to get the first 100th and 1000th rows, using iloc command. So we are going to
type this print df.iloc 0, 99, 999. Let us see what answer we are getting.
(Video Starts: 04:58)
Yes? See, we have getting 1st, 100th and 1000th row.
(Video Ends: 05:21)
57
(Refer Slide Time: 05:21)
So far we are seeing subsetting rows. Now we will see subsetting columns, the Python slicing
syntax used a colon, colon represents all the rows. If you have just a colon that attribute refers to
everything. So if you just want to get the first columns using a loc, or iloc syntax. We can write
something like df.loc[ : , which column we need to refer, to subset the columns.
(Refer Slide Time: 05:49)
The next slide I need to show that we are going to subset the columns with loc, not the position
of the colon. It is used to select all rows.
(Refer Slide Time: 05:58)
58
You see that, subset equal df.loc: , I want to see only two columns that is year and population. So
when you type this way you will get all the rows only two columns details that is year and
population. You will type this so when you type print subset.head. You can get the first 5 rows.
So you will see how it appearing.
(Video Starts: 06:26)
Subsets equal to Subset is object because from the df is the initial object which has all the details.
Now I am going to fetch only few columns from the df object that I am going to saved in the
name subset, subset is the object. So all the rows but I need only year column and population
column so I am going to type I want to see the first 5 rows, see that I am able to see 1 st 5 rows,
only for 2 cells. That is year and population. This is the way to get only 2 cells from the 2
columns from the Big Files.
(Video Ends: 07:21)
(Refer Slide Time: 07:21)
59
There is another example subset column with iloc, iloc will allow us to use integers - 1 will
select the last column. The same thing whatever we have seen in the previously so subset equal
to df.iloc:, represents all the rows. Then [2 , 4 , - 1, then we can see by using this command print
subset.head 1st 5 rows.
(Video Starts: 07:46)
See that we are able to see the last column and the population column, life expectancy column.
You can open our Excel sheet you can verify whether we are getting the right answer or not.
(Video Ends: 08:26)
(Refer Slide Time: 08:26)
60
Sometime there is another way for subsetting columns by using the command called range. First
will make range of numbers we are going to save that range of a number in object called small _
range, so small _ range equal to list range 5. Print small range will get 0, 1, 2, 3, and 4. Now this
small _ range, object can be used to access the corresponding columns.
(Refer Slide Time: 08:57)
1st column, 2nd column, 3rd column, 4th column and 5th column, so we will try this.
(Video Starts: 09:09)
61
small_range is an object, we are going to create a range. Suppose we want to see what
small_range is. So it is up to 0 to 4, that means 1 to 5. Now we are going to subset using that
object called small _ range using ilocation command. df.ilocation:small_ range we see that here
we are able to see 5 column that is a country, year, population, continent and life expectancy.
(Video Ends: 10:21)
(Refer Slide Time: 10:22)
So far we have seen subsetting only rows and columns. Now we are going to subset rows and
columns simultaneously. For example; using loc command so if you type print df.loc 42
countries. We can check in the 42 label in country columns. What is the cell name, there cell
name is Angola. Will try this.
(Video Starts: 10:47)
Going to see in that file in 42nd label in country column what value is there so that is an Angola,
Yes?
(Video Ends: 11:09)
(Refer Slide Time: 11:09)
62
Yes, we can see what is in the using the same ilocation we can see in 42nd label in 0th column.
Now we can represent column also with 0 columns, what value it is, you will see that. You can
verify you have to get to the answer. You can open the Excel file. You can verify we are
correctly accessing the cell or not.
(Video Starts: 11:29)
Print df.iloc in 42nd label 0th column what is the value it is Angola.
(Video Ends: 11:46)
(Refer Slide Time: 11:46)
Next we can subset multiple rows and columns. For example; get the 1st, 100th and 100th rows
from the 1st, 4th and 6th column. So now we are going to, simultaneously we are going to fetch
63
rows and columns and corresponding cells. So print to df.iloc 0, 99, 999. Similarly column labels
is 0, 3, 5. Let us see what answer.
(Video Starts: 12:13)
This accessing rows and columns are very important functions because nowadays data file comes
with a lot of rows and lot of columns. We need not use all the columns, all the rows for further
analysis. Sometimes we need only specific rows or specific columns. So these basic commands
will help you, how to access a particular rows and columns, that will be very useful when we do
further analysis using Python. Yeah? This is the value so that means 1st row, 100th row 1000 th
row, 1st column and soon.
(Video Starts: 13:08)
(Refer Slide Time: 13:08)
And there is another way if you use the column names directly it makes the code a bit easier to
read. In terms of number and so you see number column. If you use for representing column, if
you use column name we can see what is there, so simply type the column name. So we use this
command, print df.loc 0, 99, 999. Then directly will type the column name country, life
expectancy, gdpPercap you see there is a square bracket here.
(Video Starts: 13:36)
That you have to do as the same that Life capital Exp, Yes? This is because country, life
expectation this is the easy way to because we cannot remember column name.
(Video Ends 14:48)
64
(Refer Slide Time 14:49)
This was not only that instead of see suppose if you put a 10 column 13 that corresponding rows
will be displayed. So print df.loc 10 to 13, the 10th row 11th, row 12th, row 13th, row will be
shown and in columns country and life expectancy and gdpPercap so we will try this command.
(Video Starts: 15:11)
That means we can see the range of rows at a time. You are able to see the 10th row, 11th, 12th
and 13th.
(Video Ends: 16:17)
(Refer Slide Time: 16:17)
Okay? Next see print df. head we can see we can able to see 1st, 10 rows.
65
(Refer Slide Time: 16:23)
The 10th row some time for each year in our data what was the average life expectancy. To
answer this question we need to split our data into parts per year and then we can get the life
expectancy column and calculate the mean.
(Refer Slide Time: 16:38)
So what is happening there is a command which I go to use called groupby, we look at the data it
is not grouped. So when you use this command print df.groupby year,and life expectancy and
corresponding mean. The mean of the on the in the year 1952, the mean of the life expectancy
variable is 49.05. In 57, 51.09. We look at the data; it is not in this order. So the groupby by year
66
this command is grouping all the values, with respect to year. So we will see what is the answer
for this, we will verify this.
(Video Starts: 17:15)
When you open that Excel file you will see that the Excel file will be in some other form it is not
grouped by year, different years are appearing at different places. So this command that is a
group by will help you to group the data in year wise. Yes, you see that you are able to get 1952
the life expectancy was 49 years you see that when you look at this data. When year increases
the life expectancy year also increases due to advancement of medical facility available and the
standard of life is also increasing.
(Video Ends: 18:42)
(Refer Slide Time: 18:42)
Now, we can form a stacked table. Stacked table is using the group by command. So you type
this multi _ group _ variable = df . \ . See the \ represents to breaking the command we can use \.
Otherwise you can write straightaway also no problem. df.group by year, continent, life
expectancy,gdp per capita, then we can find the mean. Then we will get this output for that
means in 1952, in Africa, the life expectancies 39, in America 53, in Asia 46 in Europe 64 will
try this command.
(Video Starts: 19:28)
When we takes these command you will get an output, that is a stacked table. That is very useful
for interpreting the whole dataset, is kind of a way of summarizing the data in the form of table.
67
Multi_group. You see that now year wise. It is very, very useful command it is year by 1952,
some country Africa. What was the average year 1957 Africa. We see that if you look at only the
Africa data. 52 to 39 in 57 41, in 62 43, in 67 45, see that we can interpret this way, by looking at
the, this table. Suppose you have to flatten this.
(Video Ends 21:24)
(Refer Slide Time: 21:24)
If, you need to flatten the data frame. You can use this reset underscore index method, just to
type flat = multi _ group _ var . reset _ index. Then you see now the data is again. Now it is
flattened. The same data set, which was it in the table form now it in the simple learned form. So
we will try this comment.
(Video Starts 21:48)
This is what you are doing the data manipulation, because from the big data file, we have to learn
this kind of fundamental data manipulation methods that will be very useful, in coming classes.
So able to use reset _ index command to flatten the, that stacked table. See that now we can see
first 15 rows. Now it is data is flattened into the normal form.
(Video Ends 22:41)
(Refer Slide Time: 22:41)
68
The next one is grouped frequency counts. By using nunique command, we can get a count of
unique values on the panda series. So when you type print df. groupby continent, country.
nunique, you can get unique values that means frequency. Okay, will try this command.
(Video Starts: 23:04)
Print, See Africa 52, America is 25, Asia 33. When you look at the data, again, you go to
excel,Excel data you can interpret what is the 52 means, what is the America 25 and soon.
(Video Ends: 23:49)
(Refer Slide Time: 23:49)
Now, some basic plot a way to construct two things one is year and life expectancy. So we are
going to create a new object that is called Global _ yearly_ life _expectancy. By grouping year
69
and life expectancy, with respect to its mean. Then we are going to print it. So you are going to
get two values one is year. Next one is life expectancy. That is a mean life expectancy, you will
see this.
(Video Starts: 24:17)
There is a new object. The object name is called Global _ yearly _ life expectancy. Yes, see that
year, and supposed we want to plot it. We will see we are going to plot this data, how we are
going to plot it.
(Video Ends: 25:28)
(Refer Slide Time: 25:28)
Simply, just that object name. plot. That automatically takes this was output, which I got is in x
axis in a year, in y axis, average life expectancy. We will run this.
(Video Starts 25:40)
So, what this data says that, when the year 1950 - 2000 you see when the year increases, the life
expectancy also increases.
(Video Ends: 26:07)
(Refer Slide Time 26:07)
70
Just we have seen only the simple plot, in coming classes, we will see some of the visual
representation of the data. We are going to see a histogram, frequency polygon, ogive curves,pie
chart, stem and leaf plot and pareto chart and scatter plot .
(Refer Slide Time: 26:21)
Suppose, this is the data, see what is there in East, west, north. In column first quarter, second
quarter, third quarter, fourth quarter.
(Refer Slide Time: 26:30)
71
Suppose the very easiest way is the graph. By using this is called bar graph, bar chart. Bar chart
is different regions are labeled as different colors. This is a method of visual representation of the
data. If you look at this, the eastern side in third quarter, there are more sales. Okay.
(Refer Slide Time: 26:53)
The another way to represent visually, the data is pie chat, is the first quarter, third quarter. You
look at this, third quarter, which is in blue in color. There are more sales. And most importantly
the pie chart, we can get pie chart only for categorical variable. The variable is continuous, you
cannot use bar chart, you cannot use pie chart. So the pie chart is used only for categorical
variable. That is for only count data.
(Refer Slide Time: 27:31)
72
The another one is the Multiple bar chart. This is another way to represent the data visually.
(Refer Slide Time: 27:39)
73
See, this is the frequency table.
(Refer Slide Time: 27:25)
See, next one is frequency polygon. This figure is drawn from the previous table, which was
shown in the previous slide. So below 20 around 13,14. This represents frequency polygon.
When you connect the midpoint, you see that this is the. This is called frequency polygon. Then
the, this one is the cumulative frequency. It is not always, you cannot connect the midpoint, you
have to be very careful with the data is continuous, then only you can connect one this bar. The
data is not continuous, you cannot connect it.
(Refer Slide Time: 28:24)
74
Next one is a histogram .The histogram was constructed from the given table. You see.
(Refer Slide Time: 28:30)
The lower limit of the table values is going to in x axis. The frequency is shown in the y axis.
You see that this is data in continuous data. Okay, that was histogram. The purpose of histogram
is, the histogram will give you a rough idea what is the nature of the data whether, what kind of
distribution it follows. Whether it is following bell shaped curve, whether the data is skewed
right or skewed left.
(Refer Slide Time: 29:03)
75
Next one is the frequency polygon which I have shown you. If, the midpoint of histogram are
connected then there is called frequency polygon. Because, the frequency polygon is used to
know the trend.
(Refer Slide Time: 29:20)
Trend of the data. The next one is ogive curve. This is cumulative frequency curve .So what is
happening in the, for example 20- under 30, the upper limit of the interval is taken the x axis, the
cumulative frequency is taken in the y axis. For example, the first interval.20 - 30. So 30 the
upper interval is 6. For 40, upper interval is to 24, that is marked.
76
Because the advantage of this ogive curve is, supposed if we want to know below 16, how many
numbers are there, that can be read directly from the ogive curve. That is the purpose of ogive
curve.
(Refer Slide Time: 29:56)
Next one is the relative frequency curve. Exactly similar to that now actual frequency that
relative frequency was taken.
(Refer Slide Time: 30:08)
Okay. The next way to represent the data using pareto chart. The Pareto chart is having some
applications in quality control also. This is to identify which is more important, important
variable. Assume that, if you look at this Pareto chart. There are 3 axes one is frequency. In x
77
axis, different name is given poor wiring, short in coil, defective plug, other. You see there is
one more variable in terms of percentage.
For example, I am a quality control engineer, suppose my motor is failing so often. I want to
know there are different reason for failing of the motor. I want to know what are the main
reasons, due to which the motor fails. So what I have done. First I have go to frequency table,
that is due to poor wiring, the motor was falling for failing 40 times, frequencies 40. Due to short
in coil, the motor was failed 30 times.
Due to defective in plug, the motor was failed 25 times. Due to some other reasons the motor
was failed by say below 10 times. So the first technique is for drawing this one, we have to
arrange in the descending order of their frequency. So in x axis that values are taken. Then the
cumulative frequencies plotted on the, this axis. For example, how to interpret this table is. You
see, here this value corresponding this only 70.
So 70 % of the failure is due to only two reasons, that is poor wiring and short circuiting. So
what is the meaning of this one is, if you are able to address these 2 problems, 70% of the
failures can be eliminated. So the purpose of a Pareto chart is, to identify which is critical for us.
Generally it is called 80-20 principle. This is called the Pareto principle .That is 80% of the
problems are due to 20% of the reasons.
So similarly here, when you look at this, the cell, here need not always 80, see the 70% the
failures, only due to 2 factors that is due to poor wiring and short coil. So this is the pareto
chart.
(Refer Slide Time: 32:33)
78
The next one is scatter plot. The scatter plot is so far what ever seen only for one variable, the
scatter plot is used for two variable. In x axis registered vehicle, y axis the gasoline sales. So this
says the scatter plot says, when the number of registered vehicle is increasing the gasoline sales
is also increasing. So the scatter plot is used to know the trend out the data.
(Refer Slide Time: 32:59)
Some of the basic principle for excellent graph. One is the graph should not distort the data. The
graph should be very simple. It should not contain unnecessary adornments. So, so much
decoration in the graph is not required, the scale on the vertical axis should begin at 0. All axes
should be properly labeled. Weather should be x axis or y axis, it has to be properly labeled. The
79
graph should contain a title. The simplest possible graph should be used for given set of data.
These are the basic principle of excellent graph.
(Refer Slide Time: 33:39)
See when you look at this one. The left hand side it is a bad representation of the graph. What is
happening lot of animations, unnecessary pictures. The right hand side, it is a simple graph x axis
is taken as year, in y axis it has taken the wage. So it is showing some trend. But when you look
on the left hand side it is not giving any idea. What is happening year with respect to wage.
(Refer Slide Time: 34:04)
Another one you look at the left side picture and right side picture. Both are the same data. But
what is happening. When here in the left side picture the scale is 0 to 100, here it is 0 to 25 just
80
by changing the scale, we are able to get different interpretation. You see that when the when the
scale is increased. It looked like flat. If you are drawing in smaller scale. You see that look like
there's a lot of variations. So what is the learning is that we are to use proper scale to draw the
picture.
(Refer Slide Time: 34:40)
The next one is the graphical error, no 0 point on the vertical axis. When you look at the left side
of the figure January, February, March, April, May, June, the month is given in x axis. Monthly
sales is given y axis. But the problem on the left hand side is it did not start from 0. The right
side is you see that the small Brake is given. So, that, even though, 0 to 36 there is no data, you
have to make a small break like this. So that, we can come to know it start from 0.
So this is the right hand side is the right way of drawing the graph. This is the basic requirement.
In this lecture, what you have seen, how to access particular rows and columns by using basic
commands. Then we have seen the different visualization techniques, different theories of the
visualization technique. The next class will take in some sample data. By using the sample data
with the help of the sample data will try to visualize the data.
By having different tools like a pie chart, bar chart, pictogram, Pareto chartor simple graph.
Thank you, we will see you next class.
81
82
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology, Roorkee
Lecture No 4
Central Tendency and Dispersion
Good morning students, today we are going to the lecture 4. In this lecture we are going to
talk about central tendency and how to measure the dispersions. The lecture objectives we
talk about different types of central tendency.
(Refer Slide Time: 00:42)
83
What is measure of central tendency? Measure of central tendency yield information about
particular places or locations in a group of numbers. Suppose there are a group of number is
there that number group of numbers has to be replaced by a single number that single number
we can call it as central tendency. That is a single number to describe the characteristics of a
set of data.
(Refer Slide Time: 01:08)
Some of the central tendency which we are going to see in this lecture is arithmetic mean,
weighted mean, median, and percentile. In the dispersions we are going to talk about
skewness, kurtosis, range, interquartile range, variance, standard score and coefficient of
variation.
(Refer Slide Time: 01:25)
84
First we will see the first central tendency arithmetic mean. Commonly it is called as the
mean it is the average of a group of numbers; it is applicable for interval and ratio data. This
point is very important it is not applicable for nominal and ordinal data. It is affected by each
value in the data set including extreme values, one of the problem of the with the mean is that
it is affected by the extreme values computed by summing all values in the data set and
dividing the sum by the number of values in the data set.
(Refer Slide Time: 02:01)
See here I have used a notation µ, µ means capital letters µ represents, mean for the
Suppose in your class if you see the average mark is 60. So the whole marks of all the
students can be represented to be a single number that is 60, 60 will give an idea about the
performance of the whole class.
(Refer Slide Time: 02:55)
85
Next what is the sample mean? Make sure that the difference here the X bar. Previously for
the population mean we have used µ for the sample mean we are using X bar.
X bar =
= X 1 + X 2 + X3 / n.
For example; 6 element is there 57, 86, 42, 38, 19, 66 so divided by 6, the mean is 63.167
(Refer Slide Time: 03:19)
Now how to find out the mean of a grouped data? The mean of your grouped data is nothing
but weighted average of class midpoints, class frequencies are the weight. For the formula is
µ=
So Sigma f is nothing but
86
= (f1 m1 + f2 m2 + f3 m3 and so on + fi mi )/ sum of all f.
That is nothing but your N. We will see you an example;
(Refer Slide Time: 03:58)
See this is the grouped data. What is given class interval is given frequency is given class
midpoint is given and multiplied value of frequency and midpoint also we can find out. For
example; see here 20 to 30 there are 6 numbers is their frequency 6. Suppose if you say the
marks of here if you say this this is an example of here marks obtained by in your class. So
between 20 and 30 there are 6 students is there. Between 30 under 40 there are 18 students is
there.
Suppose for this the data is in this format that is in grouped format how to find out the mean?
Okay? First what do you have to do first you have to find out the class midpoint. See 20 to 30
that is a class interval the midpoint is 25, for 30 and 40. The class midpoint is the middle
value 35 like this 45, 55, 65, and 75. Next one you have two multiplied by frequency and
class midpoint so 6 into 25 is 150, 18 into 35 is 630, 11 into 45 is 495 and so on.
What the formula says it is last column the sum value is 2150, 2150/50
Sigma f is some of the frequency so for this kind of grouped data the mean is 43.
(Refer Slide Time: 05:29)
87
Now we will go to the next central tendency that is the weighted average. Some time if you
look at the previous values, the each value is given equal weightage. Suppose it is not always
the case there may be some marks there some values where there may be higher weightage.
So for that case we have to go for weighted average. Some time you see this we will list two
average numbers but we want to assign more importance or weight to some of the numbers.
The average you need is the weighted average.
(Refer Slide Time: 05:58)
So the weighted average is sum of the product of weightage and that value/sum of values.
Where x is the data value and w is the weight assigned to that data value. The sum is taken
over all data values.
(Refer Slide Time: 06:20)
88
We will see one application of this weighted average. Suppose your midterm test score is 83
and your final exam is score is 95 using weights of 40% is for the midterm and 60% is for the
final exam compute the weighted average of your scores if the minimum average for an A
grade is 90 will you earn an A grade. So first we find out the weighted average so the mark is
83 weights age is 40% for midterm for interim your mark is 95 weightage is 60 %.
So multiply that then divided by some of the weight that = 0.4 + 0. 6 = 1, so 90.2. So if you
are above 90 you will get A, because you are crossing 90 obviously you will get the A grade.
(Refer Slide Time: 07:12)
Now we will go to the next central tendency Median, the middle value in ordered array of
number is called Median. It is applicable for ordinal interval and ratio data. You see
previously the mean is applicable only for interval and ratio data but the median is applicable
89
for ordinal data. There is a point has to be remembered and it is not applicable for nominal
data and one advantage of median is it is unaffected by extremely large and extremely small
values.
(Refer Slide Time: 07:46)
Next we will see how to compute median there are 2 procedures, first procedure is arrange
the observations in an ordered array. If there is an odd number of a term the median is the
middle term of the ordered array. If there is the even number of terms the median is the
average of middle two terms. Another procedure is the medians position of an ordered array
is given by n + 1 / 2, n is the number of data set.
(Refer Slide Time: 08:15)
We will see this example; I have taken some exam some numbers that is arranged in an order
that is an ascending order 3, 4, 5, 7 up to 22. There are 17 terms in the ordered array the
90
position of the median is, with respect to previous let n + 1 / 2. So n + 1 / 2 = (17 +1)/2 = 18 /
2 = 9. So the median is the 9th term, 9th term here is 15. If see the 22 which is the highest
number is replaced by 100 still the median is 15.
See if the 3 is replaced by -103 still the median is 15. So there is the advantage of this median
over mean is median is not disturbed by extreme values.
(Refer Slide Time: 08:59)
Previously the number of items are odd now let us see the another situation; there are 16
terms in the ordered array there is an even number, the position of the median is n + 1 / 2 that
is 16 + 1 / 2 is 8.5. So we have to look at the term where it is the position of 8.5. That is the
median is between 8th and the 9th term here the 8th term is 14, 9 th term is 15 so average of
that one is 14.5. Again, if the 21 is replaced by 100 the median is same 14.5, if the 3 is
replaced by - 88 still the median is 14.5.
(Refer Slide Time: 09:42)
91
Now let us see how to find out the median of your grouped data but it will be grouped data,
here if the data is given in the form of a frequency table. This case the formula to find out the
Then cfp = cumulative frequency of the class preceding the median class f median, the
frequency of the median class, W is the width of the median class; N is the total number of
frequencies.
(Refer Slide Time: 10:26)
92
See this is an example; as I told you before using this formula first order to find out the
median class. What is the median class is when you add the frequency it is a 50. 6 + 18 + 11
+ 11 + 3 + 1 is 50. So divide this 50 / 2 it is a 25. In the community frequency column in the
last column look at where that 25 is lying it is not between 30 and 40; it is going to lie on
between 40 and 50 because 24 for the next term is 35.
So the median class for this given group data is 40 and 50. So as usual L, is the lower limit of
the median class that is a 40 + N is 50. You see the cumulative frequency of the preceding
interval is 24. So,
Md = 40+ ((50/2) – 24) x10 /11
because the width interval is 10. When you simplify you would get 40.909, so this is the way
to find out the median of your grouped data.
(Refer Slide Time: 11:45)
Now mode the most frequently occurring value in a data set is mode applicable to all level of
data, measurement nominal, ordinal, interval and ratio. Sometimes there is a possibility the
data set may be bimodal. Bimodal means data sets that have two modes. That means two
numbers are repeated same number of time multimodal data sets that contain more than two
modes.
(Refer Slide Time: 12:12)
93
See this one sample data as it is given for this data set the mode is 44, because the 44 is
appearing more number of time. How many number of time 1, 2, 3, 4, 5. Okay? So the mode
is 44. That is there are more 44s than any other values.
(Refer Slide Time: 12:37)
That is the formula for finding mode of a grouped data. Here first we have to find out the
mode class. For that look at the frequency column there 18 is the highest frequency. So
corresponding the n class interval is called mode interval. Okay? The mode interval L Mo is
the lower limit of that mode interval is = 30 + (12/(12+7))x10
94
See 30 + see d1 is nothing but 18 is the mode interval and then the previous frequency is 6, so
18 - 6 is 12 / d1 is 12 + d2 is the difference between your 18 and 11 that is 7. So 12 + 7
multiplied by width is 10, so 36.31 is the mode of your grouped data. Yes? We have studied
mean, median, mode for group data and ungrouped data. Now the question is when to use
mean? When to use median? When to use mode? Okay?
Many time even though we study mean, median, mode we are not exactly told how to use or
when to use mean or when to use median or when to use mode?
(Refer Slide Time: 14:00)
For example look at this data set, this is left skewed data because the tail is on the left hand
side. The example for this is suppose, say the exam is very easy question paper and the x axis
is the marks and y axis is frequency. So there is more number of students who got higher
marks .Where the question paper is easy situation this is an example of left skewed data. So
what will happen here, here will be mean here will be median this will be mode.
You see another example; where the question paper is very tough. So this is called right
skewed data. You know here what is happening how we are saying that since the question
paper is very tough. There are more number of students who got the lesser marks that is why
the skewness on this side. So here there will be mean here will be median this will be mode
Okay? There will be another situation it is symmetric it is a bell looking at bill shaped curve
in this situation.
95
Now after looking at this hypothetical problem now the question arises when to use mean,
when to use median, mode look at the location of the median. The median is always in the
middle. Whether the data is left skewed or right skewed the median is always the middle. So
whenever the data is skewed you should go for median as a central tendency. If your data is
following a bell-shaped curve then you can use mean, median, mode.
There is no problem at all the clue for that choosing the correct central is first you have to
plot that curve go to plot the data outer plotting the data you have to get an idea of the
skewness of the data set. How it is distributed? Whether it is right skewed or left skewed or it
is bell shaped curve. If it is skewed data you go for median as the center tendency. If it is
following a bell-shaped curve you go for mean or median or mode as a central tendency.
(Refer Slide Time: 16:39)
Now you go to next one is a Percentile, mainly this you might have seen some of the cat
examination scores or gate examination scores their performance is expressed in terms of
percentile not the percentage because percentile is having some advantage over percentage
because percentage is absolute term but the percentile is the relative term the measure of
central tendency that divide a group of data into 100 parts it is called percentile.
For example; somebody say 90th percentile my score is 90th percentile indicates that at most
90% of the data lie below it and at least 10% the data lie above it. Okay? The median and the
50th percentile have the same value. It is applicable for ordinal, interval and ratio data it is
not applicable for nominal data.
(Refer Slide Time: 17:44)
96
Okay we will see an example how to compute a percentile the first step is organize the data
into an ascending ordered array calculate the pth percentile location. Suppose if I want to
know 30th percent location for that you have to find out the value i, i = (P / 100) multiplied
by n, n is the number of data set, the i is nothing but the percentiles location we got to find
out the i value.
If i is a whole number the percentage is the average of the values at the i and i + 1 positions.
If i is not a whole number the percentile is the i + 1 position in the ordered array.
(Refer Slide Time: 18:35)
Look at this example the raw data is given 14, 12, and 19 up to 17. I have arranged in the
ascending order the lowest value is 5, the highest value is 28. Suppose I want to know 30th
percentile for knowing the 30th percentile, first I have to find out i that is a (30 / 100 )
97
multiplied by 8 = 2.4. The i is nothing but location index as I explained the previous slide, i
is not the whole number. So you have to add i + 1, so 2. 4 + 1 = 3. 4.
In the 3.4 the whole number portion is 3 rights? So the 30th percentile is at the 3rd location of
an array. When you look at the 3rd location is 13, that means a person who scored 13 marks
his corresponding percentile is 30.
(Refer Slide Time: 19:26)
So far we talked about these different central tendencies will go for differing. Now we are
going for measuring dispersion measures of variability describes the spread or the dispersion
of the set of the data. The reliability of measure of central tendency is the dispersion because
many times, the central tendency will mislead the people. So the reliability of that central
tendency is calculated by or identified by its corresponding dispersion.
It is used to compare dispersion of various samples that is why whenever you plot the data
you not only show the mean you have to show the central tendency also because the
reliability of mean is explained by dispersion.
(Refer Slide Time: 20:14)
98
You look at this data, when you see the first two rows is no variability in cash flow mean is
same. The second one is variability in cash flow see there is a lot of variability in the second
one but the mean is same. If you look at only the mean it look like same when you look at
only the mean the mean value same but when you look at see the left hand side the second
dataset is having more variability. The quality of the mean is explained by its variability that
is nothing but dispersion.
(Refer Slide Time: 20:51)
There are different measures to measure the variability one is the range, inter-quartile range,
mean, absolute deviation, variance, standard deviation, z scores and coefficient of variations.
We will see one by one.
(Refer Slide Time: 21:07)
99
Suppose there is ungroup of data is there see this one you have to find out the range. The
range is nothing but the difference between the largest and the smallest value in a set of data.
It is very simple to compute. The problem here is it ignores all data points except the two
extremes. So the range is the largest value is 48 in this data set the smallest value is 35. 48-
35 = 13 you see that only the two values are taken care in between the values is not taken into
consideration for finding the range.
(Refer Slide Time: 21:46)
It is a quick estimate to measure the dispersion of a set of data. I will go for a quartile;
quartile measures the central tendency that divided group of data into 4 subgroups. We say Q
1, Q 2, Q 3. Q 1 is nothing but 25 % of the data set is below the first quartile. Q2, 50 % of the
data set is below the second quartile. Q3, 75 % the data set is below the third quartile. So we
can say Q 1 is the 25th percentile Q 2 is the 50th the percentile nothing but the median.
100
This is a very important point; Q2 is nothing but the median. Q3 is the 75th percentile and
another point is the quartile values are not necessarily members of the data set.
(Refer Slide Time: 22:34)
You see this lets say Q1, Q2, Q3. So Q1 see first 25 % the data set, Q2 first 50 % of the data
set Q3, 75 % of the data set Okay? It is nothing but the quartile is used to divide the whole
data set into 4 groups first 25, second 25, third 25 and last 25.
(Refer Slide Time: 22:59)
Suppose an example for finding the quartile, suppose the data is given 106, 109 and so on.
Okay? We have arranged it in the ascending order. First we got to find out the Q 1, Q 1 as I
told you but the 25th percentile so the location of the 25th percentile. First you have to find
out the location index i for the 25/ 100 x 8 = 2. Since the 2 is the even number. As I explained
101
previously if it is the location takes it 2 you have to find out that position plus the next
position and its average.
So in the second positions data set is 109 + 114 / 2 = 111.5. So the Q1 is nothing but here
111.5, Q 2 is 50th percentile 50 / 100 x 8 = 4, again the 4 is the even number. So the 4th
location is 1, 2, 3, 4th location is 116 and 5th the location is 121 so, 116+121 / 2 =118.5. So
the Q2 our median is 118.5. Then Q3, 75 /100 x 8 = 6, 6 is the even number the average of
6th and 7th values are 122 + 125 / 2. = 123.5.
(Refer Slide Time: 24:28)
This is the way to calculate Q 1, Q 2, and Q 3. Now the next term is interquartile range .So
the dispersions in the data set is measured with help of interquartile range by using this
formula Q3 - Q1. As we know Q 3 is 75th percentile Q1 is the 25th percentile so range of
values between the first and third quartile is called interquartile range. It is a range of middle
of .Why we are using quartile range because it is the less influenced by extreme values.
Because when we collect the data set we are not going to consider at very low values at the
same time very high values. So the middle values which is not affected by extremes that is
taken for further calculation .For that purpose we are using interquartile range.
(Refer Slide Time: 25:15)
102
There is a Q3, now we will go for deviation from the mean so dataset is the given 5, 9, 16, 17,
and 18. To find the deviation from the mean first to find the mean, mean is 13. Suppose there
is a graph is there so this is the 13 Okay? See the first value 5 the difference is 5 - 13 = - 8. So
this distance is your first deviation the second data is 9. So 9 - 13 = - 4 this is - 4. So this
deviation is expressed by these lines.
Look at that there is a negative deviation there is a positive deviation. Suppose if we want to
add the deviation general it will become 0. That is why we should go for mean absolute
deviation.
(Refer Slide Time: 26:12)
You see this X is given here there are 2 values are negative deviation 3 are three values are
positive deviation. When you add this it is becoming 0 so it seems we are getting 0 we cannot
103
measure the dispersion. One way is we have to remove this negatives you take only positive
value. When you take positive values 24 so = 24 / 5 there are 5 data set,
= 4.8 is called mean absolute deviation. It is the average of absolute deviations from the
mean.
(Refer Slide Time: 26:46)
There was a problem in the mean absolute deviation I will tell you what is the problem there,
see the next we will see population variance it is not the average of the squared deviation
from the arithmetic mean. Okay? So the X is there, mean is there so when you add the
absolute even digit is 0, one way the previously the mean absolute deviation you take an only
positive value. Now we are going to square it, the squaring of the deviation having some
advantage.
One advantage is we can remove the negative sign second one is but the deviation is less
when you square it. For example; - 4 square is 16 see - 8 squared is 64. So what is happening
104
The next one is the population standard deviation because already there is a variance but
variance is a squared number that we cannot compare. Suppose the two numbers are given
say 12 and 13 that is easy intuitively we can say which is higher which is smaller. Suppose
124, 169 is given notice squared number. We cannot compare intuitively and not only that it
is in the square root of squared term.
We want to have it in the actual term so for comparison purpose for that purpose we are
taking square root of that.
(Refer Slide Time: 28:25)
So 5.1 is the standard deviation next we will go to the sample variance the formula is same
but only thing is it is divided by n - 1.
(Refer Slide Time: 28:37)
105
Why we are dividing by n -1, the reason is that to make the variance as the unbiased
estimator. This is due to degrees of freedom since we already know the value of the mean
will last one degrees of freedom. That we are dividing by n – 1 so it is very important
whenever you find the sample variance so the in the denominator there should be a n - 1. So
here the variance is 221,288.67.
(Refer Slide Time: 29:06)
This is another sample standard deviation; just to take the square root of that it is a 470.41 so
a square root of the variance is nothing but standard deviation.
(Refer Slide Time: 29:17)
Now the purpose is why we have to study the standard deviation because the standard
deviation is giving an indicator of financial risk. Higher the standard deviation is more risk
lesser the standard deviation less at risk. In quality control context generally when we
106
manufacture something suppose here plant A and plant B or shift A and shift B whenever the
variances in lesser then the quality of the product is high.
The process capability also they should have the lesser variance means in the process
capabilities high. Then I suppose therefore comparing the populations household income of 2
cities, employee absenteeism in 2 plants for these purposes, it is for comparing the population
that means wherever there is a lesser standard deviation so that is having higher
homogeneous data set.
(Refer Slide Time: 30:12)
You see look at this one µ and σ, see this is a financial security A and B. See the return rate is
15, 15 it is both are giving equal return but look at this the σ standard deviation because in
financial context it is, it is measured as the risk. So the first one is 3% second with 7% so the
security B having a higher risk, so always we will go for where there is a lesser standard
deviation because mean is same.
107
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology, Roorkee
Lecture No 5
Central Tendency and Dispersion
Good morning students; today we are going to the lecture number 5, Central tendency and
dispersion will continue what have stopped, from the previous lecture, what we are going to see
today is, one important property of a normal distribution. And second, we will see various
kourtosis. Then second we see the box and whisker plots that has a different way of measuring
the dispersion.
(Refer Slide Time 00:51)
See this empirical rule, if the histogram is Bell Shaped. Look at this normal distribution, Bell
shaped curve. This yellow line says that, from the mean if you are traveling either side on 1
sigma distance. You can cover 68 % of all observations. Come to the second one, from the
means 0. If, you travelling 2 sigma distance on the either side. You can cover 95% of all
observations.
The third one, if you are travel 3 sigma distance on either side from the mean of a normal
distribution. You can cover 99.7 % of all observations. This is very important empirical rule.
108
Even through you study in detail about the normal distribution and it is the properties in the
coming lectures, I wanted to say this idea may be very useful in coming lectures, because we can
say the normal distribution is the father of all the distributions.
Because if you have any doubt on nature of the distributions. If you are not really sure about
what distribution the data follow, you can assume normal distribution. But there is a limitation of
this Empirical Rule. It is applicable only for the Bell shaped curves. There may be a situation, the
actual phenomena need not follow bell shaped curve. At that time this formula that is the
empirical rule will not work. So we will go to another rule, in the next slides, the same thing.
(Refer Slide Time 02:24)
What I given the previous slide, see µ ± σ. You can cover 68% of the all observations. µ ± 2σ,
you can cover 95 % of the observations. µ ± 3σ, you can cover 99.7 % of all observations.
Actually this 1, 2, 3 is nothing but Z. I will tell you incoming classes, what is Z.
(Refer Slide Time 02:57)
109
The previously we have seen that the properties of normal distribution, that is a bell shaped
curve. Sometimes certain phenomenon need not follow the bell shaped curve. That time, you
cannot use that property which he studied previously; you had to go for another formula for to
find out how much observations are covering under 1σ, 2 σ and 3 σ distance. This idea was given
by Chebysheff’s. It is called Chebysheff’s theorem.
Yeah, more general interpretation of the standard deviation is derived from Chebysheff’s
theorem, which applies to all shape of the histogram, not just to bell shape. Previously we see,
what is totally for bell shaped, but here it is apply it is applicable to all shape of the histogram,
even the distribution can follow any shape. So the proportion of observation in any sample that
lie within k standard deviation of the mean is at least: 1- (1/ k2 ).
So, how to read this formula is, suppose a phenomenon which is not following normal
distribution. If you want to know the 2 σ distances, how much percentage of observation can be
covered? So when you substitute here 1- (1/ k 2 )= 1-1/4 = 3/4. So, 3/4 means 75 %. So if you are
travel 2 σ distance on the either side. For a distribution which is not following normal
distribution. You can cover 75 % of all observations.
You see the previously; It is 95 %. So you see that that is a given. For k equal to 2, the theorem
states that, at least 3/4th of all observations lie within two standard deviation of the mean. This is
110
lower bound compared to empirical rule approximation 95 %. In case the previous slide, if it is 2.
We can cover 95% of all observations, but here we can cover only 75% of all observations.
Sometime we can use Chebysheff’s theorem also; the data is not following normal distribution.
(Refer Slide Time 05:14)
These two properties in coming classes many times we will refer that 1 sigma, 2 sigma, 3 sigma.
Just like that I want to give an idea about the normal distribution, but we will go in detail later.
The next way to measure the dispersion is coefficient of variation. The ratio of standard
deviation to the mean, express as a percentage. So coefficient of variation is your sigma by mu, it
is the measurement of relative dispersion. Already there is a standard deviation is there.
What is the purpose of this coefficient of variations that will see the next slide?
(Refer Slide Time 05:53)
111
Look at this, you see, stock A, stock B or stock 1,stock 2 the µ1 = 29, σ1 = 4.6. Another one, µ2
= 84, σ 2 = 10, supposed to choose which is better. If you compare only the mean, 29 verses 84,
the second stock is better. Suppose if you compare the standard deviation, 4.6 and 10. The lower
the standard deviation better it is. So the stock 1 is better. Now there is a contradiction, with
respect to mean option B is better with represent standard deviation option A is better.
Now there is a contradiction to need to have the trade off. In this situation we have to go for this
coefficient of variation, coefficient of variation = σ /µ. For example, for this case; = (4.6 / 29)x
100 = 15.86, for second case = σ2 /µ2 = (10 /84 )x 100 = 11.90. Lower the coefficient of
variation, but have the option is. So, if the variance is smaller, to be able to choose that group, or
that stock.
(Refer Slide Time 07:18)
112
Now we will see how to find the variance and standard deviation of the grouped data. Already
we have seen the standard deviation of the raw data also named as ungrouped data. We are
seeing the standard deviation for population, standard deviation for sample. Similarly, we will
For the sample variants, look at the formula here it is also n-1,
113
So we will see this example now look at this, there with the grouped data is given; but the
problem is given in the form of table. So, see these are the interval. These are frequencies. So 20
to 30, there are six values are there. Between 30 and 40 there are 18 values is there, 40 and
50,11 is there 50 and 60, 11 is there. 60 and 70, 3 is there, and so on. So first 1 we are to find out
the midpoint of the interval between 20 and 30, the midpoint is 25, 35, Next interval 45, 55, 65,
75. Now, you multiply this f and the midpoint of the interval. So, 150,630, and so on.
We need this then you can find M- µ, M is 25- 43- 18, 35- 43-8 and so on. Then you
square it, then the squared value you multiply by f, you are getting this . When you
add it, that is going to be 7200, . There is nothing but 50. So 144 is the population
variants of this grouped data. If you want to know the standard deviation of this group of data,
just to take the square root of the variance, that is 12.
114
The next measure is shape of the, we can say a set of data. That is shape or distribution, what
distribution follows. We can see the skewness, kurtosis, box and whisker plots there are the three
method. So we will see what is a skewness, skewness is the absence of symmetry. As I told you
it may be this is left skewed data. This is a right skewed data. This is symmetric data. So this
absence of symmetry, this and this can be done with helps skewness.
So the other one the application of skewness is to find out what is the nature of this shape,
whether it is skewed or symmetric. Next one is a kurtosis; it is the peakedness of a distribution.
There are three layers, there are Leptokurtic, Mesokurtic, Platykurtic. Leptokurtic means high
and thin, Mesokurtic is little flat in this way and Platykurtic very very very very flat this way flat
and spread out.
Then we can see, box and whisker plots. It is a graphical display of distribution. It reveals
skewness. The application of boxer whisker plot is to check whether the data, follow a symmetry
or what is the nature of the skewness of the distribution.
(Refer Slide Time 10:58)
115
See the skewness left one which is an orange color it is the negatively skewed. As I told you the
previous lecture skewness is how it is named is looking at the tail. the tail is on the left hand side
so it is a left skewed or negatively skewed. Come to the blue one, the extreme right. It is this tail
is on the right hand side. So it is a right skewed or positively skewed one, the middle one there is
no skewness, so it is symmetric.
(Refer Slide Time 11:29)
The skewness of a distribution is measured by comparing the relative position of the mean,
median, and mode. If the distribution is symmetrical, we can say mean equal to median equal to
mode. The distribution is skewed right, the median lies between mode and mean, and the mode
116
is less than mean. The distribution skewed right means this way. So this will be our mean, this
will be our median, this will be mode.
Look at this, Median lies between mode and mean. The mode is less than the mean because
mode is less than the mean. The same thing the distribution is skewed left. This is the case. So
the mean position of mean will be here. Median position of the mode, the median is lies between
mode and mean. And the mode is greater than mean. As I told you, whenever, if you want to
know the central tendency of your distribution you to check the nature of the distribution.
If it is skewed right or left, you should go for median, because the median always middle of the
distribution irrespective of your skewness.
(Refer Slide Time 12:58)
The same thing, what are explained the previous one. Mean, see negatively skewed the position
of mean is here, median, mode. Positively skewed to the position of means here, median is here,
mode is here. There is no skewness in middle one.
(Refer Slide Time 13:11)
117
How to find coefficient of Skewness? The summary measure for Skewness can be measured as:
You will see an example. µ1 is 23, median1 is 26, σ1 is 12.3, and you apply this formula, = 3 x
(23- 26 )/ 12.3 we are getting negative, so it is a negatively skewed. Go to the middle one µ2
equal to 26, median2 equal to 26, so 26 - 26 = 0. So S2 equal to 0. For this distribution the
118
skewness is 0 or it is symmetric. The right one µ3 equal to 29, median is 26, σ3 is 12.3, and you
substitute here we are getting positive value for S3. So the skewness is positive.
(Refer Slide Time 14:20)
The kurtosis, as I told you, it is a peakedness of a distribution, when they say Leptokurtic,
leptokurtic, this one. So it is high and thin, if that means highly homogeneous distribution, the
things are very close. This is second one is the Mesokurtic, it is normal shape. The last one is
Platykurtic, flat and spread out.
(Refer Slide Time 14:48)
119
The next we will go to box and whisker plot. There are five positions in the box and whisker
plot. One is median Q2, first quartile Q1, third quarter Q3. The next word is the minimum value
in the data set, maximum value in the data set.
(Refer Slide Time 15:05)
We will see this one here, this one. So, this point is would minimum value in this box is a Q1 is
the quartile one, quartile two, quartile three maximum, why its called box and whisker plot. The
whisker is look like a whisker of a cat. So it is a box and whisker plot.
(Refer Slide Time 15:25)
You see how the skewness can be measured or identified with the help of box and whisker plot.
So by looking at the position of this middle this line, we can identify the distribution. What is
120
that, if it is on the right side of this box it is left skewed data. If it is a left side it is the right
skewed data. If it is exactly on the middle, it is symmetric which follow normal distribution. So
far, we have given some kind of theories about this various central tendencies and dispersion.
Now, I'm going to switch over to Python. So whatever we have done. The theory portion
whatever we are taught here, so I am going to use Python. I am going to explain how to use
Python to get central tendencies, skewness, box and whisker plot and various dispersion
techniques with the help of Python. So we will go to the Python mode.
(Video Starts: 16:26)
Okay, now we will come to the Python environment, the first, as I told you we are to import
pandas as pd. We can do this pd is it is for only our convenient fantasies and library. Then they
are to the import Numpy numerical Python, as the np. Okay, so the first one is to import the
required libraries. The second one is going to import the data set. So the data set, already I know
the part of the data set is the otherwise the name of the data set is IBM underscore 313
marks.xlsx.
So, I am going to save the object called Table. Table equal to pd., this is the command. This
read_excel is the command for the reading the Excel file. The path is this, ‘IBM-313’ otherwise;
simply you can type it there. Now print table, let us see what is the data. See, look at this, serial
number is there, MTE that is a midterm examination marks, mini project, total, end term
examination marks and total marks.
Okay, this total is out of 100, this total is out of 50. You see it is starting from 0,1,2,3. Okay, now
I want to find out the mean of the total that is in the end term examinations. So, x = table, the
object either be a square bracket. There you have to write the column name. So that means I am
going to take only the column name, total, and I go to store that value into the variable x. Now
the x is nothing but by the last column, if you want to mean of that one. So, np.mean.
Otherwise, if you want to know np. If you press tab you will get various options to that np. tab.
See here, there are so many options there in that have there are maximum, minimum, the mean,
median, you need not remember also you can check it one by one. So now we will go to the
121
np.mean. Then, np.mean, then we will call that variable x executed, shift enter. So we are getting
this value 46.90 is the average marks.
There is a lot of the median. So np.median, median is 45. We will go for mode, the mode. You
have to import scipy, from scipy import stats. Stat is another library function. So start stats.mode
called the variable x. We will see what is the mode? Mode is the number of frequencies.
Suppose, there are five students see got the same marks 30 bars, the mode will be 30. Okay. So,
okay, we will come back to later.
So, next we will go to percentile. In percentile suppose I have taken the array, that we are
introducing another one np.array, a equal to np.array just we have taken an array 1, 2, 3, 4, 5.
Suppose, go to say p equal to np.percentile of that array a, 50. What do you want to know? I
want to know 50th percentile, 50th percentile means what value in this array will be the 50th
percentile, and execute this print P. So, 3 is the 50th personality but the median.
This number is very small number one for illustration purpose. So you can have a large number
then you can run it. Then now, we will go to another command in Python is for loop. For loop is
suppose I have taken a variable k saved three variables, one is Ram, Seeing the characters in the
code 65, 2.5. Suppose if I print k, what will happen. You see that it is printing Ram, 65, 2.5. But
there is a requirement that I have to print one by one.
First I have to print Ram, and then I have to print 65, and then print 2.5. Now here at a time I am
getting all the answer but I want to print one by one. So for that purpose, see that for i in k. This
is the syntax, there should be a colon, that is one print i. So what will happen first in k this is
array. So first, for i in k will take the value Ram, second i will take the value of 65. Third, i will
take the value of 2.5.
Now if we execute this print i, see that one by one. So this is the one by one, I am getting this
output. So this is the example of for loop. So for i in k. the k is in which variable. So the i value
will change. if you want to print i. So the first it print Ram and then 65 and 2.5, because why I
122
am showing that we are going to use this for loop incoming examples. So I want to give an idea
about how to use for loop in Python.
Now we will go to the range. So, far i in range. Is it that 10, 20, 2 this is the rage function. The
rage first one is the starting value. The second one is the ending value, the 2 is increment; print i.
if we print that you see that, now what is happening. 10, 12, 14, 16, 18 incremented by 2, ending
with excluding 20. 1, 2,3,4,5 increment is by 2. Now suppose in the print, I want to print, now it
is printed one by one.
But I want to print in i with the comma, so 10, 12, 14 that purpose the same comment, i use end
equal to it should be separated by a comma in colon. So if I run this. What is happening see 10,
12,14,16,18 so this one end equal to in ‘,’. That is what how it is giving the output in horizontal
way. Now we will go to the next option functions in Python. Suppose the functions are very
useful applications in Python many time.
There are some built in function is there. For example; print is the built in function, maximum is
the built in function, and minimum is built in function. You can create your own functions, and
then you can call that function wherever it is required. Suppose def,that is the syntax def greet
open parenthesis, end with the colon, print Hi, print good evening. So, this is the way of defining
your function.
Okay, then. After defining the function, you have to call that function, suppose I call the greet. I
will execute this word what will happening? So, this function is getting executed. So again Hi,
and good evening. So another example or function suppose I want to add two numbers, so def
the function name is add in parenthesis, p,q. It can be anything colon, the colon is important.
Otherwise, it will show syntax error.
So c equal to p+q, so print c. suppose this is my function, suppose this is, I want to call this
function, add 6,4 what answer I am getting. Suppose other number suppose, add 10,4. So I am
getting 14. We have seen how to create a function now finding the minimum, maximum value in
123
the data set. Suppose I created a new array. Data equal to 1, 3,4,463,2,3,6, just i take randomly.
Suppose I want to see the minimum value in this array and maximum value in this array.
So what is happening. So minimum value is 1, the maximum value is 463. So for that the
comment is min and max. Now, this minimum and maximum value, I can create own function.
Then I can call that function. Because every time I need not type minimum, min data, max data.
Because already that was built in function, we can create our own function. So the same data I
have taken, 1, 3, 4, and 463.
I am defining function min underscore; underscore MAX data, so min underscore value equal to
minimum data, maximum underscore value equal to maximum data. Now returns because I want
to get the output. This is indentation is more important. Suppose, there is, minimum_value
should be the same, same indentation, generally we can give Tab. Tab means we can save for
space work. So return underscore, minimum_value, maximum_value will run it let us see what is
happening.
So I am getting this because I called this function again how? Min_and_max( data). Because my
function name is Min_and_max. So 1, 463. So this functions application is very much useful in
Python because when you are making a large program, every time some routine aspects you need
not do it, yourself every time. So you can call that function whenever it is required. It will save a
lot of your time and energy.
So now, suppose I want to know the range of the data range is nothing but maximum value and
minimum value. For that, I go to define a function def, that function name is rangef. That is the
rangef. rangef I given you can give any name for the data. So, I am finding
minimum _value = min(data)
maximum_value = max(data)
return (maximum_value - minimum_value)
So if i call that function rangef data, what will happenning I getting 462, nothing but = 463-1.
Now we will go to quartile. Quartile, we have seen already. It is a Q1, Q2, Q3. Q1is the 25 th
124
persentile, there is an inbuilt function in NumPy. So when you say I am creating an array, a equal
to np.array, array1, 2,3,4,5. So Q1 equal np.percentile. This np.percentile will give you the
percentile.
Suppose if you want to know 25th percentile I can get what value in this array is the 25th
percentile. So if I execute this. So, that means, the value in two is 25th percentile. So the same
thing np.percentile(a,50), if I put. I am going to call it Q2. So 3 is, otherwise median you see
look at this because it is odd number, the middle value three obviously it is the 50 th percentile. I
will go for third one, Q3 = 4. That is our 75th percentile.
Next, spredness is measured in terms of Inter quartile range. As we know already, Q3-Q1, so that
is nothing but IQ. So if we see IQ, there is nothing but Q3-Q1, 2 is inter quartile range. Now, we
will go for how to find out the variance. Suppose, there are two way two variants one variances
for fine variants of mean, another one is the variance of the population. Even you put np.
NumPy, var is for the population variance x. So what is the x, x is the total.
So that column, the total column we have saved in the name of object called x or variable called
x. So we will see variance is 262.781. Suppose, there is a another function will be import to
library statistics. So statistics.pstdev, that is population standard deviation x. so I can get the
population standard deviation, that is 16.2105. It is the standard deviation. So if you want to
know the standard deviation of the sample. So, statistics.stdev.
So only thing is, if you want to know the population standard deviation, you should write p there,
np.std otherwise by default, in statistics library, you are getting only the sample standard
deviation. This is standard deviation for the population. This is standard deviation for the sample.
Next round for the skewness for that you are to import from scipy there is a library called dot
stats import skew, skewness of x. So skewness is positive value so it is the right skewed data.
Next we will go to Box and whisker plot, because for drawing plot you have to use Matplotlib.
Import pyplot as plt. So plt.boxplot, there is an inbuild function as x comma symbol is star (*)
plt.show execute this. So we are getting box and whisker plot. You see that box and whisker plot
125
rather than some star symbol that implies that that data is are outlier. Outlier means which goes
beyond maximum value beyond minimum value.
The position of this middle line will help you to identify the nature of the distribution. If it is in
left side, it is right skewed data. See, look at this, because it should positively skewed data and
now it is little left side. So the data is, data is right skewed data. So with that we are stopping the
central tendency. So what we have seen so far we have seen various central tendencies and
different way of measuring the dispersions,
(Video Ends: 31:40)
Whatever you will learn theory part that we run in Python, we got the answer. There are so many
sources are available in internet to know the different course, different videos find out you can
also refer that for this class. Thank you.
126
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 06
Introduction to Probability - I
Good morning students. Today we are going to next lecture number 6 introduction to probability.
The concept of probability is fundamental in any field whether you call it a statistics or analytics
are your data science everywhere because if you look at some of the book titles of the statistics
or analytics it will come with probability and statistics because the concept of probability and
statistics is cannot separate it because always will go together because, the concept of statistics
since we are taking sampling, we are predicting about the population.
So, whatever we say with the help of sampling we have to attach always some probability
because it cannot be 100% assured that whatever you say with the help of sample will be exactly;
you cannot predicted so when there is a prediction comes, then you have to attached probability
to that. Today, you will see that it is an introduction to probability I am not going to teach in
detail about that one.
(Refer Slide Time: 01:21) correct the picture size
What are the ideas which are important for us the only that I am going to teach. So the lecture
objective is to comprehend the different way of assigning probability understand and apply in
127
marginal union joint and conditional probabilities and solving problem using laws of probability
including law of addition, multiplication and conditional probability and using very important
theorem that is the Bayes rule that to revise the probability at the end these are the my lecture
objectives.
(Refer Slide Time: 01:51)
So we will go the definition of the probability, the probabilities the numerical measure of
likelihood that an event will occur, the probability of any event must be been between 0 and 1,
inclusively 0 to 1 for any event A, the sum of the probability of all mutually exclusive
collectively exhaustive event is 1 latter will explain what would be mutually exclusive
collectively exhaustive events. So always the probability of summation of probability will be
equal to 1, for example, probability A plus probability B and probability C equal to 1, here A, B
and C are mutually exclusive and collective events.
(Refer Slide Time: 02:32)
128
So this is the range of probability you see that, if it is an impossible event, the probability value
is 0. If it is a certain event, the probability is 1 if it is there a 50% chance, the probability is point
5. So, the point here is it is the probability lies between 0 to 1.
(Refer Slide Time: 02:50)
The method of assigning probability there are 3 methods, one is the classical method of assigning
probability rules and laws, the relativity frequency of occurrence that is cumulative to historical
data and subject to probability that is a personal intuition or reasoning by using these 3 methods,
let us see how to find out the probability.
(Refer Slide Time: 03:11)
129
First of all is a classical probability the number of outcomes leading to an event divided by the
total number of outcomes possible is a classical probability. Each outcome is equally likely there
is an equal chance of getting different outcome. So, it is determined a priori that is before
performing the experiment we know what are the outcome is going to come suppose we toss a
coin there are 2 possibility head or tail in advance we know what are the possible outcomes.
It is applicable to games of chance the object to is everyone correctly using the method assign
and identical probability because what is happening here, using classical probability that
everyone will get the same answer for a problem because we know in advance what are the
possible outcomes.
(Refer Slide Time: 04:00)
130
So, mathematically the P(E) = ne / N
where the N is the total number of outcomes ne is the number of outcomes in E.
(Refer Slide Time: 04:15)
The relative frequency probability, it is based on the historical data because the another name for
relatively frequencies of probability, it is computed after performing the experiment, number of
items an event occurred divided by number of trials and it is nothing but frequency divided by
sum of frequency, here also everyone correctly using this method assign as identical probability
because everything is already known to you.
(Refer Slide Time: 04:42)
131
So, the same formula probability E is n of e divided by N where n e is number of outcomes N is
total number of trials.
(Refer Slide Time: 04:50)
Then subject to probability, it comes from a person's intuitions or reasoning. Subjective means
different individuals may correctly assign different numerical probabilities to the same event, it
is the degree of belief sometimes subjective probability is useful for example, if you introduce a
new product, suppose if you want to know the probability of success of the new product so we
can ask an expert that what is the probability of success. Even if it is a new movie or new project
what is the probability of success?
132
It is based on the intuition are based on the experience of the person he can give some probability
of success or failure for example sites select decision for sporting events for example, in cricket
what is the how much possibility of one team to win these are intuitive probability.
(Refer Slide Time: 05:42)
In this course that there are certain terminology with respect to probability you have to know it is
very fundamental, even though you might have studied or even with previous classes it is just to
recollect it what is the experiment we will see what is the experiment event, elementary event,
sample space, union and intersections mutually exclusive events, independent events,
collectively exhaustive events, complimentary events, these are some terms we will revise this.
(Refer Slide Time: 06:07)
133
One is we say what is experiment trial elementary event and event experiment, a process that
produces outcomes is experiment so there are more than one possible outcome is there only one
outcome per trial, what is a trial one repetition of process is a trial what is the elementary event?
Even that cannot be decomposed or broken down into other events that is elementary events and
what is the event and outcome of an experiment may be an elementary event maybe aggregate of
elementary event usually represented by uppercase letter for example A, E that is notation for
event.
(Refer Slide Time: 06:48)
We look at this example. If you look at the table there are some towns population is given there
are 4 families their family A, B, C, D we asked the 2 questions, children’s in household whether
do you have children are not see family yes. Then we asked a number of automobiles how many
number of automobiles you have 3. B they have children, they have 2 automobiles for the help of
this table will try to understand what is experiment for example, randomly select without
replacement 2 families from the residents of the town.
For example, randomly we can select so elementary event for example, the sample includes
family A and C randomly you have to selected. So, event each family in the sample has children
in the household, the sample families own a total of 4 automobiles to these are particular events
for example for the event each family in the sample has children in the household for example A
is one event D is another event for example, the sample families own a total number of 4
134
automobiles for example, A and C they have 4 automobiles B and D they have 4 automobiles A
and D they have more than 4 automobiles.
(Refer Slide Time: 08:07)
This is the example of event then what is the sample space, the set of all elementary events for an
experiment is called a sample space. Suppose if you roll a die, there are you can get 1 2 3 4 5 6
these are the sample space there are different methods for describing the sample space one is
listing tree diagram, set builder notation and Venn diagram, you will see what is that.
(Refer Slide Time: 08:30)
See this listing experiment randomly select without replacement 2 families from the residents of
the town, so each order the pair in the sample spaces elementary event for example D, C. So
135
what are the different possibility look at this table A B, A C, A D, B A, B C, B D, C A, C B, C D
so, these are the listing the sample space what is that we have to select 2 families from the
residents.
(Refer Slide Time: 08:58)
So, here we will do without replacement, without replacement means suppose A it owned by
again A, if it is B it owned by again B, once A is taken we are not selecting another A, so
without replacement the same thing the another way to express the sample spaces with the help
of tree diagram, it is a tree diagram is very useful and easy to understand. For example A B C D
there are 4 families we can have combination A B we can combination A C, A D, B A, B C, B
D, C A, C B,C D, D A, D B, D C. So, this is the easy which is the different sample space because
tree diagram is easy to understand.
(Refer Slide Time: 09:55)
136
Now the set notation for random sample of 2 families so S = {(X, Y), X is the family selected on
the first draw, and Y is the family selected on the second draw}. It is the concise description of
larger sample spaces in mathematics they use this kind of notations.
(Refer Slide Time: 10:15)
You see the sample space can be shown in terms of Venn diagram, so this is a list of sample
space see that this is a different dot express different sample space
(Refer Slide Time: 10:26)
137
Then we will go to the another concept union of sets, the union of 2 sets contains an instance of
each element of the 2 sets for example X is 1,4,7,9 one set, Y is another set 2,3,4,5,6 So, if you
want to know X union Y just we have to combine 1,2,3,4,5,6,7,9 similarly we look at the Venn
diagram X is 1 Y is 1 if you want to know union combining both events and other examples say
C IBM, DEC, Apple that is the C set there is the another set F Apple, Grape, Lime suppose we
want to know union of set C and F so we are to take IBM, DEC, Apple, Apple is coming in both
sets we are taking only one grape lime, this is the union.
(Refer Slide Time: 10:15)
138
X intersection Y is 4, for example, C and F, in C we have IBM, DEC, APPLE F is APPLE,
GRAPE, LIME then C intersection F that is a common thing between set C and F is Apple, so
this one, see this portion says our intersection.
(Refer Slide Time: 11:54)
Then we will go for mutually exclusive events even with the no common outcomes is called
mutually exclusive events occurrence of one event precludes the occurrence of other event for
example C IBM, DEC, Apple F is Grape, Lime. So C intersection there is no common thing so
that is why it is null set similarly X is 1,7,9 Y is 2,3,4,5,6 , X intersection Y there is no common
set look at the Venn diagram there is no common thing so X intersection Y 0 these 2 sets are not
over lapping.
So it is called mutual exclusive events. another example for this when we toss a coin there is 2
possibility to get the outcome one may be head or tail it cannot have both that is why both events
are mutually exclusive event.
(Refer Slide Time: 12:46)
139
Then independent events, so occurrence of one event does not affect the occurrence or non
occurrence of other event is called independent event, the conditional probability of X given Y is
equal to the marginal probability of X the conditional probability of Y given X is equal to the
marginal probability the one way we will do a small problem on this one way to test the
independent event is suppose P(X/Y) = P(X) and P(Y/X) = P(Y), then even X and Y are called
independent events will go in detail after some time but with the help of an example.
(Refer Slide Time: 13:28)
Collectively exhaustive event it contains all elementary events for an experiment suppose E1 E2
E3 sample space with 3 collectively exhaustive event suppose you roll your die, all possible
outcome 1, 2, 3, 4, 5, 6 that is collectively exhaustive events.
140
(Refer Slide Time: 13:52)
Then complementary events and elementary events not in the A dash or is it is complimentary
event you see that the P (A) is there which is not there that is A dash that is called
complimentary, so P( A’) = 1 – P(A) then counting the possibilities because in probabilities
many time different combinations may come these rules may be very useful for counting
different possibilities one rule is mn rule second one is sampling from a population with
replacements, second one is sampling from a population without replacement.
(Refer Slide Time: 14:28)
141
Will go for the mn Rule if an operation can be done m ways and the second operation can be
done n ways, then there are mn ways for the 2 operation to occur in order. The rule is easily can
be extended to k stages, with a number of ways equal to if there are k stages n1, n2, n3 there
some simply we have to multiply for example toss 2 coins the total number of sample event is 2
x 2 =4 because in the first coin you make a 2 possibilities second coin you make it another 2
possibilities so the total is 4 possibilities.
(Refer Slide Time: 15:07)
Suppose you see that another example of sampling from a population with replacement. One
example is a tray contains 1000 individual tax returns if 3 returns are randomly selected with
replacement from the tray, how many possible samples are there? So every time you are going
142
for a 3 trial; trial 1 trial 2 trial 3, in each trial there are 1000 possibilities because you can choose
one from 1000, so, firstly trial 1000 second trial 1000 third trial 1000 when you multiply this is
1000 million possibilities are there with replacement.
(Refer Slide Time: 15:45)
In case if you go without replacement, the same thing because without replacement what will
happen the sample size will decrease. A tray contains 1000 individual tax returns in 3 returns are
randomly selected without replacement from the tray. How many possible samples are there So,
that is a N Cn = N!/(n!(N-n))!
= 1000!/(3!(1000-3)! = 166,167,000
you see the previously with replacement and going to previous light with replacement it is a 1000
million now, it is only 166 million because we are going for without replacement.
(Refer Slide Time: 16:29)
143
There are different types of probability say we can call it as a marginal probability union
probability joint probability and conditional probability. Then what is the rotation model
probabilities simple one probability P of X, so, how it is expressed in terms of Venn diagram, see
this one, so marginal probability the union probability is the X union Y, the probability of X or Y
counting’s, the joint probability or common probability, the probability of X and Y occurring
together the middle portions.
Then conditional probability, the probability of X occurring given that Y has occurred here there
are 2 events, the probability of the outcome of X is depending upon the outcome of Y. So we
have to read the probability of X given that Y has occurred. So this is the Venn diagram notation
for expressing the conditional probability
(Refer Slide Time: 17:23)
144
Then we will go for general law of addition so, P(XUY) = P(X) + P(Y) – P(X∩ Y)
(Refer Slide Time: 17:35)
Will take a small example, from that example, will understand the concept of probabilities a
company is going for improving the productivity of the particular unit. They are coming with a
new design one design is layout design for example, layout design, one design will reduce the
noise, that is one option that is second design that will give you more storage space, so we are
going to ask from the employees what kind out of these 2 designs which design will improve the
productivity.
(Refer Slide Time: 18:13)
145
You see the problem, a company conducted a survey for the American Society of interior design
in which workers were asked which changes in the office design would increase productivity,
there are 2 design is there one is the one design will reduce the noise another design will improve
the storage space. The responders were allowed to answer more than one type of design changes.
So this table shows the outcome so 70% as the people have responded that reducing noise would
increase the productivity 67 percentage of the respondents responded that more storage space
would increase the productivity there is the 2 design So, we are asking there from the respond
which design will improve the productivity.
(Refer Slide Time: 19:04)
146
Suppose, if one of the survey respondents were randomly selected and asked what office design
change would increase workers productivity, otherwise, what is the probability that this person
would select reducing noise or it designed which is helpful for providing more storage space out
of this that reduce, out of 2 options.
(Refer Slide Time: 19:29)
So, let N represent is the event reducing noise that means you are choosing that design S
represents the event more storage space, yes, that is the another options. The probability of
person responding to N or S can be symbolized statistically as a union probability by using law
of addition that is a P(NUS).
(Refer Slide Time: 19:56)
147
So, see that P of N union S is because we know they asked for the our the formula P(NUS) =
P(N) + P(S) – P(N∩ S). So, the P of N is 70% P of S is 67% those who have told S for both the
designs is 0.56. So, when you substitute these values in the formula, so, we are getting 0.81 that
is 81% of the people have told both the designs will increase the productivity
(Refer Slide Time: 20:33)
What you have solved with the help of Venn diagram in previous in the previous slide can be
solved with the help of contingency table. This contingency table is so helpful just we have to
make a table. For example, in rows I have taken noise reduction, the noise reduction say 70%
people have told yes, so, remaining 30 might have told no. 70 30 in the column I have taken
increasing storage space design in the 67% total they have told that increasing storage space
would increase the productivity.
So remaining 33 might have told no. So, the 0.56 is intersection people have told both yes for
both the design that is for noise reduction and increasing storage space. So, once if you know this
0.56 the remaining things you can simply you can subtract it 0.70 - 0.56 = 0.14 that is those who
have told no to storage space and yes to noise reduction then if you subtract from 0.67 – 0.56,
will get 0.11, 0.33 - 0.14 we will get 0.19 so from this table we can read a lot of information’s.
(Refer Slide Time: 21:49)
148
Whatever it has no this is for example, event A A1 A2 event B1 B2 this in the rows, you see
that? It is in the columns whatever cell inside the cell this portion is called the joint probabilities
P (A1 ∩ B1) whatever it is the extreme side of the table see that this called marginal
probabilities, this is a notation that traditionally they follow whatever the inside the cell it is a
combination of both event that is the: it is the joint probabilities, whatever extremely and it is a
marginal probability extreme side of the table.
(Refer Slide Time: 22:26)
The same thing, suppose if you want to know the same answer with the help of this contingency
table, we want to know how much percentage of the people who have agreed for both the design
that is N and S. So, P ( N )+ P (S) - P (N ∩S). So this value directly we can read it from the table.
149
So P (N) is 0.7 0 + P (S) is 0.67 - P (N intersection S) is 0.56, so when you do that we getting
0.81.
(Refer Slide Time: 22:57)
Then we will go for conditional probability the probability of N is 0.7 0 those who are good both
the design engineers is .56 suppose if you want to know P(S \ N) = P (N intersection S) /P ( N) so
this value I will explain this conditional probability in detail later, but now you take this one, so,
P of N intersection S from the previous table you can find out the 0.56. You can look at the Venn
diagram also, P of N is 0.7 and you substitute getting 0.8
(Refer Slide Time: 23:37)
150
The same office design problem you see that there is a another conditional probability P (N /S )
that means, those who are told No to noise reduction, but they are agreed for storage design. So,
this is the conditional probability. So, we have to multiply is this point because
We will take another small problem will explain the concept of probabilities with the help of the
problem. A company data revealed that 155 employees worked on 4 types of positions the table
is shown in the next slide raw value of matrix also called the contingency table with the
frequency counts of each category and sub totals and totals containing breakdown of these
employees by type of position and by sex.
(Refer Slide Time: 24:46)
151
You see that look at this table in rows, the type of position they hold in their organization,
whether they are working in managerial position, professional or technical or clerical in the
column sex whether male or female, you see the intersection of managerial and male 8 that
represents both that is more managerial working in a managerial position the same time they are
male. So, that is our join values here the only count join counts the extreme right or the bottom
of the table we are given the total counts.
(Refer Slide Time: 25:23)
Now, if an employee of the company is selected randomly, what is the probability that the
employee is female or professional worker, so, what we have to do? So, we are going to find out
P ( F )+ P (P) - P (F ∩ P). So, P(F) is when you go to the previous one P (F) is so, when you
152
55/155 =0.335, P(P) is when 44/155 = 0.824, minus P (F ∩ P). that is F intersection P =13 /155
you will get 0.084
= 0.55.
There is another problem.
(Refer Slide Time: 26:15)
Shown here are the raw value matrix and corresponding probability matrix of the result of a
national survey of 200 executives who were asked to identify geographic location of their
company and their company's industry type. The executives were only allowed to select one
location and one industry type, because it is not possible same person working different location
different industries, because one person can work only one type of industry will conclude that in
this session.
We have seen different types of probability how to assign probability and different counting
rules say mn rules with replacement, without replacement. And different terms, which you are
frequently we are going to use in this course that is the event, join probability, marginal
probability and so on, then you have taken one sample problem with the help of sample problem,
we have seen how to find out union of 2 events that is a joint probability P(A U B), then
intersection P (A ∩ B).
153
Then how to find out the marginal probability, then how to find out the conditional probability,
but that will close and continue with the next lecture. Thank you very much
154
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Management Studies
Indian Institute of Technology – Roorkee
Lecture – 07
Introduction to Probability - II
Dear Students, we will continue that the concept of probability in lecture number 7 will take one
example then we will try to understand the concept of marginal probability, joint probability, and
conditional probability in this problem. The problem is a company data reveal that 155
employees worked in a 4 types of positions. Shown here again is the raw values matrix also
called the contingency table with the frequency count for each category.
And subtotals and totals containing a breakdown of these employees by type of position and the
sex. Look at this contingency table in the row it is given what kind of position they are holding
whether managerial professional technical clerical,
(Refer Slide Time: 01:20)
In column what is their sex, suppose when an employee of the company is selected randomly,
what is the probability that the employee is female, in the contingency table row represents the
type of position, column represents sex, the type of position they may hold whether they can
have the they can work as a managerial position, professional position, technical clerical, we
155
asked their sex also in the datasets, the question is if an employee of the company is selected
randomly.
What is the probability that the employee is female or professional worker that is what is the P(F
U P)? F is the female, P is the professional. So, as per the law of addition of probability
P(F U P) = P ( F) + P ( P) - P (F ∩ P),
P( F) we can find out and going to the previous slide.
(Refer Slide Time: 02:22)
So, the P (F) is there are 55 females total there are 155 So, when = 55/155, you will get 0.355
then P(P), going to previous slide P(P), is probability of professionals, there are 44
professionals, 44 /155 that will give you 0.284 then, minus P ( F ∩ P), that is for female related
same time their working in a professional type of position that is 13. So, 13 /155 will give you
0.084
P(F U P) = 0.555.
(Refer Slide Time: 03:10)
156
Shown here are the raw value matrix and corresponding probability matrix of the result of a
national survey of 200 executives who are asked to identify the geographical location of their
company and their company's industry type. So, there asking 2 question what is their company's
geographic location and what kind of industry they are working. The executives were only
allowed to select one location and one industry because they can work only one industry in one
location.
(Refer Slide Time: 03:45)
This table shows 0 there is a industry type maybe finance manufacturing communications,
calling it ABC. The geographic location may be Northeast, Southeast, Midwest and West, So for
example, in the finance A, there are 56 people are working manufacturing B there are 70 people
157
are working, in the communications 74 in the Northeast location, there are 82 people, in
Southeast location 34 people and Midwest location 42 and West 42.
(Refer Slide Time: 04:21)
The question is, what is the probability that the respondent is from the Midwest? Directly we can
read this answer from the table, second question is what is the probability that the respondent is
from the communications industry or from the northeast? So here the addition of the 2
probability that is a P(C) and P(D). The third question is what is the probability that the
respondent is from the Southeast or from the finance industry?
(Refer Slide Time: 04:55)
158
So from the given table, we find out the probability that means we have divided each element in
the cell divided by the gross total. After dividing that we got 0.12, 0.05, 0.04, 0.04 Now this is a
matrix conditional probability. From this table, we can pick up whatever answer we wanted to
get for answering that question
(Refer Slide Time: 05:20)
For example. I am going back, suppose what is the probability that the respondent is from the
Midwest? So I am going to next slide Midwest is F. So, it is a 0.21 going back, what is the
probability that the respondent is from the communication industry or Northeast? So you have to
see P (C U D) = = P ( C) + P ( D) - P (C ∩ D) you can find out then what is the probability that
respondent is from southeast or the same thing? P (E union A) = P ( E) + P ( A) - P (E ∩ A)
These values you can directly pick up from the previous table from, this table now we will go for
mutually exclusive event suppose okay suppose if you want to find out P (T union C) here T is
the those who are technical professional P of T union C is P (T )+ P ( C) because generally it will
be minus P (T intersection C), but that is not possible, because a person cannot work in 2
industry at a time.
So, it is mutually exclusive event in the mutually exclusive event in the intersection will become
0 that is a P (T union C) = P( T) + P (C) that another term that is minus P (T ∩ C) will become 0.
So, P(T) is a 69 /155, + P ( C) is 31/155 when you simplify it is 0.645.This is example of
mutually exclusive event.
159
(Refer Slide Time: 07:00)
There is another P(P U C) = P (P) + P (C) there would not be any intersections even that formula
which I told the 2 slides before also that the intersection component will become 0 because the
person cannot work in 2 industries, there is an example of mutually exclusive event.
(Refer Slide Time: 07:18)
160
You will see another problem. A company has a 140 employees of which 30 are supervisors 80
of the employees are married, and 20% of the married employees are supervisors. If a company
employee is randomly selected, what is the probability that the employee is married and is a
supervisor? Wherever this kind of problem comes, if you are able to construct the contingency
table whatever question being asked you can pick up from there directly. So, from the given data.
(Refer Slide Time: 08:15)
First we will construct a contingency table in the contingency table you see there are 140
employees out of which 30 are supervisors out of 140, 80 people are married and for example,
those who are married at the same time supervisors, what do you do that is you have to multiply
that how will you that if you multiply 80 into 0.2 divided by 140 you will get this answer.
161
(Refer Slide Time: 08:43)
So, for example, you see that how we are getting that we are the previously they will get 0.1143
we will see how we are getting so, we know the probability of married people 80 /140. This is
given those who are supervisors at the same time married. 0.2 that is a 20% is given. If you want
to know those who are married at the same time they are supervisors. So, we have to use
conditional probability P ( M ) x P (S \ M). So, P ( S\M) is known, 0.20 is given. So P(M∩S) =
0.1143.
(Refer Slide Time: 09:32)
You see that once you know that one cell in the contingency table filling the remaining cell is so
easy. And whatever value you want to pick up we can pick up for example, I have filled the first
162
0.1143 I know what is the row total and column total from that I can subtract it I can get the
remaining rows that is a application of contingency table. Suppose P (S) = 1 – P(Sbar), that we
know that P (S bar), that is 0.7857.
I am saying this one, this location, this location 0.7857 if you want to know P (M bar intersection
S bar) that means, those who are not M those are not married. At the same time, who are not
supervisors, so, that is nothing but this location 0.326 those are not married the same time they
are not supervisor this locations, P( M intersection S bar) those are married, but not the
supervisors.
So, that is a P(M)- P (M∩S). So, this value is 0.5714 minus this one. So, we will get this point,
nothing but in the contingency table if you know one cell the remaining this can be found out
(Refer Slide Time: 10:54)
The special law multiplication for independent events general law is if P (X intersection Y)= P
(X) x P( Y \ X ) = P ( Y) x P ( X \ Y). special law, that is if X and Y are independent. So, P (X )
= P (X \ Y), because, when the event X and Y are independent, the outcome of X is not
depending on outcome of Y. So, P (X\ Y) will become P (X) itself. So, similarly, P (Y) and P(X)
if not independent P( Y \ X) will become P(Y) itself and you substitute the there. So, P(X ∩ Y)
will be P( X) into P ( Y), only when the both even are independent.
(Refer Slide Time: 11:45)
163
This also law of conditional probability, this also we have seen previously also the conditional
probability of X given Y is joint probability of X and Y divided by the marginal probability of Y
So joint properties intersection the P (X intersection Y) / P (Y). So this can be just by readjusting
(P(Y / X) x P (X))/ P ( Y).
(Refer Slide Time: 12:14)
164
where P of A and B equal to join probability of A and B. So P (A) is marginal property of event
A, P ( B) is marginal property event B
(Refer Slide Time: 12:54)
We will take an example how to find out the conditional probability of the cars on a used car lot,
70% have air conditioning Air Conditioning and 40% have CD player. 20% of the cars have
both. So, what is the probability that a car has a CD player, given that it has AC that means, we
want to find out P of CD given that AC is there.
(Refer Slide Time: 13:19)
As I told you just to you draw the contingency table because all the values are given? So, what is
the value I am going back see for example 70% of the cars having AC So, this value I am going
165
back 40% of the cars having CD player So, this value and you subtract minus 1 will get that one
another information is given 20% of the car have both like by see that 0.2 this value. So, once
you know these values other cells can be find out.
So, if you want to know P (CD\AC) as per the definition, P of CD and AC divided by P (AC) so,
this is a 0.2 this is by P (AC) is by 0.7. So, this is 0.2857. So, given the AC we only consider the
top row 70% of the cars of these and 20% CD player, so 20% of % is 28.57% okay there so, we
are getting the conditional probability.
(Refer Slide Time: 14:25)
So, this conditional probability can be explained with the help of a tree diagram, because the tree
diagram is easy to visualize. So, having AC, having not AC, having CD, having not CD having
CD having not CD. So, 0.7 0.2 0.5 0.2 0.1 so, if you want to know having CD, so you have to
divide 0.2 divided by 0.7. For example, if you want to know this this arc, so, this is a 0.5 divided
by 0.7 and so on because a tree diagram is very easy to understand.
(Refer Slide Time: 15:12)
166
Then we will see the definition of independent event, if an X and Y are independent events, the
occurrence of Y does not affect the probability of X occurring, so, X and Y are not connected.
Similarly, if X and Y are independent events, the occurrence of Y does not affect the probability
of X occurring, you see that P of if X and Y are independent P (X \ Y) = P(X), P(Y\ X) = P(Y)
this we have seen the previous also.
(Refer Slide Time: 15:44)
This is another example 2 events are independent This is the condition P(A\ B) = P( A). So, this
condition is for testing independent, even A and B are independent. When the probability of one
event is not affected by other event.
(Refer Slide Time: 16:05)
167
So, we will take one example will check the practical application of this concept of independent
events. This also this data previously given. So, we have asked what kind of industry are
working, whether you will finance manufacturing communications that we asked to the
geographical locations. Now, you see that the question is tested the matrix for the 200 executive
respondents to determine whether the industry type is independent of geographical location that
means, we were to find out is there any dependency between the geographical location and what
kind of industry.
For example, in India, if you look at there, most of you know, software companies are in south.
So, is there any connection between the geographical location and kind of industry which are
located. We will take this Example finance and the best region, so, when you go this,
(Refer Slide Time: 17:05)
168
If you want to know P(A\G) = P(A∩G)/ (G), P(A∩G) we can directly read from the table 0.07
this one this value then P ( G ) 0.21 directly we can read from the table. So, what you do that
value is 0.33, but P(A) when you look at P(A), so, P(A ) is 0.28. So, now what is happening, the
P(A\G) is not equal to P(A). If it is equal, both are independent, since it is not equal, there is a
kind of dependency between the geographical location and the type of industry which are located
there. So, this is a one way to test the independency.
(Refer Slide Time: 18:06)
For example, you take for another example, because A given D, this is another example. So,
P(A∩ D) is 8 here the actual count is given. Any way you can you can do both the way also
P(A∩ D) is 8 divided by P( D) , P (D) is 34. So, we are getting this value, but you see the P (A)
169
is 20 divided by 85, 20 divided by 85. So, both P (A \ D) and P(A) are same. So, these are
independent events. Then example, if at the same P (A / D) equal to P(A) both events are
independent this is the way to test the independent.
(Refer Slide Time: 18:55)
Next we are going to an important application that is a Bayes ruler Bayes theorem, it is used to
revise the probabilities, it has lot of applications in higher level of probability theory. And
extension to the conditional law of probabilities enables revision of original probability with the
new information’s. So, P ( X \ Y) equal to P( Y \ X) x P ( X ) divided by the summation of this
one I will tell you the net the next slides
(Refer Slide Time: 19:29)
170
For example, supposed see that P (X\Y) So, this can be written as P (X∩Y)/P(Y) this can be
written as P of x intersection y divided by P of x you look at this here also P of x intersection y
this also be x intersection y. So, this can be written as P ( x\y) multiplied by P ( y) equal to P ( y\
x) multiplied by P( x), you see that if I look at this suppose, I know P (x\ y).
Suppose, if i want to know the reverse of this that is P of y by x, you see that I know P of x by y,
I am getting reverse of that that is P of y by x. So, from this you can write it P of y by x is
nothing but P of x by y multiplied by P of y divided by P of x. Here the P of x is only because
here only 2 outcome there are if there are more outcomes here the sigma of P(x) will come, the
sigma of P(x) is nothing but different combination of joint probabilities that we will see with the
help of an example
(Refer Slide Time: 21:12)
This is a very typical example machine A and B, and C all produce the same 2 parts X and Y. Of
all of the parts produced, machine A produces 60% machine B produces 30%, and machine C
produces 10%. In addition 40% of the parts made by machine A are part X 50% of the parts
made by machine B are part X 70% of the parts made by machine C is part X. A party produced
by this company is randomly sampled and determined to be an X part with the knowledge that it
is an X part. revise the probabilities that the part came from machine A, B, and C First, we will
solve this with the help of a tabular format.
(Refer Slide Time: 22:01)
171
For example, there are 3 mission is there mission A and BC that 60% of that part was produced
by machine A, 30% was produced by machine B, 10% is by C previously we have seen how that
formula for conditional probability has come now, I will tell you an application of Bayes
theorem.
(Refer Slide Time: 22:29)
Suppose there are 2 say there are 2 supplier, supplier A supplier B, I know that, say the 40% of
the product supplied by supplier A, remaining 60% supplied by supplier B from my past
experience. I know that the 2% out of 40% is 2% will be defective product which are supplied by
supplier A from supplier B, I know from my past experience he used to supply 3% of defective
products out of 60.
172
By using their products that I have assembled a new machine now the machine is not working,
the machine is not working, but I want to know what is the probability that product was supplied
by supplier A, If the machine is not working, what is the probability that the product was
supplied by supplier B? So this is the application of your Bayes theorem, we will see with the
help of an example.
(Refer Slide Time: 23:36)
Different options there. The problem is a particular type of printer ribbon is produced by only 2
companies that company names is are Alamo Ribbon Company and South Jersey Products.
Suppose Alamo produces 60% of the ribbons and South Jersey produces 35% of the ribbons
from over experience. Look at this 8% of the ribbon produced by Alamo or defective and 12% of
the South Jersey Ribbons are defective from our past experience, A customer purchases a new
ribbon.
What is the probability that Alamo produced the ribbon? Otherwise, what is the probability that
South Jersey produced the ribbon, like in the previous example, the machine is not working,
what is the probability that product was supplied by supplier A what is the probability that
product was supplied by supplier B the same example.
(Refer Slide Time: 24:39)
173
Now, first you will find out the marginal probability and conditional probability. So, P of Alamo
that is 65% of the product was supplied by Alamo South Jersey 35% from the past experience I
know the defective parts which was supplied by Alamo supplier 0.08. Similarly, the defective
products which was supplied by South Jersey is 0.12, you see that this some would not be 100,
but this some will be 100 because this is a total they supply in 8% is in the 65 out of 65, 8% is
defective products are supplied by Alamo person.
Now, as for the formula now, we look at this V now it is reverse Now, we know it is defective.
Then we want to find out what is the probability that was supplied by Alamo. So, we look at this
P ( D) given by Alamo multiple by P of Alamo, look at this this component, this is the sum of all
possibilities P (D) given a Alamo multiplied by P (P Alamo )+ P ( D given by South Jersey)
multiplied by P (South jersey) So, this 0.08 is given 0.65 is given.
So, this was combination of 0.08 and 0.65 this and this was combination of 0.35 and 0.12. So,
but because we have to add these two. So, when you divide by this is 0.553 that is, if the product
the ribbon is defective, then there is a 50% chance it was supplied by supplier alone. Similarly,
the product is defective what is the probability that it was supplied by South Jersey, the same
thing, 0.12 to multiplied by 0.35 because P of D Alamo is given then 0.08 This is the all
combination this denominator same, so, 0.447 that is there is of 44.7% chance that defective
product was surprised by South Jersey.
174
(Refer Slide Time: 27:01)
If you look at in the in the tabular form it is very easy. So, this is the first. So, this is the event
Alamo South Jersey this fellow supplying 65% this is 35%, this is the conditional probability, we
know that this supplier Alamo will supply 8% of defective products, this fellow will supply 12%
of 2 products. So, first we have to find out the joint property 0.0 to 0.08 this one we have to add
it. Then this joint property has to be divided by this 0.04. So, that will give you use see that, here
we know the details of P of D given the; we are finding the reverse of that P of E given by D that
was the advantage of this byes theorem. Now, it is 0.094, this was 0.447.
(Refer Slide Time: 27:54)
175
This can be shown in the pictorial form. There is a Alamo South Jersey defective not defective.
Defective 12% remaining this percentage when we multiply by 0.02 when you multiple this
0.042 when you added, we are getting 0.094. In this lecture, I have explained the example of
mutually exclusive events then I have explained the multiplication, then explain the independent
events then I have explained the concept of Bayes theorem, then I have explained with the help
of a problem, the application of Bayes theorem with that will conclude this lecture. Thank you
very much.
176
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 08
Probability Distributions
Good morning students we are entering to the 8th lecture on this course that is a data
analytics with the Python. Today the topic is probability distributions. So what we are going
to cover today is very interesting topic. We are going to see the some empirical distribution
and its properties. Then discrete distribution in the discrete distribution we are going to see
Binomial, Poisson, Hyper geometric distributions. The continuous distribution we are going
to see the uniform, exponential, normal distribution.
(Refer Slide Time: 01:00)
First up all what is distribution? What is the purpose of studying the distribution? The
distributions describe the shape of a batch of numbers that is the meaning of distribution.
Suppose the different set of numbers there, you want to show what shape it follows whether it
is a bell shaped, we can call it is a normal distribution. If it is forming a rectangular shape, we
can call it as a uniform distribution like this that describes the shape of a batch of numbers.
177
parameter one is mean and variance with the help of that you can draw the distribution that is
a parameter.
(Refer Slide Time: 01:53)
Why distribution? Can serve as a basis for standardized the comparison of empirical
distributions because if you want compare with phenomena with the very standard
distributions we can come to know that what distribution it follows then it will help you to
estimate the confidence intervals for inferential statistics that will see what is the meaning of
conference interval incoming classes then form a basis for more advanced statistical methods.
For example, fit between observed distribution and certain theoretical distribution is an
assumption of many statistical procedures. Suppose why we have to study the distributions,
suppose we are doing your simulation for example the arrival pattern follow Poisson
distributions. Suppose certain data you collected if you prove that it is arrival follow Poisson
distribution already there is a mean and variance and other population parameters already
defined it.
If you are a natural phenomena, you are able to compare with standard distributions that are
well defined distribution parameter is there that parameter you can use as it that is a purpose
of studying the distribution.
(Refer Slide Time: 03:01)
178
Then we will go for what is the random variable we want to construct a distribution, it is the
relation between X and corresponding probability X, p of x. So here the X is nothing but
random variable. A variable which contains the outcome of chance experiment is the random
variable is the kind of you are quantifying the outcome suppose we task of the coin for X = 1
is getting head, 0 getting tails. So 1 is nothing but your random variable.
So the X, the X is the random variable, that can take the value of 1 and 0, the X value is 1 it is
the head. If X value is 0 it is a tail. Variable that can take on different values in the population
according to some random mechanism. So the value of 1 and 0, it follows certain mechanism.
Random variable can be a discrete it may be distinct values, countable. For example year is a
discrete random variable. For example Mass it is a continuous random variable.
(Refer Slide Time: 04:02)
179
Then probability distributions, the probability distribution function or probability density
function PDF of the random variable X means the values taken by the random variable and
their associated probabilities if you make a relation between X and corresponding
probabilities p of x or f of x, that if you plot that point that will form your distributions, so
PDF of your discrete random variable also known as PMF probability mass function.
Example let the random variable X be the number of heads obtained in you 2 tosses of your
coin. There are 2 possibilities when you toss 2 times 2 tosses, first toss you may get head,
second toss you may get head then head tail, tail head, tail tail, so these are the sample space.
(Refer Slide Time: 04:53)
Probability density functions of your discrete random variable. Suppose we are tossing coin 2
times the probability of the 0 head is 1 by 4, the probability of getting one head is 1 by 2, the
probability of getting 2 heads 1 by 4 some should be 1. See in the in the X axis, the random
variable is taken 0 in Y axis corresponding probabilities marked. So, in X axis random
variable and Y axis corresponding probability this is called the distributions.
(Refer Slide Time: 05:23)
180
Now, probability distribution for a random variable X will do a small numerical problem, a
probability distribution for discrete random variable X is given. So, X is given corresponding
probability distribution is given. So, this is an empirical distribution. Suppose, if you want to
know, what is the probability of X < = 0. So, what you have to do? Wherever random
variable X is 0 and less than equal to 0 you would add it.
For example 0.20 + 0.17 + 0.15 + 0.13 we will get 0.65. Suppose, if you want to know the
probability for the random variable - 3 to 1 - < = X < = 1. So, it would add - 3 to 1. 0.15 +
0.17 + 0.20 + 0.15 and you add it will get 0.67.
(Refer Slide Time: 06:12)
How to plot a discrete distribution? So number of crisis for example is taken the probability
of happening that crises is also given for example, the probability of getting 0 crises is 0.37
181
for one crises 0.31 and so on. So, in X axis you mark the random variable in Y axis you plot
the probability. When 0.37 it is a 1 this 1 this is that these are discrete these points, these
points cannot be connected in x axis, this random variable has to come into the x axis, this
probability has to go to Y axis.
For example, 0.37, this one here what will happen you cannot connect this line because it is a
discrete, because you may have an x = 1 when x = 2, when x = 1.5 there is no value, if it is a
discrete distribution, you cannot connect these points that is why it is called the discrete
distributions.
(Refer Slide Time: 07:17)
The requirement for the discrete probability density function, so probabilities are between 0
and 1 inclusively. Total of all probabilities equal to 1 and some are probability we have seen
that already just 1
(Refer Slide Time: 07:30)
182
Next term will see cumulative distribution function. The cumulative distribution function of a
random variable X defined as F of X is the graph associating all possible values are in the
range of possible values with the P of X <= x. Cumulative probability distribution function is
just adding the probabilities. The CDF always lies between 0 to 1 that is 0 <= F of x should
be <= 1 F CDF cumulative density function.
(Refer Slide Time: 08:04)
Then there is a very important property is the expected value of X. X be a discrete random
variable with the set of possible values of D and pmf is P(x). The expected value or mean
value of X is denoted as, generally expect of x or µx = Ʃ x. p(x). So, Ʃ x. p(x) is your
expected value of X.
(Refer Slide Time: 08:32)
183
What is the meaning of this mean and variance of your discrete random variable is look at
this there are picture (a) and picture (b) the left side, the mean is same for both the
distribution, but look at the variance. The left side figure it shows that lot of variances that
items figure it is less variances, the probability distribution can be viewed as, are viewed as
you are loading with the mean equal to the balance point. So mean is nothing but it is like
kind of a balance point for which the distribution lies. Part (a) and part (b) illustrate equal
means, but part (a) illustrates a larger variance.
See the second case mean and variance of discrete random variable, the probability
distribution illustrated in parts a and part b differs even though they have equal means and
equal variances the shape of the distribution is different.
184
(Refer Slide Time: 09:25)
Now, we will see how to find out an expected value use the data below to find out the
expected number of credit cards that a customer to retail outlet will process. So X is a random
variable. There is a how many number of credit cards customers having the P (x) equal to X
corresponding probability. So Zero P (x) is .08. That means probability of a person having 0
credit card is 8 % probabilities a person to have for example, 6 credit cards is 1 %.
So how to find out the expected value multiplied by x and corresponding probability to
submit. So, 0 (0.08) + 1 (0.28) + 2 (2.38), and so on, + 6 (.01) = 1.97. You can make them 2.
That means the customers, they can have an average of 2 credit cards that was any customer
if you take randomly average that customer can have 2 credit cards. Here an example of
meaning of this what is mean.
(Refer Slide Time: 10:31)
185
Now we will see how to find out the variance and standard deviation of an empirical
distribution. Previously we have seen µx = Ʃ x. p(x), I will see how to find out the variance of
an empirical distribution. Let X how the pmf of p of x and the expected values mu. We know
already the mean of an empirical distribution. Now we have to find out the variants of the
empirical distribution, then the variance of X denoted as V(X) or σ x 2 or σ2.
The variance of X = Ʃ (x - µ)2. p(x), variances can be denotes E( X - µ)2 . The standard
deviation is the square root of this.
(Refer Slide Time: 11:14)
We will see you an example, a quiz scores for a particular student are given below 20, 25 and
so on find the variance and standard deviation. So, before knowing the standard deviation,
first you have to find out the mean because the mean is required. So, the mean if you add and
186
divided by corresponding elements, number of elements you will get 21. For example first we
will construct a frequency distribution, you see 12 is repeated by 1 time, 18 is repeated by 2
times, 20 is repeated by 4 times, 25. For example 25 is repeated by 3 times.
Then we have to find the probability, the probability is nothing but the relative frequency as I
told you one definition of probability is relative frequency. So what is the cumulative
frequency here first you have to find a total frequency 1 + 2 =3, 3 + 4 =7, 8, 10, 13 there is a
cumulative frequency. So, the probability here we are obtaining by using the concept of
relative frequency. So the relative frequencies we are adding all the frequency that is a total.
So 1 divided by corresponding sum of all frequencies 2 divided by some of frequencies. now
the mu we can find out mu in another way also, we know that already we are done with this
relative frequencies ƩF x M/ ƩF , how to find out the mean, sigma of expected value X. p
(x), 12 x 0.08 + 18 x 0.15, + 20 x .31 + 22 x .08,
Ʃ x. p(x) = 21. One way you can add all the values you can divided by number of elements.
Otherwise from this empirical distribution, what x is given x is 12, 18, 12, probabilities given.
So, if you want to know the mean x into p of x, now we are going to find out the variance.
(Refer Slide Tim: 13:40)
So p 1 variance = .08 .X1, Where X1= (12 - µ)2 +, p 2 , that is , .15 . X2 That is, (18 - µ)2 and
so on when you add it you will get the variance and the mu take square root of will get this.
You see that and going back .08 (12 – 21)2 + .15(18 – 21)2 + .31 (20 – 21)2, when you
simplify the variances 13.25 standard deviation is 3.64. So, what do I have done seen this
problem, the data is given first you were constructed here empirical distributions, then we
187
will use the formula of needed variants to find out the mean and variance.
Another shortcut formula to find the variances is nothing but we the E (x - µ)2 for example
already we seen E (x - µ)2, when you square it and simplify it, you will get this formula. So
V(X) =[ Ʃx2 p(x)] – µ2
= E (X2)- [E(X)]2
just you expand it will get this answer.
(Refer Slide Time: 14:48)
So let us find out the meaning of your discrete distribution, the formula for finding the mean
mu equal to expected value of X that is X .p (x). So X is given p(x) is given, multiply X and
188
p(X) after doing that, when you sum the sum is 1. So the mean of this empirical distribution
is 1.
(Refer Slide Time: 15:07)
That is find out the variance and standard deviation of this empirical distribution that is a
discrete distribution. So σ2, we know (X – u)2 . p (x). So X is given p(x) is given first to find
out (X – u) then (X – u)2, then multiply (X – u)2 by P( x) and sum it we are getting 1.2. So the
variance is 1.2 you take square root the standard deviation is 1.10.
(Refer Slide Time: 15.37)
Suppose then another distribution say X is given p of x is given X into p of X. So when you
plot it the mean you see that the mean or mean not be exactly 1 or 2 or 3 mean value may in
between 1 and 2. So mean value need not be discrete only the random variable is discrete
here.
189
(Refer Slide Time: 16:00)
Some of the very important properties of expected values suppose the expected value of a
constant is constant only. When you want to multiply 2 random variable E ( X + Y). we can
write E(X) + E (Y). E (X \ Y) is not division it is a conditional, it is kind of a conditional
probability. So E (X \ Y) need not be will not be equal to E (X) divided by E (Y) and same
thing E (XY) is not equal to E (X) multiply E (Y) unless they are independent.
If they are independent, you can write E (XY) equals to E (X) and E (Y) otherwise we cannot
written. So, if a random variable come along with the constant that constant can be removed
out of this expected value. For example E (a X) the a can be brought left side. So, a E (X),
here a is the constant. So, easy that it is in format E (a x + b) that can be brought a left side.
So, a E(x), then when your expect b value constant this constant itself. It will become a E (x
+ b) where a and b or constant.
(Refer Slide Time: 17:11)
190
Then properties of variances; variances of a constant is 0. If X and Y are 2 independent
random variable, then Var( X + Y) = var(X) + var(Y). Var( X – Y) = var( X) + var(Y), it
should be very carefully here, support there are 2 groups there are group 1 and group 2. If you
want to know the difference in the variance, you too add their variances b a constant, then
variances of b + X because variances of b will become 0 it will become only variances of X.
If a is constant than variances of a X is because variances Ax the square term and you bring
left side of the bracket would write a square and variances of X. There are proof is therefore
this if a and b are constant than variances of a X + b equal to a square variances of X, the
variance of B will become 0. The answer is a square variances of X. If X and Y are 2 random
variable and a and b are constant, then var(a X + b Y) = a2 var( X) + b2 var(Y).
(Refer Slide Time: 18:24)
191
Then covariance for 2 discrete random variable X and Y, E (X) = µx and E (Y) = µy, then
covariance between X and Y is the defined as covariance of X Y equal to can be written as σ
x y = E ( X - µx) and E (Y - µy ) and simplify it will get E (XY) - µx . µy , that is a covariance.
In general, the covariance between 2 random variable can be positive or negative. If random
variables move in the same direction, then the covariance will be positive. If they move in the
opposite direction, the covariance will be negative. Properties of covariance, if X and Y are
independent random variables their covariance is 0. Since E ( X Y) = E(X), E(Y) is
independent covariance there would not be any covariance. Covariance (X X) is variance of
X. Similarly, covariance of YY is simply variance of Y.
(Refer Slide Time: 19:31)
192
Then correlation coefficient, the covariance tells the sign but not the magnitude about how
strongly the variables are positively or negatively related. The correlation coefficient provides
such measures of how strongly the variables are related to each other. The variance is only
giving the direction not the magnitude, but the correlation it is giving the magnitude for 2
random variables X and Y, E (X) = µx and E (Y)= µy. The correlation coefficient is defined as
covariance of X Y divided by σx and σ y.
(Refer Slide Time: 20:06)
Dear students now are going to some special distributions will study some special distribution
in a discrete category and continuous category. The discrete will study about the binomial
distribution and Poisson distribution and Hyper geometric distribution. Continuous category
which will study we are going to study uniform exponential and normal. In this class I will
explain the theory and corresponding its parameters outer end of the class will use Python to
find out various parameters various mean and variance of your distributions and
corresponding probabilities in the practical class.
(Refer Slide Time: 20:46)
193
First one is the binomial distribution. Let us consider an example to explain the concept of
binomial distribution. Let us consider the purchase decision of the next 3 customers who
enter in store there are 3 customers going enter the store and the basis of past experience, the
store manager estimates that the probability that any one customer will make your purchase is
0.30. What is the probability that 2 of the next 3 customers will make a purchase?
(Refer Slide Time: 21:18)
Now look at this the tree diagram the first customer there is a 2 possibility, S is the purchase
F is no purchase, X is the number of customers making purchase. So we will see that is the
end here. Now, what is happening the first customer he can purchase or not purchased second
customer different possibilities, third customer different possibilities. Now we look at the
experimental outcome, this possibility, look at this possibility, success success success.
194
Look at this possibility success success failures look at this personal success failure Success,
then success failure failure, failure success success, failure success failure, failure failure
success, failure failure failure. So, we have written all possibilities. Now, the question is out
of 3 customers what is the probability that 2 customers will make a purchase? What is the
meaning SSS all 3 customer have purchased.
So value of x equal to 3 random variable second case to customer how purchased third
customer did not buy. So, here x is 2 because here the X is taken the number of customers
making purchase the first possibility x = 3, the second possibility is 2, the third possibility 2,
the fourth possibility is 1,.. 2, 1, 1, 0. Now the question is, what is the probability that 2 out of
3 customers will make a purchase, See that? There is a possibility.
(Refer Slide Time: 23:00)
The first customer: it is the possibility, the SSF, SFS FSS what is the probability of success is
p,p and 1 - p we know p is .3. So 0.32x0.7 = .063. For second category also we are getting p,
first success p failure 1 - p again success is p. So, p2 multiplied by (1 – p). so.3 square .7
equal to .063 then third possibility failure, 1 - p success success p p ,So, p square 1 – p = .
063, .063, .063.
Now, here we actually do here the possibilities 3 C 2 because the question is asked out of 3
customers, what is the probability that 2 customer will buy? So it is a 3 C 2, the value of 3 C 2
is 3. That is why 1, 2, there are 3 possibilities when you go back when you go back how
many 3 is there 1 2 3 possibilities there is the meaning of value 3 C 2.
(Refer Slide Time: 24:20)
195
Now we will find out the probability if x = 0 that mean nobody is buying.
(Refer Slide Time: 24:33)
Students we have studied so far variance, covariance, correlation coefficient, just how to
make it in relation. For example, we know the variance formula, variance equal to Ʃ(x - x
bar) 2. So variances for 1 variable, suppose if you want to know 2 variable if you want to
know for 2 variable, this variance will be called this covariance.
So covariance is Ʃ(x - x bar) .(Y - Y bar) variances divided by n - 1. So, Ʃ(x - x bar). (Y - y
bar), here also n - 1. Variance, covariance, next one correlation coefficient is covariance x, y
divided by standard deviation x, standard deviation y. Now, you see that this is a variances,
this is a covariance, correlation coefficient. So, for correlation coefficient when you divide
covariance to a corresponding standard deviation you will get correlation coefficient.
196
Next we will say ‘m’ that is called slope of the regression, slope of a regression equation. So,
there is nothing but you were covariance(x, y) divided by variances of x, the first one is
variance, covariance, correlation coefficient and slope of regression equation, you see that all
are having some relationship for the variance is only for one variable, what is the meaning of
variance?
How each value is away from its mean that deviation square of the deviation, then some of
the deviation, then the mean value of that some of the deviation it will give you the variance
for covariance there are 2 variables there how each variable is moving away from its own
mean. So, Ʃ( x - x bar ) (y - y bar) divided by n - 1. If you want to know correlation
coefficient, that covariance is divided by its corresponding standard deviation look at
correlation coefficient.
If you want to know slope and regression equation, if you divide covariance of x, y divided
by corresponding variances of x you will get m that is nothing but slope of the regression
equation. Dear students so for we have seen what is the need for studying the distribution?,
then we have seen how to construct a discrete probability distribution after constructing how
to find out the mean and variance of a discrete distribution then we have seen the properties
of expected value.
Next we have seen the properties of the variance. Then we have seen how this mean,
variances, covariance are interrelated. The next class we will continue some discrete
distributions and some continuous distribution in detail. Thank you
197
Data Analytics with Python
Prof. Ramesh Anbanadham
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture - 09
Probability Distribution – II
Dear students in this lecture, we are going to see some of the discrete distributions and
continuous distributions, the discrete we are going to study the binomial poisson hypergeometric
in continuous that where we going to study uniform exponential and normal, first you will see,
what is the application of this Binomial Distribution?
(Refer Slide Time: 00:48)
Otherwise what is the need of this binomial distribution. We will take one practical example that
I will solve manually with the help of our concept of probability then I will tell you how the
concept of binomial distribution will help us to solve this problem quick way, but just consider
the purchase decision of the next 3 customers who enter your store and the basis of past
experience.
The store manager estimates the probability that any one customer will make your purchase is
0.30. Now, what is the probability that 2 of the next 3 customers will make a purchase?
(Refer Slide Time: 01:29)
198
That problem is drawn in the form of a tree diagram the first customer, second customer, third
customer when the first customer comes, there are 2 possibility he can go for purchase or not
purchase, purchase mention as say S success failure. The second customer also may go for
purchase, not purchase third customer purchase, not purchase. Look at the first possibility the
First customer is purchasing, so, success, success, success that is S, S, S.
If you take X look at the bottom of this slide, if X equals the number of customers making a
purchase, so, in this category, all 3 customers all 3 are going to buy look at the second possibility
success, success failure. So, in this out of 3 customers 2 customers S S going to purchase like
this we have displayed all possibilities. Now, the problem is X what is asked is 2 out of 3
customers what they have purchased.
So, number of customers making a purchases how many possibilities are there 2 customers 1, 2,
3 how we got this 3 is nothing but your 3 C 2 the 3 C 2 is out of 3, what is the probability? What is
the chance that 2 customers will buy there will be a possibility of 3 C 2, so the 3 C 2 is 3.
(Refer Slide Time: 02:57)
199
Now, we will see the first customer all possibilities purchase purchase no purchase success
success, so, success we know, we are calling it is a p failure is 1 - p. So, it is a p.p into 1 – p, p is
(0.30)2 multiply by 0.70, it is a 0.63. For second chance also first customer may buy second
customer did not buy third customer also buy, so, it is success failure success. So, p 1 - p into p
again p2. (1 – p) = 0.063, third choice, no purchase purchase purchase FSS, so 1 - p, p,p so it is a
p2 .(1 – p) = 0.63.
(Refer Slide Time: 03:48)
If I plot this one see x possibility, x possibility this possibility is see this possibility, when x = 0,
that means, what is the charge that all 3 customers will not buy? That is this case 0.7, 0.7, 0.7. If
I say x = 1, what is the probability that 1 customer will buy, if x = 2? What is the probability 2
200
customers will come, 2 customers will buy with x = 3? What is the probability that 3 customers
will buy? So, we can do with the concept of probability manually like this.
So, the question is asked is this one? What is the probability that 2 customers will buy out of 3.
So, this is your probability distributions. So, this we can do manually but it will take a lot of
time. So, here with the help of your binomial distributions, you can get right this all possibilities
going back this way nC x So,
P(X=2) = nC x .p x .q n - x, where n = 3 where x = 2 p is the probability success q = probability of
failure.
So, this nC x will give you different combinations of that event to happen. So, this p x , see that
always here p2 q n - 2
. So, this will give you the probability. So, that is the purpose of this
binomial distributions.
(Refer Slide Time: 05:35)
What is the property of this binomial distribution experiment involves in identical trials, each
trial has exactly 2 possible outcomes success, failures. Each trial is independent of the previous
trial for example, the previous case customer one may buy or may not buy. That is not depending
upon the for example for the second customer is intension of purchase is not based on the
previous customer.
201
So, p is the probability of success on any one trial, the q is probability of failure on any one trial.
So, p and q are constant throughout the experiment, this is important assumption. So, x is the
number of success in the end trials, there the previous case we are seeing x = 2.
(Refer Slide Time: 06:23)
202
Sometime there is ready made table is there to find out binomial probability values for example,
n = 10 and n is number of trials, x = 3 , p = 0.04 , n = 10 , x = 3 that value is this one 0.150. So,
the tables are provided you do not use your calculator it may take more time.
(Refer Slide Time: 07:34)
Now we will find out mean and variance suppose, for the next month the clothing store forecast
1000 customers will enter the store the previous problem their 1000 customers. So, what is the
expected number of customers who will make the purchase out of 1000. So, the answer is µ = n p
so n this 1000 probability of success point 3, so it is equal to 300. So, that is out of 1000
customers, there is a chance that 300 customers will make the purchase.
203
For the next 1000 customers entering the store the variance and standard deviation of the number
of customers who will make the purchase can you written as npq. So, n is given p is given 1 - p
70 210 the standard deviation is 14.49.
(Refer Slide Time: 08:24)
Then will go to next distribution Poisson distribution describes discrete occurrences over
continuous or interval. Generally Poisson distribution is for rare events, it is an discrete
distribution. X can have only few discrete values. Yes, it describes the rare events, each
occurrence is independent of any other occurrences. The number of occurrences in each interval
can vary from 0 to infinity. The expected number of occurrences must hold constant throughout
the experiment. These are the assumption of Poisson distribution.
(Refer Slide Time: 09:09)
204
Some of the examples are application of Poisson distribution. So, arrival at the queuing system
follow Poisson distribution, any airport people may arrive, airplane may arrive automobile may
arrive, baggage may arrive that follow Poisson distribution. The banks people automobiles loan
applications this arrival pattern follow Poisson distribution in computer file services, read and
write operations will follow Poisson distribution
The defects in manufacturing goods can be considered an example of Poisson distribution. For
example, number of defects per 1000 feet of extruded copper wire see that the n is very large.
The probability of success is very low. Another example of Poisson distribution is number of
blemishes per square foot of painted surface blemishes kind of a defect in paint number of errors
per typed page, these are the example of your Poisson distributions.
(Refer Slide Time: 10:15)
205
So, the probability function for Poisson distribution = (λX e -λ )/X!
Where X can take only discrete value, the lambda is mean that is a long run average the value of
e is 2.71. The mean of this Poisson distribution is lambda variance of the Poisson distribution
lambda standard deviation of Poisson distribution is root of lambda. So, this distribution called
the univariate distribution.
Because it has unique parameter that means, it is a special property of your Poisson distribution
where mean and variance is same. So, that is mean and variances. Same that is lambda.
(Refer Slide Time: 10:55)
206
And another caution while using Poisson distribution is the unit of this lambda and X should be
same, for example, lambda equal to 3.2 customers for 4 minutes probability of around 10
customers in 8 minutes. So, you have to adjust the lambda, how will you adjust, you multiply by
2, so it will be 6.4 into 8. Now, the unit of X and lambda is same, then you can use your
probability function, P (X) = (λX e -λ )/X! and substitute X = 10.
We are getting 0.025. See the second one, lambda = 3.2 customers by 4 minutes, X = 6
customers for 8 minutes. So, you have to multiply lambda. So, it will be 6.4 in 8 then P of X will
get the answer. So, what is the point here is the unit of lambda and X should be same.
(Refer Slide Time: 11:49)
Here also there is a probability Poisson probability table is available. So, when µ = 10 X = 5, you
see in the column shows mu. So, µ = 10 here, X = 5 is this value. So, this value is 0.378. So, you
cannot, you need not to use your calculator directly you can read it from the table. But we are
going to use these things in your practical class we are going to find out probabilities with the
help of Python.
(Refer Slide Time: 12:23)
207
We will go for hypergeometric distribution, see the binomial distribution is applicable, when
selecting from finite population with replacement or for an infinite population without
replacement. So, whenever the concept of without replacement comes, then we have to think of
using hypergeometric distribution. The hypergeometric distribution is applicable when selecting
from your finite population without replacement.
(Refer Slide Time: 12:52)
The properties of hypergeometric distribution, so, sampling without replacement from your finite
population then you should go for hypergeometric distribution. The number of objects in the
population is denoted as N, each trial has exactly 2 possible outcomes, success or failure. Similar
to binomial distribution trials are not independent, this is one different property, when compared
208
to binomial. In the binomial distributions, since we are going with replacement, the trials are
independent; here we are going without replacement.
So, the trials are not independent, it is dependent. Then, that means, the P will not be fixed, the
probability of P will not be fixed, every time we will get different answer for the P. X is the
number of success in the n trials. The binomial is acceptable approximation N/10 ≥ n, otherwise,
it is not.
(Refer Slide Time: 13:53)
That we will see the probability function with discrete values use probability mass function, if it
is a continuous will use PDF probability density function. So, the probability mass function is
P(x) = (ACx) (N - AC n – x) / NCn. So, here see the capital letters represent for the population. So N
is for population size, A is the number of success in the population, small letter, n for the sample
size, the x is number of success in the sample.
So, P(x) = (ACx ) (N – A C n – x )/ NCn. The mean value = A.n / N. The variance and of
hypergeometric distribution is A (N – A) n (N – n) / N2 (N – 1), root of variance is the formula
for standard deviation.
(Refer Slide Time: 14:57)
209
Will see one problem here, different computers are checked from 10 in the department, 4 out of
10 computers have illegal software loaded. What is the probability that 2 of the 3 selected
computers will have illegal software loaded. By looking at the problem we are feeling that it is a
finite population. Whenever there is a finite population, we should think of hypergeometric
distribution.
So, N is given that N is your population size, A is 4, because we know 4 is out of 10, 4
computers are illegal software. Because we know the how much illegal software is installed in
the population A = 4. Then, what is the probability that x is 2 of the 3 selected the computers will
have illegal software loaded. So, x = 2, here n is 3 there are 2 things is there, one is for the
population and population N = 10, A = 4.
For the sample n = 3, then what is the probability of that 2 out of 3 selected computers will have
illegal software loaded. So now substituted, everything is given, so we will get 0.3. So what is
the meaning of this 0.3, the probability that 2 of the 3 selected computers will have illegal
software loaded is 30%.
(Refer Slide Time: 16:38)
210
Then we will go to the continuous probability distributions. A continuous random variable is a
variable that can assume any value in the continuum. That is, can assume an uncountable number
of values. Thickness of an item, time required to complete a task, temperature of a solution,
height, these are example of continuous random variable. This can potentially take any value
depending only on the ability to measure precisely and accurately.
First we will see uniform distribution. The uniform distribution is the probability distribution that
has equal probabilities for all possible outcomes of a random variable. That is the probability of
random numbers also. When we say random numbers, the probability of choosing any number is
same, because of its shape, it is also called a rectangular distribution.
(Refer Slide Time: 17:31)
211
Look at this, Uniform Distribution. The uniform distribution is defendant interval, a ≤ x ≤ b. So
the probability function is 1/ (b – a), where ‘b’ is upper limit, ‘a’ is the lower limit. In other
interval, the value of f (x) is 0, the area = 1.
(Refer Slide Time: 17:51)
212
Now, we will see some problem using uniform distribution, suppose uniform probability
distribution over the range is defined this way. 2 ≤ x ≤ 6. So if you want to know f(x) = 1/ (b – a)
, where b is 6, a is 2, so, (1/ (6 - 2 )) = 0.25. What is the meaning of this 0.25 means, in between
2 and 6, you can select any random variable, maybe 3, you may select 3 or 4 or 5, the probability
of choosing 3 or 4 5 is, 0.25. So, the mean = (a + b)/ 2 ,
where a is 2 b is 6
= (8/ 2) = 4. Similarly,
standard deviation = (b – a) / √12, that is 1.17.
(Refer Slide Time: 18:56)
213
We will see another example, suppose a random variable is defined between 41 to 47. First, we
have to find out probability density function 1 divided by b - a, so 47 - 41, so that is 1 / 6. So 1 /
6 the height of this rectangle of distribution is 1 / x this value is your 1 / 6. The lower limit 41
upper limit 47.
(Refer Slide Time: 19:21)
214
Now, see the uniform distribution is defined in this interval 41 to 47 right. Suppose, we want to
know the probability between 42 and 45, is it 42 and 45. So, that had not been done by 45 x, that
is:
x2 - x1 /( b – a)
x2 is 45, x1 is 42 So,
45 - 42 = 3 , 3divided 6 = 1 / 2. So, this area is 0.5.
(Refer Slide Time: 20:04)
You will see another example of this uniform distribution. Consider a random variable x
representing the flight time of an airplane travelling from Delhi to Mumbai. Suppose the flight
215
time can any value in the interval between 122 to 140 minutes. Because of the random variable x
can assume any value in the interval, x is a continuous rather than a discrete random variable.
(Refer Slide Time: 20:30)
Let us assume that sufficient actual flight data are available to conclude that the probability of
your flight time within any one minute interval is the same as the probability of a flight at time
within any other one minute interval, contained in the large interval from 120 to 140. So, that is
the properties of your uniform distribution. That means, in any one, any small interval you
should have the same value.
If any one interval, if any between this interval the probability will say next, another one minute
interval, the probability same. With every one minute interval being equally likely the random
variable x is set to have a uniform probability distribution. So, the upper limit is 140, the lower
limit is 120, 1 /(140 – 120) .
(Refer Slide Time: 21:20)
216
So, the height of this rectangle distribution is 1 divided by 20, it is starting a value is 120, b value
is 140. Suppose, you are asked to find out, probability of your flight time between 120 and 130
minutes. So, somebody is asking what is the probability of that, flight arriving between 120 and
130 minutes. So, what you have to do, this is (130 – 120) divided by (140 – 120). When you
simplify you will get 0.50. So the probability of that flight arrive between 120, 130 minutes is
50%.
(Refer Slide Time: 22:05)
217
complete a questionnaire, the distance between major defects in a highway. For example,
whenever the time between arrival and time required to complete the questionnaire, whenever
the word the between comes. Then that is the then it is appropriate to use exponential
distribution.
(Refer Slide Time: 22:43)
The density function for a exponential probability distribution is f (x) = (1/ µ). e (-x/µ).
Here, µ is the mean.
(Refer Slide Time: 22:52)
We will see what is the how to construct the exponential distribution. Suppose that x represents
the loading time for a truck at a loading dock and follow such a distribution. If the mean or
218
average loading time is 15 minutes, µ = 15, then appropriate probability density function is f of x
= 1 divided by 15 e to the power – x divided by 15.
(Refer Slide Time: 23:13)
So, what is the meaning of this exponential distribution is supposed, if the loading time, the
probability of loading time less than 6 is this area. The probability of loading time between 6 and
18 is this one this time, this is a probability, this shadowed portion represents the probability of
that loading time is between 6 and 18.
(Refer Slide Time: 22:39)
Because in many application of exponential distribution will use the cumulative probability
density function. So, for an exponential distribution, the formula for finding the cumulative
219
probability density function = 1 - e (- x 0 / µ) where x0 is some specific value of x. it is nothing but
if you integrate that to distribution, if you integrate the distribution between the intervals when
you simplify it, you will get this answer.
(Refer Slide Time: 24:08)
We will see one example of this exponential probability distribution. The time between arrivals
of cars at a petrol pump follows an exponential probability distribution with the mean time
between arrivals of 3 minutes. See that it is mean time between arrivals. So, the mean is 3 here.
The petrol pump owner would like to know, the probability that time between 2 successive
arrivals will be 2 minutes or less. So x value is less than or equal to 2.
(Refer Slide Time: 24:43)
220
So if you want to know P of x less than equal 2, we have to substitute into the, that distribution.
That is the cumulative to distribution function, 1 - x value, so it is 0.4866. It is a 0.4866 so this
one. So this shaded one. Suppose if you want to find out, the time between 2 successive arrivals
of vehicle is less than 7 minutes, probability will increase. So, when x increases, but there are
more chances that the time between 2 success arrival is this much. So what I am saying this is a
way to interpret the exponential distribution.
(Refer Slide Time: 25:23)
Now, there is a very important relationship between poisson and exponential distribution. See,
the Poisson distribution provides an appropriate description of number of occurrences per
interval. So, one interval is there. In that interval, how many occurrences happened, that is the
Poisson distribution. In the exponential distribution provides an appropriate description of length
of the interval between the occurrences.
Suppose the same interval between this occurrence to this occurrences, so this phenomena is
explained with the help of between 2 occurrences, this phenomena is explained with the help of
exponential distribution. Number of occurrences in that interval it is explained with the help of
Poisson distribution, that is a relation between poisson and exponential distribution.
(Refer Slide Time: 26:14)
221
So, mean of your poisson and mean of your exponential distribution, what is the relationship.
Because the average number of arrival is 10 cars per hour, the average. So, if it is a 10 in that
interval, there are 10 cars are arriving. So, this mean is taken for the Poisson distribution. But 1 /
10 is taken as the mean for the exponential distribution. That means, time between arrivals. So,
that is a link between µ generally we write µ mean for Poisson distribution, 1 / µ mean for
exponential distribution that is the relation between poisson and exponential distribution.
(Refer Slide Time: 27:05)
Next, we are entering into the very important distribution that is a normal distribution, normal
distribution is we can say a father of all the distributions. Because suppose some phenomena is
happening if you are not aware, you can assume that it follows normal distribution. Normal
222
distribution is following Bell Shaped, it is symmetrical, mean, median and modes are equal. The
location of the normal distribution is characterized by µ, the spread is characterised by σ. The
random variable has an infinite theoretical range that is minus infinity to plus infinity.
(Refer Slide Time: 27:50)
(- 1 / 2)(( x - µ)/σ)2
The density function of a normal distribution is f (X) = (1/ √( 2 .ᴨ .σ) ). e , e is
mathematical constant value is 2.71, ᴨ is the mathematic constant we know that 3.14, µ is the
population mean, σ is the population standard deviation, X is any value of the continuous
variable.
(Refer Slide Time: 28:18)
223
The shape of the normal distribution is by varying the parameter of µ and σ we obtained different
normal distributions. Dear students, what we have seen so far is we have seen some of the
discrete and continuous distributions in the discrete distribution we have talked about the
binomial, Poisson distribution. In the continuous distribution we have seen exponential and
uniform distributions. The next class very important distribution that is normal distribution, that
will cover in the next class. Thank you very much.
224
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology Roorkee
Lecture – 10
Probability Distributions - III
Welcome back students now we are going to discuss another important continuous distribution
and that is normal distribution, normal distribution can be called as mother of all distribution
because, if in any phenomena if you are not aware about the nature of the distributions, you can
assume that it follow normal distribution, most of the statistical test or whatever analytical tools
which are going to use in this course.
Good to have some assumptions that did follow normal distributions knowing the properties and
behavior and assumptions about the normal distribution is very important for this course. Some
of the properties: normal distribution is Bell shaped curved.
(Refer Slide Time: 01:16)
Right we curve form a bell shaped curve. Iit is symmetrical, you can fold it so after folding both
the side are same, another important property mean, median and modes are equal the location is
characterized by its mean µ the spread is characterized by standard deviation. The random
variable has an infinite theoretical rates that is a minus infinity to plus infinity.
(Refer Slide Time: 01:48)
225
(- 1 / 2)(( x - µ)/σ)2
The formula for normal probability density function f (X) = (1/ √( 2 .ᴨ .σ) ). e ,
where e is the mathematical constant, the value is 2.71828 pi is the mathematical constant the
value is 3.14 mu is the population mean sigma is the population standard deviation, X is any
value of the continuous variable.
(Refer Slide Time: 02:18)
The shape of the normal distribution will change based on its spread by varying parameters µ
and σ being obtain different normal distributions. For example this one where the sigma is very
low, in this case sigma is little normal, in this is sigma is very big.
(Refer Slide Time: 02:39)
226
Changing mu shift to the distribution left or right if you increase the value of mu, it can go right
side or left side. Changing sigma, the standard deviation increases or decreases the spread
generally, when you decrease the sigma the spread will decrease when you increase the sigma
the spread will increase.
(Refer Slide Time: 03:01)
There is another normal distribution standardized normal distribution, any normal distribution
with the mean and standard deviation combination can be transformed into standardized normal
distribution. One thing what you had to do, we need to transform X unit into Z units, Z is nothing
but the conversion method is (X – mu) / sigma, the standardized normal distribution as means 0
and the variance or standard deviation is 1.
227
(Refer Slide Time: 03:36)
The translation from X to the standardized normal that is the Z distribution by subtracting the
mean of the X and dividing by standard deviation. So, that conversion from it is normal
distribution to standardized normal distribution is done with the help of this Z transformation
Where Z = (X – µ) / σ
X is a random variable mu (µ) is the mean of the population. Sigma(σ) is the standard deviation
of the population.
(Refer Slide Time: 04:05)
The formula for the standardized normal probability density function, if you substitute Z equal to
X – mu / sigma in our previous equation and it become if
228
f (Z) = (1/ √( 2 .ᴨ ) ). e (- Z2)/2
Where pi(ᴨ) is the mathematical constant z is any value of this standardized normal distribution.
(Refer Slide Time: 04:26)
Standardized normal distribution the shape how they look like also known as Z distribution mean
is 0 standard deviation is 1, the value above the mean how positives Z value, values below the
mean will have negatives Z value.
(Refer Slide Time: 04:44)
Let us see how to do that conversion from normal distribution to standardized normal
distribution. If X is distributed normally with the mean of 100 and standard deviation of 50 the Z
value of X is 200 then corresponding Z value is X - mu / sigma, X is 200, - mu 100, divided by
229
sigma 50, equal to 2.0. This says that X =200 is two standard deviation above the mean of 100,
that is 2 increments of 50 units, the Z value nothing but how many times of it is standard
deviation that is nothing but your Z, here 2 increments of 50 that is why the Z value is 2.
(Refer Slide Time: 05:41)
Look at the conversion now this will be so convenient for you the red one, where the mean = 0
the X = 200, we have asked to find out when X = 200 what is the corresponding Z value? The
red will shows in the simple normal distribution, the black one shows the standardized normal
distribution, you see that the mean of the distribution is under mu standardized scale it becomes
0, when X = 200 in a normal distribution in a standardized normal distribution.
The X and corresponding Z value is 2 , where the mean mu equal to 0 sigma equal 1. Note that
the distribution is the same only the scale is has changed. We can express the problem in original
units are standardized units but there is an advantage. Why we have to convert into standardized
normal distribution sometime you may be required to find out the area of a distributions. Because
if you are not standardizing you cannot use that your Z table, Z statistical table. Every time to
know the area you have to integrate.
That is a very compression process that is why every normal distribution is converted to
standardized normal distribution for the convenient of looking at the Z value directly from the
table that will simplify our task.
230
(Refer Slide Time: 07:13)
The probability is measured by area under the curve in a continuous distribution, the probability
you know that it is measured area under the curve suppose, always it has to be expressed
between A and B. If you want to know the probability exactly at A are exactly B that will not
form the area. So, the probability is 0. So in the context of continuous distribution, the meaning
of probabilities area under the curve, but if it is a discrete probability distributions, the
probability can be red directly by looking at the X and corresponding P of X.
(Refer Slide Time: 07:53)
231
Total area under the curve is 1 and the curve is symmetric, so half is the above the mean half is
below. So P (-minus infinity ≤ X ≤ mu) is 0.5 similarly, P(mu ≤ X ≤ plus infinity) is 0.5, so the
total area is 1.
(Refer Slide Time: 08:16)
Suppose, if you want to know the area Z less than 2.00, see this was when the Z is lesser, this
area is 0.9772.
(Refer Slide Time: 08:30)
One way you can read it directly from the Z table suppose in the rows, the Z value is given the
column the decimal of it is given. Suppose if you want to know Z = 2.00 you have to look at in
row 2.00, the corresponding area is this one. See, the rows shows the value of Z to the first
232
decimal point. The column gives the value of Z to the second decimal point. The value within the
table gives the probability from Z minus infinity up to desired Z value.
When we look at the table, statistical table especially Z table it should be very careful whether
the area is given from minus infinity, there are 2 possibilities sometime the area may be given
minus infinity to plus X value, sometime area may be given only the positive value, this side
value 0, positively values of Z is given. If you want to know if you want to read the negative
value of Z because it is symmetric, so you all can read just only the positive value, then we can
take that value to the negative side.
(Refer Slide Time: 09:53)
So, finding normal probability procedure we will see one problem to find P( a < X < b) when X
is distributed normally, the first one is draw the normal curve of the problem in terms of X,
whenever you are going to find out area, it is always good to draw the distribution draw the
normal distribution then you can intuitively you can read from the picture, so the next step is
translate X value to Z values then use standardized normal table where you can get the area.
(Refer Slide Time: 10:34)
233
Let X represents the time it takes to download an image file from the internet. Suppose X is a
normal with mean 8 and standard deviation 5. If we want to know what is the probability of X
less than 8.6 that means what is the probability of downloading time is below 8.6 So first you
have to mark the mean then you are to find out this X values 8.6, so since it is asked less than
8.6, the left side area, so the first steps is 8.6 has to be converted into, you can integrated by
using normal distributions, you can substitute to minus infinity to 8.6 mean you can integrated,
we will get the area there is no problem, but it is very time consuming process.
(Refer Slide Time: 11:26)
So, one easy way is you have to convert that normal distribution into standard normal
distribution, that means the X value has to be converted into Z scale they can read they can use
234
the table to find out the area for the corresponding Z value. Suppose, X is the normal with mean
8 or standard deviation 5. X less than 8.6 use the Z = (X – mu)/sigma, formula to get Z value
when X = 8.6 . So we got 0.12, so now when Z value 0.12 you can read this value directly from
the normal table to know the probability.
(Refer Slide Time: 12:06)
Is it that Z value 0.12 so you can Z value 0.12 so this area is 0.5748.
(Refer Slide Time: 12:15)
Finding normal probability suppose X is greater than 8.6, so now we have to look at the area of
the right side so, P of X greater than 8.6 is equal to that we have to convert it to Z scale after
getting since it is greater than since the area is 1.
235
1 – P( Z less than 0.12), will give the blue side area. So, one when Z = 0.12 corresponding areas
0.54 so, this side area is after subtracted from one will getting, we are getting 0.4522.
(Refer Slide Time: 12:56)
Suppose X is a normal with mean 8 standard deviation 5 so find P(8 less than X less than 8.6)
now, the 2 value of X is given both of values has to convert when X = 8 we are getting Z value 0,
when X = 8.6 we are getting Z value 0.12, so now we have to know the area of Z = 0, to Z =
0.12. So that means 0 to 0.12.
(Refer Slide Time: 13:28)
236
One way from the table is, first you find the area up to minus infinity to Z value 0.12. So, we are
getting 0.5478 then subtract when Z = 0 left side area we know it is a .5. So, the remaining is
0.0478.
(Refer Slide Time: 13:50)
Now, just the reverse of that the probability is given you have to find out the X value, let X
represents the time It takes to download an image file from the internet suppose X is normal with
mean 8 standard deviation 5 find X such that 20% of the download times are less than X, there
are 2 points here, one is less than X another one is 20%. So, on the left hand side when area
equal to 0.2 what is the corresponding X value so, for that first you got to find out Z value, from
the Z you have to find out the X value.
(Refer Slide Time: 14:35)
237
Now look at the table. So, when area equal to 0.2 corresponding Z value is (-0.84), this is the
value of Z.
(Refer Slide Time: 14:45)
So, we know that Z value is (-0.84), here this formula has come from this simple formula = (X –
mu) / sigma. Now, we know the value of Z from this you have to find out value of X. And one
more thing the when you are finding the value of Z, you should be very careful what kind of
normal distribution you are using to find out the value of Z if normal distribution is like this that
is area is given from 0 to positive Z right. So if you are measuring area on the left hand side, so
will get them Z value but have to attach negative side to that. So we should be careful.
238
So mu = 8.0,
X = 8.0 + (-0.84)5, we are getting 3.80. So 20% of the download times from the distribution with
the mean 8 and standard deviation 5 are less than 3.8 seconds.
(Refer Slide Time: 15:47)
Another important thing gives us is normality because the normality assumption is very
important for other type of inferential statistics. I will tell you why it is important because we
will be studying a concept called Central Limit Theorem, where when you do the sampling of the
sampling that will follow normal distributions. So, lot of many analytical tools many statistical
tools follow the assumption that data should follow normal distributions, that is why as soon as
you collect the data.
The first step is cleaning the data, when the cleaning in that process is you have to verify whether
the data follow normal distribution or not, otherwise, you may not otherwise you will you may
end up choosing wrong statistical techniques or analytical techniques.
(Refer Slide Time: 16:35)
239
It is important to evaluate how well the data Z is approximated by a normal distribution.
Normally distributed data should approximate theoretical normal distribution, like the normal
distribution is bell shaped where the mean is equal to the median. The empirical rule applies to
the normal distribution. The interquartile range of a normal distribution is 1.33 standard
deviations; these are the way to test the normality.
(Refer Slide Time: 17:04)
Another way to assess the normality is construct the charts or graph. Now, you can look at the
shape of the distribution, for small or moderate sized data set, do stem and leaf display and box
and whisker plot and check whether it is look symmetrical or not. As I told you in the beginning
240
of the lectures, if you look at the stem and leaf plot, you should follow this kind of shape then we
can say it follow normal distributions, In the box and whisker plot.
For example, box and whisker plot is like this, right, the middle line that is median line should be
middle of the box then only we can say the data Z follow normal distribution for a large data set.
That is the histogram or polygon appears bell shaped, you can draw a histogram and also you can
verify whether it follows normal distribution. Other way you can compute descriptive summary
measures, whether you can check mean median mode.
How the similar value is the interquartile range approximately 1.33 sigma is the range is
approximately 6 sigma, these are the some descriptive measures to check whether the data follow
normal distribution or not then you can find the skewness, when the skewness is 0 then we can
say this data follow normal distribution.
(Refer Slide Time: 18:31)
Some more example, to check the normality observed the distribution of the data set these are the
conditions do approximately 2/3 of the observations live within the ±1 sigma. Then we can see it
follow normal distribution do approximately 80% of the observations live within ±1.28 standard
deviations are do approximately 95% of the observation live within the mean or ±2 standard
deviations, this is the Z table.
(Refer Slide Time: 19:02)
241
You see the previously the Z table is starting from 0 is not starting from minus infinity. So, this
is second decimal, suppose Z is 0, the probability is 0, here it is given, what is given only one
side known. The area is given only this one. So if you are finding you have to add 0.5 suppose if
you want 0.0 you have to add 0.5 to get the, Z table. Another important .which I am planning to,
willing to share with you.
(Refer Slide Time: 19:39)
See this Z = 0, Z = 1 see, this is 0.3413, right between 0 and one. Suppose if we want to know
minus infinity to 1, you have to add 0.5 with that. Plus 0.5. So, we will get the value, another one
when you see, when you look at the normal distribution. I will come back to that.
(Refer Slide Time: 20:04)
242
If X is normally distributed with the mu = 485 and sigma = 105. So, the 485 to 600 when X is
485, you have to convert that scale to 0 when X = 600 corresponding Z values 1.1 so, Z is 0 to
1.1, 0.3643 is the area and the curve. Dear students we have seen So, far the properties of normal
distribution then we have seen standardized normal distribution a normal distribution how these
are interrelated and we have seen how to find out the area with the help of table one property.
(Refer Slide Time: 20:51)
You can look at the normal distribution the normal distribution shape is like this. So, you look at
this, it would not touch this is x axis y is your probability effects. It will not touch the x axis you
may how this doubt why it is not touching this distribution normal distribution why it is not
touching axis? Because suppose, if I am plotting age of the students in the class follow normal
243
distribution, see the average age is say 19. There is a possibility somebody, suppose I am closing
this way, some bodies age may be say around 30, somebody age may be around say, around 10.
So, since this normal distribution was drawn with the help of sample, I was not exactly knowing
that this kind of rare value of X whether it is X = 30 hours X = 10. So, why I am not closing?
Why this normal distribution not touching X axis, because we were given provision for the rare
events that means X is maybe very high value X may be very low value, but I am not sure about
that. That is why the normal distribution did not touch with the X axis. The another doubt you
may know when you look at the Z table.
When you look at the Z table, the value of Z most of the time I go back. It will show you, see this
the value Z is 3.5. The question may come why the value of Z is maximum 4 or 5 in the
statistical table, you remember the beginning of the class, I was saying from the mean if you
travel on either side with one sigma distance you can capture 68%, if you travel 2 sigma distance
from the mean, that is this distance, 2 sigma distance and minus 2 sigma distances.
You can capture 95% of the area of the normal distribution. If you travel 3 sigma distances this
extreme distance, I can use some other colour, please bear with me. If you travel 3 sigma
distance, this portion, if you travel 3 sigma distance, you can cover 99.7% of the data. Okay. So,
why the value of Z is not beyond 3, the possibility of the Z value is beyond 3 is only 0.3% the
same time the probability of x value to become extremely high or extremely low is only 0.3%.
What is the meaning of that only 0.3% chance that the value of that will be more than 3, that is
why all statistical tables given only 3.5 or 4 not beyond that. The another reason and also why we
are not closing with the x axis the probability of that extreme events to happen is only 0.3%
provided if it is following normal distribution. Now, I will summarize the students that so far via
we have seen different type of probability distributions.
The previous class we have seen some of the continuous distribution in this class we have seen
an important normal distribution that is a normal distribution. We have learned properties,
normal distribution and a standard normal distribution, how to convert a normal distribution to
244
standard normal distribution, how to refer Z table to find out the area that you have seen. The
next class with the help of Python will use how to find out the area under the curve or how to
find out the mean standard deviation of different distributions. Thank you very much.
245
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 11
Python Demo for Distribution
Dear students in the last lecture we have seen probability distributions in this lecture with the
help of Python we solve some problems from probability distributions.
(Refer Slide Time: 00:44)
The problem is taken from a book written by Ken Black the title of the book is applied statistics
so we will see the problems now. Now I am importing scipy importing numpy as np from scipy
dot stats import binom, binom is for doing binomial operations and one more thing you can
import picture also for example this picture that is that empirical distribution picture I have taken
from this source that is exclamation symbol square bracket that link and that link should not be
in square bracket then you will when you do that link we can get that picture directly I am
executing this.
Yes now we will see the problem a survey found that 65% of all financial consumers were very
satisfied with their primary financial institution. Suppose that 25 financial consumers are
sampled and if the survey result still hold the true today what is the probability that exactly 90
are very satisfied with their primary finance institutions. By looking at the problem we have to
246
see what kind of distribution we are going to use here there are 2 possibilities satisfied or not
satisfied.
So, there are 2 possibilities there so then we can go for binomial distributions print binom dot
pmf, probability mass function k equal to 19 that is we say in the probability distribution current
context, n equal to number of sample p is probability of success. Now we can see that the answer
is 0.09 so there is 0.09 the probability that exactly nine are satisfied with a primary financial
institutions.
(Refer Slide Time: 02:43)
We go to the next problem this Book this problem also taken from that book, according to US
Census Bureau approximately 6 percentage of all workers in Jackson Mississippi are
unemployed in conducting a random telephone survey in Jackson what is the probability of
getting 2 or fewer unemployed workers in a sample of 20. Here we want to know 2 are less so
say the probability of 0 plus probability of 1 plus probability of 2.
So we were to do the cumulative density function. So, here for doing that one you have to enter
type binom.cdf(2, 20, 0.6) the 2 represents less than or equal to 2, 20 represents the sample, the p
represents the probability. So, when you run this you are getting community probability of
88.5%.
247
We will take another problem solve the binomial probability n equal to 20 sample size is 20 p
equal to 0.4 x equal to 10 so binom dot pmf you will get the answer for 0.117.
We well go to the next distribution Poisson distribution. So, in the Poisson distribution for doing
Poisson distribution you have to import the library Poisson distribution from scipy dot stats
import Poisson first we will find out Poisson probability mass function so poisson.pmf (3, 2). 3
represents the x and 2 represent the mean. We will see another problem, suppose bank customers
arrives randomly on any weekday afternoon at an average of 3.2 customers every 4 minutes,
what is the probability of exactly 5 customers arriving in a 4 minute interval on a weekday
afternoon.
By looking at the problem you say that we know that the arrival pattern follow Poisson
distribution and you have to be very careful on the unit of mean and the unit of x both are in 4
minutes then no problems they simply can poisson.pmf(5, 3.5) is your x value Poisson dot pmf
so 3.2 is the arrival rate, so 5,3 that is the 11.39%. You will see one more problem, bank
customers arrive randomly on weekday afternoon at an average of 3.2 customers every 4 minutes
what is the probability of having more than 7 customers in you 4 minute interval on a week day
afternoon.
So, here we have to find out the probability of x greater than 7 so what we will do first we will
find up to 7 with the help of this world that we will save any object called prob, equal to poisson
dot cdf 7 and lambda 3.2 so when you subtract 1 minus of this 1 then we will get probability of
more than that, yes so I am finding up to 7, when you substrate 1 minus that up to 7 we will get
to more than 7. we will see another problem on Poisson.
On a bank has an average random arrival rate of 3.2 customers every 4 minutes what is the
probability of getting exactly 10 customers during 8 minutes interval. now it should be very
careful here the unit of x and unit of lambda are different, because it's a 4 minutes it is 8 minutes
so you have to convert into same units so multiply by 32 by 2 you will get 6.4, so lambda equal
to 10 so Poisson dot pmf of 10, 6.4 will give you the answer for 0.0527.
(Refer Slide Time: 06:47)
248
We will go to uniform distribution next suppose the amount of time it takes you see that the
amount of time it takes to assembly a plastic module ranges from 27 to 39 seconds and the
assembly time are uniformly distributed describe the distribution what is the probability that a
given assembly will take between 30 to 35. first we will develop that array, so here u equal to np
dot arrange, there are 2 functions one is range another one is arrange, arange means its array,
array function if you type if you type simply range that is a list.
So 27 is the starting value say ’40-1’ because it is n – 1, not 40, 39 will be the last value in the
array, the increment is by 1, we got this one, now this is our uniform distribution, now from
scipy dot start import uniform so we will find out the mean of this distribution so for that purpose
uniform dot mean loc is the starting point 27, scale is how much plus 12 so it is a 27 + 12 is 39.
So, that is a syntax for, so the mean is 33 otherwise in the uniform distribution finding the mean
is not a very complicated formula simply you have to find out the (a + b) / 2.
Then we will do the cumulative distribution cdf, cumulative function, so uniform dot cdf np dot a
arrange 30 what the question was asked is 30 to 35. So, np dot array 30 because it is a 35 you
have to go out to 36 the increment is 1, starting point is 27 this scale is 12. so this will give you
the probability between 30 to 35 so, when you run this so the probability of 30 is 0.25, 31 is 0.33,
32 is 0.41 and so on. Suppose we want between 30 and 35 for 30 the probability is 0.25 for 35 it
249
is 0.66, so if you substrate 0.66 - 0.25 you will get the; and so far that probability that the given
assembly will take between 30 to 35 seconds.
(Refer Slide Time: 09:33)
You will see one more problem according to the National Association of Insurance
Commissioners the average annual cost of automobile insurance in the United States in a recent
year was 691 dollar. Suppose the automobile insurance cost are uniformly distributed in the
United States with an average of, from $200 to 1182 dollar what is the standard deviation of this
uniform distribution. So, we have to find out standard deviation of this distribution. Before that
will check the mean the mean is given 691 dollar.
So, we will verify this uniform dot mean loc starting point is 200 the difference is the scale is
982 that is 1182 minus sorry $200, this is 1 so it is the extra 691 dollar if you want to know the
standard deviation of the uniform distribution because this formula is different it is not simple
standard deviation so uniform dot std loc is a 200, scale is 982 you will get the standard
deviation of 283.47.
Next we will move to the normal distribution here also I have inserted a picture of normal
distribution you see that, the picture is taken from this link actual exclamation square break here
that link okay when you execute this you will get here picture of probability distribution, that
picture. First we will have to import a library norm that is imported from scipy so from scipy dot
250
stats import norm that is the value, mean, standard deviation: 68, 65.5, 2.5 suppose, if x equal to
68 the mean of that normal distribution is 65.5 standard deviation is 2.5 what is the probability?
So, we will run that, first you have to run this also, yes the probability is 0.8413.
If you want to x less than that value suppose, if you want to know cumulative distribution of x
greater than value you have to substrate from 1. Suppose if we want to move the value 68 and
above. So, already known we know up to 68 this much value so, the remaining area is because
we know the area of the normal distribution is 1. So, 1 minus remaining that value will give you
the right side area. Suppose if you want to know the value between x1 and x2.
For example value 1 less than or equal to x less than or equal to value 2 so it is a very simple
printout norm dot cdf you will find out the upper limit and the lower limit. Simply you type the
lower limit because the value ms already I have declared. Now suppose the between 68 and 63, x
values 60 and 63 if we want to know the area that it plays a very simple reason to receive a lot of
our time. Suppose what is the probability of obtaining a score greater than 700 on your GMAT
test that has mean 494 and standard deviation of 100 assume GMAT score are normally
distributed.
There is another example what is the probability of x greater than 700 when mean equal to 494
and standard deviation is 100. So, because we want to know x greater than equal to 700 so we
have to find out x equal to 700 then subtract from 1 so, print 1 minus norm dot cdf 700 the 494,
100 will give you the answer for. what is the probability that randomly drawing his score the 550
or less so we have to need x value less than or equal to 550 so 515, 494, 100 will be the answer.
What is the probability of randomly obtaining a score between 300 to 600 the GMAT
examination actually this problem is taken from statistics for management Levin and Reuben.
Now you see that the upper limit is 600 lower limit is 300 between 600 and 300 what is the
probability so print norm.cdf 600, 494, 100) minus this was upper limit, minus norm.cdf( 300
,494, 100) is the lower limit.
251
What is the probability of getting a score between 350 and 450 on the same GMAT exam, 450
350 there is another example, similar to previous one into this one. Now we are going to do the
reverse of that. Now if they are so far be able to find the cdf cumulative probability. Now
suppose the area is given if the area is given we want to know the x value, if it is a standard
normal distribution we want to know the z value because the default function is the standard
normal distribution where the mean equal to 0 standard deviation 1.
So area under 0.95 the corresponding z value is 1.645 this value you can read it from the table
the same way which I have explained in the in my theory lecture. Suppose if you want to know
most importantly here norm dot ppf that is a probability function. So, now I am norm ppf 1 -
0.672 will give you the left side area. So, we will see what is the corresponding let us say it is z
value is, yeah here we are going in the left hand side so the z value is minus 0.459.
(Refer Slide Time: 15:31)
Now we will see an example of hyper geometric distribution. the example says suppose 18 major
computer companies operate in the United States and that 12 are located in California's Silicon
Valley. If 3 computer companies are selected randomly from their entire list what is the
probability that one or more of the selected companies are located in the Silicon Valley. What
things you have to notice here is 1 or more. So, for that means we have to see what is the
probability of getting 1 or more selected companies.
252
So from scipy dot stats import hypergeom p value equal to hypergeom dot sf, sf means survival
function. So, here if it is one or more means that 1 - 1 so 0. 0, 18 represents the population size 3
means we are 3 we are choosing that is the number of sample 3, 12 means the number of success
in the population that is a capital A, the same notation what you are used in our theory. So, here
the p value is 0.9754.
(Refer Slide Time: 16:45)
We will see another example a western city has 18 police officers eligible for promotion 11 of 18
are Hispanic, suppose only 5 of the police officers are chosen for promotion if the officer chosen
for promotion had been selected by chance alone what is the probability that one or fewer of the
5 promoted officers would have been his Hispanic. So, what we need to know here 1 or fewer, so
here we have to find out the cumulative probability. So, the formula for finding the cumulative
probability for a hyper geometric function is the p value.
I am going to save in the name of p value equal to hypergeom dot cdf 1, so 18 represents the
population size 5 represents, because choosing 5, 11 represents the number of success in the
population. So, when you run this getting 0.04738.
(Refer Slide Time: 17:43)
253
Now we will go for next example on exponential distribution we will take a sample problem. A
manufacturing firm has involved in statistical quality control for several years. As part of the
production process parts are randomly selected and tested from the records of these tests it has
been established that the defective part occur in a pattern that is a Poisson distributed on the
average of 1.38 defects every 20 minutes during production run. Use the information to
determine the probability of less than 15 minutes will elapse between any 2 defects.
Here how to look at the 2 things in this problem one is the mean of your Poisson distributions
given mu and second thing is the between any 2 defects. Now when as I told you in theory itself
whenever the between 2 things you have to go for exponential distribution. Now first we do find
the mean of your exponential distribution. So, the mean of your exponential distribution is 1 by
mean of the Poisson distribution. So, here is Poisson distribution mean is 1.38 so the lambda we
can call it as mu 1 that is mu1 is for the mean of here exponential distribution mu1 equal to 1
divided by 1.38.
So that value is this much suppose, what was asked probability that is less than 15 minutes from
scipy we have to import exponential function. So, we have to find out the cumulative probability
further to expon dot cdf so the 0.75 represents because we got 0.75 from 15 divided by by 20
because that is the mean was in the Poisson distribution mean was for 20 minutes. Now the
problem for the exponential distribution is asked for 15 minutes.
254
So, we are dividing 15 by 20 so that the units are matching. So, we need to find out the
cumulative function of exponential distribution. The lower limit of that x is 0 the upper limit is
0.75 so exponent dot cdf upper limit 0.75, lower limit and the lambda value so you will get the
0.644. Students, is so far we have seen we have seen binomial distribution, how to use Python.
Then you have seen Poisson distribution, we have seen uniform distribution.
We have seen normal distribution and exponential distribution and hypergeometric distribution
also. So, we will continue in the next class with a new topic that is on sampling and sampling
distribution, thank you.
255
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 12
Sampling and Sampling Distribution
Dear students we are going to the next lecture that is sampling and sampling distributions. The
objective here is; objective of this lecture is describing a simple random sample and why
sampling is important.
(Refer Slide Time: 00:40)
Explain the difference between descriptive and inferential statistics and defining the concept of
sampling distribution. Determining the mean and standard deviation of the sampling distribution
of the sample mean that very important theorem that we are going to see in this class, the central
limit theorem and its importance. and determining the mean and standard deviation of the
sampling distribution of the sample proportions, then at the end we will see the sampling
distribution of sample variances.
(Refer Slide Time: 01:17)
256
The whole statistics can be classified into 2 categories one is the descriptive statistics another
one is the inferential statistics. The descriptive statistics is only for collecting and presenting
describing the data as it is it is very low-level statistics. Whereas the inferential statistics drawing
conclusions are making decisions concerning a population based on sample data, in the
inferential statistics with the help of sample data we are going to infer something about the
population. So, when you say population you should know what is the population what is the
sample?
(Refer Slide Time: 01:58)
Population is the set of all items are individual of interest for example all likely voters in the next
election, all parts produced today, all sales received for November. The sample is the subset of
257
the population like 1000 voters selected at random for interview, a few parts selected for
destructive testing, random received selected for audit. This is an example of sample.
(Refer Slide Time: 02:33)
When you look at the left hand side there is a bigger circle that is the population from there some
numbers are bigger the collection of that picked of the values is called a sample.
(Refer Slide Time: 02:35)
The question may come why we out to sample it is less time-consuming than a census, less
costly to administer then your census. It is possible to obtain statistical result of your sufficient
the high precision based on the samples. Because of the research process sometimes destructive
the sample can save the product. If accessing the population is impossible sampling is the only
258
option. Sometimes you have to go for census also we are in census we will examine each and
every item in the population.
(Refer Slide Time: 03:17)
Suppose if we need to have higher accuracy and you are not comfortable with the sample data
then used to go for census. The reasons for taking a census because census eliminates the
possibility that random sample is not representative of the population, many time there is a
chance that the sample which you have taken may not represent the population. Otherwise the
person authorizing the study is uncomfortable with the sample information then you go for
census.
(Refer Slide Time: 03:40)
259
We will see what is sampling? Sampling is generally selecting some items from the population
that is a sampling. So, there are that can be classified into two way one is random sampling
another one is a non random sampling in the random sampling. The concept of randomness is
taken care non random sampling the randomness is not there. Sometimes we may go for non
random sampling even though it is not so comfortable that is not good for doing many statistical
analysis, sometimes we have to go for non random sampling.
But in the random sampling the outcome or the generalization which you provide with the help
of random samplings are highly robust. So, we will go for what is the random sampling? Every
unit of the population has the same probability of being included in the sample that is the concept
of your randomness. A chance mechanism is used to selection of the process because the chance
of mechanism is we can use a random table to choose someone, you can use your calculator you
can choose someone, choose someone randomly that eliminates the bias in the selection process
also known as the probability sampling.
They will go for non random sampling every unit of the population does not have the same
probability of being included in the sample. It is open the, you know selection bias, there is a
possibility selection bias not appropriate data collection methods for most statistical methods. So,
it is not good method for doing some statistical analysis also known as non-probability sampling.
(Refer Slide Time: 05:18)
260
Random sampling techniques there are 4 way we can say of selecting random one is the simple
random sample second one is a stratified random sample with the proportion disproportionate
third one is a systematic random sample fourth one is cluster or area sampling. Simple random
samples every object in the population has an equal chance of being selected, objects are selected
independently.
(Refer Slide Time: 05:48)
Samples can be obtained from your table of random numbers or computer random number
generators. A simple random sample is the ideal against which the sample methods are compared
this is a best method.
(Refer Slide Time: 05:59)
261
Suppose we will see there are 20 states have ever taken suppose I want to choose some states
randomly for some studies. Suppose first task is I have given some number 2-digit number 01, 02
for example up to this one, it is only for illustrate the purpose it is not 20 the number of states are
more.
(Refer Slide Time: 06:22)
Next I am using the random table to choose the States randomly. For example you can start from
you can see this is a random table you say see the table you can follow any 2 digit 99, 43, 78, 79,
61, because the random table can be read at any direction. so suppose if I am reading left to right
99, 43, 78, 76, 61, 45 and so on so 53 next is 16 so 16 is, I have to choose the serial number 16
and corresponding states I am going back so the 16 is Tamilnadu. So, one state is chosen the next
random number is 18 so the 18 is Kerala next state is chosen.
262
(Refer Slide Time: 08:08)
Then we will go for stratified sampling so the population is divided into non-overlapping
subpopulations called strata. Random sample is selected from each stratum potential for reducing
sampling error. We can go for proportionate the percentage of these samples taken from each
stratum is proportion to the percentage that each stratum is within the population. We can go for
disproportionate also the proportion of strata within the sample are different than the proportion
of the strata within the population.
(Refer Slide Time: 08:41)
For example stratified random sample population of FM radio listeners so what I have done the
whole population is divided into 3 stratum one is 20 to 30, 30 to 40, 40 to 50 you see that each
263
stratum are homogenous within between the stratum there may be a difference maybe there is a
different variance but the same stratum will have is homogeneous the similar kind of behavior
are dataset it will be there. Why it is reducing the sampling error that you if you choose 20 to 30
if you choose something from this strata so all we will we have the similar characteristics.
If you choose number some numbers 40 to 50 these sample will have similar characteristics. See
that between the stratum it is a heterogeneous within the strata it is homogeneous.
(Refer Slide Time: 09:39)
Then next method is the systematic sampling it is convenient and relatively easy to administer
the population elements are ordered in sequence. The first sample element is selected randomly
from the first K population element. Thereafter the sample elements are selected at a constant
interval k from the ordered sequence of frame. What is the k is, k is the population size divided
by sample size. The k represents the size of selection interval we will see an example.
(Refer Slide Time: 10:12)
264
Suppose the purchase order is from the previous fiscal year serialized one to 10,000 so capital N
is 10,000 a sample of 50 n equal to 50 purchases orders need to be selected for an audit so here K
is 10,000 divided by 50 that is a 200, K is the interval so the first sample element randomly
selected from the first 200 purchases assuming that we have chosen 45th the purchase order from
the 45th you have to add 200, so 45th plus 200, 245, 245 +, 445+ 645, and so on.
(Refer Slide Time: 10:12)
Then we will go for the cluster sampling here the population is divided into non-overlapping
clusters or areas. Each cluster is miniature of the population the subset of cluster is selected
randomly from the sample if the number of elements in the subset of cluster is larger than the
desired value of n these clusters may be subdivided into form a new set of clusters and subjected
265
to a random selection process. Because each cluster will behave like your population now you
may ask the difference between stratified sampling and cluster sampling.
In stratified sampling the things are homogeneous in each stratum the items within the in
Stratham of homogenous but in cluster sampling it is highly heterogeneous and each cluster will
act like your population. For example say, apparel cluster Ludhiana, apparel cluster Tirupur or
these are the example of clusters because each cluster will have similar characteristics but will
have different variants.
(Refer Slide Time: 12:04)
So, we will go for advantages of cluster sampling it is more convenient for geographically
dispersed a population, reduced travel cost to contact the sample elements, simplify the
administration of the survey because the cluster itself will act as a population. Unavailability of
sampling frame prohibits using other random sampling methods because there is no other
method we can go for a cluster sampling. The disadvantage is statistically less efficient when the
cluster elements are similar.
Because that cannot be generalized cost and problem of static analysis are greater than simple
random sampling.
(Refer Slide Time: 12:40)
266
The next kind of sampling technique is non-random sampling the first one is the convenience is
sampling because based on the convenience of the researcher the sample is selected. Next one is
the judgement sampling sample elements are selected by the judgement of the researcher for
example suppose you administering a questionnaire suppose that questionnaire can be
understood only by a manager then you have to look for only the managers. So, the researcher is
judging that who should fill this questionnaire so judgment sampling.
Then quota sampling sample elements are selected until quota controls are stratified. Suppose
say some, Uttarakhand there are some districts and each distinct I have to collect some sample so
I may have some quota for example in Haridwar district how much sample has to be collected
some other district how many sample has to be collected. So, there is a quota sampling. Snowball
sampling is a very familiar that survey objects are selected based on the referral from other
survey respondents.
Suppose you may approach one respondent out ever the survey is over you can ask him to refer
his friends, so that is a snowball sampling. It is a very common method in the research.
(Refer Slide Time: 13:56)
267
Then there are some errors when we go, when we go for sampling. Data from non-random
samples are not appropriate for analysis of inferential statistical methods that was there a very
important drawback because you cannot generalize because there is no randomness. Sampling
error occurs when the sample is not the representative of the population, if the sample is not
representing the population then whatever analysis you do that will become futile.
So, non sampling error suppose if you go for apart from this sampling procedure sometime there
may be missing data, there may be problem and recording, there may be problem with the data
entry, there may be analysis error. Sometime the poorly consumed concepts, unclear definition
and defective questionnaires that also lead to error. Sometime response error occurs when the
people may not understood what is the questionnaire.
Suppose there is option that not know, will not say. Sometimes the respondent may over state
their answers, these are the possible error when you go for sampling. There is one more error,
type 1 and type 2 error that we will see in the coming classes.
(Refer Slide Time: 15:19)
268
So, now is to go to the sampling distribution of mean here Expo represents the mean so the
proper analysis and interpretation of your sample statistic require knowledge of its distribution
that is a sampling distribution. For example we start from population say population is mu select
a random sample from the sample you select the sample statistic, statistic it is not statistics, yes
there is no s, so whatever things would you say about the sample it is called a statistic, T statistic
F statistic X-bar these are; since we you calculated from the sample we are calling it to statistic.
With the help of sample mean you can calculate or estimate the operation mean this is the
process of we were inferential statistics. So, what is happening something we are going to
assume about the population once we assume that population that is generally called hypothesis
then we will take a sample randomly we will do some sample statistic with the help of sample
statistic we can estimate the population mean or we can estimate the population variance. In this
contest currently we are estimating the population mean.
(Refer Slide Time: 16:37)
269
This picture shows the inferential statistics there is,m see there are bigger circle that is the
population. So, the population parameter is unknown but can be estimated from the sample
evidence see the red one shows that the sample statistic. So, what is the inferential statistics is
making statements about a population by examining sample result that is the inferential statistic.
(Refer Slide Time: 17:04)
See another example of inferential statistics drawing conclusions or making decision concerning
a population based on these sample results. You see there are different red color is there. So,
these 1 2 3 4 5 6 7 these are the sample the whole things in the population, the inferential
statistics is used for estimation estimating the population mean weight using the sample mean
weight. For example if you want to know the weight of the population that can be estimated with
270
the help of weight of the sample mean then this inferential statistics are another application was
for hypothesis testing.
We can use sample evidence to test the claim that the population mean weight is for example 120
pounds are not. We will go in detail about the statistics in coming lectures.
(Refer Slide Time: 17:57)
Now we are entering into the sampling distribution sampling distribution is a distribution of all
of the possible values of your statistic for a given size sample selected from the population.
(Refer Slide Time: 18:14)
271
So, what will happen we can say type of sampling distributions we can do the sampling
distribution for the sample mean. We can do the sampling distribution for the sample proportion.
We can do the sampling distribution for sample variance.
(Refer Slide Time: 18:30)
Suppose assume that there is a population there are 4 people in a population that is age random
variable is x is age of individuals. So, the value of x may be 18, 20, 22, 24 it is the population.
(Refer Slide Time: 18:54)
272
First you will find out the population mean population mean is Sigma of capital Xi divided by N
generally whenever you see a capital alphabet that is for the population. The smaller one is for
the variance. So, 18, 20, 22, 24 divided by 4 is 21 similarly the population variance is 2.236.
What is happening there are 4 element is there so the probability of getting each element that is
choosing 18, 20 it is 1 by 4 so 0.25 + 0.25 and 0.25 it this follow uniform distribution.
Suppose if we choose only one sample when you plot it the chances for selecting each person
from the population is 0.25.
(Refer Slide Time: 19:49)
273
Suppose if you consider all possible sample of size n, size n here means we are going to select 2
people with the replacement there is a possibility first observation may be 18 20 22 24 second
observation may be 18 20 22 24 so possibility is 18 18, 18 20, 18 22, 18 24, 20 18, 20 20, 20 22
and so on. So, there are 16 possible samples here we are doing sampling with replacement that is
why it is coming 20 20, 22 22, 24 24.
If we find the mean of this, so right side picture shows the mean of that 18 18 is 20, 18 20, 19
when you plot this me what is happening that mean of this sample is following normal
distribution. |Previously when you take only one sample when you plot it we are getting uniform
distribution. When you increase the sample size 1 to 2 what is happening you are getting here
normal distribution it is no longer uniform.
(Refer Slide Time: 21:00)
Now summary measure of this sampling distribution where we selected to with replacement you
see that and going back there are 16 elements 4 x 4, 4 times 4 =16 element. So, the mean
expected value of x bar is 18 19 21 up to 24 out of 16, mu equal to 21. Then the standard
deviation of this sampling distribution is Ʃ(X- µ)2/√N, so the formula for standard deviation is
first to find the variance, mu is 21, so (18 – 21)2 + (90 – 21)2 + up to + (24 – 21)2 = it is 1.58.
Please look at and going back look at the population mean. The population mean is 21 and
population standard deviation is to 2.236, when we select 2 with replacement mean of the
274
sampling distribution is 21 but the standard deviation of the sampling distribution is 1.58 when
you go for selecting 2 samples with replacement.
(Refer Slide Time: 22:16)
Next what we have to do we are going to select the 4 at a time we are going to construct the same
table which have constructed previously. After constructing when you find the mean it will be 21
so we have found these summary measures for the sampling distribution where the mean of the
sampling distribution is 21 and the standard deviation of sampling distribution is 1.58, so when
we compare population data versus sample.
For population there are 4 element in the population in the sample there are 2 element. The mean
of the population is 21 the mean of the sampling distribution is also 21 but the standard deviation
of the population is 2.236 but the standard deviation of sample distribution is 1.58.
(Refer Slide Time: 23:08)
275
You will go for another example that there is a population which follow an exponential
distribution. Now from this exponential distribution we are going to select 2 at a time with
replacement. When you select two at a time then if I find the mean then if you construct
frequency distribution then if I plot that frequency distribution when n equal to 2 we are getting
this kind of distribution you see that the parent distribution is exponential when the sample size
is 2 if I plot the mean of the sample mean that is following this kind of similar to uniform
distribution.
(Refer Slide Time: 23:50)
276
population if you select any sample from the population then if you plot that the sample mean
that will follow normal distribution. So, for example another example you take the population
follow a uniform distribution.
(Refer Slide Time: 24:21)
You select 2 at a time and plot the sample mean that follow this kind of distribution increase
sample size to 5 it is approaching normal distribution. When n equal to 30 it is looking like a
normal distribution initially it was the uniform distribution when the sample size is increasing
then it is following it is behaving like a normal distribution.
(Refer Slide Time: 24:43)
277
So, expected value of sample mean let X1 X2 … Xn represent a random sample from the
population. The sample mean of these observation is defined as , then standard error of
the mean. different sample of the sample size from the same population yield different sample
means. A measure of variability in the mean from the sample to sample is given by standard
error of the mean.
So standard error is σ/√n, note that the standard error of the mean decreases when the sample
sizes increases.
(Refer Slide Time: 25:31)
See if the sample values are not independent what will happen if the sample size is n and not a
small fraction of the population size capital N then the individual sample members are not
distributed independently of one another thus observations are not selected independently. So, a
correction is made to account for this. So, σ2/n that was the variance of the sampling distribution
that has to be multiply by (N - n / n – 1). You take square root of it, σ.√(N-n) /√( N - 1 ). n.
(Refer Slide Time: 26:09)
278
Ok if the population is normal with the mean mu and standard deviation Sigma the sampling
When the sample size is not a large relative to the population then
279
So, the Z value for the sampling distribution of the mean is
(Refer Slide Time: 21:00)
We look at the sampling distribution properties, see the top one it is a normal population
distribution but the normal sampling distribution as the same mean, then sampling distribution
properties. For sampling with replacement when n increases sample size increases the standard
deviation of sampling distribution decreases. So, what is happening look at the red color there is
a large sample size that is the smaller standard deviation. Look at the blue one smaller sample
size larger standard deviation.
280
(Refer Slide Time: 27:20)
The population is not normal we can apply the central limit theorem even if the population is not
normal. Sample means from the population will be approximately normal as long as the sample
,
. This theorem is very important theorem that is the central limit theorem, why it is important
through this theorem the concept of sample and population is connected. What is the result the
mean of the sampling distribution is population mean.
The standard deviation of sampling distribution is σ/√n, where the Sigma represents the
population standard deviation n represents the sample size. It is very powerful it is the very
fundamental theorem for inferential statistics.
(Refer Slide Time: 28:19)
281
What is happening as the sample size get large enough the sampling distribution becomes almost
normal regardless of the shape of the regardless of shape of the population. So, what is the
meaning is suppose there is a population you take some sample if you plot the sample mean that
will follow normal distribution provided n is large enough. So, when you keep on increase n then
the sampling distribution will be exactly like your normal distribution.
(Refer Slide Time: 28:52)
Even this is applicable even the population is not normal the parent population may be may
follow any distribution but the sampling distribution will always will follow a normal
distribution so the mu X bar equal to mu the standard deviation is Sigma by root n. You look at
282
this case also the population is not following a normal distribution but the sampling distribution
will follow normal distribution.
(Refer Slide Time: 29:21)
So how large is large enough for most distribution when n is greater than 25 will assembling
distribution that is nearly normal. For normal population distributions the sampling distribution
of the mean is always normally distributed very important result. What is the meaning the
sampling distribution of the mean is always normally distributed.
(Refer Slide Time: 29:46)
283
Suppose we will see an example a large population has mean equal to 8 and standard deviation is
3 suppose a random sample of n = 36 is selected what is the probability that the sample mean is
between 7.8 and 8.2, we will see an example.
(Refer Slide Time: 30:06)
Even if the population is not normally distributed the central limit theorem can be used when n is
greater than 25. So, the sampling distribution of x-bar is approximately normal that is the result
which have seen equal to 8 and this standard because the is mean of the sampling
distribution is 8 the standard deviation of the sampling distribution is Sigma by root n, Sigma is
3, n is 36, so 0.5.
(Refer Slide Time: 30:33)
284
So what will happen we were asked P(7.8 < < 8.2) so this 7.8 has to be converted to Z scale
the conversion factor the conversion formula from converting to Z it is X - mu by Sigma by root
n the X is given 7.8 mu is 8.2 Sigma is 3 sample sizes 36 that will be the middle one that is µX -
µ /(σ/√n) that is nothing but your Z value less than equal to the upper limit.
So X is 8.2 - mu 2 by 3 by root of the 36 so, when you simplify this P of - 0.5 less then Z less
than 0.5 that will give you the probability of 0.3830 so what is happening the extreme left shows
the picture of your population there is a question mark that means the population may follow any
distribution. if you select some sample when you find the sample mean then you draw the
sampling distribution that will follow normal distribution.
So what is the area of the sampling distribution between 7.8 and 8.2 that was asked otherwise
what is the probability of that the mean of the sampling distribution is between 7.82 to 8.2 so that
7.8 has to be converted into Z scale so that we can refer the table that conversion is done with the
help of formulas Z equal to (X - µ) by σ by root n after converting the 7.8 corresponding Z
values – 0.5, 8.2 corresponding Z values 0.5.
We can look at the statistical table or we can use Python to find the area between - 0.5 to 0.5 that
will give the area of .1915, + 0.1915 with that we will close this one. So, I m concluding this
lecture so what we have seen in this lecture is different sampling techniques. We have seen the
285
importance of the sampling then we have seen the probability sampling non-probability
sampling. Next we have seen an explanation for our central limit theorem.
What is the central limit theorem the nature of the population may be anything if you take some
sample from the population if you plot that sample mean that will always follow a normal
distribution that is the central limit theorem because the central limit theorem is very important
theorem we have seen one problem also by using central limit theorem. We will continue in the
next class.
286
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 13
Distribution of Sample Mean, Proportion, and Variance
Welcome Students, last class we have started various sampling distributions. We have seen the
sampling distribution of mean. In this class, we will continue from the sampling distribution of
mean then we will talk about the sampling distribution of proportions and sampling distribution
of the variance. Then I will introduce the concept of chi-squared distribution. We will do some
problems then with that we can close this lecture. Before going to this lecture just to recollect
what we have done in the previous class.
(Refer Slide Time: 00:31)
This is a population from the population I am taking a sample 1, say, sample 2, sample 3, sample
4. For each sample I can find out the sample mean and sample variance for example if I write x1
bar this is for sample 1 its corresponding mean, if I say, s12 it is the sample with sample variance.
So what will you do this is for continuous variable. Continuous variable in the sense if I am
measuring some length or height or something, suppose if I take the same sample assume that I
am taking a discrete variable or the categorical variable, categorical, categorical variable in the
sense, it can have only two values positive negative or good or bad.
287
Suppose I am taking so this is sample one out of the sample 1,e how many good product is there.
So, what is the proportion so then I can call it this is P 1, P 1, another sample that will be P 2
another sample that is P 3. So, if I plot this P1, P2, P3 directly I cannot plot it. First I have to
construct a frequency distribution then, I have to plot it I will get another distribution that is the
sampling distribution of proportion.
So, there are three point here one is first you take the sample, you take the sample mean, if I plot
that sample mean that will follow a normal distribution. So, mean of the sampling distribution is
the variance of sampling distribution is I am writing This is my first result. What is the
first result? From the population I have taken the sample, if I plot that sample, we will mean that
will follow normal distribution.
Similarly, in this lecture, what you are going to do, we are going to take some sample from the
population, each population is you know, each sample is going, we are going to find out the
proportion so proportion means the probability. So I make it P1, P2, P3 that also will follow
normal distribution okay. The third one, which we are going to see in the class, so, we have taken
the mean, if you take the variance of each sample, if I plot that variance, if I plot that variance
which has come from normal distribution that will follow a special shape, this is called chi-
square distribution okay. This is going to be summary of our class. We will continue yeah.
(Refer Slide Time: 04:01)
288
Before that from the previous class we have seen sampling distribution of the mean, with the
help of sampling distribution of the mean, we can find out the lower, upper limit of a sample
So, what they are asking? Population mean is given, population variance is given, we have to
find out the range of sample means that is X bar, lower limit, upper limit. By the central limit
theorem, we know that the distribution of X is approximately normal, if n is large enough with a
mean Mu and standard deviation. Let Z α/2 be the Z value that leaves area alpha/2, in the upper
tail of the normal distribution, that is in the interval ± Z α/2 , encloses probability (1 -α ).
The lower limit is , actually this has come from this formula very, very famous x
bar minus Mu by Sigma by root n. From this relationship we can say Mu if you re-adjust that you
will get this equation okay. So, if you get you know you from this you can find out the X bar. So,
289
this value is X bar value we can get the upper limit and lower limit of X bar is a sample mean
okay.
(Refer Slide Time: 05:43)
This was what we have started in the last class. So, sampling distribution there are three things
which you are going to see. One is sampling distribution of sample mean which I have seen. This
class, we are going to see the sampling distribution of sample proportion and sampling
distribution of sample variance. First you will see sampling distribution of sample proportion.
(Refer Slide Time: 06:06)
290
P equal to the proportion of populations having some characteristics, we can call it as P is the
population proportion. This sample proportion we are going to call it as a small . It provides an
estimate of P.
(Refer Slide Time: 06:24)
What is the meaning of this estimate of P is sampling distribution of sample proportion. We are
going to use capital P, the proportion of population having some characteristics. Then, sample
proportion we are going to call it as provides an estimate of capital P. so, what is the meaning
of this one is, with the help of sample proportion we can find out the estimate of population
proportion.
So, here how the sample proposed is found equal to X divided by n, X is number of items in the
samples sample having the characteristics of interest divided by n is sample size the range of
sample proportion is as usual zero ≤ P hat ≤1. P has the binomial distribution, but can be
approximated by a normal distribution when n P Q is greater than 5. Here, Q is nothing but 1
minus P so here it is following binomial distribution.
As we know the binomial distribution having properties of having only two alternatives that is
good are defective, pass or fail, yes or no. So, only two alternatives is there okay. So what will
you do?
(Refer Slide Time: 07:41)
291
From the population, we will take sampling proportion so, when you plot the sampling
proportion that will follow a normal distribution. So, what will happen? This picture shows the
different sample as taken from the population for each sample we find out the sampling
proportion, if you plot that sampling proportion that will follow a normal distribution. When we
know that it is following normal distribution will has two parameters.
So, mean of that sampling proportion is that is the expected value of your P hat is nothing but P
the population proportion. And the variance of this sampling distribution is PQ / n that is a P.(1-
P) / n.
(Refer Slide Time: 08:27)
292
Actually this need not remember this formula we can derive it because we know, we have seen in
the previous class, the mean of binomial distribution is nP, the variance of binomial distribution
is nPQ. Actually we have to use capital P for the population so we will, I use capital P.
Otherwise we can write 1- P. Suppose what happen there is a population I am taking proportion
1, proportion 2, proportion 3, proportion 4, like that I may get say P1 hat, P2 hat, I will get too
many such proportion.
If I plot this, if I plot this sampling proportion that will follow, normal distribution so, we have to
find out what is the mean of this sampling proportion distribution. Similarly, what is the variance
of this sampling proportion distribution? We know that since we have taken n sample from the
population so the okay. We know that not Mu not X bar mu X mu X all know mu X we
can write it as n P /n so that is nothing but your population P.
So, what this result says that the mean of the sampling proportion is equal to population
proportion. Similarly, according to central limit theorem, this is Sigma by now when you square
into Sigma square by n Sigma square by n, so, if you substitute Sigma square is nP I am writing
1 minus P divided by n this is n square. So, variance is because Sigma square by n.
(Refer Slide Time: 11:01)
293
So, the Z value for the proportion is so small P cap minus capital P divided by Sigma P we know
that Sigma P is root of P into 1 minus P divided by n so P minus so it P minus capital P here
We do one small problem. If the two proportion of voters who support proposition A is P equal
to 0.4, what is the probability that a sample of size 200 yields, a sample proportion between 0.40
to 0.45? What is asked here is the population proportion is given that is a 0.4 that is a 40%. What
is the probability that the sample proportion will lie between 0.4 and 0.45.
(Refer Slide Time: 12:03)
294
So, n is taken equal to 0.4 and n equal to 200 what is the probability of P(0.4 ≤ p^ ≤ 0.45) here p
is sampling proportion less than or equal to 0.45. First you will find out the Sigma proportion
that is the standard deviation of sampling proportion. Sigma P equal to root of P Q by n so P is
given 0.4, 1 minus 0.42 by 200 we got point 0.34. We have to convert this 1 Sigma P equal to
standard normal distribution by using X minus mu / Sigma P so X is given.
X is 0.4 minus so we have find out the standard deviation of sampling proportion that is the
0.3464 so that we will convert into standard normal so that we can refer the table. So, P(0.4 ≤ p^
≤ 0.45) . P(0.4) this 0.4 is X equal to the small p -0.4 is that capital P divided by 0.03 because
that is this your Sigma P there is nothing but did this form is P hat minus capital P divided by
Sigma P.
So, P cap that is a lower limit is 0.4, capital P which is given 0.4 divided by Sigma p so this
portion will get 0 and right hand side see the another upper limit of P cap 1 is 0.45 minus 0.42
the divided by 0.034 we got the P ( 0 ≤ Z ≤ 1.44).
(Refer Slide Time: 13:44)
When you look at the table P ( 0 ≤ Z ≤ 1.44).we can get 0.425. So, will summarize what we have
done the, it, was asked what is the, what is the probability sampling proportion to lie between 0.4
and 0.45. So, what we are done this 0.4 we are converting to corresponding Z scale it becomes
zero. This 0.45, we converted to corresponding Z scale it is 1.44 then we found this area between
295
Z value is 0 to 1.44 which we got 0.4251. So, now we have seen this one we will go to the
sampling distribution of sample variance.
(Refer Slide Time: 14:27)
Let X1, X2 and Xn be the random sample from a population, the sample variance is sample
variance . The square root of the sample variance is called the sample standard
deviation. The sample variance is different for different random samples from the same
population, because every time you may get different sample variance, okay, very important
result which we are going to see.
(Refer Slide Time: 14:55)
296
The sampling distribution of sample variance has the mean population variance. So, what is the
meaning in that one is, from the population, you take different sample for that sample you find
the sample variance we know of that sample variance is equal to population variance but when
you take the from the normal population, if you take some sample, then, you find the sample
variance.
If you plot that that will follow a particular distribution that shape of this will be like this, right
skewed distribution. That distribution is called chi-square distribution that you will see in the
next slide. So, another important result is if the population distribution is normal then there is a
relationship between sample variance and population variance. That is that relation is (n-1)s2 / σ2
as a chi-squared distribution with the n minus 1 degrees of freedom.
So this x axis is nothing but (n-1)s2 / σ2. This is nothing but our chi-square distribution. You may
see there is a similarity between, there may be intuitively you can connect with the normal
distribution. For example, we say that we will see in the next slide.
(Refer Slide Time: 16:16)
For example Z = (X - µ)/σ, you take different X 1 X 2 X 3 variable so you will get Ʃ(X - µ). So,
what will happen when you square both side when you square both side for different degrees of
freedom, so like that you take different sample different means different X 1, X 2, X 3 so this
will become Ʃ(X - µ)2/σ2.
297
So, the square of Z this is will become a chi square. So, what is nothing but Chi square is nothing
but Ʃ(X - µ)2. I think you know this formula the variance is Ʃ(X - Xbar)2/ (n-1), sample variance.
So, this numerator can be replaced by that is Ʃ(X - Xbar)2can be replaced by (n-1) s2/σ2. That is
nothing but your chi-square distribution.
So there is a connection between your Z distribution and chi-square distribution the other thing
since it is a squared you see that Z it is normal distribution this way, so chi-square distribution is
like this, because you see that we have squared that Z value, so that there will not be negative Z;
so Chi square will be always positive. That is the connection between your Z distribution and
Chi-square distribution.
(Refer Slide Time: 18:00)
What will happen? From the sample, you would have taken the variance when you plot that
sample variance that will follow this shape. So, this x axis is nothing but your chi-square value.
So, the chi-squared distribution is your family of distribution depending on the degrees of
freedom n-1. So, when the degrees of freedom it is increasing that means if you are started to
take more samples from the population, then, you plot that the variance at the end that will
follow a normal distribution.
298
What will happen? Your chi-square distribution if the degrees of freedom has increased, that will
follow a normal distribution. What is the chi-square distribution? From the population, you take
some sample, for that sample, you find the variance, like that you take many sample you will
find different variance when you plot that variance that will follow this shape. This shape is
nothing but the chi-square distribution. What is this chi-square distribution? This x-axis is (n-1)
s2/ σ2, okay.
(Refer Slide Time: 19:03)
Then, another important concept is degrees of freedom because many of the time we will use this
concept degrees of freedom, we will see, what is the degrees of freedom? Number of
observations that are free to vary after a sample mean has been calculated. That is the degrees of
freedom. Suppose that the mean of 3 numbers is 8, say 8 so x1 equal to 7, x2 equal to 8 what is
the value of x3 what will happen? Since already the mean is known to us we can supply any
value to x1 any value to x2.
But you cannot give any value to x3 because we have lost one degrees of freedom because
already we know, what is the mean of that? So what is the logic here is when n equal to 3 so the
degrees of freedom is n -1 = 2 values can be any numbers but the third is not free to vary from
given mean. It is like, example like, assume that there are three chair is there we are asking three
student to sit there. The first person who is entering will have three possibilities.
299
That is the three degrees of freedom because three chairs are available. The second person will
have two possibilities there is a two degrees of freedom. The third one but there is only one chair
there is no option for that so you are lost one degrees of freedom. There are if there are n values
you will have only n minus 1 degrees of freedom just we have introduced what is the chi-square
distribution and how it has connection with the normal distribution. We will do a small problem
to understand the application of chi-square distribution.
(Refer Slide Time: 20:34)
A commercial freezer must hold their selected temperature with a little variation specification
called for a standard deviation of no more than 4 degrees that is the variance 16 degree square
you should not exceed 16, and the standard deviation 4. For a sample of 14 freezers is to be
tested what is the upper limit of the sample variance such that the probability of exceeding this
limit given that the population standard deviation is 4 is less than 0.05.
What is it asking, what is the probability of sample variance that the, the probability of exceeding
this limit is less than 0.05? You will see the next slide what it says exactly.
(Refer Slide Time: 21:25)
300
So, first thing is we have to find out the Chi square value for n minus 1 degrees of freedom. This
is a chi-square distribution there are 14 sample is n the degrees of freedom is for 13. 14 minus 1
13, so, the corresponding alpha is equal to 0.05, is 22.36.
(Refer Slide Time: 21:46)
So, what is asked is, if, if the chi is 22.36 right we know that the P of (n-1) s2 by, Sigma is 4 so
σ2 is 16 if the chi-square value is 22.36, what is the value of your sample variance that was the
question. What is asked the chi-square value is known to us that is 22.36 when alpha equal to
0.05 when chi-square value is 22.36 what is the maximum value of your sample variance.
301
So, probability of (s2 > k) equal to {P (n – 1) s2 /16} greater than chi-square 13, equal to 0.05.
So, this value this value, this value, (n – 1) s2 /16 so with this s2 between that is your K. So, (n –
1) K /16 equal to 22.36 and you simplify we are getting 27.52. The result is give the sample
variance from the sample size of 14 is greater than 27.52 there is a strong evidence to suggest
that the population variance exceeds 16.
That is the application of this chi-square distribution. We will see in detail there are many
applications for a chi-square distribution one is test of Independence, another one is good
goodness of it that we will see in coming classes.
(Refer Slide Time: 23:11)
Now we will summarize in this class what we have seen we have introduced what is the
sampling distributions described the sampling distribution of sample means for a normal
population. Then we have explained what is the central limit theorem, then, we have seen the
sampling distribution of mean, then we have seen the sampling distribution of variance, then, we
have seen the sampling distribution of proportions.
Then I have introduced the concept of chi-square distribution how it has connection with normal
distribution. Then, we have seen application of chi-square distribution. The next class will go to
the next topic, Confidence Interval. We will continue in the next class. Thank you.
302
303
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 14
Confidence Interval Estimation:
Single Population-1
Welcome students to the next lecture. Today lecture is we are going to talk about the confidence
interval estimation for the single population. In the previous lecture, we have seen the sampling
distribution. Sampling distribution we have seen three results, out of the sampling distribution
lecture. One is sampling distribution for mean, sampling distribution for proportion and sampling
distribution for variance.
By using that result we are going to estimate some population parameters. What we are going to
estimate? We may estimate the mean of the population or the proportion of the population or the
variance of the population. That we will see in this lecture.
(Refer Slide Time: 01:15)
The objective of this course is to distinguish between a point estimate and a confidence interval
estimate and construct and interpret a confidence interval estimate for a single population mean
using both Z and t distributions. Today also in this lecture I am going to introduce Z and t
304
distribution then, form and interpret, a Confidence Interval estimate for a single population
proportion, create confidence interval estimate for the variance of the normal population.
(Refer Slide Time: 01:46)
In the confidence interval, what we are going to see today confidence intervals for the population
mean there are two possibilities: When the population variance Sigma square is known, other
case is when population variance Sigma square is unknown. Then, confidence interval for the
population proportion is P hat, using large samples. Then, confidence interval estimate for the
variance of your normal distribution.
(Refer Slide Time: 02:14)
305
Before getting into the content, we will see what is the estimator and estimate. An estimate of
your population parameter is a random variable that depends on sample information. An
estimator whose value provides an approximation to the unknown parameter for example, a
specific value of the random variable is called estimate. For example X bar is estimator for
population mean.
Similarly S square, sample variance an estimator for population variance P hat, population
proportion is estimated for P hat normal proportion is estimated for population proportion. So,
this X bar s square P hat these are called estimator.
(Refer Slide Time: 03:16)
A specific value of x bar, s square, P hat is nothing but estimate. In estimate, there are two things
we can say, one is a point estimate is a single number other one is a confidence interval provides
additional information about the variability. We can say it is a point estimate and interval
estimate because in point estimate is only single number. It is not very much reliable but the
confidence interval is giving you additional information about the variability of that point
estimate.
For example when you look at this picture, we say what is a point estimate and interval estimate.
A point estimate is a single number and interval estimate provides additional information about
the variability of the point estimate. For example, you see that if I, if I say tomorrow what is
306
going to be temperature if I say is exactly 35 degree Celsius, this is a point estimate. If I give
some lower limit and upper limit for this for example, this may be say 30 to 40 that is the
confidence interval.
And I say 35 it is a single number but when I say 30 to 40 that is in confidence interval. So, the
30 can be called as a lower confidence limit the right side can be called as upper control limit 40.
So this is width of confidence interval the point estimate is just one number, single number.
(Refer Slide Time: 04:40)
Yes, so, point estimate we can estimate population parameter mean mu with the help of sample
mean that is x-bar. We can estimate population proportion P with help of sample proportion
small p.
(Refer Slide Time: 04:55)
307
Then, another important property of this estimator is it should be Unbiasedness. A point
estimator theta hat is said to be an unbiased estimator of the parameter theta if the expected value
or mean of the sampling distribution theta hat is theta. So, then we can say it is unbiased
estimator if the expected value of theta hat equal to theta, then, we can say it is the unbiased
estimate.
For example if I say X bar then when I can say X bar is an estimator of population proportion. If
I say, if the expected value of X bar is equal to mean then we can say X bar is an unbiased
estimator, the sample mean X bar is an unbiased estimator for mu the sample variance small s
square is an unbiased estimator for Sigma square, the sample proportion small p is an unbiased
estimator for population proportion P.
(Refer Slide Time: 06:02)
308
Look at the, another property of unbiasedness. Suppose, if you look at this picture, there are two
pictures there. One is for theta one another one is theta two. So, theta one the mean of theta one
cap is nothing but your theta. So, the theta one cap is an unbiased estimator. But theta two cap is
not unbiased estimator because the mean will be somewhere here, because it is not the
population mean, let us see the unbiasedness of an estimator.
There is a two figure is there. One is theta 1 hat another one is theta 2 hat. If you look at the theta
1 cap the mean of theta 1 cap is the theta that is the population mean. But the mean of theta 2 cap
is away from the population mean. So, we can say theta 1 cap is an unbiased estimator of the
population.
(Refer Slide Time: 06:52)
309
We can measure the biasness. Let theta hat be an estimator of theta the bias in theta hat is defined
as the difference between the mean and theta. So, the biasness of theta cap is nothing but the
expected value of theta hat - theta. The bias of an unbiased estimator is 0 if it is 0 we can say
there is no biasness.
(Refer Slide Time: 07:18)
How can we see the most efficient estimator? Suppose there are several unbiased estimator of
theta, we have seen sample mean is the one of the estimator of the mean. The most efficient
estimator or the minimum variance unbiased estimator of theta is the unbiased estimator with the
smallest variance. So, even though there are different estimators to predict the population
parameter, we have to see a estimator which is having the smallest variance is the efficient
310
estimator. Let theta 1 hat and theta 2 hat be the two unbiased estimator of theta. Based on the
same number of sample observation then theta 1 hat is said to be more efficient then theta 2 hat,
variance of theta 1 hat is less than the variance of theta 2 hat. So what is the point here is there
may be different estimator for the population parameter if we want to say which is more efficient
we have to see the variance of the estimator.
If the variance of the estimator is lesser then that estimator is the most efficient estimator. The
relative efficiency of theta 1 hat with respect to theta 2 is the ratio of their variance. So, relative
efficiency is variance of theta 2 hat divided by variance of theta 1 hat.
(Refer Slide Time: 08:41)
Then, Confidence Interval. How much uncertainty is associated with the point estimate of the
population parameter because when I say, the previous example the temperature is 35 degree
how much uncertainty is associated with that point estimate. That uncertainty is expressed with
the help of confidence interval. An estimate provides more information about the population
characteristics than does a point estimate.
So, when compared to point estimate, interval estimate is giving more information about the
population. Such interval estimates are called confidence intervals. So, for example, if we say
this is the population I am taking different sample say, the population mean may be say 40. I
have taken various sample with help of sample mean, I can predict what will be the lower limit
311
and upper limit of this population mean. For example, if I say, 35 to 45 this interval is nothing
but confidence interval. I can go for an exactly endpoint estimate for example if I exactly I can
say, point estimate is I can say, exactly say, 40. But the 40 is not much reliable.
(Refer Slide Time: 10:12)
Confidence interval estimate: An interval give, gives you a range of values. And confidence
interval takes into consideration, variation in sample statistics from sample to sample, because
what will happen, if there is a big population, we may take different samples but different sample
may have different variance, we are constructing the confidence interval with the help of that
variance.
So the consideration for the sample to sample is taken with help of, taken into account with help
of confidence interval. We can construct the confidence interval based on observation from one
sample for example, if we say X bar with the help of one sample, I can predict what is the upper
limit and lower limit of your mu. It gives information about closeness to unknown population
parameters. Stated in terms of level of confidence, can never be 100% confident. We cannot be
always 100 % confidence.
(Refer Slide Time: 11:14)
312
Let us see what is the confidence interval and confidence level? So, confidence interval is lower
limit and upper limit, the confidence level is nothing but the probability. If P (a < ϴ < b) = 1 - α
then the interval from a to b is called 100 into 1 – alpha, confidence interval. So, this interval a to
b is taken as the confidence interval. The quantity 1 - alpha is called a confidence level. So,
confidence level is a probability confidence interval is the lower limit upper limit of population
proportion.
So the confidence level alpha is between 0 & 1. In any repeated samples of population the true
value of the parameter theta would be contained in 100 (1 – alpha)% of intervals calculated this
way. The confidence interval calculated in this manner is written as a < ϴ < b with 100 ( 1 –
alpha)% confidence level.
(Refer Slide Time: 12:22)
313
Next we will see what is the estimation process? Look at this the left-hand side this is the whole
population mean mu is unknown. We want to predict what is the value of the mean. So, you take
the sample the green one say the sample mean is 50, with the help of the sample mean you can
say what is the lower limit and upper limit of this population parameter mu, with a certain level
of confidence. Say I am saying I am 95% is confident that the µ is between 40 and 60.
(Refer Slide Time: 13:01)
Then we go to what is a confidence level suppose confidence level is 95 also written as 1 - alpha.
We will see in detail what is alpha. Alpha is called a type 1 error. So, 1 - alpha is 0.95, a relative
frequency interpretation from a repeated samples 95% of all the confidence intervals that can be
314
constructed will contain the unknown true parameter. So, what is the meaning of this 95% is,
even though we will see in the coming slides.
Suppose if you construct an interval with some range say 40 to 50. So what is the meaning of this
95% so, this interval when you repeat this experiment 100 times, there is a 95% of time you can
capture the true mean within this interval. Only 5% of the time this true mean may be outside the
interval okay. A specific interval either will contain or will not contain the true parameter.
For example, this interval sometime may contain true parameter otherwise may not contain the
true parameter. But when is a 95 % 95 % of the time this interval can't the true parameter there is
only five % turns this interval will not capture the true parameter.
(Refer Slide Time: 14:23)
The general formula for confidence interval is point estimate is, general formula for point
estimate is nothing but your x-bar + or - this reliability factor we will see later, Z. This is
standard error. If you use a standard error, σ/√n, so x-bar + or - Z (σ/√n) is nothing but the
formula for confidence interval. So, when you say + it is upper limit if it is - it is lower limit.
The value of the reliability factor depends upon the desired level of confidence. The value of Z is
depending upon how much confidence level you need to have that we will see.
(Refer Slide Time: 15:09)
315
So, the confidence intervals we will see the classification. We can find the confidence interval
for the population mean, we can find confidence interval for the population proportion, we can
find the confidence interval for the population variance. In this population mean, there are two
category one is Sigma square that is the population variance is known, Sigma square is unknown,
the population variance is unknown, whenever the Sigma square, whenever there is a capital
letter that represents about the population; whenever there is a small letter that represents about
the sample.
(Refer Slide Time: 15:41)
We will see first one confidence interval for µ. That means we are going to find the confidence
interval of population mean. First case is the Sigma square is known. Sigma square is population
316
standard deviation is known. What assumptions? Population variance Sigma square is known.
Population is normally distributed. If the population is not normal, we have to go for large
sample size use a large sample.
So, the confidence interval estimate is X bar - Z α/2 (σ/√n) < µ < X bar + Z α/2 (σ/√n), where σ ,
α/2 is the normal distribution value. So, this is nothing but it will be like this, right. So, this one
has come from this formula (Xbar – µ)/ (σ/√n). When you when you re-adjust this equation you
can find the µ upper limit. This is upper limit, lower limit okay.
When you re-adjust this, Z alpha by 2 is nothing but because we are finding both the sides, so
this value is alpha by 2. This value is alpha by 2. So, the remaining places that is 1 - alpha. So,
this 1 - alpha is called Confidence interval. We will say one more term called margin of error.
(Refer Slide Time: 17:19)
The Confidence interval X bar - Z α/2 (σ/√n) < µ < X bar + Z α/2 (σ/√n), can be written as X bar +
or - ME. This ME is nothing but margin of error. So, this term, so this term we can call it as
margin of error. You should be very careful when we say, error; generally, another name for
standard deviation is the error. Therefore if we write Sigma by root n that is standard error if we
say Z α/2 (σ/√n), that is a margin of error.
317
All our error, error is nothing but the variation. So, this is error this is we can say this is error.
This is standard error, this is margin of error okay. The standard error whenever you go for
sampling that Sigma has to be divided by root of n. This is the result of central limit theorem
okay.
(Refer Slide Time: 18:28)
Generally, we have to look for reducing the margin of error. The margin of error can be reduced
by looking at this Sigma, n and Z. If the population standard deviation can be reduced when you
reduce Sigma, obviously, margin of error will reduce. When you increase the sample size we can
predict more accurately the error can be minimized. So, the margin of error will be minimized
okay. What is the meaning of this one is, suppose this is one confidence level, this is another
confidence level.
For this margin of error for this one, margin of error is more this one, margin of error is more for
this one, the margin of error is more. What do I mean whenever the confidence level is small, the
margin of error also reduced.
(Refer Slide Time: 19:21)
318
Then, we look at how to find out the reliability factor that is Z α/2. For example, if I suppose, if
you want to know something at 95% confidence level, so this is 95% confidence level so the
remaining is 5%, when you divide this 5% by 2 see the right hand side you will get is 0.025. The
left hand she will get 0.025. When you look at the Z table, when the right hand side is 0.025, the
corresponding Z value is 1.96 on right hand side.
The left side it is - 1.96. This z 1.96 is called upper confidence limit. The left hand side it is
called lower confidence limit. The value of Z has to be captured by looking at what is the alpha
value. So, when you look at the table the Z value, for 0.025 is + or - 1.96 is from the standard
normal table. This we can find out.
(Refer Slide Time: 20:25)
319
Look at this, suppose, if the confidence level is 80%. This is nothing but 1 - alpha. When you
look at the table it is 1.28. But it is at 90%, 0.90 when you look at the table it is 1.645. Generally
we will go for 90, 95, 99. So, this value can be remembered. Most of the time, we will go for 95,
if it is 95, the Z value is 1.96. Z alpha by 2 not exactly Z, it is Z alpha by 2 when it is 99 then the
confidence coefficient - alpha is 0.99 then Z alpha by 2 value is 2.58.
(Refer Slide Time: 21:12)
Next we will see intervals and level of confidence. As I told you, you see that so I have captured
7 intervals. Out of 7, one interval is not lying you are not able to capture the blue one. We are not
able to capture the true population parameter okay. So, this is nothing but confidence interval.
320
So, this portion is nothing but your confidence level. So, 100 (1 – alpha) % of intervals
constructed contain mu, that is 100 alpha do not.
Interval extended from lower control limit is X bar - Z (σ/√n), upper control limit is X bar + Z
(σ/√n). This we can say this is nothing but your X bar + Z (σ/√n). This left hand side is X bar - Z
(σ/√n). If I say 95% level of confidence what is the meaning is, if I constructed it 100 times, out
of a 100 times, 95 time my interval which I have constructed will capture the true population
mean. Only 5% of time it may not capture the true population parameter.
(Refer Slide Time: 22:37)
Example a sample of 11 circuits from your large normal population has a mean resistance of 2.20
ohms. Here the sample value is given. This is your sample mean is given. We know from the
past testing that the population standard deviation is 0.35 ohms. Determine 95% confidence
interval for the true mean resistance of the population.
(Refer Slide Time: 23:06)
321
So, what is a given is this is n, this is your 2.20 the sample mean. So, x-bar is given 2.20, + Z
because the Z value which we got from the table because it is a 95% is confidence level. But it is
a 95 % confidence level then, the Z value is 1.96 Sigma value is 0.35 is given. There are 11
samples root of n. so, when I say this one the lower limit is 1.9932, the upper limit value is
2.4068.
(Refer Slide Time: 23:39)
So, how we are to interpret this is, we are 95% confident that the true mean resistance is between
1.9932 and 2.4068 ohms. Although the true mean may or may not mean in this interval. 95 % of
intervals formed in this manner will contain true mean. Only 5% of time this may not have the
322
true mean. That is called your significance level. Another name is called type 1 error. Another
name is called producers risk. This will see in detail in coming lectures ok.
(Refer Slide Time: 24:35)
We will go to the next category. We will predict the confidence interval or the mean when Sigma
square that is the population variance is not given. Dear students I will summarize what we have
done so far. We have seen what is the point estimate; we have seen what is the interval estimate
we have seen advantage of interval estimate then, we have seen what is the meaning of
confidence level.
Then confidence interval after that we have seen how to predict the confidence interval of a
population mean when Sigma square is known. In the next lecture will go for predicting the
population mean when Sigma square is unknown, thank you.
323
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 15
Confidence Interval Estimation_ Single Population – II
Dear students in the previous class we have predicted the population mean with the help of
sample mean where the condition was the Sigma square is unknown now No, the Sigma square
is known. Now we will see the next case where Sigma square is unknown then we will see how
to predict the population mean.
(Refer Slide Time: 00:49)
For this purpose you have to use student’s t distributions consider a random sample of n
observations with the mean x-bar and standard deviation yes from a normally distributed
population with the mean mu then the variable t is nothing but (X bar – µ) /(s /√n) you see that
there is a connection between Z, Z to be used to write (X bar – µ) /(σ/√n). But in the t
distribution what is happening the Sigma is unknown so we are going to use sample standard
deviation.
The other thing is this n should be the smaller number it is less than 30. So, when will you go for
324
t distribution when Sigma is unknown when n is less than 30 then the variable t = (X bar – µ) /(s
/√n) follows the student distributions with n - 1 degrees of freedom.
(Refer Slide Time: 01:49)
Now we will see how to predict the conference interval for µ when Sigma Square is unknown if
the population standard deviation Sigma is unknown we can substitute the sample standard
deviation s, this introduces extra uncertainty since s is variable from sample to sample. So, we
use the t distribution instead of the normal distribution.
(Refer Slide Time: 02:16)
What is the assumption for the t distribution population standard deviation is unknown
population is normally distributed with the population is not normal use very large sample the
325
student t distribution the confidence interval is (x bar – (t n – 1,α/2 . S/√n) ) < µ < (x bar + (t n – 1,α/2 .
S/√n) ) so this also came from this t n – 1, α/2 this has come from this expression. So, when you
readjust that this equations then we can get the lower limit upper limit for the population mean.
(Refer Slide Time: 03:07)
Student's t-distribution the t is a family of distributions because for every degrees of freedom you
will get you a different t distribution. The t value depends upon degrees of freedom number of
326
observations that are free to vary after sample mean has been calculated nothing but degrees of
freedom that is your n – 1.
(Refer Slide Time: 04:04)
Look at this connection between t distribution and Z distribution, we start from the, see the flatter
one t distributions are bell shape and symmetric but have flatter tails than the normal. So, when
the degrees of freedom see initially 5 it is a flatter when the data of freedom is 13 and so on you
see when the degrees of freedom is infinity it is behaving like a Z distribution that is why in
many software packages see there would not be any option for doing Z test there will be option
only for doing t test.
Because when the sample size increases for the t test so the behavior of Z distribution t
distribution is same.
(Refer Slide Time: 04:45)
327
Then we look at the students of t table you said there is a difference between Z table and t table
in Z table whatever value which is given is the area but in a t table you see that the area is given
on the top say 0.05 the whatever value which is given inside the t table is that is a critical value.
For example if it is alpha by 2 is a 0.05 the corresponding t value is 2.920 so the body of the
table contains t value not probabilities we should be very careful.
So, for example n equal to 3 and degrees of freedom is n – 1 = 2, alpha equal to 10% then alpha
by 2, 0.05 so we are to see where the .05 in the column, column line when degree of freedom is 2
then we can see that is a 2.920.
(Refer Slide Time: 05:41)
328
A kind of a comparison between t values and Z values first we will go for this one, this is so
familiar for us when the confidence level is 95% see the Z value is 1.96 for different degrees of
freedom you see that you see that when the degrees of freedom is 10 it is a 2.228 when is it 20 it
is 2.086 see that when t equal to 30 the degrees of freedom is 2.0, so the value of t approaches Z
when n increases you see that initially it is increasing so it is starting you know it is decreasing
and finally it reaches 1.96.
This table explains whenever the degrees of freedom is increases we are getting Z is close to 1.96
for the t distribution.
(Refer Slide Time: 06:44)
Now we will see how to find out a confidence interval for your t distribution an example is a
random sample of n equal to 25 as sample mean is 50 and sample standard deviation is 8 for me
a 95% confidence interval for µ. The first one is we have to go for degrees of freedom there are
25 so 25 – 1, 24 here confidence level is 95% so the significance level is 5% when they say it is a
5% because it is the upper limit lower limit we have to divide by 2 it is 2.5% when degrees of
freedom is 24 alpha by 2 is 0.025.
When you look at the table the t value is 2.06 so you substitute X bar equal to 50, t equal to 2.06,
S is 8 and sample size is 25 so you are getting lower limit of forty 6.698 upper limit our upper
limit of 53.302.
329
(Refer Slide Time: 07:51)
I will go to the next category finding the population proportion with the help of sample
proportion.
(Refer Slide Time: 08:04)
Confidence interval for the population proportion an interval estimate for the population
proportion p can be calculated by adding and elements for uncertain uncertainty to the sample
proportion that allowance is nothing but your standard error.
(Refer Slide Time: 08:21)
330
Recall that the distribution of the sample proportion is approximately normal if the sample size is
large then standard deviation is your σP, σP is root of P Q by n, Q is nothing but 1 - P we will
estimate this with the sample data. So, this is your sample standard deviation we can say standard
To find out the lower limit upper limit of the population proportion we have to use the sample
values because what will happen we may not know the population P value directly. If you know
331
population P value what is the purpose of finding lower limit upper limit we know only the
So what is happening so with the help of our sampling proportions we can find out this value is a
lower limit of our population proportion. This value is your upper limit of our sampling
proportion you see that we with the help of sampling proportion very able to predict. There was a
condition but the nPQ should be greater than 5 then only it can be approximated to normal
distribution also.
(Refer Slide Time: 09:52)
An example a random sample of 100 people shows that 25 are left-handed for me a 95%
confidence interval for the true proportion of left-handers, so this problem the P cap is 25 by 100
Z is 1.96 because 95% is confidence level all other P cap is given just you substitute this value
and then you put plus decide minus you are getting the lower limit of population proportion is
0.1651. The upper limit of population proportion is 0.3349.
(Refer Slide Time: 10:25)
332
How to interpret this we are 95% confident that the 2% of left-handers in the population is
between 16.51% and thirty 33.49% although the interval from 0.1651 to 0.3349 may or may not
contain the true proportion 95% of intervals formed from the samples of size 100 in this manner
will contain that is more important term. Another way you can say when you repeat this 100
times 95 times you can capture the true population proportion only 5 times you may not capture
true population proportion.
(Refer Slide Time: 11:14)
We will go to the last one how to predict the population variance. So, so far what we have seen
we have predicted the population mean we have predicted the population proportion. Now we
are going to predict population variance.
333
(Refer Slide Time: 11:35)
The goal is to for me a confidence interval for the population variance Sigma square. The
confidence interval is based on the sample variance. So, what we are going to do with the help of
sample variance we are going to predict the population variance interval. We are, assuming the
population is normally distributed.
(Refer Slide Time: 11:56)
We already we have seen that whenever there is a population is there if you take some sample
from there when you plot the when you plot the sample variance that will follow Chi square
distribution as I told you previously it will be like this. This will be (n – 1) s2 / σ2. We are going
to use this result when you readjust this right when you readjust this so Sigma Square will be less
334
than or equal to less than or equal to so, this will become Chi Square n - 1 this side will become
n - 1 s square Chi square n - 1 here alpha by 2 here 1 - alpha by 2.
So, what is happening when you readjust this equation when you readjust this equation for Sigma
square you can find out the upper limit and lower limit of population variance.
(Refer Slide Time: 13:13)
Yes the same thing the 1 – alpha percentage confidence interval for the population variance is
given by this one. You look at this the left hand side is alpha by 2 because what will happen
when you look at the chi square distribution, we have given only the right side area when the
right side area is alpha by 2, so what will happen here they will get to a bigger number. Suppose
this was this value is over 1 - alpha by 2.
So, here be a bigger number for example say 5 he will be smaller number when you numerator
when you divide by bigger number will become smaller value that will become the lower limit of
over variance. The numerator when you divide by a smaller value it will become bigger number
that will become the upper limit of your population variance.
(Refer Slide Time: 14:07)
335
We will see you an example you are testing the speed of batch of computer processors you
collect the following data, sample sizes 17 sample mean is 3004 sample standard deviation is 74
assume the population is normal determined 95% confidence interval for σXbar2 here σXbar2 is
nothing but lower limit upper limit of the sampling variance.
(Refer Slide Time: 14:37)
So, n equal to 17 then chi square distribution has the n – 1, 16 degrees of freedom when alpha
equal to 0.05 because it is we are finding upper limit lower limit we got 2 divided by 2 so 0.025
so when it is alpha by 2 it is 28.25 so what will happen this is the right side limit when you want
to know the left side limit you have to, in the chi square table when area equal to 1 - 0.025 that
336
area you have to find out that probability when the degrees of freedom is 16 so corresponding
value is 6.91.
(Refer Slide Time: 15:21)
So when you substitute this value 17 – 1, s square is 74 so this value is chi square value when it
is alpha by 2 chi-square value if it is 1 - alpha by 2 you are finding the lower limit is 3037 and
upper limit is 12 683 converting the standard deviation we are 95% confident that the population
standard deviation of CPU speed is between when you take square root of this between 55.1 and
112.6.
(Refer Slide Time: 15:55)
337
So, far we have assumed that the infinite population sometime there is a finite to population.
Finite population is when the when the when the number of element in the population is small, if
the sample size is more than 5% of the population size and sampling without replacement then
the finite population correction factor must be used in calculating standard error. So, we have to
add this correction factor when we go for a finite population.
When the finite population we have? when the sample size is more than 5% and being you go for
without replacement.
(Refer Slide Time: 16:34)
Suppose sampling is without replacement and the sample size is large relative to the population
size we should go for finite population correction factor. Assume the population size is large
enough to apply the central limit theorem. So, apply the finite population correction factor when
estimating the population variance, so, this factor (N – n) /( n – 1). So, N is population size is n is
sample size.
(Refer Slide Time: 17:06)
338
Let the simple random sample of n be taken from the population of n members with mu the
sample mean is unbiased estimator of the population mean mu then the point estimator is (1/ n)X
bar, there is no problem for sample mean when you are going for sample variance we have to
add this correction factor that is this correction factor has to be added. If the sample size is more
than 5% of the population size and unbiased estimator for the variance of the sample mean is s
square by n you have to multiply this.
(Refer Slide Time: 17:42)
So 100 into 1 - alpha% conference interval for the population mean is this mu, .
(Refer Slide Time: 17:51)
339
So, this is applicable for population proportion also when the population proportion is population
when we are going to predict the population proportion when the sampling proportion is larger
and the population proportion is finite then you have to add another correction factor. Let the
true population proportion be P, let be the sample proportion from n observation from the
simple random sample.
The sample proportion P cap is unbiased estimator of the population proportion n so here also we
have to add this N - n by n - 1 as a correction factor all others are remaining same.
(Refer Slide Time: 18:37)
340
Now we will summarize what we have done so far. In this class we will summarize what we
have done so far in this lecture we have created a confidence interval estimate for the proportions
then we have created a confidence interval estimate for the variance of a normal distribution. For
each proportion and variance estimations we all you taken a numerical example to solve the
problem to understand this concept of parameter estimation, thank you.
341
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 16
Hypothesis Testing- I
Welcome students today we are entering to a very, very interesting topic that is on hypothesis
testing. Especially this topic is going to be fundamental for coming lectures. So, you are to
carefully you should understand.
(Refer Slide Time: 00:40)
The class objectives are first I will to explain how to develop null and alternative hypotheses
because solving a hypothesis problem is very easy the most important is how to formulate to the
hypothesis. Once you are very good at the formulating the hypothesis solving the problem is very
easy. Then I am going to explain what is a type 1 and type 2 error and how this context of type 1
type 2 error is connected with hypothesis.
Next we are going to do hypothesis testing when Sigma that is a population standard deviation is
known. Next we will go to hypothesis testing when population standard deviation is not known.
Then we will do hypothesis testing for the population proportion.
(Refer Slide Time: 01:27)
342
First we will go for what is hypotheses are testing hypothesis testing can be used to determine
whether a statement about the value of the population parameter should or should not be rejected.
So, hypothesis is nothing but some assumptions about the population parameter we know that
most of the populations which we are going to do is going to follow a particular distribution for
example a normal distribution.
So normal distribution having two parameter one is mean and variance. So, we can assume the
population mean as a hypothesis and assumption otherwise you can have population variance
also and hypothesis. The null hypothesis denoted by H0 is the tentative assumption about the
population parameter. So, whatever assumptions which are having that will go to the population
parameter the alternative hypothesis denoted by Ha is the opposite of what is stated in the null
hypothesis.
The hypothesis testing procedure uses data from the sample to test to computing statement
indicated by H0 and Ha. What are the two competing statement one is null hypothesis another
one is alternative hypothesis.
(Refer Slide Time: 02:44)
343
Next we will see how to develop null and alternative hypothesis, it is not always obvious how the
null and alternative hypothesis should be formulated. We should be very careful to structure the
hypothesis appropriately so that the test conclusion provides the information the researcher
wants. The context of the situation is very important in determining how the hypothesis should
be stated. In some cases it is easier to identify the alternative hypothesis first in other cases the
null is easier.
So, correct hypothesis formulation, will take a practice in this lecture we are going to take some
example and I am going to explain how to formulate the hypothesis whether it is null or altered
hypothesis.
(Refer Slide Time: 03:36)
344
First you start with alternative hypothesis as a research hypothesis. Most of the time the
researchers wanted to prove the alternate hypothesis. Many applications of hypothesis testing
involve an attempt to gather evidence in support of research hypothesis. In such cases it is often
best to begin with the alternative hypothesis and make it conclusion that the researcher hopes to
support it because many of the time the researchers wanted to support his hypothesis.
So first you have to write the alternative hypothesis. The conclusion that the research hypothesis
true is made if the sample data provide sufficient evidence to show that the null hypothesis can
be rejected so, if we want to accept your alternative hypothesis the data which we are collected
from the sample has to support the null hypothesis to reject it.
(Refer Slide Time: 04:36)
345
Next alternative hypothesis as a research hypothesis example we will see some examples here in
example is a new manufacturing method is believed to be better than the current method.
Assume that in your manufacturing context some is proposing a new way of doing work a new
manufacturing method. So, we want to test this assumption so what is alternative hypothesis the
new manufacturing method is better because that new method was given by the researcher
always the researcher will believe that whatever he says is there is a support for that.
So, first we will formulate to the alternate hypothesis that is the new manufacturing method is
better, the null hypothesis just to the complement of alternate hypothesis so the new method is no
better than the old method.
(Refer Slide Time: 05:26)
346
We will take another example a new bonus plan that is developed in an attempt to increase the
sales. Now what is happening any other organization so we are introducing new bonus plan we
are going to see that the bonus has any impact on the sales. So, what is alternative hypothesis is
the new bonus plan increases sales. So, first to go for alternate hypothesis then what is the null
hypothesis the new bonus plan does not increase the sales you see that whenever when you look
at the null hypothesis it will say always does not increase the sales. That is why it is called the
null nothing has happened there is a meaning of null.
(Refer Slide Time: 06:13)
We will go for another example of alternative hypothesis see a new drug is developed with a
goal of lowering cholesterol level more than the existing drug. So, what is happening there are
347
already there are some drugs are available to lower the cholesterol that a researcher has found
some drug so that is reducing the cholesterol better than the existing medicine drug.
So, we will go for alternate hypothesis the new drug lowers the cholesterol level more than the
existing drug. So, null hypothesis the new drug does not lower the cholesterol level more than
the existing drug. You see that that does not so this does not represent the null so nothing
significance has happened that is why we are calling it is null hypothesis. Then null hypothesis
an assumption to be challenged.
We might begin with the belief or assumption that the statement about the value of the
population parameter is true. So, in the hypothesis testing context always we will start the
problem assuming that the null hypothesis is true. For example in India before starting a trail
suppose somebody was accused so before starting a trial we the trial will be started assuming that
the person is innocent person.
You see what is happening the trial will be started assuming that the person is innocent the police
has to bring some evidence and they have to say that it is not innocent. When in other countries
the person who is being suspected he has to prove his innocence. So, it is reverse so what is the
meaning of this reverse that even though something has happened if there is no evidence that
person is free.
We then using a hypothesis test to challenge the assumption and determine if there is a statistical
evidence to conclude that the assumption is incorrect. In this situation it is helpful to develop the
null hypothesis first. We will take an example of how to develop your null hypothesis. A null
hypothesis is an assumption to be challenged.
(Refer Slide Time: 08:23)
348
Example you see little the label on your milk bottle states that it contains 1000 ml null
hypothesis the label is correct, so µ ≥ 1000 ml. Another hint is the null hypothesis is nothing but
the status quo in null hypothesis always there will be ‘=’ sign the null hypothesis is looked at a
optimistic perspective. If somebody say the bottle contains the 1000 ml we are assuming that yes
that assumption is correct so we formulating the null hypothesis is µ ≥ 1000 ml.
So, alternate hypothesis the label is incorrect µ < 1000 ml you see that the signs are
complementary since the null hypothesis it is greater than or equal to there is here less than. If
the null hypothesis is less than or equal to so the alternate hypothesis it will be >. The null
hypothesis is ‘=’ the alternative hypothesis is ‘≠’. And the null hypothesis always will have equal
to sign alternate hypothesis never contained equal to sign.
The status quo will go to null hypothesis we have to challenge the status quo that is nothing but
your alternate hypothesis.
(Refer Slide Time: 09:49)
349
Okay you see how the nature of the null hypothesis. The Equality part of the hypothesis always
appear in the null hypothesis that means in null hypothesis always there will be a equal to sign.
So, when is equal to means it is equivalent to null nothing has happened that is the status quo is
maintained as it is. In general the hypothesis test about the value of the population mean mu must
we take one of the following 3 forms where µ0 is the hypothesis value of the population mean.
You see that the hypothesis may take different forms for example µ greater than or equal to µ0
the µ0 is what do you have a similar population mean. You see that the null hypothesis there is a
greater than or equal to so we are writing in the alternative hypothesis less than because the signs
are complementary. So, this test is one tailed test that is called lower tailed test. So, how we are
calling it to lower tailed test is for example if I am drawing here.
We have to look at the sign of your alternate hypothesis. The sign of for alternate hypothesis is
less than µ0 so it is left tailed test. If anything goes beyond the left hand side we will reject it. See
there is another context µ, H0: µ less than or equal to µ0 so alternative hypothesis is µ greater
than µ0 here also you look at this this is a less than or equal to so complement sign is greater
than.
So it is one tailed it is called per tail test look at the sign of our alternate hypothesis it is greater
than, so if it is greater than so it is called right tailed test. If anything beyond this point suppose
350
this mean beyond this point will be rejected the last one is equal to sign µ equal to µ0, so Ha: µ
not equal to µ0, this is called a two tailed test. So, two tailed test is you see that the rejection area
will be on both side so if the value goes below this will reject it the value goes about this rejected
what is the meaning of value.
(Refer Slide Time: 12:11)
I will explain we will take an example to do a hypothesis testing a major hospital in Chennai
provides one of the most comprehensive emergency medical services in the world. Operating in
here multiple hospital system with approximately 100 mobile medical units that hospital is
having 100 it is not 100, it is 10 mobile medical units the service goal is to respond to medical
emergencies with a mean time of 8 minutes or less.
So, the problem is that they have 10 mobile medical units they are too whenever there is a
emergency they have to respond 8 minutes or less. The director of medical services want to
formulate a hypothesis test that could you see a sample of emergency response times to
determine whether or not the service goal of the goal of 8 minutes or less is being achieved.
Look at this problem the director wanted to test the service goal of 8 minutes or less is being
achieved.
351
See now it is like here alternative hypothesis the researchers wanted to test whether the service
goal of 8 minutes or less is achieved. So, what will happen now the status quo the status quo is 8
minutes or less.
(Refer Slide Time: 13:40)
So what happened the status quo will go to null hypothesis so what is the null hypothesis the
emergency service meeting the response goal. So, no follow-up action is required the another
name why it is called null hypothesis is when you accept a null hypothesis no follow-up action is
required, no course of action is required. So, why we are saying mu less than or equal to 8
because that is the status quo, so, always null hypothesis null hypothesis look at at their
optimistic perspective.
So when I say µ ≤ 8 you see that the opposite of this what is that the emergency service is not
meeting the response goal that is appropriate follow-up action is necessary that is why it is called
alternate hypothesis, so mu greater than equal to 8 you see that here it is a less than or equal to 8.
So, the sign is complimentary it is greater than equal to 8 while we are writing mu less than or
equal to the status go should go to null hypothesis okay where the mu is the mean response time
for the population of medical emergency request.
(Refer Slide Time: 14:49)
352
So, we will go to what is a type 1 error because hypothesis tests are based on the sample data we
must allow for possibility of errors because the conclusion of hypothesis that is to accept a reject
is based on sample data. So, always there is a possibility of error. Here type 1 error is rejecting
H0 when it is true, as he told you in the code context somebody is pleading that is innocent but
the judge is not accepting his innocent but really is innocent but he was his innocence was
rejected that is incorrect rejection, that is a type 1 error.
The probability of making a type 1 error is when the null hypothesis is true when the null
hypothesis is called the level of significance. So, level of significance we call it is alpha most of
the time it is 5% what is the meaning of this 5% is the probability of incorrectly rejection is only
5%, application of hypothesis testing that one that only control the type one error are often
called significance test.
(Refer Slide Time: 16:10)
353
Type 2 error a type two error is accepting H0 when it is false, it is difficult to control the
probability of making a type 2 error status easy and avoid there is a risk of making type 2 error
by using do not reject H0 instead of accept null hypothesis because in the hypothesis context
when we concluded we will not say accept null hypothesis we will say do not reject null
hypothesis. Because there is no proof for that null hypothesis is true.
(Refer Slide Time: 16:53)
See the context see the population condition is H0 is true you see that in the conclusion H0 is
true we will see this when you reject H0 that is called your type 1 error so that is called incorrect
rejection. You see the other case the H0 is false but you have accepted so that is called your type
2 error. So, another name for a type 1 error is incorrect rejection, for type 2 error it is false
354
acceptance. We can say another example the producer risk we call this alpha, alpha is called type
1 error, beta consumer risk is called type 2 error.
What is the meaning of this producer risk and consumer is case assume that I am the
manufacturer I am producer I am producing shaft, so whose diameter is for example the shaft
diameter is say 50 mm. Suppose there is a supplier is coming the supplier has taken some sample
from my production lot then he is rejected my lot, he says that you were your production level is
not meeting our specification that is 50mm.
There is a 2 possibilities there the supplier who has the way he measured is wrong otherwise I
made the sample which have kept is not correct. So, that is incorrect rejection even though I have
quality good products they have rejected that is an incorrect rejection that is called to produce a
risk. So, there is another possibility assume that I am making only 49 mm of shaft again the
supplier came he measured is 50, it is 49 but he is measure it is 50 then he has accepted my lot so
that is false acceptance that is called a type 2 error.
There are two possibility one is the sample which I have kept that meet all his requirements but
my whole lot does not meet meeting his requirement. So, that means my sample is not the
representative of the population that is one possibility otherwise the way they have measured it
that is wrong, so that is called false acceptance that is a type 2 error. In the next lecture we will
see the application of type 2 error in detail.
(Refer Slide Time: 19:29)
355
There are 3 approaches for hypothesis testing first approach is p-value approach most of the
statistical package follow this method. Second method is critical value method, the third one is
confidence interval value method. The confidence interval value method mostly used for 2-tailed
test. First we will go for p-value approach that is a one-tailed hypothesis testing. What is the p-
value the p-value is the probability computed using test statistic. You should be careful test the
statistic that measures the support our lack of support provided by the sample for the null
hypothesis.
So the p-value says whether it is supporting the null hypothesis or it is not supporting null
hypothesis if the p-value is very high it will support null hypothesis you will accept null
hypothesis. If the p-values be less it will not support null hypothesis we will reject the null
hypothesis. Say we say that what is the test statistic the test statistic? For example in the Z
context the test statistic nothing but this one (X bar – mu) /(σ/√n) that is the test statistic for Z
test if it is a t-test this is (X bar – mu)/( s /√n).
So, n - 1 degrees of freedom, so, whatever value which have calculated with the help of sample
that is called a test statistic, if the p-value is less than or equal to the value of the alpha then the
value of the test statistics in the rejection region. I will show you in the next slide. Reject H0 if
the p-value is less than alpha. The p-value is very less it is not supporting null hypothesis here to
reject it where alpha is significant level.
356
(Refer Slide Time: 21:36)
We will see how to use hypothesis, how to do hypothesis testing using p-value approach. The p-
value approach the first one is, see assume that the problem alpha equal to 10% it is given this
was alpha so we have to calculate this test statistic that that is your Z value. So, X bar might be
given X bar is the sample mean, minus µ is the population mean what we have assumed Sigma
value must be given, root of n.
For example this value assumed that it is -1.46 okay. So, this -1.46 corresponding what is the left
side area. So, this value is our p-value, okay how to get this one so when Z value is -1.46 we can
get corresponding area of your normal distribution on the left hand side.
(Refer Slide Time: 22:39)
357
So, that one we can do with the help of Python for that first you have to import scipy so
importing library from scipy import stats. Then the left side area is say, -1.46 the left side is Z
statics -1.46 so when you put a minus this one stats.norm.cdf cumulative distribution function -
1.46 we are getting the probability is 0.07 so that is nothing but 0.07 you see that this alpha is
10% so the p-value is less than the Alpha so way out to this region is a rejection region.
So, this region is acceptance region beyond this point it is the rejection region. So, the value of
the P that is when Z value equal to -1.46 since we are standing on the left hand side that is we are
standing on the rejection side we have to reject the null hypothesis. This is left side test lower
tailed test this way explain.
(Refer Slide Time: 24:00)
358
Now so suppose if it is a right tailed test say the calculated Z value is 2.29 we got some X bar
value, µ value, Sigma by root n value, so suppose this is giving 2.29 so for alpha assume that
alpha equal to 4% when alpha equal to 4, we go to mark it, alpha equal to 4% from the right to
left. So, when alpha equal to 0.4% when Z values 2.29 we would what is the corresponding area
towards the right side so what you have to do we can find out 1.75 also see that when Z values
2.29.
(Refer Slide Time: 24:46)
So stat.norm.cdf will give you the right side area when you put 1 – stats.norm.cdf (2.29) will
give you the left side area. So, this first one actually it is not required here because alpha will is
directly given here this is only for proof this is for testing, how to use from Z, how this is Z value
359
from that we have find out the probability value. So, now the Z value is 2.29 so we want to know
the right side area so 1 minus corresponding area that will give the right side area that 0.011.
So, this area I am saying this area is 0.011 now look at this alpha so the p-value is less than the
alpha otherwise you see the p-value so this side is the rejection region this side the acceptance
region. So, when the p-value is 0.01 still you are standing in the rejection region so how to reject
a null hypothesis. In case if the p-value is 0.05 you might crossed the boundary after crossing the
boundary you will be landing on the acceptance region so we have to accept a null hypothesis.
(Refer Slide Time: 26:07)
I will go to another method critical value approach for one tailed hypothesis are testing. The test
statistic Z has the standard normal probability distribution we can use the standard normal
probability distribution table to find out the Z value with an area of alpha in the lower tail or
upper tail of the distribution. For example we know the Alpha value, say alpha value is 0.05, so
this side area is 0.05 with the help of Python when alpha is 0.05 you can get the Z value this is
lower tailed test.
For upper tail test when alpha equal to 0.05 you can find out corresponding Z value. In Python
what you have to do if you want to know this right side, upper tail test you want to know the Z
value you have to is 1- 0.05 for that probability you were to find out corresponding Z value. The
360
value of the test statistic that established the boundary of the rejection region is called critical
value of the test. If it is for the 5% age you will get 1.645 here also -1.645.
So, this -1.645 is called a critical region. Rejection rule if it is a left tailed test reject if the Z
value that means your calculated Z value is less than this -1.645 because you will be standing on
the rejection side. If it is a right tailed test the calculated Z values greater than your table value
then you have to reject it.
(Refer Slide Time: 27:57)
See for example Sigma is known the Alpha equal to 10% when alpha equal to 10%
corresponding Z value is -1.28 this is our critical region this is our critical region. So, with the
help of sample data you have to find out the Z value if the Z value is lying on this side you have
to reject it. If the Z value is lying on that side you have to accept it.
(Refer Slide Time: 28:25)
361
For example when area equal to 0.1 the corresponding Z value is -1.28 and going back to
previous slides -1.28. Now we will go for upper tail test when alpha equal to 0.05 so the right
side area is 0.05 so this side is 0.95, so when the left side here is 0.95 corresponding Z values
1.645. If any calculated Z if this Z value is lying on this side for example 1.7 you have to reject
the null hypothesis but is lying this side you have to accept the null hypothesis.
(Refer Slide Time: 29:01)
For example 0.95 the value is 1.65 what is the see how, now I now we have done hypothesis
testing with help of p-value approach and critical value approach what is important point you to
note that is the decision whether to reject or accept a null hypothesis in p-value method is
decided by comparing the probability. Probability of what is the probability P value versus alpha
362
value. But in your critical value approach the decision is done by comparing the critical value
and calculated Z value.
Decision will be same, only for comparison purpose sometimes we compare probability
sometimes we compare critical value but the end result will be same. So, what is the first step
will develop null and alternative hypothesis step 2 specify the level of significance alpha this is
very important. Before starting of the test you have to decide the significance level. Step 3
collect the sample data and compute the test statistic.
This test statistic maybe your t value or it may be z value. The p-value approach what will you
do use the value of test statistics to compute the p-value, if the p-value is less than or equal to
alpha you rejected the same step for critical value method use the level of significance alpha, to
determine the critical value and rejection role. Use the value of test statistic and the rejection rule
to determine whether to reject H0.
Dear students in this lecture so far what we have seen we have seen what is hypothesis what does
the null and alternative hypothesis. We have learnt how to formulate hypotheses then we have
seen hypothesis testing. In the hypothesis testing we have seen what is left tail test, what is the
right tail test. What is the two tail test then we have seen the theory of how to test the hypothesis
by using p-value approach and by using the critical value approach.
The next class will take one problem with the help of that problem will formulate the hypothesis
then we will test the hypothesis with help of p-value approach and critical value approach then
we will compare the result. And one more thing one more method that I did not cover in this
lecture is that is testing the hypothesis with the help of confidence interval that we will do in the
next class, thank you very much.
363
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 17
Hypothesis Testing- II
Welcome students in the last class we have seen how to formulate the hypothesis then we have
seen some theory when the hypothesis should be accepted when the hypothesis should be
rejected. In this class we will take some practical examples then we will solve then we will
understand the concept of hypothesis in detail. So, the class objective is when the population
standard deviation is known how to do the hypothesis testing.
(Refer Slide Time: 01:00)
Here hypothesis testing is we are going to check the population mean will take own problem this
is example of one tailed test about the population mean when Sigma is known Sigma means
population standard deviation. The assumption is on population mean so the problem is the mean
response times for the random sample of 30 pizza deliveries is 32 minutes. So, they conducted a
sample survey in that they have sample sizes 30 out of 30 they found that the mean delivery time
for Pizza is 32 minutes.
The population standard deviation is believed to be 10 minutes Sigma is known this population
standard deviation this population standard deviation is nothing but your Sigma 10 minutes this
364
is nothing but over Sigma. The pizza delivery services director wants to perform a hypothesis
test when alpha equal to 0.05 level of significance to determine whether the service goal of 30
minutes or less is being achieved.
Manager of that store or that shop wanted to verify whether the survey's goal of 30 minutes or
less is being achieved or not. So, now the status quo first thing is formulating the hypothesis the
status quo is the Pizza is delivered within 30 minutes.
(Refer Slide Time: 02:19)
Before that we will see what are the values are given there are any hypothesis testing there is a
two kind of data some data from sample some data from population. So, in the sample, sample
mean is 32 minutes sample sizes n. so, this sample mean this is nothing but your x-bar this is
nothing but your n, so with respect to population, population mean which we have to assume is
the 30 minutes what is the population standard deviation also has to be given, what is the
population standard deviation? Sigma equal to going back yeah the population standard deviation
is 10 minutes.
(Refer Slide Time: 03:05)
365
Now first we will solve this problem using p-value approach what is the step 1 develop the
hypothesis develop the hypothesis in the previous class also I given some hint the status quo,
status quo should go to null hypothesis. What is the status quo currently the pizza is delivered
and the average of 30 minutes, so mu is less than or equal to 30. After you write the null
hypothesis then you should go for alternative hypothesis the clue used that these signs are
complementary when you write for null hypothesis it is a less than or equal to for alternative
episodes it should be greater so greater than 30.
The step 2 is specified the level of significance alpha is given 5% now we have to decide
whether it is a right tailed test or left tailed test. As I told you by looking at the sign off your
alternate hypothesis it is greater than 30 so it is a right tailed test for example this is right tailed
test. What is alpha it is 0.05. The next one compute the value of test statistic for the test statistic
Z = (X bar -µ) /(σ/√n), X bar is given 30 to µ (mu) is as you would mean divided by 10 it is the
population standard deviation/ root of n, n is your sample size, z = 1.209. So when you mark
1.09 approximately it will be here 1.09.
(Refer Slide Time: 04:40)
366
So, what we have to do then the calculated Z value are the test statistics is 1.09 we have to find
out what is the right side area that is whatever p-value, so what is the meaning is when it is Z
values 1.09 so we are to suppose this is 1.09 we have to mark this side area that side area is
nothing but p value p value. So, with the help of Python when you go for 1 minus stat.norm.cdf
1.09 because cdf is we are finding minus infinity to Z value when you want the right side area
that has to be divided by 1 that is nothing but if I draw here one more time in Python we can get
area when Z values 1.09.
For example approximately here we have to find out what is the right side the area. So, for
finding the right side area if you put stat.norm.cdf of 1.09 because the Python is giving area from
minus infinity to here Z value. So, you will get the left side area but we want to know the rights
area. So, since you know the area is 1, so 1 minus this left side area will give you the right side
area. So, the right side area this one is 0.137 you have to compute the p-value for Z equal to 1.09
the p-value is 0.137.
Now we have to determine whether to reject H0 or not what has happened since the p-value
0.137 is greater than alpha value because alpha is 0.05 so the p-value is greater than alpha value
so we do not reject null hypothesis. So, what is the meaning is, again I am drawing one more
time even though there are many places and drawing normal distribution that will be for our
367
understanding purpose, this is 0.05. So, whatever value which is on the right side of this 0.05 will
be rejected what happened the area which I found this area is 0.137.
So, this area is 0.137 so now what happened now I have entered into the acceptance region. So, I
have to accept the null hypothesis. In case with the p-value is 0.04 so I will be standing here
because this is up to this I am writing this is 0.05 when you say 0.04 I will be standing here so
that means I am standing in the rejection region I have to reject it. Now what has happened that
we are crossing that boundary of 0.05 so we have entered into the acceptance region.
So we have to accept the null hypothesis but generally we would not say accept do not reject null
hypothesis. So, what is the conclusion there are not sufficient statistical evidence to infer that
Pizza delivery service is not meeting the response goal of 30 minutes. So, when you say accept
null hypothesis so what we say that the µ is less than or equal to 30 minutes that means the Pizza
is delivered before 30 minutes.
You know that offer is there if it is not delivered within 30 minutes the Pizza is free for you. So,
they make sure that all deliveries are delivered within 30 minutes.
(Refer Slide Time: 08:12)
368
The same example you see when Z equal to 1.09 corresponding p value is 0.137 but the Alpha is
0.05 the red region represents the rejection region. So, what has happened we cross into the
acceptance region so we have to accept the null hypothesis.
(Refer Slide Time: 08:32)
Then the same problem will do with the help of critical value approach in both approach we have
to get the, we have to get the same answer. So, we will continue from the step 4 first determine
the critical value and rejection rule. So, what is the rejection rule is when alpha equal to 0.05 we
have to find out what is the Z value Z value is 1.645. Then you see that our calculated Z value is
one 0.05 that is nothing but our test statistic. So, test statics will statistic will be 1.05 here 1.05.
So, what is the logic if the calculated Z value are the test statistics is, if it is falling on the
rejection reason we have to reject it but here it is falling on the acceptance region so we have to
accept the null hypothesis.
(Refer Slide Time: 09:31)
369
The previous example was the p value for one tailed test now we will see how to do hypothesis
testing for a two-tailed test as I told you how to know it is the two-tailed test in alternative
hypothesis if the sign is for example H0 is this one H0 if the sign is not equal to then it is a two-
tailed test. Generally two tailed test what will happen there will be an upper limit there will be a
lower limit.
If any values any test statistics if it is false on the above this upper limit of the acceptance region
we will reject it for this falls below the below the acceptance region we will reject it.
(Refer Slide Time: 10:16)
370
Now computing the p-value using the following 3 steps one is compute the value of test statistic
Z if Z is an upper tail find the area under the standard normal curve to the right of the Z. If the Z
is a lower tail that is if the less is less than 0 find the area under the standard normal curve to the
left of Z. So, double the tail area obtained by in step two that is a logical why what we are going
to do, since it is a two tailed test.
So, whatever area which were found left side or right side that has to be doubled to obtain the p-
value. The rejection rule is if the double devalue double the p-value is less than or equal to alpha
reject it otherwise accept it so what is the what it say is that you go this way for a test statistics,
for the test statistics for example Z you find what is the area you multiply this left side area this
is a p-value multiplied by 2 times because it is a two-tailed test.
And not only that it is a symmetric if after multiplying if the p-value is still less than or equal to
alpha we have to reject it. Otherwise what you can do instead of multiplying you when it is a
minus Z you find out what is the p-value when it is the plus Z what is the find of p-value you add
it the added p-value should be less than or equal to; if it is less than or equal to alpha we have to
reject it.
(Refer Slide Time: 11:53)
The critical value will occur in both lower and upper tail of the standard normal curve use the
standard normal probability distribution table to find out Z α/2. Why we are doing Z α/2 because a
371
2 tail if alpha equal to 5% for example so alpha/2 is 2.5% it is 0.025 so when alpha by 2 is 0.025
we have to find out corresponding Z value on left side and right side. So, the rejection rule is if Z
is less than the lower limit reject it or if Z is above the upper limit reject it.
(Refer Slide Time: 12:45)
Here the Z means sample statistic we will do an example for that his example is for doing
hypothesis testing for the two-tailed test when Sigma is known. The example is a milk carton
assume that a sample of 30 milk carton provides a sample mean of 505 ml, the population
standard deviation is believed to be 10 ml, perform a hypothesis test are at 0.03 level of
significance when the population mean is 500 ml. To help to determine whether the filling
process should be continued to operating or it has to be stopped and corrected.
So, what is happening there is a assume that it is assembly line so the in the assembly line that
the bottles are filled with 500 ml, what is happening generally if it is over filling also there is a
problem if it is under filling also there is a problem that is why if it is H0 mu equal to 500 ml,
Ha: mu ≠ 500 ml. The logic why we did not go for left tail or right tail test is because we have to
go for not equal to because even over filling and under filling is the problem for us that is why
we should go for two-tailed test.
(Refer Slide Time: 14:08)
372
So, what are the data is given here as usual data will be given for sample and population n equal
to 30 this sample mean is file 505 ml with respect to population what kind of data is given we are
assuming that mu equal to 500 and standard deviation Sigma equal to 10 ml and alpha equal to
0.03. This problem that is a two tailed problem will solve with the help of p-value approach. First
you have to determine the hypothesis.
You see that mu equal to 500 why this as I told you because both overfilling and under filling
will cause the problem for the company. So, the hypothesis is formulated specify the level of
significance alpha it is given in the problem it is 0.03% it is a two-tailed test, so I have to mark
this side 0.03 by 2 left side also it is 0.03 by 2. Next I have to compute Z statistic, Z statistic is
(X bar -µ) divided by (σ/√n), X bar is 505 mu is assumed mean 500 divided by 10 is given root
of 30, so 2.74 so you have to mark this 2.74.
(Refer Slide Time: 15:50)
373
For example assume that the 2.74 is here so what I have to do when the Z value is 2.74 we have
to find out the area towards the right in Python if you type this 1 – stat.norm.cdf of 2.74 the right
side area is 0.003. If you multiply this both side because multiply two times because it is
symmetric. So, this what this meaning is in Python it is this way so when Z equal to 2.74
corresponding right side the area is point 0.00307.
This side also when Z equal to minus 2.74 the area is 0.00307 you mu add both you will get
0.0061.
(Refer Slide Time: 16:42)
374
So what is happening that 0.0061 is less than your 0.03 see this 0.0061 this was 0.0062 by after
approximation the Alpha is 0.0 there is still it is that less than the alpha value so we have to
reject H0. When we reject H0 there is no sufficient statistical evidence to infer that the
alternative hypothesis that means the mean filling quantity we are rejecting null hypothesis so
when we reject null hypothesis what was our null hypothesis µ= 500, H1: µ ≠ 500, when you
reject it there is no sufficient statistical evidence to infer that the alternative hypothesis is true.
So we have found that the p-value is 0.002 that is less than alpha 0.03, so we have to reject null
hypothesis, when you reject a null hypothesis we are accepting our alternative hypothesis that
there is no sufficient statistical evidence to infer that the null hypothesis is true. So, the mean
filling quantity is not 500ml so immediately they have to stop the assembly line they had to make
the corrective action that is the inference.
(Refer Slide Time: 17:57)
Yes that was shown here in the picture form Z equal to 2.74 is the test statistics. So, the right side
area is 0.0031 when test two statistics is -2.74 the left's idea is 0.003,1 after adding that still it is
less than or equal to alpha we have to reject it otherwise we can compare this 0.0031 versus 0.15
the p-value this is half of the significant value, the half of the p-value is 0.03 that is lesser than
the 0.015 so we can reject it. But many software packages you may not give this half of the p-
value and half of the Alpha value.
375
You will get the added value that means this 0.0031 is added with another 0.0031 then this alpha
on the 0.015 is added with another 0.015 so the added p-value is compared with added alpha
value then we take the decision if the p-value is less than alpha we are to reject it.
(Refer Slide Time: 19:08)
So, here we are rejecting I will go for critical value approach the critical value approach will
continue from the fourth step determine the critical value and the rejection rule for alpha by 2
0.015 so what is the meaning is when alpha is 0.015 we have to find out this critical value for the
right side when this side area is 0.015 we were to find out - Z critical value. So, if the calculated
Z value is lying on right side we are projected to what is lying on the left side we have to reject.
Because what happened the 2.74 is our calculated Z value otherwise test statistic sample statistic
this value is 2.17, so the 2.74 will be on this side 2.74 will be on the rejection side. So, we have
to reject it so there is a sufficient statistical evidence to infer that the alternative hypothesis is not
true. Now test statistics 2.74 lying on the rejection side we have to reject our H0, so the
conclusion is there is sufficient statistical evidence to infer that the null hypothesis not true.
So we have to accept our alternative hypothesis that means the assembly process is not filling
average value of 500 ml. So, we have to stop that assembly line then we have to make corrective
actions.
(Refer Slide Time: 20:50)
376
So, what is the step here as I told you when this value is alpha by 2 that is a 0.015, so the
corresponding, this is a positive side the left side 0.015 so the left side -2.17 how can you get
this one when you type stats.norm. cdf of in Python when you put 0.015 you will get lower limit
of our critical value this is symmetric. So, right side, it is also will be same value.
(Refer Slide Time: 21:26)
This also same thing what has happened when alpha by 2 is 0.015 the lower limit is -2.17 it is
0.15 on the right hand side the upper limit is 2.17 this Z value that is we calculated 2.74 so this
2.74 will be this side 2.74 will be on the rejection side you have to reject it. In case for example
the Z value, say 2 for example 1.5 say 1.5 will be here so we have to accept it.
(Refer Slide Time: 22:02)
377
We will solve the same problem with the help of confidence interval approach confidence
interval approach for two tailed test about the population mean. So, select the simple random
sample from the population and use the value of the sample mean X bar here X bar to develop
your confidence interval for the population mean mu (µ), if the confidence interval contains
hypothesis value of 500 do not reject it.
So how we are going to develop this conference interval is we know this very familiar formula
(xbar - µ)/(σ/√n) when you readjust this so we can find out in terms of x-bar the upper limit of
mu and lower limit of so is, µ + Z (σ/√n) will be the upper limit when you put µ - Z (σ/√n) will
be the lower limit. So, what we are do with the help of x-bar we have to express the upper limit
lower limit of mu so that formula has come from this Z = (X bar - µ)/(σ/√n).
So the upper limit of mu is X bar + Z Sigma by root n lower limit will be X bar - Z Sigma by
root n this value by adjusting this Z equations we got this one in this interval. Suppose we have
to find out the upper limit say this is lower limit this is upper limit, in this interval if the 500 is
the assumed mean is lying we have to accept null hypothesis. Otherwise reject it, actually H0
should be rejected if mu happens to be equal to one of the endpoint of the conference interval.
Now you see that the formula which have explained previous slide x-bar ± Zα/2 (σ/√n), and X bar
is 505 +, Zα/2 is 2.17, so Sigma and n so sample size is 30 so this is 505 ± 3.9619 so the lower
378
limit is this is lower limit this is upper limit. So, in this interval we are not able to capture 500, so
we have to reject null hypothesis, so to see this because the hypothesis value for the population
mean mu 500ml is not in this interval so not in this interval the hypothesis testing conclusion is
that the null hypothesis H0 mu equal to 500 as rejected.
Dear students what you have seen in this lecture so far we have taken one practical problem for
the pizza delivery problem. We have learned how to test one tailed test that is left tail test then
we have learnt how to do the two tail test. In one tail test first we you solved with help of p-value
approach then we solved with the help of critical value approach. In two tail test also first you
solved with the help of p-value then critical value.
Then the third one which ever solved using confidence interval method in all these 3 method the
final result is same that we have rejected our null hypothesis. In the next class will start with the
t-test so what will happen in t-test so far we the population standard deviation is given there may
be a situation where you may not know the population standard deviation that time you should
go for t-test that will continue in the next class.
379
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 18
Hypothesis Testing- III
Welcome students in the previous lecture we have seen hypothesis testing when Sigma is known
so that test we call it as Z test. Now we will go to be another category of hypothesis testing
procedures where Sigma is not known. Most of the time the population standard deviation is not
known to us so that time we should go for another test that is called t test.
(Refer Slide Time: 00:49)
I will explain what is the connection between the Z test and t test? So, previously you remember
that we have used the Z = (X bar – mu) divided by (σ/√n). So, in this case since Sigma is not
known to us instead of Sigma that Sigma is replaced by s that is a sample standard deviation. The
another relation between Z and t is when the n is increasing look at this picture here the blue one
is Z distribution the pink one is t distribution.
So, when n is increasing when the sample size is increasing the behavior of Z test and t test is
same that is why in many software packages there would not be separate tab for doing Z test
there will be a tab only for doing t test for example hypothesis testing and in SPSS. When you go
380
for even in Minitab also when you go for that there would not be a column to do the Z test but
there will be a column for t-test.
For doings Z test the t test is enough, so what is the t, t = (X bar – mu) divided by (s /√n). here
the degrees of freedom will come into place because the shape of the t distribution it is affected
by the degrees of freedom when the degrees of freedom is increasing so the behavior is same. So,
the test static has a t distribution with n - 1 degrees of freedom.
(Refer Slide Time: 02:22)
What is the rejection rule the same whatever we have seen for the previous lecture that is Z test
that rule is applicable for here also so reject H0 if the p-value ≤ alpha, if it is a critical value
approach suppose if µ ≥ µ0 so alternative hypothesis will µ < µ0 so it is a left tailed test. So, in
the left tailed test, test statistic t is less than your - t α, you have to reject it.
For example see here this is µ ≤ µ0 so the signs are complementary so this right tailed test so
what is this left tailed test. So, the left tailed test will be like this so this is t α, so if the calculated
t value is lying on this side we are to reject it what does this right tail test I am writing this side
so if it is this way this is right if the t is like this is t alpha the t, calculated t is lying on the right
side we have to reject it if µ = µ0 what will be the alternate hypothesis here H1: µ ≠ µ0.
381
So, this will be this way there will be t α/2 on the right side t-α/2 on the left side if the t value
calculated t value is lying on any of this side we have to reject our null hypothesis.
(Refer Slide Time: 03:55)
For doing t test in Python we have to import stats, from scipy import stats then you have to
import numpy else import numpy as np, now X equal to 10 12, X is an array 10 12 20 21 22 24
18 and 15 the function for doing t test is start stat t test underscore one sample here the X
represent this array the mu represents our as you would mean. So, for this problem here the null
hypothesis is H0: mu =15, H1: mu ≠ 15.
So when you run this you are getting this test statistics what is this value does t = (X bar – mu)
divided by (s /√n). So, what has happened the Python calculated the value of x bar from this
given array and the value of s sample standard deviation with the help of X mu this is 15 n is the
sample size it is taken care. So, what is happening the previously what you are then when you are
doing Z test we are using (X bar – mu) divided by (σ/√n) the X bar has to be formed with from
the sample.
But here you need not do that one just you mention that array name the built-in function will take
care. So, this p value what we are getting is the two-sided p-value, two-sided p-value me in the
sense suppose if it is a two-tailed test, so this site when t value what is the t value when t is 1.56
so plus 1.56, - 1.56, so the total area that is the right side area and this left side area that the area
382
is 0.16 assume that my alpha equal to say 5% what is happening the p value is exceeding the 5%.
So we have to accept our null hypothesis this is the way to do the t test in Python.
(Refer Slide Time: 06:08)
It will take an example for this in an ice cream parlor at IIT Roorkee the following data
represents the number of ice cream sold in 20 days. So, here n equal to 20, the 20 days data were
surveyed the shop owner want to test that the average is less than 10 by taking alpha equal to 5%
so what is happening there are 20 data set is there the number of ice cream sold on day 1 is 13
day 2 is 8 and so on so for day 20, 80.
So, if you are doing manually that means with the help of statistical table we have to find out the
sample you have to find out the X bar then we have to find out the sample standard deviation
then you have to use this formula (X bar – mu) divided by (s/√n).
(Refer Slide Time: 06:58)
383
But in Python it is helping us very easily so we will go back, so what is happening the H0: is mu
less than or equal to 10 so H0: mu less than or equal to 10. So, what will be over alternative
hypothesis: mu greater than 10, so it is a right tailed test. So, right tailed test I have to shade to
this side okay what is alpha is 0.05 first you will saw 0.05 so I have stored this given value into
an object X so X is 13 8 10 all the values which I was shown in the previous table that I have
stored in this one. Then stats dot t-test underscore 1 samp in bracket X ,10 the here the 10 is our
assumed mean assumed population mean.
This X is this array, so what we are getting we are getting the t value is - 0.35 so I am drawing
this distribution now, so what kind of yeah this is right tailed test right, right tailed it is 0.05 but
when we do the sample statistics we are getting that is - 0.35 but this 0.72 is when Z equal to -
0.35 what is the left side area when Z equal to plus 35 what is the right side area? So, that added
area is this 0.72, 0.723. Now since it is a one tailed test we have to divide by 2 when you divide
this one it is 0.36 so this 0.36 is greater than 0.05.
So, we have to accept null hypothesis. In case if the p-value is for example 0.03 so we will be
stand, will be landing on here so we have to reject the null hypothesis yeah this is right tailed test
this is the right tailed test.
(Refer Slide Time: 09:20)
384
What was the t value here what was the t value one tailed test about the population mean when
Sigma is unknown the previously the value of t is - 0.384, so when it is - 0.384 the corresponding
p-value suppose if it is - 3.84 it will be here this will be - 0.384, - 0.384. so, this left side area is
0.3526 because in Python you see that from the t value you can find out the area stats dot t dot
cdf, you have to find out the way to enter the t value and corresponding degrees of freedom.
Previously our sample size is 20 so the degrees of freedom is 19 so when the, this is calculated t
value the corresponding area is 0.35. so, what is happening this 0.35 is greater than this is left
side left side the area similarly if you put plus 0.384 you will get the area towards left that has to
be substrate from – 1, so we will get the right side area. So, the right shady area will be 0.35
approximately 0.35 will be right side area will be here 0.35. So 0.05 is less so the p-value is more
than the Alpha so we have to accept the null hypothesis.
(Refer Slide Time: 10:57)
385
Next we will go for hypothesis testing for proportion similar to null and alternate hypothesis for
mean. So, here the Equality part of the hypothesis always appear in the null hypothesis in general
a hypothesis test about the value of the population proportion P this is P population proportion
must take one of the following 3 forms right. So, for example the P ≥ P0 so this is this is a
situation like this; what is happening this is this is a left tailed test example for this.
There is another possibility it may appear this way this is your right tailed test. How I am naming
left tail or right tail, I'll write a test by looking at the sign of your alternate hypothesis. Now it is a
two tailed test what would be two tailed test? This way if anything below all, so will reject it
anything above it is, similar to what you have seen previously.
(Refer Slide Time: 12:03)
386
Here the test statistic test statistic raised Z = (P bar – P 0) divided by σ p bar , here the P bar is
your sample proportion this is your assumed a population proportion this is a standard error for
the proportion. So, the standard error is √((pq)/n) that is √ (p0 (1 - p0) divided by n), assuming
one assumption here is that the value of n p and npq should be greater than or equal to 5 because
this is this is this follow binomial distribution. If you if you want to approximate binomial
distribution to the normal distribution. So, the assumption is we're n p and npq should be greater
than or equal to 5
(Refer Slide Time: 12:48)
What is the rejection rule the same rejection rule if the p-value is less than or equal to alpha we
have to reject it, for the critical value approach the same thing if it is Z this is less than or equal
387
to so p is greater than equal to p0 this is a right tailed test so this is p less than equal to p0 left-tail
test. So, there right tailed test the Z value is more than Z alpha, reject it. For the left tailed test if
the Z value is less than minus alpha, reject it.
When it is a two-tailed test p equal not equal to p 0 so whether it lies on either side of the minus
alpha by 2 and plus alpha 2 we have to reject it.
(Refer Slide Time: 13:26)
We well take an example suited traffic police that is the example for the New Year's week the
city traffic police claimed that 50% of the accident would be caused by drunk driving. So, the
sample of 120 accident showed that 67 where caused by drunk driving, use the data to test the
traffic police claimed that alpha equal to 0.05. Similar to that here also the sample data is given
and population proportion is given.
What are the sample data first you will solve this p-value approach what are the sample data is
given we go back see n is 120 okay the probability actually here the success is 67. So, we have to
find out p bar p bar equal to 67 by 120 this is a sample data even. So, the population proportion
capital P equal to 0.5 and alpha equal to 0.05. So, now we will use this data we will test the claim
that the 50% of accident Road caused by drunk driving.
(Refer Slide Time: 14:49)
388
So, first we will go for p-value approach the first step in the hypothesis testing is determine in
the hypotheses. So, what is the null hypothesis H0: P = 0.5, alternative hypothesis: P ≠ 0.5 this is
given in the problem. The next step is specifying the level of significance alpha equal to 5%. The
third step is compute the value of the test statistics that is the value of Z, so for finding the value
of Z we need to know the standard error of the population proportion.
So σ p bar = √ (p0 (1 - p0) divided by n) the P0 it is nothing but the assumed population
proportion that is 0.5, 1 - 0.5, n is the sample size 120 so the standard error of the population
proportion is 0.045644. Now we will use our traditional said formula, Z = (P bar – P 0) divided
by σ p bar here the p bar is nothing but our sample proportion. So, what is the sample proportion
out of 120, 67 accidents due to drunk driving.
So p bar is 67 divided0 by 120 minus this is our assumed a population proportion that is 0.5 the
Sigma P already we got it 0.045644 so the Z value is 1.28. when alpha equal to 0.05 because it is
a two-tailed test why we are calling it a two-tailed test when you look at the sign of over
alternative hypothesis it is not equal to if it is not equal to it is a two-tailed test if it is greater than
that is a right tailed test if it is less then, then it is a left tailed test.
Now since the sign of our alternative hypothesis is not equal to type it is a two-tailed test. So,
when a alpha equal to 0.05 when you divide by 2 so, this side area is 0.025 this side area is 0.025
389
when it is a point zero to five the corresponding Z value is 1.96 left hand side is - 1.96 now what
has happened our calculated Z value is 1.28. so, 1.28 will be approximately here, 1.28 now when
you compare 1.96 and 1.28 this 1.28 is lying on the acceptance region, so we have to accept null
hypothesis.
But the methodology for testing the hypothesis what we are using is the p-value approach so the
decision of accepting or rejecting our hypothesis not with respect to 1.96 and 1.28. Here we are
going to compare the probability so what is that probability is see this left hand side 0.025 the
right hand side also 0.025, so what we are going to do when the calculated Z value is 1.28 we are
going to look at what is this side area.
Similarly because it is a two tail test when it is - 1.28 we are going to look at this side area by
adding this side area plus this side area then after adding the two side area if it is exceeding 0.05
we are going to accept the null hypothesis otherwise we are going to reject it. Now what has
happened when it is z-values 1.28.
(Refer Slide Time: 18:54)
The corresponding area is the corresponding area is it is 1.9887 as I told you the fourth state
fourth step is compute the p-value so when Z equal to 1.28 the cumulative probability from left
to right 0.8997 so when we go back here so this side area when Z values 1.28 this side area is
390
0.8997 so the right side area this side area will be 1 - 0.8997 that is approximately 0.1, so what
happened but here the area is 0.025.
Now we have to take the decision you can compare 0.025 and 0.1 when they compared 0.1 and
0.25 so the 0.1 is lying on the acceptance side. So, we are about to accept null hypothesis that is
the one way to take the decision otherwise the right side area is 0.1 similarly when Z value is -
1.28 this left side area is 0.1 so when , when you add this 0.1 + 0.1 that will be 0.2006 so that
value is greater than ever 0.05. So, this added value is greater than 0.05 so we have to accept our
null hypothesis.
So, what is the simple rule if the p-value the p-value is less than alpha reject a null hypothesis
less than or equal to if the p-value is less than or equal to alpha reject null hypothesis you the p-
value is greater than alpha except a null hypothesis. So, now here the p-value that is 0.2006 is
more than this so this is the condition for rejection what is the condition the p-value is less than
alpha reject it, if the p-value is greater than alpha accept it.
Now what happened now is the p-value is the second condition is satisfied that is the p-value
0.2006 that value is greater than 0.05. So, we are accepting our null hypothesis now we will go
back to the step compute the p-value for Z equal to 1.28 the cumulative probability is 0.8997
then the p-value is nothing but 1 - of 0.8997 because the two-tailed test so material by two outer
multiplying that you are getting 0.2006 that is greater than 0.05.
So, next we go to next step determine whether to reject H0 or not because the p-value this 0.2006
is greater than alpha that is a 0.05 we cannot reject it that means we have to accept null
hypothesis.
(Refer Slide Time: 22:07)
391
In Python there is an inbuilt function is there so what to do for that from statsmodel dot stats dot
proportion import proportions underscore Z test the proportion test always the Z test there would
not be proportion t-test that is number that is not available. So, only if the proportion test means
the Z test. So, count is equal to 67 that is the number of success. The sample size is 120, so this
capital P is the population mean what we assume it is 0.5, so proportion underscore Z test the
syntax is count come on sample size comma capital P.
So, we will get the Z value is 1.28 and the p value is 0.19 so 0.19 because we previous slides at
1.20 it is the address of approximation.
(Refer Slide Time: 23:02)
392
The same proportion test will solve the help of critical value approach determine the critical
value and rejection rule because for when it is alpha by 2 that is the for area 0.25 they said well is
1.96 and the left side is - 1.96 if the calculated it is this way 1.96 - 1.96 if the calculated Z value
is going this side or this side we are to reject it. So, what has happened the Z value is 1.278 so
1.278 is in the will be here 1.278, so you have to accept the null hypothesis we have to accept the
null hypothesis.
Dear students previously we have solved this population proportion test with help of p-value
approach then we have solved with the help of critical value approach both a time we have
accept a null hypothesis that is the P equal to 0.5. That is a 50% of accident is due to drunk
driving. So, I will conclude that in this lecture what you have seen first you have seen t test when
you will go for t test when Sigma is not known when the sample size is below 30 we should go
for t test.
We have solved one problem for hypothesis testing, we have solved with the help of p-value
approach and critical value approach. After that we have solved a problem using a population
proportion mean so that means the population proportion is given we have tested whether
population proportion can be accepted or not accepted, thank you very much. In the next class
we will see different types of error while doing hypothesis testing, thank you very much.
393
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 19
Errors in Hypothesis Testing
Welcome students in the last lecture we have seen hypothesis formulation and testing. In that
hypothesis formulation and testing we have seen what is the null hypothesis and what alternative
hypothesis just some introduction about the errors in hypothesis testing. Then we have seen Z
test when Sigma is known when Sigma is not known we have done a t-test also.
(Refer Slide Time: 00:57)
I this lecture we will go in detail about errors in hypothesis testing we will take an example this
example is taken from this book a famous book applied statistics and probability for engineers
Douglas C Montgomery and at all it is very interesting books I will like to recommend this book
for further reading after seeing this lecture. We are interested in burning rate of a solid propellant
used to power air crew escape system.
Burning rate is a random variable that can be described by a probability distribution. Suppose our
interest focus on mean burning rate so null hypothesis is mu equal to 50 centimeters per second
alternative hypothesis is mu not equal to 50 centimeters per second.
(Refer Slide Time: 01:46)
394
You see that in the previous slides we have assumed mu equal to 50 but what is the basis for that
50. There are different possibilities how we can assume the value for our null hypothesis one
thing is the past experience our knowledge of the process or even from the previous test our
experiments. Other possibility is from some theory or model regarding process under study even
from external consideration such as design and engineering specifications are from contractual
obligations we can assume the value of null hypothesis.
(Refer Slide Time: 02:23)
Suppose I know the say I am following the say confidence interval method. So, I have found the
lower limit is 48.5 upper limit is 51.5 when µ equal to 50 centimeters per second. If any mean
value which is beyond this 51.5 which is beyond this 51.5 if the sample mean value is beyond
395
51.5 I will reject null hypothesis the same time if it is below less than 48.5 again I will reject null
hypothesis. So, what is happening decision criteria for testing H0: mu = 50 centimeter per second
versus H1: mu ≠ 50 centimeter per second. So in this example we have taken the sample size is
10 and the population standard deviation is 2.5.
(Refer Slide Time: 03:27)
So, by using the previous example I will explain what is the meaning of this type 1 error. See the
true mean burning rate of the propellant could be equal to 50 centimeter per second however
randomly selected the propellant specimens that are tested we could observe a value of test
statistics x-bar that falls into the critical region a rejection region. What in the rejection region?
Below the lower limit above the upper limit.
So, if our x-bar is lying on the rejection region we will reject null hypothesis if the x-bar is lying
on the acceptance region will accept null hypothesis. So, we will go to the next point we would
then reject null hypothesis H0 if in favor of alternative H1 in fact H0 is really true. So, this type
of wrong conclusion is called type 1 error. What is the meaning of type 1 error? Even though the
null hypothesis are true but the X bar value lying on the rejection side we have rejected null
hypothesis this is called type 1 error or incorrect rejection.
(Refer Slide Time: 04:32)
396
You see for example in this slide you see that there are two normal distribution one is on the left
hand side another one is right hand side. So, what will happen this is a one tailed test for this
value which is in the red portions sometime the value of x bar may lie on the right hand side that
is in the regions. We will reject null hypothesis even though the mu value is same. For example
in this case mu value is 50 for example this normal distribution the mean equals to 50.
So, what is the meaning of type 1 error rejecting the null hypothesis H0 when it is true is defined
as the type 1 error. So, what will happen actually the null hypothesis is true because the sample
was randomly selected the value of x bar is falling on the rejection region we have rejected a null
hypothesis but it is not correct, so, it is incorrect rejection so this is called type 1 error.
(Refer Slide Time: 05:31)
397
Now we will explain what is the meaning of type 2 error?. Now suppose the true mean burning
rate is different from 50 centimeter per second yet the sample mean x-bar fall in the acceptance
region what will happen this, now the null hypothesis is not true but the sample mean is
following in the accepting region in this case we would fail to reject H0, when it is false. So,
what is happening the H 0 is false but still we have accepted our H 0, so it is false acceptance so
this type of wrong conclusion is called type 2 error.
So I just am saying what do you need I type 1 type 2 error see the type 1 error is incorrect
rejection type 2 error is false acceptance.
(Refer Slide Time: 06:22)
398
You see this situation there are two normal distribution one is we have seen this is 50 say it is 52
now what has happened there are 2 population which are overlapped my concern is about the
population whose mean is 50. But there is another population whose mean is 52 which are
overlapping. So, what will happen this much regions is not belongs to that light green one it is
not belongs to this population. So, this region this region is belongs to population whose mean is
52 just because off it is lying on acceptance side of this population where the mean is 50.
We are wrongly we are falsely accepted, so it is called a type 2 error, so, failing to reject null
hypothesis when it is a fall is defined as the type 2 error.
(Refer Slide Time: 07:06)
You see the type 1 type 2 error see the H 0 is correct but we have rejected. So, this is your
incorrect rejection alpha, the H0 is incorrect but we accepted this is a type 1 error. So, that is a
false acceptance this we have seen previously also the same table.
(Refer Slide Time: 07:28)
399
Now you will see how to calculate type 1 error and type 2 error first we will go with the type 1
error. In the propellant burning rate example a type 1 error will occur when either the sample
mean is greater than 51.5 or sample mean is less than 48.5 when the true mean burning rate is mu
equal to 50 centimeters per second. Suppose the standard deviation of the burning rate is that is a
sigma is 2.5 centimeters per second and n equal to 10, so the probability distribution mu equal to
50 the standard error actually standard error is our Sigma by root n that value standard error is
0.759.
So what is the value of the type 1 error is when the probability of X bar less than or equal to 48.5
when true mean is 50 plus when the X bar is greater than 51.5 when true mean equal to 50.
(Refer Slide Time: 08:26)
400
You see that you see in the left hand side it is a 48.5 when the sample mean is what is the
probability of sample mean to lie below 48.5 plus what is the probability of that sample mean to
lie above 51.5 when you add that that corresponding probability is nothing but your type 1 error.
You see that we got alpha by 2,= 0.0288 similarly on the right hand side you will see how this
has come.
(Refer Slide Time: 08:55)
Using Python actually I have pasted the print screen of the Python after running first we for
finding the type 1 error will define a function. So, that function I'm going to call it as def of Z
underscore value X, mu, sem standard error of mean you see that whenever we define function
there should be a colon. So, first I am finding Z value Z value is x - mu by standard error, if the Z
401
value is less than 0 that means if the value of it will be like this if the Z value is the Z values on
the negative side simply the p value is nothing but the cumulative distribution function of Z.
So, when you type alpha equal to alpha I just I am naming equal to stat stat.norm.cdf Z you will
get alpha value suppose if the Z values greater than 0 we have to find out the right side area so if
we want to know the right side area you the whole area has to be subtracted from one so that we
will get the right side area it, else : alpha equal to 1 – stats.norm.cdf Z so print alpha, so
calculating alpha for different value of x mu and standard error of mean.
You see first I will find out the left side area so when area suppose this is X values 48.5 what
will be the area? So, X is 48.5, mu this is 0 for standard normal distribution but we at present we
are taking mu equal to 50 because after converting to Z scale to become 0, standard error of the
mean is Sigma by root n 0.75. Now we will call that a function which we are defined. So, Z
value so we have to give the value of x because that function is defined 48.5 mu is 50 standards
of the mean so the left side area will get 0.0287 this value is given left side also see that 0.0288
right.
Now we have to find out the right side actually this is alpha by 2, value whatever value which we
got it this value is alpha by 2. If you put the upper value that is 51.5 alpha by 2, so if you replace
instead of 48.5 if you if you put 51.5 here when you replace this X and when you replace 51.5
what will happen this Z value will be positive the Z value is positive this command will be
executed. So, if you want firstly they will find this left side area then from one the left side area
will be subtracted then we will get the right side area so we will get another 0.0287.
(Refer Slide Time: 12:00)
402
When you add this 0.0280 + 0.0287 you will get the Alpha value so that alpha is 0.057 this is
your value of alpha error type 1 error. So, what is the meaning of this type one error this implies
that 5.7% of all random samples would lead to rejection of null hypothesis that is H0 when mu
equal to 50 centimeters per second. So, there is a possibility of rejecting to null hypothesis is
5.7%, so we can reduce the type one error 1 possibility is by widening the acceptance region.
What is the widening the accept region if you make critical value 48 and 52 what will happen
what is the widening of this acceptance region is suppose this is currently this one so when you
increase now it is 51.5, so now you make it this right hand side 52 left hand side 48 so what is
happening the acceptance region is widened yeah you see that lower side on 48 upper side 52 so
what is the area cutting 0.00567 + 0.0567 when you add it will become 0.0114.
So, what is happening the type 1 error can be reduced by increasing the acceptance region that is
one possibility another possibility is if you increase the sample size the previous problem you
have taken sample size is 10, now from 10 if you increase 16 what is happening the value of
alpha is decreasing that means the Alpha is decreasing means we are more accurate in making
decision that is the possibility of incorrectly rejection is reduced.
(Refer Slide Time: 13:51)
403
Next we will go to type 2 error first we will explain what is a type 2 error and I will take 2
example 2, 3 examples to calculate the value of type 2 error. In the previous example what has
happened there are 2 population which are overlapped so this is 50 this is 52 the rejection region
this is this much when the population mean is 15 there is another population whose mean is 52
which is over lapping.
So, this overlapping region is that pink one because the pink portion is lying on acceptance side
we have falsely accepted assuming that that region has come from the population mean whose
value is 50. So, it is a false acceptance right so this pink region is nothing but the value of your
type 2 error.
(Refer Slide Time: 14:45)
404
Suppose how to find out this type 2 error see type 2 error will be committed if the sample mean
x-bar falls between 48.5 and 51.5 critical region when mu equal to 52 you see that it is a 52 not
mu equal to 50 when I say 52 that population is nothing to do when the mean equal to 52
population because it is something other population but it is lying on the acceptance side so I am
accepting falsely I have accepted so that is our type 2 error. So, what is the possibility so this is
worth 51.5.
(Refer Slide Time: 15:20)
So what is happening there are 2 population is overlapped. So, this is our original mu equal to 50
so this region is 51.5 this left hand side is what is that value, 48.5 this is for my assumed
hypothesis value in mu equal to 50 but actually what has happened there is another population
405
whose mean is 52, so I am extending this so this much portion this much portion is not really
belongs to the first one, but lying on the acceptance side so this much portions I have falsely
accepted so this portion is called the beta this portion is called alpha.
So we have to find out the value of beta and one more thing the value of alpha plus beta ≠ 1, it
should be very careful. So, what is happening this is 51.5 so this also 51.5 because this condition
is same. Now for this population that is which is on the below we have to find out the left side
area then we how do you do that first you have to find out the Z value Z value is (X – mu) by
Sigma, so 51.5 - 52 divided by Sigma by root n.
So, what will happen for this region when Z it is below the left hand side the corresponding
value is your type 2 error. See, that I have done that also so I calling it the beta stats.norm.cdf
cumulative distribution function. So, my x value is 51.5 minus my true mean is 52 divided by
Sigma by root n so 0.7 so that beta value is 0.263. So, this 0.263 is nothing but this value 0.263 is
0.264, so this is that pink region area is this one just I will give an explanation the code over so
Python is given how to find out.
Now you see that no mu equal to 50.5 what happened again between the true mean and assumed
mean the difference is becoming less. So, if I put the true mean is 50.5 you see that I have
changed there 52 here I have changed in 50.5 so again beta equal to stats.norm.cdf of 51.5 there
is nothing but X bar - mu by Sigma by root n that value is 89 you see that the value of beta has
increased. So, one point at present you have to remember when the difference is decreasing
between your assumed mean and the true mean that type two error is increasing.
We can say one example suppose there are two product goodness original another one is
duplicate both are looking similar color wise texture wise quality wise, there are more possibility
for committing type two error when whenever the difference between original and duplicate of
product is very, very less, the same way whenever the mean the published which you assumed
and the true mean the difference is less there is more possibility of committing type 2 error that I
will explain.
(Refer Slide Time: 19:15)
406
You see this one again when the mean is 50.5 there is a more there are more pink region that is
there is more type 2 error. So, the point is when the distance between your assumed mean and the
true mean is decreasing there are more possibility of committing type 2 error. Now we will see
computing of this type two error already I have explained so computing the probability of type
two error may be the most difficult to concept if you are doing it with the help of statistical table
it will take more time that is why we go for Python that it will solve your problem very quickly.
So, this area we have found out already.
(Refer Slide Time: 19:57)
Now look at this table acceptance region is their sample size is there you see that for the same
sample size when you increase the acceptance region what is happening the Alpha is decreasing,
407
sample size is same you are widening the acceptance region, so the Alpha is decreasing. When
alpha is decreasing you see what is happening the beta is increasing. So, the relation between
alpha and beta is when you decrease alpha beta will increase it is like this as I told you
previously there are two normal distribution this is your rejection region there is another normal
distribution this is mu equal to 50 this is 52.
So this side portions, so this is your alpha this side portion is beta I will change the color. So, this
portion the green portion is nothing but your type 2 error the red portion is your type on our
alpha and beta. So, what is happening suppose assume that I am keeping in that line where
there's intersections there I am keeping a pen like this if I move towards right-hand side what
will happen alpha will decrease so when alpha is decreasing what is happening to beta, beta
value is increasing.
Suppose I am keeping this pen I am moving towards left hand side what will happen beta will
decrease but alpha will increase so that is explained with the help of this table so the relation
between relationship between alpha and beta is inverse, apart from this equal to the last column
the MU equal to 52 again I am the mu through 50.5 suppose the difference is decreasing what is
happening the value of beta is increasing that is one point.
Now we look at the another point suppose when you increase the sample size you keep the
acceptance region as it is you increase the sample size 10 to 16 you see that when you go to 10 to
16 you see here 48.5, 51.5 sample size is 10 alpha is 0.05 for the same reason when you increase
you compare this will go to say I am comparing this accepted region is same but I have increased
my sample size the Alpha value is decreasing.
See initially our alpha value 0.05 now it is 0.01 look at this beta also the value of beta also you
see initially it is 0.26 now it is 0.21, so what the point we are learning from here is when you
keep acceptance region as the constant one when you increase the sample size both value of
alpha and beta will decrease okay that is the point here let us see this. For constant n when you
increase the acceptance region alpha is decreasing with the Alpha is decreasing beta values
increasing.
408
Second point increasing n can decrease both type of error that is type 1 and type 2 there is a
learning from this slide.
(Refer Slide Time: 23:49)
Type 1 and type 2 errors having an inverse relationship if you reduce the probability of one error
the other one increases so that everything else unchanged. So, that is a relation between alpha
and beta remember alpha + beta is not equal to 1.
(Refer Slide Time: 24:08)
Now let us see a factors affecting type 2 error the true value of population parameter beta
increases when the difference between this point already I told you the difference between
409
hypothesis the parameter and its two values decreasing, significance level alpha increases when
beat decreases, population standard deviation Sigma increases when beta increases, sample size
beta increases when n decreases that is the relation between your different element of your type 2
error.
(Refer Slide Time: 24:49)
How to choose there between type 1 and type 2 error the choice depends upon the cost of error.
So, the first point is choose smaller type 1 error when the cost of rejecting the maintained
hypothesis high, for example in a criminal trial committing an innocent person is very, very
costly mistake so that time the value of alpha should be very less. Choosing a larger type 1 error
when you have an interest in changing the status quo so what will happen if you are willing to
change the status quo if you increase the alpha value obviously there is a more chances for a
rejection region either hypothesis get rejected that is the status quo is getting rejected.
(Refer Slide Time: 25:40)
410
We will take another problem to find out type 2 error so I am assuming mu =8.3 alternate
hypothesis a mu < 8.3, it is a left tailed test. Determine the probability of type 2 error if the true
mean is 7.4 at 5% significance level when Sigma is 3.1 and n equal to 60. See that I have drawn
this one my assumed mean is 8.3 the question is asked if the true mean is 7.4 what is the value of
beta? You see that if any portions which are going on left hand side I will reject it so this side I
will reject it but the right hand side I will accept it.
But what is happening the true mean is 7.4 it is lying on acceptance side of where the mu equal
to 8.3, I actually I have to reject this since it is lying on the acceptance region so this much
portions this much portion I have falsely accepted so that beta value is nothing but your type 2
error. See that the value of C is constant for this population and this population so we have to
find out this right side area. So, for this purpose I have developed a function because this
function is very useful first you let us understand.
So I am going to define your function I am going to call it as a type_2 so what are the parameter
which I am going to take mu1 my assumed mean mu2, true mean Sigma population standard
deviation n sample size alpha significance level colon. So, for example the first one
which is in the topper 1 the normal distribution and find I'm finding out what is the Z value in
this location what is the Z value I know what is alpha value so Z equal to stats.norm.cdf of alpha
411
value if we substitute what will happen I will get the value of Z. If I know the value of Z I can
find out X bar how I can find out because Z equal to in this relationship if I know Z value I can
find out X bar. So, X bar is when you mu multiplied by mu plus when you bring this left hand
side Sigma Z multiplied by Sigma by root n that is it an X bar equal to mu 1 + Z star Sigma
divided by np because from numpy that is kernel numpy .square root of n.
So, I will get X bar this is the value of Z now I have to find out. The Z 2, this said to this what
will be this Z 2 this this X bar this X bar this value is X-bar so corresponding normalized scale is
Z 2, so what will happen Z 2 is X bar - mu 2 that is for the from this population what is the mean
that is a 7.4 X bar - mu 2 divided by Sigma by np . square root n. You see the condition if mu 1
is greater than mu 2 what is mu 1, mu 1 is now 8. 3 this is 7.4 in this case yes mu 1 is greater
than mu 2 what will happen I will get the positive Z value if the Z is positive if I want to know
the p value from 1 I have to subtracted.
So the Z value is positive the beta equal to 1 - stats.norm.cdf of Z 2 if Z value is negative just
finding the left side value beta equal to stats.norm.cdf Z 2, beta. So, when you type this in
Python type underscore now we have to give this value of mu 1 suppose what I am going to do a
good to find out the beta value the mu 1 is the zoom in 8.3 mu 2 is true mean 7.4 Sigma is 3.1
sample size is 60, alpha equal to 0.05. so, now what will happen here if it is a positive value the
corresponding probabilities 0.2729 that is why that value yes this 0.729.
So, the right side beta will is 0.2729 so when will commit to type 2 error and the error will be
made when Z values greater than so now is here the corresponding Z value is – 1.645 whenever
you crossing - 1.645 on the right hand side then you will accept that that is nothing but your
value of type 2 error.
(Refer Slide Time: 30:39)
412
I will take another example solve you for type 2 error one more example mu equal to 12, mu less
than equal to 12 we know that X bar equal to mu 0 Sigma by root n so here assumed means 12, Z
value is because the left side you when alpha equal to 0.05 corresponding Z is - 1.645 Sigma
values given the X bar is 11.979 if the value of X bar is below 11.979 we will reject it if the X
bar is above 11.979 I will accept my null hypothesis.
(Refer Slide Time: 30:39)
Now we look at this graph you see that my assumed mean is 12 suppose my true mean is 11.979
what are you with the value of type 2 error you see that when alpha equal to 0.05 - 1.645 I know
that so what will happen if the true mean is 11.979 what will happen Z value will be X bar - mu
413
so (11.979 – 12)/ Sigma by root n right, we will get Z value so corresponding right hand side
area is my type two error. So, beta equal to 0.8023.
(Refer Slide Time: 32:09)
I have done some code for this so type two error so twelve is what is that my assumed mean
11.99 is my true mean this is my standard deviation Sigma value this is my n this is my alpha, so
it is 80 that is why this much here, here so what we are getting we are getting 80790
approximately 80%.
(Refer Slide Time: 32:39)
I will go to see another problem the true mean is 11.96 little far away from the 12 what is that
we the true mean is going towards left hand side now what has happened to the value of beta, so
414
this portion is my beta type 2 error. So, by use already we have developed your function for that
will substitute this 12, 11.9, 6.1, 60, .05 so what is happening the value of beta is becoming very
less. Again I am stressing this point you the difference is bigger the value of beta is very less if
the difference is closer the earlier beta is very high.
(Refer Slide Time: 33:19)
Now hypothesis testing and decision making we have illustrated hypothesis and testing
applications. Now let us see the; what is the application of this type 2 error we have illustrated
hypothesis testing applications referred to as a significance test. In the test we have compared the
p-value to a controlled probability of type 1 error alpha which is called the level of significance
for the test. What do we have done to accept or reject a null hypothesis we have considered the p
value that is compared with alpha.
The p value is smaller than the Alpha we have rejected it the p value is greater than alpha we
have accepted it. So, we will go to the next point with the significance test we controlled the
probability of making type 1 error but not the type 2 error. We recommended the conclusion do
not reject H 0 actually we have to use accept H 0 but very cautiously we have used do not reject
H 0 rather than accept H 0 because the later puts us at risk of making type 2 error.
Why we are not accepting there is no proof that the value which have assumed in a null
hypothesis correct so that is why we are saying do not reject it now in this example what we are
415
going to do what should be the value of our null hypothesis. So, that the something called the
power of test can be improved we will see the definition of power of test. With the conclusion do
not reject H 0 the statistical evidence is considered inconclusive you are not able to say anything.
Usually this is an indication to postpone a decision until the further research and testing is
undertaken. But in many decision-making situations the decision-maker may want and in some
cases may be forced to take action both the conclusion do not reject or the conclusion reject H0
in such situation it is recommended that a hypothesis testing processor be extended to include
consideration of making type 2 error.
So, what we are going to say what in whenever you do the hypothesis our testing we have to see
the possibility of committing type 2 error also that I will show you with the help of an example.
(Refer Slide Time: 35:48)
That point is called power of test ok, suppose there is a restaurant is there suppose in the
restaurant when you order some dosa and you order some coffee or soup many times it will take
different times, sometimes they because they were to prepare it. Assume that the owner of the
restaurant has the target of service goal of 12 minutes or less whether it can be achieved or not so
what is the different possibility of committing type 2 error if you assume mu equal to 12. You
will see that the problem detail.
416
The mean response time for a random sample of 40 food order is say 13.25 minutes the
population standard deviation is believed to be 3.2 minutes the restaurant owner wants to
perform a hypothesis test with alpha equal to 5% a significance level to determine whether the
self-service goal of 12 minutes are less is being achieved. Now what is happening first you have
to start null hypothesis so null hypothesis is the status quo.
(Refer Slide Time: 37:00)
The status quo is mu ≤ 12 alternative hypothesis is mu > 12 so what will happen what kind of
test this is this is right tailed test. So, this value is 12 if anything goes this side I will reject it so
when alpha equal to 0.05, so the corresponding value is 1.645 if any value Z value goes beyond
1.645 I will reject it. So, we will substitute this value into our Z formula so Z equal to (X bar –
12) /(3.2/√ 40 ), if it is greater than 1.645 I will reject it.
So from this relationship I will bring the value of x bar okay we are finding the value of x bar
that is a 12.83 so what will happen we will accept H 0 when the value of x bar is 12.83, so this
value will 12.83 if anything value goes that side we will accept H 0 if anything goes below this
will reject it.
(Refer Slide Time: 38:05)
417
So, what will happen here in this we assume you said assumed we have assumed mu ≤ 12, H1:
mu > 12 now the question is how what is the logic behind this 12. So what I am going to do
instead of this 12, I am going to supply different values of this new value say I'm going to supply
14 13.6 13.2 12.00 so value of mu so Z it is (12.83 - mu) /(3.2/√ 40 ), in this Z formula when you
substitute the value of mu 14 this is the Z value.
So when the Z value is - 2.31 what is the value of type two error 0.01 so 1 - beta so this 1 - beta
is nothing but power of a test. Power of a test is rejecting a null hypothesis when it should be
rejected. So now instead of 14 if I make 13 so again the Z value is – 1.52, so corresponding beta
will is 0.64 you see that when the difference is becoming closer to 12, what is happening you see
that the value of beta is increasing whenever value of beta is increasing the power of testers
decreasing.
(Refer Slide Time: 39:43)
418
So, how we got this 0.0104 I have done in the next, next slide see I am calling that function
which I previously used so type_ 2 if true mean is 14 assumed mean is 12 Sigma is 3.2 n equal to
40 alpha equal 2.05, so my beta is 0.01 this is for my this is 14. Suppose if it is a 13.6 right
substituting 13.6 or other value my beta well is 0.06, so that is this value so when it is 13.2 what
is the beta value so in substituting 13.2 the beta value is 0.23 and 0.23. If it is 12.8 again beta
value is 0.5 and so on.
(Refer Slide Time: 40:35)
So, if I plot this okay now we will come to before plotting I will define what is this power of test
the probability of correctly rejecting null hypothesis when it is false is called the power of test.
For any particular value of mu this is mu the power is 1 - beta you call it is capital B that is
419
convenient. So, we can show graphically the power associated with each value of mu such graph
is called power curve.
(Refer Slide Time: 41:08)
So, the application of power curve first I will explain what is the element in this power curve.
Here in the x-axis the true mean which I have assumed in the y-axis the value of 1 - beta
probability of currently rejecting null hypothesis what is happening when the difference is
increasing between true mean even your assumed mean there is a more chances you will
correctly reject your null hypothesis. So, this power of test says what should with the value of
mu you see that when you feel if he assume you equal to 12 power is less when the power is
when you assume you equal to 14.5 this power is more.
So this power curve is helping us to decide what is the possibility of committing type 2 error at
the same time how much value of null hypothesis we can have so that we can improve our power
of a test otherwise we can decrease the beta. Dear students in this lecture we have seen different
types of error while doing hypothesis testing. We have taken one practical example I have
explained what is the meaning of type 1 error and type 2 error.
And also we have calculated value of type 1 error and type 2 error at the end we have seen a
power of a hypothesis testing we call it as a power curve. So, what we have done we have
suggested what is the possible value of mu and corresponding beta value are a corresponding
420
power of a test. So, with that we will conclude in this lecture in the next lecture we will go for a
two sample hypothesis testing, thank you very much.
421
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 20
Hypothesis Testing: Two Sample Test-I
Dear students today we are entering into another topic that is a hypothesis testing for 2 sample
tests. First I will explain what is the theory behind this 2 sample test?
(Refer Slide Time: 00:38)
You look at this picture see there are 2 population there are population 1 and population 2
suppose if I take some sample from population 1 I am taking sample and finding sample mean I
am calling it as X1 bar that is a Sigma X divided by n1 there is another population which is in
green color from that I am taking some sample X 2 then finding mean of that sample by using
this formula Sigma X2 divided by n 2.
If I plot this is the sampling distribution of X1 bar the green one is the sampling distribution of
X2 bar. So, the mean of this sampling distribution of X 1 bar is mu1 we can say mu1 the mean of
sampling distribution of X2 is µ2. The variance is σ12 divided by n1 now what will happen
because this result is so the variance actually you know to right this way variance of X1 bar, this
is variance of X2 bar.
422
Suppose if I find the difference of their sample mean if I plot the difference that will follow a
normal distribution. So, the mean of this population is µ1 - µ2 the variance of this population is so
the variance is we have to find the difference of these variants for population 1 variance is σ12 by
n1 for population 2 the variance is σ22 by n2. So, the variance is if we want to know the
difference of the variance of the 2 population we have to add the variance.
We know that the formula variance of A - B equal to variance of A + variance of B. So, the
formula if I say variance of (A – B) you might have studied. So, variance of A plus variance plus
variance of B. So, for the first population variances Sigma 1 square by n 1 for second population
the variance is Sigma 2 square by n 2 so this is the variance of this population which is in blue in
color.
(Refer Slide Time: 03:14)
So, will you use this result in coming this one this is the classification of different 2 sample test.
One classification is population means for independent samples, population mean for dependent
samples, population proportions and population variances. In the population mean for
independent variables will compare group 1 versus group 2 both the populations are independent.
The population mean for dependent variables same group before and after the treatment so this is
dependent samples.
423
In population proportion there are 2 population we will take proportion one from population 1
versus variance of proportion 2 we will take another population. Similarly for comparing
variances of 2 population we will find the variance of 1 of population 1 versus variance of 2 of
population 2.
(Refer Slide Time: 04:10)
First you will start with between 2 means so the population means for independent sample there
are 2 possibilities there one is the sigma1 square and sigma2 square is known sigma1 square is
variance of population 1 Sigma 2 square is variance of population 2. The another category is
variance are unknown for population 1 and 2 variants are unknown. When variant are unknown
if we can assume it is equal or we can assume it it is not equal.
If variance of population 1 and 2 is known we should go for test statistics is a Z value if the
variance is unknown we have to go for t statistics.
(Refer Slide Time: 04:57)
424
We will see what are the assumptions for when a variance of 2 populations are known to you, the
first assumption is samples are randomly and independently drawn both the populations are
normal. We have seen in the first slide both the populations are normal population variance are
known but different.
(Refer Slide Time: 05:17)
Then sigma1 square and sigma2 square known we will find out what is the variance of that
population when sigma1 square and sigma2 square are known both the populations are normal
the variance of the difference in the variance of population 1 and 2 is nothing but the summation
of their variance that is Sigma 1 square by n 1 plus Sigma 2 square by n 2 which I have already
425
explained. So, corresponding Z statistics is (X 1 bar - X 2 bar) – (mu1 - mu2) divided by(√ ((
Sigma 1 square by n 1) + (Sigma 2 square by n 2)) has a standard normal distribution.
So, if it is a 1 sample what was our formula if it is a 1 sample our formula was like this Z equal
to (X bar – mu) divided by Sigma by root n the denominator Sigma by is called standard error.
So, in this formula previously for one sample you have taken only one sample here there is 2
sample that is we are finding the difference of the 2 sample so it should be X 1 bar - X 2 bar.
Previously we have assumed one mean population mean now we are going to find the difference
of the 2 population mean we are going to assume the difference of 2 population so see mu1 -
mu2 this standard error Sigma by root n is we got from this one.
So, this you have to take the square root of this so that will become root of Sigma 1 square by n 1
plus Sigma 2 square by n 2, when the population means our independent samples first we will
start from null hypothesis. So, now that mu 1 - mu 2 will have some difference D0, the test
statistics for (mu 1 - mu 2) is (X 1 bar - X 2 bar) - D0 the previously what we had taken view we
wrote mu 1 - mu 2 as it is but instead of some mu 1 - mu 2 we can write only the difference also
here, so root of Sigma 1 square by n 1 + Sigma 2 square by n 2.
(Refer Slide Time: 07:19)
When both the Sigma are known to us there are different types of test as possible it may be a
lower tailed test how we are saying lower tails is the left one you see that. The null hypothesis is
426
mu 1 ≥ mu 2, alternate hypothesis is mu1 < mu 2. So, this will be a left tailed test otherwise you
can bring this mu 2 to the left side so it will be mu 1 - mu 2 ≥ 0, alternate hypothesis is mu1 - mu
2 < 0 that means. So, H0: is mu 1 - mu 2, H1: is mu 1 > mu 2, we can find the difference mu 1 -
mu 2 less than or equal to 0, mu 1 - mu 2 greater than 0 it is a right tailed test.
Here we can have an assumption that mu 1 equal to mu 2 and mu 1 not equal to mu 2, if you
bring on left hands to be mu 1 - mu 2 equal to 0 then mu 1 - mu 2 not equal to 0. We will see this
different test pictorially.
(Refer Slide Time: 08:28)
The decision rule for the left tailed test is reject H0 if the calculated Z value is less than - Z
alpha. The decision rule for the right tailed test is reject H0 if the calculated Z value is greater
than Z alpha, for two tail test reject H 0 if Z values less than - Z alpha by 2 are greater than Z
alpha by 2, anyone can happen.
(Refer Slide Time: 08:58)
427
So this distribution is sampling distribution of difference of 2 samples. So, the mean of this
distribution is mu of (X 1 bar - X 2) bar the standard deviation of this distribution is Sigma 1
square by n 1 + Sigma 2 square by n 2 this already I have explained how we have got this values.
(Refer Slide Time: 09:23)
For the sampling distribution of difference of 2 population mean the expected value is E of X1
bar - X 2 bar equal to mu1 - mu2. So, the standard deviation is already we have seen root of
Sigma 1 square by n 1 plus Sigma 2 square by n 2.
(Refer Slide Time: 09:43)
428
The interval estimate is X 1 bar - X 2 bar ± Z α/2 (root of (Sigma 1 square by n 1 plus Sigma 2
square by n 2)) this was for 2 sample. If it is a one sample what was the remember that it is X bar
plus or - Z alpha by 2 Sigma by root n, so equation just as extended for 2 sample population. So,
instead of X bar we are going to write x 1 bar - X 2 bar ± Z α/2 will be as it is this Sigma by root n
is replaced by Sigma 1 square by n 1 + Sigma 2 square by n 2 so what I am trying to say is that
even though it is a 2 sample Z test the logic of extending one sample to 2 sample is very easy
you need not remember the formula just intuitively you can extend the formula.
(Refer Slide Time: 10:52)
We will do problem on this when Sigma 1 and Sigma 2 is known the problem is taken from this
book applied probability and statistics for engineers by Montgomery. A product developer is
429
interested in reducing the drying time of a primer paint 2 formulations of the painter tested.
Formulation one is the standard chemistry and formulation 2 has new drying ingredient that
should reduce the drying time. From experience it is known that the standard deviation of drying
time is 8 minutes and this inherent variability should be unaffected by the addition of new
ingredient.
10 specimens are painted with the formulation 1 and another 10 specimens are painted with the
formulation 2. The 20 specimens are painted in random order to sample average drying times are
X 1 bar is 121 minutes X 2 bar is 112 minutes respectively. What conclusions can the product
developer draw about the effectiveness of the new ingredient?
(Refer Slide Time: 12:12)
Assuming significance level alpha equal to 0.05, the first step in the hypothesis testing is the
quantity of interest is the difference in the mean of drying time so mu 1 - mu 2 if there is no
difference it will become 0. Next we will form null hypothesis mu 1 - mu 2 = 0 or mu 1 equal to
mu2 that means the mean drying time of ingredient 1 and 2 is same. The alternative hypothesis is
mu 1 > mu 2 what we are going to assume that the new ingredient is more efficient for that the
drying time is going to be less.
If it is less will be greater and mu 2 will be lesser so we are writing mu 1 greater than mu 2 that
is going to be our alternative hypothesis. We want to reject H0 if the new ingredient reduces
430
mean drying time for alpha 5% test statistic is X 1 bar - X 2 bar - mu 1 - mu 2 that is 0, root of
Sigma 1 square by n 1 + Sigma 2 square by n 2 standard deviations given 8 so the variance is 64
the sample size is n 1 n 2 equal to 10 when you supply this Sigma 1 square Sigma 2 square n 1
and n 2 in this formula.
(Refer Slide Time: 13:30)
So the rejection rule is if mu1 equal to mu2 that is a null hypothesis, reject H0 if the test statistics
is greater than 1.645, how we got this 1.645 because you see that this test is right tailed test so
right tailed test means it will be like this. So, when alpha equal to 0.05 corresponding Z value is
1.645 if the calculated Z is greater than 1.645 we have to reject our null hypothesis. Next we will
compute our calculate our Z average test statistics.
After supplying X1 and X2 we are getting 2.52, so 2.52 will be in the rejection side so we have
reject we have to reject null hypothesis.
(Refer Slide Time: 14:29)
431
You see that 2.52 is lying on the rejection site that decision is we are to reject null hypothesis by
comparing the critical value.
(Refer Slide Time: 14:44)
The same problem can be done with help of comparing p values also so what conclusion we can
have since the Z calculated is that is 2.52 is greater than 1.645 we reject H 0 that is mu 1 equal to
mu 2 at the Alpha equal to 0.05 level and conclude that adding new ingredient to the paint
significantly reduce the drying time. Since we reject null hypothesis we are going to accept ultra
2 hypothesis that says that new ingredient is reducing the drying time.
432
Alternatively we can find the p-value for this test, so because it is a right-tailed test the p-value
should be 1 minus when calculated Z value is 2.5 minus corresponding probability so we will get
we got 0.0059, I will verify this with the help of Python how we got this 0.0059, see these 0.0059
we are comparing with alpha. So, when we have to reject they see that therefore H 0 mu 1 equal
to mu 2 would be rejected to any significance level alpha is greater than 0.0059 what is
happening here the value of p is very less. So we are to reject our null hypothesis.
(Refer Slide Time: 15:59)
So we have done your Python code for this import pandas a pd import numpy np import math
from scipy import stats we are going to make one definition here we are going to define your
function define def that is a standard syntax Z _ and_ p the variable which are going to take is X
1 X 2 Sigma 1 Sigma 2 n 1 n 2 with useful notations. Then we will find out the Z value Z value
is X 1 - X 2 square root of Sigma 1 square by n 1 + Sigma 2 square by n 2 the value of Z is less
than 0 then p value is nothing but the actual value you can read as it is.
When the Z is it is like this if the Z value is coming on the left hand side the p value you can read
as it is but if the if the z values positive we need the right side area so 1 minus the left side area
will give you the right side area p equal to 1 – stats. Norm. cdf z, so print Z p. So, this code in
our monitor just you can try this after pausing this video you can verify this answer. So, Z
underscore Z underscore and underscore P just and supply all the values.
433
In the previous problem the X 1 bar is 121, X 2 bar is 112 Sigma 1 is the 8 Sigma 2 is 8 n 1 is 10
n 2 is 10, what happening we are getting the Z value is 2.51 there is a Z calculated value look at
this p value and you go to previous slide say p value here also we got 0.0059, so here we get with
the help of Python. When you compare with the alpha this is very small we are rejecting null
hypothesis.
(Refer Slide Time: 18:01)
Now we will go to the second category of the problem. So, far in the previous problem we know
Sigma 1 square, Sigma 2 square but this case Sigma 1 square, Sigma 2 square is unknown but we
are assuming it is equal. There is a concept behind why we are assuming it is equal whenever we
make comparison we can compare our the comparison is meaningful only when the variance of 2
groups are equal like comparing the performance of third year versus fourth year student is there
is no meaning for that.
We can compare only the third year student versus another third year students so that way the
variance has to be equal then only the comparison will be meaningful. So, the second case you
see the blue 1 Sigma 1 square and Sigma 2 square unknown when done we are going to make
another assumption that it is equal there is another possibility it is unknown but unequal we will
come to the that one in the after sometime.
434
First we will go Sigma 1 square Sigma 2 square unknown but assumed equal. What are the
assumption we are making samples are randomly and independently drawn populations are
normally distributed population variances are unknown but assumed to be equal.
(Refer Slide Time: 19:15)
The population variance are assumed equal use the 2 sample standard deviation and pool them to
estimate the variance or standard deviation use your t value with n 1 + n 2 - 2 degrees of
freedom. Actually what the concept here is assume that there is a group 1 this variance is S1
square suppose n 1 sample there is a group 2 the variance is S2 square this sample sizes n 2.
Suppose assuming population variance are equal you can pull the variance how we can do the
pull the variance.
We can find out the weighted variance that we are going to called pooled variance the weighted
variance is nothing but suppose assume that we are going to find out the weighted mean. So,
what is the formula for weighted mean suppose W 1 X 1 + W 2 X 2 divided by sum of weighted
W1+ W 2 this is nothing but your weighted mean. Here the weight is nothing but your degrees of
freedom. Suppose for the sample 1 the degrees of freedom is n 1 - 1 here the variance is S1
square plus sample to the weighted is corresponding degrees of freedom S2 square.
So, next we have to sum the degrees of freedom n 1 - 1 + n 2 - 1 that is nothing but n 1 + n 2 - 2
so, this is nothing but the pooled variance.
435
(Refer Slide Time: 20:59)
The test statistics for mu1 - mu2 is say previously we are used as Z now we are using t, ((X1 bar
- X 2 bar) – (mu 1 - mu 2))/ (root of (sp square by n 1 + sp square by n 2)), since sp square is
same we can bring left hand side that is nothing but pooled variance that pooled variances see
that (n1 – 1) S1 square + (n2 – 1) S2 square divided by (n 1 + n 2 – 2) degrees of freedom. You
see that here the degrees of freedom is n 1 + n 2 – 2.
(Refer Slide Time: 21:38)
Then we do the population mean standard deviation as unknown the error also there is a
possibility to see this test is left tailed test this is right tailed test this is 2 tailed test.
(Refer Slide Time: 21:46)
436
The next slides see that if the - t alpha no alpha this one this is a left tailed test this is the right tail
test middle one the right hand side was is the 2 tailed test.
(Refer Slide Time: 21:57)
We will take one problem where Sigma 1 square and Sigma 2 square unknown assumed equal, 2
catalyst are being analyzed to determine how they affect the mean yield of a chemical process.
Specifically catalyst 1 is currently in use but catalyst 2 is acceptable. Since catalyst 2 is cheaper
it should be adapted providing it does not change the process yield a test run in the pilot plant
and the result in the data shown in the table.
437
Is there any difference between mean yields use alpha equal to 0.05 and assume equal variances.
By looking at this problem you see that how it is given the variance are equal, no where the
population variance is given so we should go for sample t-test assuming equal variance.
(Refer Slide Time: 22:58)
As usual the step one is we have to see which parameter of the population we are studying
whether mean or variance. Now it is a mean so the parameter of interest are mu1 and mu2 the
mean process yield using catalyst one and 2 respectively and we want to know if mu1 - mu2
equal to 0. So, H0: mu1 - mu2 = 0, so H1: is mu1 ≠ mu2, alpha 0.05 there is the test status is t 0
X 1 bar - X 2 bar minus the difference in the mean as you would mean sp this is so sp square was
inside root we brought left side so it is the pooled standard deviation root of 1 by n 1 + 1 by 2.
(Refer Slide Time: 23:47)
438
So, what will happen when we look at the say statistical table when 14 degrees of freedom
because it is a 2 tile, so since it is 2 tile but this area is 0.025 this area is 0.025 when 14 degrees
of freedom the right hand side value is 2.145, I am writing here at the bottom it is 2.145 the left
hand side it is – 2.145 it is symmetric so positive or negative from the previous slide we have the
mean of sample one is 92.255 and standard deviation of the sample one is 2.39 for n 1 equal to 8.
Similarly for the sample 2 the sample mean is 92.733 a standard deviation is 2.98 and n 2 is 8
therefore first we will find the pooled variance by using the formula (n 1 - 1 )s 1 square + (n 2 –
1) is 2 square divided by n 1 + n 2 - 2 after substituting this value we are getting 7.30 we could
take the square root of that will get the standard deviation so use 2.70.
(Refer Slide Time: 25:10)
439
In this t formula we are getting - 0.35 obviously you have to locate this is the rejection region.
So, what we are concluding since – 2.145 is less than that what we are got the value is going on
the left hand side that is - 0.35 the calculated t value is - 0.35 so in this the - 0.35 will be the
acceptance side. So, we have to accept null hypothesis so what is happening - 0.35 the null this
cannot be rejected that is at the 5% level of significance we do not have strong evidence to
conclude that the catalyst 2 result in a mean yield that differs from mean yield when catalyst 1 is
used.
(Refer Slide Time: 25:59)
So, now with the help of Python when Sigma 1 square, Sigma 2 square unknown assuming equal
variance will solve the problem. Previously I am taking the b equal to I am assigning into an
440
objective b I have taken an array a, I have taken the next one. So, when he stats dot t-test
underscore independent you call that array a, b equal variance and equal to true right it can be
true or false the next after sometime we will solve that problem when it is a true directly we are
getting you see that the test statistics - 0.35, so the p value is 0.72.
So we have to accept our null hypothesis we can see how we got – 2.144 also stats.t.ppf if you
want to know the key value 0.025 when 40 degrees of freedom so we can compare t values also
see that so when you see that one so the t value is – 2.14 but our test statistics - 0.35 it is lying on
the acceptance region we are accepting our null hypothesis. Dear students so for what we have
seen we are comparing hypothesis testing for 2 sample.
We have seen three types of problems number one is Sigma 1 square, Sigma 2 square is known
then we have compared to the mean. Another type of problem is Sigma 1 square and Sigma 2
square is unknown and we have assumed equal variance. We have done the Z test and we have
done the t-test and also I have explained the concept behind of standard deviation of the
difference of 2 population variance.
So what is the concept there if I want to know the difference of the 2 population variance you
have to add the variance. If you want to know the difference of the 2 population mean just you
can find the difference of the population mean. In the next class we will take a new problem
where Sigma 1 square Sigma 2 square unknown but if it is unequal variance with that we will
start the next class, thank you very much.
441
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 21
Hypothesis Testing: Two Sample Test-II
Dear students in the previous class we have seen the problem in comparing hypothesis testing in
to population where sigma1 square sigma2 square is known. Then we have seen that the next
problem that is sigma1 square sigma2 is unknown but assumed equal variance. In this class we
are going to take another category of the problem where sigma1 square sigma2 square unknown
but assumed unequal.
(Refer Slide Time: 00:59)
What are the assumptions we are having the samples are randomly and independently drawn the
populations are normally distributed population variance are unknown and assumed unequal. The
population variances are assumed unequal so your pooled variance is not appropriate. So, use
here we have to use a t-value with new deals of freedom the formula for degrees of freedom is
this one nu(υ) equal to ((s12 / n1) + (s22 / n2))2 divided by ((s12 / n1)2 / (n1 – 1)) + ((s22 / n2)2 / (n2
– 1))
(Refer Slide Time: 01:30)
442
The t statistics is X 1 bar - X 2 bar - the difference root of ((s12 / n1) + (s22 / n2)) you see the
previous problem we have used SP square by n1 + SP square by n2 where the variances are equal.
But here it is unequal we cannot use SP square in both the places so we have to use only s 1
square, the corresponding formula for degree of freedom already which we explained.
(Refer Slide Time: 01:57)
We will take you one sample problem will solve this one the problem is arsenic concentration in
public drinking water supplies is a potential health risk. An article in Arizona Republic Sunday
May 27 2001 reporter drinking water arsenic concentration in parts per billion ppb for 10
metropolitan phoenix communities and 10 communities in rural Arizona are given in the table.
We can know what is the X1 bar that is a 12.5, s 1 is 7.63, X2 bar is 27.5, s 2 is 15.3.
443
(Refer Slide Time: 02:38)
So what are the steps in hypothesis the testing as usual the first step is the parameter of interest
are the mean arsenic concentration for the 2 regions say mu1 and mu2 and we are interested in
determining whether mu1 - mu2 equal to 0. So, what will be about null hypothesis null
hypothesis is mu 1 - mu 2 equal to 0 otherwise mu 1 equal to mu 2. Alternative hypothesis mu 1
not equal to mu 2, because the signs are complementary so alpha is 5% but is not given we have
444
to assume it is a 5% is the formula for test statistics is this one t0 is X 1 - X 2 bar root of s 1
square by n 1 + s 2 square by n 2.
(Refer Slide Time: 03:40)
So, the degrees of freedom be all the data is given in the previous slide we can supply that value
s1 square n 1, s2, square n 2 and so on. So, we are getting 13.2 approximately the degrees of
freedom is 13 therefore using alpha 5% we would reject H0: mu1 = mu2, if the p value the
calculated T value is greater than 2.160 or p value is less minus because these values there are 2
ways we can get this value we can refer the T table but we can use Python also directly you can
use the Python to get the critical value when alpha equal to 0.02 that means when the probability
is 0.025 when degrees of freedom is 13.
(Refer Slide Time: 04:31)
445
So, we have done the t by using the calculated t values – 2.77 obviously – 2.77 is lying on the
rejection side so we have to reject null hypothesis. So, what we are concluding there is evidence
of difference in the means that means it is not the equal amount of arsenic is available there is it
in some cities it is more in other cities it is less.
(Refer Slide Time: 05:00)
So, the conclusion because t 0 – 2.77 is less then you were – 2.160 we have to reject a null
hypothesis there is a evidence to conclude that the mean arsenic concentration in the drinking
water in rural Arizona is different from the arsenic concentration in metropolitan Phoenix
drinking water it is not the same.
(Refer Slide Time: 05:22)
446
So, we will use Python to solve this problem we can see the p value as I told you stats.t.ppf when
in the t distribution when this area is 0.025 because a 2-tailed test area equal to 0.025 when the
degrees of freedom is 13 we are getting it is - 2.160 so it is a - 2.160 our calculated t value is how
much – 2.77 so – 2.77 will be on the left-hand side obviously we have to reject it. Instead of
doing that it is very simple in Python you take array 1 as the values which is given for Metro
there are another one array 2 that is call it is rural.
The value is given a rural area so stats.ttest_ind( call this to array metro, rural equal underscore
variance you have to type equal to equal to false, do you remember for a previous one we have
written is it true. Now simply write false you'll get the your t value your p value obviously it is a
2-tailed test, so, the p value the alpha is it is very small the p value is very small so we have to
reject a null hypothesis when compared to alpha it is only 0.01 so we have to reject our null
hypothesis.
(Refer Slide Time: 06:42)
447
Now we will go to another setup problem where there is a samples are dependent. So, the test of
2 related populations they are called paired sample or match the samples so it is repeated
measures. The same population we are collecting the data before and after so we have to find use
the difference between paired sample di = xi – yi. So, what the logic is, is this is say population 1
this is population this is also population 1 same population before what was the this before any
treatment suppose the we can see a lot of hair oil advertisements are coming before applying oil
what was the length of you hair.
Here you can see some example some values who take some sample mean after sometime up
after applying hail oil you can see what was the say this is X1 bar this is Y1, X bar Y bar before
and after we find when you plot the difference, so you take X1 from the sample one before then
you take Y1 from the same sample because it is the independent sample when you plot the
difference that when you keep on collect different pair from the same sample when you plot the
difference that will follow normal distribution. So both our populations are normally distributed.
(Refer Slide Time: 08:14)
448
The test statistics for the mean difference is the t value with n – 1, degrees of freedom. So, here
you see previously we would write X bar here the mean of the difference there is a d bar, you
add all the difference divided by n there is nothing but equal to X bar - Y bar this was difference
in the population sd is for that data for the difference the data what was the standard deviation
the root of n. so, we will get the t-value.
(Refer Slide Time: 08:46)
Here also see this is your left tailed test the second one is right tailed test this is 2-tailed test.
(Refer Slide Time: 08:51)
449
Left tail right tail 2 tail test but only the t is (d bar – d)/ (sd /√ n) with n - 1 degrees of freedom.
(Refer Slide Time: 09:04)
We will take one example for dependent sample an article in the journal of strain analysis
compares that is volume 18 and number 22 compare several methods of predicting the shear
strength of steel plate girders. Data for 2 of these methods one method is Karlsruhe another
method is Lehigh procedures when applied to nine specific graders are shown in the table. I think
these this is 2 methods are the different way of measuring the shear strength.
We wish to determine whether there is any difference on the average value between 2 methods
because the populations are same 2 difference are conducted.
450
(Refer Slide Time: 09:49)
So, called Karlsruhe method Lehigh method in Karlsruhe method this was the values this is
Lehigh method this was the value. You are finding the differences look at this here the difference
are positive there is a possibility the difference may be negative also that will subtracted from the
positive value there is no problem.
(Refer Slide Time: 10:09)
So, the first step is the parameter of interest is the difference in the mean shear strength between
2 methods there's µD = mu1 - mu2 equal to see rather we call it as difference is equal to 0 the
third would third step is µD ≠ 0 so alpha equal to 5% those the tested statistics is d bar divided by
(SD /root of n) it is nothing but the same thing previously what was the t formula if it is the if it is
451
not paired sample what was the t formula (X bar – mu) divided by (S by root n) but the X bar is
nothing but the mean of the difference.
This is nothing but the standard deviation of the difference this was the difference in the mean all
others are same. Because what I am saying every statistical test has some link once that is why
you have to follow the order of learning this statistics because if you in between if you are going
for some lectures that may require certain prerequisites. So, when you learn this one so you have
to follow this sequence so that will be very easy for connecting with other statistical test.
(Refer Slide Time: 11:28)
When you look at the table when the Alpha value is 0.025 that is half of the Alpha values 0.025
80 degrees of freedom so that value is 2.3, so if the calculated 2 value is it is like this so what
will you do this value on positive side we are getting 2.306 and negative side we are getting –
2.306 this is the value which you got from the table. The calculated t values lies on either side of
this limit it will be rejected. so, what we got the mean of the difference is 0.2736 the standard
deviation is 0.1356 when you input this data we are getting 6.05 that is far away.
So we got to reject the null hypothesis when you reject null hypothesis the mu 1, what was null
hypothesis that mu 1 - mu 2 equal to 0, so H1: is mu 1 – mu 2 not equal to 0 when you reject that
there is a difference if it is an hair oil example yes there is a effect of hair oil that help you to
grow the hair.
452
(Refer Slide Time: 12:42)
So, we are rejecting so we conclude that this strength prediction method yield different result we
look at the p-value because with the help of statistical table especially t table find if the p-value is
very difficult, but we will use Python to see what is the p-value.
(Refer Slide Time: 12:58)
So, you take call it as array 1 call second one Lehigh stats dot t-test you see that this is
underscore rel so that means dependent sample so call the 2 variable you will get this is your t
value this is a p value less than alpha. So, we have to reject the null hypothesis so that means
there is a difference.
(Refer Slide Time: 13:31)
453
Next we will go to another problem that is the inferences about the differences between 2
population proportions. So, far we talk, population proportion means whenever there is a
categorical variable is there so far we have measured the continuous variable about from the
population. If there is a categorical variable obviously the count is taken care that is nothing but
the population proportions. So, inferences about the difference between 2 population proportions
here also we can estimate the population proportion p 1 - p 2 we will do the hypothesis test about
the difference of p 1 - p 2.
So, what is the expected value before going to this expected value say this is population 1 this is
population 2, I take some sample from population 1 I am finding p1 p2 p3 and so on. I am taking
some sample from this population to from population 2 there are different sample. If every time
if it take p 1 minus that is sample which is taken from sample 1, population 1 and population 2 if
I find this difference p1 - p2 every time of a finding p1 – p2, so that difference if we plot that that
will follow a normal distribution.
The same logic there if you want to know the difference of the variance for example here what
was the variance you remember there pq by n, Sigma square is pq by n Sigma 1 square Sigma 2
square is when you call it as p1 q1 by n1 here it is p 2 q 2 by n 2 if you want to know the
difference in the variance you to add the variance. So, what will happen (p 1 q 1 by n 1 )+ (p 2 q
454
2 by n2), this is the variance if you want to know those standard deviation just to take square root
of that, that is why we have got this one.
So, the expected value nothing but the mean value of p 1 bar - p 2 bar is it is p 1 - p 2 standard
deviation is Sigma of p 1 bar - p 2 is root of p 1 into 1 - p 1 by n 1 + p 2 into 1 - p 2 by n 2 this p
1 p 2 you see that previously you have taken this is nothing but this sample proportion p 1 p 2
also sample proportion for population 2, n 1 is size of the sample taken from population 1 n 2 is
size of the sample taken from population 2.
(Refer Slide Time: 16:15)
If the sample sizes are large the sampling distribution of p 1 bar - p 2 bar can be approximated by
a normal probability distribution. The sample sizes are sufficiently large if all the conditions are
met when np is greater than or equal to 5 or nq is greater than 5 then only we can approximate
this one to the normal distribution.
(Refer Slide Time: 16:41)
455
You see that the mean of the mean of this non-word distribution is p 1 - p 2 the standard
deviation is root of(( p 1 into 1 - p 1 by n 1) +( p 2 into 1 - p 2 by n 2)).
(Refer Slide Time: 16:57)
The interval estimation is as usual p 1 bar - p 2 bar + or - Z alpha by 2 root of p 1 bar into 1 - p 1
by n 1 + p 2 bar into 1 - p 2 by n 2 if it is a single sample you remember can you recollect what
was the formula for interval estimation we evidently this way P bar + or - Z α by root 2 root of pq
by n what is happening this pq is population proportion but we have to assume we have to
approximate it with the sample proportion p 1 q 1 that is a small p 1 q 1 by n 1. So what I am
saying everything is came from work previous single sample hypothesis testing.
(Refer Slide Time: 17:41)
456
We will take one problem point estimator of the difference between 2 population proportions say
p 1 the proportion of population of households aware of the product after new campaign, p 2 is
the proportion of population of households aware of the product before new campaign. So, we
are going for new promotions we have to see the effectiveness of that promotion advertisement
so p 1 bar is the sample proportion of households aware of the product after the new campaign p
2 is sample proportion of households aware of the product before the new campaigns.
So we will find the difference is any impact on the campaign new campaign on awareness. So, p
1 bar - p 2 bar is, we know that p 1 bar is you know that this 120 divided by 250 because it is
given so outer 250, 120 people are aware after the campaign, so, before the campaign out of 150
only 60 people are aware.
(Refer Slide Time: 18:55)
457
So, the p 1 bar - p 2 bar is 8% so hypothesis we focus on test involving no difference between 2
population proportions. Here what is happening here they are also left tail test right tailed test 2
tail test even in the 2 sample population proportion also it can be left tail test right tailed test or 2
tail test.
(Refer Slide Time: 19:15)
The standard error is root of p into 1 - p divided by 1 by n 1 1 by n 2 we have seen that one here
also we can use pooled estimate of p when p 1 and p 2 equal to p. What is the meaning of this
one is if the if you assume that the 2 population proportions are same then we can pool that so p
bar is n 1 p 1 bar + n 2 p 2 bar it away n 1 + n 2.
(Refer Slide Time: 19:45)
458
Test statistics is (p 1 - p 2)/√( p q ((1/n1)+(1/n2)))
(Refer Slide Time: 19:54)
We will take on problem hypothesis test about p 1 - p 2 extract of st. John's Worton are widely
used to treat depression this Jones what is a plant or medicine for treating depression. An article
in April 18 2001 issue of Journal of American Medical Association the journal the article title is
effectiveness of St. John's Wort major depression a randomized control trial. Compare the
efficiency of standard extract of St. John's Worton with the placebo in 200 outpatients diagnosed
with major depression. Patients were randomly assigned to groups one group received the st.
John's Worton and other received the placebo.
459
After 8 weeks nine of the placebo treated patients showed input whereas 27 of those treated with
St. John's Wort improved is there any reason to believe that St. John's Wort is effective curing
major depression. Assume alpha equal to 5%. Now we have to see effect of this medicine and
curing their depression.
(Refer Slide Time: 21:19)
The parameter of interest are p 1 and p 2 the proportion of patients who improve following
treatment with the st. John's what p 1 our placebo so the null hypothesis is there is no ffect of this
new medicine p 1 so we are going to assume p 1 = p 2 then alternative hypothesis it is not equal
to p 2 okay, it does need not be 2-tailed test it is up to you to decide whether it is one tail or 2 tail
test at present we are assuming we go that there is no difference in the medicine on the
improvement of the patients.
The test statistics is (p 1 hat - p 2 hat )/√( phat (1- phat) ((1/n1)+(1/n2)))
so where p 1 is 27 by 100 p 19 by 100 n 1 n 2 is 100, so it is the pooled one so we see since the
population proportions are same we are find out the pooled proportion 19 + 27 + 100 + 100,
0.23.
(Refer Slide Time: 22:30)
460
So we have to reject our null hypothesis if it is greater than + 1.96 otherwise less than - 1.96. so,
the z value when you substitute it is 1.35 so what is happening 1.96 is here so this is the rejection
region our 1.35 is lying on the acceptance region. So, what we are concluding since Z0 = 1.35
does not exceed Z 0.025 that is 1.96 we cannot reject hypothesis. When you look at the p-value it
is 0.177 so 0.177 is it is more than 0.5.
So we have to accept the null hypothesis there is insufficient evidence to support the claim that
the Saint John's Wort is effective in treating major depression. So, we would accept our null
hypothesis that means there is no evidence that Saint John's Wort is effective.
(Refer Slide Time: 23:44)
461
This will do with the help of you can type in Jupiter this command then you have to verify this
import math will make a function two_sample_proportions( p1, p2, n1, n2) first we will find out
the pooled proportion with the help of Python we learn how to use 2 sample proportion test. So,
import math we define a new function to underscore sample underscore proportion p 1 p 2 n1 n2
first we will find out the pool to proportion by n 1 p 1 + n 2 p 2 divided by n 1 + n 2 will solve
with the help of Python 2 sample proportions hypothesis testing.
So in the given problem the p 1 population proportion is 0.27 p 2 population proportion is 1/9 so
n 1 is 100 n 2 is 100. So, after getting we are getting the Z varies 1.3 the p value is 0.17 that is
more than our 0.5, so we would accept our null hypothesis. Since stats, suppose if you want to
know what was the Z critical value, so stats.norm.cdf(1.35) where we got this 1.35 so the
corresponding probability 0.91 from this side this side is 0.91 will use Python to solve a 2 sample
proportion test.
(Refer Slide Time: 26:59)
462
We import pandas as pd import numpy import math from scipy we will import stats, so we will
make your function, function name is to underscore samp underscore proportions p 1 p 2 n 1 n 2
first we will find the sample pooled the proportion by using this formula n 1 p 1 + n 2 p 2
divided by n 1 + n 2 then find out the variance is says p into 1 - p multiplied by (1 by n 1) + (1 by
n 2) so that will be the variance to get the standard deviation otherwise standard error will take
square root of our variance that is s_sq.
So Z is p 1 - p 2 ,Z if the Z value is less than 0 the p value from the table we can treat as it is if
the p value is positive we have to substrate from 1. So, when you the way we are going to call
this function is by so the function will returns it p value that has to be multiplied by 2 because it
is a 2-sample t-tests. So, we run this to sample proportion p 1 is 0.27 p 2 is 0.19 n 1 is hundred n
2 100 we will run it.
So we got the t value is 1.33 for the p value 0.17, so it is more than our alpha value, so we are to
accept null hypothesis. Where this will conclude this will summarize this class we have seen 2
sample hypothesis testing when Sigma 1 square Sigma 2 square is unknown but not equal. Then
we have seen 2 sample Z proportion test we have taken some problems then we solved it the next
class will go for comparing 2 population variance using F test.
463
Data Analytics with Python
Prof. Ramesh Anbanandam
Lecture – 22
Hypothesis Testing: 2 Sample Test-III
(Refer Slide Time: 00:37)
Welcome students in this class we will continue with the 2 sample hypothesis testing. In this
class we will see how to compare population variance of 2 population so agenda for this class is
comparing 2 population variances then many times student may have this doubt when to go for z
test went to go for t test I will clarify that when to go for z test. The third one is it is the most
important that what should be the sample size for doing any statistical analysis.
(Refer Slide Time: 00:58)
464
Now we will test the hypothesis test for 2 variances the goal is to hypothesis about population
variances you see that the H0: σ22 ≤ σ12 , the H 1: σ22 > σ12 it is over left tailed test otherwise the
lower tail test it may be this way right it will be right skewed one what will happen this is the
here also there is a left side this is left side test this is right side test this is 2 tailed test.
You have to remember that I did not draw the normal distribution this is a right skewed
distribution this distribution is called F distribution. So, in the F distribution we have to find out
the F statistics that F statistics will decide will help us to accept or reject null hypothesis. As
usual here also there may be a left-tail test right tailed test or 2-tailed test but very important
assumption which are which here to remember that the 2 populations are assumed to be
independent and normally distributed.
(Refer Slide Time: 02:15)
465
The test statistics for comparing 2 population variances F is nothing but (s12 / σ12)/( s22 / σ22 ), the
F statistic here is (s1 by Sigma1)/ (square s2 by Sigma2) whole square. If you are assuming both
the populations having equal variance the F will become s1 square by s2 square so it the F, n 1 -
1 numerator degrees of freedom and n 2 - 1 this is the n2 - 1 denominator degrees of freedom.
(Refer Slide Time: 02:55)
Yes I told you the critical value for a hypothesis test about 2 population variances you F equal to
s 1 square by s 2 square what we are assuming here both population have equal variance where F
was n1-1 numerator degrees of freedom and this is n2 - 1 denominator degrees of freedom.
(Refer Slide Time: 03:17)
466
The decision rule for 2 variances for example most of the time the F test is only a right tailed test
because whenever there is a variance so only we are bothered about only the upper limit of the
variance the lower limit you will not bother about. Because taking lower limit is there is no
meaning, so the what we have to assume that what is more important for is it should not exceed
the upper limit of the variance.
It is like you see the bus has to come 9 o'clock if it has come 9 :05 or 9 :10 you will get bored but
if it is coming early that there would not be any problem like that we have to bother about only
the upper limit if there is a lower limit is not that much important. So, the degrees of freedom
here is this is n1 - 1 this is n 2 this is n 1 this is n 2 and other things for comparing 2 tailed test
right as I told you for comparing 2 tailed test you see that this is Sigma 1 square equal to Sigma
2, square Sigma 1 square not equal to Sigma 2 square the rejection region for your 2 tailed test is
see that this is n 1 – 1, n 2 – 1.
So what we have to do while finding the value of F right this is 1 while finding the value of F we
have to maintain that the higher variance should be in the numerator. So, what we have to
assume this s1 square is greater than s 2 square so where s 1 square is the larger of 2 sample
variance that should go to in the numerator. If you take larger of 2 variants in the numerator you
need not bother about the lower limit of the 2 tail test only we have to compare only upper limit
of 2 tailed test for at accepting or rejecting null hypothesis.
467
(Refer Slide Time: 05:12)
We will take one problem a company manufactures simpler for use in jet turbine engines one of
the operations involves grinding a particular surface finish on a titanium alloy component 2
different grinding processes can be used and both processes can produce parts at identical means
surface roughness. The manufacturing engineers would like to select the process having the least
variability that is a point least variability in the surface roughness.
When you say generally the surface references measured by this way surface roughness is
measured by this way suppose the surface roughness is this one so it is not good the surface
roughness it cannot be perfectly smooth covered there should be a smaller variations. So, for the
manufactures are also interested would like to select a process having least variability in the
surface roughness a random sample of n1 11 parts from first 2 processes result a sample standard
deviation of 5.1 micro inches and random sample of into 16 parts from the second processes
result in a sample standard deviation of s2 = 4.0 micro inches.
We will find here 90% confidence interval on the ratio of 2 standard deviation then we will
compare whether it is these variances are equal or not equal.
(Refer Slide Time: 06:42)
468
As usual first we have to form the null hypothesis the null hypothesis is σ22 = σ12 what is the
meaning of this both the process are having equal variances. Alternative hypothesis is σ22 ≠ σ12
there is there is a difference between variances so find the critical value alpha equal to 10%.
(Refer Slide Time: 07:06)
So the first thing is you have to find out the numerator degrees of freedom that is n1 - 1 so 11 - 1
10 is the numerator degrees of freedom it is the denominator degrees of freedom n 2 - 1 so 16 - 1
15 is denominated degrees of freedom.
(Refer Slide Time: 07:24)
469
Assuming that the 2 process are independent and the surface reference is normally distributed we
are going to find out the confidence interval of the ratio of their variance. So, what is the logic is
suppose Sigma 1 square equal to Sigma 2 square, so it will in that interval 1 will be captured so F
distribution is like this so here I am going to find out the confidence interval if you look at the F
table the area is given only from right to left so when area equal to 0.05 because we have told it
is a 10% see if the right side is 0.05 left side is 0.05.
If you look at the F table we can read the for 0.025 significance level what is the corresponding F
value right if you want to know the left side 0.05 so you have to read 0.95 significance level then
only you can find out the lower limit of the F. So what we see we write (σ12 / σ22) ≤ (s12 / s22),
first time finding the upper limit upper limit is when it is a 0.05 see that when F equal to 0.05
numerated degrees of freedom is 15 denominated degrees of freedom is 10 that will be my upper
limit.
The larger value of the variance should go to numerator. The left side the area the left side
critical value is f 0.95 right if we want to because the F table read right to left 0.95 15, 10 degrees
of freedom okay so s1 square 5.1, 5.1 divided by 4.7 then 0.95, 10 , 15 deals of freedom this
value we are to read from the table. First we will read the upper limit so when the degrees of
freedom is 0.05 numerator degrees of freedom is 15 and the denominator degrees of freedom is
10 we will see what is the F value.
470
(Refer Slide Time: 09:48)
So numerator degrees of freedom is 15 denominator degrees of freedom is 10 see that this value
this value 2.85 that is where F value now what we are to do we have to know that lower limit for
that when significance level is 0.95 when 15, 10 degrees of freedom we have to find out what is
the F value for that purpose what you are to do we are to reverse the degrees of freedom so 10,
15 here 10, 15 is this point 2.54 because DF 1 do column says the numerator degrees of freedom
the rows is denominated liters of freedom.
So this is 10, 15 so this 1 is 15, 10 because the column says the numerator this one is 10, 15
degrees of freedom when alpha equal to 0.05. So, if you want to know lower limit of their F so
what you have to do?
(Refer Slide Time: 11:07)
471
You see that if you want to know the 0.95 for 15, 10 so we had to find the first we had to reverse
the degrees of freedom 10, 15 then we had to find out alpha equal to 0.05 then we had to find the
inverse of that so that is 1divided by 2.54 how we got 2.54 this value 2.54 so 2.4 we got 0.39 so
and going back again so that is why we got this 0.39 when you simplify this the lower limit is
0.678 the upper limit is 1.887 actually after taking the square root because upon completing the
implied calculation take a square root of that we are getting this one.
You see that the range is capturing one that means there is a possibility Sigma 1 equal to Sigma 2
because there is this ratio implies that there is a possibility it may take one also so we have to
accept our null hypothesis that is a Sigma 1 equal to Sigma 2. Since this conference interval
includes unity we cannot claim that the standard deviation of surface roughness for the 2 process
are different at the 90% level of confidence.
Going back in case the unity is not coming here so we cannot say both variants are equal since
we are able to capture unity because lower limit is 0.6 upper limit is 1.8 there is a possibility the
value of the ratio of Sigma 1 by Sigma 2 will become 1 so Sigma 1 by Sigma equal to 1 so that
there is a possibility Sigma 1 will become equal to Sigma 2.
(Refer Slide Time: 12:57)
472
Now we will use a Python for finding the F table so for that you have to import pandas as pd
import numpy is np, import math, from scipy import stats okay you can import scipy so what you
have to do scipy.stats.f.ppf you have to give the probability so 1 - 0.05 means 0.95, so 0.95
numerated degrees of freedom is 15 and 10 we are getting 2.84 so that value is nothing but 10,
15 2.84 scipy.stats.f.ppf, q equal to say 1 - 0.95 numerator dfn is nothing but numerator degrees
of freedom 15 dfd denominator degrees of freedom 10 we are getting 2.84.
So, if you want to know the lower limit here we can directly read from the F table
scipy.stats.f.ppf q 0.05 numerator degrees of freedom is 15 denominated degrees of freedom is
10 you are getting 0.39 so this was our lower limit this was the our upper limit which previous
problem.
(Refer Slide Time: 14:10)
473
So, for what you have done there are 2 group of population the standard deviation that is the
variance of that 2 populations are given instead of that there is a possibility there is a population
1, X will be given Y will be given you have to find out the p value for that. Suppose I am
assuming this is the population 1, I am going to call it as capital X there is another population 2, I
am going to call it is capital Y, so my null hypothesis H0: I am going to assume σx2 = σy2.
Alternative hypothesis is σx2 ≠ σy2 right I am going to take alpha equal to 5 % that is 0.05.
So, declare variable X declare variable Y, import numpy as np then find out the ratio that is
variance of X divided by variance of Y then you find numerator degrees of freedom len function
that will tell you how many element is there in that array 1, so number of element - 1 that is a
degrees of freedom for numerator, number of element - 1 there is a degrees of freedom for
denominator. So, to get the p value equal to scipy.stats.f.cdf this is the syntax.
First declare what is F we have found this here then say what is it degrees of freedom numerator
degrees of freedom denominator when you enter p value we are getting 0.024 so what will
happen this your F distribution okay this is 0.05 this 0.024 see that is 0.024 is lesser than 0.05 so
we have to reject a null hypothesis. When you reject it we are accepting that Sigma X square not
equal to Sigma Y square. This is easiest way for testing variance of 2 population.
(Refer Slide Time: 16:27)
474
Many students may have doubt when we will go for Z test when you go t tests before going this I
will summarize what you have done we have done one sample Z-test we have done t-test then we
have done Z proportion test then we have done 2 sample z test then we have done 2 sample t-test.
In the 2 sample t-test we have heard assumption population variance are equal are not equal.
Then we have done 2 sample Z proportion test then later we have compared to sample variance
test comparing we have compared to population variance.
After completing this then people may have doubt when should we go for Z test a t-test. You see
that whenever the Sigma is known what is the meaning of the Sigma is known as whenever the
population standard deviation is known you should go for Z test. So, to decide when should go
for Z test there are 2 criteria one is the sample size and other is whether the Sigma is known or
unknown. So, without considering the sample size whenever the Sigma is known whether the
sample size is less than 30 or greater than 30 you should go for Z test.
So, when you look go for t test whenever Sigma is unknown and n is less than 30 you should go
for t test. There may be a possibility Sigma is unknown but n is greater than 30 so instead of t
test you can go for Z test because as we have studied previously the t distribution is the special
case of your Z distribution. Whenever the degrees of freedom for example t distribution will be
flat whenever the degrees of freedom is increasing, is increasing it will behave like your Z
distribution.
475
So, whenever the Sigma is unknown n is greater than 30 you can go for we Z test that is why in
many statistical package there would not separate tab for running Z test there will be a tab only
for t test.
(Refer Slide Time: 19:01)
The another important question students will have is how to choose the sample size determining
the sample size for a hypothesis the test about the population mean you see that there are 2
population mu0 is the base population there is a mua is alternative mean, so this is a right tailed
test where is a right tailed test if the mu greater than mu0 you will reject it. Now what is
happening any point which goes right hand side we will reject it.
This much portions which are falling acceptance side of my distribution I have accepted. Now
this is beta there is a false acceptance this is alpha in correct rejection. By considering the Alpha
and Beta you see in this point the value of X bar is same X bar for this population X bar for this
population same.
(Refer Slide Time: 20:00)
476
So, by equating this X bar for both populations we can derive a formula for knowing the sample
size considering Alpha and Beta this we can derive it, it is very simple derivation. So, what how
to derive this we have to X bar because this line is same for both the population. So, here what
will happen Z alpha is nothing but (X bar - mu0) divided by Sigma by root n for here Z beta
equal to (X bar - mua ) divided by Sigma by root n.
You see that the value of this Z beta will become negative because this is lower value minus
upper value, so when you equate this when you from these 2 from these 2 equation 1 and 2 when
you simplify you can Sigma by root n we can get the value of n. So, when you the value of n is
(Z alpha + Z beta)2 multiplied by σ22 ≠ σ12 by (mu0 - mua ) whole square okay because it is a 2
tailed test if it is a 2 tailed test we have to use Z alpha by 2 not Z alpha.
(Refer Slide Time: 21:24)
477
We will do a small problem for this let us assume that the manufacturing company makes the
following statement about the allowable probability for type 1 type 2 error. Suppose somebody is
manufacturing a shaft whose diameter say 50 mm. If the mean diameter is 50 mm I am willing to
risk an alpha equal to 5% of a probability for rejecting null hypothesis if the mean diameter is
0.75 mm over the specification that is if the mu equal to 12.75.
I am willing to take a risk beta equal to 10% probability of not rejecting there is a false
acceptance. Now what is happening alpha is given beta is given, alpha is your type 1 error, beta
is a type 2 error, see actual mean is given an alternative mean is given.
(Refer Slide Time: 22:27)
478
So, if that is the case what should be the sample size so alpha equal to 5% beta equal to 10%
when alpha equal to 5% we can get it is 1.645 the Z beta we can find out directly how to find out
the Z beta I am going back this value Z beta is X bar - mu a divided by Sigma by root n all will be
given. So, SigmaBeta is 1.28, mu 0 is 12 mua is 12.75 Sigma is 3.2 just substitute this value so it
should be 156.
(Refer Slide Time: 23:07)
We will use a Python for solving this problem import pandas pd import numpy as np from scipy
import stats import math so we have to define a function samplesize (alpha beta mu 1 mu 2)
Sigma so Z1 = - 1 * stats.norm.ppf (alpha), Z 2 = - 1* stats.norm.ppf (beta), n equal to Sigma 1 +
Sigma 2 whole square that is the same formula if you substitute it you can print the n. So, when
you supply alpha value beta value mu 1 value mu 2 value Sigma value we are getting 155.90.
So, far we what we have seen we have compared to 2 population then we have compared
variance of the 2 population we have tested whether the populations are variants are equal or not
then we have seen when to go for is a testament to when to go for t test after that by considering
the Alpha and Beta value we have found your formula how to decide or how to arrive the value
of your sample size it would take an old sample problem they have conducted then we found
what was the value of the sample size, thank you.
479
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 23
ANOVA- I
Welcome students today we are, we will continue with the sample size calculation. After that we
are going to see very important topic that is analysis of variance.
(Refer Slide Time: 00:37)
Today’s lecture planet sample size Calculation and one way Anova.
(Refer Slide Time: 00:41)
480
Determining sample size with estimating Mu: We know that the Z formula is (x bar – Mu)
divided by (σ/√n). In this Sigma by root n, from this relationship this n is nothing but your
sample size when you re-adjust this Mu, re-adjust this from this equations like (Z.σ)/√n, (X bar –
Mu), right then √n, can say Z.σ divided by (xbar - µ), that will be your root n. You square both
sides.
So, (Z.σ)2 divided by (xbar - µ)2, that will be your n. That is nothing but this one. So, the
numerator this X bar - mu here we are going to call it as Error of estimation otherwise tolerable
error. So, sample size is mu such that Z square, Sigma square, this is sigma square, Sigma square
divided by error square ok. Since there, everywhere square is there we can bring it to common.
Many times, the value of population standard deviation may not be known. So there is
approximation, one fourth of the range of the data which were collected can be taken as standard
deviation.
(Refer Slide Time: 02:16)
481
We will do an example, calculating the sample size. So, the permissible error is given. 1. The
population standard deviation is 4. You want to conduct 90% confidence level. If it is 90%
confidence level, this Z value is 1.645. Then you substitute here. You will get 43.30 that is
nothing equivalent to 44.
(Refer Slide Time: 02:42)
We will do another problem to find out the sample size. So, permissible error is 2. Range is
given 25. 95% confidence level Z is equal to 1.96, this value which you have to get from the
table. So, estimated Sigma is one fourth of the range. So it is 6.25 then substitute Z Square,
Sigma square divided by E square we are getting 38.
(Refer Slide Time: 03:08)
482
Now, we will see how to find the sample size when estimating the population proportion. If you
are estimating the population proportion, the formula for Z is different. p hat - Capital P root of
capital P and capital Q by n. This P hat is the sample proportion capital P is the population
proportion so the error is P - capital P. The same way what we have done previously. If you for
example, when you bring Z root of PQ divided by n is equal to P hat minus P. Square both sides.
Z square PQ by n equal to P hat minus P, whole square. So, this will become Z Square PQ
divided by P hat minus P whole square. That is nothing but your n. So, P hat - P we call it as E
Square. So, n equal to Z Square PQ divided by E square.
(Refer Slide Time: 04:24)
483
We will do one problem here. So the permissible error is given. For 98% confidence level the Z
value is 2.33 which you have to get it from the table. Estimated P is given population P is given
0.40. So Q is 0.60. You substitute this value it will be 1448 you see that whenever you go for
population testing population proportion generally you ask yes or no type. That time obviously
you have to have go for more number of samples. Sometimes what will happen the value of P
which were given as 0.4 may not be known to you.
(Refer Slide Time: 05:08)
That time, how to decide the sample size, determining the sample size when estimating P with no
prior information. Just look at this. The P is in the x-axis sample size in the y-axis suppose, the
value of P equal to 0.5, the maximum sample size is required. So what the logic what we have to
understand from here is if you are not knowing the population proportion we have to assume P
equal to 0.5. Even the value of P is goes above 0.5 see that the, the n values decreasing.
The value of P is goes below 0.5 that time also the value of p is decreasing. It is better to assume
P equal to 0.5 so that you will get maximum number of sample size.
(Refer Slide Time: 06:02)
484
In this situation, Error is given 90% is confidence level Z equal to 1.645 with no prior estimate of
P we have to use P = 0.50. If we substitute P as 0.50 it will go for 271. The population proportion
is not known to you or to take P value equal to 0.5.
(Refer Slide Time: 06:30)
Next will go to next topic is Anova. Why Anova is required? So far we have seen two samples Z
Test, two sample T test. Whenever there is a requirement for comparing more than two
population. So far, what you have done? First, we have done one population next we have
compared two population. If you want to compare more than two population, we can go for
Anova. There is a possibility. You can compare 1 and 2 Suppose this is one, this is 2, this is 3,
we can compare 1 and 2, 2 and 3, 1 and 3.
485
1 and 3 there are 3 comparison, but we will not go for that one because there is a reason for that.
I will tell you we could compare the means one by one using T test of difference of means
because what will happen each test contain Type 1 error. So the total type of error 1 - 1 - Alpha
power k where K is number of means. For example, if there are 5 means if you use Alpha equal
to 0.05 because 5C2. You must, you must make 10 2 by 2 comparison because every comparison,
your confidence level is 0.95.
If you are making 10 comparison the overall confidence level is 0.95 power 10. So 1 -
confidence level is nothing but error. So, 0.95 power 10 that you substitute for 10 that is nothing
but your error. That is 40 % that is, 40 percentage of the time you will reject the null hypothesis
of equal means in favour of alternative. That is why we should not go for two samples T test
whenever there if you want to compare more than two populations. We should go for Anova.
(Refer Slide Time: 08:32)
Hypothesis testing with categorical data because we are going to say Group 1, group2, group3.
You see we have seen 2 samples Z proportion test. If you want to go for Z proportion test, if you
want to go for comparing three population proportion, if you want to compare the proportion of
more than two population, we should go for chi square test. Similarly, if you want to compare
more than three population mean we should go for Anova.
486
So, chi-square test can be viewed as the generalization of Z test of proportion. The same way the
anova can be viewed as the generalization of T test. The comparison of difference of mean
across more than two groups like Chi square if there are only two groups, 2 analysis will produce
identical result does a t test and Anova can be used with two groups. There are 2 groups we can
go for Anova and a T test. Both will give you the same answer.
(Refer Slide Time: 09:38)
Why this concept of Anova is more important. Suppose, in the production process, there are
different input variables. X1, X2 and XP. These are controllable inputs. There are uncontrollable
inputs, that is, Z1, Z2, Z3. So, the input maybe raw materials, component and subassemblies,
output is the quality characteristics. What will happen this quality characteristics, Y is generally
affected by this X1 X2 X3 right. See that is output product.
If it is affecting Y we need to find out what combination of X1 and X2 will provide better Y
value measurement evaluation monitoring and controlling for that purpose Anova is required. So
the purpose of Anova is there are many input variables how this input variable is affecting the
quality characteristics. For that we can find out with the help of Anova.
(Refer Slide Time: 10:42)
487
See, when we want to improve the quality Application of quality engineering techniques and
systematic reduction of process variability, because we want to improve the quality, is nothing
but reduction of the variance. To go for acceptance sampling there will be lot of variance,
because the acceptance sampling is a suppose, we go for say called n, c . If they say 10, 2 this is
n is the lot size. C is acceptance number.
What is the meaning of this in n, c is the lot size 10, I will count all the defective products. The
number of defective ways 2 or less, I will accept the whole lot. If the number of defective is two
or more, I will reject the whole lot. So, there is a, it is a kind of intuitive process there is
mathematics behind it which follow by normal distribution. But it is very easy way but there is
more variability still will be maintained.
When we go for statistical process control, when the process started then we are controlling the
process parameters that should go for statistical process control like our control chart. But the
design of experiment days before starting of the experiment, before manufacturing in at
laboratory level, you can see what are the parameters that will affect the quality of the product.
So, we can control the product so that you can control the variable so that you can improve the
quality.
488
With help of design of experiments, we can maintain high level of quality. That is why people
are interested in design of experiments. The base for this design of experiment is nothing but
your anova, analysis of variance. That is why I am trying to connect the connection between the
design of experiments and Anova.
(Refer Slide Time: 12:40)
I am going to explain the concept of analysis of variance with the help of an example. The
example is, there are three teaching methodology. The one way of teaching is with help of
blackboard and other ways with help of case studies. Third way is PowerPoint presentations
suppose. I want to know which teaching methodology is more effective or is there any
difference? Is there any influence of teaching methodology on student performance?
Totally 9 student was taken in each group 3 students are allotted randomly. So what do you see
here is the marks obtained by the students when they are studying blackboard group. They ask to
study case presentation group in where the teacher is using only PowerPoint presentations.
Whatever value which you are saying, what value you are seeing this is the marks. These are the
marks. What is the null hypothesis here?
(Refer Slide Time: 13:43)
489
The null hypothesis is I am going to assume that null hypothesis is H0: mu1 = mu 2 = mu 3,
alternative hypothesis is µ1 ≠ µ2≠ µ3. If you see even this technique name is called Anova
analysis of variance. But I am comparing mean. What are you going to do here with the help of
the concept called variance and I am going to compare the mean of three populations.
Nothing to do with, I am not going to, compare the variance and comparing the mean of three
populations. So, far the group one have taken the sample mean that is 3 for group 2 I have taken
sample means it is 4, group 3 it is 2. Then I find overall the sample mean 4 + 3 + 2 + 2 + 4 + 6 +
2 + 1 + 3 with 9 elements, equal to 3. What I am going to do is I am going to find out the overall
variance. That I am going to call it as SST, total sum of square.
Here, why I am saying it as variance, see the variance formula, we know that the variance what
is the formula? Variance is equal to (Sigma X - X bar) Whole square divided by n -1. This
numerator that is (Sigma X - X bar) whole square that I am going to say sum of square. I
am going to find out the overall variance that overall variance SST and I am going to group into
two categories.
This variance is due to SS treatment plus variance, variance due to error. That error minus that
their sum of square. So what I am writing SST is equal to total sum of square equal to some time
there might be between columns SSB + SSE. So, obviously in the variance, due to treatment, that
490
is SSB is dominating we can see that the teaching methodology is influencing variable. First I am
going to find out the overall variance.
For that variance I am going to find out only the numerator so that I am going to call it as SST.
What is SST? How each element is away from overall mean. Overall mean is 3, so in the first
column, the first element is 4. So, 4 - 3 whole square the value in the second column is 3. 3 - 3
the whole square + 2 - 3 whole square upto 18. So this 18 is nothing but total sum of square. This
18 I am going to see how much variance is due to this teaching methodology.
So I am going to call it is SSB. Some books call it SS treatment. That is treatment sum of square.
What is treatment sum of square? In the first column, there are three element is there.3 minus the
first column mean is 3. This 3, so the overall mean is 3. So 3 ( 3 - 3 )2 + the second column so
this 3 represents the number of samples in each column, this 3 represents in the second column
there are 3 elements.
So, this 4 represents mean of the second column is 4 minus 3 is overall mean 3 whole square +
for third column also there are three elements. The mean of the third column is 2. 2 minus 3 this
3 is overall mean whole square. So, this becomes 0, 4 – 1,1 square into 3 = 3, 2 – 1 =1, is going
to be 6. So this 6 is numerator of the variance that is SSB is 6. Then I will find out SSE. That is
the variance error sum of square, the inherent variance.
For inherent variance, when you look at the column 1 in the first element 4 the mean of the first
column is 3. So, 4 minus 3 whole square + the second element is 3 minus the mean of the first
column is this 3 this this this is a mean of first column. So, 4 minus 3 whole square + (3 – 3)
whole square +(2 – 3) whole square. So this 4, this 4 represents mean of the second column at
this represents this 2 represents the first element in the second column.
So, 2 - 4 whole square + 4 minus 4 whole square + 6 - 4 whole square. The third one is this 2
represents the mean of the third column. This 2 represent the first element in the third column 2 -
2 whole square + (1 – 2) whole square + (3 – 2) the whole square. When you simplify that it is
12. So what happened the SST is divided into two categories one is variance due to treatment this
491
is Error variance or Individual variance.The second one is where to find out the degrees of
freedom for SST.
What are the degrees of freedom? There are 9 elements. 9 - 1 that is your degrees of freedom is
8. For SSB there 3 columns is there. So, 3 treatment. So, 3 minus 1 that is 3 minus 1 that is the
degrees of freedom. For SSC, in the first column, there are three elements so the first column
will have two degrees of freedom the second column the three elements. So, 3 minus 1, 2 degrees
of freedom the third column also there are 3 elements. 3 – 1, 2 degrees of freedom. So totally six
degrees of freedom.
(Refer Slide Time: 19:32)
After that we have to find out f value. F value is nothing but MSB by MSE. That means what is
MSB is mean square between columns. This is means error square. So what is this mean means,
if you divide this SSB divided by the corresponding degrees of freedom we will get MSB. When
you divide SSE divided by corresponding degrees of freedom we will get MSE. If we look at the
previous one SSB, this is a two, degrees of freedom is 2. 6 divided by 2, SSE is how much
chances?
That is 12. Degrees of freedom is 6 ,so 3 by 2 is 1.54 that 1.5 is nothing but this 1.5. Ok, Next
what you have to do is, this is we can say your calculated value. You can refer your F table. In f
table assume that alfa is equal to 5 %. You have to look at the, what is this value for example, if
492
you look at the table, it will give you 5.14 but your calculated value 1.5 is lying on the
acceptance side table. This is F table so you have to accept null hypothesis.
This is an simple intuition for how this concept of Anova is working. In the next class we will do
more theory behind this Anova, we will continue. Ok students in this class, we have seen how to
find out the sample size for hypothesis testing. Then we have started the concept of Anova. For
Anova I have taken one example then I have explained what is the SST, that is the total sum of
square, treatment sum of square, error sum of square.
Then I have explained when should we go for Anova then I have solved one problem then, I will
complete calculated F value with table F value then I have concluded the conclusion also. The
next class, the same problem we will solve the help of python. Then, we will come then we'll go
for post hoc analysis. Post hoc analysis is whenever you reject null hypothesis, we have to say
which two pairs are different. So, that we will see in the next class, thank you very much.
493
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 24
ANOVA- II
Dear students in the previous class I have explained the concept behind analysis of variance. In
this class with the help of Python will solve that problem because previously in the previous
class we have solved it manually. Now we will use the help of Python will solve that problem
this was the problem which was given.
(Refer Slide Time: 00:47)
What is the problem is there are 3 different teaching methodology, we have to say which
teaching methodology is more influencing on the student performance.
(Refer Slide Time: 01:01)
494
In Python so I have taken a is an array a=[ 4, 3, 2] b= [ 2, 4, 6 ] c = [2, 1, 3] if you type
stats.f_oneway just you call abc you run that one you will get this was were valuing this was for
F that is a calculated F value, this is a p value. Suppose if the Alpha equal to 5 %, it is more than
5 % so we have to accept a null hypothesis that is what our previous result also.
(Refer Slide Time: 01:32)
Now we will use that another command that that is pandas.melt command we will see the
purpose of this command for doing ANOVA pd.melt allows you to unpivot data from a wide
format into long format that is data with each row representing a data point.
(Refer Slide Time: 01:51)
495
So, for that purpose input pondas dot pd import numpy is np, import math, from scipy import
stats, import scipy, import statsmodel.api as sm, from statsmodels.formula.api import ols, from
matplotlib import pyplot as plt, so first we will load the data the data I am going to save the data
the given time in the excel file are going to save in the object called data. So, data =
pd.read_excel (oneway.xlsx).
So, I have loaded when I run this data now the data is appearing column 1 column 2 column 3 so
0 1 2 that is your index.
(Refer Slide Time: 02:42)
496
Next what you have to do for running the data I need to have the data in this format what is that
format is so T.M1 teaching methodology one teaching methodology if your teaching
methodology one there may be some numbers. In teaching methodology 2 there may be some
numbers and teaching methodology 3 there may be some numbers. So, this says your treatment
so the next one says the value suppose I want to have the data in this format.
For that you have to use the following command that is data new I am going to call it is that way
pd.melt(data.reset_index(), id_vars =[‘index’], value_vars = [‘Teachin Method1’, ‘Teachin
Method2’, ‘Teachin Method3’]), data_new.Columns = [ ‘ index’, ‘treatment’, ‘value’]. So, this is
the this is the syntax for using the melt function.
So, if you run this data underscore new will get this kind of odd for do you see previously the
data was 0 1 2 format now what we are saying all teaching methodology one it is grouped in this
way this is group 1 this is your group 2 this is group 3. Now only one there was only 2 column
one is a one is treatment another one is value. now after getting this data into this format for
converting this purpose the melt command is used.
(Refer Slide Time: 04:22)
So model equal to ols, ols is ordinary least square method in in quote value tilde C treatment,
data equal to delta underscore new fit. Then if you write anova_table =
sm.stats.anova_lm(model, typ = 1), when you run this will get the anova_table this represents
497
degrees of freedom for treatment because there was a 3 column treatment is 2 whereas dual is 6
because for a column 1 there are 3 elements so 3 – 1, 2 degrees of freedom.
Similarly for column 2 also another 2 degrees of freedom for column 3 also another 2 degrees of
freedom totally 6 degrees of freedom, so some squared is 6 here sum of square is 12. So, mean
sum of square is 6 divided by 2 that is 6 this is 12 divided by 6 that is 2, so F value is 3 divided
by 2 this was the p value this also we got it previously to help of; when you do it manually we
compared this 1.5 and we got the same value this is with the help of Python we are also getting
the same result.
(Refer Slide Time: 05:31)
Now we will go to the formal definition of ANOVA venire conceptual overview analysis of
variance can be used to test the Equality of 3 or more population means. Data obtained from
observational or experimental studies can be used for this analysis. We want to use the sampled
result to test the following hypothesis in ANOVA what does the hypothesis H0: mu 1 equal to
mu 2 equal to mu 3 it may be n number of columns.
(Refer Slide Time: 06:17)
498
Alternative hypothesis not all population means are equal, H 0 equal to mu 1 equal to mu 2 equal
to mu 3, Ha is not all population means are equal if it is not is rejected we cannot conclude that
all population means are equal so when you reject it that means there are some unusual means.
So, rejecting H 0 means that at least to 2 population means how different values.
(Refer Slide Time: 06:36)
What are the Assumption for analysis of variance. For each population the response dependent
variable is normally distributed. In our previous example the performance of the student is the
dependent variable the independent variable is teaching methodology. The variance of the
response variable denoted by Sigma square is the same for all populations. Why this assumption
499
is required? when you are comparing more than 2 groups the basic assumption is that the
variance of that group should be same.
This concept we have explained when we are conducting 2 sample tests and the observation must
be independent.
(Refer Slide Time: 07:16)
Look at this normal distribution this is the sampling distribution of x-bar given null hypothesis is
a true sample means are close together because there is only one sampling distribution when H0
is true. Look at this normal distribution there are 3 normal distribution here the sampling
distribution of X-bar given H0 is false what is H0 is false what is H 0 here H 0 equal to mu 1
equal to mu 2 equal to mu 3, if it is a false what will happen it would not be from same
population it will be from different population.
So, the sample means come from different sampling distribution and are not as close together
when H 0 is false.
(Refer Slide Time: 08:05)
500
In analysis of variance we can classify one is 1 way ANOVA another one is 2 way ANOVA
there is one more thing in between that is called R B D randomized block design will see when
we will go for randomized block design. In one-way ANOVA we are going to do the F test the F
test will help you to decide whether the null hypothesis accepted or rejected when you reject the
null hypothesis then Tukey Kramer test will help you which 2 pairs are equal or which 2 pairs
are not equal.
Then this side is a 2-way ANOVA then we will go for interaction effects that we will see in
coming classes. In this class we will see how to do the F test how will you do the 2 Tukey
Kramer test.
(Refer Slide Time: 08:49)
501
What is the general ANOVA setting investigator controls one or more factors of interest in our
previous example the teaching methodology is the factor. So, each factor contains 2 or more
level in our case you see suppose of the pressure is the one parameter may be high or low that is
level. So, high is one level low is another level, level can be numerical or categorical. Here the
example is it is a categorical he need not be categorical it may be a continuous variable also.
So, different levels produce different groups think of the groups as population we can say each
groups can be we can consider as the population. Observe effect on the dependent variable so
what we are going to doing otherwise what is the effect of this treatment on the dependent
variable. Next we will see experimental design the plan used to collect the data only the external
design will have a plan to collect the data. And will see the effect of this data on the how this
treatment is influencing the data.
(Refer Slide Time: 09:59)
502
The first one method called completely randomized design in our previous example the students
are allocated to 3 groups randomly that is an example of you were completely randomized
design. There is no bias because what will happen there suppose if you consider the student IQ
level then you are allocating that is student to different category of classes then that is not called
biased method. So, what is happening here the experimental units are assigned randomly to the
different levels so subjects are assumed homogeneous.
Only one factor or one independent variable that is called one way ANOVA because here
teaching methodology the independent variable the student performance that is the marks is the
dependent variable. There also we can have 2 or more levels if you are analyzing one factor
analysis of variance it is called one way one way ANOVA. If there are 2 independent variable
that is a 2 way ANOVA.
(Refer Slide Time: 10:57)
503
So, what we are doing the basic concept behind is we are finding the variance due to between the
treatment and variance between the treatments. So, that will go to your numerator that is nothing
but every SS treatment that is why I wrote SSB. When you divide by degrees of freedom this is
MSB, divided by the SSE divided by degrees of freedom that is variance within the treatment
this is nothing but you are MSE. So, we will find variance between the treatment then variance
within the treatment then we will go for F test then I will explain what is this ANOVA table.
(Refer Slide Time: 11:44)
So, what is a null hypothesis for the CRD, completely randomized design mu1 equal to mu2 =
mu3 equal to mu k not all population means are equal. Here assume that a simple random sample
of nj has been selected from each of the k populations or treatment there are k treatment in our
504
previous example there was a 3 treatment 3 factors 3 factors means 3 levels for the resulting
sample data let X ij value of observation i for treatment j, nj is number of observation for
treatment j, xj is sample mean for treatment say, s j square sample variance for treatment j, s j is
the sample standard deviation for treatment j
(Refer Slide Time: 12:32)
First we will find out between treatment estimation of population variance Sigma square. The
estimate of Sigma square based on the variation of the sample mean is called mean square due to
treatment that is denoted by MSTR in our example previously we have used we can have MSB
that is mean square due to between columns. So, how we are finding is this MSTR is nothing but
n j number of elements in column j xj that column j mean minus the overall mean whole square
divided by k – 1, k is the number of columns here 3 – 1.
So the denominator is the degrees of freedom associated with sum of square treatment the
numerator is called the sum of square due to treatment SSTR divided by degrees of freedom.
(Refer Slide Time: 13:34)
505
Between treatment estimation of population variance Sigma square the mean square due to
treatment that formula which you have seen previous slides so what is the meaning of this k, k is
the number of groups, k number of columns nj is the sample size from Group j xj bar is the
sample mean from Group j x double bar is the grand mean, mean of all data values over all
mean.
(Refer Slide Time: 13:58)
Next we will see within treatment estimate our population variance. The estimator of Sigma
square based on the variation of the sample observations within each sample this is more
important term within each sample is called mean squared error is denoted by MSE. So, mean
506
square error how we are doing that one nj in column j how many element is there minus one, that
is our degrees of freedom sj square.
Actually how it has come you see if we want to know sj square what is a formula Ʃ(X – X)2
divided by n – 1, so this instead of writing numerator that can be written as (sj) 2 . (n – 1) that is
why it is written nj - 1 (sj) 2 or nT is denominator is the degrees of freedom associated with error
sum of square I will tell you that.
What is the nT in the next slide K is the number of groups nT is the number of treatment here this
nT is nothing but the overall degrees of freedom. From the overall degrees of freedom if you
subtract the degrees of freedom for between the columns then you will get either degrees of
freedom for your SSE that is error sum of square. In our previous example we might have seen
the nT is 9 – 1, 8 and the K is there was 3 columns, so it is 2 it was 6 degrees of freedom in our
previous problem for MSE.
(Refer Slide Time: 15:43)
507
Comparing the variance estimates that is F test if the null hypotheses are true and ANOVA
assumptions are valid the sampling distribution of MST are divided by MSE is an F distribution
with MSTR degrees of freedom is equal to k - 1 that is number of column minus 1 and MSE a
degrees of freedom is n T – k, n T is total number of sample minus k is number of groups. If the
means of the K populations are not equal the value of MSTR divided by MSE will be inflated
because MSTR overestimate Sigma square.
So, what is the meaning of this one is we are finding you have the value of F is MSTR divided
by MSE there are 2 possibility it may be equal 1 or less than 1 or greater than 1. If it is equal to 1
what is the meaning is variance due to treatment is equal to variance due to individual error. If it
is greater than 1 the variance due to treatment is more when compared to within the error. When
it become less than 1 if the MSE that is error due to individual difference is more when
compared to treatment then it will become less than 1.
So, you see this one if the mean of the k populations are not equal the value of MSTR by MSE
will be inflated because the MSTR overestimate Sigma square. Hence we will reject because F
value become very big when F value is very big we will reject H0 if the resulting value of MSTR
by MSE appears to be too large to have been selected at random from appropriate F distribution.
(Refer Slide Time: 17:52)
508
This is situation so what will happen when F is bigger number or obviously will be landing on
the rejection site will reject null hypothesis. When you reject a null hypothesis we will say mu1=
mu2 = mu3 this was your null hypothesis, alternative hypothesis is mu1 not equal to mu 2 not
equal to mu 3. So, when you reject null hypothesis we can conclude that this means are not equal
and one more thing this is the F distribution it is not normal distribution it is a right skewed
distribution.
(Refer Slide Time: 18:35)
This is the ANOVA table setup so what will be written sources of variation. So, there may be
variation may be due to treatment variance due to error, so sum of square is sum of square
treatment error sum of square. Here the deal is of freedom is K - 1 here n T - k generally if you
509
SST degrees of freedom is n T - 1 and n T is total number of elements - 1 when you subtract this
n 2 - 1 - k - 1 that will give you n T - K so MSTR is nothing but we have to divide this SSTR a
little bit corresponding degrees of freedom. so, it will become mean treatment sum of square.
When you divide by SSE you total by corresponding degrees of freedom mean error sum square.
So, the ratio of you see a MSTR divided by MSE always in the denominator there should be
error term because when you go for 2-way ANOVA be able to remember that the denominator
always there will be error term then we can find out corresponding p value this is what we have
done previously when we are explaining my first example.
(Refer Slide Time: 19:41)
Generally what is happening this SST divided by its degrees of freedom nT - 1 is the overall
sample variance that would be obtained if you treated the entire setup observation as one data set
right when you divide this SST by corresponding degrees of freedom that is overall variance.
With entire data set as a one sample the formula for computing the total sum of square is SST is
Ʃj equal to 1 to k Ʃ i equal to 1 to n say (X ij - X double bar)2.
So, this total sum of square can be splitted into 2 part one is treatment sum of square and error
sum of square. If this treatment sum of square is dominating even without going further test we
can say that there is a influence of treatment on the response variable.
(Refer Slide Time: 20:35)
510
ANOVA can be viewed as the process of partitioning the total sum of square and the degrees of
freedom into their corresponding sources that is treatment and error. Dividing the sum of square
by the appropriate degrees of freedom provides the variance estimates and the F value used to
test the hypothesis of equal population means.
(Refer Slide Time: 21:02)
What is the hypothesis the null hypothesis as usual mu1 equal to mu2 equal to mu 3, alternative
hypothesis: not all population means are equal the test statistic is the ratio of mean treatment sum
of square divided by mean error sum of square.
(Refer Slide Time: 21:19)
511
The p-value approach as usual for hypothesis testing also if the p-value is less than or equal to
alpha we have to reject a null hypothesis. If you are using critical value the F value is greater
than your value which you got from the table that also we have to reject our null hypothesis. In
this class what we have seen we have taken one problem that is problem we have solved with
help of Python then I have explained the theoretical background behind this ANOVA.
Then I have explained what is the total sum of square then what is the treatment sum of square
than error sum of square. Then what is the degrees of corresponding degrees of freedom. In the
next class will take extension of these classes once we reject a null hypothesis we have to say
which 2 means are equal or not equal. So, that analysis is post hoc analysis we will continue the
next lecture with the new topic of post hoc analysis in ANOVA, thank you very much.
512
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 25
Post Hoc Analysis(Tukey’s test)
Dear students in the previous class we have seen the theoretical background behind analysis of
variance and also we have solved a problem. So, what will happen once you reject a null
hypothesis of an ANOVA problem that means you are accepting alternative hypothesis it is need
not be that all means not equal some time any pair may be equal some other pair may be not
equal. So, whenever you reject a null hypothesis then we have to say which two means are equal
for that purpose there is one more statistical analysis that is a Post Hoc analysis useful one test is
Tukey Kramer test another test is HSD test. We will see with that how to use this post hoc
analysis in ANOVA in this lecture.
(Refer Slide Time: 01:22)
So, the lecture objective is after completing the lecture you should be able to use Tukey test and
least that is LSD test to identify specific differences between means.
(Refer Slide Time: 01:34)
513
We will take in this problem and engineering perspective experimental design methods also
useful in engineering design activities where new products are developed and existing ones are
improved. By you see design of experiments engineers can determine which subset of the
process variables has the greatest influence on the process performance that is the main objective
of the design of experiment is. What kind of variables that has the greatest influence on the
performance of the product.
(Refer Slide Time: 02:09)
When you do the design of experiments what are the advantages benefits one is it improves the
process yield it reduces the variability in the process that leads to less rejection and closer
confirmation to the nominal or target requirements, so quality of the product is improved and
514
reduced design and development time because before making the product since we are doing the
design experiments so the time spent on redesign is reduced at the same time reduced the cost of
operations because the waste is minimized.
(Refer Slide Time: 02:48)
Some of the term that we have to remember while going for this design of experiment is one
does a conjecture, Conjecture is the original hypothesis that motivates the experiment,
experiment the tests performed to investigate the conjecture. Analysis the statistical analysis of
the data from the experiment, so, conclusion is what has been learned about the original
conjecture from the experiment often the experiment will lead to a revised conjecture and new
experiment and so forth.
(Refer Slide Time: 03:23)
515
We will solve one, one way problem in this class then I will explain how to use post hoc
analysis. The problem is like this a manufacturer of paper industry he is using the paper for
making grocery bags and he want to improve the tensile strength of the product. The product
engineer thinks that the tensile strength is a function of hardwood concentration in the pulp and
that the range of hardwood concentration is concentration of practical interest is between 5 to
20% so what is the meaning is that when you increase the hardwood concentration so the tensile
strength will increase.
A team of engineers responsible for the study decides to investigate for level of hardwood
concentrations the concentration level which they considers our 5% , 10% 15% and 20% they
decide to make up 6 test specimen at each concentration level using a pilot plant all 24
specimens are tested on a laboratory tensile tester in a random order the data from this
experiment is shown in the table.
(Refer Slide Time: 04:40)
516
Now the row 5 represents hardwood concentration 5% 10% 15% and 20% 5% 10% these are the
observations there are 6 times the experiment is repeated the average value is given so here the
treatment is the percentage of hardwood concentration.
(Refer Slide Time: 05:02)
First we will plot it with help of box and whisker plot so I am going to take the first data set as a
5% in array 5 8 15 11 9 10, next variable name is a 10 % I am ever taken the next array 12, 7, 13,
18, 19, 15 then 15% and I have taken all the 6 variables in 15% in 20% and so on. So, I go to
draw the box plot, so box underscore plot underscore data so just to call that arrays for 5% 10%
15% and 20% then you write plt. Boxplot( box_plot_data ), plt.show, so we are getting the box
and whisker plot.
517
So what is happening here you see the means are not equal there is a lot of differences they there
appears that whenever the percentage of hardwood is increasing so the tensile strength is
increasing because there seems to be there is an increasing strength.
(Refer Slide Time: 06:13)
This is a typical data for single factor experiment generally see the treatment is taken as 1 2 3 4
observations are taken in row wise y11 this is the response variable of first a treatment and first a
sample first treatment second sample first treatments and y 1n so if I write y1. that is a total the
first row total Y 2. second row total. If I write a is there is ‘a’ treatment ‘a’ levels is it so if I write
Y a. represents the sum so Y a. means the totals.
If I write Y 1 dot bar so that is the row 1 mean Y 2 dot bar that is the second row mean by a dot
ath level averages. If you write Y dot, dot it is the sum of all total if I write y dot dot bar that is
the average.
(Refer Slide Time: 07:17)
518
So, SST treatment sum of square is the formula which you have seen previously because if you
write in this way for especially for as a student this will it is easy way to solve the problem in the
examination. So, this will save a lot of time Sigma I equal to 1 to a up to level j equal to 1 to n up
to number of sample size y ij is the individual response - y dot dot but that is overall mean so that
will give the total sum of square.
If you if you want to move the treatment sum of square SS treatment equal to n into y i dot bar -
y dot dot bar that is over all this is the row 1 mean this is overall mean whole square multiplied
by n number of sample in the row 1 y. SSE is y ij - yj dot bar that corresponding mean whole
square okay this is your error sum of square.
(Refer Slide Time: 08:25)
519
There is a shortcut formula for this the shortcut formula because this formula is very useful if
you are using calculator. So, when you simplify the previous slide with what you are done so
SST is nothing but Sigma of I equal to 1 to a sigma j equal to 1 to n y ij square - y dot dot
squared order by n, n is the nothing but a into n, a is the number of level n is small n is number of
observation at level a. So, trick this is total sum of square the treatment sum of square is 1 by n
Sigma i equal to 1 to a y i dot m square - y dot dot whole square.
When you see that here also y dot dot square by n here also y dot square n is same so if you
calculate for 1 it can be used for both calculation. So, we know that SST equal to SSTreatment
plus SSE so when you subtract it you can get SSE, so this will save a lot of time in the
examination. The previously there is a equal sample size if there is unequal sample size SST is
same y IJ square - y dot square by n but the SS Treatment is yi dot squared order by n i because
this this new term will be introduced there - y double dot whole square divided by n.
(Refer Slide Time: 09:51)
520
Consider the paper tensile strength experiment which is described previously. We can use the
analysis of variance to test hypothesis that the different hardwood concentration do not affect the
mean tensile strength of the paper. So, what is the null hypothesis is that that different hardwood
concentration does not affect the mean tensile strength that means the hardwood concentration
tensile strength are independent nothing to do with that one.
So, here the hypothesis is different treatment effect is 0, the alternative hypothesis of this is there
is an effect of treatment.
(Refer Slide Time: 10:30)
521
When we take alpha equal to 1% the sum of square of analysis of variance are computed as
follows we can say y ij 2 – (y ..2 /N), we can simplify we just you can substitute this formula so SS
treatment is 512.96, sorry total sum of square SS treatment is 382.79 when you subtract it will
get SSE so SSE is 512.96 – 382.79 it is 130.17.
(Refer Slide Time: 11:01)
So, this is an ANOVA setup when you supply this value here so you can get what you are done
we got SST we got SS treatment when you subtract it will get SSE error so degrees of freedom is
there is a totally 24 element so 24 – 1, 23. So, there was a 4 rows, 4 – 1, 3 so 23 - 3 is 20 then
when you divide this 382.9 degrees of freedom we will get 127.6 when you divide 172 divided
by 20 will get 6.5. so, 127.62 by 6.5 you will be 19.6 so it look like 19.6 big, so what we can do
so it said it will this is a calculated value.
522
Since F 1% 3, 20 is 4.94 so when you 3, 20 you see that when you suppose if you are comparing
the critical value so this value is 4.9 which you got from this one you can see that
scipy.stats.f.ppf, if it is a 1% so we want to know probability of when the right side area is 0.01
so for that scipy.stats.f.ppf(1 – 0.01 will give you the because we want to know the right side
area but the Python gives only the left side area so 1 – 0.01 that probability when degrees of
freedom is 3 when the denominator is no 20 the corresponding F value is 4.93. Our calculated F
value is 19.6, so 19.6 far away from here so we got to reject a null hypothesis.
(Refer Slide Time: 13:43)
523
So, what we can do we can directly we can run ANOVA so scipy.stats.f_oneway, call 5% 10%
15% 20% so you are getting F statistics 19.68 CP value is 3.59 into 10 to power – 6, so both the
way you are getting the same result.
(Refer Slide Time: 14:06)
Now we will solve this problem suppose assume that this dataset which I have already entered
into the excel file. So, import pandas as pd, import numpy as np, import scipy import
statsmodels.api as sm, from statsmodels.formula .api import ols, so df = pd.read_excel I saved
that file name is concentration.xlsx. So, when you write this df you are getting so concentration
1 concentration 2 concentration 3 consultation for that is a different % level.
Now we have to convert this one into only in two column in one column I need to have
concentration level in our another column I am going to have only the values response variables.
(Refer Slide Time: 14:56)
524
For that I have to use melt function so I am going to save that file in the object called data
underscore r 1, pd . melt d f dot reset underscore index, id underscore vars equal to index, value
underscore vrs equal to concentrations file that is the value of the variables concentration of 5
concentration 10 concentration 15 concentration 20 so data underscore r 1 columns going to be
index treatment and values. So, model ols values tilda c in bracket treatments data equal to data
underscore r 1 dot fit. So when I write model dot summary I will be getting this result right.
(Refer Slide Time: 15:44)
So what we are getting here this was the regression model now we are getting the ANOVA table
so aov_table = sm.stats.anova_lm when you call that model, typ = 1, you type this table you will
get treatment there are 4 row was there so 4 - 1 3 degrees of freedom for the error degrees of
525
freedom is 20 so this is treatment sum of square error sum of square when you divide by 3 you
will get 127.59 when you divide 130 by 20, 6.59 so when you divide 127.59 into 509 this is your
calculated values the p-value is very, very low. So, we can reject the null hypothesis.
(Refer Slide Time: 16:37)
When null hypothesis rejected in the ANOVA we know that sum of the treatment our factor level
means are different. Anova does not identify which means are different methods for investigating
this issue is called multiple comparison method are post hoc analysis that we will do here.
(Refer Slide Time: 16:57)
One technique for doing post hoc analysis Fisher's least significant difference method, the
Fisher's least significant difference method compares all pairs of mean with the null hypothesis
526
H0 mu i = mu j for all (i ≠ j) using the t statistics this is nothing but your two sample t-test. So, yi*
- yj* divided by root of how we got this one generally we will get this one MSE divided by n +
MSE divided by n both are same we wrote 2 into MSE divided by n but here is the sample size is
same.
(Refer Slide Time: 17:56)
Assuming a two-sided alternative hypothesis the pair of means i and j would be declared
significantly different if the absolute value of the difference of their mean if it is greater than
LSD then we will say that there is a significant difference between that 2 pair, so LSD if you
bring the left hand side the previous formula will be t α/2 a (n – 1), a is the number of levels in this
number of observations each treatment,√(2MSE divided by n) just I have readjusted that form.
(Refer Slide Time: 18:34)
527
So if the sample sizes are different in each treatment the LSD is defined as LSD equal to t α/2 (N –
a) root of MSE (1 divided by n i + 1 divided by n j) that means for each pair there will be a
different LSD because the sample size are different that should be very careful on that one.
(Refer Slide Time: 19:00)
We will apply the fisher’s LSD method to the hardwood concentration experiment there are 4
means n equal to 6 MS is 6.51 so in alpha equal to 5% for 0.025 and 20 degrees of freedom the t
value is 2.086 this was the mean of different treatment.
(Refer Slide Time: 19:24)
528
So, when you substitute here LSD value is 3.07 so we have to compare see that this one 1 and 2,
1 and 3, 1 and 4 - 1 3 and 2 and 4 therefore any pair of treatment averages that differs by more
than 3.07 implies that that corresponding pair of treatment means are different. So, what you
have to do next step or to take any two pair of the mean you have to find their absolute difference
if the absolute difference is greater than 3.07 we can conclude that that two pairs means are
different.
This was already we have got this ANOVA table now we will go for this LSD test import mat
first we will find out the t value t values - 1 x scipy.stats.t.ppf of 0.025, 20 because why I am
taking - 1 because whether it is the right side value the t value should be positive. So, n equal to
6 MSE is already we know that it is this value 6.50 so LSD is I am writing this formula t
multiplied by math.sqrt(2*MSE/n), we are getting 3.07. So this value we got it already 3.07 this
3.07.
(Refer Slide Time: 20:49)
529
Now we are going to take all the pair's first you will take 4 versus 1, 4 versus 2, 4 versus 3 then 3
versus 1, 3 verses 2 and 2 versus 1, so the absolute difference is 11.17, 5.50 this is greater than so
what we can conclude mu 4 not equal to mu 1, but here you see that if this is less than 3.07 so
what we have to conclude is this mu 3 equal to mu 2 all other pairs are different.
(Refer Slide Time: 21:29)
In this problem we see that there are significant differences between all the pairs of mean except
2 and 3 this implies that 10 % and 15% hardwood concentration produce approximately the same
tensile strength there is means are equal that two means are equal. So, what we are concluding
that 10% and 15 % hardwood concentration produce approximately the same tensile strength and
that all other concentration levels tested produce different tensile strength.
530
(Refer Slide Time: 11:01)
We will go for another post hoc analysis that that is called to Tukey Kramer test Tukey Kramer
test tells which population means are significantly different it is then after rejection of equal
means in ANOVA that means after rejecting our hypothesis it allows pairwise comparison all the
means are compared as a pair. So, compare absolute mean differences with the critical range
what will happen do in this test we will find out mean absolute difference so that difference is
compared with the critical range that we got from the table called Tukey table.
(Refer Slide Time: 22:40)
So, Tukey Kramer test for post hoc analysis determine if there is any significant difference
between the means so when we reject a null hypothesis this figure says that mu 1 equal to mu 2
531
but mu 3 is different. So, this which two pairs of means is equal that we can find with the help of
this Tukey Kramer test.
(Refer Slide Time: 23:05)
So, here we have to find out the critical range the critical range is QU root of MSW or we can
say MS within the column otherwise we can say MSE divided by 2 (1 divided by nj plus 1
divided by nj ),- nJ and nj’ is 2 pairs which are comparing and corresponding sample size. Here
the QU the value from studentized range that I will show you I have the table with me
studentized range distribution with the c and n - c degrees of freedom.
Here c is the number of columns n is the total number of sample size degrees of freedom for the
desired level of alpha MSW is mean square within nothing but every a MSE nj and nj’ says or
sample sizes from groups j and j’ that means we are taking 2 pairs of population j and j’ they are
comparing the we are finding the absolute difference. If that absolute difference is greater than
critical range we will say that that two pairs are different. If it is within the critical range we say
that that 2 pairs means is same.
(Refer Slide Time: 14:20)
532
This was the problem which have solved previously so for this problem we have remember we
have rejected null hypothesis.
(Refer Slide Time: 24:30)
So first we are finding out reject a null hypothesis then we are going to do Tukey Kramer. So, we
are going to compare mean of X 1 bar and X 2 bar, X 1 bar and X 3 bar, X 2 bar and X 3 bar X 1
X 4. X 2 X 4, X 3 and X 4. So, the absolute mean 4 X 1 bar - X 2 bar is 5.67 for second one is 7
for third one is 1.33, X 1 and X 4 is 11.14 and so on.
(Refer Slide Time: 15:04)
533
Next we will find out the critical range find QU value from the table with see here number of
treatment is 4, n - c is 20 degrees of freedom for the desired allele of alpha equal to 5% this 3.9 c
is got from this table.
(Refer Slide Time: 251:22)
When you look at this you see 4 is c, 20 is your n - c so that corresponding value when alpha
equal 0.05 and 3.96 okay the Q table the critical values for Q corresponding to alpha equal to
0.05 on top and 0.01 at the bottom. This is the ANOVA table this on our table says you see that
MSE is 6.51 mean squared error, MS treatment is 127.6 because the value of 6.51 will use in the
next slide.
(Refer Slide Time: 16:00)
534
So, third step is compute the critical range we will find the critical range QU which you got
from the table root of MSW is 3.651 one which I shown in the previous table 5.6 divided by 2, 1
by 6 + 1 by 6 because same sample size, so that value is 4.124 then we will find the difference of
two pair for example x1 and x2 the difference is 5.67 absolute difference is 5.67 but that is
greater than 4.12 so we can say mu1 not equal to mu2.
But look at this x2 and x3 this is less than your 4.124 so we will say mu2 equal to mu 3 this is
the observation which you got previously also so the mean of population 2 and population 3 is
same, there is mu2 and mu3 same.
(Refer Slide Time: 27:00)
535
Other than X 2 bar - X 3 bar when absolute value all of the absolute mean differences are greater
than critical range therefore there is a significant difference between each pair of the means
except 10 and 15% of concentration at 5% significance level. So, only these two concentrations
there is no difference in tensile strength all other pairs are different.
(Refer Slide Time: 27:25)
This we cannot solve with the help of Python so we import, from statsmodels.stats.multicomp
import pairwise_tukeyhsd, honestly significant difference does that HSD from
statsmodels.stats.multicomp import MultiComparison, mc = MultiComparison( data_r1[‘value’],
data_r1[‘treatments’]) is a value which you have already you remember we have done in the
previous lecture data underscore onwards the treatment.
So mcresult equal to mc.tukeyhsd for 0.05 alpha mc result dot summary we will get this result.
So, students what you have to do I have taken the screenshot of the output you have to enter this
command into the Python command prompt then able to enter and verify the result. So, what is
happening here is a group 1 group 2 you see 10%, 15% here false the rejection that means the
rejection is false only that means these when mu equal to 10% of concentration and when the mu
equal to 15% of concentration the means are equal all other pairs means are not equal.
(Refer Slide Time: 28:48)
536
We will do another problem the following table shows observed tensile strength found in lb/ in
square of different clothes having different weight % of cotton check whether having different
weight percentage of cotton plays any role in tensile strength what we are going to do in this
problem whenever the percentage of cotton is added into thee into the clothes this tensile
strength is increasing. We will see that is there any connection between percentage of cotton and
the tensile strength of the clothes.
(Refer Slide Time: 29:23)
So here the weight percentage is taken in drove 15%, 20%, 25, 30, 35 this was the 5
observationally is given total is given the grand total is 376 the grand mean is 15.07.
(Refer Slide Time: 29:39)
537
First we will find out SS treatment some books they follow in SSB that means between sum of
square, SSA among sum of square some book write SS treatment, treatment sum of square. So,
for treatment sum of square see in the row 1 there is a in the treatment 1 there are 5 element is
there so 5 into this is the mean of first row. This is overall mean so 5 into 9.8 15.04 square plus
this is their 5 element and 15.4 is the mean of the second row this is the overall mean Plus this is
the mean of third row this is mean of third row mean of 4th row mean of your third row.
It is very getting SST treatment sum of square among the column that is otherwise SS treatment
sum of sum of square is 475.76 similarly we have done previously with the help of our problems
we can find out SSE we know this SST is 636.96 so if you want to know SSE 636.96 so we are
getting 161.60 so this is ANOVA setup so this is sum of square of cotton weight percentage sum
of square of error there is a degrees of freedom because there is a 5 rows, so 5 – 1 = 4 rows so
there are 25 elements 25 – 1, 24 so 24 - 4 is 20 degrees of freedom.
When you divide by this 475.76 by 4 you are getting 118.94 when you divide 161.2 divided by
20 we are getting 8.06 so the F value is 4.76.
(Refer Slide Time: 31:19)
538
When alpha equal to because we could 0.95 because 5% means 0.95 numerator degrees of
freedom is 4 degrees of when we are getting so this value when alpha equal to 5% this is 2.8
okay but our calculated F value is 14.76 so 14.76 will be this side which is on the right hand side
so we have to reject our null hypothesis. After rejecting null hypothesis we refer Q table.
(Refer Slide Time: 31:59)
So, the Q value which we got from the table is when alpha equal to 5% is in his 4.23 so when it
is a 4.23 MSE is 8.06 we got 8.06 from this one this value MSE 8.06 this, this value is taken as
8.06 divided by n so it is a 5.37. So, this 5.37you have to compare any pair of treatment averages
that differ in absolute value by more than 5.37 would imply that corresponding pair of population
means are significantly different.
539
(Refer Slide Time: 32:37)
So, we have compared all comparison y1 versus going to y1 y3 what is happening this is more
than 5.37 this is this is here we can say mu 1 not equal to mu 2 here also mu 1 not equal to mu 3
but what is happening here it is less than, less than 5.37. So, here we can say mu 1 equal to mu
five here also mu 2 equal to mu 3 here also mu 2 equal to mu 5, here also mu 3 equal to mu 4
okay this is the Tukey Kramer test.
(Refer Slide Time: 33:22)
Then we will do with the help of Python I have imported the data calling it df3, pd.read_excel
because I have saved in the excel format then I am using this melt command you know that
previously in the two classes I have used how to use this melt what is application of this pd.melt
540
(df3_reset_index(), id_vars = [‘index], value_vars = [‘cotton percentage 15, cotwt 20, cotwt 25
30, 35 so date data.Columns equal to [‘id’, ‘treatment’, ‘values’].
(Refer Slide Time: 34:08)
So, mc is multi comparison data one value, data treatment one so, mcresult equal to mc.tukeyhsd
for 0.05 when you mcresult.summary when you type this what is happening here this, this, this
that means the corresponding means are equal. So, other places not equal, so we have got
whatever we got that result we have checked with the help of Python also. Dear students in this
lecture what you have seen we have solved one way ANOVA problem in that one way ANOVA
we have rejected null hypothesis.
Once we reject null hypothesis we have to say which two pairs of means is equal or not equal for
that we have gone for and another set of test called post hoc analysis there are two test was there
one is the least significant difference method another rule is Tukey Kramer method and also this
be solved another problem with the help of Python, thank you.
541
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 26
Randomize Block Design (RBD)
Dear students, the previous class we have seen one way Anova that is Completely Randomized
Design, we call to CRD. In this class, we will see another technique called Randomized Block
Design.
(Refer Slide Time: 00:42)
The class objectives are estimate the various components in experiment involving random
factors, what will happen in Anova we are considering some factors. We are saying that the
effect of the factor, what is the effect of the variance. But unknowingly, there is a possibility that
some more variable may influence our response variable so that unknown variable and variance
due to that unknown variables are going to remove it, then we are going to do the analysis.
Then, will see what is the effect of that one? Then, understand the blocking principle and how it
is used to isolate the effect of nuisance factors. So, what you are doing here in Randomized block
design. We are going to we are going to isolate the effect of nuisance factors then, design and
conduct experiment involving Randomized Block design. A completely randomized design CRD
542
is useful when the experimental units are homogeneous. If the experiment units are
heterogeneous blocking is often used to form homogeneous groups.
(Refer Slide Time: 01:50)
Why we have to go for RBD Randomized block design? A problem can arise whenever
difference is due to extraneous factors that is, once not consider in the experiment cause the
mean squared error term in this ratio to become large. What will happen? Due to that nuisance
factor, the value of mean squared error will become very high. In such cases, f value in equation
can become very small.
Signaling, no difference among treatment means when in fact such differences exist. So what
will happen here in the MSE, there may be some error terms which are due to external factors.
So we are going to find out how much error is due to external factor that we are going to remove
it. Then, we are going to conduct the F Value.
(Refer Slide Time: 02:45)
543
Experimental studies in business often involve experimental units that are highly heterogeneous
as a result Randomized block designs are often employed. Blocking in experimental design is
similar to certificate stratification in sample. In stratification in sampling what we are doing? We
are if the samples are heterogeneous based on certain criteria we are grouping, we are stratifying
that sample, so that each strata will have homogeneous sample.
(Refer Slide Time: 03:21)
Here also, it is similar to stratification sampling. Its purpose is to control some of the external
sources of variation by removing such variation from the MSE term. That is mean square error
term. This design tends to provide a better estimate of the true error variance and leads to more
powerful hypothesis test in terms of the ability to detect differences among treatment means.
544
(Refer Slide Time: 03:47)
We will take one sample example. This sample example is Air traffic controller stress test. Why
this Air Traffic Controller is he has to schedule various aircraft what time it has to be landed,
what time it has to take off. So, he is the person who has to allocate different slots for, for
landing and takeoff. So this job is very stressful job. We will see one problem on this one. A
study measuring the stress of Air traffic controller resulted in a proposal for modification and re-
design of controller’s workstation.
So, what they are planning? They are going to redesign, the work station because this sometime
the workstation may influence, may affect the stress level. If it is the workstation is very narrow
people are get stressed more ok. After consideration of several designs for the workstation, 3
specific alternatives are selected, as having the best potential of reducing controllers stress. They
have identified the three alternatives.
The key question is to what extent do the three alternatives differ in terms of their effect on
controller stress? So we are going to see to what extent they are different, different workstation
design is going to affect the stress of the Air traffic controller.
(Refer Slide Time: 05:16)
545
In a completely Randomized design a random sample of controllers would be assigned to each
workstation alternative. Generally we will assign. However, Controllers are believed to differ
substantially. It is not what we are assuming because the sample is not homogeneous because
different controllers are affected by different level of workstation design. So, the controllers are
believed to differ substantially in their ability to handle stressful situations.
What is high stress to one controller might be only moderate or even, low stress to another. So,
what is happening? The sample is not homogeneous. Hence, when considering the within group
sources of variation, that is MSE, let us call it means square error then, we must realize that this
variation includes both random error and error due to individual control differences. In fact
managers expected controller variability to be a major contribution to the MSE term.
(Refer Slide Time: 06:26)
546
This is a set up. So what happened? There are three workstations design. We call it system, A
system, B system, C. See this is a controller 1. So, in controller 1, when we put into system 1, the
stress level is measured in terms of 15. So when controller 1, when he was subjected to work
design 2 workstation design B he was expecting the stress of the distance is measured in terms of
a questionnaire, so the 15 is the score, higher the score, higher the stress.
547
We solve these examples using Anova in Python. So, whether to import Pandas as pd, import
numpy as np, import scipy, import statsmodels.api as sm, from statsmodels.formula.api, import
ols. What I have done? The data which is in the table, which was in the previous slides, I have
typed in an Excel. Then, I have imported that file name is RBD.xlsx so that I am going to save in
the name of data frame. When I show the output this was system A, system B, system C.
So, the value is the dependent variable tilde, See the treatments, data equal to that file name is
data.fit. So anova_table equal to sm.stats.anova_lm, lm is a linear model. (model, typ =1).
Remember this Type 1 because when whenever there is a two way anova you have to use type 2.
So, Anova underscore table. What is happening? This error sum of square is 3.2. So, what is
happening this value is more than 0.05.
So, we are accepting null hypothesis. What is the meaning of accepting null hypothesis? Here, so
if it is a 0.05, so it is the P value. 06 we accepted null hypothesis. When I accept null hypothesis,
what is the null hypothesis here? The level of stress is equal for different 3 workstation design.
Suppose,H0 equal to work station, This is stress, average stress level for workstation A,
workstation design A, this is B, C. So H1: mu A ≠ mu B ≠ mu C. So, at present what I am
concluding I did not block it what time concluding? There is no connection between workstation
design and the, and their average level of stress.
(Refer Slide Time: 10:02)
548
Next, what I am going to do is I am going to do blocking going back. So there are 3.2 there is
error. Actually 49, 49 divided by 15 so this is sum of square error is 49 mean sum of square is
3.2. So, this 3.26 due to blocking effect I am going to remove or subtract certain level of variance
this. Then again, I am going to conduct, let us see what is happening. So this was the given data
slide.
(Refer Slide Time: 10:33)
So the treatment mean is there are three treatments that I am calling this system A system B
system C. So, x.1 bar is 13.5, x.2 bar is 13, x.3 bar is 15.5 ok.
549
(Refer Slide Time: 10:52)
We know this is our Anova setup. Look at the previous when there is the CRD in the completely
Randomized design or one way Anova, there is no, there is no column blocking. Ok. There was
only treatment and error. Now we are introducing the blocking so what is happening what are the
degrees of freedom? Total number of element minus 1, degrees of freedom in treatment, there
are k treatment k -1 blocking, there are b blocking b-1.
So how to find out the k-1, b-1 is nT -1 – (k-1) – (b-1), so we get k-1 and b-1.MSE treatment is
SSE sum of square treatment divided by k – 1. MS mean square blocking equal to SSBL.
Actually this data will not use that one. We will use only MS treatment by this MSE. SSE
divided by k-1 into b-1. Actually this much portions, we will remove this will be subtracted. Ok.
(Refer Slide Time: 12:03)
550
So xij is the value of the observation corresponding to the treatment j in the block i. x.j bar is the
sample mean of jth treatment xi. bar sample mean of ith block, x double bar is overall sample
mean.
(Refer Slide Time: 12:22)
What is the step 1? First, we will find out the SST that is a total sum of square. Total sum of
squares summation i equal 1 to b, summation j equal 1 to j xij - individual element - overall main
whole square. Ok. So in that way, we are getting SST =70. Then compute the sum of square due
to treatment so there are 3 treatments so, b is number of replication treatment 1. That is 6 column
1 in 13.5 - overall mean 14 whole square + 6 is common be brought 6 everything is brought in
this side. So, (13.0 -14) 2 + (15.5 -14) 2 is 21.
551
(Refer Slide Time: 13:17)
So, 3rd step is compute the sum of square due to blocks. Due to blocks is there is a k treatment
ok, so xi.bar minus x double bar so that row wise what is the mean? Everywhere there are 3
treatment 3 this one. So, 16 -14 how we got the 16, going back to this 16, (16 - 14 )2+ (14 – 14)2
square and so on. (16 - 14 )2+ (14 – 14)2 + (12- 14 )2+ (14 – 14)2 + (15 - 14 )2+ (13 – 14)2 is
equal to 30.
This much variance is due to blocking, this much sum of square to compute the sum of square
due to error term. We know that from SST you have to subtract treatment sum of square minus
block in sum of square. That will give SSE = 70 - 21 - 30 is 19. So this 19 is the true SSE
because this SSB amount which is due to extraneous variable that we are noise variable. So, error
due to this, we are removing this.
(Refer Slide Time: 14:47)
552
While finding SSE now what we are getting. Yeah, the, whatever value which are given I kept it
here. So it is a 10.5 divided by 1.9. Even though we find this one will not used for calculation. So
5.59 to the value of P value is 0.024 if Alpha equal to 5 percentage, we have to reject the null
hypothesis. Previously what has happened? When we do without blocking, we are accepted null
hypothesis and going back see, we accepted null hypothesis without blocking.
After blocking, our decision has completely changed. So what happened? We have rejected the
null hypothesis.
(Refer Slide Time: 15:34)
553
We will do with the help of python import Pandas as pd, Import numpy as np, import scipy,
import statsmodel.api as sm, from statsmodel.formula.api, import ols. ols is ordinary least square
method because the regression and Anova is like two sides of the same coin. The sequence of
learning Regression and Anova is first you have to learn Anova then you have to learn regression
because there is a close relationship that I will see after this lecture is over after 2 lectures will go
for Regression Analysis, I have imported.
(Refer Slide Time: 16:13)
Ok There are 3 columns that I am using the melt Command so that I will bring all the values in
two columns. One is for here there are 3 columns that are going to do the blocking. Blocks
treatments and values so model equal to ols( ‘value tilde C (block) + C( treatment)’ you see that
now there is a blocking that I have included. Previously, there is no this term C x blocking so
that I could data close bracket dot fit.
So, sm.stats.anova_lm (model, is a typ = 1), anova_table. I am getting you see this one here is
this is 0.024 so it is less than 0.025 I am rejecting the null hypothesis.
(Refer Slide Time: 17:07)
554
So what we are concluding. Finally note that Anova table shown in the table provides f value test
for treatment effect but not for the blocks. The reason is that experiment was designed to test a
single factor workstation design. The blocking based on the individual stress, stress differences
was conducted to remove such variation from the MSE term. However, the study was not
designed to test specifically for individual differences in stress.
What is happening here is, the blocking exactly what you are doing? The error due to blocking is
removed while finding the influence of workstation design on stress level.
(Refer Slide Time: 17:51)
555
We will go for one more problem will use this Randomized block design. We will go for one
more problem. An experiment was performed to determine the effect of four different Chemicals
on strength of your fabric. These Chemicals are used as a part of permanent press finishing
process. 5 fabric samples were selected. And a Randomized complete block design was run by
testing each chemical type once in a random order on each fabric sample.
The data is shown in the table in the next slide. We will test the difference in using Anova with
Alpha equal to 1 percentage.
(Refer Slide Time: 18:29)
This was the table. What says this? Different chemical type is there, different fabric samples are
there. The replication is five because the same after adding chemical Type 1 when we conduct
the fabric strength, we have conducted 5 samples. This was the row mean, this was the row
average.
(Refer Slide Time: 18:49)
556
What will you do? I have typed this data in excel, in excel file RBD2.xlsx. So this was the data
so using melt coming and going to bring into the two variables. One is on value that is a response
variable and other one is treatment. This you have to type as it is. That is the purpose of this
pd.melt. So when I am running model = ols, a value is the dependent variable tilde, is the
treatment is the independent variable.
Data is equal to data because this data is the way I have taken this after using melt command is
the data. The file name is data so I am using data, data.fit and also another variable_table =
sm.stats.anova_lm (model, is a typ = 1). So what is happening? You see that mean this is 0.4777.
Here we did not do the blocking. We will do the blocking and what is happening to you? We are
rejecting the null hypothesis because the probability is less than 0.01.
(Refer Slide Time: 20:00)
557
So we are finding SST. SS treatment is see this formula is so comfortable for using calculator. So
yij whole square minus y dot these notations, already I have explained. What is a? a is the
number of treatment b is the number of blocks? So this is SST is 25.69 is the total sum of square
treatment sum of square is 18.04.
(Refer Slide Time: 20:27)
SS block is y.j whole square minus y double dot whole square divided by ab. So, this 6.69, is the
error term, you see that of finding SST. Total sum of square minus sum of square due to blocking
that I am subtracting that is due to treatment so I am getting 0.96. So this is a true error without
having blocking effect.
558
(Refer Slide Time: 20:54)
So what is happening? The mean square is here 0.08. So, you go back. What was the mean
square without blocking? Yah, you see that it was without blocking it is 0.47 now it is. 08. The
mean square error is removed because we have removed the error due to blocking. So here the
value of F also when you can compare it, it is significantly high. 75. 13 that is a more chances for
rejection. Previously, what is the F value. I am going back. Here the values is 12. 12.58 now F
value is 75.13. You are certainly you can say that you will reject your null hypothesis.
(Refer Slide Time: 21:50)
Your Anova is summarised in the previous table. Since f equal to 75.13 which is greater than the
table value that is a 5.95 which we got from the table, we have done Anova also. So that is the P
559
value is very low we conclude there is a significant difference in the chemical types so far as
their effect of the strength is concerned.
(Refer Slide Time: 22:11)
Previously, we have done a traditional way. Now we will use Python for doing the blocking that
is doing the Randomized block design. import pandas as pd, import statsmodels.api as sm, from
statsmodels.formula.api import ols from statmodels.stats.anova import anova_lm. So, you save
the file in the name df equal to pd.read_excel (). df This was the output.
(Refer Slide Time: 22:41)
Again, you see that we are using melt command after giving the melt command the data has
become this format. So what is happening? Fabric samples 01234, 01234 see, these are 1 group.
560
This is another group. This is another group. This is another group, another group. So, this is
chemical 1, chemical 2, chemical3, chemical4 the purpose of this pd.melt command is for this
purpose.
Now there are three columns. One is the fabric sample. So the value, value is dependent variable
chemical type treatment is independent variable. Fabric sample, that is, blocking variables.
(Refer Slide Time: 23:28)
Now past this model is equal to ols (‘value tilde C (fabric), fabric is, is a blocking effect plus the
chemical that is the treatment effect, data equal to data.fit. When you run this we are getting see
the f value is which we got traditionally manual method. We got this one so we see that P value.
So what we have done, we taken one problem, we have solved without blocking what was the
status. In this problem we are rejecting then we go for blocking.
After blocking also we are rejecting. But the when you look at the value of f that is significantly,
it has increased. So, what will happen without blocking you may conclude on things? You may
accept null hypothesis, because the error term is very bigger. After blocking the error term
become very less than you may reverse the decision. We can reject the null hypothesis. That is
application of this blocking.
561
Dear Students, in this lecture, what we have seen just I am summarizing. We have seen what is
randomized block design? We have seen what is the need when will we go for a Randomized
block design. Then you have taken your problem that problem was solved without blocking and
seen what was the result then the same problem with blocking. Then you have seen how the
result has changed. Even without blocking also we are used python code we saw what is the
result?
Then, with blocking also we have used Python code. Then we have seen what was the result? In
this, what we have done? We have taken two problems for both the problems we solved with
blocking and without blocking. In the next class, we are going to another type of Anova that is a
two way Anova. Thank you very much.
562
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 27
2 Way ANOVA
Dear students in the previous class we have seen randomized block design. In this class we will
go to the next topic that is factorial experiments or 2 way ANOVA.
(Refer Slide Time: 00:37)
The learning objectives are designed and conduct engineering experiments involving several
factors using factorial design approach, understand how the ANOVA is used to analyze the data
from these experiments and know how to use 2-level series of factorial design.
(Refer Slide Time: 00:58)
563
Let us see what is a factorial experiment a factorial experiment is an experimental design that
allows simultaneous conclusions about 2 or more factors the previous 2 problems we did not see
simultaneous effect of 2 variables at a time. The first problem when we whenever doing CRD
completely randomized design we have taken only one independent variable the randomized to
block design we have taken one independent variable and one blocking variable there was no
interaction.
But in this lecture we will see that if there are 2 independent variable how there is a possibility of
interactions we are going to see the effect of interaction also. The effect of factor is defined is the
change in response produced by a change in the level of factor it is called main effect because it
refers to the primary factors in the study. For example in levels of factor a and b levels of factor
b small a is the level of factor a there may not be factor a there may be a levels small a levels
may be low high there is a, 2 level.
For factor b also there may be 2 level low medium high for example 3 level also possible that is
small b the experiment will involve collecting data on a b treatment combinations small a and
small b. Factorial experiments are the only way to discover interaction between variables.
(Refer Slide Time: 02:27)
564
When you look at this diagram you see that there is a left side there is a factor that has 2 level
low and high in observations you see the b and the factor they are also low a high, so these lines
are parallel so these lines are parallel so that means that there is no interaction. When you look at
the other side when the factor a goes low level to high level so what is happening there is a the
crossing there is a interaction instead of a when a goes b is high when he is high then we also
should be high.
But now b has become, it comes on the lower side so whenever there is a intersection that means
that there is an interaction effect is there.
(Refer Slide Time: 03:15)
565
The simplest type of factorial experiment involves only 2 factors say A and B there are a levels
small a, a levels of factor A and b levels of factor B this 2 level factorial shown in the next table
the experiment has n replicates and each replicate contains all a b treatment combinations.
(Refer Slide Time: 03:38)
Look at this there is a factor, factor here there is a, a level is there factor b there are b level is
there so maybe observations will be there the observation in a jth cell for the kth replicate is
denoted by y ijk in performing the experiment the a b and observations would be run in the
random order. Thus like a single factor experiment the 2 factor factorial is a completely
randomized design this is also kind of you CRD.
(Refer Slide Time: 04:10)
566
We will take an example with the help of example we will see how to do the 2 way ANOVA. As
an illustration of your 2 factorial experiments we will consider a study involving Common
Admission Test for example in MBA suppose we want to get admission MBA you have to go
this Common Admission Test. A standardized test used by Graduate School of Business to
evaluate the applicants ability to pursue a graduate program in the field.
Scores on the CAT range from 200 to 800 in India it is seen it is expressed in terms of percentile
assume that the range is 200 to 800. So, that means the minimum qualify marks for CAT is a in
terms of absolute term say 200 not in terms of percentile see but the higher scores imply higher
aptitude.
(Refer Slide Time: 05:04)
There are 3 CAT preparation programs in an attempt to improve students performance on the
CAT a major university is considering offering the following 3 CAT preparation programs, there
are 3 CAT population program the first program is 3 hour review, the second is one-day program
the third one is intense you 10 weeks course involving. There are 3 type of coaching technique
one is a 3-hour review session covering the types question generally asked in the CAT.
One day program covering relevant exam material along with; the taking and grading of sample
exam. And intensive 10-week course involving the identification of each student's weaknesses
and setting of individualized programs for improvement.
567
(Refer Slide Time: 05:04)
One factor is in this study is the CAT preparation program which has 3 treatment we are calling
it as a treatment 3 hours review, one day program, 10 week course. There are 3 treatment before
selecting the preparation programs to adopt further study will be conducted to determine how
this proposed program effect to the CAT score. So, we are going to see there are 3 way of
learning are preparing for the CAT examinations we are going to see the effect of these learning
methods and how it is going to affect the performance in the CAT examination that is a CAT
scores.
(Refer Slide Time: 06:34)
Factor 2 also there are 3 treatment CAT is usually taken by students from 3 colleges the college
568
those have undergraduate business school, the college who come from engineering backgrounds,
the College of Arts and Sciences. Therefore a second factor of interest in the experiment is
whether students undergraduate college effect to the CAT score. Now for example in many IITs
the arts and science students are not allowed to take MBA examinations but I would prefer at
present many IITs they are allowing even arts students also to get admitted into MBA program
but they can take the CAT examinations.
Therefore we look at the will continue this problem therefore we a second factor of interest in the
experiment is whether the students undergraduate college affect to the CAT score. The second
factor undergraduate college also has a 3 treatment business a student may have Business Studies
background or engineering background or art and science background. So, what we are going to
study we are going to learn from this whether their undergraduate college or their background
will affect their performance in the CAT's score.
Maybe sometimes the engineering students do better in CAT examinations, sometimes the
background from the BBA students in the business background may do the CAT examination
better, sometimes the arts and science students there may be a possibility because they may not
come across many quantitative subjects in their undergraduate they there is a perception that they
may not do well in the examination that we will see in this exam in this problem.
(Refer Slide Time: 08:21)
569
So, what was done there are it is written in the table format in row it is taken the preparation
program in factor in the column we take in college. So, what this table represents 9 treatment
combinations for 2 factor CAT experiment, one factor is preparation program another factor is
college they belongs to. So, you see that a person may be business background he may take 3
hours review he may be engineering background he may take 3 hours review he may be art and
science background 3. So, there are 9 combinations are possible.
(Refer Slide Time: 09:04)
What is the replication an experimental design terminology the sample size for 2, for each
treatment combination indicates that we have 2 replicates, if there are saying 2 for example here
this 580 is the replication there are 2 students. So, what do you done a person who belongs to
570
business background when he undertake 3 hours review of taking coaching method, so what was
their marks. So, we have subjected for 2 students we are subjected to 2 students.
Similarly 2 students are taking those who are engineering background and 3 our reviews, so how
many 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 there are 18 observations.
(Refer Slide Time: 09:55)
So, what is the analysis of variance computations answers the following questions what we are
going to do that was the very important this lecture. So, what is the main effect factor a do the
preparation programs differs in terms of effect on CAT’s score. So, what we are going to see
whether the different preparation programs affect their performance in the CAT course. Main
effect B there is a factor B do the undergraduate college differs in terms of effect on CAT’s
course whether their undergraduate background is going to affect their performance in CAT’s
course are not.
Then interaction effect that is a factor A and factors B do students in some colleges do better
than one type of preparation program whereas others do better on a different type of preparation
programs that we are going to see interactions.
(Refer Slide Time: 10:54)
571
The term interaction refers to a new effect that we can now study because we used a factorial
experiment if the interaction effect has significant impact on the CAT’s course we can conclude
that the effect of the type of preparation programs depends on the undergraduate college that is
the learning. If the interaction is significant we can conclude that the type of preparation
programs depending upon the, their undergraduate college background.
(Refer Slide Time: 11:29)
This is a two factor factorial experiment setup so Factor A is here SSEA sum of square for factor
A, SSB sum of square for factor B SSE is sum of square for error so when you subtract it this is
AB for interactions. So, how to know the how to write the degrees of freedom so n T - 1 that is a
572
degrees of freedom for total number of ‘a’ levels in ‘a’ (a – 1) degrees of freedom for factor b - 1
for factor B when you multiply a - 1 into b - 1 degrees of freedom for interaction ok.
Then you see that this is a mean square for factor a mean square factor B this is mean square for
interaction this is a mean square error. So, what we are going to see we are going to see effect of
factor A and effect of factor B and effect of interactions. If you want to know the effect of factor
a we have to write mean sum of square for factor A divided by MSE if you want to know the
effect of B means sum of square for factor B divided by MSE.
If we want to know the effect of interaction means sum of square MSAB divided by MSE you
see that in the denominator always there is a error term. Many students will do mistakes there
because MSA divided by MSE the denominator always there should be a error term.
(Refer Slide Time: 12:49)
573
The ANOVA procedure for 2 factor factorial experiment requires us to partition of sum of
squares into 4 groups, sum of square so SST we are partitioning into sum of square due to factor
A, sum of square for factor B sum of square for interaction ,of interaction and sum of squares
due to the error term so, the formula for this partition follows SST equal to factor A sum of
square plus factor B sum of square plus factor AB sum of square that is interaction sum of square
plus error sum of square.
(Refer Slide Time: 13:48)
The notations are X ijk observations corresponding to the kth replicate taken from the treatment i
of factor and treatment j of factor b, Xi. bar represents sample mean for observation treatment i
X.j bar represents sample mean for observation in treatment j factor B, X ij bar represents sample
574
mean for the observations corresponding to the combination of the treatment i factor A and
treatment j factor B X double bar is overall sample mean of all nT observations.
(Refer Slide Time: 14:33)
So, the first step is individually for each cell we have to find out the mean so here this location is
X 1 1 bar mean is 540, X 1 2 bar mean is 500, X 1 3 mean is 500 the road total is 2960. So, the 2
1 mean is 500, 2 2 mean is 590, 2 3 mean is 550. So, third one X 31 bar equal to 580 X 32 = 590
X 33 = 445 the overall sum is 9270 the overall mean is 515.
(Refer Slide Time: 15:12)
575
Now the factor A means for row 1, X 1 bar equal to 493.33 X 2. bar is 513.33 X 3. bar equal to
538.33 for factor B means X you see that if it is a B only X. 1 bar is 540 X . 2 bar is 560 X. 3 is
445.
(Refer Slide Time: 15:45)
First step is to find out total sum of square we know that total sum of square is X ijk each element
minus the overall mean whole square. So, what will happen we have to do for all the
observations so that is coming 82,450. Step 2 compute the sum of square for factor A, so you
write if you are writing you look at this if you are writing SSA so b into r, summation of X i dot
bar that is we can say it is a row mean row mean minus overall mean whole square.
So SSA is 3 b equal to 3 replicas 2 because 2 data set so 493.33 is row 1 mean minus overall
mean whole square, plus 513.33 is row 2 mean 515 whole square plus 538.33 minus overall
mean whole square that is 60100.
(Refer Slide Time: 16:56)
576
Now compute the sum of square of factor B whenever you write SSB see that a will come here ar
summation j equal to 1 to b is nothing but your column mean, so column mean is 540 I am going
back how we got the 540 and going back so 540, 580 this 580 540 560 445. Now let us go back
next we are going to compute the sum of square for factor B so SSB equal to a.r summation j
equal to 1 to be X dot J bar - X double bar whole square.
So there are a is 3 to 2 replications 540 is your column mean and going back how we got 540
560 445 and go back this is 540 560 445 that why got this one, that is why we got this one, 540
560 so SSB is 45300. Next we will go for SSAB compute the sum of square of interaction see
here are in 2 there are 2 summations i equal to 1 to a j equal to 1 to b X ij bar minus that means
in the cell what was the mean minus row mean minus column mean plus overall mean whole
square.
So, X aj is in that cell mean is 540 row mean is 493.33 minus column mean is 540 plus overall
mean is 515 whole square. So, SSAB when you continue this we are getting 11200.
(Refer Slide Time: 18:48)
577
Then we will find out SSE, SSE is total if you subtract from the SST so total sum of square
minus sum of square due to factor minus sum of square due to factor B minus sum of square due
to a B so that SSE is 19,850.
(Refer Slide Time: 19:05)
So, I have filled that value for factor a sum of 6100, 45300 interaction 11200 error 19850 so total
sum of square is 82450, nothing but this 82450, 2450 is splitted into 4 parts one is due to factor
A due to interaction in factor B and in due to interaction due to error. So, the degrees of freedom
is there are 18 data set is there, so 18 - 1 is 17 there are 3 factors so 3 - 1 is 2 there are for factor
A , for factor B there are 3 treatments so 3 – 2, 2 so we are getting a interactions it is number of
levels in factor A that - 1 multiplied by number of levels in factor B that – 1.
578
So how we got this one go back you see that how we are getting the degrees of freedom for
degrees of freedom for interaction right a – 1, a is number of level in factor a number of levels in
factor that - 1 multiplied by number of levels in factor B that – 1. So, that value is this one you
get 2 multiplied by 2 is 4. So, how we got this mean square 3050, when you divide 6100 by 2
3050 when you divide 43200 by 2, 22650, 11200 divided by 4 for 2,800.
So 19850 divided by nine this one so how we got F value is when you divide 3050 by 2206 you
will get to 1.38 when you divide 22000 divided by 650 divided by 22650 get 10.27 when you
divide 2800 by 2206 is at 1.27 this is a corresponding p-value. So, what does happy here, here
we are accepting null hypothesis when you accept a null hypothesis there is no effect of factor A.
Here interaction we are accepting null hypothesis there is no effect of interaction but there is an
effect of factor B because it is less than point 0.05.
(Refer Slide Time: 21:46)
The data whichever is given there I have entered into the in Excel in excel file, so I am reading
df 2 equal to pd.read_excel.
(Refer Slide Time: 22:02)
579
So, when I say df2 to see the data is in this permit value is in the first column there is 18
including 0 there are 18 values, see the preparation program 3 hours up to this there are 3 hours,
first is 6 data set this is one day this is 10 week, this is those who are belongs to business
background, those belongs to engineering background, those who belongs to art and science
background. Again this is for whenever one day preparation program who belongs to business
background, engineering background art science background when they go for 10 weeks
intensive training program there also business background , engineering background and art and
science background.
(Refer Slide Time: 22:55)
580
Here for formula equal to ‘value tilde C( college) plus C( preparation program) plus C( college)
colon C(preparation program)’. This represents for interaction, so model equal to ols ( formula
you can write it directly otherwise you can specify separately ols( formula ,df 2 ).fit so analysis
of variance underscore table equal to anova_ lm (model , see type 2, when you write typ = 2 it is
for 2-way anova.
So, when we are writing for one-way anova here we write in typ = 1 we got one way ANOVA,
so when you print on our table so we are getting this one. So, what is the meaning though the
what we do it manually what you do in Python is same here what is happening this we accept a
null hypothesis this we accept null hypothesis this we reject a null hypothesis. So, there is no
interaction there is no effect of preparation program but there is an effect of the college
background they belongs to.
They may be belongs to be engineering background they may be belongs to business background
or art and science background because what we are concluding from here is that if there belongs
to particular background there is courses their performance in CAT’s score is otherwise we can
say the college backgrounds affect their performance in the CAT's score it may be those who are
belongs to engineering they can perform better or those who belongs to Arts and Science they
may not perform better.
So what we are concluding here is the college they belongs to is an important variable on their
performance in the CAT examination. We got the ANOVA table 2 way ANOVA table when we
look at that see the preparation program is not a significant variable the interaction between
college and the preparation program also not significant factor here but only the college there
belongs to is an significant factor that means there are 3 possibility their college background may
be one is Arts and Science second one is a business third one is engineering background. So that
factor will affect their performance in the CAT's score.
Dear students in this class we have studied what is a 2 way ANOVA then we have taken one
problem we have traditionally we have solved that 2 way ANOVA. Then I have explained the
theoretical background behind this 2 way ANOVA then the same problem we have solved with
581
the help of Python. Then we have interpreted the result. The next class will go to another topic
that is a regression analysis because this analysis of variance and regression analysis are it is like
a 2 side of the same coin.
Even ANOVA can be solved with the help of regression analysis even a regression problem can
be solved with the help of ANOVA. So, the next class I will meet you with another new topic
called regression techniques, thank you very much.
582
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 28
Linear Regression - I
Dear students in this class we will go to the new topic called regression analysis, the class
objectives of this lecturers is;
(Refer Slide Time: 00:34)
We will study simple linear regression model when you say simple linear regression model only
one independent variable will be considered then will see what is the least square method. That is
the principle behind this regression model. Then I will see what is coefficient of determination
goodness of regression model explained, generally with the help of this coefficient of
determination called R square we will see in detail later.
What are the model assumptions then we can test for significance, even hypothesis testing also
can be done with the help of our regression analysis, then using the estimated regression equation
for estimation and use the prediction also.
(Refer Slide Time: 01:19)
583
Many problems in Engineering and Science involve exploring the relationship between 2 or
more variables. So far what you have seen the same variable we have compared with the, we
have taken some sample with help of sample we are predicted the population parameter the same
variable sometime we are compared the mean sometime we compared variance but this lecture
they are going to take 2 different variables.
Regression analysis is a statistical technique that is very useful for these type of problems where
the cause and effect has to be measured. This model can also be used for process Optimisation
such as finding the level of temperature that maximizes yield or process control purposes. There
are many independent variable we can say which independent variable is more important
variable that affect our dependent variable.
(Refer Slide Time: 02:12)
584
We will see one example. There is a table is given there is an X variable called hydrocarbon
level Y Variable is called purity as an illustration considered data in the table. In this table Y is
the purity of oxygen produced in a chemical distillation process and X percentage of
hydrocarbons that are present in the main contents of the distillation unit. Now, we are going to
see what is the influence of X on purity of oxygen?
(Refer Slide Time: 02:43)
So, I enter the data in excel so I saved the file name is reg2.xlsx when import data you go to
pd.read_excel I have specified the path. So X = data that is a hydrocarbon level that column is
my X variable, Y = in data file ‘02’ that is my dependent variable and I use plot.figure then sns.
regression plot regplot (x, y fit_regression equal to true then can I use this plt.scatter (
585
np.mean(X), np.mean( Y), color = ‘green’), would be green so what I am saying and getting a
scatter plot between hydrocarbon level and oxygen.
So, what is happening whenever the hydrocarbon level is increasing the oxygen level also
increase. There is a positive relationship suppose if you want to make a relation between quantify
the magnitude of X and how it is influencing and Y then I should go for regression equation that
will do incoming slides.
(Refer Slide Time: 03:58)
So, the theory behind the simple linear regression model the equation that describes how Y is
related to X and an error term is called regression model. The simple linear regression model is y
=(β0) beta 0 + (β1) beta 1 x + (ℇ )error term. Here y = equal to beta 0 + beta 1 x + error term
where beta 0 and beta 1 are called the parameter of the model, ℇ is a random variable called the
error term. So what we are going to do we are going to estimate the value of Y with help of
independent variable X.
Because X itself will not enough to predict the Y variable there maybe some unknown variable
other than X the error due to that unknown variable, otherwise unexplained variance we are
going to call it is error term.
(Refer Slide Time: 04:52)
586
The simple linear regression equation is expectation of Y equal to beta 0 + beta 1 X when you
comparing the previous slide. That was there is no error term because while calculating the value
of beta 1 when we have taken care that error is minimized not only that the previously the
previous slide we are writing Y now it is expected value of Y. Now what we are predicting is the
mean value of Y not the actual value of Y where the graph of the regression equation is a straight
line.
Because the power of x is 1, beta 0 is the Y intercept of the regression line beta 1 is the slope of
regression line. So, the expected Y is the expected value of Y for a given X value expected
values nothing but mean value.
(Refer Slide Time: 05:41)
587
The simple linear regression equations. This is a example of positive linear relationship is in X-
axis. There is a when the value of the value of X is increasing the expected value of Y also
increasing so the slope beta1 of this b 1 is a positive. The intercept so this distance is your b 0.
(Refer Slide Time: 06:04)
This is an example of negative linear relationship what is happening when X increases the
expected value of Y is decreasing here the slope is negative.
(Refer Slide Time: 06:16)
588
Here, there is no relationship because it is a line which is parallel to x-axis to the slope is 0. So
what is the meaning of this irrespective of any value of X, the expected value of Y is same. Here
we can see the value of x and y are independent.
(Refer Slide Time: 06:35)
The estimated simple linear regression equation is y hat = b 0 + b 1 x generally write the capital
Y if I write beta 0 + beta 1 X if I use capital letters that is for the population. What we write in
small letter that is for the sample. So y hat is the estimated regression line b 0 is the y intercept of
the line b1 is the slope of the line y hat is the estimated value of y for given x value.
(Refer Slide Time: 07:09)
589
The principle behind the least square method is the sum of the square of the error has to be
minimized. Suppose I have some x value y, I have some number for x I have a number for y
suppose, I have drawn this way. This is x axis. This is y axis and plotted line like this. So my
objective is I have to draw a line. I have to draw a line. Ideally that line has to pass through all
the given points, but that is not possible.
So what I am going to do I am going to draw a line so that the error is minimised not only the
for, example this is e1 this is e2 this is like that be many e3, so this is positive error actual minus
predicted value. So this line is y hat equal to so b0 + b1 x. So now what happening this much
distance is it is a vertical lines to this much distance is my error actual minus predicted value.
This vertical line distances e2 so what is happening? This is error.
So what I have to do the error square and sum of the square has to be minimised like this. So,
some of the error has to be minimised. I have to draw a line in such a way that sum of the square
of the error has to be minimised. I can draw different line suppose. I can draw this, this way also,
this way also for each line. I want to find out this sum of the square of error wherever the sum of
the square of the error is minimum so that line is the best line that principle called least square
method.
590
Why we are squaring there is logic behind this, if you are squaring the positive and negative
error will become nullify we will 0 that is why we are squaring that the same logic for example
the formula for variance what we are doing Ʃ(X - X bar) 2
/ (n -1), the logical why we ask
squaring the 2 purpose otherwise Ʃ(X - X bar) equal to 0 here the sum of positive or negative or
0, the square transformation says one more implications that suppose the deviation is less for
example it is 0.5 when you square it then it is 0.25.
Suppose deviation is 5 the net value is 55. What is happening? There is lesser deviation had
lesser penalty, there is a larger deviation larger penalty. That is beauty of this squared the
transformation.
(Refer Slide Time: 10:19)
In the estimation process what is happening initially, we will assume a regression model Y equal
to beta 0 + beta 1 X + e, that regression model will predict with help of regression equation that
is expected value of Y equal to beta 0 + beta 1 X you see that. The regression equation there is
no error term. Here the unknown parameters are the population regression model say Y equal to
beta 0 + beta 1 X + e, the regression equation is expected value of Y equal to beta 0 + beta 1 X.
Here the unknown parameters are beta 0 and beta 1. We have to estimate the value of beta 0 and
beta more importantly the value of beta 1 what you are going to estimated whether the value of
beta 1 is 0 or other than 0 if I estimate beta 1 = 0 that means there is no relation between X and
591
Y. So, this is our equation from that what I am going to do I going to collect the sample for my X
variable and Y variable X is independent variable. Y is dependent variable. So, with the help of
the sample data are going to make a regression equation that is applicable only for the sample. So
y hat = b 0 + beta 1 x, here b 0 and b 1 x is the sample statistics.
So this equation is valid only for the sample now I going to estimate that whether the value of
beta 1 and beta 0 is valid event for the population level also, what is the meaning of that one is
sometime this Y you could to b 0 + beta 1 X with the help of sample. You can construct a
regression equation. What will you predict it when you estimate value of b 1 estimate value of
beta 0, then may not be significant other population-level.
So that time what we are going to see the beta 1 is equal to 0, if the beta 1 is equal to 0 there is
no relation between X and Y we will see in the coming slide.
(Refer Slide Time: 12:42)
How this regression is different from our previous concept which have studied. For example
what happened with help of x bar we have predicted population parameter mean with the help of
sample variance we have predicted the population variance with the help of sample proportion.
We have predicted population proportion the regression actually what is happening there is a
sample smaller circle sample bigger circle is population.
592
I have some X and Y value from the sample with help of x and y value I hope predicted
regression equation y equal to b0 + beta1 x now I am going to prove that whether this
relationship is valid even for the population for that what I am going to do capital y = beta 0 +
beta 1 x with the help of regression equation. That means a sample model I going to predict
whether this model are this relationship is valid for even for the population are not.
Sometime what happened you can construct a regression equation with help of sample data. You
can say there is a relation between x and y but when you go to the population level, there were
not be relation between x and y. So, what is the difference between this regression modelling?
And previous our hypothesis testing, in hypothesis testing we have tested only one parameter at a
time what you done we have tested, we are predicted mean are variance are population
proportion.
Now I have constructed a model small model with the help of sample data I am testing this
model in the population level, I am simultaneously I am checking 2, 3 parameter one is my beta
1 one parameter, beta 0 is another parameter, like that we may have different this is simple
regression like that in the multiple regression different independent variable. This is the logic of
regression modelling.
What will you do suppose in the regression equation with the help of sample data? What is
required is I have to find out what is the value of beta 1, b1 this is called slope. This is my ‘y’
intercept what is happening a line is like this suppose there are 2 points. Suppose I am going to
call this is a (x1, y1), this is (x2, y2) like that there will many point for example this is (xn, yn)
ok. So this is my x-axis. this is y-axis. So what I am going to do this is my error going to call to
e1 this is e2 this is en so what are you going to do? First go to find out the error for each values
then I going to square the error.
Then I am going to sum the error then for what value of this b1 and b0 the error will get
minimised so that I am doing here the next slide to what is happening. So here I am going to call
it is this is y equal to mx + b this is traditional notation because our school in your study this
593
really you cans use any notations. So here what is the error term actual minus predicted so my
actual error is for this one my actual point is y1 my predictable value is y.
(Refer Slide Time: 16:57)
594
it, similarly -2 bs same so I go to add only y1 + y2 + up to yn here the m square is the constants
x1 square + x2 square xn square here 2 mb is a constant so i am going to add x 1 + x2 up to xn
and here there are b square n time and b square that should have done this one.
(Refer Slide Time: 19:52)
So I have grouped all y1 square term y 1 square + y 2 square + y m square - 2 m as i told you
previously i am going to here the - 2 m is constant so x1 y1 + x2 y2 + up to xm y m i may way
go back here – 2b for the third term – 2b is the constant. -2 b the remainder is y 1+ y 2 up to y n.
the 4th term m square is a constant m square x1 square x2 square + 2 x1 square. then next 2 mb,
2 mb is constant in all the terms so when you bring it common 2 mb the remaining is x 1 + x 2 +
x n the last term is b square b square b square.
So here but I am going to do you see that. I want to know y 1 square mean. so what time to do
sigma of y1 square + y 2 square upto y n square divided by n. so, what are you going to do the
submission and go to write in terms of its average. so when i take this one so instead of y1 square
+ y2 square i can write ny 1 square. similarly this -2 m -2 m so this one i can write n 2 x1 y 1 bar
here – 2b what i have done.
i have group the square term. the first term is y1 square + y2 square + and so on + y n square to
this i want to write in terms of its average value so i go to write y square bar nothing but y1
square + y2 square + y3 square up to y n square divided by n square. Now what is happening Y1
595
square y 2 square + yn square can be replaced by multiplied by y square bar. so that is wrote it
as y square bar. similarly the second term x1 y1 x2 y2 + up to x m y n can be written as xy bar
multiplied by n.
so the remaining term is 2mn xy bar the next – 2b that i go to write n y bar. so we will look at the
third bar m square, m square will come as it is so x1 square + x2 square and so on going to write
in terms of n multiplied by x square bar the. Next term is 2 mb I am going to write n x bar there
are n b square writing n b square. So this is the simplified of the error term.
(Refer Slide Time: 23:00)
So it is we say it is Maxima minima principal. So what will do for generally what will you do for
maxima minima principal dy by dx equal to 0, might have studied in schooling so d square y by
d x square is negative less than 0 so that the; it is this way. This way what happening dy by dx
equal to 0 is this point if it is negative. That means you are, it is a maximum point if it is a
positive. So that is the minimum point.
596
So, what is happening even to the both for both the conditions. The first one is dy by dx is equal
to 0. Here x is variable for example here the m the slope is variable and the b, y intercept is the
variable. so, first we will; because there are 2 variable is there you partially differentiate this
squared error first with respect to m when you, partially differentiate with respect to m. so there
is no n term there will be - 2 m x y bar plus here also there is no m term it is 0 so 2mn x square
bar here there is x term 2b nx bar this will become 0 then equate to 0.
So when you simplify this 2n is the constant 2n here 2n, 2n remove this the remaining - xy bar
square + m x square bar+ bx. so i am going to write in this one, y equal to mx b + format. so
what will happen m x square + bx take right and side xy bar, so i am going to divide by x bar so
x square by x square + b. this is xy bar – xy bar divided by x bar now what happening this is mx
+ b equal to y format which one this equation.
So, m is m so the x coordinate is x square bar divided by x bar so the y coordinate is xy bar
divided by x bar. so, what this implies is if you want to draw a best line that line has to pass
through this point this is x coordinate this one is this is first one x coordinate and the second one
y coordinate. so if you want to draw a line which should minimise the sum of the squared error
that has to pass through this point.
(Refer Slide Time: 26:27)
597
Then we will find out the other because to know the slope we need 2 point we got already one.
So now, we differentiate partially differentiate with respect to b so here there is no b term 0, here
also, there is no b term 0 here there is a b term -2 my bar there is no b term here 2mn x bar, so
here 2nb so equate to 0 here also this 2n is a constant 2n 2n so divide both side remaining – y bar
+ mx + b equal to 0 so when you simplify y equal to mx + b format.
so now this is y equal to mx + b format this line passing through the y bar so this line is passing
through x bar so the another point is x bar, y bar so you see this is very important result. if you
want to draw a best line that line has to pass through the average value of its x and average value
of y, one of the point should that lines to pass through that then only that line maybe the best
line. so we got the 2 point x bar, y bar another point is x square bar by x bar, xy bar divided by x
bar.
so, when we know this one point say another point is x bar, y bar so if you we want to know the
slope of this equation then what is the slope formula y 2 – y 1 divided by x 2 – x 1 when you use
that formula you will get formula for slope. dear students we got the 2 points after using the least
square principle the one point is x square bar divided by x bar, xy bar divided by x bar that is
one point.
Another point is x bar, y bar when there are 2 point is there we can find out the slope. what is the
slope formula of this is first point, what is the slope formula suppose a traditional we might have
studied this in school. Suppose there is 2 points point 1 is x1, y1 so point 2 is x2 y2 ok. so, that x
point is x square by x bar, y1 point is divided by xy bar divided by x bar. so, x2 point is x bar y
bar we know the slope formula y 2 - y1divided by x 2 - x1. here the y2 is y bar - y1 xy bar
divided by x bar divided by x 2 is x bar minus actually this is i wrote x 1 y 1 for only our
convenience.
so, this x1 is different x2 is x bar, x1 is square by x bar by x4 bar. so this is when you multiply
both side numerator and denominator by x bar it will become x bar y bar - xy bar divided by x
bar whole square, x square bar because it is x bar x bar get cancelled when you bring minus this
one when you multiply both side by minus xy bar square – x bar y bar divided by x square bar -
598
x bar whole square. this is the formula for slope actually this slope is nothing but when you look
at the numerator, there is nothing but the covariance of( x, y) the denominator is variance of x i
will explain how this numerator is covariance.
so, we know that in our probability class you study the covariance of x, y is expected value of x –
x bar by y – y bar and you simplify this you bring this side xy - xy – xy bar - x bar y + x bar y
bar we you bring e inside? it will become e x y because y bar is number when you bring your e
of x again it is x bar - x bar is a number when you bring e inside it will become y bar because y
bar is a number so that will be as it is. so, when you bring e here it becomes xy bar so - x bar y
bar – x bar y bar + x bar y bar you can cancel it plus and minus.
the reminder is xy bar – x bar y bar that is nothing but the numerator and going back you see that
xy bar – x bar y bar is numerator. so, that in the slope formula numerator is nothing but
covariance of x, y so the variance of x is this also we studied from school x – x bar whole square
you expand x square - 2 x bar square + x square when you bring x inside this will become e of x
square - 2 x bar is number the expected value of a number is number itself and , d of x becomes x
bar this is number itself will keep as it is.
when we keep e inside become x square bar so there are - 2 x bar whole square + x bar whole
square. so when you subtracted it the remaining is 1. so, x square – x bar whole square even
when i go back when we look at this, this formula is sampling with this denominator. so, the
slope formula the numerator is nothing but write numerator is nothing but the covariance
denominator nothing, but the variance of x.
Actually this variance covariance and correlation coefficient regression slope all are having some
relationship. See that the variance formula we know sigma of x - x bar the whole square divided
by n – 1 that is only one variable. in the covariance there are 2 variable x, y, so sigma of x-x
bar even this can be written as sigma of x - x bar into x - x bar there 2 variables is there instead
of another x – x bar you can write sigma of x – x bar into y – y bar divided by m – 1.
599
so, the correlation coefficient is nothing but when you divide this covariance divided by its own
standard deviation, it will get correlation coefficient. so, but the slope is the ratio of covariance
divided by variance of x you see the variances. the covariance is this 1 sigma of x – x bar divided
by y – y bar by n - 1 assume that the equal same degrees of freedom. the denominator is sigma of
x –x bar of whole square n -1. so, this n -1 get cancelled the formula for slope is sigma of x – x
bar into y – y bar divided by sigma of x – x bar whole squared.
(refer slide time: 33:14)
that is why we got this formula b1 so we need not it is easy way to remember this formula for the
slope is nothing but covariance by variance. in this class we started about the regression analysis,
I have explained the importance of regression, how the regression is different from the our
traditional hypothesis testing. in the regression equation there is a y intercept is there and slope
there by using least square method. i have derived the formula for finding the slope and the y
intercept.
The formula for slope is nothing but the covariance by variance then I have interlinked how
variance covariance and correlation coefficient and regression coefficients. All these are
interrelated. The advantage here is you need not remember the formula, formula is; if you know
the variance formula and covariance formula easily, you can find out the slope of regression
equation. We will continue the next class by taking an example; I will explain how to use this
regression equation for prediction purpose. Thank you very much.
600
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 29
Linear Regression - II
Dear students in the previous class I have derived what is the formula for Slope of a linear
regression equation and y intercept and also I have explained the concept of least square method.
In the formula of slope then I explained slope is nothing but covariance of the two variable if it is
the 1 independent variable and one dependent variable simple linear regression. Covariance
divided by variance, variance of independent variable.
Then we also got another important result in the previous class that if you want to draw a best
line that line has to pass through its average of x, y value that that lines to pass through average
value of x that x bar and average value of y bar class what you are going to do.
(Refer Slide Time: 01:22)
I have taken the problem with help of a small problem I going to find out the slope and y
intercept then I will explain the practical meaning of this y intercept and slope this formula b1
nothing, but the slope of the line is writing y = b 0 + b 1 x to the b1 is nothing but the slope is
covariance divided by variance, we know that the formula for covariance is Ʃ (x – x bar) . (y - y
bar) divided n -1 divided by the variance.
601
The variance is Ʃ (x – x bar)2 / (n -1) because the numerator and the denominator n -1 is same so
when you cancel that the remaining formulas, Ʃ (x – x bar) . (y - y bar) divided by Ʃ (x – x
bar)2.
(Refer Slide Time: 02:22)
If you are using calculator for examination purpose this is the very very very useful notations
convention. So by using if you know Sxx the meaning of this is Sxx is Ʃ (x – x bar)2 the
meaning of yy is, it is a convention is Ʃ (y – y bar)2 if I write S xy that is nothing but S xy equal
to Ʃ (x – x bar) . (y - y bar).
(Refer Slide Time: 03:01)
602
The formula for slope is nothing but slope m equal to S xy divided by S xx. If I want to know
error sum of square, I will explain what is the meaning of error sum of square. Now you take this
formula what is error sum of square equal to S yy – (S xy divided by S xx) suppose why this
formula so useful suppose if you know this 3 term S xx, S yy, and S xy you can find out the
slope. And you can find out the error sum of square and this is convenient also later to find out
the coefficient of determination. I will explain the next lecture.
(Refer Slide Time: 03:52)
So what is simple linear regression suppose that we have n pairs of observation se( x1, y1),( x2,
y2) like this (xn, yn).
(Refer Slide Time: 04:07)
603
Previously we have seen the formula for slope. Now the formula for y- intercept to be zero equal
to y bar - b 1 x bar, explain how this formulas come here x1 is value of independent variable for
the ith observation, yi is the value of dependent variable for right observation x bar is the mean
of mean value for independent variable, y bar is mean value of value for dependent variable n
actually we not using here n is the total number of observations to find out the mean value.
(Refer Slide Time: 04:46)
Simple Linear Regression is you this is your x axis. This is your y-axis. So whatever value which
is said that is observed value actual value this line shows the critical value. So this line generally
written as b 0 + b 1 x the final objective in a regression equation is to find out what is the slope
of this line and y intercept of a line because if you know slope and y intercept, so this was your y
intercept, if you know, y intercept and slope then you can construct the regression equation.
The previous class already explain this is your error one. This is error 2 this is error 3 so the
concept of least square method is the sum of the square of the error has to be minimised. So that
idea is taken care to find out the value of slope and y intercept.
(Refer Slide Time: 05:57)
604
We will take one example simple linear regression why it is called simple linear regression only
one independent variable is there. If there are more than one independent variable that you will
call it as multiple linear regression small problem and explain how to construct a regression
equation and how to use this formula of slope and y intercept. An auto Company periodic special
week-long sale as a part of the advertising campaign Company runs one or more television
commercials during the weekend preceding the sale.
Data from a sample of 5 previous sales are shown in the next slide actually the company before
introducing a new product. They go for television advertisement. This problem says, is there any
effect of television advertisement on the sales of the car?
(Refer Slide Time: 06:56)
605
The data says number of TV ads 1 the number of cars sold 14 when the number of TV ads is 3
number of cars sold is 24, number of TV ads is 2 number of cars sold is 18 number of TV ad is 1
number of cars sold is 17 number of TV ads is 3 number of cars sold 27. In this the dependent
variable is number of cars sold this generally we will call it as y is dependent variable. The
independent variable is x that is nothing but number of TV ads.
So we have to know effect of this number of ads on the number of cars sold. Generally what is
perception when you the frequency of ads is more than be more sales.
(Refer Slide Time: 07:51)
606
The first task is the Slope of the estimated regression equation. So what we have to do first we
have to find out x bar and y bar then for each value of x and find out x - x bar and for each value
of y you find out y - y bar then you have to multiply that and you have to sum that multiplication
that lead to 20. Then sigma of x - x bar you have to square that then you to submit that is 4. So
the slope is 5 the y intercept for estimated regression equation is b 0 is y bar – b 1 x 1 bar already
we know b1.
So you take b1 value here y bar you know it is 20 the x bar is 2 this is 10 so the estimated
regression equation is 10 + 5 x, you are to be very careful. This is estimated regression equation.
It is not why the value of y is the estimated value it is not the actual value it is nothing but the
mean value. So, y equal to 10 + 5 x so how to interpret this, when the value of x is increasing 1
unit y will be increased by 5 units so keeping other things constant when x is increased by 1 unit
the sales will increase by 5 times not the 15 times it is not right 5 into 1 equal to 15. We have
seen the rate of increment of x and rate of increment of y.
(Refer Slide Time: 09:35)
So, that this show that y = 5 x + 10. So here the 10 is y intercept when you extend this. This is a
y intercept. Ok 5 is the slope. So suppose if the TV number of TV Advertisement is a 5 now we
have taken up to 4 suppose this is 5 you can put to here 5, 25 + 10, 35. This way; this is a trend
line is a regression line.
(Refer Slide Time: 10:08)
607
This one will do with the help of python import numpy as np import matplotlib.pyplot as plt
import seaborn as sns import Pandas as pd import matplotlib as mpl import
statsmodels.formula.api as sm, from sklearn and see that this one is sklearn that is the library for
running linear regression sklearn.linear_model import linear regression, from scipy import stats
first I have entered this value in excel and going to save that filename object called tb1, tb1 equal
to pd.read_excel. This was the path where I have stored my excel file.
(Refer Slide Time: 11:16)
First task is your to plot the scatter plot scatter plot is we have to see only we can go for
correlation here. Also, we can discuss scatter plot will say rough idea about what will happen
when the value of x increases and how it is affected y what is happening? You see that you want
608
some point when the number of TV ads increasing the car sales also increasing so tbl.plot (‘TV
ads’,’cars sold’, style = ‘o’), plt.ylabel(“ cars sold”). Plt.title(‘ sales in UK regions’), plt.show().
The that will show the this graph.
(Refer Slide Time: 12:01)
Next one I am going to save that TV ads in variable quantity equal to t = tv ads in and c = car
sold here. The car sold is dependent variable TV ads is independent variable import
statsmodels.api as s, so t = s.add_constant( t ) because we need to have the t so model1 I am
saying model1 = sm.OLS(c,t) , ols means ordinary least square method c, t. What is c here? c is
your dependent variable t is your independent variable result1 equal to model1.fit() so
print(result1.summary()).
This is was the output of your linear regression equations. So, look at this most importantly is it
that the coefficient so how to write this one y equal to so the constant is 10 + 5 TV ads. There are
many terms are here model is ols methos is least square when it was conducted number of
observations 5 number of residuals residual is nothing but you error here your r square is 0.877, I
will explain the meaning of 0.877 in the next class.
Then the adjusted r square this value of adjusted r squared interpreted for multiple regression
equation. I will explain what is this F statistics in next class then there are many fitness index is
there. So these standard error this is the t value I will explain t value and one more thing there in
609
look at the probability value suppose Alpha equal to 5 percentage the probabilities less the 0.05
then you can say it is significant I will explain this also in the next class. So this is the output of
our regression equation. We will take another problem on regression analysis.
(Refer Slide Time: 14:33)
The problem is the data in the file. I have a file called hardness.xlsx, provides measurement on
the hardness and tensile strength for 35 specimen of die cast aluminium. It is believed that
hardness that is measured in Rockwell E unit can be used to predict the tensile strength measured
in 1000 of Pounds per square inch. So, what are the things used to do is construct a scatter plot
assuming a linear relationship. Use the least square method to find the regression coefficient for
b0 and b1 interpret the meaning of slope b1 in this problem.
Predict the main tensile strength for the die cast aluminium that has hardness of 30 Rockwell E
unit. Today is a tensile strength is given hardness is given for this data set. We are going to
construct a regression equation. So will switch to Python I will tell you how to do that.
(Refer Slide Time: 15:27)
610
For this import pandas as pd, import numpy as np, from sklearn import linear_model, import
statsmodels.api as sm and from sklearn.matrix import mean_squared_error. first I will load the
data object called data. So run It import pandas as pd, import numpy as np, from sklearn import
linear_model, import statsmodels.api as sm and from from sklearn.matrix import
mean_squared_error the file I have stored in a object called data.
So the data source this is the tensile strength and hardness. What are going to do? There are 35,
data set that we are going to split into two categories only for training and other only for testing.
For that purpose from sklearn.model_selection import train_test_split, x equal to data so x value
is going to be hardness dot values dot where reset command is used to convert one dimensional
array into 2 dimensional array, then y equal to data that is a tensile strength.
So, x is independent variable, y is dependent variable. The next one is x underscore train, x
underscore test, y underscore train, y underscore test equal to train underscore test underscore
split x, y test underscore size equal to 20%, so this 20% What is the meaning is there 20% the
data will be kept for testing our model remaining 80% of the data will be used for building our
regression model. So random underscore stat equal to you can give any number so that when you
repeat this program again, you will get the same answer.
611
Because the 20% of the data is randomly chosen out of 35 data set. So if you use this shape
command now you can see I will run this one, so you are getting 28, 1, so 28 data sets for
training 7 data status for testing ok. So, we can see the length also also can see that the 28 for
training 7 is for testing this is the train data sets. Now will go for constructing the regression
model from a scale and linear underscore model input linear regression, so regression equal to
Linear regression when you run they will run this model.
So now we will see what is y interested? Now y intercept is 7.045 so the regression coefficient is
1.9974. Now will predict for the test data set will predict what is the y value? So this is your
predicted y value when giving x data set as an input. Now we will find out what is the mean
square error the mean squared error, the mean squared error is 35, if the error in smaller that
model is good model. The next one is the fitness of this regression model is nothing 0.53 by
taking x underscore data set an independent variable and y underscore data set is dependent
variable. The next is when you set for training data set it is 0.45 explain the meaning of this score
data.
(Refer Slide Time: 19:21)
This the meaning of score is meaning of score is nothing but your r square. This r square is
nothing but coefficient of determination. So, the coefficient of determination is your SSR divided
by SST, SSR is regression sum of square, SST is total sum of square in your problem the r
square is see that the reg.score (x_test , y_test) is the r square is 53%. What is meaning of this 53
612
is the 53% of the variable variability of y is explained with the help of this is dependent variable
for the training data set.
It is for the training dataset it is 0.45. that is the 45% of the variability of y is explain with help of
independent variable x this mean square mean squared error is nothing but SSE divided by n-2
that is mean square error. If it is the lesser value the model is good fit. Otherwise, it is not good.
Now will see the another concept in the regression model that is called machine learning in
machine learning this is one category of supervised learning. The machine learning techniques
are classified into two categories under supervised learning method and unsupervised learning
method.
So the regression is example for supervised learning because in advance we are labeling what is
independent variable and what is dependent variable if it is unsupervised learning. So, we cannot
label in advance that what is going to be dependent and what is independent variable. So, in the
supervised learning is nothing but the regression analysis. That in the context of machine
learning will call it is supervised learning in statistics will call it is simple regression.
Dear students in the previous class derive the formula for y intercept and slope. Then I have
taken one sample problem with the help of sample problem explain how to use that formulas and
also called with the help of python. I have taken another problem also in that problem. The data
set is divided into two parts, 1 part is for Training the for building the data set the using training
data set the other part of the data set that is for test data set.
So the test data set was used for validating the model which we have constructed. Thank you
very much.
613
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 30
Linear Regression – III
Dear students in the previous class I would take in a sample example I have explained to
construct how to construct a regression equation.
(Refer Slide Time: 00:37)
In this class I will explain what is the meaning of coefficient of determination and test statistical
hypothesis and construct confidence interval and regression model parameters, there are one
parameter is the ‘b’ the coefficient of x is one parameter.
(Refer Slide Time: 01:01)
614
Now I will explain what is the goodness of it that is coefficient of determination r square assume
that why value assume that there is no independent variable the easiest way for prediction is
mean of the y what is the meaning of there is no independent variable. Suppose I have demand
for say first eleven month I want to predict in 12th month what is the demand of a particular
product, the easiest way is you find the mean of the previous data.
So that will be used as the mean for the next data so without considering any independent
variable suppose if there is no independent variable so the actual point is say this is the y is the
actual value. So, the one way to predict without any independent variable is say y bar but I know
there is a one independent variable what is this much this is actual this is predicted, so this much
is my error what is this error this point to this point this is error.
What is the error total error we can say total error actual - predictor. Suppose if I know one
independent variable that I have as I am assuming that is affecting my dependent variable then
that regression equation is like this, so this I am writing y hat = you can call it as b 0 + b 1 b x,
now what has happened now this much distance because this point so this much distance is see
total error is this much, so this much error is you could draw it this way.
So here what is yes this much portion this much portion I am able to explain with the help of
regression independent variable. So, total error is this point to this point so this much error with
615
the help of independent variable x I am able to explain. So, the remaining error this one's this
much distance is unexplained the error. So, what I am saying this is only one point there is no
linear relationship like that there are different y values may be why here y may be here y may be
here maybe here maybe here.
So if I find the total sum of square there is a total error then nothing but y SST . So, SST equal to
SST + SSE what is the SST total sum of square what is the SSR a regression sum of square what
is SSE error sum of square. So, now what is the logic behind this is the total error is this point to
this point total error I am splitting that error due to how much error we are able to explain with
the help of this independent variable x that is SSR, so the remaining portions that is the which is
in the bracket that is unexplained error.
You when you look at this there will be a connection with on over in ANOVA what do I written
the same thing you know what you written SST equal to SS treatment + SSE , your treatment is
nothing but your independent variable . Now we will find out what is the formula this so what is
SST so this point is y the total error Ʃ ( y - y bar)2 , that is SST equal to so this much portion
what is the error is Ʃ(y hat - y bar)2 + SSE unexplained error.
What is unexplained error Ʃ ( y - y hat)2 ,so what has happened the total error is the regression
sum of square + error sum of square. Here I want to predict the coefficient of determination, the
coefficient of determination referred as r square is nothing but explain the error divided by total
error what is explained error explained is nothing but SSR that means that much error we are
able to explain because that much error is due to this independent variable x what is a total error
this is SST total sum of square.
There is a two possibility of this r square is it cannot be more than 1, if it is 1 what is the
meaning the total SST the numerator also SST denominator also SST so what we are saying this
point this line pass through that point. If it is less than 1 so what is happening SSE error is
smaller SST is bigger, if equal to 1 both are same. So, the upper limit of the r square is 1 the
lower limit is 0 to 1 so 0 to 1 is the interval for r square.
(Refer Slide Time: 07:32)
616
We will see this one yes, see the relationship among SST, SSR and SSE. So, SST you see Ʃ ( y -
y bar)2 equal to SSR Ʃ ( yhat - y bar)2 for SSE Ʃ ( y - y hat)2 remember the previous also I was
showing this SS yy that that SS yy is nothing but Ʃ ( y - y bar)2 so, this is the a very handy
formula if you are using calculator and a very short cut very quickly you can get the answer for
what is SST, SSR and SSE from that you can easily find out r square nothing but SSR divided by
SST.
(Refer Slide Time: 07:58)
R square is SST / SST, SSR is sum of square due to regression SST is total sum of square.
(Refer Slide Time: 08:07)
617
You see that the previous also I was saying what is the meaning of this r square. So, we are
getting in our problem we are getting 0.87 I will show you how it has come yes the coefficient of
determination formula is r square equal to SSR divided by SST, so SSR in our problem is 100
SST is 114 so I can show you this how this SSR 100 you see that SSR.
(Refer Slide Time: 08:42)
So how what is the formula for SSR we have seen previously how we are getting SSR you see
that SSR, is nothing but see Ʃ ( yhat - y bar)2 square first you have to find out the regression
equation in that when you substitute first value of x you will get y1 hat, y1 hat is when substitute
the value of x into that then y - y bar whole square then when you substitute x equal to 2 they
will get y 2 hat so then y bar whole square when you sum that one that is nothing but your SSR.
618
SST is y - it is a numerator of that variance of y - y bar whole square so from that you can find
out SST. So, in this our problem it is 100 order by 114 now what is the meaning of this r square
so the meaning of r square is as we know that it is 0 to 1 the regression relationship is very strong
what is the meaning is 88% of the variability in the number of cars sold can be explained by the
linear relation between the number of TV ads and the number of cars sold.
So what is meaning that 87 it is 87.7, 88% of the variability of y can be explained by the help of
this independent variable there is a remaining 13% that we are not able to explain that may be
due to two reasons one is we might all miss you that some other independent variable there may
be some other variable that affects the car sales. Another reason is that we have fixed a linear
regression but the actual data may follow non linear regressions so that is why we are not getting
exactly 1.
In Python output you see that when you see the r square is the 0.77 this is r square is 0.77 that is
the meaning of that is 87.7% of the variability of car sold can be explained with the help of
number of TV ads that is our independent variable.
(Refer Slide Time: 10:56)
From r square we have to find out the r that is a correlation coefficient. So, the sample
correlation coefficient r x y equal to sign of b 1 in our problem it is the sign of b 1 is positive root
619
of coefficient of determination r square so sign of b 1 into r square. So, what is this b 1 is that the
slope of the regression equation.
(Refer Slide Time: 11:23)
In our problem it is y equal to 10 + 5 x so the sign is + the root of 0.87 7 is this is your
correlation coefficient and remember that the range of correlation coefficient is - 1/2 + 1 but the
range of r square is 0 to 1. Here in the correlation coefficient if it is a - 1 it is a perfectly negative
correlation if it is a + 1 it is perfectly positive correlation if it is 0 there is no correlation. In the
context of r square if it is 1 it is a perfect model that means all variability of y can be explained
with the help of independent variable. If it is 0 there is no relation relationship between x and y.
(Refer Slide Time: 12:13)
620
Another important point is assumptions about the error term e, here the two tests the goodness of
the model not only r square is important you have to plot the error term. When you look at the
error term we have to look at the behavior of that error is nothing but actual minus predicted
value, so what are the assumption of the error term the error ‘e’ is he a random variable with
mean equal to 0, so the error has to be appear in a random manner where the sum of positive
error should be equal to sum of negative errors so that sum will be 0.
The variance of e denoted by e 2 is the same for all values of the independent variable, so that a
concept called a homoscedasticity what is the meaning of that one is if there are many x1 say x 2
x 3 independent variable these variance of x 1 variance of x 2 variance of x 3 should be same
then only there is a meaningful comparison otherwise the variance of the error should be the
same then only there is a meaningful comparison.
And the value of e is independent another important there should not be any pattern in the error
term sometime what will happen when you plot the error term sometimes there is an increasing
trend sometimes there may be in decreasing trend this kind of, this kind of pattern is not allowed
the error term has to be distributed randomly. And the another point is the error e is normally
distributed random variable now testing for significance.
(Refer Slide Time: 14:03)
621
So far as I told you in the beginning of the class whatever regression equations and the goodness
of model which you have tested only for the sample data what is the sample data y equal to b 0 +
b 1 x so this capital Y equal to beta 0 + beta 1 X whatever we have know done is only for the
sample. Now we are going to see whether this model is valid even at the population level for that
purpose we are going to do some assumption we will see what is that assumption that is a
hypothesis?
To test for a significant regression relationship we must conduct a hypothesis test to determine
whether the value of beta 1 is 0. What will happen if the beta 1 is 0 there is no relation between x
and y at the population level. But there is a possibility that there may be a relation between x and
y at the sample data it is not necessary that even at the population level there will be a relation
between x and y. so, that testing can be done by two methods one is a t-test another one is F test.
Both t-test and F tests require an estimation of S square. S square is called the variance of the
error otherwise if you say S it is the standard error the variance of e in the regression model.
(Refer Slide Time: 15:22)
What is the estimation of the standard error suppose in years normal data set if there are two data
set data set 1 and 2, 1 is having lesser variance than the other one so the first data it is more
homogeneous in the same way when you do in regression model suppose there are two model is
there model 1 model 2 for the model in which there is a lesser standard error that is a more
622
suitable model. So, the mean square error provides the estimate of S Square and the notation S 2
is used so s square is nothing but MSE mean squared error.
We know that how we got to MSE, MSE is SSE a divided by n - 2 here what the degrees of
freedom n - 2 so the logic is n - 1 - K there is a logic of degrees of freedom. K is number of
independent variable in this we are having only one independent variable we know that already
the degrees of freedom is n – 1, so n - 1 - K will be n - 2 and SSE also you see that we can find
out that formula which I in the beginning of the class which I am saying yi – y hat whole square.
The y hat you can substitute b 0 actually it is a b 0 + b 1 x when you bring - inside b 0, - b 0 - b 1
x.
(Refer Slide Time: 16:51)
The term S we make the square root of s square the resulting is called standard at other term. So,
when you take the square root of this mean squared error so you will get the standard error see in
our problem I will go back I will see what is the standard error here where it is standard error as I
told you by using shortcut method you can use SSE divided by n - 2 so that is MSE so S xy – S
xy S xy whole square by S xx divided by n - 2 that is the standard error of the estimate. So you
have to take the square root of that then you look at the standard error.
(Refer Slide Time: 17:37)
623
Now you go for hypothesis testing so what is the hypothesis testing beta 1 equal to 0 that means
that there is no relation between x and y. Alternate hypothesis is beta 1 ≠ 0 the test status is b 1 -
beta 1 divided by Sb 1 Sb 1 is the standard error for the coefficient of b. So, since beta 1 we are
assuming 0 it is simply b 1 divided by Sb1.
(Refer Slide Time: 18:06)
What is the meaning of b 1 beta 1 equal to 0 that means there is no relation between x and y. If it
is when you plot the data in this case hypothesis is accepted because beta 1 equal to 0 now what
is happening the beta 1 is not equal to 0 you see for this kind of data set so there is some relation
between x and y see in this case the hypothesis is rejected so we are saying beta 1 not equal to 0.
(Refer Slide Time: 18:30)
624
So, this is very important how to find out the standard error of the coefficient of x that is the b1,
so Sb1 is Se that is a standard error will divided by root of see Ʃ ( x - x bar)2, you see intuitively
the total error that is Se then we are dividing how much error is due to this independent variable,
so total error did away portions of error from independent variable x. So, that will give you Sb1.
(Refer Slide Time: 19:04)
What is the rejection rule reject H 0 if the p-value is the less than or equal to alpha we have seen
many times this one so what will happen this one if the p value, the p value is the see alpha the p-
value is less than that you are to reject it otherwise accept it, where t alpha by 2 is the two-tailed
test because we are writing beta equal to 0 Beta Beta 1 equal to 0, beta 0 equal to 0 and when
you look at the t table you were to see n - 2 degrees of freedom.
625
(Refer Slide Time: 19:42)
So, first determine the hypothesis beta 1 equal to 0 beta 1 not equal to 0 specify the significant
level alpha equal to 5% select the test statistics b 1 by Sb1 state the rejection role reject H 0 if the
p-value is less than or equal to 0.05 otherwise the t is greater than 3.182 when n - 2 degrees of
freedom there are 5 data so 5 - 2 is the 3 degrees of freedom.
(Refer Slide Time: 20:16)
So, this was the compute the value of the test statistics so 5 data by the Sb 1 is 1.08 we get 4.63
determine whether to reject H 0 because t equal to 4.5 form provides in the area of 0.01 in the
upper tail here is the p-value is less then we will see how the got the p-value less than 0.02 so P
626
is greater than 4.63 we can reject null hypothesis and we reject null hypothesis what we are
conclude there is a relation between x and y.
(Refer Slide Time: 20:51)
This was the, here also H0: beta 1 equal to 0 , H1: beta 1 ≠ 0, 2 tail test, this is the right tailed test
this is a left tailed test this was the formula for finding t = (b1 - beta1)/ Sb to find Sb = Se is
divided by root of SSXX that is this is a form of Se = root of (SSE divided by (n – 2) ), SSxx is Ʃ
( X)2 – ((Ʃ X)2 /n) remember it is n - 2 degrees of freedom.
(Refer Slide Time: 21:19)
Next we can use the 95% confidence interval for beta 1 to test the hypothesis just used in the test.
So, now with the help of conference interval also we can decide whether null hypothesis should
627
be accepted or rejected it is not as rejected if the hypothesis value of b 1 is not included in the
conference interval. our b 1 value what we are assuming to 0 so in that confidence interval if the
0 is appearing we have to accept the null hypothesis otherwise we have to reject null hypothesis.
(Refer Slide Time: 21:55)
So, confidence interval for beta 1 is the form of controlled b 1 + or - t α/2 S b1 , b 1 which you
got from our regression equation that is a coefficient of x 1, Sb 1 previously we are getting out I
told you what is the formula for getting is b 1, so when you substitute it here.
(Refer Slide Time: 22:22)
628
hypothesis, conclusion 0 is not included in the conference interval so we are rejecting null
hypothesis.
(Refer Slide Time: 22:43)
The previous way we have used the t-test some time what will happen if the number of
independent variable is more than 2 we have to do the t-test to 2 times. If there are say 5
independent variable you have to do file individually as I told you whenever you are comparing
more than two we should go for Anova that is the F-test so here also whenever there is a number
of independent variables more otherwise a generic method for testing the beta 1 equal to 0
hypothesis is going for F test.
So here you have F test is a MSR divided by MSE even in Anova also you know anova what we
write is we write MS treatment divided by MSE, MS treatment is nothing but our regression sum
of square mean regression sum of square.
(Refer Slide Time: 23:43)
629
So, F equal to MSR divided by MSE you see that MSR how we are getting MSR SSR divided by
K and K is number of degrees of freedom that is nothing but number of independent variable n -
K - 1 is degrees of freedom for the error term.
(Refer Slide Time: 24:02)
So, what is the rejection rule reject H 0 if the p-value is less than equal to alpha otherwise if the
calculated F value is greater than the value which we got from the table. So, if alpha is based on
the F distribution we have to look at the what is the degrees of freedom as you look at you see
enumerated degrees of freedom in this problem we have only one independent variables so one
degrees of freedom numerator so n - 2 is 5 -2 =3 , degrees of freedom for denominator.
(Refer Slide Time: 24:33)
630
So, beta 1 equal to 0 beta 0 equal to 0 alpha equal to 0.05 F equal to MSR divided by MSE, so p
value we got to find out so numerator degrees of freedom is 1 t nominated is a freedom is 3 so
that p value is 10.13.
(Refer Slide Time: 24:51)
So, now we will use Python code rules that will do that import numpy as np import
matplotlib.pyplot as plt, import seaborn ssn, import pandas as pd input matplotlib as mpl, import
stats.models.formula.api as sm, from sklearn.linear_model import LinearRegression, from scipy
import stats. So, tb1 we are going to that regression data we are going to save in the object called
tb1 and we are reading that.
631
Now what is happening you see that the p-value for the TV ad we say alpha equal to 5% it is less
than 0.01 so TV ad is insignificant variable if it is more than 5% say, if it is a 0.06 the regression
equation we will not include this independent variable TV ads you have statistics 21.43 I will go
back will verify this answer that we can find out MSR, how we do MSR and that is what I am
saying that time the first you have to find out SSR regression sum of square regression sum of
square is see Ʃ ( yhat - y bar)2 divided by k, k is number of independent variable will get MSR.
So the p-value sorry the F value is 21.43 and see the probability it is less than 0.01 so that is less
than 0.05 so we are saying that the model as a whole there are two things is there as a whole
model the F value is less than 0.05 the model is valid and if you want to check individual
independent variable also. So, see here it is less than 0.05 so this variable is significant see the
lower limit, upper limit there is no 0 here 1.563, 8.43. So we can we cannot accept where to
reject null hypothesis.
(Refer Slide Time: 27:02)
You see MSR is 21.43 we are here 21.43 so we can verify that our Python result then what we
have done it with the help of manually. Some cautions about the interpretation of significant test
is so it is very important this one rejecting H 0: b1 or beta 1 equal to 0 and concluding that the
relationship between x and y significant does not enable us to conclude that there is a cause and
effect relationship in present between x and y. Just because of there is a correlation we cannot
say there is a cause-and-effect relationship.
632
So, just because of we are able to reject H 0: beta 1 equal to 0 and demonstrate statistical
significant does not enable us to conclude that there is a linear relation between x and y. Dear
students in this class what we have seen we have taken one sample problem then we have fitted a
regression equation in the regression equation we gone for hypothesis testing we have tested the
significance of that independent variables.
There are two way to do the significance test one is by using t method t statistic method another
one is F test method in both method we all got the same answer. Then I have explained what is
the meaning of coefficient of determination, that is r2 from the r square I have explained how we
can get the r. The next class will go for multiple regression equation where we will consider
more than one independent variable and we will also ill explain some important assumptions in
the regression equations. Thank you very much
633
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 31
Estimation Prediction of Regression Model Residual Analysis: Validating Model
Assumptions - 1
Dear students in the previous class we have explained I have explained the confidence interval
for the x coefficient that is the b. For the b we have found what was the lower limit and upper
limit. In this class we will find the confidence interval for y and prediction interval.
(Refer Slide Time: 00:43)
So, today's class agenda is we list will explain what is the point estimate and interval estimate;
and confidence interval for the mean value of y and prediction interval for the individual value of
y.
(Refer Slide Time: 00:43)
634
We will take one problem then I first I will solve this problem with the help of Python then I will
explain what is the meaning of this confidence interval and prediction interval. Data were
collected from a sample of 10 ice cream vendors located near college campuses for the ith
observation our restaurant in the sample xi is the size of the student population and y is the
quarterly sales of ice cream. The values of xi and yi for 10 restaurants in the sample are
summarized in the table this is given x 1.
(Refer Slide Time: 01:29)
So, what the problems is the independent variable is student population dependent variable is
sales there was a 10 data set like this.
(Refer Slide Time: 01:40)
635
For given data set first we will run the regression model so import pandas pd import matplotlib
as mpl import stats model dot formula dot api as sm from sklearn dot linear underscore model
input linear regression from scipy import stats import Seaborn as sns, import numpy as np import
matplotlib dot pyplot as plt. First we load the data we will treat the data there is a pd dot breed
underscore this is a path where I have stored my excel file, so this is the data.
So, what is it there is a 10 dataset 10 restaurants this is a student population this student
populations in terms of 1000 sales also in terms of 1000 of it, for a product called ice cream first
we will plot the scatter plot.
(Refer Slide Time: 02:29)
636
So, data dot plot population, sales style equal to ‘o’, so why label is ice cream sales title is sales
when we they show this versus, so what is happening there seems to be some positive trend when
the student population is more there is a more number of sales. We will find a regression model
for this.
(Refer Slide Time: 02:56)
So import startsmodels dot api as s, St_pop equal to that is a student population equal to data I
am going to in the population I am going to stay a store variable called St_pop, sales equal to
data sales St_pop = s.add_constant because I need to have the constant in the regression
equation. So, model one equal to sm.OLS( sales, sale is our dependent variable St_pop) is our
independent variable result 1 equal to model one dot fit. So, print result one dot summary.
So what we are getting here and you look at this, this is the constant value. So, the sales equal to
I can write sales equal to 60+5 this is our independent variable say population st underscore. It is
a population. So, what is the meaning of interpretation of this file if the student population is
increased by one unit the sales will increased by 5 units look at this R square R square is very
good that is 90.3 I will explain meaning of adjusted R square in coming class.
Then we have to remember this is there is a standard at it is 0.58 the t value is 8.68 this is a
problem the probability is 0.00 this is lower limit this is upper limit. There is another way we can
write it otherwise directly we can get y-intercept and x coefficient from a sklearn dot linear
637
underscore model input linear regression x equal to data [‘population’] dot values reshaped (-1,
1), y equal to data[‘sales’] dot value.reshape (-1, 1), reg = LinearRegression(), reg.fit is x, y so
linear regression is copy underscore x equal to true fit underscore intercept true, equal to true, n
underscore jobs equal to one normalize equal to false.
What is the meaning of fit underscore intercept sometime if you put false so sometimes when
you fit a regression line suppose it is coming like this, so there is a y intercept is a this much
distance is y intercept. Sometime you need not have the y intercept for that time for that time you
have to use write false. The another one is normalized equal to false so there are y and there are x
value you if you normalize x-value and y-value then you run the regression they will get a
standardized regression coefficient.
Now we are not equal to false is written so we are not going to normalize the data set so reg dot
intercept underscore 0, reg coefficient underscores 0, 0 so this is your intercept this is your x
coefficient. The previous also you look at the previous slide there also here got the 60 and 5
same result.
(Refer Slide Time: 06:08)
So, what we can do in the ice cream at our example the estimated regression equation is 60 + 5x
provides an estimate of the relationship between the size of the student population x and
quarterly sales y. so, this is our regression equation.
638
(Refer Slide Time: 06:27)
So, this is our regression equation so the y intercept the slope is 5 the y intercept is 60.
(Refer Slide Time: 06:38)
Then we will see what is the point estimate, we can use the estimated regression equation to
develop your point estimate of the mean value of y for a particular value of x or to predict an
individual value of y corresponding to a given value of x. So, whatever value which you are
predicting is the mean value there is another we can predict an individual value. For instance
suppose your manager want to want see a point estimate of the mean quarterly sales for all
restaurants here you have to see all restaurants located nearby a college campus with the 10,000
students.
639
(Refer Slide Time: 07:23)
So if you say student population is 10 what will happen when you substitute to 10 it is 110. So,
that is your point estimate for the mean quarterly cells of all restaurant located near campus is the
10,000 students a one lakh 10,000 dollar. So, even regression equation also we can use the
predict function reg.predict when you put the input value that is x value you can get y value is
110.
(Refer Slide Time: 07:45)
So, what is a point estimate now suppose the manager want to predict the sales of an individual
restaurant located in area college with the 10,000 students. In this case we are not interested in
the mean value of all restaurants located near compass of the 10,000 students we are just
640
interested predicting quarterly sales of one individual restaurant as it turns out the point estimate
for an individual value of y is the same as the point estimate for the mean value of y.
Hence we would predict quarterly sells sales 60 + 5 the 10 is our input it is 110000 dollars so
what I am saying for you find estimate the value of confidence interval and the value of
prediction interval is same you see that you may see the similarity also here and when you go for
hypothesis testing see x-bar + or - Z Sigma by root n, right. So, this x bar is nothing but our point
estimate so whatever value after substituting 10 we are getting 110 we are getting that is only
point estimate so point estimate is not the reliable one so we need to have interval estimate.
So, interval estimate in the hypothesis testing context x bar + Z Sigma by root n is the upper
limit x bar - Z Sigma by root n lower limit. How we are going to find out upper limit lower limit
in the regression context I will explain in the next slide.
(Refer Slide Time: 09:17)
First we will plot it so what will happen plot at mean value of x and y so x equal to data
population y equal to data sales so x is the population y is the sales value plot dot figure, sns dot
reg plot x, y fit underscore regression is true plot dot scatter np dot mean value of x, np dot mean
value of y so we got this regression equation. You see that if you want to draw the best
regression equation that has to pass through x bar, y bar. So, that is why, so this point is mean of
x so this point is mean of y.
641
(Refer Slide Time: 10:06)
So what is a confidence interval estimation, confidence interval is an interval estimate for the
mean value of y for a given value of x. But the prediction interval is used whenever we want an
interval estimate of an individual value of y right this is an individual value of y that is a mean
value of y for a given value of x. For example y it is a mean value so what we are predicting is
expected value of E(y) = a + bx so whatever value after substituting x we are getting into the
mean value.
So what will happen the margin of error is larger for your prediction interval. So, the prediction
interval the margin of error will be larger for your confidence interval the margin of error will be
smaller.
(Refer Slide Time: 10:56)
642
So, confidence interval of estimation, for example take xp equal to the particular or given value
of independent variable x, yp is the value of the dependent variable y corresponding to the given
xp, so expected value of yp is nothing but mean or expected value of dependent variable y
corresponding to the given xp, so y hat equal to b 0 + beta 1 xp, is the point estimate of expected
value of yp when x equal to xp so that is why 60 + 5*10 is 110.
(Refer Slide Time: 11:47)
In general we cannot expert y hat p is equal to expected value of y p exactly if you want to make
an inference about how close y hat p is to the true mean value of expected value of y p we will
have to estimate the variance of y hat p. The formula for estimating the variance of y hat p at
643
given x p is denoted by s square, y hat p so this y hat p is nothing but s2 ((1 / n) + ((xp - x bar)2 /
Ʃ(xi - x bar)2 )), this is the variance of predicted y.
(Refer Slide Time: 12:36)
So, the confidence interval is you see that the confidence interval is we are writing y hat p + or -
so this yes y hat p is the variance of this y at p right so what you are done previously in the
hypothesis testing example x + or - z Sigma by root n. So, instead of x bar we are writing y hat p
+ or - so instead of z we are writing t α/2 this standard error and so write we are writing Sy hat p,
so that was the formula is equal to s2 (1 divided by n) this can be derived a very easily.
So I am not deriving you can refer any book for this. so, the variance of y cap is equal to s2 ((1 /
n) + ((xp - x bar)2 / Ʃ(xi - x bar)2 )), We can substitute this value here the s square is nothing but
the standard error. So, we can substitute s square value n is 10 xp is 10 because that is a value of
x so x bar is 14 whole square when you substitute 2 you are getting 110 + or – 11.415.
(Refer Slide Time: 13:44)
644
So, the Green Line shows green dotted line shows the upper limit the down one is shows the
lower limit. You see that it is the confidence interval is not a straight line it is somewhat curved
one so what is happening when x bar equal to 4 the interval now it is a very narrow. What will
happen that is a special case.
(Refer Slide Time: 14:09)
Now we will plot this confidence interval okay what is happening here you see that when this
Point, see it is not the straight line it is somewhat curved one. The confidence interal is very
narrow when there is x equal to x bar we will see that a special case.
(Refer Slide Time: 14:27)
645
The estimated standard deviation of y hat p is smallest when xp equal to x bar. So, what will
happen in the previous equation when you substitute xp equal to x bar so this term will become
0. So, remaining is s divided by 1 by n you see that this is similar to our the result of central limit
theorem. The variance of a sampling distribution is Sigma by root n it is similar to that.
(Refer Slide Time: 14:58)
Now we will go for prediction interval for an individual value of y instead of estimating the
mean value of sales for all restaurants located near campus of the trend of students we want to
estimate sales on individual restaurant located near a particular college with the 10,000 students.
So, when you go for predicting y value for an individual restaurant there are two component of
variance has to be added one component is the variance of individual y values about the mean
646
value of y p that is given by s square, the variance associated with using y hat p estimate is
expected value of y p and estimate of which is given by s square of y hat p.
So what is happening here if you want to go for a prediction interval these two variances has to
be added one variances for y another variances for y hat p right.
(Refer Slide Time: 16:03)
You see that so that is where s square individual is s square + s square y hat p so when you add
it the s square is common so we will get this formula. So, for this formula we will substitute the
value when you substitute it you see it is a 14.69.
(Refer Slide Time: 16:17)
647
Now the value of t alpha by two and you look at the then when you substitute 14.69 you will get
this was the this 33.875 is the margin of error so 110 + or – 33.875 will get so this one.
(Refer Slide Time: 16:37)
So, in the black line shows the prediction interval the Green Line shows the confidence interval
both are not the straight line. So, when you look at this one see the prediction line is having
margin of error is more compared to the confidence interval.
(Refer Slide Time: 16:58)
Now confidence interval versus prediction interval confidence intervals and prediction intervals
show the precision of the regression result narrower intervals provide a higher degrees of
648
precision. So, what after doing regression analysis when you plot the confidence and prediction
interval it has to be narrow if it is wide means that model is not the good model.
(Refer Slide Time: 17:23)
Now we will use Python to plot this prediction interval and confidence interval. So, for that
purpose from statsmodels dot stats dot out layer underscore influence import summary table st
command data 1, ss 2 equal to summary underscore table result 1, alpha equal to 5% fitted value
is equal to data colon, second that means we are referring the third column predict underscore
mean underscore ac equal to date data 1 colon, 3 that is we are referring fourth column predictor
underscore mean ci interval that is mean ci interval means your confidence interval lower limit
confidence interval upper limit.
So, that was because in their summary table that is in the summary table we are referring the
fourth to sixth column, dot t predict underscore see a confidence table low-protein predict
underscore ci upper limit data to 6 : 8. Actually what do you have is what is happening here we
are getting in the summary table all the result so we are calling 4 to 6, 6 to 8, 3 2 to get a
particular value that is the reason here.
(Refer Slide Time: 18:41)
649
You see that this is the predict underscore main underscore ci_low so we are getting the
confidence interval for the lower limit here predict mean underscore ci for upper limit. So, this
was predict for this is a prediction interval this is for the confidence interval the first 2 things for
the confidence interval the next bottom 2 is for the prediction interval. So, lower limit upper limit
this is the lower limit see a ci _ low, ci _ upper is the upper limit.
(Refer Slide Time: 19:19)
So, this picture shows you see that x equal to s dot ad underscore constant fig, ax equal to plot
dot subplot fig size equal to 8, 6, ax dot plot x, y ou label equal to data ax dot plot x, featured
values are hypen, label Wireless ax plot dot x, predict underscore ci low it is in the dotted line by
ax dot plot x, predict underscore ci underscore upper limit b hypen hypen ax dot plot x, predict
650
underscore mean ci low g a x dot plot x predict underscore main ci upper limit so the location is
the best plt dot show.
So, when you run this command you will get this kind of here model. So, the green one shows
the confidence level the blue one shows the prediction interval. In this picture when you look at
the ‘r-‘ represents the red color b represents blue color g represents green color the hypen
represents what kind of pattern we need to have in the in the picture. Now what we have done in
this class we have explained what is point interval and what is confidence interval and what is
prediction interval.
So the point interval is same for particular value of x for both confidence and prediction interval.
So, the another point which you have learnt in this lecture is that the confidence interval is not
the straight line it is curved line similarly the prediction interval. The another one is the
prediction interval is having more margin of error when compared to confidence interval after
that what you have done with help of Python I have run this code to show you how to plot this
confidence and prediction interval, thank you very much.
651
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 32
Estimation Prediction of Regression Model Residual Analysis: Validating Model
Assumptions - II
This lecture is validating regression model assumptions. The lecture objective is understanding
different types of residual analysis and plotting residual plots using Python.
(Refer Slide Time: 00:37)
652
Residual analysis validating model assumptions, first we will see what is the residual analysis.
The residual analysis is the primary tool for determining whether the assumed regression model
is appropriate. So, the residual for observation i is nothing but yi there is an actual value and y
hat our predicted model. So, the difference between actual and predicted model it's nothing but
the error otherwise you can call it is residual analysis.
So y i is the observed value of dependent variable y i hat is the estimated value of dependent
variable.
(Refer Slide Time: 01:12)
Assumptions about the error term that is Epsilon we know that y equal to beta 0 + beta 1 x + the
error term, what are the assumption about this error term, number one the expected value of error
is 0. The variance of error term denoted by Sigma square is the same for all values of x the
values of error are independent the error term epsilon has their normal distribution. We will
validate we will check this assumptions by drawing various residual plots in this lecture.
(Refer Slide Time: 01:47)
653
Why this assumption is important these assumptions provide the theoretical basis for the t-test
and F test used to determine whether the relationship between x and y is significant and for the
confidence interval and prediction interval estimate. What we have done in the previous class we
have done t test and F test to test these hypotheses what was our hypothesis: beta 1 = 0 beta 1 is
≠ 0. So, this assumption can be tested by two method only is the t-test and F test so to validate
that assumptions the error about assumption is more important.
If the assumptions about the error term epsilon appear questionable the hypothesis test about the
significance of the regression relationship and the interval estimation result may not be valid that
is why we have to verify this assumptions.
(Refer Slide Time: 02:47)
654
We will take an example this example is adopted from statistics for Business and Economics
David and Anderson Sweeney and Williams the student population xi 2 6 8 8 12 and so on, sales
of ice cream is given so we have fitted the regression line y i hat equal to 60 + 5x so when you
substitute the x value here this is actual 58 this is our predicted 70 the difference is 50 - 70 is –
12, so 105, 90 the difference is 15, so this y i - y hat I is that is the residual. So this residual I
have to have some properties that properties we will check it.
(Refer Slide Time: 03:41)
The residual analysis is based on the examination of graphical plot. So, we are going to plot the
residual it was in the previous slide then we have to check certain assumptions there are 4
method we are going to do in this class one is a plot of the residual against value of independent
655
variable x. So, first assumption is x axis we are going to take x value in dependent variable in y
axis we are going to take residual the second assumption is the plot of residuals against a
predicted value of the dependent variable y cap, so in x axis we are going to have y cap then y
axis we are going to have residuals.
The third assumption is standardized as a residual plot we are going to standardize this the
residual we know that how to standardize standardized for example if you are standardizing this
is (x - x bar) by Sigma this is the way Sigma our standard error so it is nothing but z is nothing
but z dot t so we will standardize our residual then we plot it the last assumption is normal
probability plot.
(Refer Slide Time: 04:51)
So, we will check this assumptions the first assumption is when we plot the residuals against x
value so this error that is duals are plotted in this way. So, what is the inference we can get it
number one that it is not following any pattern, if it is not following any pattern these errors are
independent then there is no problem in the assumptions.
(Refer Slide Time: 05:15)
656
So, this one we have done with the help of Python I have the screenshot at the end of this lecture
I am going to run all these codes you can verify it. Import seaborn as sns before that we have to
import the data set that I will show you in the end of the class. So, sns.residplot so this is used for
plotting that is do one variable a student population that is x value the y value sales color is green
color so you will get this output.
So this was the Python output of a residual sales that is a y predicted value against sales y axis
not the sales it is the residual, it is the residual, y axis is not the sales because sales would not
become 0.
(Refer Slide Time: 06:20)
657
So, the one assumption is the variance is the same for all values of x, so what will happen now in
this figure which is given it is looking like a rectangle shape that means this assumption is valid
it is a good pattern. So, what did this graph implies the residual plot should give an overall
impression of horizontal band of points. So, that means the variance even though the x value is
increasing the variance is same so then we get here a horizontal band of points so this is the way
to check the one assumption that the variance is same for all values of x.
Sometime what may happen when the value of x increases the variance may increase that should
not be the case yeah that is an example of this one.
(Refer Slide Time: 06:54)
What is happening violation of assumption what is that the variance of yi is not the same for all
values of x when x is increases the variance is no it is getting a conical shape, so it is a non
constant variance. This is the violation of our regression model. So, assumption of a constant
variance of E is violated if you are getting this kind of shape. If variability about the regression
line is greater for larger values of x then you can get this kind of pictures. So that is not correct
one.
(Refer Slide Time: 07:30)
658
Another type of picture you may get it when you plot the residual against the x, a curvilinear this
is a this is a kind of a non linear shape, so instead of fitting a linear regression equation it is
suggesting that you can try for curvilinear regression model or a multiple regression model
should be considered if you are getting the plot is in this shape. Previously we have plot the
residual against x now we are going to plot the residual against y hat that is our predicted value.
The pattern of this residual plot is the same as the pattern of residual plot against an independent
variable x.
It is not a pattern that that would lead us to question the model assumptions why this we are
going for that one if there are more number of independent variable for each independent
variable you have plot this residual.
(Refer Slide Time: 08:19)
659
So, instead of going for different independent variable if you plot this residual against this
predicted value then from that we can verify whether the model is valid or not.
(Refer Slide Time: 08:35)
So, for a simple linear regression both the residual plot against x and the residual plot against y
hat provided the same pattern for multiple regression analysis the residual plot against y hat is
the more widely used because of the presence of more than one independent variable. Whenever
there is more than one independent variable instead of going for x we should go for y hat.
(Refer Slide Time: 08:58)
660
Then we will go for next residual plot that is a standardized residuals. Many of the residual plots
provided by computer software packages uses a standardized version of the residuals. So, what is
the standardized version yeah random variable is standardized by subtracting its mean dividing
the result by its standard deviation, this way Z equal to a residual i value residual mean value
divided by the standard deviation of the residual.
With the least square method the mean the residual is 0 because Sigma of x - x bar is 0, thus
simply dividing each residual by its standard deviation provides the standardized to residual. So,
what you have to do in the least square method simply how to divide residual by your standard
deviation that will give you the standardized the residual.
(Refer Slide Time: 09:57)
661
So, I am so in the Python code here this is a screenshot of the program import pandas as pd, from
statsmodels dot formula dot api import OLS from stats model dot stat stat anova import anova
underscore lm, import matplotlib dot pyplot as plt. So, the data set name is ice cream so in the
independent variable student population the dependent variable is sales. So, to get your
regression model reg1 equal to OLS formula equal to sales as a dependent variable tilde student
underscore population data equal to df1.
(Refer Slide Time: 10:43)
So, when you print a summary so you will get this kind of regression output. So, this says your r
square this is over adjusted r square I will explain the meaning of our just r square in multiple
662
regression this was for F statistics. So, what say this is, so y equal to 60 + 5 x 1 x 1 is number of
and populations.
(Refer Slide Time: 11:07)
So, when you use this print anova_lm for the our model fit one you can get your ANOVA table
for regression analysis, so for you a residual it is 8 because there is a 10 data set so the degrees of
freedom is n - p – 1, p is number of independent variable there is only one independent variable
so the degrees of freedom is 1 this is sum of square for student population this sum of square for
error. So sum of squares divided by degrees of freedom you will get mean sum of square.
So the F value is nothing but means sum of squared divided by mean error sum of square so the p
value is very low so we can say that the model is valid.
(Refer Slide Time: 11:52)
663
Next will tell you how to find out these standardized to residual. So, the standardized residual is
s y i - y hat i equal to s √(1 – hi) here s is the standard error of the estimate. So, in the previous
this is where MSE is 191 when you take the square root of this what is the standard error is
standard error is SSE divided by n – 2, when you take square root otherwise 1 and 2 and 0.25
when you take square root that you will get the standard error.
So that is nothing but the value of s so you can find out h i equal to (1 by n) + ((x i - x bar)2 /Ʃ(xi
- x bar)2 )).
(Refer Slide Time: 12:41)
664
So, there is an illustration so I use there x is there we are finding x i - x bar because this value
will be useful for the the formula which is in the previous slide so x i - x bar whole square so
when you know we can ((x i - x bar)2 /Ʃ(xi - x bar)2 )), then you confront a h i you can find out
the s y i - y hat i from that you can find out the y i - y hat I, so then you will get the standardized
residual.
(Refer Slide Time: 13:15)
It is do it so standardized residual is nothing but y i - y hat i / s y i - y i hat so what will happen when
you plot this figure x against this and residual most of the data point is see that between + 2 and -
2 so that means that 95% of the time the data's are within the limit so this is acceptable.
(Refer Slide Time: 13:47)
665
The assumption is valid we can plot the standardized residual plot against the independent
variable x so for that you this command influence, influence equal to fit1 dot get underscore
influence, resid_student equal to influence dot resid_studentized_external, external so you can
see what is that resid_student that is an array of this is nothing but the standardized residual.
So now you can plot student population against this studentized residual will get this figure see
all the data point is between + 2 and – 2, so this assumption is valid.
(Refer Slide Time: 14:22)
The standardized residual plot can provide insight about the assumption that the error term ‘e’
has the normal distribution. If this assumption is satisfied the distribution of the standardized
residual should appear to come from a standard normal probability distribution.
(Refer Slide Time: 14:41)
666
Studentized there is dual this when looking at the standardized the residual plot we should expect
to see approximately 95% of the standardized residuals between - 2 and + 2. We see the figure
that from the ice cream example all standardized residuals are between - 2 and + 2 therefore on
the basis of the standardized residuals this plot gives us no reason to question that assumption
that the error term has here normal distribution.
(Refer Slide Time: 15:10)
Next we will plot normal probability plot. Another approach for determining the validity of the
assumption that the error term has a normal distribution is normal probability plot. Many
software packages you may see that the normal probability plot. To show how a normal
probability plot is developed we introduce concept called normal score. Suppose 10 values are
667
selected randomly from a normal probability distribution with the mean 0 and standard deviation
1 and that is sampling process repeated over and over with the values in each sample of 10
ordered from smallest to largest.
(Refer Slide Time: 15:52)
For now let us consider only the smallest value in each sample. The random variable
representing the smallest value obtained in a repeated sampling is called first order statistic.
Okay the second largest is second order statistic and so on so this was the first order statistics.
(Refer Slide Time: 16:14)
So for this first order statistic wherein the sample size equal to 10 it should be – 1.55 that means
these values data's are coming from the normal distribution. So, for the second order statistics the
668
value should be so, this value which we got from the table. Now we are going to compare the
standardized residual values with this table so already we have the standardized residual when I
equal to 1 x i equal to 2, so it is – 1.0792.
So these values we are going to compare with the standardized so this value we are going to
compare with the normal scores.
(Refer Slide Time: 16:57)
So, what is happening here then we look at this picture when the order statistic is 1 the minimum
value is – 1.55 so in this figure we have to see which is near to – 1.55, so the this one -1.71. So,
this is the – 1.71 next we have to see which is the least value from this figure next to least value
is 1.07 so – 1.07 next least. So, we have mapped with this our standardized the residual against
the normal score. When it is the the least one is taken as – 1.71 when it is 1.55 in the normal
score the corresponding score from our dataset is 1.4230.
(Refer Slide Time: 17:49)
669
Now we will see what its normal probability plot by using this data we will plot it if the
normality assumption is satisfied the smallest standardized the residual should be close to the
smallest normal score. The next smallest standardized residual should be close to the next
smallest normal score and so on that is what we mapped it in the previous slides.
If you had to develop a plot with the normal score on the horizontal axis and the corresponding
standardized the residual on the vertical axis the plotted point should cluster closely around the
affine is 45-degree passing through the origin. The standardized residuals are approximately
normally distributed. It is a property of this residual plot. So, such a plot is referred as normal
probability plot.
(Refer Slide Time: 18:36)
670
So, this was normal probability plot so in x-axis normal score is written in y-axis the
standardized residuals written. So, it is starting from 0 the line is 45 degree so all the points are
right it is not deviating from this line, so it is it is a clustering around this Green Line. So, then
we can say that this data follow normal distribution. Suppose the data is following this by this
point this point it is not going it is not clustered around this green line then we can say it is not
following normal distribution.
(Refer Slide Time: 19:15)
This also we have done with the help of Python from scipy import stat stats import starts model
dot api as sm so we are going to take residual as res is equal to fig 1 dot residual then we go for
prop lot equal to sm dot probplot because sm and all different library which is already important
671
to see that import statsmodels dot api as sm. So, then figure prompt lot dot qq plot when line
equal to 45 degree, so we can say h= plt.title (‘qq plot again to this dual of OLS. fit then we are
getting this you see that all the points are above this red line then we can see this normality
assumption is validated.
Now what we are going to do I have prepared this command in Python I go to run all the Python
course then I am going to verify I am going to show how to get this residual plots then how to
verify that it meet the regression assumptions. So, far I have shown the screenshot of the Python
output now I am going to run and I am going to explain how to get the residual plots and what is
the interpretation of that.
For that what I have done I have taken one regression example filenames where I have stored the
data is ice cream so first I will run this I on the library then I will show what is the data set. This
data set shows there is a two variable and this one a student underscore population is independent
variable sales is dependent variable. So, for this data set I am going to run the regression
equation. We are getting regression output you see that intercept is 60 the intercept of the student
underscore population variable is 5.
So we can write it y equal to 60 + 5 x 1 here x 1 is student population then we can see that this p-
value also it is less than 0.05 so this independent variable is significant values you see that r
square is 0.903 that means 90.3% of the variability of Y is explained with the help of this is a
regression model. Similarly for the x coefficient the standard error is 0.58. Now we are going to
get the ANOVA table for this regression.
So this ANOVA table type this print anova underscore lm fit1 we are getting this anova table for
regression analysis what we are understanding here for independent variable is student
population, so the degrees of freedom is one sum of square is 14200 when you divide this 14200
by 1 we are getting the mean sum of square. Then for a error term the degrees of freedom is 8.
How it is 8 because there was a 10 data set so the degrees of freedom is 10 - 1 - number of
independent variables so 10 – 1, 9 - 1 one independent variable so 8.
672
So the sum of square is 1530, so when you divide this 1532 by 8 we are getting the mean error
sum of squares. So, F value is this 14200 by 191 you are getting this one so p value is very low
this model is validated. Next what are you going to do we are going to draw the residual plot in x
axis I have taken the student population that is independent variable in y axis this is not the sales
it is the residual for the sales .
See, that next one we will see the studentized residual plot you run this, so this is your
standardized residuals. So, we will plot this standardized residuals so this is a standardized
visible so what is the interpretation from this is all the points are between + 2 and - 2 so we can
say this assumption is valid. Next we will go for checking the normality of the error term. So
now what is happening we are getting the qq plot when you run this code.
So this qq plot says that all the points are around this red line we can say this model is that is the
assumption of the normality is tested it is correct. In every lectures you can follow this code you
can verify this output I will also planning to share this code with you when you register this
course. Now I will conclude what we have seen in this lecture in this lecture we have tested
various assumptions about the regression models these assumptions we have tested with the help
of different residual plots.
We have seen 4 types of residual plot 1 plot is a residuals against independent variable, the next
one is the residual against our predicted values the third one is standardized the residual plot the
fourth one is the normal probability plot. So, these different graphs helped us to test the
assumption about the regression models. The next class we will discuss about the multiple
regression models with some other examples, thank you.
673
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 33
Multiple Regression Model - I
In the previous class we have studied about simple linear regression, in this class we are going to
discuss about multiple regression models.
(Refer Slide Time: 00:35)
The class agenda is I am going to explain what is multiple regression model then what is a least
square method then multiple coefficient of determination. In the multiple coefficient of
determination I am going to explain what is adjusted r-square also. Then what are the assumption
in the multiple linear regression. Then I am going to test the significance of, by using F test and t
test.
(Refer Slide Time: 01:04)
674
What is a multiple regression model so multiple regression model is when there are more than
one independent variable that is called multiple linear regression model. If it is only one
independent variable it is linear regression model. When you take the expected value of this
multiple regression model so we know that that assumption any regression equations that the
expected value of error term is 0, so when you take expected value of y there would not be any
error term that is that is a multiple regression equation. Here beta 1 beta 2 is the coefficient of x 1
x 2 and beta p a coefficient of x p.
(Refer Slide Time: 01:42)
What is the estimation process for a multiple regression there is a multiple regression model y
equal to beta 0 beta 1 beta 2 and beta p and e be an error term from this we can go for multiple
675
regression equations where beta 0 beta 1 beta 2 are unknown parameters. To find out this
unknown parameter from the population we are going to collect sample data for x 1 x 2 like this
up to x p and sample data for y that is dependent variable. With the help of sample data we are
going to construct your sample regression equation what is that compute to the estimator
multiple regression equation that is y hat equal to b 0 + b 1 x 1 + b 2 x 2 and so on + b p x p
where b 0 b 1 b 2 b p our sample statistics.
So with the help of sample statistics we are going to find out the population parameter that is
beta 0 beta 1 beta 2 beta p then there will do a significant test then we will see that whether the
beta 1 beta 2 is equal to 0 or not equal to 0 after testing that we will find out what is the actual
value of beta 1 beta 2 at the population level. This is the process of doing a multiple regression
model. This is similar to the simple linear regression model but what we have done in the simple
linear regression model only x1 and y1 was taken only one independent variable is there but here
more than one independent variable that is only difference all other concepts are same.
(Refer Slide Time: 03:21)
So, what is simple versus multiple regressions. In simple linear regression b0 b1 bear the sample
statistics used to estimate the parameter of beta0 and beta1 but in multiple regression the parallel
is that the statistical inference process with b0 b1 b2 and bp denoting the sample statics are used
to estimate the parameter of beta 0 beta 1 beta 2 and beta b. So, what is the meaning of this one is
676
with the help of sample statistics b 0 b 1 b 2 we are going to predict the population parameter the
beta 0 beta 1 and beta 2.
In simple regression there was only b 0 was there beta 1 was there only one independent variable
in multiple regression more than one independent variable there is only difference.
(Refer Slide Time: 04:18)
Least square method in simple linear regression also I have derived the formula for b 0 b 1 by
having the assumption that when we draw a line the error term that is the sum of the square of
the error has to be minimized. But the y hat there in simple linear regression y hat i was b 0 + b 1
x 1 but in multiple regression this y hat i equal to b 0 + b 1 x 1 + b 2 x 2 and so on + b p x p, p is
the number of independent variable.
So, all other procedure is same here also what we are going to do that there are but here it is a
multi-dimensional picture we cannot draw a two-dimensional picture because we need it because
there are more than one independent variable that is going to be a a multi-dimensional picture
that we cannot explain with the help of a simple graph.
(Refer Slide Time: 05:17)
677
The least square estimate what happened to y hat equal to b 0 + beta 1 x 1 beta 2 x 2 up to beta p
and x p because there would not be error term here because the expected value of the error term
becomes 0. So, how to interpret the value of b 1 b 2 and b 3 how do you interpret the coefficient
of b 1 is by keeping other variables constant if the x 1 is improved by one unit the y hat will be
improved by b 1 units. It is a similar way for simple linear regression but here when you are
interpreting one coefficient we have to assume that that we other coefficient for other
independent variables are constant.
(Refer Slide Time: 06:03)
We will take an example this example problem is taken from statistics for Business and
Economics is the auther by Andersen. As an illustration of multiple regression analysis we will
678
consider a problem faced by a tracking company the major portion of the business involves
deliveries throughout the local area to develop a better work schedule the manager want to
estimate total daily travel time for their drivers. So, they want to estimate this is going to be total
daily travel time is going to be our dependent variable.
(Refer Slide Time: 06:43)
There are 10 assignments there are 10 assignment drivers x 1 equal to miles traveled y equal to
travel time there is a connection between x 1 and y what is the meaning of that 1 when the travel
time we will increase distance traveled also high. So, y is the dependent variable x 1 is
independent variable.
(Refer Slide Time: 07:03)
679
I have brought the screenshot at end of this lecture I will run this quotes then you can understand
it better when I will show that I will explain the screenshot import pandas as pd from statsmodels
dot formula dot api import Wireless that is ordinary least square regression models from stats
model dot stats dot anova import anova underscore lm because this library will be used to see the
ANOVA table for a regression model.
Then import matplotlib dot pyplot as a plt the file name is which I have stored is it tracking that
is an excel file I going to store this data into an object called df1, df1 equal to pd dot read
underscore excel that file name, so if you want to know what is the data set this is the data set.
(Refer Slide Time: 07:54)
So, in this data set there are 1 travel underscore time is dependent variable there are 2,
independent variable one is x 1 and another is number of deliveries. The meaning of x 1 is miles
traveled before going to regression first we ought to have an idea between this independent
variable x 1 miles traveled and time dependent variable is there any connection.
(Refer Slide Time: 08:19)
680
So the first step is first you have to draw the scatter plot so import the matplotlib dot pyplot as a
plt, I am drawing the scatter plot df1 x 1 is in the x-axis travel underscore time in y-axis green
color so label is travel time this one, so what is happening that there seems to be some relation
between this miles traveled and the travel time that means the obviously when the miles traveled
is more the travel time also will be more. This is between one independent variable and one
dependent variable.
(Refer Slide Time: 08:55)
Similarly we will take another variable number of deliveries as an independent variable then
travel time as the dependent variable there also seems to be there is a positive correlation. Why it
681
is required that if there is no correlation at all between that independent variable and dependent
variable we need not do the regression analysis.
(Refer Slide Time: 09:20)
Now in this graph both the variable are taken together what is that vary the distance traveled and
the number of deliveries this is the code for to show both variables in the same figure. So, what
are I'm going to do first I am going to take one independent variable I am going to plot, construct
the regression equation then I am going to take both intermediate variables together then I go to
construct a regression equation. The first taking for one independent variable this is a y hat equal
to 1.27 + 0.0678 x 1, I will show you in the next slide how we got this answer.
(Refer Slide Time: 10:01)
682
So I am going to do a regression analysis that regression model I am going to say reg1 is equal to
OLS formula the travel time is taken as the dependent variable x 1 distance traveled is taken as
the independent variable. So, fit1 equal to reg1 dot fit so print fit1 dot summary so what is
happening here we are getting the coefficient what is it coefficient the intercept is 1.2739 x1 is
0.0678, so how we can write it y hat equal to 1.2739 + 0.0678 x 1 variable this is an independent
variable with you see that the same answer we are getting here.
So for here one more things I were to understand see the R square is 0.664 okay now the next
one what I am going to do I am going to introduce another variable here after introducing the
another variable I am going to see what is going to happen this r square. The r square says the
goodness of the model the higher the r square the model is better what is the meaning of 66.4
here was 0.664, 66.4% of the variability of y can be explained with the help of this model.
(Refer Slide Time: 11:25)
Now what happening that I am going to bring another independent variable that is number of
deliveries so when you bring another independent variable I will show you that model you see
that model equal to OLS( travel underscore time tilde sign x 1 + n underscore of underscore
deliveries so this is two independent variable if there are three you can write it plus that variables
this is the way to do the multiple regression in Python.
(Refer Slide Time: 11:46)
683
So, now what is happening here you look at the y intercept it is y equal to – 0.8687 + 0.0611 x 1
+ 0.9234 x 2, here you can call it as x 2 is what is the meaning of x 2 number of deliveries okay
so, what is this, this is important. We will verify this in the previous slide also we got the same
thing – 0.869 + 0.0611 x 1 + 0.923 x 2 now look at this the previous r square now look at this
now this r square after introducing new variable.
After introduce a new variable the r square is previously it was 0.6 something now it is increased
to 0.90 so adding a new variable as helped to improve the explaining power of this regression
model. Then I explain there is one more term adjusted r-square because in many previous
lectures I am saying that I will do the next lecture but I am not able to do that one now in this
lecture I will explain what is the meaning of adjusted r-square.
The other point you have to understand you look at the p-value for each independent variable.
So, what is the null hypothesis for a year what is the null hypothesis H 0 : beta 1 equal to beta 2
equal to 0, so in all hypothesis for you look at the b values here see for x 1 it is a point 0 0 so we
have to reject null hypothesis. When you reject null hypothesis beta 1 is not equal to 0, that mean
there is a relation between x 1 and y 1.
Similarly look at the number of deliveries corresponding p-value is 0.04 that also less than 0.05
so that hypothesis beta2 = 0 also at we rejected that means at a population level there is a the
684
relationship is significant what is the meaning of that one is even at the population level between
x 2 and y there is a significant relationship is there.
(Refer Slide Time: 14:04)
Relation among SST SSR and SSE we know that SST total sum of square equal to the regression
sum of square + error sum of square, SST this I have explained in my previous lecture total sum
of square is this way for your convenience I am drawing one more time this is your y bar this is
your y hat so this is y, so this distance okay this distance is your SSR this distance is your SSE,
so the total distance is SST.
So this total distance is SST, so, what is SST? SST is y i - y bar whole square Sigma what is SSR
y hat I - y bar whole square what is SSE y - y hat whole square so when we have only one
independent variable look at this here what is SST when you add this SST equal to summation of
15.87 + 8.02 so it will come around 23.89 SST. You see the residual sum of square so what is
SSE? SSE is 8.02 when there is only one independent variable SSR is 15.871 to get this
regression model output you have to use this one print anova _lm the or to call the first
regression model.
(Refer Slide Time: 16:08)
685
The next slides we are going to bring another ANOVA table when there are two independent
variables for that purpose and I want a score table equal to anova_ lm( model , typ =1) ANOVA
table. Now you see that the SST is same SST is around 22 around 22 but look at SSE is 2.29 so
error has been decreased. You see SSR, SSR is these two 15.87 + 5 approximately 20.
Something so what is happening when you introduce a new variable the value of SSR is
increased to 20 variously SSR only one independent variable SSR is 15.
So after introduced a new variable the 5 unit of variants is increased and at the other point is
previously when there are only one independent variable is the error term is 8.02. Now the error
is reduced to 2.29 so that is the advantage of using more number of independent variable so that
we can have more accurate model.
(Refer Slide Time: 17:20)
686
Now you will see what is multiple coefficient of determination when there is a simple linear
regression model we have called it coefficient of determination. Now there is a multiple
independent variable we are going to call it is multiple coefficient of determination it is SSR by
SST. So, what is R square SSR, SSR is when you add this two 15.87 + 5.7, 21.6. SST is when
you are all three 22.2 approximately 23.0.
So there is a 90.4% of the variability of y can be explained with the help of these two
independent variable. So, the r square is increased so it is a good model when compared to
simple linear regression model.
(Refer Slide Time: 18:07)
687
So, now we will go for another concept adjusted R square what is the purpose of adjusted R
square. So, adding independent variable causes the prediction errors to become smaller, so we
know that see SST equal to SSR + SSE so when you add independent variable prediction error
become smaller what will happen this error will become smaller so what will happen this when
SSE becomes smaller SSR will become bigger one because SSR equal to SST - SSE when SSE
becomes smaller SSR become larger.
So, causing R square to increase whenever you add any independent variable SSR will increase
SSE will decrease due to that SSR will increase due to that the R square will increase. Many
analysts prefer adjusting R square for number of independent variable to avoid overestimating
the impact of adding an independent variable on the amount of variability explained by the
estimated regression equation.
So what is happening instead of using R square we are going for adjusted R square. The
advantage of adjusted R Square is whether the added new variable is it is really as an explaining
variable or it is a noise variable otherwise the added a new variable how much it is helping to
explain the variance of the existing model.
(Refer Slide Time: 19:38)
So, what is a formula for adjusted R square s previously what was the formula for R square see
that R square equal to SSR divided by SST, explained variance divided by overall variance. So,
688
this explained variance the regression sum of square can be written this way SST – SSE, SST -
SSE because what is happening this SSR this SSR represents regression sum of square for all
independent variables. So, when you add a new variable you cannot know the contribution of
that new variable into the SSR we are going to split this SSR into two term that is SST - SSE so
now this will become 1 - SSE divided by SST.
But what we have to do we have to write the degrees of freedom because what is the meaning of
adjusted is this adjusting for degrees of freedom. so, when SSE what is the degrees of freedom
SSE the degrees of freedom is n - p - 1 what is the n, n is the total number of data set p is number
of independent variable - 1 here it will become n - 2 divided by SST you write SST as it is. It is n
– 1, so when you simplify this you will get this method.
You look at this that is the meaning of 0.8, so another importance of this adjusted r-square is
sometime you see what will happen I am writing here R square adjusted R square what will
happen whenever you introduce new variable the value of R square will increase adjusted R
689
square also will increase. So, I will explain what is the meaning of R square and I just R square
assume that there is a one dependent variable there are many independent variable that
independent variable is x 1 x 2 x 3 and x 4.
Now what I am doing here I am going to build a regression model. So, first what I will do first I
will take y then I will write regression equation in terms of x1 so what will happen R Square
increase and will also adjusted R square. Now taking y is a dependent variable I am going to
bring 2 independent variable R square increases adjusted R square also will increase. So, what
will happen if the x2 is really helping to explain the variance of the y some time suppose say
variable x 3, x 1, x 2 this x 3 variable is the noise variable.
Noise variable means it will not help to explain the variability why it is going to disturb the
existing relationship. So, what will happen R square will increase adjusted R square will start
decreasing. So, this is the hint for us that the variable which you have added is not helping to
explain the model instead of that it is deteriorating the existing model. So, x 3 should not be
added that is the meaning of this adjusted R square most of the time.
If the value of R square adjusted R square is similar that means that we have no need to increase
any further variable into the model that means you have reached the good model.
(Refer Slide Time: 23:30)
690
If there is a gap for example R square is 0.9 adjusted R square is 0.3 that there is a possibility of
adding more independent variable into the model. Now let us see adjust and multiple coefficient
was this multiple coefficient if your variable is added to the model yeah that is the point which I
am saying previous slide, if your variable is added to the model R square become larger even if
the variable added is not statistically significant it is very important.
The adjusted multiple coefficient of determination that is adjusted R square compensate for the
number of independent variable. So, it is adjusted means it is adjusted for the number of
independent variable otherwise adjusted for it is a degrees of freedom. If the value of R square is
smaller and the model contains a large number of independent variable adjusted total coefficient
of determination can take negative value.
It is a very important point here the interpretation of R square and adjusted R square is not same.
The R square is that how much variability of y is explained but the adjusted R square is not the
same interpretation. What will happen many time adjusted R square may become negative okay
you should be very careful on that. Then we will go for checking model assumptions so as I told
you in the beginning of the class y equal to this is the regression model when you there will be
error term when you go for regression equation there would not be error term because when you
go for expected value of y beta 0 + beta 1 x 1 + beta 2 x 2 and so on.
And there would not be error term because the expected value of error is 0. We will go for some
assumption what is the first assumption the error term epsilon is a random variable with mean or
expected value of 0 what is implication for the given value of x 1 x 2 and up to x p the expected
or average value of y is given by this way you look at this when you go for expected value of y
there is no error term. This equation represents the average of all possible values of y that might
occur for the given value of x 1 x 2 up to x p by expected value of y.
(Refer Slide Time: 25:58)
691
We will go for second assumption the variance of epsilon is denoted by Sigma square and is the
same for all values of the independent variable x 1 x 2 x p what is implications the variance of y
about the regression line equal to Sigma square and is the same for all values of x 1 x 2 x p. if it
is different we will call it is there is effect of heteroscedasticity. Why this point is required if you
want to compare the variance of x 1 x 2 up to x p should be same then only there is a meaning
for comparison.
The third assumption is the value of epsilon are independent. What is implications the value of
epsilon for a particular set of values for independent variable is not related to the value of epsilon
for any other set of values. Another way the error terms are independent when you plot that error
term there should not be any pattern whether it is increasing or decreasing pattern that is the
meaning of this third assumption. Then fourth assumption the error term epsilon is normally
distributed random variable reflecting the deviation between y value and the expected value of y
given by beta 0 + beta 1 x 1 + beta 2 x 2 up to beta p x p.
What is implications because of beta 0 beta 1 beta p are constant for given values of x 1 x 2 x b
the dependent variable y also normally distributed random variable because what will happen the
error term it should be independent but it should follow a normal distribution with equal variance
if it is not equal variance then it will go to the second assumptions also get violated.
(Refer Slide Time: 27:39)
692
Now look at this graph of a regression equation for multiple regression analysis with 2,
independent variable x 1 is in one independent variable x 2 is another independent variable. See
this is the mean value of x 1 this is mean value of x 2 you see this is a plane. So, multiple
regression equation is explained with the help of here a surface otherwise this is called a surface
the reference model is a plane now the equation is not the line it is the plane.
Otherwise they will call it is RSM also response surface model another name for regression is
response surface model because now this is the surface.
(Refer Slide Time: 28:26)
693
A response variable and response surface in regression analysis the term response variable is
often used in place of the term dependent variable instead of saying dependent variable we will
say the response variable. Furthermore since the multiple regression equation generates a plane
or surface the graph is called response surface. In this lecture I have explained what is a multiple
regression model? Then I have explained what is the connection between simple linear
regression model and multiple regression model.
Then I explained the least square model then I have explained what is the meaning of R square
and adjusted R square? Then I have explained various model assumptions. The next lecture I am
going to test the significance of beta 1 beta 2 and beta 3 with the help of F test and t test and also
we will see a demo on Python programming to do a multiple regression, thank you very much.
694
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 34
Multiple Regression Model - II
In the previous lecture we started multiple regression models. In the multiple regression model I
have explain how to do a multiple regression model. What is the meaning of beta 0 beta 1 beta 2
and also explain what R squared and adjusted R square. In this lecture you are going to see how
to do the significance test that means here also, like simple regression we are going to have some
hypothesis about beta 1 coefficient and beta 2 coefficients so on.
And we are going to test whether the beta 1 is equal to 0 are not equal to 0. So what you are
going to do in this lecture is we are going to test the significance of regression model with the
help of F test and t test and I am going to do a Python demo for a multiple regression.
(Refer Slide Time: 01:13)
F test is used to determine whether a significant relationship exists between the dependent
variable and the set of independent variables we will refer F test is the test of overall
significance. I will show you where this F test is appearing in our Python output. If the F test
shows and overall significance the t test is used to determine whether each of the individual
independent variable is significant or not. A separate t test is conducted for each of the dependent
695
variable in the model. So we refer each of these, t-test as a test of individual significance. So, F
test is used for testing the overall model of regression equation that t test is used to test for
individual independent variables, whether they are significant are not.
(Refer Slide Time: 02:08)
The F test what is the null hypothesis, here null hypothesis beta 1 equal to beta 2 up to beta p
equal to 0 when I accept null hypothesis, what is the meaning is for example for accept beta 1
equal to 0 there is no relation between x1 coefficient and the different variable y. If I accept to
beta 2 equal to 0 there is no relation between x 2 and y variable obviously alternative hypothesis
is one or more of the parameter is not equal to 0.
(Refer Slide Time: 02:43)
696
So how to find out the F statistic of statistics is MSR divided by MSE that is mean regression
sum of square divided by mean error sum of square how we are getting this mean regression sum
of square when you divide SSR divided by corresponding degrees of freedom here the degrees of
freedom p divided by then will get MSR. MSE when you divide SSE divided by n - p - 1 where p
is number of independent variable. Then we will get the mean error sum of square.
This hypothesis testing can be done by two way one is by p value approach and release by
critical value approach. In the p value approach reject H 0 if the p value is less than or equal to
alpha what will happen this is F test. This way it is a right skewed data. This is your alpha value
will get F alpha corresponding this value can get it in table. What we have to do you have to find
out the p-value. The p-value is lying on the rejection side you have to reject it.
Otherwise, if the F value is beyond the alpha value we have to reject it where F alpha is based on
the F distribution with p degrees of freedom in the numerator and n - p – 1, degrees of freedom
in the denominator.
(Refer Slide Time: 04:18)
So we got F equal to 10.8 divided by 0.328 equal to 32.9 I will show in another table this F value
the previous slide.
(Refer Slide Time: 04:34)
697
F value is 32, so this value 32.8 that that is 32.9 approximately then we can see the easy here the
p value probability of stat this p value is less than 4.05. when you say alpha is equal to 5.5 %
then also we have to reject the null hypothesis.
(Refer Slide Time: 05:00)
So this is anova regression table what is the sources of error? Error due to regression variable
error total regression sum of square error sum of square total sum of square look at the degrees of
freedom that is more important. There are 1 independent variable the degrees of freedom is 1 for
regression sum of square. For TSS that is the total sum of square for that n -1, n is number of
data set. So, when you subtract n - 1 - p that is why we are getting n - 1 – p is n - p – 1.
698
So, what is MSR is SSR divided by p, MSE is SSE divided by n - p - 1. So, F equal to MSR
divided by MSE this is the F value which you got anova output, when we introduce both variable
into the model corresponding anova table for that two independent variable regression model is
this one.
So, this is another table first you find out regression sum of square error of sum of square SST, p
is the degrees of freedom n is the number of independent variable. For SST the degree of
freedom is n -1. If you want to know the degrees of freedom for n -1 – p that is n - p - 1 MSR
=SSR divided by p. MSE = SSE divided by n - p - 1 finally we are getting F value. So, this F
value is nothing but your 32.88.
(Refer Slide Time: 06:30)
Now will go for individual for each variable we will see that significant weather that individual
variable is significant are not, for any parameter beta i H 0 equal to beta i equal to 0, H a beta i 0
equal to 0. The test statistics is bi divided by Sbi actually they should be this way bi – beta i
divided by Sbi because we are assuming beta equal to 0 so the remaining is only bi divided by S
bi this Sbi for ith independent variable we can see that is standard error.
When I go back you see this the standard error for x1 is this 0.010 the standard error for second
independent variables 0.0221 that value we can get it this one. So for p value approach reject H 0
the p value is less than or equal to Alpha as usual. Critical value of approach reject H 0 t value is
699
below the lower tail or above the upper tail. Here the t α/2 is based on the t-distribution with n - p
- 1 degrees of freedom ok we will do that one.
(Refer Slide Time: 07:50)
So, beta 1 is 0.0611 where we got this one and going back, t test for individual significance from
output of python model we get b1 equal 2.061135, b2 equal to 0.9234, sb1 equal to 0.00988, sb2
equal to 0.2211 where we are getting just see that b1. So, b1 is this value 0.0611 the Sb1 is this
0.011 I am going back, see that is b1 so 0.0098 so it can be 0.01 because after rounding ok. The
b2 is 0.92, Sb2 is 0.221. So, what is the t formula of b 1 divided by Sb1 b1 is 0.061135 divided
by 0.088, 6.8 here t is 4.18 this also you can verify.
Where this for first variable it is 6.18 see that 6.18, for second variable this is 4.16. Look at the p
value, p value for first variables 0.000 2nd variable is 0.00 by looking at the p-value itself
without to reject the null hypothesis. When we reject null hypothesis beta 1 not equal to 0 that
mean there is a relation between x 1 and dependent variable y at the population level. Similarly
for the second independent variable, also, we have to reject null hypothesis.
So, beta to not equal to 0 that means that there is a relation between x 2 that is the second
variable number of deliveries and dependent variable. This is the way to interpret the Python
output of this multiple regression model. And going to give a demo for the multiple regression
model. Ok students we have seen the theory behind the multiple regression model. We have
700
come to Python background. I have already prepared the code as shown the output suppose. You
want to do it demo in our class are you want to someone in this course.
(Video Start time: 10:15)
Go to this kernel option restart and clear output so what will get it there is a restart and clear all
output when you do this way to see that only the quotes will be there, there would not be any
output. Suppose you want to show to others what is going to be the output of this code you can
do that way. So first we will import the necessary libraries import Pandas as pd from statsmodels
dot formula dot api import ols, so this library is used for doing regression analysis.
As you know that the pandas is used for reading a loading the files. From stat model dot stat
anova import anova_lm, so this library is used to see the output of anova table for regression
model import matplotlib.pyplot as plt so this is used for plotting the figure. First I have stored my
data set the file name is called tracking. I have loaded this data set into the object called df1 first
will run the library then will see the what is the data set.
In this data set a when you look at this. The first one is the index column second one is the
driving assignment third one is our independent variable that is the travel time. The next one is
number of delivery third one is not the travel time. It is the distance travelled the x1 means
distance travelled the next independent variable is number of deliveries. The travel time is our
dependent variable here.
Here what were going to do? First, we are going to see what is the relation between x 1 and our
dependent variable then we will see x 2 the number of deliveries versus travel time. Then we will
see both independent variable together. Then will see what is the effect on the dependent
variable? First will do the scatter plot then we can understand the relationship that trend between
is independent variable and the dependent variable. This one is x1 is taken as a independent
variable and y is taken as the different variable.
It seems to be there is some positive trend is there. Why this scatter plot is required if there is no
relationship at all suppose the lines are in a horizontal manner that you need not do any
regression analysis because there is no relation between this x and dependent variable. Now we
701
will take 2 variable then you got it. Now it is happening the green dot shows one variable and red
dots shows another variable. Now, this is only second independent variable. Now will get the
regression model the first regression model where we are going to consider only one independent
variable. So that is model name is reg1 is equal to OLS().
The formula equal to travel underscore time is the dependent variable tilde symbol actually the
first you to write the dependent variable tilde x1 in double quote data equal to df1. Because this
tilde symbol even if you know the R programming, in R programming also similar Syntax will
use that one. So, will you do this one you are getting the output of our regression model where
only one independent variable is considered. So what is the first task is we have to construct the
regression equation? What is regression equation y equal to 1.2739 + 0.0678 x 1.
Second one is where to locate the R square? R square is 0.664 if the R square is more than 50. It
is considerably a good model even though it is 66 this is accepted. Then look at the F statistics, F
statistics to 15.81 that look at the probability that is less than 0.05 as a whole model this
regression model is acceptable. Remember that we are using only one independent variable
within a look at the p value when the first independent variable 0.004 less than 0.05 when you
look at an integer variable also this one variable is significant variable.
The next one what you are going to do? We are going to introduce both the independent variable
together. Then we are going to see the impact on R square. So here I go to say that model is
regression 2 where 2 is equal to OLS( formula equal to travel time tilde x1 plus I am adding
second independent. If there is third independent variable plus you have to add the third
independent variable. Then fit2 equal to reg2 fit.
So print fit 2.summary, look at this, first will come regression equation. Regression equation is y
equal to - 0.8687 + 0.0611 x 1 + 0.9234 x 2 otherwise number of deliveries. You compare the R
square with the previously say R square is 0.664 now when you use to independent variable, the
R square is increased. So, the goodness of it the model is increased when we introduce more
independent variable. There is another term adjusted R square this is adjusted for number of
independent variables.
702
When you are in keep on introducing more number of independent variables you have to monitor
the value of R squared and adjusted R square what will happen when you introduce more
independent variables R square will always will increase but adjust R square it will initially start
increase after certain point it start decreasing that point You should stop adding more
independent variable that means the R square value when is decreasing that the new variable
which have introduced into the model is not helping to explain the regression model instead of it
is going to disturb the existing model that is that a new variable it is the noise variable.
Look at the F value F is 32.8 look at the probability value it is less than 0.05 as a whole model so
we are going to reject null hypothesis. So, what this F statistics says F statistics is used to test
overall significance of the regression model that both x1 x2 by considering in this regression
equation the model is valid. Then we will go for significance of each individual independent
variable. There is x1 when you look at the p-value it is 0.000 that means we have to reject the
null hypothesis of beta 1 equal to 0.
When you rejected it that means beta 1 is not equal to 0 then you say beta 1 is not equal to 0,
even if the population over there is a relation between x 1 and the y variable. Similarly look at
the p value for the our second independent variable that is also less than 0.05 that means that the
second variable also significant variable in our regression model. Sometime what will happen
they may be different independent variables for some independent variable p value more than
0.05.
If it is more than 0.05 then you are writing regression model that corresponding independent
variable has to be dropped. So meaning of the dropping is that with the help of sample data we
can try regression equation by considering all independent variable but that cannot be
generalized at the population level because certain variable cannot be significant at the
population-level how to know that the variable is not significant we have to look at the p-value.
The p value is more than 0.05 we to accept null hypothesis, when we accept null hypothesis. that
means beta 1 equal to 0 then there is no relation between that independent variable at the
population. Other goodness of fit is we will see the what is the meaning of Durbin Watson in the
703
coming classes. Similarly there is one more measure to check the goodness of model AIC and
BIC and we go for our Logistic regression, then I will explain what is the meaning and
significance of AIC.
(Video End Time: 19:02)
(Refer Slide Time: 19:03)
So, far we have studied the regression analysis? Now with the help of regression I am going to
tell you how to solve an anova problem. I have taken one sample problem that problem first time
going to solve with the help of anova then with help of excel and good to solve it after that the
same problem. I go to explain how to solve that anova problem with the help of regression
analysis.
(Refer Slide Time: 19:34)
704
The problem is like this what is happening in the three column is there A, B and C. A Represent
one type of assembly method B represents another type of assembly method C represents the
third type of assembly methods. If you follow method A; 58, 64, 55, 66, 67 represents number of
product which are assembled per week; similarly for B under the column B 58 69 71 64 68
represents number of product assemble to per week. Here the product is filtration system. As a
manager I want to know which method is producing the better result.
That means if I follow method A or B or C which one will produce are will help me to assemble
more number of products. This is a typical anova problem. So in anova what we used to do
generally the null hypothesis, The null hypothesis mu A equal to mu B equal to mu C that means
the mean of the product assemble through method A equal to mean of the product assembled by
method B equal to the product obtained by method C.
Obviously the alternative hypothesis mu A not equal to mu B not equal to mu C, the purpose of
doing this is to identify which assembly method is more productive. In case if I accept null
hypothesis all the three assembly methods are giving the same result I am not able to identify
which method is better. In case if I reject my null hypothesis, I can clearly say which method is
the better method which will give you more number of products assembled.
(Refer Slide Time: 21:44)
705
This I am go to solve with help of Excel enter the data in three column A column B column C.
So go for data go for data analysis go for anova. Anova is a single factor because one way anova
so the input range is I am selecting all this values, so labels in the First row yes, so when I say ok
I am getting this output, what this mean is, when you look at the p-value here, it is 0.00382 it is
less than 0.05. So I will be rejected my null hypothesis.
When I reject null hypothesis all the three assembly method not producing equal result, this was
the output got it. How to interpret this output when you look at this p-value the p-value is less
than 0.05 so I am rejected my null hypothesis.
(Refer Slide Time: 22:48)
706
Now, this is what I have got in the previous slide I am going to get with the help of regression
analysis. The regression analysis I am going to use the concept of dummy variable because there
are three assembly methods is there. Generally 3 - 1 number of dummy variables is required.
How I am creating dummy variable I am taking say A this is A is one dummy variable for
example B is another variable.
If I save 1, 0 that represents the presence of 1 represents variable If you say 0,1 the presence of 1
represents on column B the represents the B variable see that the absence of 1 in both columns
represent assembly method say that is why it is written here, see the 1,0 observation is associated
with assembly line method A 0,1 represents the observation is associated with assembly method
B 0,0 represents the observation is associated with assembly method C. So, I am going to do this
modification then I am going to do regression analysis. So, with the help of regression I am
going to explain here, but after sometime I go to explain to you. How to use Python. So, first
now I will explain with help of Excel.
(Refer Slide Time: 24:05)
This is the given data set. So what I have done this is my after coding. For example see that upto
58 to 67 this column up to this much it represents A so I have written 1,0,1,0,1,0,1,0 see that here
up to 1,0,1,0 the presence of 1 represents assembly method A the absence of 1 represents a
assembly method B. Similarly I have to type the remaining B values see that here 0,1,0,1,0,1,0,1.
So this portion represents look at this the presence of 1 represents method B.
707
The last one is this portions I have taken 0,0 on both columns that means absence of one and
both the method represents the Assembly method C. Now this one I going to do the regression
analysis go for data analysis go for regression. Here the y value is this one x values here there are
2 dummy variable this one, when you run it you are getting this output you see that. When you
look at the p-value here here also, you are getting 0.0038 that means here also you are rejecting
the null hypothesis. I will explain how to interpret the coefficient of 52, 10, 14 in coming slides.
(Refer Slide Time: 25:52)
Expected value of y is equal to expected value of number of units produced per week. This is the
regression equation beta 0 + beta 1 A+ beta 2 B if you are interested in the expected value of the
number of units assembled per week for an employee who uses method C our procedure for
assigning numerical value to the dummy variable result in setting A equal to B equal to 0,
suppose if you want to know the answer for assembly method C you to substitute A equal to 0 B
equal to 0. When you substitute A equal to 0 B equal to 0 the expected value of y is nothing but
your beta 0.
(Refer Slide Time: 26:35)
708
In case for method the value of the dummy variable is equal to 1 b equal to 0 when you substitute
in the regression equation beta 0 + beta 1 because A values is 1 B value 0 so beta 0 + beta 1. If I
want to know the expected value of assembly method B we have to set A equal to 0 and B equal
to 1 when you substituted regression equation You are getting beta 0 + beta 2.
(Refer Slide Time: 27:07)
Now what we got it, when you look at this coefficients the intercepts is 52 that is your beta 0.
This is beta 1 for coefficient of A this is beta 2 this is coefficient of B.
(Refer Slide Time: 27:23)
709
What will happen see beta 0 is 52 beta 1 equal to 10 B2 equal to 40 if you want know the
estimated value of y for assembly method A you have to refer b 0 + b 1 how we got b 0 b 1 look
at this beta 0 + beta 1 now beta 0 52, b 1 is 10 so totally 62. If you want to know the estimated
value of y for assembly with B it is 52 + 14 how we got 52 + 14 using this equation beta 0 is 52
Beta 2 is 14, what is beta 2 look at this Beta 2 is 14. So when you substitute here we are getting
66. If you want to know the expected mean of methods C you have substitute A equal to 0 B
equal to 0 you get only beta 0, beta 0 is estimate with help of b 0 that values 52 .
(Refer Slide Time: 28:24)
Then we can go for significance test beta 1 equal to beta 2 equal to 0 this we have seen already
we can go for t test or F test. In the F test if the value is less than 0.05 then we can say both the
710
variables are significant. What happened here when you do with the help of Excel you see that
the p value of F see the p-value is less than 0.05 we can see the regression coefficient A and B is
significant. So far I have done with the help of Excel. Now I am going to do the same problem
python.
(Refer Slide Time: 29:01)
First will import the necessary files like import Pandas as pd import statsmodels dot api dot sm
from statsmodels dot com dot api import ols form scipy import stats import statsmodels dot
formula dot api as smf the data is stored in a file called chemitech. So this is my data set so far
this data set and going to do an anova after doing anova I go to check the result the same
problem I run it with help of regression analysis.
For doing regression analysis I go to use the concept called dummy variable. So, first I will run
this given data set anova so for that purpose. I am converting this data set into this form. What is
that form? That I will show in the data_r, you see that all the treatments are in one column all the
values are in one column model equal to wireless. Value is our dependent variable Tilde see the
treatment is my dependent variable.
The command for regression analysis from the regression analysis I am going to get the anova
table the anova table is this one? You see that the p-value 0.0038 when I am doing the same
problem with the help of Excel also because the same result. So, what we are concluding here for
711
all the means or not equal so we are rejecting our null hypothesis. Now this problem we are
going to do with the help of regression analysis by using the concept called dummy variables that
one treatment that is assembly method with 3 variables ABC so that I am going to convert into
dummy variables.
You look at this ABC is the presence of 1 represents A the presence of 1 here represents B the C
column the presence of 1 represents C because in the treatment of three levels, we need only two
dummy variables. So we are going to drop the column C then we are going to use the two
column A and B. In excel also when I am solving meet you seen this kind of data. So what I am
going to do I am going to drop this column C then I am going to add this dummy variables into
the filename called step_1.
This one see that now the file is changed. Now the value is taken as it is only the column A and
B is maintained. So for this dataset going to do the regression analysis, so the results is equal to
smf dot ols step underscore 1 the value is my different variable sm dot add underscore constant
step_1 A, B is my independent variable then I go to get the regression output. Now look at this
regression output. Look at this probability that the p-value 0.00382 here also we are rejecting our
null hypothesis.
Then look at the constant value constant is 52 that is b 0, b 1 is 10 and b 2 is 14 this value is
taken for interpreting the output. Now look at the variable A and B both are the p values less than
0.05 both variables are significant the value of b0 b1 b2 can be used for interpreting as I
explained in my slide for interpreting the output. In this class we have seen how to do the
significance test for multiple regression model.
That significance test we have done with the help of two test one is F test and t test. F test is used
to test the overall significance of the regression model, t test is used to check the individual
significance of each independent variable. After that I have taken a sample problem. Then I
explain the given a demo how to do the multiple regression, then we have interpreter the output
of multiple regression model.
712
In the next class going to do another regression model that is where the independent variable that
is categorical independent variable. So far what we have done that one dependent variable and
independent variable both are continuous. There may be situation where that independent
variable is categorical variable. For example gender is a categorical variable that can have only
two option male or female in that case you to do some adjustment in existing; our regression
model. How to do that one that you will see in the next class, thank you very much for listening.
713
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 35
Categorical Variable Regression
Dear students, in this lecture, we will see how to handle Categorical Variable linear regression
analysis. Whenever we do a linear regression analysis, the assumption is the nature of
independent and dependent variable has to be continuous variable. Sometimes what will happen
we have to include the categorical variable into independent variable category? How to handle
that kind of regression analysis that we will see in this class?
(Refer Slide Time: 00:57)
The Agenda of this lecturer is, to show how categorical variable are handled in regression
analysis. Illustrate and will interpret how to do the categorical independent regression analysis.
The same problem we will do in Python will explain how to code and how to do this categorical
regression in Python programming.
(Refer Slide Time: 01:17)
714
Another name for categorical variable is called dummy variable dummy variable also called
indicator variable. It allows us to include categorical nature in regression analysis. For example,
gender is one of a categorical data where there is only two levels are possible male or female. If
dummy variable can take only two values, when it is gender category, for example, zero means
absence of category and one means the presence of category. Here zero will be taken as their
reference. With respect to zero, we will compare what will happen to another level of the
categorical variable.
(Refer Slide Time: 01:51)
We will take a problem with the help of problem I will explain how to use categorical variable
into the regression analysis and how to interpret it. This problem is taken from statistics for
715
Business and Economics from David Anderson, Sweeney and Williams. It is Syncage
Publication in 2003 to 2013 edition. Johnson filtration Incorporation provides maintenance
service for water filtration systems.
Customers contact Johnson's with a request for maintenance service on their water filtration
system. To estimate the service time and the service cost Johnson's managers want to predict the
repair time necessary for each maintenance request. Hence, the repair time in hours is the
dependent variable. Repair time is believed to be related to two factors. One factor is number of
months since the last maintenance service was done; second factor is the type of repair problem.
Here the type of repair problem, mechanical or electrical is the categorical variable.
(Refer Slide Time: 02:53)
This is the given data. What is there is a Column 1 is the service call, the column 2 says months
since the last service was done, in terms of month. Column 3 says the type of repair whether it is
the repairs with respect to electrical system or mechanical system. The last column is repair time
in hours. How much time it is taken for doing repairing?
(Refer Slide Time: 03:18)
716
I have taken the screenshot of our python code get so I have to import necessary libraries like
import Pandas as pd, import matplotlib as mpl, import statsmodels dot formula dot api as sm
from sklearn.linear underscore model import LinearRegression from scipy import stats import
seaborn sns, import numpy as np, import matplotlib.pyplot as plt, import statsmodels dot api as s.
First we will load this regression file it is a data file I have saved in the name of dummy dot xlsx
that we are going to save any object called tv1.
When you execute this one, we can see this is a data file. At the end of the class going to give the
demo for this what are the codes which ever done in it. There also we can understand the steps.
Here this is the data, display the data.
(Refer Slide Time: 04:10)
717
But first will do the scatter plot between the months since last service and repair time in hours.
When we look at this scatter plot, you see that there seems to be positive trend because when the
month since last services more the repair time in hours also getting more. This is a simple linear
regression considering only one independent variable. Here independent variable is continuous
variable.
(Refer Slide Time: 04:41)
When you do the regression analysis, this is output of python. So, from statsmodels.formula .api
import ols, ols is used for doing regression analysis. Here, the dependent variable is repair
underscore time in hours tilde sign independent variables months since last service. When you
look at this series, y intercept, I can write y equal to 2.1473 + 0.3041 because this is independent
718
variables months since the last service was done. Look at the R square. R square is 53.4 % look
at the P value of this independent variable here. Here, it is significant because it is less than 0.05.
Now what we are going to do residual plots for this problem?
(Refer Slide Time: 05:38)
When we use this to do regression model, When you do the normal probability plot, look at this
it has to all the probability points has to align with is red point. What is happening is there are so
many points it is away from the red line. So, we can say that even if the, the residual plot is not
appropriate, so the data, the error is not following normal distribution.
719
(Refer Slide Time: 06:10)
First, we will create a dummy variable for the categorical data. How to create a dummy variable
for this categorical data? so that new dummy variable I going to call it is just underscore
dummies equal to pd.get_dummies where the filename which column has to be converted into
Dummies. So, the type of repair that is the value where we have written, whether the problem is
related to mechanical or electrical.
So, when we display the just dummies see that that one variable is know, it is taken into 2 parts.
1 is for Electrical so, the presence of one says electrical; the absence of one says mechanical.
There are 2 columns is there which is the dummy variable. So what happened both are same
whether we can use this variable interval into our new regression model or this variable for Our
new regression model, if you take electrical equal to 1.
720
Mechanical can be taken as 1 and electrical can be taken as zero. There will not be problem in
the interpretation.
(Refer Slide Time: 08:06)
This was the data which we have converted into dummy variable. Month since last service 1
represents problem related to electrical Zero represents problem related to mechanical. This was
our Y is our dependent variable.
(Refer Slide Time: 08:22)
When you do the regression analysis, see that just_dummy is pd.get underscore dummies p1 is a
type of repair. So here what I have done? I have displayed, I have dropped the certain columns
what column I have dropped, I have dropped the column that is type of repair then I have added
721
only dummy variable with respect to the electrical repair. That is why this column has come. So,
now this is going to this is the last column that is under electrical heading.
It is going to be taken as independent variable that will do the regression analysis.
(Refer Slide Time: 08:54)
So look at R square it is 0.85 previously, the R square was when there is only one independent
variable I m going back previously asked for R is only for 0.534 when we introduce another
variable what has happened, the R square is increased to 0.859. So, F statistics corresponding
probability the p-value is very low 0.005 so as a whole this regression model is significant. When
we look at the individual independent variable, for example, months_since there is independent
variable 1, the P value is less than 0.01.
So we can say this variable is significant. Similarly, for the second one the type of repair, where
electrical is taken as the reference this also less than 0.05, so, this is also a significant variable.
722
(Refer Slide Time: 10:42)
Now, this is the regression equation y hat equal to 0.93 + 0.388 X 1 + 1.26 x2. If x2 equal to 1
means electrical if I say x2 equal to one it is related problem related to electrical if x2 be 0 it is
related to mechanical.
(Refer Slide Time: 11:15)
The most important part, that is, interpreting the parameters. We know the expected value of Y
equal to beta 0 + beta 1 x 1 + beta 2 x2 when you substitute equal to 1, when you substitute this x
2 = 0 that equation for mechanical, problem related to mechanical. So, beta 0 + beta 1 x1 beta x2
0 so this term will become there Beta 0 + beta 1 X 1 will be there. When substitute this x 2 equal
to one that equations for the problem related to electrical.
723
So, E expected value y electrical equal to beta 0 + beta 1 x1 so, beta 2(1) what is happening so,
beta 0 beta 2 that can be grouped that this will be beta one x1. See, both equations are same, both
equation having the same slope Beta 1 only it differs by this extra value in our Y intercept how
much with Beta 2.
(Refer Slide Time: 12:16)
Comparing equations 1 and 2 we see that the mean repair time is linear function of X1 for both
mechanical and electrical repair. The slope of both equation is beta 1, but the y-intercept differs.
The y intercept is beta 0 in equation 1 for mechanical repairs and beta 0 + beta 2 in equation 2
for Electrical repairs.
(Refer Slide Time: 12:45)
724
The interpretation of Beta 2 is that it indicates the difference between the mean repair time of
electrical repair and the mean repair time of mechanical repair. So the time differs by with this
unit of this Beta 2. Beta 2 is positive the mean repair time for electrical repair will be greater than
that of the mechanical repair. In our problem, it is beta 2 is positive, if the beta 2 is negative the
mean repair time for an electrical repair will be less than that of mechanical repair. If finally you
Beta 2 equal to zero there is no difference in the mean repair time between electrical and
mechanical repairs.
And the type of repair is not related to repair time. This is most important because after doing a
dummy variable regression you have to interpret it. The interpretation is this way. The first thing
is you have to look at what is the sign of this Beta 2. Beta 2 is positive or negative. Then in case
the beta 2 is 0, we can save the type of the time taken to repair that filter is nothing to do with the
type of problem it has occurred. Whether it is problems related to mechanical repair or problem
related to electrical repair.
(Refer Slide Time: 14:01)
725
In effect, the use of dummy variable for type of repair provides 2 estimated regression equation
that can be used to predict the repair time, one corresponding to mechanical repair and another
corresponding to electrical repairs, in addition, beta 2 = 1.26 we are getting this 1.26, going back,
this 1.26. This 1.26 we learnt that the average electrical repairs required 1.26 longer than the
mechanical repairs because for electrical repairs we have taken x1 = 1, for mechanical repair, we
have taken x1 =0.
So, the electrical repair is taken as the reference. What is the meaning of that is that the 1.26 time
units the electrical repair is taking longer time than mechanical repairs. Look at this picture.
(Refer Slide Time: 14:55)
726
The green one is for mechanical repair when substitute x2 = 0 here, the blue one is for electrical
repair, very extreme cold one. Look at this one. 2.19, this is 0.19. Both the slopes are same. This
slope is 0.388 for this equation and this equation. Only the intercept is differs.
(Refer Slide Time: 15:13)
What is the logic is that here we have we have seen only two levels. Sometimes, there may be
more than two levels. So, the number of a categorical variable with k levels must be modeled
using k-1 dummy variable. What happened previously there was a 2 level, so we have taken only
one dummy variable x2. So there are three levels you have to take 3 – 1 that is a 2 dummy
variable. Care must be taken in defining and interpreting the dummy variable.
What is the care here is what is the value we have assigned is equal to 1. For example, electrical
repair, you take an equal to one that equation is integrated with respect to x2 =1.
(Refer Slide Time: 15:54)
727
We will go for another problem. This problem is taken from statistics for management from
Lemen N Rubeen. The manager of a small sales force wants to know whether the average
monthly salary is different for males and females in a sales force. He obtained a data on monthly
salary and experience for each of 9 employees as shown in the next slide.
(Refer Slide Time: 16:20)
Look at this. This is there are nine employees their salary, there is gender, there is experience.
Now what you are going to do in this example, what is the salary of the females even though
they have equal experience with the male, whether females are discriminated or not when we can
say that they are getting discriminated, even though they have equal experience with male, they
are getting lesser salary that means the females are discriminated.
728
(Refer Slide Time: 16:49)
First, we will import the data. Here are imported in the object called tbl2 = pd.read_excel. The
excel data where I have stored this problem is in the filename called dummy2. So, when I show
this. Look at this, this is the employee salary, gender and experience. Next, what we are going to
do?
(Refer Slide Time: 17:14)
We are going to find out the scatter plot or is there any trend between the experience and the
salary? It seems to be there is a positive trend. But look at the residual plot. What is this
equation?
729
(Refer Slide Time: 17:29)
Y equal to see, R square is 0.926. See, the experience is the independent variable. Experience is
the because p value is less than 0.05 we can say as the significant value. So we can write Y equal
to 5.8 + 0.2332 experience. This is a regression equation. Ok now let us do the residual plot. For
this we will do the error analysis.
(Refer Slide Time: 17:58)
We will do the Residual analysis. You see that most of the points support, taken as the reference.
This is a standardized residuals. Most of the points are you should be randomly it has to be
distributed. Most of the points are above this way, there is a zero line. That means there is a
730
problem in assumption. Otherwise there might be some other variable that may affect the salary
apart from experience.
(Refer Slide Time: 18:25)
Look at the see that quantile plot. You see that here also most of the points are above the pointers
it has to sit on this red line, but it is not sitting on red Line then there is a problem in the
assumption of that equal variance. That means error is not following equal variance.
(Refer Slide Time: 18:42)
Now, what we have done in this data. Categorical data is included in the regression analysis by
using dummy variable here what you have done? Zero for males, 1for females. What has taken
731
zero also as reference or one also reference? So, one for male, female is taken as a reference now
data, so that a multiple regression model can be developed. We will do that one.
(Refer Slide Time: 19:08)
From the given data, I have converted into dummy variable, one dummy variable for female
because there are two level female and male. So, male is taken as one female taken is zero. The
coding is that zero is taken as male one is taken female. So in this we are going to take this
column for our further analysis. So, how to interpret this 0 means female one means male.
In creating a dummy variable for gender, we are going to follow this notation x2 = 0 means male
x2 equal to 1 is taken as a female. So, after creating dummy variable first how to create a dummy
variable in Python just _ dummies to that is a variable which I have given, pd.get_dummies. This
was the command for making dummy variable. So we are going to take female column for
further analysis. Zero means male one means female.
(Refer Slide Time: 20:10)
732
This was our Python output for that regression analysis. When you look at this, R square is 0.107
but look at here, first I will write the regression equation. Y = 9.7 -1.1750 x1. How to interpret
this result you see that in the x1 is not the significant value here it is not significant. At the
sample data level, what is the meaning of x1 R? Look at this. If you write x2=1 here x1 equal to
1 is not x2 it is x1.
When you substitute x1 =1 this one, this coefficient says, it is negative. What is the meaning of
these negative is this female is getting lesser salary when compared to male by this much unit
because it is a negative sign, we go for interpretation.
(Refer Slide Time: 21:23)
733
The value of the intercept is 9.7 the average salary for males has been coded a gender 1 for
female ok then, 1 for female and 0 for males. So, the value of the slope is - 1.175 tells us that the
average salary is lower than the average male salary by 1.175. What is the meaning of this?
Females are getting 1.175 units they are getting lesser salary when compared to male. If it is a
positive then we can interpret that. When compared to male females are getting more salary
because the negative you are saying that when compared to male, females are getting less salary.
(Refer Slide Time: 22:09)
Now what we are going to do? We are going to introduce the previously we considered only
gender. That is a female is taken as a reference. Now, we are going to introduce the experience
also. When you introduce the experience also know the regression equation is y equal to 6.2485
+ 0.2271, experience - 0.7890 female. Now look at the p-value these p-values now less that 0.05.
Now, here the gender is significant variable.
In your previous slide, when you go back in this slide, you see that the p value is not significant.
So we cannot say there is a gender discrimination. We can write a regression equation with the
help of sample data, but at the population level, there is no connection between Gender and their
salary because the relation between x1 that is there gender and the salary there is no relationship.
That means here the both female and male are getting same salary.
734
But when we introduce our experience was one of the variable now the general also is
significant, so by considering experience and the gender, now gender also one of their significant
because you are the P value is less than 0.05. We look at the f value the f value is very low. The
probability value also very low as a whole model, this model is significant individually also all
the variables are significant.
(Refer Slide Time: 23:47)
What would happen if we used zero for females and one for males in our data. Would our results
be any different right? So for that purpose we have done some modification here. For example,
gender female it is just reversed. You see that there is a difference in intercept but the slope is
same but the slope sign is different. So what is the meaning here? The male right, because the
male is 1,the males are getting 1.175 unit of higher salary when compared to females.
(Refer Slide Time: 24:27)
735
So what happened is there any difference in the result not really. With the coding as above, the
interested change to 8.525 see that 8.525 the slope of the gender would still 1.175, but it would
have a positive sign reflecting that the average male salary is higher than average female salary
by 1.175. So predicted salaries from the model for males and females would not change no
matter how the dummy variable is coded.
(Refer Slide Time: 25:00)
Sometimes, what will happen, they may be more than one dummy variable, how that in our
problem. We have only two levels, sometimes there are three levels. We should have to Dummy
variable will see that example. For gender, we had only two categories female and male does we
used a single dummy variable 0,1 variable for this. When there are more than two categories the
736
number of dummy variables that should be used = the number of categories -1. So, the number of
dummy variable is number of levels -1.
(Refer Slide Time: 25:33)
You see that there is one example where the job grade is there are three levels. 1, 2, 3, in this
example, the categorical variable job grade as three level so, 1, 2, 3, 1 means lowest Grade, 2
means medium and 3 means highest grade. We are going to have three levels in our categorical
data three levels are level 1, level 2, level 3.
(Refer Slide Time: 25:54)
There are 3 levels and we are going to have only 2 Dummy variables. Job 1 say taken as 1, 0
job2 taken as 0,1 job3 is 0,0. So now, we can say this 0, 0 is taken as a reference, ok. So, the
737
presence of 1,0 will explain category 1; 0,1 will explain category 2; 0,0 will explain category 3.
So here what is happening is there are 3 levels. But we are going to have only two dummy
variable dummy variables. Dummy variable 1 and dummy variable 2.
(Refer Slide Time: 26:33)
Now, this is a new data set how this data set can be used for doing dummy variable regression.
The interpretation is already I have explained to you now will go for demo of this code which I
have shown in our, this presentation.
(Refer Slide Time: 26:49)
I have prepared already code for that person. First I am going to remove this output by clicking
kernel restart and clear output. I have cleared the output now I am going to run this one. So as
738
you know, this is Shift Enter so again shift enter this is the data. This data shows service call
months since last service type of repair. Next we will go for scatter plot.
(Refer Slide Time: 27:17)
Scatter plot shows that there is a correlation between month since last service and repair time in
hours next will go for simple linear regression where were taken only one independent variable.
(Refer Slide Time: 27:35)
When you look at this here, this is equal to 2.14 + 0.3041 x1, suppose these variables x1. Look at
the p-value this p-value is less than 0.05. So this variable is significant value variable, R square
also it is good above than 0.5. See, when there is f statistic this is also less than 0.05. So, as a
whole model it is valid.
739
(Refer Slide Time: 28:04)
Now, we will plot standardized residual plot. When you look at the standardized residual plot
this is the pattern. See there are that some points which are going above -2 how to interpret the
standardized residual plot all the points should be between – 2 to + 2. But it seems that there are
some variable which goes beyond -2. So it is violating our model assumptions. Now we will go
for this q plot.
(Refer Slide Time: 28:38)
These also see that these into some pattern continuously three lines are below this line. There are
so many points are above this line. There are also problems in variance of the error variable also.
740
(Refer Slide Time: 28:51)
Now we will convert the data into dummy variable. This is dummy variable electrical is taken as
one mechanical is taken as 0, now after converting into dummy variable to drop the column
dummy variable belongs to mechanical. After Dropping we can see this output, for duplicate this
now. There is no mechanical column only electrical column is there.
I will do for this data set will go for regression analysis two independent variable one is months
underscore since last service another one is type of repair that is electrical is taken as reference.
When we look at the p-value the p-value are both independent variables less than 0.05, so the
significant model in this equation when you substitute x 2 equal to 1 will get a regression
equation for problem related to electrical. When you substitute x 2 equal to 0 will get a
regression equation for problem related to mechanical repairing system.
Now we will be going for another problem. This is our second problem, where the salary is the
dependent variable. Experience is independent variable gender also independent variable. When
we plot that between experience and salary there is a positive relationship. Now will take salary
and experience, experience is an independent variable. You see that experiences a significant
because less than 0.05. R square is 0.26.
741
There is no problem in this. Now, we look at the standardized residual plot that most of the
points there is not equally plotted, most of the point above zero there is no randomness in the
distribution. There seems to be some pattern in the residuals. We will go for checking the
normality of the variance error term. See that it is this also following some kind of a pattern and
then also not sitting on the exactly the diagonal line.
Now will go for create a dummy variable for the gender. There is one is for female another one is
male now will drop this one. So, female is taken as 1 male is taken as zero when you do there is
regression analysis where Gender is taken as a female now, you see that Y equal to 9.7 plus (-
1.175) female. So, females are getting less salary than the male but look at the P value, when you
consider only the gender, the P value is more than 0.05.So this gender variable is not significant.
When you, when you bring another variable is an experience when you look at our previous
code, it is only gender is taken gender also it is not significant because the p-value 0.389. Now
will take Gender and experience together, let us see what is happening. When you take Gender
and experience together, you see that the P value for female is less than 0.05. The experience
also, listen 0.05. Both the variables are significant, but the female is getting less salary, when
compared to male even though they have equal experience.
Now, what will happen when you reverse the code? Suppose, we have taken female equal to 1
male equal to zero now, what will happen? When you reverse that code send male equal to 1 and
female equal to zero what will happen if there will not be any change in the result. Only the sign
of usually the male is taken that was - 1.17. Now female is taken as reference. So we are getting
only the positive value of 1.17.
Only the difference in the Y intercept, otherwise all interpretations are same. In this lecture by
using dummy variable regression I have taken 2 problems with the help of python code I have
explained how to do a dummy variable regression and I have also interpreted the result. We
know what is the dummy variable regression is sometime the gender is one example for dummy
variable regression because there are two possibilities, male and female.
742
Similarly the job category, Category 1, category 2, category 3, these are dummy variable. For
this purpose we have learnt how to do a regression analysis, the next class very important topic
that is logistic regression, we are going to see that one before seeing Logistic regression. There is
a one principle called maximum likelihood principle. I will explain what is the maximum
likelihood principle? With the help of some examples, then we will go per Logistic regression in
the next class. Thank you very much.
743
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture - 36
Maximum Likelihood Estimation - I
In this lecture, we will go to new way of estimating the population parameter. That method is
called maximum likelihood estimation. In our previous class, we have estimated the population
parameter with the help of least square or we can say with the help of method of moments. This
method of estimating population parameter has lot of advantages over that two methods. That we
will see in this class.
(Refer Slide Time: 00:50)
The agenda for this class is to provide an intuition behind maximum likelihood principle and
theory and examples. So what we are going to do, we remember in the previous class with the
help of x bar, we have predicted the mean with the help of sample variance, we have predicted
the population variance with the help of moment. In the regression model, we have used least
square estimate. What you have done in this?
The sum of the square of the error is minimized when we draw the best regression equation.
Instead of that one, we are going to use another way of estimating population parameter with the
help of maximum likelihood estimation. This is very simple. With the help of this maximum
744
likelihood estimate, you can estimate parameter of any population, it may be any distribution. It
may be binomial; it may be a Poisson. It may be an exponential.
What is the assumption? We are having in the least square estimate is that error term should
follow normal distribution. Whenever the error term not following normal distribution, the
maximum likelihood estimate is the best way. That we will see in this class.
(Refer Slide Time: 02:02)
What is maximum likelihood estimation? The method of maximum likelihood was first
introduced by R. A. Fisher, a geneticist and statistician in 1920s. Most statistician recommend
this method at least when the sample size is large, since the resulting estimator have certain
desirable efficiency properties. Maximum likelihood estimation is a method to find most likely
density function that would have generated the data.
So what we can do with the help of this MLE is that which distribution has generated the data
that we can find out. Otherwise, this data set is suitable for what kind of distribution? But one
assumption we have to have this maximum likelihood estimation is that it requires that one to
make distribution assumption first. So in advance, we have to assume which distribution has
generated that set of data.
(Refer Slide Time: 03:01)
745
Let us see the intuitive view on likelihood. See there are some data set there, in the bottom. See
there are this data set. We want to know from which normal distribution this data set might have
come. There are three possibilities; one is the green line, whose mean is minus 2, variance is 1.
The another one is blue, whose mean is 0 and the variance is 1. The last one is mean equal to 0,
the variance is 4.
So the most suitable for this one is the blue one, because that covers all the data set. So the
purpose of maximum likelihood principle is; suppose there are some data. This data has come
from which distribution. So that kind of testing can be done with the help of this. Otherwise, this
data set is suitable for what kind of distribution; the other way also. So this is most useful for
estimating many population parameters.
(Refer Slide Time: 04:01)
746
We will take one simple example. With the help of this example, I will explain what is the
application of this maximum likelihood estimation? This problem is taken from this book
probability and statistics for engineering and sciences by professor Jay L. Devore 8th edition. It
is Cengage publications. The problem says a sample of 10 new bike helmets manufactured by a
certain company is obtained.
Upon testing, it is found that the first, third, and 10th helmets are flawed; whereas the others are
not. Let p is the probability of flawed helmet that is p is the proportion of all such helmets that
are flawed. Define Bernoulli random variable X1, X2, and so on up to X10 by; we are going to
use X1 = 1, if the helmet is flawed; if there is a defect. X1 = 0, if the helmet is not defective.
Like that, if X10 value = 1, if the 10th helmet is flawed, 0 if the 10th helmet is not flawed,
defective.
(Refer Slide Time: 05:18)
747
Then, for the obtained sample, say X1 = X3 = X10 = 1, because they are already given; only the
first and third and 10th helmet have some defect and the other seven Xi’s are all 0. The values
are 0. The probability mass function of any particular Xi is pXi (1 – p) 1-Xi, which becomes p if Xi
equal to 1 and 1 – p when Xi equal to 0. Now, suppose the conditions of various helmets are
independent of one another, because this assumption is very important.
If there is independent, we can find out their joint distribution. This implies that Xi’s are
independent, so that their joint probability mass function is the product of their individual
probability mass function.
(Refer Slide Time: 06:19)
748
Since it is joint probability mass function, see that we have multiplied for all possibilities, pi into
(1 – pi) by considering all possibilities. So when you simplify that p3 into (1 – p) 7. This equation
is 1. Suppose, in that equation, this left hand side, this value is called maximum. This is a
likelihood value. Whatever is in the left hand side, I will define it later, what is the likelihood
value. The left hand value is called likelihood value.
Suppose with the p = 0.25, then the probability of observing the sample that we actually obtained
is 0.002086. So like that, we can supply different p values. Suppose, instead of 0.25, you supply
p = 0.5, then the probability is 0.0097. You see that, when it is a 0.25, 0.002, when it is a 0.5, it
has become very low. So in between this 0.25 and 0.50, we are going to get the value of p that
will maximize our left hand side value.
For what value of p is obtained sample, most likely to have occurred? That was the question.
What is that? For what value of p is obtained sample most likely to have occurred; that is for
what value of p is the joint pmf; this one, as large as it can be; otherwise what value of p
maximizes equation 1. That p value is nothing but your likelihood value.
(Refer Slide Time: 08:07)
749
likelihood value. So we are good to supply, draw a graph by supplying different value of p in
equation 1 that we have to find out the likelihood value.
This figure shows a graph of likelihood as a function of p. It appears that the graph reaches its
peak above 0.3, when the value is 0.3; the graph reaches its peak, equal to the proportion of
flawed helmets in the sample. Now what you are going to do? We are going to take log of this
function.
(Refer Slide Time: 09:04)
I will explain; there was a reason for that. Graph of the natural logarithm of likelihood. Figure
shows graph of the natural logarithm of equation 1. Since the logarithm of g[u] is strictly
increasing function of g[u], finding u to maximize the function g of u is the same as finding u to
maximize log of g of u. So what is happening is, whether of g of u and logarithm of g of u is the
same. This figure shows a graph of the natural logarithm of equation 1.
750
We can verify our visual impression by using calculus to find out the value of p that maximizes
equation 1. Working with natural logarithm of the joint probability mass function is often easier
than working with the joint pmf itself. Since the joint pmf is typically a product, so the logarithm
will be a sum. That is the advantage of taking log of that. So what will happen previously in
equation 1, we got pq multiplied by 1 – p power 7. I am going back here; this one.
We are going to take log of this. When you take log of this, it will become, because there is a
multiplication, so this will become log of p3 + log of (1 – p) 7. So this will become 3 log (p) + 7
log (1 – p).
(Refer Slide Time: 11:12)
751
Next one is that functions, we have to see when the value become maximum? We know that in
our school, we might have studied; you see to find out the maximum value, maxima-minima. For
example, the maximum value, if you say dy by dx equal to 0; then (d2y/ dx 2 )will be less than 0
means that point will become the maximum. So this equation, this is the function of p. So we are
going to differentiate that log function with respect to p.
So when you differentiate this one, so 3, log of p = (1 / p), so (3 /p), + 7 is a constant. Log of 1 –
p is, this is differentiation, log of x equal to 1 by x. So 1 divided by (1 – p), again you have to
differentiate this function. Differentiation of differentiation, so 0 – 1, so it will be -1. So (3/ p) –
7 divided by (1 – p). So this equations, we have to equate it to 0, because we know (d /dp) should
be equal to 0. Then, we have to find out the p. So that value, the function will get maximized.
(Refer Slide Time: 12:32)
Equating these derivatives to 0 and solving for p, it gives 3 into 1 – p equal to 7p. So 3 equal to
7p, so p = 0.3 is conjectured. So now what is happening? Previously, we have substituted
different values. Now we are using the concept of maxima, we have realized that when the p =
0.3, the function gets maximized. So it is called the maximum likelihood estimate, because it is
the parameter value that maximizes the likelihood of the observed sample.
So this p = 0.3 will be nothing but the; this is an estimate for the population. In general, second
derivative should be maximum to make sure the maximum has been obtained, but here this is
752
obvious from the figure. So actually we have to differentiate one more time and we have to see
whether it has become negative or not, because by looking at the figure, it seems that that point is
maximum. So what is happening; this value p = 0.3 is called the maximum likelihood estimate.
So what is happening, the binomial distribution of the population parameter p, we have estimated
it is 0.3. So the advantage of this maximum likelihood function is, it is helping to estimate
parameter of any distribution.
(Refer Slide Time: 14:01)
Suppose, that rather than being told the condition of every helmet, we had only been informed
that three of the 10 were flawed. Then, we would have to observe the value of binomial random
variable X equal to the number of flawed helmets when you substitute 10,X; 10Cx px into (1 – p)10
–x
. When you substitute x = 3, this is 10C3 p3 into (1 – p)7. We do not bother about the coefficient
10C3, because that is not a function; that is just a constant. So what they say, the binomial
coefficient 10C3 is irrelevant to maximization. So again, the p = 0.3.
(Refer Slide Time: 14:44)
753
Next, we will define, what is maximum likelihood function. There are two terms there; one is
likelihood function, next one is maximum likelihood function. First I will say what is likelihood
function, then we will go to what is maximum likelihood function. Let X1, X2, and Xn have a
joint probability mass function or probability density function; call it as f of x1, x2, up to xn ; ϴ1,
ϴ 2 and ϴ m, where the parameters ϴ1, ϴ2 and ϴm have unknown values.
Here the parameter is ϴ1, ϴ2, unknown values where x1, x2, xn are the observed sample values,
then this equation a is regarded as the function of ϴ1, ϴ2 upto m, it is called likelihood function.
So this is a likelihood function. So this function is likelihood function. The maximum likelihood
estimates theta 1 hat, theta 2 hat, theta m hat are those values of theta i's that maximizes the
likelihood function.
So that f ( x1, x2 up to xn ; theta 1 hat, theta 2 hat, up to theta m) is greater than or equal to f (
x1, x2, x3 up to xn ; theta 1, theta 2,…. theta m) for all values of ϴ1, ϴ2 and theta m. When the
Xi’s are substituted in place of xi’s, the maximum likelihood estimates result. So what you have
to do with that one? In the Xi’s we have to substitute xi’s that will be the maximum likelihood
estimate result. So what we are doing here?
We are finding joint probability mass function, then with the help of sample values, we are
predicting the population parameter.
754
(Refer Slide Time: 16:48)
How will you interpret that one? The likelihood function tells us how likely the observed sample
is as a function of possible parameter values. So maximizing the likelihood gives the parameter
values, for which the observed value is most likely to have been generated. That is, the parameter
values that agree most closely with the observed data. Otherwise, we can say in other way, that
this data set is more suitable for what kind of distributions or what kind of models.
(Refer Slide Time: 17:22)
Now we will go for estimation of Poisson parameter. Suppose, we have data generated from a
Poisson distribution, we want to estimate the parameter of Poisson distribution. The Poisson
distribution is having only one parameter, because in Poisson distribution, it is an unique
755
parametric distribution; it has only one parameter, that is where the mean and variance is same.
The probability of observing a particular random variable P ( X;u) = (e–u u X )/ X !
So joint likelihood by multiplying the individual probabilities together, so what we will do the
first step is we have to find out the joint probability function. So (e–u u X1 )/ X1 ! multiplied by (e–
u X2
u )/ X2 ! and so on multiply (e–u u Xn
)/ Xn !. So this can be written as product of (e–u u Xi )
because it is a product, there is an end time. When you expand it, so e–nu, because it will become
up to n times, so u nX bar. Next, we have to take the log of this, we will see that.
(Refer Slide Time: 18:45)
Note that, the likelihood function that factorials have disappeared. We will not bother about the
factorials, because that is not going to affect the result. This is because they provide a constant
that does not influence the relative likelihood of different values of the parameter whether we use
the constant or not, that is not required, because that will not affect our end result. It is usual to
work with log likelihood rather than likelihood, because we have seen previously.
When you take log of likelihood, the differentiation is easy. Note that, maximizing the log
likelihood is equivalent to the maximizing likelihood. This also, we have seen in the previous
slide. So this was the likelihood function. You take log of that one. So e power, when you take
log of e to the power –n mu is –mu, because it is the product, in log it will become sum, sum nX
bar log of mu. Now you differentiate with respect to mu.
756
When you differentiate it and equate it to 0, then you are getting X bar equal to mu. So what is
the result is, the sample mean is the best estimate to predict the population mean of a Poisson
distribution.
(Refer Slide Time: 20:04)
Now we will go for another distribution that is estimation of exponential distribution parameter.
Suppose, X1, X2, Xn is a random sample from an exponential distribution with the parameter
lambda. Because of independence, the likelihood function is the product of individual pdf’s.
Here also λe–λx1 will extend it, λe–λx2 up to, you have to multiply λe–λxn.
So when you simplify that, it will become λn e–λƩxi. When you take log of this, it will become n
ln(λ) - λƩxi. Then this has to be equated to 0.
(Refer Slide Time: 20:59)
757
So when you equate it to 0, so lambda becomes lambda equal to n divided by sigma Xi that is
nothing but the inverse of the sample mean. So this was the result. So what is happening is, now
the inverse of sample mean is nothing but the mean of our exponential distribution.
(Refer Slide Time: 21:24)
Now we will go for estimation of parameter of a normal distribution. This was very interesting
because we can say normal distribution is the father of all the distributions. Many time, if you are
not knowing the nature of the distribution, you can assume that it follows normal distribution. As
usual, the likelihood function for a normal distribution is, we know that the pdf, probability
density function is (1 by root of 2 ᴨσ 2) e to the power (–(x1-u)2 divided by 2σ 2).
758
So like that, this is term 1, term 2, up to nth term we can go for that. So term 1, that is when it is
x1, when you substitute x2, x3, so you will get different n terms. So that is probability mass
function. Joint probability function, so when you simplify that it is (1 divided by 2 ᴨσ 2) to the
power n/2 , into e to the power (–Ʃ(x1-u)2 divided by 2σ 2). What will happen? When you take
log of this, this is (n / 2) (ln (1) – ln (2 ᴨσ 2)).
So what will happen, log of 1 minus, because log 1 is 0, so it will become 0 – ln(2 ᴨσ 2 ), because
it is x power n. So –n by 2 n log 2 pi sigma square – e to the power, this one will come in that
value itself, because 2 sigma square, sigma of xi – x mu whole square. Now what has to be done?
This is the log value of likelihood function. This function, this equation has to be partially
derivated with respect to mu and sigma square and equate it to 0.
(Refer Slide Time: 23:19)
Then, you will get the parameter. To find the maximizing value of mu and sigma square, we
must take the partial derivatives of the previous function with respect to mu and sigma square
and equate them to 0 and solve the resulting two equations. There are a lot of details, omitting
the details, we will get this result. What does this result says? With the help of sample mean, we
can predict the population mean.
With the help of this one, look at this one; this term is the sample variance, we can predict the
population variance. So this was the outcome of, you remember, this was the result of our central
759
limit theorem also. We can prove that central limit theorem by using this maximum likelihood
estimate. But one point you should be very careful, the maximum likelihood estimate of sigma
square is not the unbiased estimator.
Actually, we should look for unbiased estimator, but here it is not unbiased estimator. So, two
different principles of estimation, unbiasedness and maximum likelihood yield two different
estimators. In this class, I have started the intuitive meaning of maximum likelihood principle.
Then, I have explained how to find out the population parameter of different distributions. First, I
have seen how to predict the population parameter of binomial distribution.
Next we have seen how to predict the parameter of Poisson distribution. Then, next we have
predicted the population parameter of exponential distribution. At last, we have predicted
population parameter of normal distributions. In the next class, we will take one example for
predicting the parameter of normal distribution. Thank you very much.
760
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture - 37
Maximum Likelihood Estimation - II
In the previous class, we have seen estimation of population parameter with the help of
maximum likelihood principles. In this class, we will take some; two examples. One is to
estimate the population parameter of normal distribution. Second one is to estimate the
population parameter of a regression equation.
(Refer Slide Time: 00:48)
At the end, we will have a demo by using Python. What is the agenda for this class is Python
demo of estimation of population parameters for regression equation. Let us take one example.
(Refer Slide Time: 01:00)
761
Id is 1, 2, 3, 4. Example 1 is estimation of population parameter of a normal distribution. Let us
explain basic idea of maximum likelihood estimation using a simple problem. Let us make
assumption that variable X follows normal distribution. The value of variable is 1, 4, 5, 6, 9. The
density function of normal distribution with mean µ and variance σ2 is given by (1 divided by
root of 2 ᴨσ 2) e to the power (–(x1-u)2 divided by 2σ 2).
The range of x is between minus infinity to plus infinity. So in this equations, we going
substitute these x values, even I am going to multiply that, then we are going to take log of that.
Then, we have to partially differentiate with respect to x and mu and equate it to 0, then we will
get the population parameter.
(Refer Slide Time: 02:05)
762
Suppose the data is plotted on a horizontal line, this way. Think which distribution either A or B
is more likely to have generated the data. Pause the video, you can think.
(Refer Slide Time: 02:21)
The answer to the question is A, because the data are clustered around the center of the
distribution A, but not around the center of the distribution B. This example illustrates that by
looking at the data, it is possible to find the distribution that is most likely to have generated the
data. Now, I will explain exactly how to find the distribution in practice.
(Refer Slide Time: 02:48)
763
Maximum likelihood estimate starts with computing the likelihood contribution of each
observation. The likelihood contribution is the height of the density function. I will show you in
the next slide. We use Li to denote the likelihood contribution of ith observation.
(Refer Slide Time: 03:07)
Look at this picture. The observation 1 has contributed up to this much height. The likelihood
contribution of first observation is this much. The likelihood contribution of second observation
is this much; similarly, for third, fourth and fifth. For example, this is the first one, is because
there is x. Instead of x, we have substituted 1. For second data set, instead of x, we have to
substitute 4. For third, 5, the next one 6, the next one 9.
(Refer Slide Time: 03:40)
764
Then, you multiply the likelihood contribution of all the observations. This is called likelihood
function. We denote that by L. So likelihood function is the product of Li. This notation means
that you multiply from 1 to n. In our example n = 5, so you have to multiply five times.
(Refer Slide Time: 04:03)
So that likelihood function, that is defined by with the help of mu and sigma, is the product of i =
1 to 5, Li, when you expand this one, L1 multiplied by L2 multiplied by L3 and L4 and L5. So it
has come from the normal distribution, we have assumed that it has come from the normal
distribution. We know that probability density function of normal distribution is 1 divided by
root of 2 pi sigma square e to the power minus.
For first data set it is 1 – mu whole square divided by sigma square multiplication 1 divided by
root of 2 pi sigma square for second data set e to the power – 4 – mu whole square divided by
sigma square up to all data set. The last data set will be e to the power – 9 – mu whole square
divided by sigma square. So the likelihood function depends on the value of mu1 sigma square,
because look at this. So here, the likelihood function is in terms of mu and sigma square.
(Refer Slide Time: 05:04)
765
In the previous slide, we have found the joint probability density function for different value of
x, we found the joint probability density function of normal distribution. So what we have to do?
We have to take the log of that function, that partially we have to differentiate with respect to
mu1 sigma and equate it to 0, then you will get the population parameter, that is mu1 sigma. So
the value of the mean mu and sigma that maximizes the likelihood function can be found with
the help of Python.
At the end of the class, I am going to show one example for regression equation. The values of
mean mu and sigma, which are obtained this way are called maximum likelihood estimator of
mean mu and sigma. Most of the MLE cannot be solved by hand. That is the maximum
likelihood estimate. Thus, you need to write an iterative procedure to solve with computer. That I
will take one demo at the end of the class.
(Refer Slide Time: 06:01)
766
Now we will predict the parameter of a simple regression equation. When I am starting the
regression, I have explained the parameter of regression equation was obtained by having this
assumption called least square method, where the sum of the square of the error is minimized. So
model for the expectation, fixed part of the model, so expected value of Y is beta 0 + beta 1x.
The residual is actual minus predicted value yi – expected value.
The method of least square, what we have done, we have found the values of the parameter beta
0 and beta 1, that makes the sum of the squared residuals as small as possible. This method of
least square is applicable only when the error term is normal. That is, residuals are assumed to be
drawn from a normal distribution. Whenever this assumption is getting violated, we cannot go
for least square method.
We should go for some other method that is maximum likelihood estimate, because we know
that the next class, we are going to study about the logistic regression, error term will not follow
normal distribution.
(Refer Slide Time: 07:19)
767
So the maximum likelihood estimate can be applied to models with any probability distribution.
That was the advantage of this maximum likelihood method.
(Refer Slide Time: 07:27)
Now, we will estimate the parameter of a regression equation. We are interested in estimating a
model like this, where y = beta 0 + beta1 x + u. This u is error term. Estimating such model can
be done using maximum likelihood estimation.
(Refer Slide Time: 07:50)
768
Suppose, that we have the following data, where x is given independent variable, y is given
dependent variable. We are interested in estimating the population parameter beta 0, beta 1. Let
us make an assumption that the error term follows normal distribution with mean 0 and variance
sigma square.
(Refer Slide Time: 08:13)
We know that what is error? Error is actual minus predicted value. So the actual value is y minus
predicted value is beta 0 + beta 1 x. This means that y – beta 0 + beta 1 x follows the normal
distribution with mean 0 and the variance sigma square. The likelihood contribution of each data
point is the height of the density function or the data point where y = - beta 0 – beta 1x, because
nothing but we brought this minus inside. This was the example.
769
(Refer Slide Time: 08:49)
When you look at this, for the first data set, it is 2 – beta 0, because 2 when I go for first data set,
the y value is 2, x value is 1. So it becomes 2 – beta 0 – beta 1. For second data set, y value is 6,
6 – beta 0 -, x value is 4, 4 beta 1 and so on. So this height is the contribution of each observation
on the likelihood function. So likelihood contribution in this example of the second observation
is given by, second observation is y value is 6, x values is 4.
So 1 divided by root of 2 pi sigma square e to the power – 6 – beta 0 – 4 beta 1 square divided by
2 sigma square. So the density function u equal to, we can simplify this way. So we will go to
next one for other data set.
(Refer Slide Time: 09:46)
770
We have done in the previous slide only for these values. Now we will expand that function for
all the data set. So the product should go to i to n, L1, L2, L3 up to L5. See for first data set, this
is 1 divided by root of 2 pi sigma square . e to the power -2 – beta 0 – beta 1 whole square
divided by sigma square. For second data set, the same thing 6 – 4, for third one is 7, 5, for the
fourth one is 9, 6, for the fifth one is 15, 9.
So this function is likelihood function. Generally, what we have to do? We have to take the log
of this, then we have to partially differentiate with respect to beta 0, beta 1 and sigma and equate
it to 0, then you will get the estimation of beta 0, beta 1 and sigma.
(Refer Slide Time: 10:46)
771
You choose the value of beta 0, beta 1 and sigma that maximizes the likelihood function. So
what we are going to do? We are going to, for the same problem, with the help of Python, we are
going to predict this beta 0, beta 1 and sigma with the help of data set, which we have
considered. We will switch to Python.
(Refer Slide Time: 11:06)
Now we will see the application of maximum likelihood estimation for a regression equation.
Before that, I have explained various theories. Now, we will take one example. I will explain
how to find out or how to estimate the parameter of a regression equation using the principle
called maximum likelihood estimation. The file name is, I have taken as MLE. I was importing
the necessary libraries, import numpy as np.
You see that this is a new one from scipy.optimize import minimize. I am going to import a
function that will minimize a function, import scipy.stats as stats. So I have imported the data. So
this is the Y variable is a dependent variable. There is a 5 data set. There is X. X is the
independent variable.
(Refer Slide Time: 11:58)
772
For this X and Y, I have constructed a regression equation. What is that regression equation?
You see that by using our least square method, I have constructed a regression equation. So when
I go for the least square method, the regression equation is y = - 0.282 + 1.6176x. I am
explaining this portion to you. This estimation was, we know that this was the y intercept; this
was b1.
So with the help of what we have done, we are going to predict beta 0 + beta 1 for x1 coefficient.
So this is the actual value. We know that this sample beta 0, b0 can help to estimate the
population beta 0. Similarly, the sample b1 can be used to estimate the beta 1. That is for the
population. So this was the answer when we were using the least square method. You see the
method is least square method. Now we are going to use concept of maximum likelihood
estimation, then we have to verify this answer; whether we are going to get the same answer.
(Refer Slide Time: 13:24)
773
This is the answer. We got the y intercept is -0.2882 and b1 x1 coefficient is 1.6176.
(Refer Slide Time: 13:32)
Another parameter which is required for maximum likelihood estimate is, you have to predict the
standard deviation of the error variable. That is your error term. So for that, you can type e =
modl2.residual, we get e; this is the error term. So we have to find the standard deviation of this
error term. We got 0.06. This also we are going to predict. What we are going to predict? We are
going to predict b0, b1 and this standard deviation of the error term using maximum likelihood
estimation.
(Refer Slide Time: 14:09)
774
This was the code for parameter estimation, for regression equations with the help of maximum
likelihood estimation, import numpy as np from scipy.optimize import minimize import
matplotlib.pyplot as plt. So I am defining a function that is going to give a likelihood function.
So lik parameters, m is the slope b is the y-intercept, sigma is nothing but the standard deviation
of the error term. So for i in np.arange 0 to all the value of x, we are going to find out y expected
value is nothing but mx + b.
Then, this term is for estimating the log likelihood. So this term is nothing but when you go back
previously, when you go back here, this is nothing but the whole equation. We are going to
predict, that is why I used for loop. So for each I1, I2, I3, I4 up to I5, I will find out, then I will
multiply it. That is why, this has come, this one. So this is nothing but that what I explained.
Finally, the l will be returned. So this is our x value.
This x is taken from here, 1, 4, 5, 6, 9. This is nothing but this x value. 1, 4, 5, 6, 9; y value is 2,
6, 7, 9, 15. So like underscore model minimize, this is the function to minimize lik.np.array. So
this is just, I am guessing the answer. What this first one says is slope, the second one says the y
intercept, third one says the standard deviation of the error term. So I am going to use this
method called; there are different methods.
775
I will show you what are the different methods for minimizing L underscore b of gs underscore
b, this is one method. When you run it, see this is the answer 1.61 minus, what is this? This one
is your slope. This is your x coefficient. This is your y-intercept. This is the standard deviation of
your error term. When you go back, we will verify this. See the coefficient of x is 1.6176; here
also getting 1.6176. The y-intercept is -0.288.
So here also getting y-intercept and other thing, we are getting the standard deviation of the error
term is 0.604. So here also, see that this value also same. So what the point we are learning here
is, the same problem can be done with the help of least square estimation method and maximum
likelihood estimate method. In both the way, you will get the same answer. Now, we will take
another example. This example, we have seen when I am teaching simple linear regression
method.
(Refer Slide Time: 17:33)
That you can recall that auto sales example. An auto company periodically has a special week
long sale, as a part of advertising campaign runs one or more television commercials during the
weekend preceding the sale. Data from the sample of 5 previous sales are shown in the next
slide.
(Refer Slide Time: 17:54)
776
So what we have seen the number of TV ads is taken as independent variable; number of car sold
is taken as dependent variable. So for this data set, first we will do a regression model with the
help of least square estimation. Second, we will do with the help of maximum likelihood
estimation. We will compare the answer; both will be the same. So first we will do, least square
method.
(Refer Slide Time: 18:20)
So I have imported the data. This was a TV ads and car sold.
(Refer Slide Time: 18:26)
777
When I do, you see there is a OLS, ordinary least square method, we are getting this is the
answer. What is that this answer? Y = 10 + 5 TV ads. Here, you can say x1 is TV ads. Now for
this term, we will find out what is the error term.
(Refer Slide Time: 18:47)
See for finding the error term, see the b0 is 10, b1 is 5. To find the error term, I am going to save
in the object called e = modl2.residual e. So this was the error term. So if I find the standard
deviation of this error term, we are getting 1.67. So we are going to predict this standard
deviation of the error term and your y intercept and the coefficient of x with the help of
maximum likelihood estimation. So there also, we will get the same answer.
(Refer Slide Time: 19:20)
778
What we have done? The same thing, because already I have defined the function, it will be easy
for me, just you replace the various parameters. If 0th index is m, 1 is b, 2 is sigma, so this is our
likelihood function. So this is my x value. This is my y value. This is the function to minimize. I
will run this program after a few minutes. This is just I have taken the screenshot of the python
program for your explanation purpose.
You see that the final value, you look at this, this is your slope is 5, because this one. Second one
is the y intercept is 10. See the standard deviation of the error term is 1.67. It is exactly what we
have done using the least square method. Now I will go to Python environment. I will run and
will explain and one more thing you have to remember this is my guessed value 2, 2, 2. While
running this program, I am going to change this values, still we are going to get the same answer.
This is just, I am guessing the value. You can give any value. At the end, you will get the same
answer. Now we will go to the Python environment. We will do the program.
(Refer Slide Time: 20:45)
779
I have explained how to use maximum likelihood estimation to predict the parameter of a
regression equation. I have shown the screenshot of the program in my presentation. Now using
Python environment, we will run this code. I will explain how it is working. I have imported the
necessary libraries, then I have stored my data in the file called MLE. So I have displayed this
data. This data says, y is the dependent variable, x is independent variable. For this data set, we
are going to construct a regression equation using least square method.
(Refer Slide Time: 21:27)
This was the output of least square method regression model. Here the y intercept is – 0.2882.
The coefficient of x is 1.6176. So how can we write the regression equation? Y = -0.2882 +
1.6176x.
780
(Refer Slide Time: 21:49)
Next one, this is y intercept. This is the coefficient of x1. Next, we are going to find out the error
by using this command. That is dot resid. For this error term, we have to find the standard
deviation of the error term. The standard deviation of the error term is 0.60488. Now there are
three things which we have done. We have found the coefficient y intercept and the error term.
Now, by using the concept of maximum likelihood estimation, you will verify this answer.
Whether we are getting same standard deviation of the error, same coefficient, and same y
intercept. I have defined a function; the function name is lik. So I have called the slope and y
intercept and sigma i np.arange 0 lnx. I have predicted what is the y expected value by
substituting different x value mx + b. So this l is nothing but the likelihood function. This
likelihood function, I have explained this formula in my presentation.
So I am going to run this for all value of x1, then I am going to return the value l. this function is
going to return the value l. So I am going to minimize the likelihood function, because the error
has to be minimized. So this 2, 2, 2 these values randomly I am guessing, what will be the
parameter, that is m, b, and the standard deviation of the error term. So I am going to display this
model. So this model says that my slope is 1.617.
781
You see that when you do the OLS method also, the slope is 1.6176. The constant, see here
constant is – 0.288. In OLS method also the constant is – 0.2882. Next we predicted the standard
deviation of the error term, that is 0.604. Look at here also, we got the standard deviation of the
error term is 0.604. Now what I am going to do? I am going to change this value. For example, 2
I am going to give 3. This I will give 4. Let us see what value, we are going to get.
Again, we are getting, there is no change in the answer. So this value, this np.array, this is our
guessing value for our parameter. So at the end, we are getting the same answer. This is our
example number 1. Now I will go to another example.
(Refer Slide Time: 24:33)
This example, when I am explaining linear regression, I solved this problem with the help of
simple linear regression by using least square method. I will clear my output. Here also, we are
going to do the same thing, what we have done for previous problem. We are going to predict the
regression model using OLS, then we are going to check that answer with the help of our
maximum likelihood estimation methods.
So I am importing the library, then I am calling the data. This is the data set. The TV ad is the
independent variable, car sold is dependent variable. Now I am going to construct a regression
equation by using OLS, ordinary least square method. This answer is, the 10 is the constant, 5 is
782
the coefficient of TV ads. So we can write y = 10 + 5 TV ads. Next, I am going to find out the
error term.
(Refer Slide Time: 25:36)
So the error term is, this is the error term. Now I am going to find out the standard deviation of
the error term. The standard deviation of the error term is 1.67. So now these parameter, which I
have got with the help of least square method, I am going to get the same answer with the help of
maximum likelihood estimate. I am calling the same function. So what is more important here,
with this function is L = length x divided by 2, this star np.log 2 star np.pi.
This I have explained in my slide, when there is a normal distribution, if you want to find out the
parameter of that, we have to use this formula. That I have explained in my class. You can refer
my previous slides there. This 2, 2, 2 is the guessing values. For example, I am going to do 5. I
am going to change this number to 5. Now let us run it. You see that the 4.99 actually our answer
is 5; we got 4.99, approximately correct.
The y-intercept is 10, here also we got y intercept is 10 and the standard deviation of the error
term is 1.67. So here also, we are getting 1.67. So this way, we have verified with the help of this
Python program that whatever answer which you get with the help of OLS is the same as
maximum likelihood estimation. Because this maximum likelihood estimation method for
predicting the population parameter is so generic and most of the software packages, they follow
783
this maximum likelihood estimate for predicting the population parameter. As I told you, there
are different ways to minimize. Suppose, if you want to check the different methods, simply
minimize, put this question mark, you will get different methods.
(Refer Slide Time: 27:44)
One method is see that Nelder-Mead, Powell, CG, BFGS, Newton CG, there are different
methods. So what we can do, here there is a method. You can change some other value, then we
will get the same answer. In this class, I have given an example how to use maximum likelihood
estimation method for predicting the population parameter of a regression equation. I have
explained the theory. I have taken two examples for that.
For the two examples also, I have concluded that you can use OLS method, that is ordinary least
square method and maximum likelihood method. In both the way, you will get the same answer.
Then I have explained how to do the coding and how to run and get the answer using Python,
because this class is based for the next class, which I am going to take, logistic regression,
because the logistic regression, the method which you are going to use to predict the population
parameter is maximum likelihood estimation. In the next class, by applying this principle of
MLE, maximum likelihood estimation, we will use and estimate the population parameter of
logistic regression.
784
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture 38
Logistic Regression - I
In this class, we are going to new topic that is the logistic regression. I am going to explain, when
you go for logistic regression over linear regression and see generally when we are doing linear
regression, both independent variable and dependent variable are continuous. When in
independent variable there is a categorical data, we have used dummy variable regression. There
may be a chance even in the dependent variable, there may be some categorical variable.
In that case, we should go for logistic regression. I will explain this logistic regression with the
help of example, then I will interpret our Python output, at the end I will explain the theory
behind this logistic regression.
(Refer Slide Time: 01:14)
The class agenda is that we will build a logistic regression model, then I will do the demo for
logistic regression model. We will see the application of logistic regression.
(Refer Slide Time: 01:24)
785
In many regression applications, the dependent variable may only assume 2 discrete variables.
For example, in the linear regression also the dependent variable was a continuous variable,
independent variable also continuous, but some time what may happen, the dependent variable
may be discrete values. For example, gender. For example, good or bad or success or failure. So
the y value, so what is happening? See in a linear regression, y = a + b1x1 +b2x2.
If x1, x2 are independent variable, y is a dependent variable, in the x1 if there is any categorical
data, we should go for a dummy variable regression. Sometime in the dependent variable, there
may be some categorical data. For example, 0, 1, it may be gender. It may be quality of product,
good or bad, whether a person will buy the product or not buy the product. Whenever there is
two options, it may be categorical. That time, we should go for a logistic regression.
For instance, a bank might like to develop your estimated regression equation for predicting
whether a person will be approved for a credit card or not. Here the y, the dependent variable is a
person will be approved for getting credit card or not. We have two possibility, then you should
go for logistic regression. The dependent variable can be coded as y = 1, if the bank approves the
request for a credit card and y = 0 if the bank rejects request for a credit card.
Using logistic regression, we can estimate the probability that the bank will approve the request
of a credit card given a particular set of values for the chosen independent variable. This may be
786
applicable, when you go for applying for loan. Whether this person will repay the loan or not;
because there are two possibilities, so that also we can use logistic regression. Somebody
applying for some jobs whether he will get the job or he will not get the job.
For that purpose, you can go for logistic regression. In your context, we can say whether you will
get the placement or not. Here it is only two possibilities, a person may get or may not get the
placement. So that case, we can have different independent variables. So what kind of
independent variables will help you to get the placement, that kind of problem can be solved
with the help of this logistic regression model.
(Refer Slide Time: 04:06)
Let us take one example. This example is taken from the book Statistics for Business and
Economics, 11th Edition by David Anderson, Dennis Sweeney and Thomas Williams. I will
suggest you, this book is excellent book to understand the concepts. This problem also, which I
have taken in this lecture, is from this book. Let us consider an application of logistic regression
involving direct mail promotion being used by Simmons Stores. The store name is Simmon.
So they are going for a promotion. Simmon owns and operates a national chain of woman’s
apparel stores. 5000 copies of expensive four colour sales catalog have been printed and each
catalog includes a coupon that provides a $50 discount on purchase of $200 or more. The
catalogs are expensive and Simmons would like to send them to only those customers, who have
787
the highest probability of using the coupon. Now we have to identify, for what kind of customers
we have to target, so that they will use the coupon.
(Refer Slide Time: 05:22)
What are the variables in these problems? The management thinks that the annual spending is
one of the variable at Simmon stores and whether a customer has see a Simmon credit card are
two variables that might be helpful in predicting whether a customer who receives the catalog
will use the coupon. So there are two independent variables. One is annual spending. Second one
whether the person having some Simmon's credit card or not.
Simmons conducted a pilot study using a random sample of 50 Simmons’ credit card customers
and 50 other customers who do not have the Simmon credit card. Simmons sent to the catalog to
each of the hundred customers selected. At the end of the test period, Simmons noted whether
the customers used the coupon or not. By using this data set, they are going to construct a
regression equation model, so that they can target to whom this catalog can be sent, so that they
will use this coupon, so that the sales will increase.
(Refer Slide Time: 06:29)
788
This is a data set. I have shown for 10 customers, but there are 100 dataset; 50 dataset people are
those who are not having the credit card, the remaining 50 are those who are having the credit
card. So spending is one of the independent variable. Having or not having, for example 1 means
having the credit card, 0 means not having the credit card. Here the coupon also 0 means they
have not used the coupon, 1 means they have used the coupon.
So the coupon this variable, this is going to be our dependent variable. This is one independent
variable x1, this is another independent variable x2. You see that the x2 variable is a categorical
variable right. Actually in case, there are different levels. We have to convert into a dummy
variable, then you have to run the analysis, but in this problem directly it is given, whether the
person is having the credit card or not having the credit card.
(Refer Slide Time: 07:30)
789
Now we will go for what is the explanation of variables. The amount of each customer spent last
year at Simmons is shown in 1000s of dollars and the credit card information has been coded as
1 if the customer has the Simmon credit card, 0 if not. So two variables, one is how much spent
the last year, whether the person having the credit card or not. If the person is having credit card
1, otherwise it is 0. In the coupon column which is dependent variable 1 is recorded if the
sampled customer used the coupon, 0 means if not.
(Refer Slide Time: 08:10)
First we will go for what is the logistic regression equation. If the two values of the dependent
variable y are coded as 0 or 1, the value of expected y in equation given below provides the
probability that y = 1 given you a particular set of values for the independent variable x1, x2, xp.
790
So logistic regression equation is expected value of y = e to the power (beta0 + beta 1x1 + beta
2x2 up to beta p). There are p independent variables, divided by (1 + e to the power (beta 0 +
beta 1x1 + beta 2x2 and so on up to beta pxp)). So this y is we are going to predict y. It is going
to be 0 or 1.
(Refer Slide Time: 09:04)
You may ask the question, why not we use simple linear regression equation here? Because the
simple linear regression equation cannot be used for this problem, because there are two
possibilities. We have assumed that when you plot this data set, I will tell you the next slide that
is there. One assumption is that the error term in a simple linear regression should follow a
normal distribution, but here the Y variable is only two possibilities. So that will follow binomial
distribution and the error term of a logistic regression will follow binomial distribution.
So you cannot use your simple linear regression, whenever there is a y-value is categorical
variable. Because of the interpretation of expected y is a probability of logistic regression
equation is often written as expected value of y = P(y = 1 given x1, x2 up to xp). So we are going
to find out the expected value of y.
(Refer Slide Time: 10:08)
791
This was the example that why we cannot use linear regression. So what will happen, here in no
way you cannot construct any linearity, because here the x variable is spending. Spending is a
continuous variable. The y variable is a person has used coupon or not. So what is happening?
Whenever their income is low, they also used the coupon. When the incomes are more, that time
also they have used the coupon.
So you cannot construct, for this data set, when you fit to this kind of linear regression, there is
no meaning for that. So one way to fit the line for this kind of data is the S-shaped curve that I
will show you in the next slide. It will become this way, that I will show in the next slide.
(Refer Slide Time: 11:01)
792
Yeah, this one assume that there are independent variable is there. The range of independent
variable up to say 0 to 5, the expected value is nothing but the probability. You see that when
this x = 3, it is getting the maximum value between 3 and 4. Whenever the value of x is between
3 and 4, there is a higher chances that expected value of y will be 1. When it is below 2, when
the x value is below 2, there is a higher chance that the expected value will become 0.
So there are two possibilities and the rate of change also you see here, the rate of change also low
between 1 and 2, but between 2 & 3 the rate of change is high, but between 3 and 4 the rate of
change is low. So this is an S-shaped curve for a logistic equation. So what we are understanding
here, when x = 3 whenever the value of x is more than 3, there is a more chance the value of
expected value of y will be 1. When it goes below 1 or below 2, there is more chance that the
expected value of y will be 0. When you go right hand side, there is a more chance that the
expected value of y becomes 1.
(Refer Slide Time: 12:24)
Now I will explain what is that previous curve? Note that the graph is S-shaped. The value of
expected y range from 0 to 1, that is in x axis, with the value of expected value of y gradually
approaching 1 as the value of x becomes larger and the value of expected value of y approaching
0 as the value of x become smaller. Note also that the value of expected y representing
probability increase fairly rapidly as x increases from 2 to 3 after that it becoming constant.
793
The fact that the value of expected value of y range from 0 to 1 and that the curve S-shaped
makes the equation, the previous slide this shape ideally suited to model the probability that
dependent variable is equal to 1.
(Refer Slide Time: 13:31)
Now we will go for estimation of logistic regression equation. In a simple linear and multiple
regression, the least square method is used to compute b0, b1 up to bp as estimates of the model
parameter. What is the model parameter? Zero, this is beta 0, beta 0, beta 1 up to beta p. So what
you have done? With the help of this sample parameter, we have estimated the population
parameter, that is beta 0, beta 1 and beta p.
But the previous equation that is the logistic equation is non linear. So the non linear form of the
logistic regression equation makes the method of computing estimates more complex. So what
we are going to do? That is why in the previous class as I explained, whenever there is a non
linear form of equation, instead of using that OLS method, you have to use your maximum
likelihood estimation method, MLE to predict the population parameter.
So all software packages follow the concept of maximum likelihood estimation, I have explained
the previous class to get, to predict the population parameter with the help of sample parameter.
We will use computer software, the Python to provide the estimate. At the end of the class, I will
794
show you that. The estimated logistic regression equation is y hat is nothing but p of y = 1 for
different x1, x2, xp equal to e to the power (b0, b1..). This b0 b1 is these sample parameter.
Divided by (1 + e(b0 + b1x1 + b2x2 up to bpxp) ). Here why hat provides an estimate of the probability
that y = 1 given a particular set of values of the independent variables. The y hat is the
probability, that probability will tell us how much chance the p of y = 1. If it is a higher
probability that p of y = 1. If y hat is low, you will get a lower probability.
(Refer Slide Time: 15:57)
I have brought the screenshot of the logistic regression. There are two independent variables.
One is card and spending. There y is a dependent variable. So I am going to use a constant x1 =
sm.add_constant. Here you see that we are going to use Logit model. Logit underscore model
equal to sm.logit (y, x1). result equal to logit_model.fit(). Print the result.summary2(), then you
will get this output. So look at this.
This is the constant is -2.14. This coefficient of card is 1.0987. The coefficient of spending is
0.3416. See there are 100 observations. I have shown only in my previous slides, only 10
observation only for understanding purpose. The model is Logit model and there are pseudo R
square is 0.101. There is AIC. There is a log likelihood and log likelihood when the variable is
not there. That is a log likelihood underscore null.
795
This we will use to find out the G statistics. I will tell you later. Then this is standard error of this
regression coefficient. The z value, it is called wald statistic, we can say WALD statistics. That is
nothing but the coefficient 1.0987 divided by 0.4447, you will get this one. This was the p-value.
There are two things you have to understand before interpreting the answer. One is we have to
look at the G statistics. In the coming slides I will explain what is the G statistic.
That G statistics is equivalent to F statistics of our linear regression. What we have done? The F
statistics in the linear regression is helping to test the overall model and the t statistics in the
linear regression is used to check the significance of an individual independent variable. The
same way here the G statistics is to test the significance of overall logistic regression model.
Here the z that is the WALD statistics is used to test the significance of individual the
corresponding p value, is used to test the significance of individual independent variable. That is
meant each independent variable. I will go further, then I will explain what is the meaning of
that.
(Refer Slide Time: 18:50)
So what are the variables? We have taken y, y can have two possibilities 0 if the customer did
not use the coupon, 1 if the customer used the coupon. x1 is the annual spending at Simmon
stores that is in terms of 1000, then x2 is a categorical variable. Categorical variable is 0 if the
customer does not have the Simmon credit card, 1 if the customer has the Simmon credit card.
So we know that the expected value of y = e to the power (beta 0 + beta 1x1).
796
This was for the population, but this can be done with the help of y hat that is eb0 + beta 1x1 + beta 2x2
divided by (1 + e to the power b0 beta 1x1 + beta 2x2). So this was the sample statistic. So from
the previous output, what is the b0 here? See b0 is -2.1464. We got -2.1464. Now in our
problem, the x1 spending, how much the customer has spent in the last time last year, see that is
taken our x1 variable. Here x2, is the person is processing the card or not.
So the constant is -2.1464. So here x1 is, the coefficient of x1 is 0.34164. We got this one,
0.3416 and the coefficient of x2 is 1.0987. So we are getting 1.0987. So this was in the
numerator, then 1 + e to the power the same value in the denominator.
(Refer Slide Time: 20:43)
We have got the output of logistic regression equation model. We will look at interpret and we
will see the managerial use of that. How to interpret this one? For example, when y = 1, x1 = 2,
x2 = 0. What is the meaning? Suppose the person's income is $2,000, is not having the credit
card. When you substitute in our estimated regression equation, substitute x1 = 2, x2 = 0, both
the numerator and denominator. When you simplify, we are getting 0.1880.
What is the meaning is that a person is having or not having credit card and having the
expenditure of $2,000 that probability of that fellow to use that coupon is 0.1880. The same case
instead of x2 = 0, I am going to see the interpretation x2 = 1, that means what? A person having
797
the credit card, spending $2,000, what is the probability that that person will use the coupon? So
you substitute x1 = 2. Previously, we substituted x2 = 0, now substituted x2 = 1.
So when you simplify this, we are getting 0.4099. So what has happened? A person having the
credit card is having the more possibility of that is the probability of y = 1, becomes higher. That
means, there is a more chance that a person having the credit card will use the coupon. So
probabilities indicates that the customers with the annual spending of $2,000, the presence of a
Simmon credit card increases the probability of using the coupon.
How it is increasing? You see that this much. This is the probability of same income, but not
having the credit card. This is the probability of having credit card. So what is happening? The
probability is increased if the person is processing the credit card.
(Refer Slide Time: 22:45)
Like that, for different possibilities, so previously we have explained only this portion. Now x1 =
1, that means $1,000 x2 = 1, you will get this probability, then x1 = 1, x2 = 0, you will get this
probability. Like that we have extended for up to $7,000. When you look at this figure, you see
that when you compare the probability, the person having the credit card there is a more chances
are that that fellow will use the coupon. If the person is not having credit card, there is a lesser
chance to use the credit card. That is one interpretation.
(Refer Slide Time: 23:34)
798
This before interpreting, we have to test whether the coefficient are significant or not, because
the equation which we have can constructed is only for the population. The same thing also we
have done our linear regression equation. The linear regression equation we have used a t-test
and F test to predict the significance of the independent variable. So what is the null hypothesis?
Beta 1 = beta 2 = 0, so one or both of the parameter is not equal to 0.
(Refer Slide Time: 24:07)
Now we will go for G statistics. The test for overall significance is based upon the value of G test
statistics. This is equivalent to F test statistics in our linear regression model. If the null
hypothesis is true the sampling distribution of G follows a chi-square distribution with the
degrees of freedom equal to the number of independent variable in the model. In our problem,
799
the number of independent variable is 2. So the degrees of freedom is 2. If there is only one
independent variable, the degrees of freedom for G statistics is 1.
(Refer Slide Time: 24:43)
So this was the output. I am going to explain how we got this G statistics. So look at this value,
which I have coloured in the blue colour log likelihood is - 60.487, log likelihood when the
variable is not there. That is a log likelihood underscore null is – 67.307.
(Refer Slide Time: 25:05)
Formula to find out the G statistics is G = - 2ln; There is a log likelihood with variable. First one
is without variable numerator, the denominator is with variable. So G = 2, we got this – 60.487
you see that this value, - 60.487. So when it is null, that value is – 67. 307. So when you find the
800
difference and multiply by 2, this value is your G value. 13.628. So the value of G is we are
getting the same answer. G is 13.628.
It is degrees of freedom for 2, because 2 independent variables and corresponding p-value is, it is
0.001. Thus at any level of significance, since alpha is greater than 0.001 is very low, we would
reject null hypothesis and conclude that the overall model is significant. In this class, I have
explained when to go for logistic regression equation. When should we go for logistic regression
equation, whenever the dependent variable is the categorical variable, we should go for logistic
regression equation.
Then I have taken a sample problem. With the help of sample problem, I have used Python to
predict the different values that is the various parameters of the logistic regression equation.
Then, I have explained what is the G statistics. The G statistics is equivalent to F statistics in our
linear regression model. In the linear regression model, F statistics is used to test to the overall
significance of the model.
The same way in the logistic regression equation or model, the G statistics is used to test the
overall significance, but we have to check the individual significance of each independent
variable. That we will continue in the next class. It is equivalent to looking at the t value of our
linear regression model. The linear regression model and t statistics is used to test the
significance of an individual variable. The same way in our logistic regression equation, the z
statistics or the Wald statistics method is used to find out the significance of each independent
variable. That we will continue in the next class.
801
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology - Roorkee
Lecture – 39
Logistic Regression - II
In the previous class, we have done the overall significance of the logistic regression model,
in this class, we will go for testing the individual significance of all independent variable.
(Refer Slide Time: 00:41)
So, the agenda for this class is; testing the significance of logistic regression coefficient then
we will do the Python demo on logistic regression. In the previous class I have stopped by
saying the G statistics and corresponding p value, that p value has less than 0.05, then we
have seen the overall model is significant. So, this is the code from Python to get the for
different chi square G value and degrees of freedom.
So, chi square.pdf (13.628, 2), has our G value in the previous lecture, 2 was our degrees of
freedom because there was 2 independent variable, so this was the p value, so the p value is
very low, we can say that the model is significant.
(Refer Slide Time: 01:30)
802
Z test or Wald test; z test can be used to determine whether each of the individual
independent variable is making significant contribution to the overall model or not. For
example, how we got the z value; if you divide 1.0987 divided by 0.447, you will get the z
value. Similarly, when you divide 0.3416 by 0.1287, you will get this z value, so
corresponding probability you see this one, both the probabilities are less than 0.05. So, we
can say both the independent variable is significant, as a whole model also significant, the
independent variable in a logistic regression model also significant.
(Refer Slide Time: 02:15)
Once we came to know both are significant, then we will go for interpretation of its output, so
what kind of different strategies that company has to adopt, so that they can improve their
revenue by selling more coupons. Suppose Simmons wants to send the promotional catalog
803
only to customers who have a 0.40 or higher probability of using the coupon. So, what will
happen, you look at where this 0.4, here this one 0.4, those who are having credit card.
Those who are not having credit card, 0.4 is here, so what interpretation from this table is;
whoever having the credit card and whose spending is 2,000 dollar and above, for them you
can send the coupon, they will use the coupon. Those who are not having the coupon but their
spending is above 6,000 dollar, for them you can send the coupon, so that they will use that
one.
So, this is the managerial interpretation, so customers who have a Simmons credit card, send
the catalog to every customers who spend 2,000 dollar or more last year because the 0.4 is the
cut-off, customers who do not have the Simmons credit card send the catalog to every
customer who spent 6,000 dollar or more in the last year, so that is the strategy for the
promotion.
(Refer Slide Time: 03:49)
Now, we will go for interpreting the logistic regression equation; for that there is a close
connection between odd. What is odd is, probability of success divided by probability of not
getting success, so generally the odd is P divided by 1 - P that is odd, so probability of y
equal to 1 is the success, probability of y equal to 0 is not success. So, probability of y equal
to 1 given that a different independent variable that is a numerator P, so 1 - P of y equal to 1
that is 1 – P, this function is called odds function.
(Refer Slide Time: 04:33)
804
Then from odds function, we are going to the next term odds ratio because this odds ratio is
very, very helpful to explain the coefficient of logistic regression equation. So, the odds ratio
is odds 1 divided by odds 0, that means when 0th level what is the odd, when the first level
what is the odd, so 0th level means for example, those who are not having credit card that is
odds is 0.
Those who are having the credit card suppose, we are jumping from one level to another
level, odds 1 is those who are having the credit card, so the odd ratio measures the impact of
odds of 1 unit increase in only one of the independent variable, so this odds ratio will help us
to interpret to the logistic regression equation, if the independent variable is increased by 1
unit, what is the effect of that on the dependent variable will interpret this.
(Refer Slide Time: 05:33)
805
What is interpretation? For example, suppose we want to compare the odds of using the
coupon for customers who spent 2,000 dollar annually and have a Simmons credit card, so
that is this category; x1 is a 2000 having the credit card to the odds of the coupon for
customers who spent 2,000 dollar annually and do not have the Simmons credit card. So,
what will happen here; x2 will becomes 0, so x1 equal to 2, x2 equal to 0, so that is the
second category is odd 0, the first one is odds 1.
So, the ratio of that 2 is called odd ratio, so we are interested in interpreting the effect of 1
unit increase on the independent variable x2, so 1 unit increase means here person having not
credit card to a person having the credit card, so the 0 level to 1 level. The 0 level is no credit
card, the 1 level is credit card, okay. So, what is the impact of this one on the estimated value
of y?
(Refer Slide Time: 06:48)
First, we will see odds 1; odds 1 is a person having the correct card, so P of y equal to 1, x1
equal to 2, spending is 2,000 dollar, x2 equal to 1 having the credit card, this is the
numerator, probability of success we can say probability of not success, 1 - P of y equal to 1,
x1 equal to 2, x2 equal to 1. So, what will happen; you can substitute this, when x1 equal to
2, this value and having the credit card that is this 0.0499.
So when you substitute this 0.4099 divided by 1 – 0.4, we are getting 0.6946, we will go to
the odds 0; 0th level. So, P of y equal to 1, x1 equal to 2, x2 equal to 0, this case similar to
previous one spending amount is same but he is not having credit card, for that the
probability is numerator is P, denominator is 1 – P. So, what is that category; this one,
806
0.1880, so 0.1880 divided by 1 – 0.1880 that is giving 0.2315, so when you divide this
0.6946 divided by 0.2315, this 3 so, the 3 is very useful for interpreting.
(Refer Slide Time: 08:21)
What is the meaning of 3 is; the estimated odds in favour of using the coupon for customers
who spent 2,000 dollar last year and have a Simmons credit card are 3 times greater than the
estimated odds in favour of using the coupon for customers who spent to 2,000 dollar last
year and do not have the Simmons credit card. So that means, expenditure is, the spending
amount is same. But when you sent this coupon to the person who is having the credit card,
there is a 3 times more chance that person will use that coupon, okay that is the meaning of
this odds ratio.
(Refer Slide Time: 09:03)
807
The odds ratio for each independent variable is computed while holding all other independent
variable is constant for example, in the previous case also where the expenditure; the amount
spent that is expenditure is taken as the constant, we have interpret only for a person having
the credit card or not having the credit card, it does not matter constant values are used for
other independent variables.
So, we do not bother about the constant variables for instance, if we computed the odd ratio
for Simmons credit card variable x2 instead of 2,000 you say, 3,000 dollar expenditure,
instead of 2,000 as the value of the annual spending variable is x1, we would still obtain the
same value of estimated odd ratio, so the constant does not matter.
Thus we can conclude that the estimated odds of using coupon for customers who have a
Simmons credit card are 3 times greater that is important interpretation, 3 times greater than
the estimated odds of using the coupon for customers who do not have the Simmons credit
card. So, you have to target to a person who has the credit card when you target to them, there
is a 3 times more chance that people will use when compared to those who are not having
credit card they will use the coupon.
(Refer Slide Time: 10:27)
Now, another very useful relationship instead of finding odd ratio that way, there is a
connection between the coefficient of logistic regression equation and the odds ratio, that is
this one; e to the power beta i, what is the title says; relationship between odds ratio and the
coefficient of independent variable, the beta i is called the coefficient of independent
variable.
808
So, if you want to know the estimated odds ratio for x1 variable that is the amount spent, e to
the power b1, in our equations b1 is 0.34, where we got this one b1, I am going back this one,
so here we are taking x1 variable, this is x2 variable, so e to the power 0.3416 that will give
you the odd ratio for this variable because in some software packages for example, Minitab
they directly give the odd ratio for each independent variable.
But in Python we can calculate the odd ratio using that relationship e to the power beta 1 for
example, if we want to know odd ratio for card, those who are having card or not, so e to the
power 1.0987 that will become 3, I will show you that one yeah, see that b2; e to the power
b2, e to the power 1.09873 is 3, so odd ratio we can directly get from the coefficient of
logistic regression equation.
(Refer Slide Time: 12:05)
Now, so far we have seen there is 1 unit is change then we have seen, what is the
corresponding effect on the dependent variable, sometime what will happen; what is the
meaning of 1 unit change means, suppose we have taken for x2 equal to 0 and 1, we have
seen 1 unit change, if there is 1 unit change, we have seen effect of that on the dependent
variable.
For example, this is a discrete variable, if the independent variable is a continuous variable
for example, say x1 is amount spent suppose, somebody is spending 2,000 dollar, what is the
probability that people will use the coupon, so we can see 2,000 to 3,000 that is x1 equal to 2
to 3, if it instead of 1 unit jump, we can go for 6 unit or 5 unit jump at a time and
809
corresponding interpretation we can find out that is the meaning of that change of more than
1 unit in the odd ratio.
The odd ratio for an independent variable represents the change in odds of 1 unit each change
in the independent variable holding all other independent variables are constant. Suppose, we
want to consider the effect of a change of more than 1 unit for example, c unit instead of 2 to
3, I want to say 2 to 5 for instance, suppose the Simmons example that we want to compare
the odds of using the coupon for customers who spent 5,000 dollar annually to the odds of
using the coupon for customers who spent 2,000, the increment is not 1. Because it is a 3
okay, in this case c equal to 5 - 2 is 3 and the corresponding estimated odd ratio is very, very
useful.
(Refer Slide Time: 14:05)
So, e to the power c, this c is how much is we are increasing, so when you multiply by 3 of
this one, we are getting the odd ratio is 2.79, this result indicates that the estimated odds of
using the coupons for the customers who spend 5,000 dollar annually is 2.79 times greater
than the estimated odds of using the coupon for customers who spend only 2,000 dollar
annually.
You see that here the increment is not the unit increment, it is the 3 times increment in other
words, the estimated odd ratio for increase of 3,000 in annual spending yeah, it is a 3 unit
means, 3000 is 2.79.
(Refer Slide Time: 14:49)
810
Then we will come to some theory portions of this logistic regression equation, first we will
see what is the logit transformation. An interesting relationship can be observed between
odds in favour of y equal to 1 and the exponent of e in the logistic regression equation, so
when you say, as I told you previously probability of success, probability of non-success
when you take log of that, that is nothing but a logit function.
So, log of odds equal to beta 0 + beta 1 x1 + beta 2 x2 + beta p xp, this equation shows that
the natural logarithm of the odds in favour of y equal to 1 is a linear function of independent
variable. So, why we are taking log, so that will become a linear function, this linear function
is called logit generally, in the custom is g of x1, x2 up to xp to denote the logit function. So,
when you take logit function it will become linear, so interpretation is easy.
(Refer Slide Time: 16:04)
811
So, how to estimate the logistic regression equation, we know this a logit function, if you
want to know the expected value of this logistic function is nothing but e to the power g of
x1, x2 up to xp divided by 1 + e power of g of x1, x2 up to xp, so you can expand this with
the help of sample data; b0, b1, b2 up to bp, so this is the your sample value, with the help of
the sample data we can predict the population parameter.
Sample; generally, the name parameter is used only for the population not for the sample, so
e to the power; when I write the hat symbol it is the estimated value, y hat, g hat, so you can
this can be written as because this is right, this can be is nothing but this value, so we can do
logit function, so e to the power g hat x1, x2 up to xp divided by 1 + e to the power g hat this
one.
(Refer Slide Time: 17:15)
Why we are taking e to the power and taking log; to make it linear, in our problem so far
which we have discussed, so this was our logit equation is -2.14, 0.34164 x1 + 1.09873 x2,
and if you want to predict y, here y where is the probability value, e to the power -2141, this
we got this answer.
(Refer Slide Time: 17:40)
812
Very important things; we will compare what is the purpose of G statistics and Z statistic, as I
told you because of the unique relationship between estimated coefficient in the model and
the corresponding odds ratio, the overall test is very important; the overall test for the
significance based upon G statistics also is test of a overall significance for the odd ratio but
the z test or the Wald test for the individual significance of model parameters also provide a
statistical test of significance for the corresponding odd ratio.
This is similar to G is similar to F test, z is similar to t test in the; this is for, the right side one
is for linear regression, the left side one is for logistic regression. Now, we will go for Python
the data which I have explained to you which I have brought you in the screenshot, I will run
that model then I will show you how to do the logistic regression using Python.
(Refer Slide Time: 18:55)
813
Now, we will go to our Python environment, then I will teach you how to do the logistic
regression, so what are the libraries required? You need pandas, you need numpy, you need
matplotlib.pyplot, you need sklearn for doing linear model, you can import statsmodels.api,
then from sklearn.metrics import mean_squared_error, the file name is Simmons.xls, as I told
you this was taken from Anderson, Sweeney and Williams book.
(Refer Slide Time: 19:26)
So, this was the data, so what is happening I am scrolling, there is a 100 data set is there, so
what are the variable is there; customer number is there, 1, 2, 3 up to 100, spending; how
much they spend last time, then whether the possession of the card or not, if 1 means they are
having the card, 0 means not having the card, then coupon; whether they have used the
coupon, 0 means not use the coupon, 1 means uses the coupon.
(Refer Slide Time: 19:56)
814
First, we will do the scatterplot between spending and coupon, when you look at this data
spending is a continuous variable, coupon is the categorical variable, coupon is our dependent
variable.
(Refer Slide Time: 20:13)
So, when you run this you see that you are getting this way, so for this kind of model whether
there is a 2 possibility, it is 1 or 0, in between there is no possibility, so the linear regression
model is not valid here, we should go for logistic regression model that is one point. The
another point is here the assumption is a linear regression model the error term will follow
normal distribution but the logistic regression model error term will follow binomial
distribution, so you cannot go for linear regression model.
(Refer Slide Time: 20:47)
815
In x, I am taken card and spending, the Y is a dependent variable is coupon, so x1 equal to
sm.add_constant(x), logit_model = sm.logit (y, x1), result = logit_model.fit(), so I am going
to summary of the logistic regression.
(Refer Slide Time: 21:10)
So, summary of logistic regression is see the constant value we need not bother about this, for
card it is 1.0987, for spending it is 0.3416, so after getting this output what do you have to
see; you have to check the overall significance with the help of G statistics that it is not here
but you have to find out how? 2 multiplied by - 60.487 minus, – 67.301, so that value is your
G value.
For that G value you have to find out by having 2 degrees of freedom, that G value is nothing
but your chi square value, you have to find out what is the corresponding p value, with the p
value is less than 0.05, we can say the overall model is significant. The next aspect is
checking whether each independent variable is significant or not that is done with the help of
Wald test.
So, you can look at here, it is the p value is 0.01 less than 0.05, for second variable also the p
value is less than 0.05, then we can say both the variable is significant. Suppose, if you want
to interpret this model with the help of odd ratio, what you have to do; when you take e to the
power this beta 1 that is e to the power 1.0987, you will get a corresponding odd ratio that is
used to explain 1 unit increase.
816
Suppose, the card person is not having card to having the card, so what was the
corresponding effect on the dependent variable that can be found out. The spending is a
continuous variable, here also we can find out e to the power – 0.3416 will give you the odd
ratio, so that will help you to interpret, suppose a person is spending 2,000, another person is
spending 3,000.
Suppose, there is 1 unit jump what is the corresponding chances because of that 1 unit jump
that the person will use the coupon suppose, if there is a c unit jump simply you have to see;
you have to find out e to the power c of that is a c of 0.3416, so you will get a c unit odd ratio
that you can directly interpret it.
(Refer Slide Time: 23:33)
This is the code to check the G value for example, as I told you previously the G value is
13.628, it is mentioned in my PPT also, so chi square value is 13.628 and the degrees of
freedom is 2, why it is 2 because there are 2 independent variable, so corresponding p value
is 0.0054 that is less than 0.05, overall model is significant. In this class I have explained how
to test the significance of each independent variable in a logistic regression equation that I
have done with the help of z statistics otherwise, Wald statistics.
Then I have explained what is the odds, then I have explained what is the odds ratio, then I
explained how to use this odds ratio to interpret the coefficient of logistic regression
equation, at the end I have used Python and I have shown how to use Python for running
logistic regression equation. Thank you very much.
817
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology - Roorkee
Lecture – 40
Linear Regressions Model VS Logistic Regression Model
In this class, we are going to compare logistic regression versus linear regression because it is
very important to understand how this linear regression and logistic regressions are connected. If
you understand the relationship between this linear and logistic regression, it is easy to interpret
the meaning of logistic regression. So, agenda of this class is comparison of linear regression
model and logistic regression model.
(Refer Slide Time: 00:56)
We will see the first relationship, first difference, estimating the relationship, when you look at
the linear regression model, we used to write Y1 equal to X1 + X2 + and so on Xn, where Y1 is
a continuous data that is dependent variable, X1, X2 are independent variable, this independent
variable it can be continuous we can call it as metric, otherwise it may be a discrete, we can call
it as non-metric.
The linear regression model, if the independent variable is discrete variable, we can use the
concept of dummy variable regression, whereas in logistic regression model, the general model
is Y1 equal to X1 + X2 to Xn, where the Y1 is a binary variable, it is a nonmetric binary
818
variable. The independent variable can be continuous or discrete, we can call it other way; it may
be a metric or nonmetric that it is a basic difference between linear regression model and logistic
regression model.
(Refer Slide Time: 02:02)
The another difference is if you plot a simple independent variable and the dependent variable
for a linear regression, you may able to connect all the points this way but in logistic regression,
value of Y can be only 2 possibilities may be 0 or 1, you may get this kind of relationship. What
is a meaning here is you cannot form a linear relationship this way, you have to form a; a kind of
an S shaped curve that is another difference between linear and logistic regression. What is that
in the y axis, see 0 to 300 for linear for example, here you see that it is a possibility, not only that
the y value is nothing but the probability but here their y value is the actual values.
(Refer Slide Time: 02:53)
819
Correspondence of primary elements of model fit between linear regressions, logistic regression,
we used to write SST; total sum of square that is for linear regression, the equivalent term for
logistic regression is – 2LL that is the log likelihood of base model. Here, we have to write SSE;
error sum of square, the equivalent term in the logistic regression is – 2LL of proposed model. I
have explained, what is the meaning of log likelihood in my previous lectures.
In this lecture also, I will show you the software output where we can get it this log likelihood
value. We know that in a simple linear regression, the meaning of SST is like this, say this is y
bar, this is y, suppose a line goes like this, this is our predicted value b0 + b1x, this is our x
value, this is our y value. So, this distance was our SST, the equivalent value in the logistic
regression is – 2LL.
Similarly, we have seen SSE, this unexplained this length, this length in the regression equation
is SSE; error sum of square are unexplained variance portions, in the linear regression model to
test the overall fit; model fit, we have used F test. There what was the F test in the linear
regression model, F is MSR divided by MSE, what is MSR; MSR is SSR divided by k; number
of independent variable, divided by SSE n – k – 1.
Sometimes, some books they use k, some books they use p to explain the number of independent
variables that is a F value. The equivalent test in logistic regression is chi square test of – 2LL
820
difference that value is nothing but your G, in the linear regression to explain the model fit, the
goodness of the model the term used is coefficient of determination, R square. So, what is R
square?
R square is SSR divided by SST, regression sum of square divided by total sum of square,
otherwise explained variance divided by total variance. The equivalent term in logistic regression
is pseudo R square; I will explain what is the formula for finding pseudo R square in coming
slides. Here, we use SSR; SSR is regression sum of square, the equivalent term for logistic
regression is difference of – 2LL for base and proposed model.
I will explain the meaning of base and proposed model, base means when there is no
independent variable corresponding log likelihood values called base value, when you introduce
any 1 independent variable, after introduction of independent variable, the corresponding log
likelihood values called model; model log likelihood, I will explain this detail in coming slides.
(Refer Slide Time: 06:54)
Then with respect to objective of logistic regression, how the linear and logistic regression
differs. Logistic regression is identical to discriminant analysis in terms of basic objectives it can
address. There is a one technique called discriminant analysis, the basic difference is in logistic
regression, we had only 2 levels; 0 or 1 but in the discriminant analysis, we can have more than 1
level that case is called discriminant analysis.
821
This I did not cover it but this is the concept behind of discriminant analysis, so logistic
regression is identical to discriminant analysis in terms of basic objective it can address; still we
go for logistic regression. If there are 2 category, we can go for discriminant analysis also but
still we prefer logistic regression because it is best suited to address 2 research objectives, one is
identifying the independent variables that impact group membership in the dependent variable.
Another one is establishing a classification system based on the logistic model for determining
group membership.
(Refer Slide Time: 08:18)
There are some more reason is there, the fundamental difference between logistic and linear
regression is; logistic regression differs from linear regression in being specifically designed to
predict the probability of an event occurring, so the y value is nothing but the probability that is
the probability of observation being group coded 1 or not, although the probability values are
metric measures, there are fundamental differences between linear and logistic regression. Even
though, we can say the probability value is metric, so there is a 2 possibility, it may be 0 or 1, so
we may get different probability, when we go for logistic regression.
(Refer Slide Time: 09:02)
822
Then, log likelihood; measures used to logistic regression to represent lack of predictive fit, so
the log likelihood is used to measure how much lack of fit is there, even though this method does
not use in least square procedure in model estimation as is done in linear regression, the
likelihood value is similar to sum of squared error. If the log likelihood value is lesser, it is better
because in the regression, we try to have sum of squared error SSE lesser it is better. Similar to
that in logistic regression, if you are getting smaller value of log likelihood it is better that is a
good model.
(Refer Slide Time: 09:52)
Now, we will compare when should we go for logistic regression, when should we go for
discriminant analysis, even 2 slides before also, I have explained comparison between logistic
823
and discriminant analysis. Discriminant analysis we can have more than 2 levels, 3 levels or 4
levels. The problem related to logistic regression can be solved with the help of discriminant
analysis, where there are 2 levels.
We can say logistic regression is a special case of discriminant analysis but we would not go for
discriminant analysis, we will go for logistic regression, there are some reason is there. See that
the logistic regression may be preferred for 2 reasons. First; discriminant analysis relies on
strictly meeting the assumption of multivariate normality and equal variance that is the first
assumption for the discriminant analysis.
That means, the data has to follow normality and it has to have equal variance and the covariance
matrices across groups is necessary, when we go for discriminant analysis. Assumptions that are
not met in many situations, a real time problems we cannot have this assumptions, in that
situation whenever there is a 2 level in the dependent variable instead of going for discriminant
analysis, we can go for logistic regression.
Because these assumptions are not required for logistic regression otherwise, we can say logistic
regression is more robust than discriminant analysis when there is an only 2 category in the
dependent variable. The next point; the logistic regression does not face these strict assumption
and is much more robust when these assumptions are not met, making its application appropriate
in many situations, that is why we are going for logistic regression over discriminant analysis.
(Refer Slide Time: 11:56)
824
Another point is even though, the assumptions are met, many researchers prefer logistic
regression because it is similar to multiple regression, many possibility to interpret the result, it
has straightforward statistical test, similar approaches to incorporating metric and nonmetric
variables and nonlinear effects and a wide range of diagnostics. Logistic regression is equivalent
to 2 groups; this point which I am trying to say, 2 group discriminant analysis may be more
suitable in many situations.
(Refer Slide Time: 12:38)
With respect to sample size which is better; logistic or discriminant analysis, one factor that
distinguishes logistic regression from the other techniques is its use of maximum likelihood as
the estimation technique. Maximum likelihood estimation requires larger sample such that all
825
things being equal, logistic regression will require a larger sample size than multiple regression,
as of discriminant analysis there are considerations on minimum group size as well.
The another when we go for logistic regression, one point is that you need to have large sample
size because it follow maximum likelihood estimate because the value of maximum likelihood
estimate is sensitive to the sample size or degrees of freedom.
(Refer Slide Time: 13:30)
The recommended sample size for each group is at least 10 observations per estimated
parameter, when we go for logistic regression that means, if you are capturing 1 variable, you
need to have 20 observations. If you are capturing 3 variables, you have to have 30 observations,
this is the thumb rule; this is much greater than multiple regressions which had minimum 5
observations per parameter.
That was for the overall sample not the sample size of each group as seen in the logistic
regression, so what the point here is if it is multiple regression for 1 variable, you need to have,
you can have 5 respondent rated this regression when you go for logistic regression in to help for
one variable in the top 10 respondent where it is regression. When you go for logistic regression,
you need to have, for 1 variable, you need to have 10 respondent.
(Refer Slide Time: 14:32)
826
Then the goodness of fit of the both linear and logistic regression, as I told you the linear
regression, the R square that is a coefficient of determination is measured by SSR divided by
SST, regression sum of square divided by total sum of square. The equivalent term in logistic
regression is pseudo R square, otherwise R square logit equal to – 2LL for null model, - 2 log
likelihood minus of – 2LL log likelihood for model divided by - 2 log likelihood for null.
This value I will show what is the null LL and model LL, so LL is likelihood, if I say – 2LL of
log likelihood of base model without any independent variable, - 2LL model is meaning here is
for model means, it is a proposed model, when you bring a new dependent variable into the
logistic regression model that time what was the corresponding log likelihood value that is called
this model value for log likelihood.
(Refer Slide Time: 15:54)
827
As I told you, you see that when you go for I have run this one previously, see here it is R square
0.783, here it is pseudo R square and as I told you in the previous class, you see that here there is
a LL null, when there is no independent variable corresponding the value of likelihood is this
much, this is null model, this is the base model, so this is your model likelihood, so we have to
find the difference of these 2 to get the R square.
(Refer Slide Time: 16:29)
Now, we will see how to test the overall significance of linear regression and logistic regression,
to test the overall significance of a linear regression we know that the F value is nothing but
MSR divided by MSE, mean regression sum of square divided by mean error sum of square, then
828
what we will do; we will, the error of distribution will follow like this, we will find out what is
the say, suppose this is 0.05, we will get corresponding F value, this is our calculated F value.
With the calculated F value is lying on that side will reject null hypothesis, if it is lying on
acceptance say, you will accept it but for logistic regression, the formula is – 2LL likelihood
without the variable divided by likelihood with variable, this value can be find out, see – 2, when
we say log value, it is division is nothing but subtraction, so likelihood without the variable is - 5
minus; this minus for the log of division, likelihood with variable is your – 4. When we simplify
- 5 + 4, so we will get – 1, so + 2.
This G also follow chi square distribution; chis square is a right skewed distribution, this is your
G value, G value is 2 for the degrees of freedom is number of independent variable in a logistical
regression, that number of independent variable is nothing but the degrees of freedom for the G
value, this value you can get it from this is output of Python of logistic regression and the linear
regression, this was the comparison.
(Refer Slide Time: 18:40)
The another point here is to test the significance of each independent variable in a linear
regression model, will use t test, the t test; the calculated t test is b1 – beta 1, here the assumption
was beta 1 equal to 0 that was our null hypothesis, H1 is beta 1 not equal to 0, then we will find
829
out t value, then we look for see n - 2 degrees of freedom, n – k – 2 degrees of freedom, then we
will compare it, whether it lying on the acceptance in the rejection site.
The equivalent test in the logistic regression to test the individual significance of each
independent variable we should go for this test called Wald test, this Wald test is nothing but
estimated beta 1 divided by standard error of a beta 1, so here you go back, see here the for
example, the card; the estimated beta 1 is - 0.0029, the standard error is this value is standard
error 1.4887, so this is equivalent to your z value.
So, this z value will be used to test whether the model is the individual independent variable is
significant or not, this z value. This z value you got it, this dividing – 0.0027 by 1.480, this value
is nothing but your Wald statistic.
(Refer Slide Time: 20:36)
Model estimation fit; the basic measure of how well the maximum likelihood estimation
procedure fits is the likelihood value, similar to the sum of square value used in the multiple
regressions, it is equivalent to your SSE. What will happen in multiple regression if the value of
SSE is low, it is a good model, the same way logistic regression measures model estimation fit
with the value of – 2 times log of likelihood value referred to as – 2LL.
830
The minimum value of – 2LL is 0, the similar to the SSE equal to 0, which corresponds to a
perfect fit, so in the linear regression it is SSE, in the logistic regression it is – 2LL, always we
prefer lower is better.
(Refer Slide Time: 21:28)
The lower the – 2LL value, the better fit the model is, the – 2LL value can be used to compare
equations for change in fit. So, what will happen; first we have to run this model without any
independent variable, we have to get what is – 2LL, then we have to introduce another
independent variable, then we have to compare how much error term is there. If it is lesser, then
the variable which have included is explaining the model in better way that is a meaning of
comparing equations for change in fit.
(Refer Slide Time: 22:07)
831
As I told you between model comparison, the likelihood value can be compared between
equations to assess the difference in predictive fit from one equation to another with a statistical
test for significance of these differences. There are 3 step is there that to assess whether the
model is fit or not after introducing a new variable.
(Refer Slide Time: 22:32)
The first step is; we have to estimate the null model, what is null model? The first step is to
calculate a null model, which act as a baseline for making comparison of improvement in the
model fit. The most common null model is one without any independent variables which is
similar to calculating the total sum of square using only the mean linear regression, it is like you
832
know y bar. The logic behind this form of null model is that it can act as a baseline against which
any model containing independent variable can be compared.
(Refer Slide Time: 23:16)
Step 2 is estimate the proposed model, this model contains the independent variables to be
included in the logistic regression model, this model fit will improve from the null model and
result in lower – 2LL value, if the after including a new independent variable, if the value of –
2LL low, then the model is good model, any number of proposed model can be estimated this
way.
(Refer Slide Time: 23:47)
833
The third step is assessing the – 2LL difference, the final step is to assess the statistical
significance of the – 2LL value between 2 models that is null model versus proposed model, if
the statistical tests support significant differences, then we can state that the set of independent
variables in the proposed model is significant in improving the model estimation fit.
(Refer Slide Time: 24:20)
Another difference between logistic and linear regression is SSE, this also I have explained in
my previous slide. In linear regression we say SSE, in logistic regression we say – 2LL of a
proposed model, if it is lesser then model is good.
(Refer Slide Time: 24:40)
834
In the linear regression we say SSR; in the logistic regression we say difference between log
likelihood of null model and the model after introducing the 1 independent variable.
(Refer Slide Time: 24:54)
Another important assumption between linear and logistic regression is we can say difference
with respect to error is; a linear regression model the error term follow normal distribution but in
a logistic regression, the error term follow binomial distribution. Linear regression assumes that
the residuals are approximately equal for all predicted dependent variable values. Logistic
regression does not need residuals to be equal for each level of predicted dependent variables.
(Refer Slide Time: 25:26)
835
Another important difference is linear regression is based on the least square estimation via less
but the logistic regression is based on the maximum likelihood estimation, this should be our first
point. Regression coefficient should be chosen in such a way that it minimises the sum of the
square distance of each observed responses to its fitted value, nothing but the error sum of square
has to be minimised.
But here, the coefficient should be chosen in such a way that it maximises the probability of y
given x, with the maximum likelihood estimation, the computer uses different iterations in which
it tries to different solutions until it gets the maximum likelihood estimations, that is how many
time solving logistic regression with the help of hand is very difficult.
(Refer Slide Time: 26:25)
Another difference between logistic and linear regression is the way we interpret the coefficient;
this is a very important difference. In logistic regression keeping all other independent variable
constant, how much the dependent variable is expected to increase or decrease with an unit
increase in the independent variable, this is the way we interpret the meaning of coefficient of
each independent variable in a linear regression model.
But in a logistic regression model, the effect of 1 unit change in the X in the predicted odd ratio
with respect to other variables in the model held constant, the point here is that the coefficient in
the logistic regression is explained with the help of odd ratio, suppose the odd ratio is 3, if the
836
odd ratio is 3, if there is a 1 unit increase in the independent variable, there is a 3 times more
chance, the probability will be increased.
This is the way to interpret the coefficient of logistic regression, in this class I have explained the
difference between logistic regression and linear regression model, there are many parameters I
have compared the equivalent term for logistic regression, equivalent means with respect to
linear regression. If you are so thorough on interpreting the linear regression model after
listening to this lecture, you can interpret the logistic regression model in a very easy manner.
Thank you very much.
837
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology - Roorkee
Lecture – 41
Confusion Matrix and ROC
In this class, we are going to talk about how to check the performance of a logistic regression
model. There are 2 ways to do that one; one is checking confusion matrix, another one is ROC,
we will explain what is this confusion matrix and ROC, then we are going to see how we can
check using these 2 criteria that the model which we developed is good or not.
(Refer Slide Time: 00:54)
The agenda for this class is we will see what is is confusion matrix and receiver operating
characteristics curve.
(Refer Slide Time: 01:01)
838
Because we have seen in our previous example, there may be a different method to classify a set
of data set. One of the methods is our logistic regression that is used to classify to into 2
category, whether it is 0 category or 1 category but we want to see which method is the best one,
so multiple methods are available to classify or predict. For each method, multiple choices are
available for setting.
Here, multiple choices means that threshold value which we are going to say that beyond this
probability, you should go to 1, below this probability you should come to 0, so that is our
multiple choices. So, we have to know to choose the best model, we have to assess each model's
performance that we will see in this class.
(Refer Slide Time: 01:53)
839
In the classification context, how to measure the accuracy; one term is misclassification error,
first we will see what is error. Error is classifying a record as labelling into one class when it
belongs to another class. Suppose, when we say 0 to 1; 0, 1, there are 2 category; we can
predicted, sometime what will happen, we may wrongly predicted, instead of saying 1, we will
say it as 0, instead of saying 0, we will say it 1, so that is error. Error rate is percentage of
misclassified records out of the total records in the validation data that is an error rate.
(Refer Slide Time: 02:42)
This is an example of confusion matrix; you see in row there is a actual class, in column, there is
a predicted class. So, in row we see 1 0, 1 0, if you can predict 1 1 that is a correct one, actual
also 1, the predicted value also 1, so like that we got this many number of data set. The other
840
possibility; the actual is 0, the predicted also 0, so these 2 columns, 2 cells are the correct value.
So, here the frequency of correct saying 1, when it is actually 1 is 201.
The frequency of saying 0, when actually 0 is 2689, so the 201; 1 is correctly classified as 1, here
the 85, 1 is incorrectly classified as 0, actual is 1 but we are predicting 0 that is your 85, the 25
represents incorrectly classified as 1, actually it is 0 but we classified as 1, 2689’s are classified
as 0, actual also 0, the predicted value is 0, so this is the set up for confusion matrix. This matrix
is useful to find out the accuracy of our regression model.
(Refer Slide Time: 04:07)
We go here, how to find out the error rate; from error rate, we will see how to find the accuracy
of our predicted model. See the overall error rate is; there are 2 error possibility, this 25 and 85,
when you add this 25 + 85, the overall data set; overall count is 3000, so the error rate is possible
error divided by 3000, so 3.67, the accuracy is 1 – error rate, so 1 minus that will give the
96.33%. If multiple classes is there, here only there are 2 classes there, one is 1 0. Sometimes
there will be possibility of 2 also; in that case the error rate is sum of misclassified records
divided by total records.
(Refer Slide Time: 05:02)
841
Here, we will see cut-off for classification, so we need to have cut off to say when it is 1, when it
is 0, most algorithms classifying via 2 step process. For each record, compute the probability of
belonging to class 1, compare the cut off value and classify accordingly. The default cut-off
value is 0.5, if the cut-off value is greater than or equal to 0.5, we will classify as 1. If the cut-off
value is less than 0.5, we can classify as 0, okay.
In the probability range, we can have below 0.5, we say it is 0, above 0.5 is 1, this is the default
value, we can use different cut off values, typically error rate is lowest for cut off, when you take
the cut off value is 0.5.
(Refer Slide Time: 05:58)
842
For example, look at this picture, this is one example of our say, logistic regression model, this is
our estimated y value. As I told you in our previous classes, the estimated y value is a
probability, 0.996, 0.988 up to this is the continuing of this one. Suppose, if you keep cut-off is
0.5, what is the cut-off 0.5; 0.5 and above we are going to call it as 1, so this category. When you
keep the cut-off here, there is 11 records are classified as 1; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11.
When you cut-off is 0.8, suppose when you put a cut-off here, 7 records are classified as 1,
where 1, 2, 3, 4, 5, 6, 7, the problem comes what should be the right cut-off to of classify it is 1
or 0 that is in our hand. Sometime, if you keep very high cut-off that also not good, if you keep
very low cut-off that also not good that we will see, what is the meaning of keeping higher cut-
off, what is the meaning of having lower cut-off?
(Refer Slide Time: 07:19)
Assume that my cut-off value is 0.25 in our previous problem, so cut-off this can be updated,
mini software packages can be we can keep different cut-off, so when you keep cut-off is 0.25,
say this value, the actual is owner, the predicted may be 1 also, you can call it as 1, this also 1,
this is 0, this is 0. So, 1 1 is the correct prediction, 0 0 it is correct prediction when you are
keeping it is a 0.25. Suppose, you increase the cut-off value to 0.75, what will happen; were able
to predict only 7, here able to predict 11. So, what is happening; when you update the cut off, we
are getting different confusion matrix that confusion matrix, every confusion matrix we will say
about the overall accuracy.
843
(Refer Slide Time: 08:24)
From the confusion matrix, generally the custom is first write 0 then 1, here also 0 1, you see
here actual also 0, the predicted value also 0, it is a true negative, you see this diagonal value,
actual also 1, predicted value also 1, so it is true positive. Whenever we do this mistake, what is
happening here the actual is 0 but we are shown it is 1, so it is a false positive. I have some
example in coming slides, what is the meaning of false positive, intuitively you can understand.
Similarly, the actual is 1 but you are predicting 0 that is your false negative, so these 4 cells are
used to find out there are different parameter to check the prediction power of our regression
model. The overall accuracy; these 2 cells are correct values TN, true negative plus true positive
divided by sum of all cells value that is overall accuracy. The second point is sensitivity;
sensitivity is true positive divide by true positive plus false negative that is sensitivity.
Because why we call it a sensitivity; actual also 1, predicted value also 1, so that is a sensitivity,
so here the context of sensitivity; sensitivity of a testing machine that I will show you in the next
slide. Then, specificity, specificity is true negative divided by true negative and false positive
that is specificity. Here if you are predicting 0, we call it as specificity, if you are predicting 1 in
a right way we are calling it a sensitivity.
844
Then the next term is overall error rate, what is an overall error rate; the false positive is one
error plus false negative is another error divided by total number of elements. False negative
error rate, where this is a false negative divided by true positive plus false negative that is a false
negative error rate. False positive error rate, false positive divided by true negative plus false
positive. I will explain what is the meaning of false positive, false negative in coming slides.
(Refer Slide Time: 10:59)
Many times, the accuracy of the model is not important, sometime we may say that the one class
is more important for example, predicting 1, when it is actually 1 that is more important. In many
cases, it is more important to identify members of one class whether it is 0 or 1 but many time it
is 1, 1 means when actual is predicted, actually it is 1, the predicted value also 1, so that is our
more important class, we are not bother about when the 0 is predicted as 0, that is not important.
If it is 1, we should predicted as 1, so that time, the only one level is more important, for example
tax fraud, credit default, response to promotional offer, detecting electrical network intrusion,
predicting delayed flights, so there is a 2 possibility there, a person has done tax fraud or not,
credit fault; default there are 2 possibility, this fellow will default or not. Response to
promotional offer; whether this person will take the promotion offer or not.
If it is not taking no problem but we are considered about whether he is going to take the
promotion offer or not because only between 0 and 1, we are more focus on 1, 0 is not important
845
for us, detecting electronic network intrusion, predicting delayed flight, whether the flight will be
delayed or on time, so we are sometime we concerned about only the on time. In such cases, we
are willing to tolerate greater overall error, in return for better identifying the important class for
further attention. So, when you want to focus only one class out of these 2, that time accuracy is
not important, something else important that will say.
(Refer Slide Time: 12:52)
That is done with the help of this curve called ROC curve, ROC is receiver operating
characteristic curve, this curve is used to identify what should be the our threshold value to
decide whether this category belongs to 1, whether this category belongs to 0, it was the idea
started in electronic signal detection theory in 1940s to 1950s. It has become very popular in
biomedical applications particularly, radiology and imaging.
Because if you want to predict a person having a disease or not, so this ROC is more suitable to
decide whether there is a difference of different test, also used in machine learning application to
assess classifier, in this class this ROC curve is used to decide or to evaluate whether the
classifier is correctly classifying or not. Even it can be used to compare test or procedures here in
the context of medical. So, what kind of operation can be done so, what kind of operation is
more suitable for the patients?
(Refer Slide Time: 14:10)
846
We will see one example simple case; consider diagnostic test for a disease, you are asked to go
for test, say medical test. The test has 2 possible outcomes; one is you may get positive that
suggest that presence of disease, you may get negative that says absence of the disease. An
individual can test either positive or negative for the disease. There are 2 possibility is there, a
person may have the disease but you may get the negative report. Sometime a person may not
have the disease but you may get positive report, so that is why the error started to come.
(Refer Slide Time: 14:59)
These terms you are going to use so often in coming slides, one is what is a true positive?
Pictorially, I will show you in the next slide, the test states that you have the disease when you
do have the disease, that means the person having the disease correctly it is saying that yes, you
847
have the disease. The true negative means the test states that you do not have the disease, when
you do not have the disease, this is also no problem, you do not have disease, the report also; the
test also says that you do not have disease.
The problem comes here in the false positive; the test states that you have the disease but when
you do not have the disease, what is the meaning is that actually you do not have the disease but
you shows the positive, positive means that you are saying that there is a disease, this is very
dangerous that means, you do not have disease but the test that machine says is no, you have the
disease.
Then the doctor started to, start the medication that may be dangerous also, there may be another
category, test states that you do not have the disease, when you do, this also very dangerous
actually, you have the disease but the test says you do not have the disease that this fellow may
not get the proper medications because the test told that you do not have the disease, so both
false positive and false negatives are dangerous.
(Refer Slide Time: 16:34)
Look at this picture; the red colour shows that patients without the disease, the blue colour shows
that patients with disease. Now, this is the test result, there are 2 possibilities there, a person
without the disease, with disease.
(Refer Slide Time: 16:55)
848
Now, if you keep a cut-off here like this, you see that this in the x-axis shows kind of a
probability. So, beyond this right hand side, you can call the patients having disease say,
positive, beyond the left hand side of this line, you are going to say that the test shows negative
that means that the patient is not having any disease.
(Refer Slide Time: 17:21)
Now, you see that the blue one, this portion says true positive, what is a true positive? The
person also actually have the disease, the test also says, yes you have the disease. Suppose, say
for example 1 1, so that is a true positive.
(Refer Slide Time: 17:39)
849
Now, look at this because the both negative and positive there is an overlapping, see the red
portions actually, this red portions actually belongs to negative just because it is lying on positive
side of this curve, we are going to say it is a false positive. False positive means actually, he is
not having disease because of this cut-off which we have chosen it is lying on the positive side,
we are going to say false positive, this is not good.
(Refer Slide Time: 18:26)
Because a person is not having disease but you are going to say is a disease, then you see another
category true negative. When there is a cut-off, the left hand side portion says that true negative
means the person not having disease, the test also says not having disease, it is like 0 0, 0, you
code it to 0, no disease, this is disease, this also no problem because we will not bother about.
850
(Refer Slide Time: 18:50)
Now, what will happen in this case, this case actually is a false negative, actually this much
portions of the blue they have the disease but it is lying on the negative side of the curve, we are
going to say it is a false negative. The very common example for this one is sometime people
may have confusion that person is having heart attack or the gastric trouble, so what will happen
sometimes this is the false negative.
Actually, he had the heart attack, some people may suggest no, no, it is due to gas, so this is a
false negative, this also very dangerous. Now, the question comes what should be the cut-off,
suppose if you increase this cut-off what will happen?
(Refer Slide Time: 19:36)
851
That was the next one, suppose we have increase this cut-off like this, so what will happen when
you increase the cut-off, so whoever comes there, we will give the report that negative, right
because of our, so below this point is negative, below this point is positive. So, whoever goes to
the pathological department, they will get a report that you do not have any disease because we
have kept higher cut-off. Now, let us see what will happen now, see this one, this is the suppose,
since it is the probability 0 to 1, see that whoever goes there, they will get a report, negative
report.
(Refer Slide Time: 20:21)
Then we will see another category, suppose if we decrease the cut-off, what will happen, when
you decrease the cut-off, whoever goes to the pathology laboratory, they will get a positive
852
report; positive report means you may not have the disease but the report is going to say that no,
you have the disease, you have to start the treatment. So, what is happening, the cut-off value
plays very vital role to decide to minimise these errors. So, in this class what we are going to do,
we are going to say that what is a role of this cut-off value and the accuracy of our predicted
model or the classification model.
(Refer Slide Time: 21:02)
So, we have to have the right threshold value; threshold value means that the vertical line, where
it has to be chosen, whether it has to be chosen right hand side or left hand side because see the
outcome of a logistic regression model is a probability often we want to make a binary prediction
whether it is 0 or 1. We can do this using a threshold value t, call it as t, above this threshold
value we are going to predict it is a positive, below the threshold value we are going to say it is
predictive. Now, what is happening, what value should be pick for t, what should be the value of
the cut-off value.
(Refer Slide Time: 21:41)
853
This cut-off value is chosen based on which error is better, there are 2 error we have seen; false
positive, false negative. If t is large, which I was shown you previously, predict positive rarely
that means, if the t is high, we say report always that it is negative, so more errors where we say
negative but it is actually positive, so what will happen here this curve, when you keep the higher
threshold value, you see this fellow is a positive, this fellow is really having the disease.
But since because you have chosen higher cut-off value, we are going to give a report saying that
negative that also dangerous that is a case of when we choose higher value of t value. Similarly,
when you go for lower value of t value, the person may not have the disease but you are going to
give a report saying that he has the disease, so both are dangerous. Now, you see the second
category; if t is small, predict negative rarely when P of y equal to small.
More errors where we say positive but it is actually negative because we shifted the line to
extreme left hand side, so that fellow is a positive but actually, it is negative, he is not having the
disease, it detects all patients who are positive, whoever goes to that laboratory, they will get a
report of that saying that you have the disease. So, with no preference between errors, you can
select t equal to 5%.
854
Suppose, if you are not knowing, you are not able to say the cost of that error false positive or
false negative, you can keep t equal to 0.5, it predicts the more likely outcome, it is a very
conservative way.
(Refer Slide Time: 23:33)
Now, I have brought this saying what is a true negative, true positive, selecting the threshold
value, compare actual outcomes to predicted outcomes using confusion matrix, this also I have
shown you. See that 0 0 it is a true negative when this is a false positive actually, this person is
not having disease but you have given a report saying that he has a disease, this is a false
negative; here false negative is this fellow actually having the disease.
(Refer Slide Time: 24:07)
855
But we have given a report saying that it is negative, this says the false positive is nothing but
your alpha type I error. Here these type II false negative is nothing but beta, it is your type II
error, this power of test we used to say in hypothesis testing 1 – beta, it is called sensitivity. If it
is actual also 0, the predicted also 0, we say it is specificity.
(Refer Slide Time: 24:34)
You see that in the term of; in the form of matrix say, C0, C0, so number of C0 cases classified
correctly, you come to this diagonal, C1 and C1, n1 1 equal to number of C1 cases classified
correctly, the error comes here. What it says, actually it is 0 but we have predicted as 1.
Similarly, this error has come, actually it is 1, we predicted as 0, so this is the confusion matrix.
(Refer Slide Time: 25:08)
856
Now, let us explain this term what is sensitivity and specificity with respect to the previous table.
If C1 is important class, the sensitivity equal to percentage of C1 class correctly classified, so
sensitivity equal to n1 1 divided by n1 0 + n1 1. Specificity equal to percentage of C0 class
correctly classified, specificity equal to n0 0 divided by n0 0 + n0 1, you can look at my previous
slide; you can see that how it has come.
The, what is a false positive rate; percentage of predicted C1's that were not C1 that is a false
positive rate, false negative rate is percentage of predicted C0’s that were not C0 that is false
positive, false negative rate because these terms we are going to use while constructing our ROC
curve that is why I am defining what is a false; then we will say true positive and false positive,
okay.
(Refer Slide Time: 26:23)
Then, what is a true positive rate; receiver operating characteristic curve, there will be, it will be
go this way, in y axis, we have TPR, true positive rate. In x axis we will have 1 minus specificity
that is false positive rate, so what will happen, receiver operating characteristic curve, the
structure of ROC curve is in x axis, we will have true positive rate, this one, true positive rate
that will be in y axis that is the proportion of positive cases.
In x axis, we are going to see false positive rate, what is a false positive rate; this false positive
rate this portions which I shaded with this one, this is a false positive rate that we are going to
857
explain, this is 1 minus specificity, that will give your false positive rate, FPR. Now, what will
happen; the curve will go this way, I will explain what will happen; when you keep very low
threshold value, will have high sensitivity and low specificity.
Because low threshold in the sense, cut-off is here, there will be high sensitivity, so whoever
comes to the laboratory, we will say that he has a disease, so the opposite of this is low
specificity.
(Refer Slide Time: 28:44)
The another category what will happen, when you keep higher threshold, it will have high
specificity, very low sensitivity. So, whoever goes there will get a report saying that he is not
having the disease, so high specificity. So, there is a contradiction of keeping higher cut-off
value and lower cut-off value, so we have to choose the trade-off between the both the errors that
we will see in the next class.
In this lecture, we have seen how to check the quality of our regression model, there are 2
methods; one is confusion matrix, another one is ROC analysis. I have explained using confusion
matrix what is a difference cell means, what is an intuitive understanding of each cell that is what
is a false positive and false negative, then I have explained some theory about the ROC analysis,
then I have explained what will happen when the cut-off ratio is higher what will happen, when
the cut-off ratio is very low, what will happen to that.
858
Now, in the next class we are going to see how to choose the correct cut-off value, so that in the
next class in pictorially I will explain how to choose the correct cut-off value, so that there will
be a trade-off between false positive and false negative error, thank you.
859
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology - Roorkee
Lecture – 42
Confusion Matrix and ROC - II
In this lecture, we will continue with our explanation of ROC analysis, in our previous lecture I
have given you a theory about what is the confusion matrix and ROC analysis. In this lecture I
will explain pictorially what is the different types of ROC curve, how that ROC curve is used to
choose a correct classifying methodology that is ROC can help to predict the accuracy of our
regression model.
(Refer Slide Time: 00:56)
The agenda for this lecture is we will continue with receiver operating characteristic curve, then
how to choose the optimal threshold value to classify the category whether it is 1 or 0, here we
will use lot of pictures give you more understanding for you.
(Refer Slide Time: 01:12)
860
What is ROC analysis? As I told you the curve which I am showing previously, see this portion,
this portion is abnormals, person having disease, so if you are predicting exactly it is a true
positive, sometime this much portions actually, he is belongs to normals category, since it is
lying on the abnormal side, we have given you report saying that you have the disease that is a
false positive.
Now, this is the decision threshold, you see the another choice that whoever on the left hand side,
we are going to say true negative, in the true negative sides, there are some people who belongs
to positive side, their also lying on this side, in the negative side, so we are going to give a report
false negative. What is a false negative? Even though they have the disease, we are going to say
that you do not have the disease.
So, what is a true positive fraction; true positive fraction is true positive divided by your false
negative, it is 1, also called sensitivity, true abnormals called abnormal by the observer, this is
the right way because if they are abnormal, we are going to say that yes, they are abnormal. False
positive fraction is FPF, is a false positive divide by false positive plus true negative, then
specificity; true negative divided by true negative and false positive.
861
True normals are called normals by the observer, this false positive rate is nothing but 1 minus
specificity, so what will happen; false positive rate if you write 1 minus something you will end
up with false positive divided by true negative plus false positive.
(Refer Slide Time: 03:23)
The next one is the true positive fraction, TPF equal to true positive, this curve is ROC curve as I
told you in y axis, we have true positive fraction, in x axis we have false positive fraction, see the
range is 0 0 because it is a problem it is 0 to 1, in y range also 0 to 1 1, this curve, the C curve
you see that there is a C, you see that this category, when there is a somewhat overlap between
this is a true negative, true positive, so this is the situation of your so this C.
So, we can I am writing here, so this was for this curve, you see that one there are some more,
more overlap, so this is the situation when the B curves, you see there is a complete overlap, this
is the situation of our A curve. If it is completely separated both true positive, true negative is 2
separate curves, so you will get a perfect ROC curve. Now, look at the different conditions,
evaluating classifiers via ROC curves, classifier A, this one, this line cannot distinguish between
normal and abnormal.
Because it is normal and abnormal there are 2 curves which are completely overlapped, the
second one B is better but makes some mistakes, this situation because there is a somewhat
overlap, C makes very few mistake, it is not completely separate, so this line, the reverse L shape
862
is perfect line. So, this graph, this picture connects between ROC curve and different types of
your true positive and true negative.
(Refer Slide Time: 05:16)
There may be a perfect category, a perfect means no false positive and no false negative, so this
line, no positive; no false positive and no false negative.
(Refer Slide Time: 05:32)
Look at there are different situations, see ROC curve; receiver operating or OC curve, ROC
space good and bad classifier, you look at this one, this is a good classifier. Why we are saying it
is a good classifier? High true positive rate and low false positive rate, this one we say bad
863
classifier because low true positive rate but high false positive rate, this one is a bad classifier
because real picture because it is both are equal.
(Refer Slide Time: 06:10)
Next there is a one more term to explain the quality of a regression model that is area under
curve, AUC that is nothing but in ROC curve, the area under ROC curve it is called AUC, area
under curve. For example here AUC is 0.775, this portions where the red in colour, so all other
things are same, the true positive rate, false positive rate.
(Refer Slide Time: 06:36)
What is a good AUC, area under curve, see maximum it can hold everything that is a perfect
prediction, so if the perfect prediction is there, that means, all 0’s are predicted 0, all 1’s are
864
predicted as 1, then you will get AUC, this full red colour. What is a good area under curve? The
maximum value is 1 and minimum value is 0.5, it is just guessing one, so maximum it can go up
to 1.
(Refer Slide Time: 07:09)
Now, we will look for selecting a threshold using ROC curve, we have seen what is the different
point of ROC curve, how to choose the threshold value that was important objective of this
lecture. Choose the best threshold for best trade off, we are looking at cost of failing to detect
positive and cost of raising false alarm, it is like a false positive and false negative, we have to
see cost of that 2, whichever is more dangerous or more costly that should be minimised.
(Refer Slide Time: 07:42)
865
Now, we will explain ROC plot each corners, a typical ROC plot with a few point in its shown in
the following figure, there are 4 point is there; A, B, C, D, note that the 4 corner points are 4
extreme case of classifier, there are different points which are above the diagonal, some points
are below the diagonal, we will take each and every points I will explain what is the significance
of these points and how to interpret this point.
(Refer Slide Time: 08:17)
First, we will look at the point A; point A is this location, what is happening here; the true
positive y value is 1, false positive is 0, this is an ideal model, the perfect classifier no false
result, then we will go to the second category that is B, here the true positive rate is 0 because y
value is 0 but x value is 1. The worst classifier not able to predict a single instance, then we will
go for C, this situation where true positive also 0, false positive also 0.
The model predicts every instance to be negative class, it as an ultraconservative classifier, this
will happen when t equal to 1. Suppose, if you keep a threshold that is very high level so
everybody will be called it as negative class, the D this point is true positive rate also 1, false
positive rate also 1, the model predict every instance to be positive class. When you take
threshold value extreme left hand side so, whoever comes to the pathology laboratory, we will
say that you have the disease, it is an ultra-liberal classifier.
(Refer Slide Time: 09:48)
866
The problem comes how to choose the right ROC value, now look at different points inside the
ROC curve. First we will look at the points on the upper diagonal region, all points which resides
on upper diagonal region or corresponding to classifiers good because this portion is the good
classifier, as their true positive rate is as good as false positive rate that is false positive rate is
lower than the true positive rates.
See there is one point X, when you compare X and Z, X is better than Z because X has higher
true positive rate and lower false positive rate than Z, when compare to X and Z, X is better. If
you compare X and Y see that neither classifier is superior because there is a trade-off between
TPR and FPR, if the TPR is increasing, FPR also increasing.
(Refer Slide Time: 10:55)
867
Now, let us interpret the different point ROC curve, first we will see the points on the lower
diagonal region, previously I have explained the upper diagonal region, now we will look at the
points in the lower diagonal region. The lower diagonal triangle corresponds to the classifier that
are worse than the random classifier because this side it is not good because it is a high false
positive rate.
A classifier that is worser than the random guessing simply by reversing its prediction, suppose
look at the 2 point W dash and W, W dash is 0.2, 0.4 is better version than the W 0.4 and 0.2
because W dash is the mirror reflection of W.
(Refer Slide Time: 11:41)
868
Now, tuning a classifier through ROC plot, see that I have 2 category of ROC plot, let us see
which is better, why. Using ROC plot, we can compare 2 or more classifier by their TPR that is a
true positive rate and false positive rate values and this plot also depicts the trade-off between
true positive rate and false positive rate of a classifier. Examining ROC curves can give insight
into the best way of tuning parameter of classifier.
For example, in this curve C2, the result is degraded after the point B, you see that this C2, what
is happening; true positive rate is increasing after point B, what is happening; true positive rate is
decreasing but there is no much decrease on false positive rate, so beyond this point P, it is not
giving good classification. Similarly, for the observation C1, beyond Q, the setting are not
acceptable because there is a comparatively lower false positive rate when compared to true
positive rate.
(Refer Slide Time: 12:54)
Now, there are different classifying comparing different classifiers through ROC plot, when you
look at this picture, see that there are C1, C2, C3 we can use the concept of area under curve as a
better method to compare 2 or classifier, we can get different classifier by getting different
threshold value. If a model is perfect, then the AUC is 1 which I have seen; which I have
explained.
869
If a model simply performs a random guessing then the AUC is 0.5, so this area, a model that is
strictly better than other would have larger value of AUC, area under curve than the other. So,
out of these 3, the C3 is having higher area under curve, so that model and that corresponding
threshold value is better to classify which is good or which is 1 or which is 0, when compared to
C2, C1. Here, the C3 is the best, C2 is better than C1 as AUC, area under curve C3 is greater,
then AUC , area under curve C2 greater than area under curve C1.
(Refer Slide Time: 14:17)
Now, let us look at our extreme cases of this ROC curve, this was our typical ROC curve, see
that how to compare 2 type of ROC curve, you see that it is closer, the area under curve is
somewhat nearer to 1, so it is a good test. When you look at this one, area under curve of that
ROC curve is lesser when compare to this, so it is a poor test. The left side one is the best test
sorry, good test.
(Refer Slide Time: 14:55)
870
Now, we will see extreme cases, when this extreme cases, see that the 2 distributions do not
overlap at all, in the previous lecture I have shown you 2 cases, one is true negative and positive,
this side true negative, true positive, then false positive, false negative. If the 2 lines are; 2
distributions are not overlapping, we will get very ideal case that is best test. The distributions
will overlap completely, then you will get a this kind of diagonal, this is a worst test.
(Refer Slide Time: 15:32)
You see that this case, true negative true positive, there is no overlap at all, so you will get this
kind of ROC curve.
(Refer Slide Time: 15:43)
871
You see that there are somewhat some overlap is there, so you will get a typical ROC curve
because there is area under curve, previously it was 1, now it is area under curve is 0.7.
(Refer Slide Time: 15:57)
Now, you see this case the area, both the distributions are completely overlapping that whenever
it is overlapping, area under curve is 0.5, so corresponding ROC curve will look like this, so far
we have seen and we have understood the concept of confusion matrix and ROC curve, now I am
going to take one example, that example already I have discussed in my previous lectures. With
the help of that example I am going to tell you how to choose the correct threshold value to
classify whether it is belongs to category 1 or category 2.
(Refer Slide Time: 16:38)
872
The example is this book; this example is taken from this book statistics for business and
economics from David Anderson Sweeney and Williams. The example is; let us consider an
application of logistic regression involving direct mail promotion being used by Simmons Stores.
Simmons owns and operates a national chain of women's apparel stores, 5000 copies of an
expensive 4 colour sales catalog have been printed and each catalog includes a coupon that
provides 50 dollar discount on purchase of 200 dollar or more. The catalog are expensive and
Simmons would like to send them to only those customers who have the highest probability of
using the coupon.
(Refer Slide Time: 17:28)
873
So, what are the variables which are involved in this problem is; one is annual spending, another
one is whether the customer has Simmons credit card or not. What we are going to predict
whether a customer who receives the catalog will use the coupon or not, Simmons conducted a
pilot study using a random sample of 50 Simmons credit card customers and 50 customers who
do not have Simmons credit card. Simmons sent the catalog to each of the 100 customer selected,
at the end of the test period Simmons noted whether the customers used the coupon or not, this is
the problem.
(Refer Slide Time: 18:11)
For that problem this is the dataset, spending how much they spend in the last month that is our
one of the independent variable. Possession of Simmons credit card, if it is 1 he has, 0 does not
have that is the another independent variable. The coupon; whether he use the coupon or not that
is our dependent variable.
(Refer Slide Time: 18:33)
874
You see that dependent variable is 2 category 0 or 1, so it is a; then we have to go for logistical
regression. The amount of each customer spent last year at Simmons is shown in the 1000’s of
dollars and the credit card information has been coded as 1, if the customer has the Simmons
credit card 0, if not. In the coupon column, 1 is recorded if the sampled customers used the
coupon and 0 if not.
(Refer Slide Time: 19:04)
So, we have imported the data, for import we have imported necessary libraries, import pandas
as pd, import matplotlib.pyplot as plt, then the dataset is Simmons.xls, I am going to show, run
this Python code and I am going to explain further. I brought the screenshot of my Python
875
output, so data dot head we came to know there is a spending is one independent variable, card
and coupon.
The first one is a data dot describe, that is to get an idea about the details of each variables,
customer, there is no meaning for this one, for spending you see that there are 100 values is
there, the mean is 3.3, standard deviation is this one, since card is the categorical variable, there
is no meaning, for the mean there is no meaning for standard deviation, so it is not applicable.
The coupon also, it is a categorical variable, in the categorical variable there are 100 values is
there. Here also, there is no meaning for mean and standard deviation because you cannot do any
arithmetic operation when there is a nominal or categorical variable.
(Refer Slide Time: 20:17)
We are going to use different inbuilt functions, so if you say, Dataframe dot describe that
function is used to get the basic statistical details such as central tendency, dispersion and shape
of the datasets distribution. If you use numpy.unique, this method gives the unique value in a
particular column, they count this option; Series.value_counts return object containing count of
unique values.
(Refer Slide Time: 21:01)
876
This ravel; it will return one dimensional array with all the input array elements, so I use that
inbuilt function for example, data for the coupon column unique, so it is give 0, 1 because that
means we can come to know there are 2 category in the coupon column, one is 0, another one is
1. Then how many when you say dot value underscore counts, say there are 60-0, 40-1, 60
people did not use the coupon, 40 people have use the coupon.
Then for running logistic regression and they split the data into 2 categories; some data for
training and building the model, after the model is built, we will use the test data to verify our
built model from sklearn import linear_model, from sklearn.model_selection import
train_test_split, from sklearn.linear_model import LogisticRegression.
So, the x value equal to data; card and spending; independent variable, y value is coupon, okay,
first we are going to split x underscore train, x underscore test that is x value after splitting we
are going to call it this way then, y train y test equal to train underscore test underscore split x, y
test underscore size 0.25, you can take any value, thus I have set some value, so that we can
repeat the code again, you may get the same kind of output.
So, you see that I wanted to say see for the in the training data set, there is 75 data set for x and
75 data set for y. For testing data set, there are 25 data set for x, 25 data set for y, this is for
testing purpose.
877
(Refer Slide Time: 22:50)
First, we will use logistic model and will after building the model, we will predict the values, we
are going to use in the logistic regression, lbfgs, there are different method I will show you there,
when I am running the code, this is one method for constructing the logistic regression model,
logisticregression dot fit x train y train dot ravel, we will written one-dimensional array with all
input array elements, so we are getting this output.
Now, we will predict the y value by using the test data set, what I have done; we have
constructed the model, then we will use a test dataset to predict the y value, this was our
predicted y value, then we will predict y value by using the training dataset because training we
have 75 data set, for testing there are 25 dataset, so this was after predicted the regression model
using training dataset.
(Refer Slide Time: 23:52)
878
Next one; y underscore probability underscore train, here see that this is the probability value
which dataset for training dataset, so there are 75 dataset, we have got the probability of all the
75 dataset. Our problem is going to be there, what should be our cut-off value here, to say this
below this category is called 0, above that category is 1, so y underscore probability here also we
will find out for using test dataset, we will predict the probability, this is there going to be 25
dataset, here going to be 75 dataset.
(Refer Slide Time: 24:34)
This is that the; our task is what should be the cut-off value or threshold value, first we will
construct a regression equation, logistic regression x is x data y data, after importing the model
we got this one, constant is - 2.1464, spending independent variable is 0.3416, card 1.089, when
879
we look at the pseudo R square, it is good, then the P value also good, the overall model is good,
even if you look at the Wald test, so this P value also all are less than 0.05, this model is good.
(Refer Slide Time: 25:20)
The point is how to set the threshold value, so after getting this model, we have predicted some
portions are 1, so first we will go for checking the accuracy. There are 4 possibility; the actual is
0, predicted is 0, it is a true negative, actual is 1, predicted is 1, it is a true positive, actual is 0,
the predicted is 1, it is a false positive, actual is 1, predicted is 0 that is a false negative.
(Refer Slide Time: 25:44)
Because that table is the; the previous table is the base for the confusion matrix, so here the
accuracy of model is test by using this score function, so score equal to accuracy underscore
880
score for y test and y predicted, so the accuracy of that model is 76, to get the confusion matrix;
confusion underscore matrix y underscore test y underscore predict, we are getting this confusion
matrix. So, here what is meaning is the true negative is 15, true positive is 4, so the false positive
is 1, false negative is 5, so this is the way to write the confusion matrix.
(Refer Slide Time: 26:31)
Generating classification report; you see that when you use this functional classification
underscore report, we will get this output, this is there are different columns, see one is on
precision, another one is a recall, another one is f1 score and support. This recall gives us an idea
about when it is actually yes, how often it predicts yes, it is like our sensitivity. Precision tells us
about when it is predict yes, how often is it correct.
I have explained what is the recall, so recall gives us an idea about when it is actually yes, how
often does it predict yes, it take care both specificity and sensitivity. If it is 1, we call it
sensitivity, if it is 0, we call it specificity. So, in our problem the specificity is 0.94 that says that
see, 94% of time we got 0 and we have predicted also it is 0, when you say sensitivity 0.44, so
44% of time, the actual is 1, we predicted also 1.
(Refer Slide Time: 27:52)
881
So, the next one more column is the precision; the precision tells us about when it predict yes,
how often is it correct, will explained this one, meaning of precision in the next slide. Now, we
will interpret the classification report which was the; in the previous slide I have shown that
output. The precision is true positive divided by true positive plus false positive, the accuracy is
here there is one; so this cell is accurate value, this cell also accurate value.
So, sum of these 2 divided by sum of all the cells, so recall is a true positive divide by true
positive plus false negative. In this lecture I have explained graphically what is ROC curve and
how to choose a ROC curve, with the help of some pictures. At the end, I have taken one
problem, there in that problem I explained what is a confusion matrix, how to interpret each cell
in the confusion matrix. In the next class, we will continue and I will explain how to choose
ROC value with the help of this same example that we will see in the next class, thank you.
882
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 43
Performance of Logistic Model – III
In this lecture, we are going to test the performance of logistic regression model. We use
Python and I will show you a demo how to check the performance of a logistic regression
model.
(Refer Slide Time: 00:41)
The agenda for this lecture is Python demo for accuracy prediction in logistic regression
model using ROC curve.
(Refer Slide Time: 01:11)
883
There are two terms, one is Sensitivity and another one is Specificity. For checking what type
of error we are making, we use 2 parameter. One is sensitivity. The another name for
sensitivity is True Positive Rate. This also I have shown you in your previous lecture tp
divided true positive by false negative. Specificity is a true negative rate, that is true negative
divided by true negative plus false positive.
(Refer Slide Time: 01:20)
In this lecture, I am going to stay the connection between sensitivity and specificity for
different threshold value. The first case is, when the threshold value is low, when the
threshold value is suppose this way, suppose when you put threshold value here, we will
increase the sensitivity, but decrease our specificity. When the threshold value is higher,
suppose if you keep the threshold value here, what will happen specificity will increase,
sensitivity will decrease. So, which threshold value should be chosen? That is the problem.
That I will show you with the help of Python programming.
(Refer Slide Time: 02:04)
884
First, we will check what accuracy is. Accuracy is true positive plus true negative divided by
true positive plus true negative plus false positive plus false negative. So, the accuracy for our
problem is 0.76. Then specificity, true negative divided by true negative plus false positive.
For our problem, it was 0.94. Then sensitivity is true positive divided by true positive plus
false negative. For our problem, sensitivity is 0.44.
We got this specificity and sensitivity by taking threshold is 0.5. That is our default value, but
the question will come, if it goes above 0.5 or below 0.5, what will happen and what should
be the correct value.
(Refer Slide Time: 02:59)
Now we will draw the ROC curve for training dataset. So, from sklearn.metrics import
roc_auc_score, from sklearn.metrics import roc_curve, auc, log of RUC_AUC1 is equal to
885
roc_auc_score, y_train, y_predict_train. Then fpr, tpr. Find out false positive rate and true
positive rate and threshold also we are going to test it. So, when you plot it, fpr and tpr,
threshold 1 equal to roc_curve(y_train, y_prob_train), so roc_auc1 is equal to, so we are
going to draw auc curve under fpr1 is false positive and true positive rate.
(Refer Slide Time: 04:29)
When you plot this, we are getting for different combination of fpr and tpr, we are getting for
default ROC curve. So, the area under AUC for this model is 0.64. Now, let us draw the ROC
curve for the Test data set. See only these things are changed, when we draw ROC curve for
Test data set, what happened when compared to this, here the AUC, area under curve is
increased by 0.9. So what we infer from the previous one, this one is, this model is
performing well for the y test dataset because the AUC is 0.9.
(Refer Slide Time: 05:03)
886
How to select threshold value? The outcome of logistic regression model is a probability.
Selecting a good threshold value is often challenging. The threshold value on ROC curve,
you can take it is 1, if the threshold is 1, what will happen the true positive rate is 0 and false
positive rate also 0. This situation. So, this is the place where T equal to 1. When threshold is
0, so true positive rate is 1, this point and false positive rate also 1. So, threshold values are
often selected based on which error is better.
(Refer Slide Time: 05:51)
At present, randomly we choose some threshold value 0.35. Let us compare by changing
different threshold value and verify which threshold value is right choice for us. For that,
y_predict_class1 equal to binarize y_prob.reshape 1, -1 by taking 0.35. So, if you take 0.35,
this was our predicted values, but if we want to get to the integer value to use
y_predict_class1.astype integer, you are getting the integer value and from that, you are
getting the confusion matrix.
So, what it says that by keeping threshold values 0.35, so true positive is 9 and true negative
is 8. So, we can change different value, for example, for that value, we can check our recall
and precision. Now, by taking threshold value as 0.5, let us verify what has happened that
confusion matrix value is increased, so the true negative is 15, true positive is 4.
(Refer Slide Time: 07:14)
887
So, what has happened, you recall previously it was 0.5, now it is 0.94. That was the value of
0.50 of the threshold value. If you take threshold value 0.7, what is happening here, the true
negative value is increased but there is no true positive at all. So, what is happening after
certain threshold value, we are not able to get true positive values but the other values are
little improved.
(Refer Slide Time: 07:45)
But the question will come how to choose the right threshold value. We have seen 0.3, 0.5,
and 0.7. The appropriate T value that we can get by looking at this table. For example, the
output of this program says that the threshold value should be 0.457 is most appropriate for
us. Now, Optimal Threshold Value in ROC curve. What has happened here in x-axis, you
take 1 minus false positive rate, in y-axis, you take true positive rate. So, at the intersection
888
point where the true positive rate and true negative rates are intersecting, so that point is
considered as the optimal value of threshold value.
(Refer Slide Time: 08:36)
So, here Classification Report using Optimal Threshold Value, this was our program output.
So, here I use binarize y_prob. reshape, so this is our predicted value. So, we got confusion
matrix, here true negative is 14, true positive is 8. For that we got the Classification Report
also. Here it says that specificity and sensitivity, both are little higher. So, it shows that this is
the optimal threshold value.
(Refer Slide Time: 09:13)
Now, with help of Python I am going to run the code. I am going to explain how to choose
the correct threshold value. So, important necessary library, imported pandas, and imported
matplotlib.pyplot then this was the dataset. This dataset I have already discussed with you.
889
There are two independent variable, and one depend variable, that is Coupon. So, we will
describe this dataset. It will give a basic statistics of all the columns.
For spending, there are 100 dataset. The mean is 3.3, standard deviation is 1.74, minimum is
1. So, 25th percentile, 50th percentile, and 75th percentile, maximum is 7.07. For Card, we
can look at only the count value, because there is no meaning for mean and standard
deviation. Similarly, for Coupon, because both the variables are binary variables. Now, we
look at what is the value in the Column coupon.
There are two values there one is 0 and 1, 0 mean that customer did not use the coupon, and 1
means that the customer has coupon. Now, let us see how many 0s and how many 1s by using
value_counts function. So, there are 60 people did not use the Coupon and 40 people used the
Coupon. So, the baseline method 0.6. Now, we will go for building the LogisticRegression
model. I have imported linear_model, sklearn.model_selection.
I have imported train_test_split, there also I have imported LogisticRegression. Then, we will
split the dataset by the ratio of 25 percentage dataset for the training, and the remaining 25
percent dataset is for testing. So, let us see how much training dataset, how much test dataset.
So, for x variable, the training dataset is 75, for y variable, the training data is 75. For test
dataset, x is 25, the test dataset for y is 25. Now, we will construct a LogisticRegression. So,
we use the solver lbfgs.
Then, we predict our constructor model with the help of test dataset. In our model, after
substituting x values, this was our predicted y value. We can get to know there are different
solvers, when you use the LogisticRegression? You can get to know there are different cases.
For example, this is help function. You see the multinomial option is supported only by the
lbfgs. There are some more method, sag method and newton-cg method that we are not using.
So that is why we have used lbfgs solver for getting this LogisticRegression output. Now, we
will predict our model with test data set. This was our predicted y value. Here the input is test
data set. Now, we will take the training dataset, then we will predict the model. Because in
the training dataset, there are 75 dataset. Here only 25 was there. So, this was our predicted
output for the training dataset. Now, we will get the probability value for this.
890
So, this is the probability for our training dataset. So, there will be 75 dataset is there. There
is a different probability. Our question comes what should be the threshold value or the cutoff
value. So that we can classify this is 1 or 0. Now for the test dataset also, there should be 25
dataset, we can get the probability. Now, we will run the regression model.
(Refer Slide Time: 13:16)
This is our predicted model, but here we have to show that what should be the threshold
value. First, we will check the accuracy of the model. For knowing the accuracy of the model,
we have to import accuracy_score. So, the accuracy is 0.76. Next, we can go for constructing
a confusion matrix. For that, you have to import a new library called confusion_matrix. When
you run this, you are getting confusion matrix.
Here the confusion matrix is see default value is 0.5 is taken, this 15 says true negative and 4
says true positive. The 1 is false positive, the 5 is false negative. So, in our dataset, say that
the true negative is 15, false positive is 1, false negative is 5, and true positive is 4. Now, we
will get to know what is the Classification Report. In classification report, important things
you have to remember. One is the recall; another one is the support.
The Recall gives us an idea about what is actually yes and how often does it predict yes. The
Recall value 1 is called as sensitivity, the recall value 0 is called specificity. Precision tells us
about when it predicts yes, how often is it correct. So, Precision is true positive divided by
true positive by false positive for 1. The accuracy is, the diagonal value of the confusion
matrix is the tp and true negative, if you add all the cells value, that is our accuracy.
891
Now the recall is tp divided by tp plus fn, for value of 1. The f measure is giving the balance
between precision and our recall. Now, we will go for finding the accuracy. Accuracy is 0.76.
Then, we will go for specificity. The specificity is 0.94, here the default value is 0.5. So, we
will go for sensitivity that is true positive rate. For true positive rate, it is 0.44. Now, we will
for ROC curve. Now, we will plot that ROC curve.
For default value, the ROC curve is a blue one, which says the area under curve is 0.64. Now,
we will see different false positive rate and true positive rate, so this was that one. Now, let us
see what are the values of false positive rate. First, we will say fpr. These are the different
false positive rate. Now, I will display the output of true positive rate, tpr. So, these values are
going to be our x and y axis of our ROC curve.
For, different combinations, we may get different ROC values. So, we will plot ROC curve.
This is for our test dataset because you see that here the value which I have taken, this is y set
dataset and this is ROC curve. When you look at these two curves, here when you look at this
dataset, this is for our training dataset. For training dataset, the ROC curve is like this. The
area is 0.64. For the test dataset, the value for ROC curve is 0.9.
So, our model is very well for the test dataset. Now, let us randomly give different T value,
different threshold value. Let us see how the ROC curve appears. Suppose we have taken the
ROC curve value 0.35, this is threshold value, let us see ROC curve for this value. So, I have
predicted values, then value I need in an integer form, I have taken integer form, then I am
going to draw the confusion matrix.
So, here the confusion matrix true negative is 8, true positive is 9 because here the value is
0.35, so if it is 0.35, see there are higher true positive rate, that is 9. For this value, let us draw
the true positive is 9 because our threshold value is low, everything will be predicted as
positive. Now, let us get the Classification Report for that. So, this is our Classification
Report. The recall when it is 0, it is 0.5, recall when it is 1, it is 1.00.
Now, let us go for another threshold value that is 0.5, when it is 0.5, see that I have changed
the value is 0.5, let us predicted y value. This predicted y value, then let us draw the
confusion matrix. What has happened, the threshold value has move on right hand side, we
892
are getting more true negative and less true positive value. For this, let us get the
Classification Report. This 0 represent specificity is increasing.
So what has happening here, sensitivity is decreasing. When you move towards right hand
side, the threshold value goes towards right hand side, what is happening specificity is
increasing, sensitivity is decreasing. The previous curve you see that, when it is low this side,
you see that the sensitivity is 1 almost. It is 1 exactly. When we have low value of, low value
is 0.35. When it is 0.35, you see the specificity is 0.5, but sensitivity is 0.1.
When the threshold value is increasing, what has happened that the specificity increased, but
the sensitivity decreased. Let us go for 0.7. When the threshold value is 0.7, this is our
predicted value. Let us go for confusion matrix. When the threshold value is high, there are
more true negative because it is extremely the right hand side. So what they say, whatever the
kind of pathology lab that whoever goes there, they will get a negative report.
So, that is the effect of changing the threshold value from lower side to upper side. Now, let
us get the Classification Report for this. Here the specificity is 1 when you are having higher
threshold value. So, sensitivity is 0 because you have chosen higher threshold value. Now,
the important task in our class, we have to choose what is the optimal cut-off point or cut-off
point in the sense, optimal threshold value.
So, we will import this ROC curve, then we will run ROC curve, y_test and y_prob, then we
will print area under ROC curve, that is AUC curve, the area under ROC curve for optimal
threshold value is 0.90, so it is the best one, because it is nearer to 1. But we want to know
what is the optimal threshold value. For that purpose, we have to run this one. We have to
import numpy as np, i equal to np.arange for tpr.
For each tpr value, we have to get roc, roc equal to pd. DataFrame, pd. Series false positive
rate, then we are getting different index values, so when you run this command, you are
getting a table, which shows the optimal threshold value. So, what is the meaning of this one
is, if you take the t values 0.457, that will give you higher area under curve.
(Refer Slide Time: 22:15)
893
So, when you run this, you see that the blue line says that true positive rate, the red one shows
the 1 minus false positive rate. So, this intersection where you see what happening, the true
positive rate high, here 1 minus false positive rate also high. So, this is our optimal value. For
that optimal value, let us draw our new ROC curve. Here, I have taken 0.45, I have drawn the
confusion matrix, you see that here.
Here the confusion matrix 14, the true negative is very high, true positive also very high. So,
when you take threshold values 0.45, you are getting higher true negative value and higher
true positive value. When you look at Classification Report, you see the specificity value is
0.88, the sensitivity value is 0.89. In this lecture, I have taken a sample problem. With the
help of sample problem, I have explained to you how to construct a confusion matrix and
how to choose the correct T value, correct threshold value.
We have chosen different threshold value, for example, we have taken threshold value 0.35,
we plotted the ROC curve. Then we have taken threshold value 0.5, then we plotted the ROC
curve. Next, we have taken threshold value above 0.5 that is 0.7, then we have plotted
different threshold value. Then when compared, when we improved or when we increase the
threshold value, how the ROC curve differs.
At the end, we have chosen the optimal threshold value. In this problem, we got it as 0.45.
Then for that optimal threshold value, we found the AUC, the area under curve, that also very
high. So, this is the way to choose the correct threshold value for checking the quality of our
classification matrix or Regression models. Thank you.
894
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 44
Regression Analysis Model Building – 1
We have seen so far a simple linear regression model and multiple linear regression model. In
this class, we are going to see how to construct a regression model by considering different
independent variables, that is model building using regression analysis.
(Refer Slide Time: 00:41)
What we are going to do. Model building is the process of developing an estimated regression
equation that describes the relationship between a dependent variable and one or more
independent variables. What is the meaning of model building? There are different
independent variable is there and there is some dependent variable. We are going to find out
how to construct a regression model by considering all the independent variable. Whether we
have to consider all independent variables or which variable has to be dropped or which
variable has to be added.
The major issues in model building are finding the proper functional form of the relationship
and selecting the independent variables to include the model. Two concept is there. One is
what kind of relationship is going to be found. One is whether it is linear or nonlinear and the
other one is how to select the appropriate independent variables. How to select the
appropriate independent variables.
895
(Refer Slide Time: 01:40)
Suppose, we collect data for one dependent variable y and k independent variables. The
independent variables x1, x2 and so and xk. Objective is to use these data to develop an
estimated regression equation that provides the best relationship between the dependent and
independent variables. So general form of linear regression model is y = beta 0 + beta 1 z1 +
beta 2 z2 plus and so on plus beta p zp plus error term. Here, the zj, j = 1, 2 up to p is a
function of x1, x2, the variables for which the data are collected. In some cases, Zj may be a
function of only one x variable, only one independent variable.
(Refer Slide Time: 02:37)
If it is only one independent variable, that is called a simplest first-order model with one
predictor variable is this. What is happening here, the value of z is taken only x1. There will
be an error term. This is the simplest linear regression model.
896
(Refer Slide Time: 02:52)
Now, we will go for modeling curve linear relationships. This problem is taken from this
book, statistics for business and economics, 11th edition by David Anderson, Sweeney and
Williams. To illustrate, let us consider the problem facing a company called Reynolds, a
manufacturer of industrial scales and laboratory equipment. Managers at Reynolds wants to
investigate the relationship between the length of the employment of their salespeople and the
number of electronic laboratory scales sold.
So what they want to know, their length of employment of salespeople that is how many
years they are working in that company versus number of electronic laboratory scales sold,
how much they have sold. Generally, what is the assumption, a person who is having a lot of
experience will sell more product. The table in the next slides gives number of scales sold by
15 randomly selected salespeople for the most recent sales period and the number of months
each salesperson has been employed by the firm.
(Refer Slide Time: 04:09)
897
So, this slide shows the data, scales the product sold, months employed. So, here months
employed is going to be our independent variable, scales sold is going to be our dependent
variable. This is y and this is our x.
(Refer Slide Time: 04:27)
With the help of Python, first we will construct a simple linear regression equation; let us see
what is happening. I have brought the screenshot of Python programming. At the end of the
class, I will run these codes. You can type these commands in your PC and you can verify the
answer. Import pandas as pd, import numpy as np, import matplotlib.pyplot as plt, import
statsmodels.api as sm. The data, which I have stored in the file name called Reynolds.xlsx. I
have read this data. This was the dataset. There is y, this is the x.
(Refer Slide Time: 05:17)
898
Let us do a Regression Analysis. First, going for Regression Analysis, we will go for a scatter
plot. For drawing the scatter plot, plt. scatter tb1, that is my first variable months employed,
this variable, second variable is going to be in y-axis scales sold. When it is plotted, it seems
to be there is a positive trend. What is the meaning of positive trend? When the number of
months employed is increasing and the sales also increasing. It says that the person who is
experienced the salesperson can sell more products when compared to an inexperienced
person.
(Refer Slide Time: 06:01)
So, this was the code for running Regression. So x = tb1, that is months employed is
independent variable, y = tb1[‘ScalesSold’] is my dependent variable. I am going to add x2
equal to sm.add_constant, so that in my Regression model, I will get a constant. So, model =
sm.OLS. OLS is ordinary least square method, y is dependent variable, x2 is independent
899
variable because in the x2, I am going to have the x variable also. So, Model equal to
model.fit(), so print (Model.summary()). So, how to interpret this one.
Look at this constant, say y = 111.22 + 2.37 months employed. We will test the significance
of the model. First, we will look at the F statistics and the corresponding p value. Here, p
value is 1.24 into 10 to the power - 5, very low. As a whole model, this model is significant.
Then look at the significant of individual variables. The month employed is independent
variable. When we look at the p value, this also less than 5, so we can say the months
employed is a significant variable.
(Refer Slide Time: 07:39)
This was my model. Here, the sales is that is y variable, number of electronic laboratory
scales sold, months equal to number of months the salesperson has been employed.
(Refer Slide Time: 07:51)
900
What will happen? First we will plot, the residual plot because as I told you in my previous
classes, it is not only the R square fp value and individual significance value is important, the
same time, we have to check the residual of that Regression Model. First, we will find the
residual E = Model. resid_pearson. So, this was my residual. So, for the x2 that is my
independent variable, I have predicted the y hat value. This is my predicted y value.
(Refer Slide Time: 08:34)
Now, I am going to make a plot between in x-axis, we have taken y hat, in y-axis, this is the
error. When you look at this picture, in x-axis, it is our y hat. In y-axis, standardized residual.
When I look at this standardized residual, it is not coming in the rectangular shape. You see
that, there is a possibility of certain kind of curvilinear relationship. You may not agree it is
exactly a curve linear relationship because the number of dataset is less. If there are more
901
number of dataset, we can exactly say. So, what is happening, it is not in the rectangular
shape, it is suggesting that there may be curvilinear relationship between x and y.
(Refer Slide Time: 09:25)
Very important point, why we have to go for curvilinear relationship. Although the computer
output shows that the relationship is significant, because the p value is less than 0.05 and that
the linear relationship explains the higher percentage of variability, that is R-square, so R-
square is 78.1. The standardized residual plot suggest that the curvilinear relationship is
needed.
So, what is the point which I wanted to say, not only R-square, not only the significant value.
Apart from that, we have to draw the different residual plot to verify whether the model is
correct or not. When we are plotting the standardized residual model, it is suggesting for a
nonlinear relationship.
(Refer Slide Time: 10:19)
902
So, we are going for what kind of nonlinear relationship we are going to have it. Z1 is x1, no
problem. Then Z2, I am going to square that X value. So, this squared value, that is x1
squared, is taken as a new independent variable. Previously only x1 was there, this square of
x1 is a new independent variable. So, this is a general linear model, there is independent
variables are having some non-linear patterns. That is x1 squared.
(Refer Slide Time: 10:54)
So, what I am going to do is to prepare a new dataset that is square of x. For that purpose,
x_sq, I am naming that way is equal to x ** 2, so that is a squared. So, this squared value is
going to be taken as another independent variable.
(Refer Slide Time: 11:11)
903
You see that x_new is np.column_stack, see this x squared. I wanted to have the constant, so
x_new2 equal to sm.add_constant x, new. So, the model 2 equal to sm.OLS y, x_new2, so
model.fit, then print summary. So, now what has happened. Look at this point, so there are
two independent variable, one is x1 and x2. So, one variable is x1 is the month, the x2 is the
squared value. Look at the R-square. The R-square is previously 0.7, now it is improved.
So, our model is good. Look at the significance of each variable. One is x1, another one is
squared value of x1, so both are significant value. This also less than 0.05, this also less than
0.05. So, what has happened, when you introduce a squared value, then the model is
significant. Not only the significance is enough for us to decide, the model is good or not, we
should go for error analysis.
(Refer Slide Time: 12:26)
904
So, what is that model which we have created. 45.3 + 6.34 months, then minus 0.0345. That
is squared value of month. Now, look at the standardized residual plot of new variable.
(Refer Slide Time: 12:41)
Now, look at the standardized residual plot of new variable. So, here what happened, I have
found the error term here. In the error term, for the model which you have created. So, I have
predicted y hat. So, now I am drawing a graph of standardized residual. So what is
happening, it is kind of a linear residual relationship but not only that you can see that there is
a possibility of getting a rectangular shape. So that we can say our model is improved.
(Refer Slide Time: 13:15)
How to interpret the second order model? The figure corresponding to standardized residual
plot shows that the previous curvilinear pattern has been removed at 0.05 level of
significance, our Python output shows that the overall model is significant because the p
905
value for the f-test is 0.000. Note also that the p value corresponding to the t-ratio of Months
Sq is 0.002. This also significant. Hence, we can conclude that adding months square as a
new variable to the model also significant. With an adjusted R-sq of 88.6%, we should be
pleased with the fit provided by this estimated regression equation where there is a nonlinear
relationship is there.
(Refer Slide Time: 14:22)
Meaning of linearity in a general linear regression model. In multiple regression analysis, the
word linear in the term general linear model refers to the fact that beta0, beta1, up to beta p
all have exponents of 1. What is the meaning of this one is, see suppose a model is there,
beta0+ beta 1x1 + beta 2x2. When I say linear model, this coefficient of beta1, beta2, beta0,
these are linear. There is a x1, y relationship. We are not discussing about relation x and y.
When we say linear, the coefficient beta0, beta1, beta2 are linear. It does not imply the
relation between y and x is linear. Indeed, we have seen one example, how linear equation
general linear model can be used to model a curvilinear relationship. Previously, we have
done one model y = b0 + b1x1 + b2 x12. Actually, this x12 is nonlinear but we have called it as
linear model because this b0, b1 and b2 is linear. Now, I will run the Python code for this
model which I have explained.
(Refer Slide Time: 15:55)
906
Now, in the Python environment, I will tell you how to do the curvilinear relationship. First, I
have imported the necessary libraries pandas, numpy, matplotlib, statsmodels, by running this
then I am going to import the data. The data which I have stored in the excel file. The file
name is Reynolds. This is the data. This data shows the MonthsEmployed independent
variable, ScalesSold is our dependent variable.
First, we will go for a scatter plot between these two variable. Now, this scatter plot shows
that there is a positive relationship between the MonthsEmployed and the ScalesSold. This
implies a person having more experience can sold more products. Now, we will go for a
simple linear Regression equation. X equal to MonthsEmployed tbl1, y is our dependent
variable tbl1 scales sold, x2 = sm.add_constant x. Thus I need to have a constant and model =
sm.OLS, so model.fit and printing the summary.
So, it shows that when you look at that first one is the R-squared, it is equal to 0.781. The fp
value is 1 into 10 to the power - 5, it is very low, so that overall model is significant. Now,
look at the independent variable that is our month employed and the corresponding p value is
0.00, so the MonthsEmployed independent variable is also significant independent variable.
Now, we can construct a Regression equation. That is, y = 111.27 + 2.37 months employed.
Now everything is ok, that is not important. Apart from this, we have to draw the residual
plots, we have to look at the behavior of the residual plots. That will say whether or model is
correct or not. So, I am plotting the residual, so this is my residual value, then the residual
907
plot what is going to be there in x-axis, I am going to have the y predicted value, in y-axis we
are going to have standardized residual value. This is my y hat, y predicted value.
Now, I am going to draw the scatter plot between y hat and standardized residual. Look at
this one, there is a curvilinear relationship. It is not the straight line. It is not coming in the
rectangular shape, so it is suggesting that instead of going for linear relationship, you go for
non-linear or curvilinear relationship that may be the better data model for the given set. So
what we are going to do.
We have one independent variable that we are going to square that. You may ask why we
have to square. You can go for cube also, you can go for power 3, power 4, power 5, but at
the beginning we start with the power 2. So this was my squared value. Now, this variable is
taken as another independent variable. Now, we are having two independent variable, one
independent variable is MonthsEmployed, another independent variable square of that
MonthsEmployed.
Now that variable also look at this one. X_sq that is taken as another independent variable.
Now, we will run the Regression equation. Now, what is happening. When you look at this,
the R square is previously 0.7, now it is 0.9, so the R-square value is improved. So the model
is a good model. The adjusted R-square also 0.886. So, the model is good. You look at the P-
value of F statistics.
When you look at the P value, this is very low, less than 0.05, that means the overall model is
significant. Now, look at the two independent variable, one is x1, another one is x2 square.
Here, the x2 is nothing but the square of the first independent variable. When you look at the
p-value for the first variable, it is 0.00, for the second variable it is 0.002. That is where the
squared term. So both the p values are less than 0.05, we can conclude that both independent
variables are significant.
It is not enough to check only the individual significant, overall significant and R square.
Apart from this, we have to go for residual plot. So, when you go for residual plot for our
second model, so this is the standardized residual, now we will go for predicted value of new
y, that is y hat 2, and now we will plot it. Now, it shows there is no curvilinear relationship.
908
Now, we can say that in this model, we can go for a simple linear relationship when you go
for curvilinear relationship that is the best model for a given data.
In this lecture, we have seen how to do a curvilinear regression model. In our previous
lecture, I have explained how to do simple linear regression and multiple linear regression
model, but in this lecture I have given you an example when we should go for curvilinear
relationship between x and y. We have taken one example, in that example, we have taken
our first simple linear relationship between x and y.
Then we look at the residual plot, that residual plot suggested that we should go for non-
linear relationship, so we have squared that independent variable. Again, we have constructed
a new regression model. In that, we have realized that when we look at the residual plot of the
new model, we realize that the curvilinear model is the better model for the given data when
compared to simple linear relationship.
In the next class, we will go for interaction, how to do if there is interaction between two
independent variable x1 and x2, how to do that kind of regression model that we will see in
the next class.
909
Data Analytics with Python
Prof. Ramesh Anbanandam.
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 45
Regression Analysis Model Building (Interaction) – II
In this lecture, we are going to see if there are two independent variable. If they have some
interaction, how to incorporate this effect of interaction onto the dependent variable. Before
that I will explain with an example what is interaction, then I will construct regression model
for incorporating this regression. At this end, I will use the Python to run this interaction
regression model.
(Refer Slide Time: 00:55)
The agenda for this lecture is incorporating interaction among independent variables to the
regression model and Python demo.
(Refer Slide Time: 01:07)
910
First, we will see what is interaction. If the original dataset consist observation for y and two
independent variable x1 and x2, we can develop a second-order model with two predictor
variables setting z1 = x1, z2=x2, z3=x1 square, and z4 is x2 square and z5 is the fifth
independent variable that is x1 and x2 in the general linear model equation. So, when you
bring this interaction, our regression equation will become like this, y = beta 0 + beta 1x1 +
beta 2x2 + square of the first independent variable x1 square plus the square of the second
independent variable x2 square and interaction.
In this second-order model, this is called as second order regression model, the variable z5,
that is x1 and x2 is added to the account for the potential effect of two variables acting
together. This type of effect is called interaction. So this term we say as interaction.
(Refer Slide Time: 02:19)
911
We take one example problem to understand how to do interaction into the regression model.
This problem is taken from this book statistics for business and economics, 11th edition. A
company produces a new shampoo product, two factors believed to have the most influence
on sales are unit selling price and advertising expenditure. So, there are two variables that is
going to affect our sales. One variable is unit selling price, another variable is advertising
expenditure.
These two variables are independent variable. The sales is the dependent variable. So, in this
problem setting, there is one dependent variable, two independent variables. To investigate,
the effect of these two variables on the sales, the prize of 2.5 dollar and 3 dollar were paired
with advertising expenditure of 50,000 dollars and 100,000 dollars in 24 test markets. I will
show you this dataset.
(Refer Slide Time: 03:24)
This dataset, you say that, there are 3 levels in prize, 2, 2.5, and 3. There are 2 level in the
advertising expenditure. One is 50, another one is 100. So that there will be a 24 different
alternatives. The last column is sales.
(Refer Slide Time: 03:44)
912
Now, we have made a summary of the previous table. What the summary says, when the
price is 2 dollars, when the advertising expenditure is 50,000 dollars, this 461 says the mean
sales. So, for example, in another sales, look at this one. When the price of the shampoo is 2
dollar, the expenditure is 100,000 dollars, this was the mean of all that combinations. So, the
mean sales of 888,000 units when the price is 2 dollars and the advertising expenditure.
How it was done where there is a; 2 is there and you have to look at the corresponding sales
value. The average of these four element is our 808. Similarly, the cells is nothing but the
mean of that level and the corresponding variable. Similarly, how we got the 461. When the
price is 2, advertising expenditure is 50. Because the next one, it is going to the 100. So, the
average of this value is 461. Now look at this table.
What it says that by keeping the selling prices to 2 dollars, when you increase the advertising
expenditure, the mean value of the sales is increasing. Here it is increasing. The second case
by keeping 2005 dollar as the prize, when you increase the expenditure 50,000 dollar to
100,000 dollar what is happening here, your sales is increasing. Here although the sales is
increasing. This is one way. By looking at another way, when you find the difference
between the 50,000 and 100,000 dollars that we will show you in the next slide, what will
happen the difference, instead of increasing, it will start decreasing.
(Refer Slide Time: 06:18)
913
This is the explanation for our previous slide. When the price of the product is 2.5 dollars, the
difference in mean sale is when it is 2.5 dollar, so the difference in mean sale is 646,000
dollar minus 364,000 dollars, this was your 282,000 dollars, for 3 dollars, the difference in
mean sale is 43. So what is happening, the difference is mean sale is decreasing. Clearly, the
difference in mean sales between advertising expenditures of 50,000 dollars and 100,000
dollars depends on the price of the product.
In other words, at higher selling prices, the effect of increased advertising expenditure
diminishes. Actually what it has to do, when the price of the product increases, then we go for
increasing the advertising expenditure, the sales also has to increase, but it is not happening
so. So, what is happening when the selling price is increasing, the effect of advertising
expenditure on the sale diminishes. These observations provide evidence of interaction
between the price and the advertising expenditure variables.
(Refer Slide Time: 07:15)
914
I am going to interpret this mean unit sale against advertising expenditure. Note that the
sample mean sales corresponding to the price of 2 dollars and an advertising expenditure of
50,000 dollars is 461,000, and the sample mean sales corresponding to the price of 2 dollars,
and the advertising is 808 dollars. I am referring to this 461 and 808. Hence the prize held
constant 2 dollars.
The difference in the mean sales between advertising expenditures 50,000 dollars and
100,000 dollars is 808,000 dollars minus 461,000 dollars, the difference is 347,000. We will
go to the next column.
(Refer Slide Time: 08:03)
When the price of the product is kept 2.50 dollars, the difference in mean sale is 282,000
units. Finally, when the price is 3 dollars, the difference in mean sale is 43,000 units. Clearly,
915
the difference in mean sales between the advertising expenditure of 50,000 dollars and
100,000 dollars depends on the price of the product. In other words, at higher selling prices,
the effect of increased advertising expenditure diminishes.
What it happens, when the price increases, when the advertising expenditure also increases,
the sales has to increase, but instead of increasing, it starts decreasing. So, the expenditure
diminishes. These observations provide evidence of interaction between the price and
advertising expenditure variables.
(Refer Slide Time: 08:57)
First, we will do the Python code for that. I have imported the data, the prices 2, 2.5, 3 level,
there is advertising expenditure is 50 and 100. The sale is in terms of unit, that is 478 and so
on. When you plot this scatterplot, see that there are three different levels. What it says that,
whenever the price of the product is increasing, the sales it is not increasing. The sales you
see that there is a decreasing trend. It has to increase. Why it is decreasing, so there is no
effect of amount spent on expenditure when x1 increases.
(Refer Slide Time: 09:47)
916
So, this graph shows that there is effect of interaction. So this scatterplot shows between the
advertising expenditure, there are two level, one is 50,000 dollars, another one is 100,000
dollars. The y-axis is the number of scales sold.
(Refer Slide Time: 10:00)
In our summary table and our scatterplot, we have realized there is interaction between x1
and x2. When interaction between two variables are present, we cannot study the effect of
one variable on the response variable y independently of each variable. In other words, a
meaningful conclusion can be developed only if we consider the joint effect of both the
variables having the response.
917
So, what is the joint effect is this x1 and x2. We have realized in that summary table, that
there is a interaction between both the variable x1 and x2. Here y is the unit sales, in terms of
units, x1 is the price, it has three level. x2 is advertising expenditure, it has two levels.
(Refer Slide Time: 10:54)
Now, the estimated regression equation, a general linear model involving 3 independent
variables, that is z1, z2, and z3. Here, the z1 is x1, z2 is x2, and z3 is this interaction variable,
that is x1 multiplied by x2. What we have to do, apart from x1 and x2, we have to introduce
another variable, that is product of two variable x1 and x2.
(Refer Slide Time: 11:21)
Now, we will create a new variable, that is z3, that is the product of z1 and z2. The data for
price advertisement of independent variable is obtained by multiplying each value of the
918
price times, the corresponding value of advertising expenditure. So, both variable z1 and z2
has to be multiplied, that will be our new variable.
(Refer Slide Time: 11:43)
After multiplying, now this is our output model for our interaction. So look at the R-square.
R-square is 0.978, x1 is our one independent variable, x2 is another independent variable.
This x3 is the interaction.
(Refer Slide Time: 12:53)
So, for this how we can write the regression equation. -276+175 price, that is our x2, then
19.7 advertising expenditure, that is our x1. The third one is our interaction variable, that is
x3, that is -6.08. Look at the p value of f statistics, that is very low, the overall model is
significant. For all variables, x1, x2 and interaction variables, look at the p-value this one, all
are less than 0.05, so each independent variable is significant variables.
919
(Refer Slide Time: 12:46)
So what is the new model now. Sales equal to -276+175 price + 19.7 AdvExp – 6.08 the price
end advertisement. This is our interaction term. How to interpret this.
(Refer Slide Time: 13:05)
Because the model is significant, the p-value for the F test is 0.0000 and the p value
corresponding to the t test PriceAdv is 0.00, we conclude that interaction is significant given
the linear effect of the price of the product and the advertising expenditure. Thus, this
regression results shows that the effect of advertising expenditure on sales depends on the
price.
(Refer Slide Time: 13:18)
920
So far, we have done some transformations only on independent variable. For example, y =
b0 +b1x1 + b2x2. Suppose, x2 is a categorical variable, assume that. What you have done, if
x2 can have only two variables, say 0, 1 gender. So we have done a modification. We have
introduced a dummy variable and we have done the model. Now, there may be a situation
that your y variable also has to be transformed.
(Refer Slide Time: 14:23)
Suppose, the miles per gallon, that is your y variable. This weight is your independent
variable. Suppose, if you do a regression analysis for this dataset, see that there is a negative
relationship. When the weight increases, the miles per gallon decreases. There is a negative
relationship for scatterplot. Now, when you look at the Regression model, the Regression
model, y is equal to 56.0957 – 0.0116. This is significant.
(Refer Slide Time: 14:40)
921
Now, look at the standard residual plot. First, we will predict the residuals, then we will
standardize. Then, we will predict the y value. Now, we will draw a graph between predicted
y hat and standardize residual. When we look at this, you see that there is a conical
relationship. What is happening, whenever the value of x increases, the variance is not
constant. This is violating our model.
What is the model? When the variance or the error term should be same for all value of x, but
now what is happening, when the value of x is increasing, the variance also increases. So, it is
not fitting to our assumption of regression equation. We are going to take log of y, so the y is
there, so we are going to take log of y values = b0+b1x1. This is going to be the same. Our
independent variable will not be disturbing.
But for the dependent variable, we are going to take the log of; the purpose of taking log is
that the error term, instead of getting this conical shape, we may get a kind of a rectangular
shape. So, that means the variance of the error terms is going to be same.
(Refer Slide Time: 16:02)
922
First, what you have done, I have taken log of all dependent variable, that is I call it Y. Now,
this log of Y is taken as the new dependent variable. After substituting this, you look at the
new variable, one is weight, the R square is increased, and F is good, the model is okay. Now,
we will go for the residual plot for this.
(Refer Slide Time: 16:24)
When you go for residual plot against y hat, this is our standardize residual, so what is
happening.
(Refer Slide Time: 16:34)
923
Now, there is no conical shape, there is rectangular shape is appearing, but you should be
very careful while interpreting the answer because it is not actual y, it is log of y. So, when
you substitute the values into this, the miles per gallon estimate is obtained by finding the
number whose natural logarithm is 32.675. So what you have to do, suppose if you substitute
weight is 2500, we are getting the log of y value, that is miles per gallon is 3.26.
If you want to know the actual value, you take e to the power 3.26, that is why to bring you to
normal term, you have to take natural logarithm is 3.26 using a calculator or any exponential
function using our Python, we have to rising e to the power 3.26, you will get 26.2 miles per
gallon, that is your original y values.
(Refer Slide Time: 17:37)
924
There are some more nonlinear model. How to do that one, I will explain. Suppose, there may
be a nonlinear relationship that the power is there, beta0 + beta1x, so the expected value,
suppose if you substitute beta0 is 500, it is 1.2 to the power x. So for this kind of model, you
take log of both the sides. It will become log of expected value of y. So log of beta0 + log of
x log beta 1. Here the constant term is, this can be written y dash equal to beta0 dash, beta1
dash x.
So, the y dash is the log of E(y), beta 0 dash is log of b0 and beta1 dash, log of beta1. This
equation can be estimated with the sample of this Regression equation, but we should be very
careful while interpreting, you have to remember it has to be brought into the original term.
(Refer Slide Time: 18:37)
Now, we are going to do the interaction among independent variable with y with the help of
this Python code. So I have imported, the file name is Tyler. So, this was our portion of our
file name. First, we should do the scatterplot. So, what is happening here, when the price of
the product is increasing, see that the y variable, it is the number of scales sold, it is
decreasing. So, this table is suggesting that there is a interaction effect between the prize and
the dependent variable.
Look at this. These are another dependent variable, that is advertising expenditure. This also
shows that whenever the advertising expenditure is increasing, the car sold is increasing, but
it is not linearly increasing because there seems to be some other variable, which is affecting
the advertising expenditure. That variable is nothing but the prize. From our scatterplot, plot
number one and plot number 2, we realize that there is interaction effect.
925
So the two variable that is z1 and z2 are multiplied. What is our z1 variable, that is our
advertising expenditure, our z2 variable is price, so new variable is z1 multiplied by z2. We
will do this one. Now, the third variable, that is new variable taken as another dependent
variable. Now, there are 3 variable, one is for advertising expenditure, another variable for
prize, the third one is interaction among these two.
So, when you run this model, we are getting all these three variables, that is x1, x2, and x3 is
our interaction variable. All are significant. So, we can say that there is interaction effect
between x1 and x2. See, our R square is better, 0.978, our fp value also very less, and
individual significance of each independent variable is also less than 0.05, also variables are
significant.
Now, in our class I have explained one more problem, that is how to do transformation of our
dependent variable. So, I have imported the necessary libraries with the data file is this one.
So, here the weight is independent variable, but the miles per gallon is dependent variable.
So, when you do the scatterplot between these two, there seems to be a negative relationship.
When you do a simple linear regression by taking x’s weight independent variable, y is the
miles per gallon, we are getting this one.
So even though the model is significant when you go for residual plot. What is happening
between standardize residual and predicted value? there is a conical shape is there. So, what
this implies that the value of x increases, the variance or the error term is not the same. It is
getting increased. This is violation of Regression model. To compensate this, we are going to
do the transformation, log transformation of our dependent variable. After log transformation
when you do again, there is a regression equation.
So, you look at this the third one, now the new dependent variable is the log of y, so the
independent variable. So, this one, we will go for a standardized residual plot. Now, what is
happening when you go for that, now there is no conical relationship. Then we can say that
the log of transformation of dependent variable is correct, you should go for log
transformation of our dependent variable.
926
In this lecture, we have seen how to incorporate if there is interaction among variable, how to
incorporate this interaction into our Regression model. We have taken one sample example,
when we are plotting the summary table, we have realized that there is a interaction between
two variables, then we have taken the product of the two variable that introduces a third
variable, then we have done a multiple regression model, we realize that the interaction is
significant.
In another problem, what we have seen in this class is, generally we do the transformation in
the independent variable, but sometimes, we need to do the transformation for the dependent
variable also. So what transformation we have done, we have done log of our y value. Before
doing log of y value, we have realized that the variance of the error is not the same. After
doing the log transformation, we have realized that the variance of the error term is same,
then we have accepted that taking log of our dependent variable is correct. Thank you.
927
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 46
Chi-Square Test of Independence-I
Welcome students. Today we are going to see a new topic that is chi-square test. Chi-square
test has 2 applications, one is to test the independence, second one is to test the goodness of
fit. In this class, we are going to see the test of independence.
(Refer Slide Time: 00:44)
The agenda for this class is to understand chi-square test of independence. Before going to
the topic let us see when will you go for chi-square test. In the beginning of the lecture I have
explained different types of data nominal, ordinal, interval ratio. Whenever the data is
nominal or ordinal you have to go for your test called chi-square test because data which are
nominal in nature you cannot go for Z test, you cannot go for T test or ANOVA, even
regression it cannot be done.
So now I will explain how this connection has with other test. For example, we might have
seen see we have studied one sample T test suppose there was a sample 1 was there we have
seen with the help of x bar we have predicted mu. After that we have seen 2 sample T test,
this is one sample Z test. Whenever there is a 2 population, population 1, population 2. See
this is x1 bar this is x2 bar.
928
Here what we have compared what was our null hypothesis mu 1 = mu 2, H1: mu 1 ≠ mu 2.
Here we are comparing 2 population at a time. So here what we have done we have done 2
sample we write we have done 2 samples Z test similarly we have done 2 sample T test also 1
this is 2. Suppose there are 3 population, population 1 and population 2 and population 3.
Here suppose we want to compare the mean of this population.
There may be a situation where you have to compare the proportion of more than 2 sample
that time we should go for this chi-square test. Suppose if you are comparing mean where the
population is more than 2 you should go for ANOVA. Suppose if you are comparing the
proportion where you have to compare more than proportion of more than 2 population you
should go for chi-square test that is the logic of using chi-square test.
Even if there are 2 sample mean instead of using 2 sample Z test you can use ANOVA also
you will get the same result. Similarly, there are 2 sample proportions if you want to compare
their proportion instead of using 2 sample proportion test you can do the chi-square test also
you will get the same answer because this ANOVA and chi-square test is a generalized
format. If you want to compare only 2 that time also you can use ANOVA okay.
(Refer Slide Time: 04:31)
929
Now we will go to our topic today’s topic that is the Test of Independence. This test is used
to analyze the frequencies of two variables with multiple categories to determine whether the
two variables are independent or not. So whenever there is a qualitative variables whenever
the data is nominal data we should go for test of independence. Here, we are going to have 2
category suppose this is called table contingency table. There will be some value in row,
some value in the column. So we are going to see whether these are dependent or
independent with the help of an example the next I will explain.
(Refer Slide Time: 05:19)
For example, suppose you have conducted a questionnaire in that questionnaire this is an
example of investment example you asked in which region of the country do you reside there
was a 4 options. First option is Northeast, Midwest, South, West and you have asked another
930
question also which type of financial investment are you most likely to make today. The
options were stocks, bonds and treasury bills.
This dataset I have captured in the form of a here table so this table is called contingency
tables. See in rows I have captured the geographic regions say Northeast, Midwest, South,
West. In column I have captured what type of financial investment they are going to make so
that I have E, F, G. Suppose I wanted to make an assumption I wanted to test is there any
connection between, is there any dependency between the geographic regions where they
reside and the type of investment they are willing to make.
So there are 2 variables; one is geographic region another variable is type of financial
investment whether these are dependent or independent. So this kind of examples, this kind
of problems can be solved with the help of this chi-square test okay. So what is your null
hypothesis the geographic regions and the type of investment which they makes are
independent there is no connection. Alternative hypothesis it is not independent there is a
dependency.
(Refer Slide Time: 07:11)
This was a theory behind this test of independence. Suppose if A and F are independent A, F.
What is the A here A means the first option that was he belongs to Northeast region, F means
he is willing to invest in the bond. If A and B are independent event, we can write P ( A
intersection F) is P ( A) . P ( F), then we will find out what is P (A). P(A) is your nA / total N,
capital N that is number of sum of all the elements.
931
Then P (F) is nf divided by total element. So when you multiply intersection of P of A and F
so that will be nA divided by N, multiplied by nF divided by N. Now you will get here in the
terms of probability, but if you want to get in terms of frequencies so that has to be multiplied
by your capital N. So expected value of AF = N multiplied by P (A intersection F).
So row sum multiplied by column sum divided by total number elements that will give you
the expected value of each cell. One more thing you see that we are multiplying by N here, N
is because if you put P (A intersection F) we will get only probability if you want to get the
answer in terms of frequency that has to be multiplied by your N that is why we are getting
this N.
(Refer Slide Time: 09:21)
So expected frequency is eij is ni: i represents row j represents the column, ni means the total
of row i multiplied by nj; the total of column j divided by capital N the total of all
frequencies. This is our expected frequencies. Obviously there is one more frequency which
you have to find out the observed frequency that will be given in the problem itself.
(Refer Slide Time: 09:52)
932
Then how to find out the test statistics of our chi-square so the calculated chi square our
observed chi-square value is the sigma of sigma see first one is observed frequency fo minus
expected frequency whole square / expected frequency. You see that this square is not for this
expected frequency that is only for the numerator and another important thing is degrees of
freedom.
The degrees of freedom is row – 1, multiplied by column – 1, that means that many number
of independent cells we can supply any values. So r represents number of rows c represents
number of columns.
(Refer Slide Time: 10:40)
We will take one example using the theory which I have taught you so far. We will solve a
problem of test of independence.
933
(Refer Slide Time: 10:49)
Suppose before starting any hypothesis testing problem the first step is to first write the null
hypothesis. The null hypothesis is the type of gasoline is independent of income. Alternative
hypothesis is type of gasoline is not independent of income. Generally, there are different
type of gasoline we have a assumption that people were having higher income they will go
for good quality in fuel. So we are going to test this is there is any connection, is there is any
dependency between their level of income and the type of gasoline they prefer.
(Refer Slide Time: 11:33)
This was our problem setup. So in the rows I have captured their level of income less than
30,000 dollar next category is $30,000 to $49,999 next is $50,000 to $99,000 it is more than
$100,000. In column I have asked what type of gasoline you are using, fuel you are using
934
regular, premium, extra premium. Generally, there is an assumption whenever the income
level is increases the people may go for good quality fuel.
So there is a dependency that is our assumption, there may be a dependency between their
level of income and the type of fuel which they use. Here the rows there are 4 rows 1, 2, 3, 4
there are 3 columns so r = 4 c = 3.
(Refer Slide Time: 12:40)
First we will find out the degrees of freedom. Assume that the alpha is 1 percentage the
degrees of freedom is row - 1 multiplied by column – 1 there are 4 rows 4 – 1 so 4 – 1 is 3
there are 3 column 3 – 1 = 2, so 3 into 2 it is 6. So for 6 degrees of freedom so the chi-square
distribution is the right skewed distribution it will be like this. So when right side area is 1
percentage for 6 degrees of freedom the value which got from the table is 16.812. The next
slide I will tell you this value we can find out with the help of python.
This was the value which we got from the table. Next what we have to do, we have to
calculate the chi square value if our calculated chi-square value how we will calculate it using
this formula observed frequency – expected frequency whole square divided by expected
frequency. Using this formula, we have to find out the chi-square calculated. If that value is
greater than 16.82 we will reject our null hypothesis. If it is less than 16.812 we will accept
our null hypothesis.
(Refer Slide Time: 14:06)
935
As I told you with the help of python import pandas, import numpy, from scipy import stats
stats.chi2.ppf when it is a 0.99 because if we want to know one percentage it is 0.99 for 6
degrees of freedom this is 16.811 that sort of value we got it.
(Refer Slide Time: 14:27)
This was the data which is given is what is the value which is their inside the cell it is called
observed frequency. So what is the meaning of this 85 those were having income less than
$30,000 they have gone for regular type of gasoline. What is the premium, what is the
interpretation of this 16 those were having income less than $30,000 only 16 people have
gone for premium type right this one.
You see that when the level of income is increasing you see that here also the number is the
people gone for extra premium also increasing. It seems to be that, there is a dependency
936
between their level of income and type of gasoline they choose. So this is the given data
which we have captured. So the first step is we have to find out the row total. The first row
total is 107 second row total is 142 third row 73, fourth row 63. Then finding the column total
the first column total is 238, second column total is 88, third column total is 59. The value
which are given is called observed frequency.
(Refer Slide Time: 15:47)
Now we should go for our expected frequency here the expected frequency value is given in
the bracket. For example, how we got this 66.15 this 66.15 is nothing, but multiplication of
row total 107 and the column total 238 divided by 385. So that value is nothing, but 66.15
that is given in the bracket. So the values which are given in the bracket it is called expected
frequency.
The value which is not in the bracket it is called observed frequency that is the data which are
given to us. So for the second dataset how we have got 24.46 so row total 107 column total
88 / 388 so we will get 24.46. For third one row total 107 column total 59 the total value is
385 like this we have to find out all the cells, all the cells after finding which was given in the
bracket. Now we will go for chi-square calculated value.
(Refer Slide Time: 17:01)
937
Now we will find out the chi-square calculated value as we know that the formula for chi-
square calculated value is observed frequency – expected frequency divided by expected
frequency. So that, you see that, (85 – 66.15)2/ 66.5 it is the first cell first row first cell. First
row second cells this 16 is your observed value this is 24.46 is our expected value, whole
square divided by 24.46. So if you keep on extent this you can go up to last cell so when you
sum it, it is coming 70.75 this is 5 this is 70.75.
(Refer Slide Time: 17:44)
We know that previously we have seen that this value is 0.01 our calculated value is 70.75 it
is lying on the right hand side so we are going to reject our null hypothesis. When you reject
our null hypothesis what was null hypothesis their level of income and the type of fuel they
choose are independent. So when you reject it what we are concluding there is a dependency
between their level of income and the type of fuel they choose. Generally, it is an assumption
938
when the level of income is increasing they will go for higher quality of the fuel that was the
conclusion.
(Refer Slide Time: 18:28)
The table which we have seen previously it is called a contingency table. It is useful in
situations involving multiple population proportions. It is used to classify sample
observations according to 2 or more characteristics also called cross-classification table
another name for contingency table is cross-classification table.
(Refer Slide Time: 18:52)
We will solve one example here. It also is similar to our previous problem there is another
example. Suppose we are going to compare the hand preference versus gender. So the
dominant hand maybe for some people may be left some people is right. The gender is male
versus female we are going to have here hypothesis that is there any connection, is there any
939
dependency between the gender and their dominant hand. So we have 2 categories for each
variable. So this is called 2 cross 2 table. We examine the sample of 300 college students this
was the outcome.
(Refer Slide Time: 19:35)
In the rows we have asked are you left hand or right hand dominant hand say left in the
column we have captured the gender female and male. There are 300 observations out of 300
observations 120 are female 180 are male. Out of 300, 36 are left handed, 264 are right
handed. So the sample result organized in a contingency table. The sample size is n so 120
females, 20 were left handed this one, 180 males, 24 were left handed right.
(Refer Slide Time: 20:20)
So what is our hypothesis H0 : ᴨ1 = ᴨ 2. The proportion of females who are left handed is
equal to the proportion of male who are left handed. Now dominant hand left hand is taken as
940
the reference we are going to compare that left hand people with respect to their gender. So
taking left hand is a dominant hand we are going to find out, is there any connection between
their hand dominant and their gender that is a null hypothesis.
Null hypothesis proportion of female who are left handed is equal to the proportion of male
who are left handed. Suppose if you accept null hypothesis so there is no connection between
their dominance of left hand and their gender. What is alternate hypothesis? The two
proportions are not the same hand preference is not independent of gender. So what will
happen if H0 is true then the proportion of left handed female should be there.
Same as the proportion of left handed males. So we can say there is no dependency. The two
proportions above should be the same as the proportion of left handed people overall instead
of taking left hand as a reference you can take the right hand also both result will be the same.
(Refer Slide Time: 21:47)
We have seen this formula that is a chi-square test statistics is observed frequency minus
expected frequency whole square divided by expected frequency. Fo says observed frequency
fe is expected frequency. Here there are 2 cross 2 table so the degrees of freedom is 2 – 1
multiple by 2 – 1 so it is 1 degrees of freedom. We assume that there is an important
assumption each cell in the contingency table has the expected frequencies at least 5. We
have to make sure that the expected frequency is at least 5 that is an assumption if it is not
there we have to collapse 2, 3 column so that to get the expected frequency is 5.
(Refer Slide Time: 22:36)
941
The chi-square test statistic approximately follows the chi-square distribution with one
degrees of freedom what will happen the decision rule is if the chi-square value is greater
than this limit we will reject it otherwise we will accept it otherwise call it, do not reject it.
(Refer Slide Time: 22:54)
So this is the observed frequency. Now we have to find out for each cell expected frequency.
How we got this expected frequency is nothing but 36 multiplied by 120 divided by 300. So
how we got this one 36 multiplied by 124 divided by 300. How we got this value 36 multiple
by 180 divided by 300 the same way how we got here 264 multiple by 120 divided not 120
300 here 264 multiplied by 180 divided by 300. So we will get here value for this.
(Refer Slide Time: 23:50)
942
The next one we have to find out the observed frequency minus expected frequency whole
square divided by expected frequency plus for this one (108 – 105.6 )2 / 105.6 + for this cell it
is (24 – 21.6 )2 / 21.6 , whole square only for the numerator + (156 – 158.4) 2 / 158.4 that is
we are getting this 0.7576.
(Refer Slide Time: 24:28)
Now we have to mark this one so point the table value which we got this is chi-square
calculated value the table values which for one degrees of freedom this is 3.814 our
calculated value is lying on the acceptance side. So we have to accept null hypothesis. If chi-
square value is greater than 3.841 reject H0, otherwise do not reject it here the chi-square
value that is 07576 is less than your 3.841.
943
You do not reject H0 and conclude that there is insufficient evidence that the 2 proportions
are different that means that both are same P1 = P2.
(Refer Slide Time: 25:17)
There we have compared only 2 proportions there may be a possibility we have to compare
more than 2 proportions for example 3 proportions that case we will see in this problem.
Extend the chi-square test to the case where with more than 2 independent populations say
null hypothesis can be ᴨ1 = ᴨ 2 = ᴨ 3 the alternate hypothesis not all of the proportions are
equal.
(Refer Slide Time: 25:44)
This other formulas are same as usual f0 is observed frequency fe is expected frequency chi-
square is the degrees of freedom. So the formula number of row – 1 multiplied by number of
column – 1.
944
(Refer Slide Time: 25:58)
We will see one example the sharing of patients records is a controversial issue in health care.
A survey of 500 respondents asked whether they objected to their record being shared by
insurance companies, pharmacies and by medical researchers. The results are summarized on
the following table. So there are 3 category now one is whether they are objected to share the
data for insurance companies, pharmacies and medical researcher.
(Refer Slide Time: 26:29)
So this table shows like this. So you see that 410 patients have objected to share their data
with the insurance companies, 295 patients have objected to share their data with the
pharmacies, 335 people have objected their data to share with the medical researchers. Now
we have to find out whether the proportion of objection for sharing their data all these 3
categories are same or not. So we can say this is ᴨ1, this is ᴨ2, this is ᴨ3.
945
So null hypothesis will be ᴨ1 = ᴨ 2= ᴨ 3 that means that the people are always object to share
their data irrespective of what kind of company it is that is our null hypothesis. There is a
independency between sharing their data and the types of companies which they ask for the
data. Here what you have done I have done the row sum this was the row sum this was
second row sum then I found the column sum there are 3 columns. This data is our observed
frequency 295, 410, 335.
(Refer Slide Time: 27:54)
Next one from the observed frequency I have to find out the expected frequency. We have
already observed frequency how will you find the expected frequency for example here 1040
multiplied by 500 divided by 1,500. So that value it will be 346.667. Similarly, for second
dataset it is 1,040 multiplied by 500 divided by 1,500 that data is about 346. This way you
can find out the expected frequency.
(Refer Slide Time: 28:40)
946
Now I have given the final answer for the first cell that is the observed frequency minus
expected frequency whole square divided by expected frequency for this cell it is 11.157,
here it is 7.77 this is 0.392, this is 26.159, this is 17.409, this is 0.88. When you add it that
value is your 64.1196.
(Refer Slide Time: 29:05)
So what is a null hypothesis as I told you ᴨ 1= ᴨ 2 = ᴨ 3 alternate hypothesis all of the ᴨ j are
equal that is j :1, 2, 3, 4. Decision rules if the calculated chi-square value is greater than your
table value reject H0 otherwise do not reject it. The table value which we got from the table is
5.9 what is the degrees of freedom for knowing this. You see that there are 2 rows is there so
2 – 1 there are 3 column is there so 3 – 1.
947
So this is 1 multiplied by 2 it will be 2 degrees of freedom. For 2 degrees of freedom for
given alpha value the chi-square value which you got from the table is 5.991, but you see that
our calculated value is our 64.116. So it is bigger than our table value so we have to reject
and we can conclude that at least one proportion of the responds to object to their record
being shared it is different across the 3 organizations.
So what will happen when you reject to a null hypothesis we can say it is not always equal
there are somewhere it is not equal. So not all of the ᴨ j are equal. In this lecture, we started a
new topic that is a chi-square test. Chi-square test has 2 applications. One is test of
independence and goodness of fit. Today we have started with a test of independence I have
taken a small example.
I have explained with the help of example how to test the test of independence. In the next
class we will take one small problem with the help of python I will explain how to construct
the contingency table. After constructing contingency table how to do chi-square test of
independence using python that we will see in the next class. Thank you.
948
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 47
Chi-Square Test of Independence-II
In the last class we started about the chi-square distribution. As I told you chi-square
distribution having 2 application one is to test the independence, second one is the goodness
of fit. We have seen in the previous class one example of how to test independence of the 2
variables. In this class, we will continue with that we will take another example we will solve
with the help of python then we will go for another application of chi-square test that is the
goodness of fit.
(Refer Slide Time: 00:56)
So agenda for this lecture is using python to test the independence of the variables. We will
start another new topic that is application of chi-square distribution on testing the goodness of
fit. First we will test the Poisson distribution.
(Refer Slide Time: 01:13)
949
This is an example record of 50 students studying say ABN school is taken at random. The
first 10 entries are shown in like this. Why I have taken this example is in the previous lecture
we have got the contingency table directly. The contingency table is given to you once the
contingency table is given finding observed frequency and expected frequency then finding
chi-square calculated values are very simple.
But in practical the chi-square distribution will not be given to you directly only excel data
file or some file from database will be given. You have to form the contingency table after
forming the contingency table then you should go for chi-square test. So in this example I
have taken one hypothetical problem and there are 1, 2, 3, 4, 5, 6, 7 variables are there. The
first variable is academic ability aa.
Second variable is parent education, the third variable is student motivation, the fourth
variable is advisory evaluation, the next variable is religion, next variable is gender the last
variable is community type. This dataset is to identify what are the variable that affect
academic performance of a candidate. So here the academic ability is nothing, but test
conducted for out of 100 marks.
So this is marks obtained by the candidates only I have shown only 10 dataset like that there
are 50 dataset is there where I am show you when I am showing the python demo. The first
one is the marks obtained by a student that is called academic ability so that is higher the
marks higher the academic ability. The next variable is parent education the question is asked
to the parent how many years you spend on the schooling.
950
Generally, parent education is a categorical variable, but instead of capturing categorical
variable we have got that variable in the form of interval kind of a continuous variable. What
we have asked to the parent that how many years you have spend in the schooling say 5
years, 6 years and so on. So now the parent education will become a continuous variable then
student motivation.
We have asked the student suppose if you want to study if you want to get more marks are
you willing to spend extra time for the study 1 means no, 0 means not decided, 2 means yes
advisory evaluation. Advisory is like kind of a faculty advisor that faculty advisor can say
whether this fellow will pass in the examination or he will do good performance in the
examination.
1 means we will not do 0 means not decided, 2 means we will do good performance in the
examination then r is the religion. There are 3 category of the religion 0, 1, 2 this is the
explanation of the variables.
(Refer Slide Time: 04:16)
For example, as I told you the first column is res respondent underscore number kind of a
registration number aa is academic ability, pe is parent education, sm is student motivation, r
is the religion, g is the gender.
(Refer Slide Time: 04:35)
951
I have brought the screenshot of python we have imported pandas and numpy then we
imported the dataset so this was the dataset. You can see that there are different variables
academic ability, parent education, student motivation, advisory evaluation, religion and
gender. So I have imported the dataset the data which have stored in this file name academic
ability.data. csv file. So this was the output of the data.
(Refer Slide Time: 05:05)
Now we are going to have the hypothesis. Test the hypothesis that gender and student
motivations are independent. Now we are going to see is there any connection between the
level of motivation and their gender. In many case we presume that the girls students are
highly motivated than the boy students. It is in perception that we will test that, whether is
there any connection between any gender and the motivation.
952
So the null hypothesis is that the gender and student motivations are independent. Alternative
hypothesis is it is not independent. The hypothesis which we are going to test is gender and
student motivations are independent that is our null hypothesis.
(Refer Slide Time: 05:54)
This is very important command for forming our contingency table otherwise call it as a cross
table between gender and student motivation. So you see that pd.pivot_table the file name
acad, g is the gender, sm student motivation, index g should appear in row column should be
the sm, so aggregate function is length. So after you run this we are getting a contingency
table. In contingency table here what is there it is a gender. So in column it is a student
motivation.
(Refer Slide Time: 06:35)
953
In the previous table which are shown in here just I have wrote in the presentation. In the row
say 0 means it is male 1 means it is a female that is a code which I have used for gender. The
student motivation there are 3 level one is 0 disagree. The question which I have asked is are
you willing to spend extra time for getting more marks 0 disagree, 1 not decided, 2 agree. So
I have seen the row sum there are 29 male students, there are 21 female students.
So 14 students have told that they will not spend extra time, 22 people have decided that told
that they have not decided whether they should spend the extra time to study or not. So 14
people have agreed that they will study, they will spend extra time. Now you see that in row
there is a one variable in column there is another variable. The null hypothesis is the gender
and the student motivations are independent. We know that the cell the 10, 13, 6 represents
the observed frequency. Now we have to find out the expected frequency for each cell.
(Refer Slide Time: 07:54)
The expected frequency is as I told you in the previous class that row total multiplied by
column total divided by overall. See previous one 29 multiplied by 14 divided by 50. So this
8.12 is the expected frequency. Similarly, for this also we got 12.76, 8.12, 5.88, 9.24, 5.88.
One important assumption that the value of the expected frequency should be more than 5. In
all the cells the expected frequency is more than 5 then we can continue for our calculation.
(Refer Slide Time: 08:39)
954
So I have wrote fo represents observed frequency, fe represents expected frequency. For each
cell we know that what is the observed frequency and expected frequency.
(Refer Slide Time: 08:51)
Now we have to go for chi-square calculation. We have seen that formula chi-square is
calculated value is nothing but observed frequency minus expected frequency whole square
divided by expected frequency. So when I go back the first cell observed frequency is (10 –
8.12 )2 divided by 8.12 we will get 0.435. Like this when you do for all the cells and sum it
the calculated chi-square value is 2.365.
(Refer Slide Time: 09:26)
955
The python code shows chi-square calculated values for that purpose we have to import
chi2_contingency. You see that chi2, p, degrees of freedom, tbl = chi2_contingency the
observed values then when you type chi2 you are getting 2.364 the p value is 0.30. Since the
p value is more than 0.05 we have accept null hypothesis.
When we say accept null hypothesis we are concluding that the gender and the level of
motivations are independent then you can get the degrees of freedom values also. As I told
you the degrees of freedom is number of row – 1, 2 – 1 there are 3 columns 3 – 1 say 2 let us
say 2 value. This is the degrees of freedom.
(Refer Slide Time: 10:20)
956
Then this is contingency table where you can get the expected frequency when you type the
tbl you can get the expected frequency you see that this 8.12 this was the I will go back see
8.12, 12.76, 8.12 that value which we got it manually we can directly can get with the help of
python. So far I have shown the screenshot of our python programming and explained how to
do, how to form a contingency table then how to do the chi-square test. Now I will go to
python prompt there I will explain how to input the data and how to do the chi-square test.
(Refer Slide Time: 11:05)
I am going to explain how to form a contingency table from the excel file then doing the chi-
square test, import pandas as pd, import numpy as np I have imported, I have imported, I
have stored my dataset and the file name called acad.xlsx. So first I have to run this.
(Refer Slide Time: 11:25)
957
Now let us see what is the dataset that dataset you see that there are 49 + 0(index) there are
50 dataset is there, that is the respondent number, academic ability, parent education, student
motivation, advisory evaluation, religion, gender and community. We are not going to
consider all the variables for our calculation. We are just going to consider only the gender
and the student motivation okay.
(Refer Slide Time: 11:53)
Then we will form a contingency table for that. This is obs = pd.pivot_table( academic ability
that is the file name, column g and student motivation and index in row I need to have the g
value that is the gender value, in column I need to have the student motivation value. So
when I see this dataset you see that the output shows that directly I am getting contingency
table.
958
Because when I am explaining theory what happened the contingency table is given to you,
but many times that is not the case. The data maybe in some other format you have to create a
contingency table before doing the chi-square test. So this command is helping in python, this
command is helping us to form the contingency table and it saves lot of our time. So this was
the contingency table.
Thus the value in the cell represents the observed frequency. So what this 10 represents when
the student level of motivation is 0 the 0 represents male. This is the 10 is our observed value
then import chi2_contingency library then we will write chi2, p, degrees of freedom, tbl =
chi2_contingency(obs), this obs this obs is wherever contingency table is stored.
So when you run this now the contingency table is run now we want to know the chi-square
value the chi-square value is 2.36. This was the chi-square calculated value then we can know
the p value, the p value is 0.30 look at the p value which is more than 0.05. So we have to
accept our null hypothesis when I say accepting null hypothesis I am concluding that the
gender and the student motivations are independent.
There is no connection between the gender and the level of motivation for the student. So we
can get to know the degrees of freedom also directly with the help of this command dof and
then tb1 this give you your expected frequency. If you are doing manually you can compare
this expected frequency. So this was the answer which I have shown in my presentation.
(Refer Slide Time: 14:13)
959
So far we have seen the first application of chi-square distribution that is test of
independency. Now we are moving into another application that is testing, goodness of fit.
What is the meaning of goodness of fit. Many time when we collect the data we have to know
what distribution this data follows. So the chi-square test is helping us to find out to know
what is the distribution this data follows. First you have to take an example of Poisson data
then we will check this whether this data follow Poisson distribution or not.
(Refer Slide Time: 14:48)
What is the chi-square goodness of fit test? Chi-square goodness of fit test compares expected
frequencies of categories from population distribution to the observed frequencies, from the
distribution to determine whether there is a difference between what was expected and what
was observed. So what we are going to do as usual we are also going to see expected
frequencies and observed frequencies. We are going to see is there any difference is there or
not.
(Refer Slide Time: 15:16)
960
The formula is same the chi-square value is observed frequency – expected frequency whole
square divided by expected frequency. The degrees of freedom this is different from
previously what we have seen. In contingency table, the degrees of freedom is number of
rows – number of column. Here the degrees of freedom is k – 1 – p. The degrees of freedom
is k – 1 – p where the k is number of categories number of categories.
961
Now let us follow some steps to test the goodness of fit of a given dataset. Now assume that
some dataset is given to you we are going to test whether this dataset follow Poisson
distribution or not. The first step is setup the null and alternative hypothesis. What is a null
hypothesis population has a Poisson probability distribution. What is alternative hypothesis?
Population does not have the Poisson distribution.
Here one important point you have to see so far whenever we see the null hypothesis we say
that then the term not will appear in the null hypothesis, but only in the goodness of fit test it
is reverse. You see that the given data follow Poisson distribution that should be our null
hypothesis. Alternative hypothesis is the data the population does not have a Poisson
distribution it is just reverse of that.
For all kind of hypothesis or testing the word not will appear in null hypothesis only for the
goodness of fit test the word not will appear in your alternative hypothesis. This is one
important difference that you have to remember. Select a random sample and record the
observed frequency we call as fi from each value of the Poisson random variable. Compute
the mean number of occurrences mu. Because we should know the parameter of Poisson
distribution mu compute the expected frequency of occurrences that is ei for each value of the
Poisson random variable.
(Refer Slide Time: 17:58)
Then compute the value of test statistics this is as usual observed frequency – expected
frequency whole square divided by expected frequencies. Remember here also the value of
expected frequency should be 5 or more. If the expected frequency is not 5 or more you have
962
to collapse certain intervals and you have to make it so that the expected frequency is 5 or
more. We will see that example also here.
(Refer Slide Time: 18:27)
There are 2 way for rejection rule p-value approach. Reject H0 the p value is < or = alpha if
you follow a critical value approach reject H0 if the chi-square calculated value is greater
than your chi-square critical value which you got from the table where the alpha is
significance level and there are k – 2 degrees of freedom you should remember how this k
because it has come k – 1 – p. Because Poisson distribution having one parameter only mean
is the parameter for the Poisson distribution. So the value of p is 1 so it has become k – 2.
(Refer Slide Time: 19:11)
So we will take an example see Parking Garage example. In studying need for an additional
entrance to a city parking garage, a consultant has recommended an analysis, consultant has
963
given some solution, that approach is applicable only in situations where the number of cars
entering during a specified time period follows Poisson distribution. Since the consultant has
given some solution that can be implemented only if the arrival follow Poisson distribution.
(Refer Slide Time: 19:48)
A random sample of 100 one minute time interval resulted in customer arrival listed below. A
statistical test must be conducted to see if the assumption of a Poisson distribution is
reasonable. So what is given a number of arrival is given, the frequency is given so 0 arrival
the frequency is 0, 1 arrival the frequency is 1, 2 arrival the frequency is 4, 3 arrival
frequency is 10 like that up to 12 arrivals is given.
(Refer Slide Time: 20:20)
We will form the hypothesis. What is a null hypothesis number of cars entering the garage
during one minute interval is Poisson distributed. Alternative hypothesis is number of cars
964
entering the garage is during a one minute interval is not Poisson distributed. You see that
this is different from our traditional way of making null hypothesis. Generally, the term not
will be there in the null hypothesis. But here where the goodness of fit test the term the word
not will appear in our alternative hypothesis.
(Refer Slide Time: 20:57)
This is the python code which I have taken the screenshot import scipy, import chi-square,
import Poisson. The dataset I have kept in the file name called P_distribution. This was the
arrival this is the actual frequency otherwise we can call it as observed frequency.
(Refer Slide Time: 21:21)
The next term we should know mean of the dataset. Estimate of Poisson probability function
see the total arrival the mean formula we know that the same simple formula mean is mean
mu = sigma fn / sigma f. So f is the frequency n is the number of arrival so 0 into 0 + 1 into 1
965
like that there will be a value will be 600. So sigma f when I go back you see that this
frequency when you add this frequency when you add this frequency that will be 100.
So that value is 6 so the 6 is your mean. We know that our formula traditionally what is our
formula. So the formula is (e– mu mu x)/x!, otherwise some people call it lambda, (e– mu mu
x
)/x!. So mu is 6 so 6 to the power x e to the power – 6 / x factorial.
(Refer Slide Time: 22:44)
Now what we have to do we have to substitute these x values then if you substitute this x
values you will get a theoretical frequency. So when x = 0, f (x) is when you substitute in this
equation when x = 0 6 to the power 0 e to the power – 6 / 0 factorial that is 1. So e to the
power – 6 is this 0.0025 this is probability value. We want to know in terms of frequency so
that has to be multiple by n, n is 100.
When you substitute x = 1 in this equation 6 to the power 1 e to the power – lambda / 1
factorial will get 0.0149 then multiple by n. So this value will give you the theoretical
frequency of Poisson distribution like that we have to have up to 12. You see that here, here
the theoretical frequency when you look at the 0, 1, 2 this values. See that this is less than 5
so this has to be added this also less than 5 so these 3 groups has to be grouped.
The same way you see that this is 6.8 here what is happening these values are less than 5. So
these values has to be clubbed so that the expected frequency is 5 that is what we have done
that one.
(Refer Slide Time: 24:15)
966
Okay so we have got the observed frequency we got that mean value then for each x value we
got our expected frequency value by using the for loop okay.
(Refer Slide Time: 24:28)
So then we will do round off this was our rounded value using python. When you look at this
0.25 you go back 0.25, 1.49 you will get exact the same value. Now we are going to have
only 2 column one is observed frequency next one is expected frequency.
(Refer Slide Time: 24:48)
967
You see that 0 or 1 or 2 so that are clubbed so that the expected frequency is more than 5.
Similarly, 10, 11, 12 these are grouped together so that the expected frequency is 8.39
otherwise it will be less than 5. Now how many numbers of interval is there 1, 2, 3, 4, 5, 6, 7,
8, 9 interval is there.
(Refer Slide Time: 25:15)
Now this is observed frequency expected frequency. Now if you directly you can run this
command that is scipy.stats.chisquare(), observed frequency – expected frequency that we are
getting 3.27 the p value is 0.911 that is more than 0.05. So we have to accept null hypothesis
when we accept null hypothesis we are concluding that the given arrival pattern follow
Poisson distribution.
(Refer Slide Time: 25:45)
968
You see that for rejection rule when alpha = 0.05, k we have got k = 9 as I told you because 9
interval and going back I will explain 1, 2, 3, 4, 5, 6, 7, 8, 9 interval that is why here k = 9. P
is the number of parameter as we know that that Poisson distribution having only one
parameter so it is 9 – 1 – 1 so 7 degrees of freedom. For 7 degrees of freedom when alpha =
0.05 we can get the chi-square table value is 14.06.
See when you look at the chi-square calculated values it is 3.268. So this value will lie on the
acceptance side. So we have to accept because the chi-square distribution to be like this so
this is 14.067 you are 3.268 will be here it will be lying on the acceptance side so you have to
accept the null hypothesis.
(Refer Slide Time: 26:44)
969
(Refer Slide Time: 26:52)
See there 14.06 but not calculated value. The chi-square table value is 14.06, but our
calculated value is 3.268 so it is lying on the acceptance side do not reject null hypothesis
then we will conclude that the arrival pattern follow Poisson distribution. In this class, I have
explained how to form a contingency table after forming contingency table how to do the chi-
square test with the help of python.
The next topic which I have started testing goodness of fit. Suppose some dataset is given if
you want to test what distribution it follows. I have taken some dataset then I have tested
whether this dataset follow Poisson distribution or not. I have explained the python
screenshot. In the next class I will run the python code for testing Poisson distribution for
given dataset and I will explain how to test goodness of fit for uniform distribution and
normal distribution that we will see in the next class. Thank you.
970
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 48
Chi Square Goodness of Fit Test
Welcome students. In the previous lecture I have explained how to do goodness of fit for
your Poisson distribution. I have explained the theory portions. In this class I am going to
give you a python demo for that, after the python demo I am going to explain how to do the
goodness of fit for uniform distribution or normal distribution.
(Refer Slide Time: 00:53)
The agenda for this lecture is python demo for testing goodness of it for the Poisson
distribution, theory of how to do the goodness of fit for uniform or normal distribution. After
that I will give you the python demo for testing goodness of fit for uniform and normal
distribution. Now we will go to the python prompt there we will see how to do goodness of
fit for your Poisson distribution.
Now we will see the python demo say some dataset is given we are going to test whether that
dataset follow Poisson distribution or not. So I am importing the necessary library like scipy
from scipy.stats chi2, from scipy.stats, poisson then for I am importing pandas and numpy.
(Refer Slide Time: 01:43)
971
So checking goodness of fit for your Poisson distribution we will see what is that dataset.
This dataset is given arrival is given, the frequency is given. The arrival is in one minute
arrival. For example, in one minute interval 0 interval there is 0 frequency. In one minute
interval there is 1 arrival there is 1 frequency. In one minute interval there are 2 interval there
are 4 frequency and so on.
(Refer Slide Time: 02:11)
Whatever the data which is given that is observed frequency. So the observed frequency I am
going to call it separately these are our observed frequency. Then next one is we have to find
out the expected frequency for given x values. To know the expected frequency, we have to
know the mean of the Poisson distribution. We know that mean of the Poisson distribution
like sigma fn divided by sigma f. See the total arrival is 600 total time period is 100 so 600
divided by 100 so the mu value is your 6.
972
(Refer Slide Time: 02:58)
So we know the mu value, we know the x value now we have to find out the expected
frequency. So I am making expected underscore frequency so for i in range of length
observed frequency then finding the expected frequency. So E_frequency = 100 multiplied by
because n is 100 multiplied by Poisson. Pmf probability mass function i, mu then I am going
to get the expected frequency. So I am using a far loop so that it will save our time.
(Refer Slide Time: 03:41)
So this was our expected frequency. This expected frequency there are different decimal is
there suppose I want to round it off to 2 decimal for that purpose you have to use this
command equal to square bracket round element, 2 for element in expected frequency. So
expected frequency rounded off let us see what is that value.
(Refer Slide Time: 04:05)
973
So this is our expected frequency rounded value. Now we will add these both the variables by
using zip command into the object called df.
(Refer Slide Time: 04:19)
So the df says that our observed frequency and expected frequency. Once we know the
observed frequency and expected frequency then we can get the chi square value directly, but
look at here, here the expected frequency is 0.25 that is less than 5. So, up to 3 when you add
this 3 then only your expected frequency will be more than 5. So we have added this so after
adding so the observed frequency is 4,1, 5 the expected frequency is 6.20.
This we have done it manually when you are running a large program with huge dataset you
can make the program so that the expected frequency is more than 5, but here we have not
974
done that way. We have manually added and checked whether the expected frequency is
more than 5 or not.
(Refer Slide Time: 05:13)
So this is observed frequency then expected frequency. Now similarly you see that for 10 and
11 and 12 the expected frequency is not more than 5 so we have collapsed these 3 intervals
and made into one interval or that is called 10 or more so that is your 8.39 and the interval is
10 or 11 or 12. Now we have the observed frequency and expected frequency. Simply you
pass this command that is scipy.stats.chisquare observed frequency and expected frequency.
I have to run this. Now I am getting the test statistics chi square statistics 3.27 see the p value
is 0.91 it is above our 0.045. So we have to accept our null hypothesis in our presentation also
I have told you the p value is 0.91 the calculated chi square value is 3.27. So we have to
accept our null hypothesis and we have to conclude that the given dataset follows Poisson
distribution.
(Refer Slide Time: 06:22)
975
We have tested whether the given dataset follow Poisson distribution or not. So Poisson
distribution is the discrete distribution. We will see another example where the given dataset
follow uniform distribution or not. So the milk sales is given for 12 months January,
February, March, April, May, June up to Decembers then liters also given like this So we are
going to say that the sales follow uniform distribution or not. So the assumption is the sales
follow uniform distribution.
(Refer Slide Time: 06:56)
So what is a null hypothesis the monthly milk figures for milk sales are uniformly distributed.
Alternative hypothesis is the monthly milk figures of milk sales are not uniformly distributed.
You see that the not appears in our alternative hypothesis which is not our traditional way of
forming the null hypothesis. So we have alpha = 0.01 the k is the given dataset because there
are 12 month dataset is there 12.
976
And here see the p value is 0 because the uniform distribution is not having any parameter
because if you know the lower limit and upper limit of uniform distribution you can easily
construct the uniform distribution. So for uniform distribution having the 0 parameter after
simplifying it is 11 so when alpha = 0.01 the degrees of freedom is 11, the calculated chi
square value is 24.725.
So if we are using the critical value method if the calculated value is greater than not
calculated value I am correcting this is the table value, table value which we got from the chi
square table 24.725. If the calculated chi square value is greater than 24.725 we have to reject
it otherwise we have to accept our null hypothesis.
(Refer Slide Time: 08:24)
We can find out the chi square table value by using this one chi2.ppf(0.99, 11), 24.72 see that
the same value 24.72.
(Refer Slide Time: 08:38)
977
Now this is the given dataset month is given observed frequency is given. To know the
expected frequency what you have to do we have to sum this dataset see that this is the sum
value 18,447 then because it is followed uniform distribution you divide by 12. So, equal
value is 1,537.25. So everywhere you can write this should be our expected frequency. Then
you find out observed frequency minus expected frequency whole square divided by expected
frequency. Then you sum it you are getting 74.38. So the calculated chi square value is 74.38.
(Refer Slide Time: 09:32)
Let us see the python code for this the x is given we are finding the mean 1,537. So expected
frequency is nothing but the same value expected frequency this also we have entered
manually then from scipy.stats import chi square. So chi square x, expected frequency so,
when you give chi square x, expected underscore frequency. You will get the calculated chi
978
square value is 74.37 the p value is 1.7 into 10 to the power – 11. So you see that this our
python outputs and our calculated values are same.
(Refer Slide Time: 10:19)
So obviously this was the table value 24.725 our calculated value is this one. So we have to
reject our null hypothesis. When we reject a null hypothesis what we are concluding that the
dataset is not following uniform distribution.
(Refer Slide Time: 10:41)
Now we will go to another very interesting example testing some dataset and checking
whether it is following normal distribution or not. So what are the different steps are there.
The first step is setup null and alternative hypothesis, select the random sample and compute
the mean and standard deviation. Define intervals of values so that the expected frequency is
979
at least 5 for each interval. For each interval record the observed frequency then compute the
expected frequency for each interval.
(Refer Slide Time: 11:19)
Then compute the value of test statistics then reject H0 if the chi square value is greater than
your value which you got from the table. Here the alpha is significance level here the degrees
of freedom is k – 3 because we know that this is K – 1 – P. So here P is 2 because there are 2
parameter for a normal distribution so the value of P = 2 that is why we have got k – 3.
(Refer Slide Time: 11:53)
We will take an example see computer manufacturers and sells a general purpose
microcomputer. As part of a study to evaluate the sales personnel management wants to
determine that alpha = 5% significance level if the annual sales volume that is the number of
units sold by the sales person follows normal probability distribution. So there are some
980
dataset that dataset is nothing, but the number of units sold by the sales people. They want to
test whether that sales follow normal distribution or not.
(Refer Slide Time: 12:30)
A simple random sample of 30 of the salespeople who has taken and their number of units
sold are given below. 33, 43, 44 and so on. So for this dataset the mean is 71, standard
deviation is 18.23.
(Refer Slide Time: 12:49)
So we have imported the data so we are finding the mean is 71, standard deviation is 18.22.
(Refer Slide Time: 12:57)
981
For this dataset what is the null hypothesis. The population of number of units sold has a
normal distribution with mean 71 and standard deviation is 18.23. The alternative hypothesis
is the population of number of unit sold does not have a normal distribution with mean 71
and standard deviation 18.23 because many times whenever you collect the data we have to
test what distribution it follows.
Because when you do the simulation rather purpose we have to knowing the exact
distribution of given dataset is more important that is why testing the particular distribution
will be very helpful for further analysis of our dataset.
(Refer Slide Time: 13:47)
982
First we make an interval to satisfy the requirement of an expected frequency of at least 5 this
is very important at least 5 in each interval we will divide the normal distribution by 5 so 30
divided by 5 so that we will get 6 equal probability interval.
(Refer Slide Time: 14:06)
You see that 1, 2, 3, 4, 5, 6. So when you divide by 5 we will get 6 interval because then
when you divide this way then we can make sure that in every interval you will get minimum
5 or more observed value. So total area = 1 so when you divide 1 divided by 6 we will get
0.1667 is because this area is 0.1667 this area is 0.1667 everywhere. So when the area is
0.167 you can get the z value. We know that their lower limit X bar – Z sigma.
So X bar is given Z value you can get it sigma value also given you can find out this interval
so this is 53.367 the second point is we know X bar is 71 how we got the 71 because the
mean is 71 right. The mean is 71 the sigma is given, sigma is 18.23 when it is 0.66 + 0.66 the
corresponding Z value is 0.43 so 0.43 and this sigma value that is your 16.17. Similarly, at
the right hand side see that how we got 88.63.
The mean value is given so when this side area is 0.166 this area is 0.166 when you add that
then corresponding Z value is 0.97. So 71 + 0.97 this was our sample standard deviation the
sigma value then you will get 88.63. What we got it we got different intervals.
(Refer Slide Time: 16:01)
983
So what are that intervals you see that 1 divided by 6 so that we will get different 6 equals
probability intervals. So for j in range 1, 6 the probability underscore interval is
scipy.stat.norm.ppf we can substitute j value directly here then x, mean in standard deviation.
So when you print this probability interval you are getting this is your different intervals
when we got after solving manually you see 53.67, the second value is 63.149 this was our
interval of the normal distribution.
(Refer Slide Time: 16:43)
See that the first value is 0 to 53.36 see that was this value the second one is 53.02, 63.03,
63.14 then 71, 78, 88 the maximum value is 88.63. So what we have to do now we have to go
to this values in that interval we have to count it how many numbers are appearing. For
example, in the interval you see that when the interval less than 53.02 this 6 was observed
frequency.
984
So how we got the 6 one from this given dataset you have to find out how many numbers are
below that when you count it, it will be 6. Similarly, in the interval 53.02, 63.03 in our given
dataset we have to count it how many numbers are appearing this range there will be a 3 this
is our observed frequency. This is 5 is expected frequency because there was a 30 dataset
since we divided 30 divided by 6 there will be a 5 expected frequency will be there.
Because why we are dividing by 5 because minimum we need to have 5 expected frequency
that is why we have divided by 6 so we have to divide the given dataset by a number so that
the expected frequency is 5. So we got this observed then expected frequency then find the
different square the difference.
(Refer Slide Time: 18:26)
This was our expected frequency this is our observed frequency. So scipy.stats.chisquare
observed frequency, expected frequency. We are getting 1.5 see the p value is 0.90 that is we
say that 5% it is more than that. We say we have to accept our null hypothesis so the given
dataset follow normal distribution.
(Refer Slide Time: 18:50)
985
When you look at that one alpha = 0.05 there are k = 6. How we get got is 6 number of
intervals I am going back 1, 2, 3, 4, 5, 6 that is why it is k = 6. P is a 2 parameter n so 6 – 2 –
1 is the 3 degrees of freedom. When 3 degrees of freedom when alpha = 5% the table value is
7.8815 you look at this one our test statistic observed frequency minus expected frequency
whole square divided by expected frequency when we add it is 1.6.
So what is happening in chi square distribution so this value is 7.815 our calculated value we
are getting 1.6 so we have to accept our null hypothesis and we have to conclude that the
given dataset follow normal distribution. Now I will explain the python code of testing how
to check whether the given dataset follow uniform distribution and normal distribution.
(Refer Slide Time: 20:01)
986
Suppose we want to know the chi square table value when alpha equal to 1 percentage when
it is alpha 1 percentage we have to write chisquare.ppf (0.99, 11) degrees of freedom. So we
are getting the calculate the table value of chi square value is 24.72. Then next one this is our
x value then we are finding the mean of that one. So the expected frequency is nothing but it
is going to be our mean.
Now we got the observed frequency that is our 1, 610 that is x value than expected frequency
is this one exp underscore f. So when you write chi square x, expected underscore f this is our
expected frequency. So now we are getting look at the p value. The p value is 1.7 10 to the
power – 11 it is very low value. So we have to reject our null hypothesis. You can look at this
our calculated chi square value is 74.
The table chi square value is 24 so 74 is larger than 24 so we have to reject our null
hypothesis then we are concluding that the given dataset is not following uniform
distribution.
(Refer Slide Time: 21:26)
After that you will see the second example there are some dataset is given we are going to
test whether this dataset follow normal distribution or not. So what I am running this dataset.
First I am finding the mean the mean of the given dataset is 71 and the standard deviation of
the given dataset is 18.23. So what is a null hypothesis the given dataset whose mean is 71
and standard deviation 18.22 follow normal distribution.
987
Alternative hypothesis is it is not following normal distribution then the given dataset the x
value we have to divide by 6 because we need to have minimum 5 expected frequency not
observed frequency, expected frequency in each interval. So the given dataset is divided into
6 so that because 30 divided by 6 I will get 5 expected frequency in each interval. Now in the
range of 1, 6 I am going to get the different intervals. So one interval is 0 to 53.36 another
interval is 53.36 to 63.14 another interval is 63.14 to 71 next one 71 to 78.85 the next interval
is 88.63 and above. So this was our interval.
(Refer Slide Time: 22:54)
So, how we got this interval now we got the different intervals. Our expected frequency is
5,5,5 I am running this because we have divided by 6 and our observed frequency. So from
the given dataset in the range of 0 to 53.67 we have to count it how many dataset is there,
there are 6 dataset. In the interval 53.36 to 63.14 there are 3 dataset. In the interval 63.14 to
71 there are 6 dataset and so on. We have to manually count it how many dataset is appearing
in this interval that is our observed frequency. So now we got the expected frequency and
observed frequency.
(Refer Slide Time: 23:39)
988
Then when we do the chi square test so we are getting p value 0.90 that is bigger than our
alpha value. So we have to accept null hypothesis and we are concluding that the given
dataset follow normal distribution. In this lecture, first I have explained a demo for testing
goodness of fit for a Poisson distribution then I have explained the theory behind how to test
goodness of fit for a uniform distribution normal distribution that is some dataset is given.
How to test the given dataset follow uniform distribution or normal distribution then I have
done the python code with the help of python demo I have explained how to test the uniform
and normal distribution for goodness of fit because the testing another important point you
have to remember all the random numbers follow uniform distribution. Sometime you may
ask to say some random numbers.
And you have to test whether the numbers are really random or not. For that purpose, we
have to test whether the given dataset it is following uniform distribution or not. If certain
numbers following uniform distribution we can conclude that, that numbers are random
numbers.
989
Data Analytics with Python
Prof. Ramesh Anbanandam
Lecture – 49
Cluster Analysis: Introduction - 1
Dear students today we are entering to a new topic that is a cluster analysis. The cluster analysis
is mostly widely used data mining techniques. It is a very important topic, so the agenda for
today class is understanding cluster analysis and its purpose. Then introduction to types of data
and how to handle them because the clustering techniques will vary with respect to what kind of
data nature of the data is; suppose the nature of the data is continuous data or interval data, there
will be a different algorithm for that. If the data is nominal data, there will be a different
algorithm for clustering.
(Refer Slide Time: 01:06)
990
First, we will see what is cluster analysis? Cluster analysis is the art of finding groups in data. In
cluster analysis basically one wants to form groups in such a way that objects in the same groups
are similar to each other whereas the object in different groups are dissimilar as possible. When
you look at this picture, they are different clusters say group of dogs, group of cat, group of
chairs, group of tables. This is an example of cluster analysis.
The classification of similar objects into groups an important human activity. This is part of
learning process. A child learns to distinguish between cats and dogs, between tables and chairs,
between men and women by means of continuously improving subconscious classification
schemes. What is the meaning of this point is a child unknowingly is able to classify different
objects and able to group cluster different objects which are similar in nature.
(Refer Slide Time: 02:17)
991
We will explain the concept of cluster analysis with the help of an example. This is a plot of 12
objects on which two variables were measured. For instance, the weight of an object might be
displayed on the vertical axis and its height on the horizontal axis when you plot it here it is
clearly visible that you are able to form two clusters with respect to height you see that these are
the different weight with respect to another height to this is the 60 is one type of height and
second is 80 is another type of height we are able to cluster it.
(Refer Slide Time: 02:56)
Because this is contains only two variable. We can investigate it by merely looking at the plot. In
this small data set that are clearly two distinct group of objects. Such groups are called clusters
and to discover them is the aim of cluster analysis, so this is the purpose of our class. What is
992
going to be there in coming classes. We will be having different types of data. We are going to
cluster that different data into different groups.
(Refer Slide Time: 03:32)
Many time the students may have doubt what is the difference between cluster and discriminant
analysis. Cluster analysis is an unsupervised classification technique in the sense that it is applied
to a data set where patterns want to be discovered. That is the group of individuals or variables
wanted to be found. Why we are calling it this unsupervised learning because we may not know
which variable will go to which cluster, we are not knowing also that how many clusters we are
going to form it.
The second point in a cluster analysis, no prior knowledge is needed for this grouping. I need to
sensitive to several decisions that have to be taken. Some of the variables are similarity
dissimilarity measures clustering methods. Whereas discriminant analysis is a statistical
techniques used to build a prediction model that is used to classify objects from your data set
depending on the futures observed on them.
In this case, the dependent variable is grouping variable which identifies to which group or
object belongs. This grouping variable should be known at the beginning for a function to be
built up sometime discriminant analysis is considers supervised tool because as there is a
previous known classification for the element of the dataset.
993
(Refer Slide Time: 05:03)
Further we will continue the difference between a cluster analysis and discriminant analysis.
Cluster analysis can be used not only to identify the structure already present in the data, but also
to impose structure on a more or less homogeneous data set that has to be split up in a fair way.
For instance, when dividing a country into telephone areas. See this is a country for example see
this example this is classified into different telephone areas.
Cluster analysis is quite different from discriminant analysis in that it actually establishes the
groups whereas discriminant analysis assigns object to groups that were defined in advance. That
is a major difference. What does that mean? The discriminant analysis assigns object to the
group that were defined in advance. But in cluster analysis it is not the case and what will happen
as I told you in the beginning, the clustering analysis and corresponding algorithm depending
upon what kind of data.
(Refer Slide Time: 06:07)
994
Types of data and how to handle them for cluster analysis, as I told you in the beginning of the
class, the types of data is an important point has to be taken care while doing cluster analysis
because for different types of data there is a different type of clustering algorithms. Let us take
an example that are n objects to be clustered, which may be persons, flowers, birds, countries or
anything.
Clustering algorithm typically operate on either of two input structure. The first represents the
object by means of p measurement or attributes such as height, weight, sex color and so on.
These measurements can be arranged in a n-by-p matrix whereas the row corresponds to the
objects and the column corresponds to the attributes.
(Refer Slide Time: 07:01)
995
You see this case the objects are in the rows Like, Intermediate, Need the attributes, Price,
Quality, Times are in the columns this is one kind of input.
(Refer Slide Time: 07:11)
The second structure is a collection of proximities that must be available for all pairs of objects.
These proximities makeup an n by n table, which is called one mode matrix because the row and
the column entities are the same set of objects. One shall consider two type of proximities,
namely dissimilarities which measures how far away two objects are from each other and
similarities which measures how much they are resemble each other okay.
996
Now assume that there are some variable A, B, C, so the A and B it is written this way. You see
that A and A, B and B, C and C is 1 because the same one. So we can write between A and B
how much is the similarity otherwise between A and B how much is dissimilarity.
(Refer Slide Time: 08:18)
Now let us see what is the interval scaled variables. In this situation the n objects are
characterized by p continuous measurement. These values are positive or negative real numbers
such as height, weight, temperature, age, cost which follow a linear scale. For instance, the time
interval between 1900 and 1910 was equal in length to that between 1960 and 1970. So this is an
example of interval scales we have studied in the beginning of the lecture.
We have classified data in to different four categories, nominal, ordinal, interval, ratio. So, the
example of interval is year. If it is a year, what happened? We can add some numbers, we can
subtract some numbers. Similarly, you see that between 19 the interval will be same between
1900 to 1910 and 1960 to 1970 the difference is same because we can add it, we can subtract it,
but we cannot multiply.
(Refer Slide Time: 09:18)
997
Similarly the another example for interval data is our temperature. Also it takes the same amount
of energy to heat an object of - 14.4 degrees Celsius to - 12.4 degrees Celsius as to increase it
from 35.2 degrees Celsius to 39.2 degree Celsius. What I am saying is that the Fahrenheit
temperature scale also an example of interval scale because there would not be any absolute 0
but we can add, we can subtract. In general it is required that intervals keeps the same
importance throughout the scale.
(Refer Slide Time: 09:55)
Interval scale scaled variables. These measurements can be organized in an n-by-p matrix where
the rows corresponds to the objects and the column corresponds to the variables. So where, the
fth measurement of the ith object is denoted by xif where i is 1 to n , f is 1 to p . So, here in row
998
we have mentioned objects in column we have mentioned variables. The another name for object
is cases the variable may be different variables
(Refer Slide Time: 10:33)
For example take 8 people for example what is happening in rows we have n objects in column
we have 2 variables one variable is height another variable is weight. Take 8 people the weight in
kilogram and the height in centimeter is given in the table. In this situation n = 8 because 8 rows
are there p =2 because 2 variable is there.
(Refer Slide Time: 11:01)
If I plot that you see that weight is taken in kg height in centimeter. When you plot it, it is
forming two similar objects. We can group into similar objects into two category one is cluster 1
999
we can call it is cluster 1 and cluster 2. In cluster 1 for example your row C H A G will occur in
cluster 2 row D B F E will occur. This is a one way of clustering.
(Refer Slide Time: 11:35)
The units on the vertical axis are drawn to the same size as those under horizontal axis even
though there represents different physical concepts. The plot contains two obvious clusters which
can in this case be interpreted easily. The one consist of small children other of adult. However
other variables might have led to completely different clustering. For instance, measuring the
concentration of certain natural hormones might have yielded a clear cut partition into different
male and female persons. In this one since we are taken 2 variable one is weight and height
instead of weight and height if you take some other variables that may bring some other type of
clustering.
(Refer Slide Time: 12:21)
1000
Let us now consider the effect of changing the measurement unit. So previously we had the
measurement unit and you look at this one the weight is in kg and height is in centimeter. Now if
you change the unit now, the weights going to be in pound and height is going to be in inch. Now
for this kind of dataset, let us see how this unit of this data is going to affect our clustering
technique.
Let us now consider the effect of changing measurement units. If weight and height of the
subject had been expressed in pounds and inches the result would have looked quite different. A
pound equals 0.453 kg and an inch is 2.5 centimeter. Therefore, table 2 contains larger number in
column of weight because we are converted into pounds and smaller number in the column of
heights the heights become very smaller.
(Refer Slide Time: 13:18)
1001
Let us see the new cluster now what happened? Now the height has increased, the weight has
increased. Now the clustering pattern is completely changed. So what point I am saying here is
that the unit of the data may bring out different clusters.
(Refer Slide Time: 13:36)
So what is interpretation, although plotting essentially the same data as figure 1, figure 2 looks
much flatter. In this figure the relative importance of the variable weight is much larger than the
figure 1. As a consequence, the two clusters are not as nicely separated as in figure 1 because in
this particular example, the height of the person gives a better indication of adulthood then his or
her weight.
1002
If height had been expressed in feet because 1 feet = 30.48 centimeter the plot would become
flatter still and the variable weight would be rather dominant. In some applications changing the
measurement units that is an important point, may even lead to one to see a very different
clustering structures. The point what I am trying to say here is that changing the measurement
unit may provide different type of clustering structure.
(Refer Slide Time: 14:37)
To avoid this because different units are providing different clustering structure one way to avoid
this problem is standardizing the data. Now let us see how to standardize the data to avoid this
dependence on the choice of measurement units one has the option of standardizing the data.
This converts the original measurement to unitless variable. First one calculates the mean value
of f given by we know that the mean is 1 / n sum of all the values divided by m number of
variables that is mf.
(Refer Slide Time: 15:18)
1003
Then one computes a measure of dispersion or spread of this fth variable. Generally we use the
standard deviation what is a standard deviation x1 that variable minus mean whole square
divided by, second variable minus whole square, up to fth variable - mf whole square divided by
n – 1 that is a standard deviation. This is one way of standardizing the data.
(Refer Slide Time: 15:44)
However this measure is affected very much by the presence of outlying values. The problem
with the standardization is that if there are extremely large values or extremely low values that is
affecting the process of standardization. For instance, suppose that one of the xif has been
wrongly recorded so that it is much too large. In this case, the standard deviation will be unduly
inflated because we are squaring xif – m is squared .
1004
So Hartigan in the year 1975 notes that one needs a dispersion measure that is not too sensitive to
outliers. Therefore we will use the mean absolute deviation we generally this term generally we
call it as MAD mean absolute deviation where the contribution of each measurement xif is
proportional to the absolute value of modulus value of xif – mf. So instead of squaring, we are
going to take to find the standardize we are going to take only the mean absolute deviation.
The advantage of taking mean absolute deviation is that if any out layer is there that its effect is
dampened. That is why instead of going for standard deviation, we should go for mean absolute
deviation. That is xf = 1/ n (l x1f – mf l + l x2f – mf l modulus and so on).
(Refer Slide Time: 17:21)
Let us assume that Sf is nonzero because the standardized value should be non-zero otherwise
the variable f is constant over all objects and must be removed. Then the standardized
measurement are defined by and sometimes called z-scores. The another name for
standardization is z-score. They are unitless because both the numerator, because numerator also
deviation the denominators are also deviation.
They are unitless because the numerator, the denominator are expressed in the same units. By
construction zif have mean 0 and then absolute deviation is equal to 1. So what is happening in
the property of this where standardized data is mean should be 0 and the variance or standard
division should be 1.
1005
(Refer Slide Time: 18:10)
When applying standardization one forgets about the original data and uses the new data matrix
in all subsequent computations. What happened the initially in row there was object in column
there was variables. There was xif variable was there. So after standardization, that will become
z11, z12 up to z1p. Now this data that is data which are standardized data will be taken for
further analysis.
(Refer Slide Time: 18:40)
Detecting outlier the advantage of using Sf rather than standardized value of f in the denominator
of z-score formula is that Sf will not be blown up so much in the case of an outlying xif and
hence the corresponding zif will still be noticeable. So the ith object can be recognized as an
1006
outlier by the clustering algorithm, which will typically put in a separate cluster. So the purpose
of using the z-score is it will not blown up so much in the case of any outlier in the dataset.
(Refer Slide Time: 19:18)
Standardizing the data the preceding description might convey the impression that the
standardization would be beneficial in all situations. However, it is merely an option that may or
may not be useful in a given application. Sometimes the variables have an absolute meaning and
should not be standardized. What is the point here is that the variable already in the absolute
term, it should not be standardized.
For instance, it may happen that several variables are expressed in the same units, so they should
not be divided by different Sf. Because all the variables are in the same units you need not go for
standardization. Often standardization dampens a clustering structure by reducing the large effect
because the variables with the big contribution are divided by a large Sf. Sf is standardized value.
In this lecture we have covered the purpose of clustering analysis. Then I have explained the
difference between clustering analysis and discriminant analysis. Then I have explained how the
different types of data will affect our clustering structure. In the different types of data we have
taken only the interval data and how to handle them for doing cluster analysis.
1007
Then I have started why we have to do the standardization because if the different variables are
in different units, you may get different kinds of clustering structures. To overcome that we have
to go for standardization. The next class I will explain that the standardization also not applicable
for all kind of data. Sometime it will mislead it may provide a different type of clustering that I
will explain with the help of an example in the next class. Thank you.
1008
Data Analytics with Python
Prof. Ramesh Anbanandam
Lecture – 50
Clustering Analysis: Part II
In my previous lecture, we have started about introduction to cluster analysis, and I have
explained how to handle interval types of data. Then I have started about the importance of
standardization. In this lecture we will see that what is the effect of standardization because
sometime standardization may mislead your clustering structure and I will explain different types
of distances computation between the objects.
(Refer Slide Time: 00:56)
Because for different types of data set, there are different ways to compute the distances, so that I
will explain the many time when we collect the data. It is not necessary that we will collect all
the data some time there may be a missing data. If the data is missed how to hold to handle that,
that also will cover in this lecture.
(Refer Slide Time: 01:13)
1009
Now let us see the effect of standardization I have taken one simple problem with a numerical
example. This problem is taken from this book Finding Groups in Data: An Introduction to
Cluster Analysis by Leonard Kaufman and Peter Rousseeuw; it is a John Wiley publishers. There
are 4 persons and your age yet in terms of year and height in terms of centimeter is given.
Suppose if you take age on horizontal axis and height on vertical axis, you can mark this all four
persons A, B, C, D.
So what you were able to understand that is a distinct cluster is there. Because A, B is one group
one cluster C, D in another cluster. Now the same data let us standardize after standardizing
again we will go for clustering let us see how it appears.
(Refer Slide Time: 02:08)
1010
In figure 1 we can see the distinct clusters, let us standardize the data of table 1. For
standardizing we should know the mean and standard deviation, standard deviation otherwise
mean absolute deviation. So that mean of age equals to m1 = 37.5 just by adding all the ages and
divided by the number of data set and the mean absolute deviation is not standard deviation it is
mean absolute deviation of the first variable works out to be S1 = 2.5.
How we are finding mean absolute deviation that variable minus mean for example 35 - 37.5 for
second variable 40 – 37.5 we have to take only the positive value. There are four data set, so the
mean absolute deviation is 2.5. Therefore, the standardization convert 40 to + 1 how we got to 40
is converted standardized to 1 we know that this is (x – mu) divided by S. So x is 40 mu that is m
is 37.5 divided by mean absolute deviation 2.5, = 1.
And same way age 35 is standardized to -1 how we got the – 1, 35 - mean divided by mean
absolute deviation. So it is - 2.5 divided by 2.5 it is - 1 the same way for the variable m2 the
mean is 175 and mean absolute deviation for variable 2 is 15. So each variable in the second
column also standardized for example 190 centimeter is standardized to + 1 and same way 160
centimeters is standardized to -1.
(Refer Slide Time: 03:53)
1011
The result data matrix which is unitless because below standardized is given in the table 2. Note
that the new averages are 0 and the mean deviations equal to 1. So this table 2 shows that these
standardized. Table for each variable is variable 1 and variable 2. Even when the data are
converted into various strange units standardization will always yield the same numbers that is
the advantage of standardization.
(Refer Slide Time: 04:25)
Now plotting the values of table 2 in the figure 2 does not give any very exciting result. So what
do you have done? In the previous table we have the standardized values for both the variables.
So when you plot it there are 4 points are appearing. So this points is not giving any useful result
so figure 2 shows no clustering structure because 4 points lay out the vertices of a square. One
1012
could say that there are 4 clusters; each consisting of single point are that there is only one big
cluster containing 4 points.
Here standardization is no solution. So what we have seen many times when you go for
standardization, the standardization may not give the useful result that is what this example
shows.
(Refer Slide Time: 05:16)
Now let us look at the choice of measurements. Here the measurement means that units of that
variable. What is the merits and demerits? The choice of measurement units gives rise to relative
weight of variables, expressing a variable in smaller units will lead to large range for that
variable, which will then have a large effect on the resulting structure. So what will happen if
variable is in smaller units? So, that will give a larger effect in the; your clustering result.
On the other hand, by standardizing one attempts to give all variables an equal weight in the
hope that achieving objectivity. As such it may be used for practitioners who possesses no prior
knowledge. So the benefit of standardization is that anybody those who are not having any prior
knowledge about the problem also can do with the help of standardized variables. They can do
the cluster analysis because there is a unitless.
(Refer Slide Time: 06:18)
1013
However, it may well be that some variables are intrinsically more important than others in a
particular application and then the assignment of weight should be based on the subject matter
knowledge. Every time because standardization is giving equal weight some time some variables
are more important. So for that variable with the help of experts, we can give a higher weightage
for that variable.
On the other hand, there have been attempts to devise clustering techniques that are independent
of scale of the variables. There are many techniques people are trying to come with a different
clustering model.
(Refer Slide Time: 06:55)
1014
Distances computation between objects. The next step is to compute distances between the
objects in order to quantify their degree of dissimilarity. It is necessary to have a distance for
each pair of objects i and j. The most popular choice is the Euclidean distance. What is this
Euclidean distance? The distance between variable i, j = (xi1 – xj1)2 + (xi2 – xj2)2 up to (xip –
xjp )2.
When the data are being standardized one has to replace all x by z in this expression if you are
standardizing instead of x you have to use z. This formula corresponds to the true geometrical
distance between points with the coordinates xi1 up to xip and xj1 up to xjp.
(Refer Slide Time: 07:55)
See the Euclidean distance suppose if you want to move from point A to B see this is point A and
B let us find out the concept behind the Euclidean distance. Suppose if we want to move if you
want to point A to B you can directly you can fly from one point to because the birds will fly
from A to B. So that distances called Euclidean distance. Let us consider the special case with p
= 2 where there are only two variable.
Figure shows two points with the coordinates xi1, xi2 and xj1, xj2. It is clear that the actual
distance between objects i and j is given by the length of the hypotenuse of the triangle yielding
expression in previous slide by virtue of Pythagoras theorem. So this formula is nothing but the
1015
hypotenuse. So this is as per the Pythagoras theorem so square of adjacent side and square of
opposite side equal to square of hypotenuse.
(Refer Slide Time: 09:02)
Let us go to the next distance measures that is Manhattan distance. It is a another well-known
metric is the city block or Manhattan distance. It is given by x modulus value of xi1 – xj1 + xi2 –
xj2 modulus value only the positive values up to xip – xjp. Suppose this is city map if you want
to move from point A to B right. There are two way one way is directly you can suppose we are
if you are a bird or you are move if you want to go A to B you can fly. Otherwise, the flight goes
from point A to point B.
But if there is a fire suppose a fire engine it has to move. It has to follow a rectangular distance.
So because there are different streets, so this distance is nothing but your Manhattan distance.
You see that the distance or Manhattan distance will be larger than the Euclidean distance the
green one is Euclidean distance. The blue one is nothing but the Manhattan distance.
(Refer Slide Time: 10:10)
1016
Let us interpret the meaning of Manhattan distance. Suppose you live in a city where the streets
are all north-south or east-west and hence perpendicular to each other. Let figure 3 be the part of
street map of such a city where the streets are portrayed as a vertical and horizontal lines. So if
you want to move from point A to point B, you cannot directly you cannot go by shortest path
you have to take a rectangular distance. Another name for this Manhattan distance is rectilinear
distance.
(Refer Slide Time: 10:41)
Then the actual distance you would have to travel by a car or fire engine to get from location i to
location j would be xi1 – xj1 modulus value + xi2 – xj2 modulus value. This would be the
shortest length among all possible paths from i to j. Only a bird could fly straight from point i to j
1017
thereby covering Euclidean distance between these points. So the example of the bird, which
covers point A to B is the example for your Euclidean distance.
(Refer Slide Time: 11:21)
The mathematical requirements of a distance function, both Euclidean metric and Manhattan
metric, satisfy the following mathematical requirements of a distance function for all objects i, j
and h. The first property is D1 di, j > 0, d i, j >= 0, d i, i is 0, d i, j = d ji, d ij <= d ih + d hj
condition D1 merely states that the distances are non-negative numbers and D2 says that the
distance of an object itself is 0 because i, i is 0.
Axiom D3 is the symmetry of the distance function. The triangle inequality axiom D4 looks a
little bit more complicated, but it is necessary to allow a geometrical interpretation. It says
essentially that going directly from i to j is shorter than making a detour over object h. For
example, suppose this is i this is j, and this is h so what says moving point i to j this will be
shorter than moving i to h and h to j that is your triangular inequality.
(Refer Slide Time: 12:59)
1018
Distance computation between the objects if d of i, j = 0 does not necessarily imply that i = j
because it can very well happen that two different objects have the same measurement for the
variable understudy. What is the meaning of this one is if the distance between object i, j = 0 it
need not necessary that always it should be i = j. Sometimes there may be two objects which is
not i = j their distance also may be 0.
However, the triangle inequality implies that i and j will then have the same distance to any other
object h because d of i, h <= d of i, j + d of j, h = d of j, h at the same time d of j, h <= d of j, i +
d of i , h = d of i, h which together imply that d of i, h = d of j, h.
(Refer Slide Time: 14:13)
1019
The next measure of the distance is Minkowski distance. A generalization of both Euclidean and
Manhattan metric is the Minkowski distance. It is given by d of i, j = modulus of (xi1 – xj1) to
the power p + lxi2- xj2l to the power p and so on + xin – xjn to the power p whole to the power
1/p where p is any real number larger than or equal to 1. This also called the Lp metric for the
Euclidean distance p = 2 and for Manhattan distance p = 1 as a special case.
(Refer Slide Time: 14:58)
Now let us take some example and calculate the Euclidean distance, Manhattan distance and M
Minkowski distance. Let x1 = (1, 2) and x2 = (3, 5) represents two objects this is point 1 so call
it as x1 this is point x2. The Euclidean distance between these two point x1 x2 is you see that it is
2 square + 3 square, square root it is 3.61. The Manhattan distance between the two point is this
2 + 3. So this is your Euclidean distance, this is Manhattan distance.
You see that the Euclidean distance is smaller than the Manhattan distance because in Manhattan
distance you cannot have a direct route, you have to take only a rectangular route that will be the
larger. So this line represents Euclidean distance move here then move here when you add that,
that represents your Manhattan distance.
(Refer Slide Time: 16:13)
1020
Let us take another example n- by-n matrix. This is one of the input for a cluster analysis for
example, when computing Euclidean distance between the objects of the following table can be
obtained in the next slides. For example, there are 1, 2, 3, 4, 5, 6, 7, 8. There are 8 persons their
weight and heights are given. Now let us find out how to make n- by-n matrix, by calculating the
distance between each persons each objects.
So generally, if you want to know the Euclidean distance between B and E, for example, B and E
that is nothing but (49 – 85) whole square. Otherwise, (85 – 49) because we are squaring it +
(156 – 178) total square you take square root that is 42.2. So the distance between B and E is
42.2. Like that for between A and B, A and C we can find out.
(Refer Slide Time: 17:17)
1021
Do you see that n- by -n matrix the distance between in A and A is 0. You see that all the
diagonal will be 0. The distance between A and B is 69.8 the distance between A and C is 2.0. So
in my previous slides I have explained the distance between B and E is 42.2. So you see that this
is symmetric see this upper triangle value is equal to your lower triangle value. So that is a
replica, that is a mirror image of this value.
(Refer Slide Time: 17:54)
Let us interpret this distance matrix. The distance between object B and E can be located at the
intersection of the fifth row and the second column yielding 42.2. Now let us interpret that
distance matrix. The distance between object B and E can be located at the intersection of fifth
1022
row and second column. That was this one I am going to previous slide. The fifth row 1, 2, 3 ,4
fifth row second column this one 42.2.
The same number can be found at the intersection of 2nd row and 5th column because the
distance between B and E is equal to the distance between E and B, therefore the distance matrix
is always symmetric. Moreover, note that the entries of the main diagonal are always 0 because
the distance of an object to itself has to be 0.
(Refer Slide Time: 18:54)
Now we have shown only the lower triangle it would be suffice to write down only the lower
triangular half of the distance Matrix.
(Refer Slide Time: 19:04)
1023
Now let us see the selection of the variables, because before doing cluster analysis we have to
see whether we have to select all the variables of the variables, which is relevant to our problem.
It should be noted that a variable not containing any relevant information say the telephone
number of each person is worse than useless because it will make the clustering less apparent.
The occurrence of several such trash variable will kill the whole clustering.
Because they yield a lot of random terms in the distances thereby hiding the useful information
provided by the other variables. Therefore, such non-informative variables must be given you
zero weight in the analysis, which amounts to deleting them. So any not important variable, you
can give zero weightage so that that will not be taken into calculation.
(Refer Slide Time: 20:02)
1024
So the selection of good variable is a non-trivial task and may involve quite some trial and error
in addition to subject matter knowledge and common sense. In this respect so a cluster analysis
may be considered as an exploratory technique. In this lecture we have seen the effect of
standardization then calculation of different types of distances with the help of example. I have
explained how to find out Euclidean distance, Manhattan Distance and Minkowski distance.
Then formulation and interpretation of n by n matrix. Then I have explained this is one of the
input for cluster analysis there are n objects, n variables, how to find out the distance between
these two variables or objects. Then I have explained how to select relevant variables for the
cluster analysis.
1025
Data Analytics with Python
Prof. Ramesh Anbanandam
Lecture – 51
Clustering analysis: Part III
In our previous class we have seen effect of standardization and how to find out different
distances like Manhattan distance, Euclidean distance and Minkowski distance. Then I have
explained how to select the variables. In this lecture we are going to see when you are collecting
the data if you are missing some data, some data is not available how to handle that situation.
Then very important concept of similarity and dissimilarity matrix. That is our agenda for this
lecture.
(Refer Slide Time: 01:00)
1026
First let us see how to handle the missing data. It is often happens that not all measurements are
actually available. So there are some holes in the data matrix that is a missing value in the data
matrix. Such an absent measurement is called missing value it may have several causes. The
value of measurement may have been lost or it may not have been recorded at all by oversight or
lack of time.
Sometime the information is simply not available. For example, birth date of a foundling or the
patients may not remember whether he or she ever had their measles, or it may be impossible to
measure the desired quantity due to the malfunctioning of some instrument. In certain instances,
the question does not apply or there may be more than one possible answer when the
experiments obtain very different results.
(Refer Slide Time: 02:07)
1027
So how can we handle a data set with the missing values? That is important question now in a
matrix we indicate that the absent measurement by means of some code. If there exists an object
in the dataset for which all measurements are missing, there is really no information on this
object, so it has to be deleted. Analogously a variable consisting exclusively of missing values
has to be removed too.
(Refer Slide Time: 02:37)
If the data are standardized, the mean value m of the fth variable is calculated by making use of
present values only. The same goes for your mean absolute deviation, so mean absolute deviation
that is Sf = (1 / n) modulus of (x1f – mf) and so on l xnf –mf l. In the denominator, we must
1028
replace n by the number of non-missing values for that variables, but of course only when the
corresponding xi is not missing itself.
(Refer Slide Time: 03:20)
In the computation of distances based on the either Xi or the Zi similar precautions must be taken
when calculating the distances d of i, j only those variables are considered in the sum of which
the measurements of both objects are present. Subsequently, the sum is multiplied by p and
divided by the actual number of terms. In the case of Euclidean distances, this is done before
taking the square root. Such a procedure only make sense when the variables are thought of as
having the same weight. For instance, this can be done after standardization.
(Refer Slide Time: 04:03)
1029
When computing these distances, one might come across a pair of objects that do not have any
common measured variables, so their distance cannot be computed by means of above-
mentioned approach. Several remedies are possible: One could remove either object, or one
could feel some average distance value based on the rest of the data. Or, by replacing all missing
xif by the mean of mf, that variable, then all distances can be computed. Applying any of these
methods one finally possesses a full set of distances.
(Refer Slide Time: 04:44)
Then we will go to another topic that is dissimilarities. The entries of n-by-n matrix may be
Euclidean or Manhattan distances. However, there are many other possibilities, so we no longer
speak of distances, but dissimilarities or dissimilarity coefficients. Basically, dissimilarities are
non- negative numbers that is d of i, j that are small, close to 0 when i and j are near to each other
and they become large when i and j are very different. We shall usually assume that
dissimilarities are symmetric and that the dissimilarity of an object to itself 0. But in general, the
triangle inequality does not hold.
(Refer Slide Time: 05:40)
1030
Dissimilarities can be obtained in several ways. Often, they can be computed from variables that
are binary, nominal, ordinal, interval or combination of these. Also dissimilarities can be simple
subjective rating of how much certain objects differ from each other from the point of view of
one or more observers. This kind of data is typical in the social science or in the marketing.
(Refer Slide Time: 06:11)
Let us take an example then I will explain the concept of dissimilarities fourteen post-graduate
economic students coming from different parts of the world were asked to indicate the subjective
dissimilarities between 11 scientific disciplines. All of them had to fill in a matrix, like in table 4
in the next slide where the dissimilarities had to be given as integer numbers, on a scale of 0 to
1031
10, where the 0 represents identical 10 represents very different. The actual entries of the table in
the next slides are the average of these values given by the students.
(Refer Slide Time: 06:54)
It appears that the smallest dissimilarity is perceived between mathematics and computer science
that value is 1.43 mathematics and computer science. This is our smallest dissimilarity whereas
the most remote fields where psychology and astronomy psychology, astronomy. So this table
represents dissimilarity matrix from that we can directly read, which is having lesser
dissimilarity, which is having more dissimilarity.
(Refer Slide Time: 07:35)
1032
If one wants to perform a cluster analysis on a set of variables that have been observed in some
population. There are other measures of dissimilarity. For instance, one can compute the
Parametric Pearson product-moment between the variables f and g or alternatively non-
parametric spearman correlation. Here the dissimilarity can be found with the help of your
Pearson correlation or spearmen correlation. We know that the Pearson correlation is a
parametric method spearmen correlation is non-parametric method. Because Spearman
correlation is applicable only for ordinal data.
(Refer Slide Time: 08:18)
Both the coefficients lay between - 1 and + 1. Which one I am saying where our Pearson and
Spearman correlation and do not depend on the choice of measurement units. We need not bother
about the units because we are going to see the range of correlation coefficient is – 1 to + 1.
Similarly, the Spearman correlation value also between -1 to + 1. That value does not depending
upon what type of units of the data.
The main difference between is that the Pearson coefficients look for a linear relationship
between variables f and g, whereas the spearmen coefficient searches for monotone relations. So,
this is formula for our correlation coefficient, so we call it as r the correlation coefficient we have
studied this formula already. So the correlation coefficient row is this is nothing but co variance,
co variance of x, y divided by standard deviation of x and standard deviation of y. So this is in
some other format this is x, y the correlation coefficient, not x, y here you can call it as f, g.
1033
(Refer Slide Time: 09:40)
Correlation coefficients are useful for clustering purposes because they measures the extent to
which two variables are related. Correlation coefficients, whether parametric or non-parametric
can be converted into dissimilarities d(f, g) for instance, by setting by this relationship. So
dissimilarity between object (f, g )= (1- R that is correlation coefficient between (f, g)) divided
by 2. ith this formula variables with a high positive correlation receive a dissimilarity coefficient
close to zero whereas the variables with a strongly negative correlation will be considered as
very dissimilar.
Why this kind of conversion is required the range of dissimilarity is between 0 to 1, but
sometime what will happen the value of correlation coefficient between – 1 to + 1. So convert
into to 0 to 1 scale we can use this transformation.
(Refer Slide Time: 10:46)
1034
Now we enter into another concept called similarities. Previously we are explaining about
dissimilarity and how we are going to study about what is similarities. The more objects and j are
alike, so the larger will be similarity between s (i, j) becomes. Such a similarity s of (i, j)
typically takes on values between 0 to 1 whereas 0 means that i and j are not similar at all, and 1
reflects maximum similarity. Values between 0 and 1 indicate various degrees of resemblance.
Often it is assumed that the following conditions hold. So, S1: 0 <= s (i, j) <= 1, because the
range of similarities between 0 to 1. S2: the similarity between i, i itself 1, the similarity s (i, j) =
s(j, i )
(Refer Slide Time: 11:48)
1035
We will continue the concept of similarities for all objects i and j the numbers s of (i, j) can be
arranged in an n-by-n matrix which is then called similarity matrix. Both similarity and
dissimilarity matrices are generally referred to as proximity matrices sometimes as a
resemblance. In order to define similarities between variables, we can again resort to a Pearson
or Spearman correlation coefficient.
However, neither correlation measures can be used directly as a similarity coefficient because
they also take on negative values because we cannot take the value of correlation and Spearman
correlation as it is because they may range between - 1 to + 1, but the similarity values between 0
to 1.
(Refer Slide Time: 12:41)
So in that case, we have to go for some transformation, some transformation is in order to bring
the coefficients into the zero-one range. There are essentially two ways to do this, depending on
the meaning of the data and the purpose of the application. If the variables with a strong negative
correlation are considered to be very different because they are oriented in the opposite direction,
like mileage and weight of a set of cars, then it is best to take something like the following.
You have to follow this transformation s of (f, g) = (1 + R of (f, g))/2 what will happen here? We
have added some constant so that constant will nullify the negative effect which yields the
1036
similarity between f and g = 0 whenever the correlation coefficient is - 1 because - 1 and + 1
becomes 0. So this take care that the similarity value comes between 0 to 1.
(Refer Slide Time: 13:49)
There are situations in which variables with a strong negative correlation should be grouped
because they measure essentially the same thing. For instance, this happens if one wants to
reduce the number of variables in a regression dataset by selecting one variable from each
cluster. In that case, it is better to use formula like this. Similarity between (f, g) = the modulus
value of correlation coefficient between f and g, which yields that the similarity between (f, g) =
1 when the correlation coefficient is -1. We have to take only the positive values.
(Refer Slide Time: 14:31)
1037
Suppose the data consist of similarity matrix, but one wants to apply a clustering algorithm
designed for dissimilarities. Then it is necessary to transform the similarities into dissimilarities.
The larger the similarity the similarity between s (i, j), between i and j the smaller their
dissimilarity d of (i, j) should be. Therefore, we need a decreasing transformation. This is a very
important result. So what it says that if you want to know the dissimilarity between two objects i,
j that is nothing but 1 - similarity between i, j
(Refer Slide Time: 15:13)
Let us take a binary type variable for that let us find the similarity and dissimilarity. Suppose a
contingency table for binary variable is given. There is an object i and object j you see that object
i is 1 0 there are two possibility object j also 1 0. So the q represents where the object i also takes
value 1 object j also takes value 1. This r represents, q is the number of values here r represents i
= 1 j = 0 this s represents number of values where i = 0 j = 1, t represents both i and j = 0.
The row sum is q + r for when i =1, when i = 0 the row sum is s +t same thing the column sum
‘s’, when j = 1 the column sum is q + s, when j = 0 the column sum is r + t. So the sum of q, r, s,
t that is nothing but your value p.
(Refer Slide Time: 16:30)
1038
What is the meaning of this q, r, s, t? q represents the number of variables that equa1 1 for both
objects i and j you see that q is the number of variables r is the number of variables that equal
one for object i but that are 0 for object j. S represents number of variables that equals 0 for
object i, but equal 1 for object j. So t represents the number of variables that equals 0 for both
objects i and j. The total number of variables is p where p = q + r + s + t.
(Refer Slide Time: 17:11)
The dissimilarity between symmetric binary variable, so from the previous table. What is the
meaning of Symmetric binary variable is example is gender suppose 0 Male 1 female. You can
reverse the code also there would not be any problem on this. So that is example of Symmetric
Binary variable. So for Symmetric binary variable, how to find out dissimilarity. So dissimilarity
1039
between i, j is it is r +s what is r+ s we will go this. So this one, the dissimilarity is this value r +
s divided by sum of the all values r + s divided by q + r + s + t.
(Refer Slide Time: 18:02)
Then let us see what is the meaning of Asymmetric binary variable. A binary variable is
Asymmetric. If the outcomes of the states are not equally important, such as positive and
negative outcome of disease test. By convention, we shall code the most important outcome,
which is usually the rarest one by 1. For example, HIV positive that is the rarest one we will
code it as 1 and the other by 0 HIV negative.
Given two Asymmetric binary variable the agreement of two 1s that is a positive match is
considered more significant than that of two 0s that is negative match. Therefore, such binary
variable are often considered as monary as if having only one state because we need not bother
about the zero state, because zero state is that non-presence of HIV, because we are more
concerned about presence of HIV where the state of one is more important.
(Refer Slide Time: 19:07)
1040
Now let us see how to find out dissimilarity value between Asymmetric binary variable. The
same table which I have given contingency table which I have given previous table I have given.
So the dissimilarity between Asymmetric matrixes d of (i, j). So we are considered about only r
and s in this in the denominator there would not be t because we are not considering 0. So only
we are writing q+ r + s. If it is Symmetric binary dissimilarity, the difference is there was a t
element was there here. But here in the asymmetric binary dissimilarity formula there is no ‘t’
element.
(Refer Slide Time: 19:51)
Even we have seen this relationship, that relationship called the Jaccard co-efficient. That is the
similarity between( i, j) = 1 - dissimilarity. So similarity q divided by (q + r+ s), where is q this
1041
one, we are bothered about only the pretense of 1 so q divided by q + r + s that is your similarity
between i, j for a asymmetric binary variable. So if we want to know dissimilarity, that is simply
the similarity equal to 1 minus dissimilarity between i and j.
(Refer Slide Time: 20:32)
Now let us take an example we will find out for Asymmetric binary variable how to find out
dissimilarity matrix. This table shows there are different name is there Jack, Mary, Jim. There is
a gender here gender is Symmetric binary variable. We are not going to consider this one
because this is a different test fever, cough test 1, test 2, test 3, test 4. This is Asymmetric
variable because where the presence of 1 is more important Y represents and P represents 1, N
represents 0.
(Refer Slide Time: 21:19)
1042
For this matrix, let us find the dissimilarity matrix between Jack and Mary. So I brought the table
again we will let us find out the dissimilarity matrices between Jack and Mary. So for Mary,
there are two possibility 1 0 for Jack that is under 2 possibility, 1 and 0. So let us count how we
got this 2 so Mary also 1 Jack also 1 there are two possibilities there Mary this is 1 possibility
this is another possibility. So there is a two count, so we have written it as 2.
Then how we got this 1? Mary is 1 jack is 0 so that means this one where Mary is 1 Jack is 0.
Now we will go this column where Mary is 0 Jack is 1 so Mary 0 is this one, I think there is no
value for this. Let us see the last option that is Mary also 0 Jack is also 0, So this 1 2 this is 3 that
is 3. So if you want to know the dissimilarity distance between Jack and Mary, so we know that
the formula is so we will add this 0 + 1 divided by 2 + 0 + 1 so we got 0.33.
(Refer Slide Time: 22:50)
1043
Similarly, now let us find how to find out the dissimilarity between Jack and Jim. So Jack is
taken in rows Jim is taken as in the column. So first we will find out how we got this 1. So this
case is Jack also 1 Jim also 1. So this category so Jack also 1 Jim also 1. I think there is only one
possibility, so it is 1 how we got this 1, the Jack is 1 Jim is 0. So this value Jack is 1 presence Jim
is 0 that is 1. So how we got this1 where Jack is 0 Jim is 1. So Jack is 0 yeah, this value Jack is 0
no means 0, Y means 1.
Let us see how we got this value 3 so Jack also is 0 Jim also is 0 so this 1 no this one 1, 2, 3 that
is how we got the 3 values. So if you want to know the dissimilarity between Jack and Jim. So
this is 1 + 1 divided by 1 + 1 + 1 so 2/3 it is 0.67.
(Refer Slide Time: 24:16)
1044
Let us take another example the dissimilarity between Jim and Mary. So Mary there is two
option 1 and 0. But Jim also there are two options 1 and 0 let us see how we got this value 1.
Mary is 1, Jim also 1. So this possibility this one the second case Mary is 1, Jim is 0 this is one
value. Mary is 1 Jim is 0, so there are two possibility. So that is where we got the value 2 how
we got to this value 1, Mary is 0 Jim is 1, so Mary is 0 Jim = 1 that is this value.
So how we got this 2 Mary is also 0 Jim also 0 so these two possibilities, Mary also 0 Jim also 0
test 1 here also Mary is 0 Jim also 0. So if you want to know asymmetric dissimilarity between
Jim and Mary it is 1 + 2 this value plus this one divided by this 1 +1 + 2 that is 4 so we got 0.75.
So this is the way to find out asymmetric dissimilarity between different variables. In this class,
we have seen how to handle the missing data for cluster analysis.
Then I have explained the concept of similarity and dissimilarity matrix. Then we have studied
symmetric and asymmetric binary variable and how to find out the dissimilarity between
symmetric binary variables and dissimilarity between asymmetric binary variables. Thank you.
1045
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering Indian
Institute of Technology – Roorkee
Lecture – 52
Cluster Analysis - IV
In my previous lecture, I have explained how to handle missing data while doing clustering
analysis. Second thing I have explained how to find dissimilarity and similarity matrix. The third
one I have explained there is a binary variable, how to find out the dissimilarity and similarity
matrix.
(Refer Slide Time: 00:46)
We will how to handle other variables. For example, the agenda for these classes, if there is a
interval scale variable how to use for use that dataset for the cluster analysis and binary variable
and how to use that dataset for the cluster analysis; we have done in our previous lecture. In this
class, we will see; if there is a categorical variable how to use that dataset for clustering analysis.
Next we will see there is ordinal variable, how to use for our cluster analysis.
And if there is Ratio-Scaled variables and how to do that data for our cluster analysis, and finally
the data maybe mixed type, the combination of these above data, in that case how to use that
dataset for our clustering analysis. The agenda for this lecture is how to handle the following
1046
data types. One is interval-scaled variable and binary variable that I have covered in my previous
lectures.
In this lecture if the variable nature is categorical, ordinal, ratio and combination of above dataset
that is mixed type, let us see how to see this kind of variables and how to use these kind of
variables for our clustering analysis.
(Refer Slide Time: 02:04)
1047
Let the number of states of a categorical variable be capital M. The states can be denoted by
letters, symbols, or a set of integers such as 1, 2 up to M. Notice that such integers are just used
for data handling and do not represent any specific ordering.
(Refer Slide Time: 02:57)
The question which we are going to answer in this lecture is how dissimilarity computed
between objects described by categorical variables.
(Refer Slide Time: 03:05)
1048
The dissimilarity between two objects i and j can be computed based on the ratio of mismatches.
So the formula to find out the dissimilarity for a categorical variable is d of (i, j) = (p – m)
divided by p, notice that it is a small m, m is the number of matches that is the number of
variables for which i and j are in the same state and the p is the total number of variables which
can be assigned to increase the effect of m or to assign greater weight to the matches in the
variables having a larger number of states. So we can give weightage also for different states.
(Refer Slide Time: 03:48)
Now we will take an example with the help of example I will explain how to find out the
dissimilarity between categorical variables. So the source for this example is, this book finding
groups in data and introduction to cluster analysis by Kaufman and Rousseeuw, the publisher is
1049
John Wiley. Suppose that we have the sample data as shown in the table, the table having 1, 2, 3
there are 4 columns.
One is object identifier, second column is test-1 categorical data, third column is test-2 ordinal,
the fourth one is test-3 ratio-scaled. Let only the object identifier and variable test-1 are available
which is a categorical data, so for this example we are going to consider only 2 column, one is
this object identifier, second one is the test-1 which is categorical data.
(Refer Slide Time: 04:47)
The dissimilarity matrix, see that we are showing only the lower triangle that is 1, 2, 3, 4 is a
object identifier, the diagonal will be 0 because the dissimilarity is 0 for this same value of when
i = j. So this location is d 2, 1 second row first column, this location is third row first column
third row second column, this location is fourth row first column fourth row second column and
fourth row third column.
(Refer Slide Time: 5:18)
1050
Since here we have only one categorical variable, test-1, we set p = 1 in equation because p is the
number of variables. So that d (i, j) evaluate 0 if the object i and j match and 1 if the object
differ, thus we get 0, 1 0, 1 1 0, 0 1 1 0. I will tell you how we got this matrix. Suppose let us
find out how this location has come this value is 1, the distance between you see that this matrix
d (2, 1) so d (2, 1) is we will use this p – m divided by small p so p is number of variables
because here only one variable minus m.
When you compare 1 and 2 because the 2, 1 or 2, 1 see code-A and code-B it is not matching so
the value of m is 0 so 1 – 0 divided by (p = 1), that is why we got this one value. Let us find out
this value d 4, 1. So there is a one variable is this 4 another variable this, so we will use the same
formula d (i, j) equal to; so p = 1, see code A and; for object identifier the code A is same for 1
and 4. So here the m is 1, so p also 1, m also 1, so this value is 0. This way the all other values
were found.
(Refer Slide Time: 06:52)
1051
Now let us go to the next type of variable, ordinal variable. So ordinal variable is similar to
categorical variable but the order is more important. So in my; when I am the explaining the
previous data that is a categorical variable one example is the pin code, for example in India the
pin code is an example for our categorical data, so that number is not representing any meaning.
So now we will start with ordinal variables.
A discrete ordinal variable resembles a categorical variable, except that M states of the ordinal
values are ordered in a meaningful sequence, this is term is very important because there is a
ranking, there is a order in each value. Ordinal variables are very useful for registering subjective
assessments of qualities that cannot be measured objectively. For example, say very good, good,
bad so this way we can give the rank 1, 2, 3.
So here 1, 2, 3 says that is a rank for that. For example, professional ranks are often enumerated
in a sequential order, such as Assistant, Associate, and full professors. So the order is more
important. A continuous ordinal variable look like a set of continuous data of an unknown scale;
that is the relative ordering of the values is essential but their actual magnitude is not. The
problem with the ordinal scale in your class.
Suppose you are giving rank, those who got 99 is rank number 1, so those who are got number 2
50 marks rank number 2. See that this 1 and 2 signifies the rank but it is not the actual value
1052
because this fellow got 99 marks, this fellow got 50 marks. So we are losing some important
information when you go for ordinal dataset. Because for us this 1 and 2 is more important, how
much mark they got is not important for us.
(Refer Slide Time: 08:58)
For example, the relative ranking is a particular sport for example, gold, silver, bronze is often
more essential than the actual value of a particular measures, because see there may be a
different 3 scales, so rank 1, rank 2, rank 3 so this person gold, this fellow silver, this fellow
bronze. Here what is more important is the; the rank is more important not the actual measures.
So ordinal variable may also be obtained from discretization of intervals quantities by splitting
the value range into finite number of classes.
Sometimes what happen, if there is a interval-scaled dataset that can be converted into ordinal
variables. The values of ordinal variables can be mapped into ranks, so after converting ordinal
then we can bring different ranks.
(Refer Slide Time: 09:52)
1053
Let us see how to find out dissimilarity matrix for our ordinal dataset. The treatment of ordinal
variable is quiet similar to that of interval-scaled variables when computing the dissimilarity
between objects. Suppose that f is variable from a set of ordinal variables describing n objects.
The dissimilarity computation with respect to f involves the following steps. The first one is the
value of f for the ith object is x( i, f) and f has M of ordered states, representing the ranking 1 to
M.
So the M is the maximum number of rank. The x (i, f) is the particular variable. So what we have
to do we have to replace each x (i, f) by its corresponding rank r if, so r if is the current rank the
M f is the maximum rank.
(Refer Slide Time: 10:51)
1054
Look at this data. This is; this also we have seen our previous lecture, this is Euclidean distance.
So this is an example of dissimilarity computation. We have seen this previous table that is the
Euclidean distances. This is an example of interval-scaled data. From this interval-scaled data we
can convert this table into in ordinal dataset. So what we have to do, so suppose the lowest
distance is highest rank.
So the lowest one is this one value, so this can be ranked as 1, the second one is 5., so this is rank
2. So next one is 5.7 rank 3, so; and so on, for each variable that is the interval dataset, you can
convert into ordinal dataset by giving rank like 1, 2, 3 and so on. So the highest value will have
the highest rank.
(Refer Slide Time: 11:52)
1055
Standardization of ordinal variable. Since each ordinal variable can have a different number of
states, it is often necessary to map the range of each variable onto 0 to 1 scale, so that each
variable has equal weight. This can be achieved by replacing the rank r if of the ith object in the
fth variable by z if equal to (r if – 1) divided by (Mf-1). So the r if represents the current rank Mf
represents the maximum rank.
(Refer Slide Time: 12:31)
Now let us see how to find out the dissimilarity computation. The dissimilarity can be computed
using any of the distance measures described by earlier like that interval data.
(Refer Slide Time: 12:44)
1056
Now, let us take ordinal data the I will explain how to find out the dissimilarity matrix. Suppose
that we have the sample data of following table, the same table which I have seen. So there are
columns object identifier, test-1 categorical, test-2. Now we are going to consider these two
column, one is object identifier next one is test-2 that is ordinal dataset. Suppose that we have the
sample dataset for the following table except that is this time only the object identifier and the
continuous ordinal variable, test-2 are available for us. There are 3 states of the test-2 namely
good, excellent and fair, so the aim of three, because that is a maximum number of states.
(Refer Slide Time: 13:41)
The step 1, if you replace each value of the test-2 by its rank, the four objects are assigned and
the rank 3, 1, 2, 3 respectively. How is 3, 1, 2, 3? So this we are going to call test-3, this is also 3,
1057
this is 1, this is 2. So how you are ranking this variable, 1 is fair, 2 is good, 3 is excellent. That is
why we got this one, 3, 1; 3, 1, 2 and 3. Step 2, normalize the ranking by mapping 1 to 0, rank 2
to 0.5, rank 3 to 1. How we got this well, rank 1 = 0.
So this is our Euclidean distance. So the first column for example it says if the object identifier.
This was our ranking ordinal dataset. This is our standardized value, so 3 is mapped into 2, 1;
how we got this one see. 3 = 1 so 1 = 0, 2 = 0.5, 3 = 1. This is our standardized data. This is in
the Z scale. This is our object identifier. This was our ordinal data. Now let us see how this
matrix has come. Suppose if you want to know the distance between object identifier 2 and 1 so
between 2 and 1 the distance is 1, so root of 1 square is 1.
So let us see how to find out distance between 3 and 1. So 3 is 1, so the formula is 1 – 0.5 that is
a 0.5 whole square. When you take square root this is 0.5. Let us see 3 to 2. So 3 to 2 0 – 0.5
whole square, square root that value is a 0.5. Let us see 4 to 1, what is the distance. So 4 to 1 is 1,
1058
this distance is 1, 1 – 1 that whole square, you take square root that as a 0. Let us see how we got
this 1. So that is a 4 to 2. So 4 to 2 is 1 square, then square root that value is 1. So 4 to 3, so this
distance is 0.5 - 1 whole square, square root that value is 0.5. So this is an Euclidean distance.
This is our standardized z value.
(Refer Slide Time: 17:18)
Now let us go to the next type of variable that is a Ratio-Scaled variable. A ratio-scaled variable
makes a positive measurement on a nonlinear scale, such as an exponential scale, approximately
the following the formula. A eBt or A e-Bt where, A and B are positive constants, and t is typically
representing the time. So common example include the growth of a bacteria population or the
decay of radioactive element, so that is an example of ratio-scale. Here the concept of ratio-scale
is, there is a meaning for our absolute 0. When there is a absolute 0 you can do all kind of
arithmetic operation using ratio-scaled data.
(Refer Slide Time: 18:08)
1059
So computing the dissimilarity between the objects. There are three methods to handle the ratio-
scaled variable for computing the dissimilarity between the objects. The first method is, treat
ratio-scaled variable like interval-scaled variables. This, however, is not usually a good choice
since it is likely that the scale may be distorted. But most of the marketing examples we will not
differentiate the ratio-scale and interval-scale even though we collect the interval scale.
So we will use as a ratio-scale for finding all kind of statistical test. The second method is apply
logarithmic transformation to a ratio-scaled variable f having the value xif for a object i and using
the formula Y if = log( xif). It is nothing but if there is a ratio-scale just you take log of that, so
that can be used for further analysis for finding the dissimilarity matrix. The Y if values can be
treated as interval valued, notice that for some ratio-scaled variables, so log of log or other
transformation may be applied, depending upon the variables definition and the applications,
because for we can use any kind of different transformations.
(Refer Slide Time: 19:31)
1060
The third method is treat X if as a continuous ordinal data and treat their rank as interval-values.
So this is the third method. The latter two methods are the most effective, although the choice of
method may depend upon the given application.
(Refer Slide Time: 19:49)
Now let us taken an example of ratio-scaled then I will explain how to find out the dissimilarity
matrix. This time we have the sample data of the following table except the only object
identifier, so we are going to consider this one and the ratio-scaled variable this column. So we
are going to consider only these two column for finding the dissimilarity matrix, so in the third
column ratio-scaled 445, 22, 164, 1210.
(Refer Slide Time: 20:23)
1061
Let us try a logarithmic transformation. So just we are going to take the log of the column 3
values. Taking the log of test-3 results in the values become 2.65, 1.34, 2.21 and 3.08 for subject
1 to 4 respectively. When you look at this after taking log transformation the value it is
compressed, see that it is scaled down; that is a purpose of scaling. So instead of using 445 you
can use 2.65; it is easy to handle.
So instead of using 22 for cluster analysis application you can use 1.34. So the benefit of taking
log of this one is it is compressed in a smaller scale. So using the Euclidean distance on the
transformed values, we obtain the following dissimilarity matrix. So this is our dissimilarity
matrix. So this is object identifier 1, 2, 3, 4. This is 1, 2, 3, 4. For example, d (2, 1) how we got
this one? This is 1, this is 2 so we have to find the difference.
We have to use this formula (2.65 – 1.35 )whole square, then square root. That value is 1.31. For
example, 1 and 3 so this is 1 and 3 so the difference is (2.65 – 2.1) whole square take square root
so that will be this value. The suppose 4 1, so suppose this the fourth one, so find the difference
(3.08 – 2.65)2 , then take square root so that value is 0.43. The same way we can get the other
cells.
(Refer Slide Time: 22:05)
1062
Now we will enter into another type of variable; it is not another type of variable where
whenever we do the cluster analysis there is a possibility of these variables like categorical,
interval binary may come together. So that type of data types we are calling it is mixed types. So
far we have discussed how to compute the dissimilarity between objects, described by variables
of the same type, where these types may be either interval-scaled, symmetric binary, asymmetric
binary, categorical, ordinal or ratio-scaled.
However, in reality, in many real databases, objects are described by the mixture of variable
types. So whenever the mixture of these variable types are coming how to use the dataset, how to
standardized that dataset for our further analysis of our cluster analysis, that we will see now.
(Refer Slide Time: 23:07)
1063
In general, a database can contain all of 6 variable types listed above. So, how can we compute
the dissimilarity between objects of mixed variable types? One approach is to group each kind of
variables together, performing a separate cluster analysis for each variable type. This is feasible
if this analysis derive compatible result. However, in real applications, it is unlikely that you
separate cluster analysis per variable type will generate compatible result. So we can group the
same variables, then you can go for cluster analysis. But sometime that will not be compatible.
We cannot follow that approaches.
(Refer Slide Time: 23:52)
A more preferable approach is to process all variable types together, performing a single cluster
analysis. So in general what we have to do we have to by grouping all the variables we have to
1064
do a single cluster analysis that will give you the meaningful result. One such technique
combines the different variables into single dissimilarity matrix, bringing all of the meaningful
variables onto a common scale of the interval 0 to 1. So one way to bring all different types of
variables into a common scale is nothing but converting all the variables and bringing into this
scale, nothing but standardization that is 0 to 1.
(Refer Slide Time: 24:36)
Suppose that the data set contains p variables of mixed type, so the dissimilarity d (i, j) between
objects i and j is defined as d ( i, j) = (Ʃ p f=1 ɗ (f)ij d (f)ij ) divided by (Ʃ p f=1 ɗ (f)ij ), for f where the
indicator ɗ (f)ij = 0, if either x if or x jf is missing that is there is no measurement of variables f for
object i or object j or x if equal to x jf, equal to 0, then ij = 0 or the variable f is symmetric binary.
Otherwise when the option is not there then the ɗ (f)ij = 1.
(Refer Slide Time: 25:33)
1065
(f)
The contribution of variables f to the dissimilarity between i and j, that is, d ij is computed
depending on its type. If ‘f’ is interval-based so d (f)ij = modulus of (x if – x jf) divided by (max h x
hf - minih x hf) where h runs overall non-missing objects for variable f. If f is binary or categorical
so d (f)ij = 0 if x if = x jf otherwise d(f)ij = 1.
(Refer Slide Time: 26:15)
1066
(Refer Slide Time: 26:58)
The only difference is, for interval-based variables where here we normalize so that the values
map to interval 0 to 1. Thus, the dissimilarity between objects can be computed even when the
variables describe the objects are different types. The summary of this lecture is, I have
explained how to handle the different types of variables. For example, if it is a categorical
variable or ordinal variable and ratio-scaled variable and variables of mixed type, how to find out
the dissimilarity matrix.
So what I have done in this lecture, I have taken one example in that using that example I have
explained how to find out the dissimilarity matrix. But only for the last that variables of mixed
type I have explained only the theory portions. The next class I will take another example which
are mixed in nature then I will tell you how to find out the dissimilarity matrix. Along with that,
we will start a new topic in the next lecture that is a K means algorithm.
1067
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 53
Cluster Analysis - V
In our previous class, I have explained how to handle different types of data for doing cluster
analysis. I have started some theory, there is a mixed data type, how to handle that kind of data.
(Refer Slide Time: 00:38)
The agenda for today’s lecture is how to find the dissimilarity matrix for mixed type variables.
And we will do python demo for computing different types of distances which I have explained
theory in my previous lectures and also I will tell you python demo for computing distance
matrix for interval scaled data.
(Refer Slide Time: 01:01)
1068
Now let us take an example. This is a mixed type dataset. So consider the data given in the
following table and compute a dissimilarity matrix for the objects of the table. Now we will
consider all of the variables which are different types. So there are three type of dataset one is
categorical, ordinal and ratio-scaled. If this kind of mixed data type is there how to use these
kind of dataset for finding dissimilarity matrix and giving as an input for the cluster analysis.
(Refer Slide Time: 01:34)
The procedures we followed for the test-1 which is categorical data and test-2 which is ordinal
data are the same as outlined above for processing variables of mixed type. So what we have
done, if it is a categorical variable type we have used this p – m divided by p so where the p is
number of variables where m is number of matches, that we have discussed in our previous class.
1069
If it is an ordinal data, we have to standardize into 0 to 1 by using this formula z if = (r if, that is
the current rank – 1) divided by (maximum rank – 1). So this after converting into 0 to 1 scale
then you can use our simple Euclidean method for finding the dissimilarity matrix. For interval
scale variable so d (f)ij = modulus of (x if – x jf) divided by (max h x hf - minih x hf).
(Refer Slide Time: 02:33)
First normalizing the interval scale data. First, however, we need to complete some work for test-
3 which his ratio-scaled. We have already applied a logarithmic transformation to its values.
Based on the transformation values we got 2.65, 1.34, 2.21 and 3.08 obtained for the objects 1 to
4, respectively. We let maximum value of hx is from this among these values the maximum
value is 3.08 and the minimum value is 1.34. The normalized value in the dissimilarity matrix
obtained in the example and solve for ratio data by dividing each one by this difference that is
(3.08 – 1.34) that is 1.74.
(Refer Slide Time: 03:21)
1070
This results in the following dissimilarity matrix for test-3. So what happened there was a object
identifier was there, there was a ratio-scale was there, I am calling it as x. So we have taken log
of that value 2.65, 1.34, 2.21. So we have to standardize this one for that purpose for example 1
and 2 we use this formula that is this formula to standardize. So here 2.65 this value minus this
value divided by the maximum value and minus minimum value.
So when you divide this, this is a 0.75. So 2, 1 this was the distance. Similarly, if you want know
4, 1 so 4, 1 so the difference is 2.65 – 3.08 whole divided by this value, that is 1.74 will give you
this value. The same way the standardization is done for all the cells.
(Refer Slide Time: 04:23)
1071
Now let us consider dissimilarity matrices for all three variables. What are the three variables?
One is categorical, ordinal and ratio data. So we can now use the dissimilarity matrices for the
three variables in our computation of equation, so d (f)ij = modulus of (x if – x jf) divided by (max
h x hf - minih x hf).
(Refer Slide Time: 04:51)
For example, we got between d 2 1. So this is the location, 2, 1. I will explain how we got 0.92.
So how we got this one is, so there are three variables is there; for dissimilarity matrix it is 1, for
categorical variable, for ordinal variable, for 2, 1 portion is 1, for ratio data it is a 0.75. So we
will find out the weighted mean. So weight is we are giving equal weigh for all kind of dataset.
So (1 into 1) + (1 into 1) + (1 into 75) so total sum of weight is 3, so 0.92 that is why we got this
value.
Similarly, each element you can do that one, for example here 1 into 1 + 1 into 0.5 + 1 into 0.25
divided by 3 so we will get this value. So this matrix is a resulting dissimilarity matrix attained
for data described by the 3 variables for the mixed type. So this dissimilarity matrix is given as a
input for doing the cluster analysis. So what happened this is a combined matrix, combined
dissimilarity matrix for all three kind of variables, so what are that variables it is for categorical,
ordinal and ratio data.
(Refer Slide Time: 06:12)
1072
If you go back and look at the table of the given data we can intuitively guess that the object 1
and 4 are most similar, based on their values for test-1 and test-2. You see that, because if you
look at this one, this is 0, this is 0, this is 0.25. In the given matrix, for example, for, say for
categorical data it is 0, for ordinal data it is 0, it look like the position of 4, 1 for both the
variables is seems to be very similar. But what is happening here if the third where it is ratio data
it is 0.25.
But when you find the average of 3 weighted average by looking at the dissimilarity matrix for
categorical and ordinal look in the position of 4, 1. So this seems to be very similar because it is
close to 0. But what is happening the 0.25 it is not very close to 0, so this can be verified by
when you look at this one, so among the all the dataset in the combined the resulting
dissimilarity matrix the value of 0.0 it is very similar, so the position of 4, 1 is very similar to
each other, that is the point.
This is confirmed by the dissimilarity matrix, where 4, 1 is the lowest value for any pair of
objects. Similarly, the matrix indicates that the object 2, 4 are the least similar. When you look at
this 2, 4, so the highest value is 1. When you go there, here also 2, 4 here also it is 1, here also 1,
here also 1 so in the resulting dissimilarity matrix also this seems to be 1 so highly dissimilar.
(Refer Slide Time: 07:55)
1073
Now, we will go to the distance measurement using python. In my previous class, I have
explained how to find out Euclidean distance, Manhattan distance, Minkowski distance. So this
is an example for Euclidean distance, the formula for finding Euclidean distance is (xi1 – xj1)
whole square + (xi2 – xj2) whole square + (xip – xjp) whole square, then square root. If there are
only two variable the distance formula is (xi1 – xj1) whole square + (xi2 – xj2) whole square,
then square root.
(Refer Slide Time: 08:30)
But how to use python command, I brought the screenshot of that, for that import scipy, from
scipy.spatial import distance, so how to find the Euclidean distance? So import numpy as np, so
a and b there are two; a is one point where 1, 2, 3; the b is another point 4, 5, 6. If you want to
1074
know the distance Euclidean distance between a, b so we have to write dst = distance.euclidean
(a, b). So when you type dst, so this is the our Euclidean distance between point a and b.
(Refer Slide Time: 09:07)
The next, the distance is Minkowski distance. So the Minkowski distance is, the combination of
both Manhattan distance and Euclidean distance. So d i, j = xi1 – xj1 to the power p + xi2 – xj2
to the power p and so on plus xin – xjn to the power p, whole to the power 1/p. So what is
happening, this is small p. If p = 1, then it is a Manhattan distance. If the p = 2 it is an Euclidean
distance. Let us see how to find out this Minkowski distance using python.
(Refer Slide Time: 09:46)
1075
So I have taken two points, Minkowski distance let us see 1, 0, 0; 0, 1, 0. So the comma 1, this
represent that we are finding Manhattan distance. So when you enter this the Manhattan
distanced is 2. So this same dataset if you type 2, you will get Euclidean distance because the p =
2 will get you formula for the Euclidean distance that is 1.14. So this was our another example
just to verifying this (1, 2, 3); (4, 5, 6), 2 so this will represent our Euclidean distance.
Suppose if you take 1, 2, 3 and 4, 5, 6 the p value can be 3 also; if it is 3 then Minkowski
distance is 4.32. This formula distance.minkowski (1, 2, 3); (4, 5, 6), 2 this data previously we
have used the distance between by using this formula distance.euclidean distance a, b we got
5.19. So by using Minkowski formula also the Minkowski formula the same dataset if you type 2
you will get the same answer. So the distance.minkowski between the two points (1, 2, 3) and (4,
5, 6), 3 where the p = 3 so this our Minkowski distance.
(Refer Slide Time: 11:07)
Now, can we explain how to find out dissimilarity matrix? So for that you have to import pandas
as pd from scipy.spatial import distance_matrix. So there are 1, 2; there are 3 points; 1, 4; 2, 5; 3,
6. So there are two columns a, b. So pd.DataFrame data, columns equal to this one, you will get
this kind of output. So if you want to know the distance between a, b so there are identifier name
is 0 1 and 2. There are two variables a and b. So if you want to know the distance matrix between
different identifier 0 0 is 1; 1 0 is 1.41; 2 0 is 2.84 by using this command distance_matrix
df.values and df.values.
1076
(Refer Slide Time: 12:05)
Now, distance matrix calculation for interval-scaled variables. For example, there are 1, 2, 3, 4,
5, 6, 7, 8 there is a 8 persons, we can say object identifier, take 8 people, the weight is given in
kilograms and the height is given in centimeters, so n = 8, so p = 8. Now how to find out the
distance matrix?
(Refer Slide Time: 12:30)
So I have taken the data 15, 95; 49, 156 and so on. So there is a labels A, B, C, D, E, F, G, H. So
the data frame is so Weight and Height. So we got this table. So weight and height is there, these
are different identifier.
(Refer Slide Time: 12:54)
1077
Suppose if you want to know the distance matrix, so for that pd.DataFrame distance_matrix you
take these values. So we are getting the distance matrix between A and A, 0, B and A. So you see
that the diagonal value 0 because it is a Replica of because the distance between F and F is 0; G
and G 0, H and H is 0. So what is happening between A and; B and A it is 69.83; A and B also
69.83 just a mirror value.
(Refer Slide Time: 13:29)
1078
Now let us see how to use python to find the distance between points. So import spicy from
scipy.spatial import distance, I will run that one. Now let us know how to find out Euclidean
distance. So import numpy as np, there are two point a, b. The position of point a is 1, 2, 3; the
position of point b is 4, 5, 6. If you want to know the distance so dst = distance.euclidean then if
you want to know the what is the distance so will let us display this, so distance is 5.19 that is the
between point a and b the Euclidean distance is 5.19.
(Refer Slide Time: 14:27)
Now, let us know how to find out the Minkowski distance. For that, the Minkowski distance is
calculated by using this function distance.minkowski. This is position of point A and position of
point B. Then if you use 1 you will get a Manhattan distance. So this is Manhattan distance.
1079
Instead of 1 if use 2 you will get here. Instead of 1 if you get 2 you will get a Euclidean distance.
This is 1.41. We already got 5.19 as Euclidean distance even the Minkowski function you can
get; you can verify that answer.
What happened here I gave distance.minkowski, I taken the same point that is 1, 2, 3 and 4, 5, 6
this one I used 2, so number 2 is used to get the Euclidean distance. Number 1 is use to get
Manhattan distance. So we got to see that this also 5.19 when in the function Minkowski for use
p, otherwise by using distance.euclidean function we got 5.19, so both are same.
(Refer Slide Time: 15:43)
Now I will explain how to find out the dissimilarity or distance matrix; import pandas as pd from
scipy.spatial import distance_matrix. The data equal to 1, 4; 2, 5; 3, 6 so there are three dataset
for two variables. If you want to know the, the distance between a and b; now we have three
dataset for two variables. If you want to know the distance matrix, so pd.DataFrame(
distance_matrix( df.values, df.values)), so you will get this is distance matrix. So 3 variables and
3 dataset. So this, the distance matrix between 0 and 0 is 0; between 1 and 0 is 1.41, between 2
and 0 is 2.82.
(Refer Slide Time: 16:42)
1080
Now, we will go for a distance calculation for that import pandas as pd import numpy as np, I
will run it, so data matrix is given like this. So there are two variables. There are 8 persons. A, B,
C, D, E, F, G, H. Suppose if you want to know the distance matrix so pd.DataFrame
distance_matrix you follow this command, the distance matrix is this one, right between A and A
is 0; between B and A is 69.83; between C and A it is 2.00.
(Refer Slide Time: 17:22)
So if you want to have 1 decimal with the rounding of 1 decimal we got this distance matrix. So
this distance matrix is given as input for cluster analysis. In this lecture I have explained how to
find out the dissimilarity matrix for mixed type of variables. What is the meaning of mixed type
1081
of variables? When we got for a cluster analysis the dataset may be combination ordinal, interval
and ratio data.
When these three types of data come together how to find out the dissimilarity matrix? That I
have explained. Then, by using python I have explained how to find out the distance. So I have
explained how to find out the Euclidean distance. Then I have explained how to find out the
Manhattan distance and then I have explained how to find the Minkowski distance. At the end I
have explained with the help of python how to find out the distance matrix for the interval-scaled
data, because that distance matrix can be used as a input for our cluster analysis. Thank you.
1082
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 54
K-Means Clustering
In this lecture, we will talk about K – means Clustering. Before that, I will explain what are the
classifications of this clustering method. There are two type of classifications that I will explain
in this class. In one classification is a K - means Clustering that I will solve one problem
numerically with the help of some example. After solving the problem numerically, I will go to
python there I will explain how to use python for doing this K - means clustering.
(Refer Slide Time: 00:59)
So the agenda for this lecture is classification of clustering methods under which the partitioning
method K – means clustering. That we will see in this class.
(Refer Slide Time: 01:03)
1083
So this picture shows the classification of clustering methods. So the clustering methods
generally classified into two category, one is partitioning method another one is hierarchical
method. In partitioning there is another classifications one is K-Means another one is K-
Medoids. In hierarchical there are two methods, one is Agglomerative method another method is
Divisive method that will explain when I am explaining this hierarchical method. So in this
lecture we are going to discuss about K-Means clustering.
(Refer Slide Time: 01:39)
So which clustering algorithm to choose? Because previously I was saying two method one is
see the partitioning another one is hierarchical. The K-Means algorithm is generally used in
advance if you know how many clustering is required. That time you can go for this partitioning
1084
method. If you do not have idea how much cluster you need to do then you can go for
hierarchical. So the another point, the choice of clustering algorithm depends upon type of data
available and particular purpose.
Particular purpose in the sense whether you want to have in advance how many cluster is
required or let us go for all type of classifications later we will give to user to chose the right
number of clustering. It is permissible to try several algorithm on the same data, because cluster
analysis is mostly used as a descriptive or exploratory tool.
(Refer Slide Time: 02:37)
First we will talk about partitioning method. In partitioning method what are the data which are
given is, a data set of n objects and k; this is user-defined, k is a number of clusters. In advanced
we are going to know how many cluster we are going to have. A partitioning algorithm organizes
the objects into k partitions where k <= n, where each partition represents a cluster. The clusters
are formed to optimize an objective partitioning criterion.
I will explain what the partitioning criterion is in next slide. The objective partitioning criterion
such as dissimilarity function based on distance. So what is happened, within the cluster the
dissimilarity should very less between the cluster the dissimilarity should be more. Therefore, the
objects within the cluster are similar, whereas the objects of different clusters are dissimilar in
terms of dataset attributes.
1085
(Refer Slide Time: 03:40)
So partitioning methods are applied if one wants to classify the objects into k clusters, where k is
fixed.
(Refer Slide Time: 03:49)
It is a centroid based technique because the; the centroid is nothing your mean, here kind of
center of gravity. The k-means algorithm takes the input parameter, k, and partitions a set of n
objects into k clusters, because as I told you here the k is one of input parameter. So, that the
resulting intra-cluster similarity is high but inter-cluster similarity is; so what is happening is
suppose there is a cluster 1 and cluster 2 so within that clusters there is a highly homogenous that
the inter-cluster similarity is very low.
1086
But between this cluster there should be a low similarity that means the, the dissimilarity is high.
So cluster similarity is measured in regard of mean value of the objects in the cluster, which can
be viewed as the clusters centroid or center of gravity as I told you.
(Refer Slide Time: 04:45)
So working principle of K-Means algorithm is; this is flowchart start, in advance you should
number of clusters. Then you form the centroids. Randomly you can choose certain point then
you form the centroids. Then the distance objects to the centroids. You find out suppose there is
object, how far away that object is from the centroid. Then grouping based on the minimum
distance. If there are two points, we have to take the point which is closed to that centriod into
that cluster. Now that you have to continue for all points, if no object move the group then you
can stop it otherwise you continue this cycle.
(Refer Slide Time: 05:32)
1087
Working principle of K-Means algorithm. First, it randomly selects k set of objects, each of
which initially represents a cluster mean or center. For each of the remaining objects, an object
is assigned to the cluster to which it is the most similar, based on the distance between the
objects and the cluster mean. It then computes the new mean for each cluster. Here what I mean,
mean is this centroid. This process iterates until the criterion function converges.
(Refer Slide Time: 06:05)
Let us see what is this criterion function. For example, E =Ʃki=1 Ʃp€ci ,i = 1 = k that means odd
number clusters. P for all clusters. So the lp – mil2 modulus squared values. Here what is the p is
the point in space representing the given object, n is the mean of the cluster. It is nothing but
kind of your mean absolute deviation but we are squaring that absolute deviation then we are
1088
summing for all clusters. For each objects in each cluster, the distance from the object to its
cluster center is squared, and the distance are summed. This criterion tries to make the resulting
k clusters as compact as separate as possible.
(Refer Slide Time: 06:59)
For example, this is k = 3, suppose we; there are n type of dataset. Randomly, I am making 3
clustering. After finding 3 clustering then I finding centroid of each clusters then from each
centroid I am looking at all other objects and their distance then; if any point is closer to that
centroid I am bringing that point into that cluster. Then I am updating this cluster and so on and
continuing until all the objects are grouped into 3 clusters and the intra-distance is less and inter-
distance is more. This I will explain with the help of a numerical example.
(Refer Slide Time: 07:45)
1089
Algorithm, k-means. The k-means algorithm for partitioning, where each cluster’s center
represented by the mean value of the object in the cluster. What are the input for the k-means
algorithm? The number of clusters and data set containing n objects. What is going to be output?
A set of k clusters.
(Refer Slide Time: 08:05)
The K-Means clustering methods, as I; this also I have explained in my; that flowchart.
Arbitrarily choose k objects from D from the dataset as the initial cluster centers. Repeat for all
the points. Assign each object to the cluster to which the object is most similar. The distance is
very small based on the mean value of the object in the cluster. Update the cluster means, i.e.,
calculate the mean value of the objects for each cluster until no change.
1090
(Refer Slide Time: 08:40)
Now taking a numerical example. There are 7 individuals. 1, 2, 3, 4, 5, 6, 7. There are two
variables, variable 1 and variable 2. As I told you there is a small difference between clustering
and factor analysis. In factor analysis we will group the variable into different category but in the
cluster analysis the respondents the individuals have to group. That is the difference. Now there
are 7 people is there. We are going to cluster this 7 people into some numbers. Let us see what is
that number.
(Refer Slide Time: 09:17)
Suppose here the k = 2 initially here assume that the k is given k = 2, I want to make 2
clustering. Suppose randomly I have chosen, in x-axis the variable 1, in y-axis the variable 2. So
1091
the point 1 and 3 are randomly taken, yes it is mentioned variable 2 and variable 1. Point 1 and 3
are taken randomly so there are k = 2, 2 cluster.
(Refer Slide Time: 10:05)
So the point 1 and 3 say the position of point 1 is 1, 2, the position of point 3 is 3, 4. So what is
the initialization? Randomly we choose following two centroids where k = 2 for two clusters. In
this case the two centroids are that point itself, K1 (1, 1); K2, (3, 4). So calculate the Euclidean
distance using this given equations between all the points and between the cluster, so the formula
for finding the Euclidean distance is root of (( x2 - x1) whole square +( y2 – y1) whole square).
(Refer Slide Time: 10:47)
1092
Since there are 2 point is there, there are two things we are going to do that. The distance
between k1 and k1 from that point itself what was the distance from that same point, so the
distance is 0. And between the two clusters that is 1 and 3 that is K1 and K2. The position of K1
is 1, 1, the position of K2 is 3, 4. The distance is 3-1 whole square + 4 – 1 whole square that is
the 3.61. Then the distance of K2 from K2 itself, obviously that will be 0. So what I have taken
cluster K1 and K2, the distance between two clusters, so the distance between K1 and K2 is 3.61
the same value is this one. Now we will update this.
(Refer Slide Time: 11:39)
1093
Now what we are going to do, we will take next variable 1.; that is the individual 2, 1.5 in 2. So
from Cluster 1 I am going to find out how much faraway this point 2. Similarly, from cluster 2
that is individual 3, what is the distance of 2. So the distance from cluster 1 it is see (1.5 – 1)
whole square + (2 – 1) whole square. So the distance from cluster 2 because 1and 3 is initial
cluster, you have to remember this.
This was our initial cluster. So that is (1.5 – 3) whole square + (2 – 4) whole square. So the
Euclidean distance the data set is 1.5, 2. The distance between cluster 1 and this data set is 1.12.
The distance between cluster 2 and the data set is 2.5. So this point is closer to cluster 1 because
it is the distance is less 1.1. So what we are going to do we are going to assign this point that is
individual into the cluster 1 that is the K1. So this point is assigned to K1.
(Refer Slide Time: 13:05)
So what happened, this point is 1.5, 2. So now in this cluster 2 and 1 are assigned into same
cluster. After assigning we are going to update the centroid of this cluster that is point 1 and 2.
(Refer Slide Time: 13:24)
1094
Now we will update the centroid of that cluster K1 because, why we are updating that K1
initially 1 individual now one more individual has entered into that cluster K1 so we are updating
for that attributes, so the centroid of K1 is; for variable 1, it is 1 + 1.5 divided by 2 it is 1.25.
Then for variable 2 the centroid is in K1 for variable 2 the centroid is 1 + 2 divided by 2 1.5. The
K2 remain as it is.
(Refer Slide Time: 13:57)
Now we will look at individual 4 whose attribute are 5, 7. From cluster 1 let us find out how
much distance it is. Similarly, from cluster 3 we will find out how much distance it is. But in
cluster 1 already there are two point has come. So that the our centroid has been already updated.
So the distance from cluster 1 is 5, see that this, this 1.25 you got from this value, it is a centroid
1095
so this value centroid value for variable 1; 5 – 1.25 whole square similarly 7 – this was the new
centroid of cluster 1, you see that this is 1.5. So that square is 6.66.
From cluster 2 the distance is 5 – 3 whole square + 7 – 4 whole square, the distance is 3.61. So
this value we brought in the table format. So that data set is 5, 7 from cluster 1 the distance is
6.66, from cluster 2 the distance is 3.61. So this point is very close to cluster 2, so we are going
to assign this point to the cluster 2 so 5, 7.
(Refer Slide Time: 15:16)
Now what happened this point is very close to this one. So we assigned this into this cluster.
(Refer Slide Time: 15:25)
1096
So after assigning as I told you, we have to update the centroid of cluster 2 now, because cluster
2 initially we had only one point, now one more point is entered. So the new centroid is 3 + 5
divided by 2 that is 4 for variable 1, for variable 2 the centroid is 4 + 7 divided by 2 = 5.5. Now
this is the 4 and 5.5 is the new centroid for K2.
(Refer Slide Time: 15:54)
Now we will another variable 5, that is 3.5. We will find out how far away this point or this
individual from cluster 1 and cluster 2. First we will find out from cluster 1. For a cluster 1 it is a
(3.5 – the centroid of cluster 1 )whole square + (5 – centroid of that is 1.5) whole square that is a
4.16. Now this point (3.5 – 4), 4 is centroid of our cluster 2, see this one (3.5 – 4) whole square +
(5 – 5.5) this 5.5 you got from this updated centroid of cluster 2, so 0.71.
So let us bring this value into the table format. So the distance between this point and the cluster
1 is 4.16 and the distance between 3.5, 5 this point 2 cluster 2 is 0.71. So this point is this dataset
is closer to the cluster 2, so we will assign this point also to cluster 2. So after assigning what has
happened, so this point is assigned to cluster 2.
(Refer Slide Time: 17:07)
1097
Now we will go to the next variable. After assigning before going to the next variable we will
update the centroid of cluster 2, because in cluster 2 now there are three points is there, that is 3,
4 5,7 3.5, 5. First in variable 1 we will find out centroid nothing but the average 3.83 for, in K2
the centroid of variable 2 is 5.33. Now this is the new centroid for our K2.
(Refer Slide Time: 17:40)
Now we will take new individual and whose data point is 4.5 and 5. Now let us see how much
distance or how much away from cluster 2. From cluster 1 say 4.5 – centroid of cluster 1 is 1.25
whole square + (5 – 1.5) whole square 4.78, that is a 4.78. Now from cluster 2 let us see how
much distance. 4.5 – 3.83, how we got 3.83, because we are updated the centroid of cluster 2, so
3.83 whole square + 5 – 5.33. How we got 5.33? This value, 5.33 whole squares so distance is
1098
this one. By looking at the table this dataset is closer to the cluster 2 so we will assign this
dataset also to K2.
(Refer Slide Time: 18:35)
So after assigning, what is happening, so this point is assigning to the cluster 2. Again we will
update.
(Refer Slide Time: 18:41)
Now in the cluster 2, there are 4 dataset, that is (3, 4) (5, 7) (3.5, 5) (4.5, 5). Now we will find the
centroids. There are four dataset add it divided 4 that is 4, here 4.0+7.0+5.0 divided 4 that is
5.25, this is our centroid of K2, updated centroid.
(Refer Slide Time: 19:11)
1099
Now we will take the last point, it is the 3.5 and the 4.5. Let us see how much distance. This
point is from cluster 1, cluster 2. From cluster 1, 3.5 – 1.25 whole square + 4.5 - 1.5 whole
square that is 3.75. From cluster 2, how we got this 4 from this value 3.5 – 4 whole square + 4.5
– 5.25, so this value is -5.25 whole square is 0.86. So that value is 0.86. Again, so this point is
closer to the cluster 2 so we will assign this point into cluster 2.
(Refer Slide Time: 19:50)
So now we have assigned, so this is one cluster. What are the point in this cluster? This is 3, 5, 7,
6, 4. In another cluster it is 1 and 2.
(Refer Slide Time: 20:06)
1100
After that again we will find out centroid of that cluster 2. So 1, 2, 3, 4, 5 dataset so add all the
value divided by 5 it is 3.9. Again you add all the value divided by 5 it is 5.51. Now there are
two clusters. The centroid of cluster 1 is 1.25, 1.5. The centroid of cluster 2 is 3.9 and 5.1. This
value will verify when I am showing python demo.
(Refer Slide Time: 20:38)
Now this is the summary of our result. What has happened? So these individual is one group,
cluster 1, this people in cluster 2. So what is the property is, the people in this cluster are more
similar. People in this cluster also more similar. But between these two clusters the distance is far
away. Now I have solved this problem with manually. Now we will go to python environment.
So the same problem I will explain.
1101
(Refer Slide Time: 21:23)
So I have brought this screenshot of our, for the K-Means clustering. We imported this required
library import pandas as pd, import numpy as np, import matplotlip.pyplot as plt, so data is this
one. So this was our data.
(Refer Slide Time: 21:41)
First we have plotted the scattered plot, the scattered plot with label.
(Refer Slide Time: 21:48)
1102
Now this was the final result of cluster analysis. What happened? See this is one group, this is
another group. This blue represents the centroid of this cluster 1, this red represents centroid of
cluster 2. Now we will go to python environment. I will tell you how to do this K-means
clustering in python.
(Refer Slide Time: 22:12)
Now I am going to explain how to use python for doing k-means algorithm. I have taken two
examples; one example is what I have explained in my presentation. First we will import
necessary libraries pandas, numpy, matplotlib.pyplot and so on. Next we will import the data.
(Refer Slide Time: 22:35)
1103
The data, when you look at the data there are 7 individual is there, variable 1 and variable 2.
After that we will position these individuals into, in a two-dimensional graph.
(Refer Slide Time: 22:50)
So what is happening, now we are able to see that all individuals, there are 7 individuals and
their position, for example, the position of individual 1 is 1, 1, for position of individual 2 is 1.5,
2 and so on. For running k-means clustering algorithm we have to import this library. From
sklearn.cluster import KMeans, so kmeans = KMeans( n_clusters = 2). So this is k = 2. If you
want to have 3 clusters that is our next example, you have to substitute in sub 2 = 3, we will run
this.
(Refer Slide Time: 23:27)
1104
After running, we will verify, now the two clusters has been formed. Now we will verify the
centroid of the two clusters. So the centroid of the two clusters is 3.9, 5.1 that was my cluster 2
centroid. For cluster 1 the centroid is 1.25, 1.5.
(Refer Slide Time: 23:50)
Let us see in picture form. This is the final output. This blue says, this is a centroid of cluster 1.
Here the red one says that this is centroid of cluster 2. So what happening now two clusters are
formed, in cluster 1 individual 1 and 2 is there; in cluster 2 individual 3, 5, 7, 6, 4 is there. This
was exactly the result which I have done in the presentation.
(Refer Slide Time: 24:21)
1105
I will take another example where instead of k = 2 will go for 3 clusters, this is a different data
set. So in that there are 8 individual is there. There are x variable and y variable.
(Refer Slide Time: 24:37)
The same way will plot into the two-dimensional plot, this was that way.
(Refer Slide Time: 24:41)
1106
So there are 8 individual and their position.
(Refer Slide Time: 24:47)
We will go for k-means algorithm. You see that k = 3, probably we are going to have 3 clusters.
So when you run that after running we can find out the centroid, so what I am going to do I am
going to enter b then I go to show what is the value of centroid of these three clusters, so paste it,
now you run it. So now there are three clusters. The centroid of this cluster is (7, 4.3) (3.6, 9)
(1.5 , 3.5).
(Refer Slide Time: 25:23)
1107
Now I showed the picture from the final output, now just show the final output. There are three
clusters which are in different color. So this blue says the centriod of cluster 1, this red says the
centroid of cluster 2, this green says centroid of cluster 3. In this lecture, I have explained the
classification of clustering methods. We know that there are two types of classification one is
partitioning method another one is hierarchical method.
In the partitioning method there are another two classification one is a K-means clustering
algorithm another one is K-Medoids. In hierarchical also there are two classification one is
agglomerative and divisive method. But in this lecture I have covered only k-means algorithm. I
have taken one numerical problems with the help of that numerical problems I have explained
step-by-step procedure how to go for k-means algorithm.
After that I have explained the same problem in python, how to make k-means algorithm where
k = 2. Apart from that, I have taken one more example in python environment then there I have
explained how to make three clusters by taking the value of k = 3. The lecture I will explain the
agglomerative method of clustering with the help of an example. Thank you.
1108
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology - Roorkee
Lecture – 55
Hierarchical Method of Clustering - I
In our previous lecture, I have explained about K - means algorithm that is one type of clustering
technique.
(Refer Slide Time: 00:32)
There is another type of technique is hierarchical method of clustering, let us see what is
hierarchical method of clustering in this lecture and we will compare that partitioning method
versus hierarchical clustering methods in this lecture. So, the agenda for this lecture is
introduction to hierarchical clustering, then comparison of partitioning versus hierarchical
clustering methods.
(Refer Slide Time: 00:53)
1109
A hierarchical method creates a hierarchical decomposition of the given set of data objects, a
hierarchical clustering method works by grouping data objects into tree of clusters. Here the tree
of clusters, I will explain what is this tree of clusters in next slides. A hierarchical method can be
classified as being either agglomerative or divisive, based on how the hierarchical decomposition
is formed.
There are 2 way we can say in hierarchical method, one is agglomerative, the second one is
divisive, the agglomerative approach also called bottom up approach, start with each object
forming a separate group.
(Refer Slide Time: 01:42)
1110
It successively merges the objects or groups that are close to one another, until all of the groups
are merged into the top most level of the hierarchy or until a termination condition holds, on the
other hand, the another classification in the hierarchical method is in divisive approach also
called top down approach starts with of the objects in the same cluster. In each successive
iteration, a cluster is split up into smaller clusters, until eventually each object is in one cluster.
(Refer Slide Time: 02:27)
Or until a termination condition hold, hierarchical methods suffer from the fact that once a step is
done, it can never be undone, so the problem with the hierarchical cluster is once a step is done,
you cannot go back and correct the mistake. This rigidity is useful in that it leads to smaller
computation cost by not having to worry about combinatorial number of different choices,
however, such techniques cannot correct erroneous decisions that is only drawback of this
hierarchical methods.
(Refer Slide Time: 03:04)
1111
Let us compare agglomerative method versus divisive method, both are hierarchical method; let
us see how this it is differ from each other. The agglomerative method this is bottom up strategy
starts by placing each object in its own cluster and then merges these atomic clusters into larger
and larger clusters until all of the objects are in a single cluster or until certain termination
condition is satisfied.
Most hierarchical clustering methods belong to this category, on the other hand, the divisive
method is a top down strategy does the reverse of agglomerative hierarchical clustering by
starting with all object in one cluster. So, in divisive methods what we are doing, so you start
from a bigger cluster, then you make smaller one, like it is a cutting a big cake into small pieces.
On the other hand, the agglomerative is each object is separate clusters.
Then from that you can form all possible types of clusters that is kind of a tree, it is up to you to
decide where you need to have the termination condition. In the divisive method, the divisive
method sub divides the cluster into smaller and smaller pieces until each object forms a cluster
on its own or until it satisfy certain termination condition such as a desired number of cluster is
obtained or the diameter of each cluster is within the certain threshold.
(Refer Slide Time: 04:47)
1112
This picture explains the difference between agglomerative and divisive method, you see that the
arrow for agglomerative method is going on this side, it says that there are a, b, c, d, there are 5
objects, we start with each objects are in a separate cluster, then this a and b in step 0, all are
clusters; 1, 2, 3, 4, 5 cluster is there, each cluster having only 1 unit. In step 1, see a and b are
clubbed that is ab.
In step 2, d and e are clubbed, in step 3 this c, d, e are clubbed, in step 4 all these a, b, c, d is
clubbed, so this is going in from left to right that is agglomerative method whereas in the divisive
method, you start from; you see that look at this arrow it is going this side, start from the a, b, c,
d that is a big by considering all the elements. A step 0; look at the step 0, only one cluster, in
step 1, the c, d, e is 1, in step 2 de is another cluster. In step 3, from a, b, c, d again the ab has
come out, in step 4 all individual elements are separate clusters that is a basic difference between
agglomerative versus divisive hierarchical clustering.
(Refer Slide Time: 06:20)
1113
What is the interpretation of the previous slides? Figure 1 shows that application of
aggromerative AGNES; agglomerative nesting, an agglomerative hierarchical clustering methods
and DIANA divisive analysis, a divisive hierarchical clustering method to a data set of 5 objects;
a, b, c, d, e. Initially, agglomerative method places each object into a cluster of its own, there is
only one item. The clusters at then merged step by step according to some criterion, let us say for
example, cluster C1 and C2 may be merged if an object C1 and the object C2 from the minimum
Euclidean between any 2 objects from different clusters.
(Refer Slide Time: 07:13)
This is a single linkage approach, in that each cluster is represented by all of the objects in the
cluster and the similarity between 2 cluster is measured by the similarity of the closest pair of the
1114
data points belonging to different clusters. The cluster merging process repeats until all of the
objects are eventually merged into form 1 cluster. So, what is happening here it is start from the
step 0, it goes up to step 4, you see in step 0, there are 1, 2, 3, 4, 5 clusters in step 0.
(Refer Slide Time: 07:56)
But in step 1, all are merged into only 1 cluster that is a, b, c, d, e. In DIANA that is in the
divisive method, all of the objects are used to form 1 initial cluster, this cluster split according to
some principle such as maximum Euclidean distance between the closest neighbouring objects in
the cluster. The cluster splitting process repeats until eventually each new cluster contains only 1
object, only a single object.
In either agglomerative or divisive hierarchical clustering, the user can specify the desired
number of clusters as a termination condition. So, here the explanation of divisive method is start
from here, in step 0 only 1 cluster is there. In step 4, there are 5 clusters is there, that is a
difference.
(Refer Slide Time: 08:43)
1115
In hierarchical clustering, another important terminology which you have to understand is
dendrogram. What is a dendrogram? It is a kind of a tree kind of structure and it says there are
different levels, level 0, 1, 2, 3, 4 on left hand side and right hand side there is a similarity scale.
At level 0, there are 1, 2, 3, 4, 5 clusters a, b, c, d, e all are forming its own cluster. In level 1, ab
is forming 1 cluster.
In level 2, the position of c is compared, in the position of c we are finding the distance between
c and between these cluster a and b and cluster d and e. If it is closer to d and e, then c, d, e form
an another cluster that is a level 3. Level 4, it is only 1 cluster, all are 5 elements are present in
there, so this kind of picture is called dendrogram.
(Refer Slide Time: 09:45)
1116
Dendrogram, a tree structure called a dendrogram is commonly used to represent the process of
hierarchical clustering; it shows how objects are grouped together step by step. Figure 2 shows a
dendrogram for the 5 objects presented in the figure 1, where l equal to 0, at level 0 shows the 5
objects are singleton clusters, there is only 1 element in the clusters at level 0. At level 1, object a
and b are grouped together to form the first cluster and they stay together at all subsequent levels,
this hierarchical structure can be understood with the help of this dendrogram.
(Refer Slide Time: 10:27)
We can also use a vertical axis to show the similarity scale between the clusters actually, it is
given on the right hand side of the picture. For example, when the similarity of 2 groups of
object a and b and c, d, e is roughly 0.16, so they merged together to form a single cluster.
1117
(Refer Slide Time: 10:49)
Now, let us go to another important idea of measures of distance between clusters, there are 4
widely used measures for distance between clusters are as follows, where modulus of p – p dash
is the distance between 2 objects or points that is a p and p dash, where mi is the mean of the
cluster, Ci and ni is the number of objects in Ci. The first measure is minimum distance, the
minimum distance between cluster Ci and Cj equal to, so minimum of modulus of p – p dash.
The maximum distance is d max between cluster i and j equal to maximum of modulus of p – p
dash, the mean distance; there is a another measure is d mean is the modulus of mean of the 2
clusters. The average distance is 1 divided by (ni – nj), here n represents number of object in
cluster 1, then sigma of p – p dash modulus.
(Refer Slide Time: 12:08)
1118
When an algorithm uses the minimum distance that is a d minimum Ci between Cj, that is a
distance between Ci and Cj. To measure the distance between clusters, it is sometime called
nearest neighbour clustering algorithm, I will show you in picture in coming slides. Moreover, if
the clustering process is terminated, when the distance between the nearest clusters exceeds the
arbitrary threshold, it is called single linkage algorithm.
If we view the data points as nodes of graph with edges forming a path between the nodes in a
cluster, then the merging of 2 clusters; Ci and Cj corresponds to adding an edge between the
nearest pair of nodes Ci and Cj.
(Refer Slide Time: 12:55)
1119
Because edges linking clusters always go between distinct clusters, the resulting graph will
generate a tree, thus an agglomerative hierarchical clustering algorithm that uses the minimum
distance measure is also called minimal spanning tree algorithm, even in your subject operation
research also, there is a topic network problems in that you might have studied minimal spanning
tree algorithm.
When an algorithm uses the maximum distance between cluster i and j, to measure the distance
between clusters, it is sometime called farthest neighbour clustering algorithm. If the clustering
process is terminated, when the maximum distance between nearest cluster exceeds an arbitrary
threshold, it is called complete linkage algorithm.
(Refer Slide Time: 13:57)
By viewing data points as nodes of graph with edges linking nodes, we can think of each cluster
as a complete sub graph that is with edges connecting all of the nodes in the cluster. The distance
between 2 cluster is determined by the most distant nodes in the 2 clusters, farthest neighbour
algorithm tend to minimise the increase in diameter of the clusters at each iteration as little as
possible. If the true clusters are rather compact and approximately equal in size, the method will
produce high quality clusters, otherwise the clusters produced can be meaningless.
(Refer Slide Time: 14:40)
1120
Let us go for choice of measurement; the above minimum and maximum measure represents 2
extremes in measuring the distance between clusters, they tend to be overly sensitive to outliers
or noisy data. The use of mean or average distance is a compromise between minimum and
maximum distances and overcome the outlier sensitivity problems, whereas the mean distance is
the simplest to compute, the average distance is advantageous in that it can handle categorical as
well as numeric data.
(Refer Slide Time: 15:24)
This picture shows the distance measures, see the first one represents group average, so you see
that all the points are connected with all other points in that cluster. This R is one cluster; Q is
one cluster you see that we have finding the beverage that is a group average. This is
1121
representation of some definition of inter cluster dissimilarity. The second one is the nearest
neighbour. See, this is a nearest neighbour, the third one is the farthest neighbour that I have
explained in my previous slides, this is a different type of distance measures.
(Refer Slide Time: 16:08)
So, this picture shows some type of clusters, see the cluster here is the ball shaped one, the
second one is elongated one the last one is compact but not well separated. So, what will happen;
if we follow the group average, your; the final cluster may be this shape, what is that; the ball
shaped one. If we follow this distance measures that is a nearest neighbour, the final cluster may
look like this one that is elongated.
You see that any time, it can form with this point any time we can go to that other cluster, in case
if you follow the farthest neighbour distance measures, your final cluster may be in this format
that is a compact but not well separated, that is why choosing the correct distance is more
important, based on the your distance measures, your shape of final cluster also will vary.
(Refer Slide Time: 17:15)
1122
The difficulties with the hierarchical clustering; the hierarchical clustering method, though
simple often encounters difficulties regarding the selection of merge or split points, such a
decision is critical because once a group of object is merged or split, the process at the next step
will operate on the newly generated clusters. As I told you, this also one of the drawback, once
the cluster is formed, you cannot, any mistake has happened that cannot be rectified if you follow
hierarchical clustering methods. See it will neither undo what was done previously nor perform
objects swapping between clusters; these are the some of the disadvantages of hierarchical
clustering.
(Refer Slide Time: 18:02)
1123
Thus, merge or split decisions if not well chosen at some step may lead to low quality clusters,
moreover, the method does not scale well because each decision to merge or split requires the
examination and the evaluation of good number of objects or clusters. For improving the cluster
quality of hierarchical method is to integrate hierarchical clustering with other clustering
techniques resulting in multiple phase clustering. So, what we can do; if you want to improve the
quality of hierarchical clustering, it can be clubbed with other clustering algorithms, so that you
can improve the quality of our clustering.
(Refer Slide Time: 18:49)
Now, let us compare the partitioning clustering algorithm versus hierarchical clustering
algorithm, first we will see what is this partitioning methods, what are the general characteristics
that is a K means algorithm is a partitioning method. Find mutually exclusive cluster of spherical
shape, this partitioning method is a distance based, may use mean or medoid to represent cluster
center, effective for small to medium sized dataset.
1124
Let us compare K means versus hierarchical clustering.
(Refer Slide Time: 19:43)
K means clustering; it is a non-hierarchical method because there will be K means using a pre
specified number of clusters. So, when we doing K means clustering in advance we know, how
many cluster we are going to have. This method assigns records to each cluster to find the
mutually exclusive cluster of spherical shape based on distance. In K mean clustering one can
use mean or median as a cluster centre to represent each cluster.
1125
For hierarchical clustering, this method can be either agglomerative or divisive. Agglomerative
method begins with n clusters and sequentially merge similar clusters until a single cluster is
obtained.
(Refer Slide Time: 20:32)
K means clustering methods is generally less computationally intensive and are therefore
preferred with very large data set. In hierarchical clustering, divisive methods work in the
opposite direction starting with one cluster that includes all the records. Hierarchical methods are
especially useful when the goal is to arrange the cluster into a natural hierarchy.
(Refer Slide Time: 21:01)
1126
A partitioning that means a K means clustering simply a division of the set of data objects into
non overlapping subsets clusters such that each data object is in exactly one subset; a hierarchical
clustering is a set of nested clusters that are organised as a tree.
(Refer Slide Time: 21:23)
When you look at this picture, the picture shows in the left hand side is un-nested clusters, we
can say it is a K means clusters. In the right hand side, the name is called nested cluster, this is
nothing but your agglomerative or hierarchical clustering.
(Refer Slide Time: 21:41)
Hierarchical clustering does not assume a particular value of K as needed by K means clustering,
the generated tree may correspond to a meaningful taxonomy, only a distance or proximity
1127
matrix is needed to compute the hierarchical clustering. This is an example of proximity matrix,
see the between a and a, the distance is 0, proximity 0, between b and a, the distance is 184.
(Refer Slide Time: 22:12)
In K means clustering, since one start with random choice of clusters, the result produced by
running the algorithm multiple times might differ. K means is found to work well, when the
shape of the cluster hyper spherical like circle in 2 dimension, sphere in 3 dimension. In
hierarchical clustering, results are reproducible, hierarchical clustering do not work as good as K
means, when the shape of the cluster is hyper spherical.
(Refer Slide Time: 22:51)
1128
So, the K means clustering suitable for hyper spherical clustering, K means clustering requires
prior knowledge of K that is number of clusters one want to divide your data into. In hierarchical
clustering, one can stop at any number of clusters, one find appropriate by integrating the
dendrogram.
(Refer Slide Time: 23:04)
There are 2 pictures, the top one is example of K means clustering where K equal to 3, the
bottom one is hierarchical clustering, you see that there is a hierarchy is there, so this is an
example of.
(Refer Slide Time: 23:18)
1129
Advantage of hierarchical clustering; ease of handling of any form of similarity or distance,
consequently applicable to any attribute types, here attribute is the variable types, it may be
interval, it may be ratio, it may be binary or categorical.
(Refer Slide Time: 23:38)
1130
Limitation of hierarchical clustering; with respect to the choice of distance between clusters,
single and complete linkages are robust to changes in the distance metric as long as the relative
order is kept. So, what is the example of single linkage when you look at this, there is a cluster 1,
cluster 2, so the minimum distance is called single linkage and the distance between the farthest
points that is called complete linkage.
So, if you use these distance measure, then the cluster what you got is very robust, in contrast
average linkage is more influenced by choice of distance metrics and might lead to completely
different clusters when the metric is changed, hierarchical clustering is sensitive to outlier. If any
extreme dataset is there that may provide different kind of clusters.
(Refer Slide Time: 25:14)
1131
Then, when we will go for average linkage clustering, what is an example of average linkage
clustering? This one, you see that, all the distance are connected then we found average. It is a
compromise between single and complete link. The strength of average linkage clustering is less
susceptible to noise and outliers, the limitations are biased towards globular clusters. So, when
you use average linkage clustering, many times, the cluster may be like a spherical shape.
(Refer Slide Time: 25:58)
Now, let us see the advantage of K means clustering, previously we have seen advantage of
hierarchical clustering, the advantage of K means clustering is the centre of mass can be found
efficiently by finding the mean value of each coordinate, this leads to an efficient algorithm to
compute the new centroids with a single scan of data. The disadvantages are K means has
1132
problem when the cluster of different sizes, densities, non- globular shapes and when the data
contains outliers.
(Refer Slide Time: 26:34)
What is a similarity between hierarchical clustering and K means clustering; 2 most popular
method is hierarchical agglomerative clustering and K means clustering, in both cases we need to
define 2 types of distance, distance between 2 records and distance between 2 cluster, in both
cases, there is a variety of metrics that can be used. In this lecture, I have explained introduction
to hierarchical clustering.
Then, I have compared the difference between K means clustering techniques and hierarchical
clustering techniques and also I have explained the advantages and disadvantages. In the next
lecture, we are going to take one numerical example, with the help of numerical example; I am
going to explain how to do a hierarchical clustering, thank you very much.
1133
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology - Roorkee
Lecture – 56
Hierarchical Method of Clustering - II
In a previous lecture, I have explained introduction to hierarchical clustering and different types
of distance measures. In this lecture, I have taken a numerical example, with the help of the
numerical example, I am going to explain how to do hierarchical clustering method, for that
same problem, I am going to explain how to use Python for doing hierarchical clustering.
(Refer Slide Time: 00:53)
So, the agenda for this lecture is agglomerative hierarchical algorithm, the second one is python
demo.
(Refer Slide Time: 00:57)
1134
So the example of hierarchical agglomerative clustering, HAC; a data set consists of 7 objects for
which 2 variables were measured. There are 7 objects; 1, 2, 3, 4, 5, 6, 7, variable 1 and variable
2. So, variable 1 is 2, 5, 5.5, 5 and so on. Variable 2 is 2, 4, 5, 2, 1, 5, 6 and so on.
(Refer Slide Time: 01:24)
When you plot in the scatterplot, it is appearing that there are 7 data set, variable 1 in x axis,
variable 2 in y axis. Now, we are going to do hierarchical clustering for this data set.
(Refer Slide Time: 01:37)
1135
What is the first one; calculate the Euclidean distance and create a distance matrix, what we are
going to do; first we are going to create a distance matrix, for that we are going to find out the
distance between 1 and 2, 1 and 3, 1 and 4, 1 and 5, 1 and 6, 1 and 7 and 2 and 3, 2 and 4, 2 and
5, 2 and 6 and 2 and 7, then 3 and 4, 3 and 5, 3 and 6 and 3 and 7 then 4 and 5, 4 and 6 and 4 and
7, then 5 and 6 and 5 and 7 finally, 6 and 7.
First we will find the distance between object 1 and 2, so the object 1 is 2, 2, object 2 is 5.5, 4,
after finding the Euclidean distance, we know the formula is x2 – x1 square + y2 – y1 whole
square, then square root so, 5.5 – 2 whole square + 4 – 2 whole square, then square root, that is
your 4.02. Then, let us find the distance between; now we have seen 1 and 2, now the distance
between 1 and 3. So, 1 and 3 is the position of 1 is 2, 2, position of 3 is 5, 5, so it is 5, 2 whole
square + 5, 2 whole square. So, it is 4.24, now let us find the distance between 1 and 4, so this 1
and 4. So, 2, 2 and 1.5, 2.5, so the distance is 1.5 – 2 whole square + 2.5 – 2 whole square that is
0.71.
(Refer Slide Time: 03:35)
1136
Now, we will find the distance between 1 and 5, so this 1 and 5 because this we have done it, so
1 and 5 is 2, 2; the position of the 5th object is 1, 1, so it is 1 – 2 whole square + 1 – 2 whole
square, it is 1.41. Now, we will find the distance 1 and 6 that is 2, 2, then 7, 5, so it is 7 – 2 whole
square + 5 – 2 whole square that is 5.83. Similarly, we can find the distance between 1 and 7 that
is 2, 2 versus 5.75 and 6.5.
(Refer Slide Time: 04:30)
So, it is a 5.75 – 2 whole square + 6.5 – 2 whole square, we got 5.86, now we have done 1 versus
all the point, now we go second versus 3, the distance between 2 and 3, the position of 2 is 5, 5,
4; 3 is 5, 5, so the distance is 5 – 5.5 whole square + 5 – 4 whole square, it is 1.12. Now, 2 and 4;
1137
5.5, 4 is the position of 2, 4 is the position of 2, the position of object 4 is 1.5, 2.5, so the distance
is 1.5 - 5.5 whole square + 2.5 – 4 whole square that is 4.27.
Next we will find the distance between 2 and 5; we know the position of 2 is 5.5, 4, the position
of object 5 is 1, 1, see this point, the distance is (1 – 5.5) whole square + (1 – 4) whole square
that is 5.41. Now, we can find the distance between 2 and 6, so the position of object 6 is 7, 5, so
the distance is (7 – 5.5) whole square + (5 – 4) whole square that is 1.81.
(Refer Slide Time: 06:04)
The next one; we are going to find the distance between 2 and 7, so 2 and 7, so the position of
object 7 is 5.75, 6.5, so the distance is 5.75 – 5.5 whole square + 6.5 – 4 whole square, so this
difference whole square and this difference whole square that is 2.51. Now, 2; the point 2, the
object 2 and other all 7’s we compare, now we will compare 3; 3 and 4, 3 and 5, 3 and 6 and 3
and 7.
So, the 3 and 4 distance is the position of point ‘3’ is 5, 5 and 4 is 1.5, 2.5, so this distance versus
this distance, this difference, so 1.5 – 5 whole square + 2.5 – 5 whole square that is 4.30. Next 3
and 5, so 5, 5 and the 1; 1, 1 is the position of object 5, so what will happen; 1 – 5 whole square
+ 1 – 5 whole square, 5.66. Now, 3 and 6; so 3 is 5, 5; 6 is 7, 5, so the difference is 7 – 5 whole
square + 5 – 5 whole square, the distance is 2.
(Refer Slide Time: 07:49)
1138
Now, we are going to find the next one; 3, 7, so 3, 7 what is the distance; we know position of 3
is 5, 5; 7 is 5.75 – 6.50, so this difference whole square versus this difference whole square, so
5.75 – 5 whole square + 6.5 – 5 whole square that is 1.68. Now, we have to find out 3 versus all
the point, now we will take 4, 5 and 4, 6 and 4, 7. So, the distance between 4 and 5 is 1.5 – 2.5 is
a the position of object 4, for the 5th one it is 1, 1.
When you look at the distance, it is 1.58, then 4 and 6 we have done that one, now we will go for
4 and 6. The 4 and 6 is the 4th 1.5 , 2.5, the 6th position is 7, 5, so the difference is 7 – 1.5 whole
square + 5 – 2.5 whole square equal to 6.04. Now, we will find out 4 versus 7, the distance
between 4 and 7, the 1.5, 2.5, the position of object 7 is 5.75 , 6.50, so the this difference whole
square plus this difference whole square.
(Refer Slide Time: 09:25)
1139
So, 5.5; 5.75 – 1.5 whole square + 6.5 – 2.5 whole square that is 5.84, 5 we have completed, now
we will go 5 and 6 and 5 and 7, the distance between 5 and 6 is 1, 1 is a position of object 5, for
6th one, it is 7, 5, so it is 7 - 1 whole square + 5 – 1 whole square that is 7.21. Now, we will find
out distance between 5 and 7 that is 1, 1 versus 5.75 – 6.5, so it is 5.75 – 1 whole square + 6.5 –
1 whole square, 7.27.
Now, 6 versus 7, we will find out the distance, so that is position of object 6 is 7, 5, position of
object 7 is 5.75, 6.5, so the distance is 5.75 - 7 whole square + 6.5 – 5.00 whole square that is
1.95.
(Refer Slide Time: 10:24)
1140
Now, we have compared all the distance, as I told you this is a distance matrix, see that in the
diagonal it is ‘0’ because the distance is 0, we got distance between 2 and 1, 3 and 1, 4 and 1, 5
and 1, 6 and 1, 7 and 1, this value which we got from our previous slides, so we got the distance
matrix.
(Refer Slide Time: 10:47)
After the distance matrix, for that hierarchical agglomerative method, select minimum element to
build the first cluster formation, so among this, find out where there is a minimum distance, so
the minimum distance is this one; 0.7. What are the 2 objects between 4 and 1, so what is going
to do; we are going to form a cluster, in that there are 2 element is going to be there; 4 and 1.
(Refer Slide Time: 11:23)
1141
So, the 4 and 1 will form a cluster, so we got this one, 4 and 1, this is our first cluster.
(Refer Slide Time: 11:29)
We are going to find the distance between that cluster 1, 4 forming a distance versus 2, so what
we are going to do; we are going to find out distance between 1 and 2 and 4 and 2, whichever is
minimum, we are going to take that distance because in cluster 1, already there are 2 object is
there, we are going to consider the minimum distance. So, the distance between 1, 2, see that this
value 4 and 4, 2 is 4.3, this value we take from the table.
So, the minimum distance is 4, the next one 1, 4 versus 3, similarly 1, 4 versus 5, 1, 4 versus 6, 1,
4 versus 7, so if you want to know this cluster versus 3, so minimum distance between 1, 3 and 4,
3; this 1, 3 and 4, 3, so 1, 3 it is a 4.2; 4, 3 it is a 4.3, this value, so minimum is 4.2. Now, 4th we
cannot go, already the 4th one is already gone to that cluster, so we will go to 5th, so 1, 4 versus
5th object, for that we have to find out the minimum distance between 1, 5 and 4, 5.
So, 1, 5 is 1.4, this value and 4, 5 is your this value, 1.6 that 1.6, the minimum is 1.4, then 1, 4
with 6, now what happened here; we have to find out the minimum distance 1, 6 and 4, 6. So 1, 6
is 5.8 this value and 4, 6 that is 6, this value, so minimum is 5.8. Now, why we are taking 1.4
because this we formed one cluster, so from this cluster there are 2 point; 2 object 1 and 4. From
1 and 4, this 7 is how far away?
1142
So, 2 way we have to do; 1 and 7 we have to find the distance and 4 and 7 we have to find the
distance whichever minimum that has to be kept. So, 1 and 7, 4 and 7, so 1 and 7 is this one, 5.9,
4 and 7 that is your 5.8, so minimum value is 4.8.
(Refer Slide Time: 14:26)
Now, we are going to update this value, so what update we have done that one; since 1 and 4
form a new cluster, so now we will find this value, how we got this value? So, 2, 1, 4; so 2, 1, 4
is 4, so we updated. Now, so updated the distance, this distance matrix is going to be used for
were further steps.
(Refer Slide Time: 14:50)
1143
So, now in that updated matrix, again you find out which is minimum, so this 1.1 is minimum
that forms 3 and 2. So, now what is happening; the 3 and 2 is going to form one cluster.
(Refer Slide Time: 15:06)
Now, since 2 and 3 is formed a distance and already there is 1 cluster; 1, 4, so we are going to
find out the distance between 2 and 1, 4 and 3 and 1, 4. So, 2 and 1, 4 you can find out this is 4
because 1 and 4 formed a cluster, then 3 and 1, 4; 3 and 1, 4 is 4.2, in this minimum is 4.
Similarly, the distance between this newly formed cluster 2, 3 versus 5, 5th point, we are not go
1144
to 4th one because 4 is already formed a cluster, so the 5th one, we have to find out the minimum
distance of 2, 5 versus 3, 5.
So, in 2, 5, the distance is 5.4 this value, this value; 3, 5; 5.7, so we got this value, in between 5.4
and 5.7, minimum is 5.4. Now, 2, 3 versus 6, now we have to find out the distance between 2, 6
and 3, 6. So, 2, 6; where is 2, 6; 6, 2, this is 1.8, this value and 3, 6, 6, 3, see these 2 value, so in
that minimum is 1.8. Now, the last point is 2, 3 versus 7, so what we have to do; we have to find
the minimum distance is 2 and 7 and 3 and 7. So, between 2 and 7, minimum distance is 2.5, 3
and 7 minimum distance is 1.7, so out of this minimum is 1.7, now we are going to update this
new distance.
(Refer Slide Time: 17:03)
So, this was updated distance but you see that 2 and 3 is formed 1 cluster, previously 1 and 4 will
formed a cluster, now in the next slides what we are going to do; among these new updated
distance matrix which is the lowest value.
(Refer Slide Time: 17:17)
1145
So, select the minimum element to build the next cluster formation, in that the minimum point is
this 1.4, so what is going to happen; this object 5 is going to join with this cluster 1, 4.
(Refer Slide Time: 17:35)
1146
Now, what we are going to do; you see that already there is; this is a new cluster, this is not new
cluster, already 1, 4 is there, one cluster that again, the object 5 is added, so the distance between
this cluster versus another cluster, in that there are 2 point; 2 object you see 2, 3. So, what we
have to do; minimum distance between 1 and 4 and 2, 3 and 5 and 2, 3, so what is happening
here; 1 and 4 is there, 2, 3 is there, the distance is 4, this value we got it.
The distance between 5, 2, 3 so this is 5.4, so this value we got it here, this value got it, the
distant minimum is 4, similarly minimum distance between 1, 4, 5 versus 6, so what you have to
do; 1, 4 versus 6 and 5 and 6, so 1, 4 versus 6, 1, 4 and 6, this is 5.8 right, so this value got here.
So 5, 6; 7.2 we got this value here, the minimum is 5.8 now, there is a 7th object is there, let us
see how far away the object 7. So, what you have to do; we have to find the minimum distance 1,
4 and 7; 1, 4 and 7 and 5, 7, so the distance between 1, 4 and 7 is 5.8 is this value and 5, 7 is our
7.3, so the minimum is 5.8.
(Refer Slide Time: 19:27)
1147
Again, we will update our distance matrix, so what happen now, you see that the 5 has entered
into this cluster because already there is 1, 4 is there, so this is our updated distance matrix.
(Refer Slide Time: 19:43)
The next step what we are going to do; in the updated distance matrix look at where there is a
minimum distance is there, so the 1.7 is a minimum distance. So, what is going to do that the
object 7 is going to add in to the cluster 2, 3, so now we will update this distance.
(Refer Slide Time: 20:05)
1148
What happen, see that this 7 is going to form in that cluster.
(Refer Slide Time: 20:09)
Now, what happened, we are going to you see that already there is a cluster, in that 2 object is
there because 7 also joined there, so the distance between this cluster versus 1, 4, 5, this is the
another cluster. So, what we are going to do; so, 2, 3, 1, 4, 5 and 7, 1, 4, 5, so 2, 3 this 1; 2, 3,
1.45, yes the distance is 2, 3 1.45, this is; this 4 has come here. Second one; the distance between
7, 1, 4, 4, 5, so this one 5.8, so this distance has come here, out of this minimum is 4.
Now, this newly formed clustered versus 6, so here one point is 2, 3 versus 6 and 7, 6, so 2, 3
versus 6, what is a distance; 2, 3 versus 6 this is 1.8, so that distance is came here, so between 7
1149
and 6, the distance is 2, so that distance has come here. Now, the minimum is 1.8, we cannot go
next one because 7 is already gone in to that cluster.
(Refer Slide Time: 21:38)
Now, again we will update that now, in the updation you see that we have formed 2 clusters, in
that 1, 4, 5 is one group, 2, 3, 7 is another group, this is updated distance matrix. This value 4 we
got from here, this 1.8 we got from here.
(Refer Slide Time: 22:01)
Now, in the next stage select minimum element to build the next cluster formation, so after this
the minimum value is 1.8, this gives the minimum distance between 6 and the cluster 2, 3, 7, so
what is going to do now; the 7 is going to join with this cluster where 2, 3, 7 is there.
1150
(Refer Slide Time: 22:24)
Now, recalculate the distance to update the distance matrix, so the distance between 2, 3, 7, 6
versus 1, 4, 5, so what you have to do the distance is 2, 3, 7 versus 1.45, you have to find out this
distance, so 2, 3, 7, 1.45 the minimum distance is that one we brought it here. The next one 6, 1,
4.5, so 6, 1.45 that is a 5.8, so in that the minimum distance is 4.
(Refer Slide Time: 23:04)
1151
This is our updated matrix, now what happened in that the minimum value is 4, so what
happened this 2, 3, 7, 6 will join with 1, 4, 5.
(Refer Slide Time: 23:17)
1152
Now, let us see how to do this agglomerative hierarchical clustering method with the help of
python, so we have imported the data; import numpy as np, import pandas as pd, import
matplotlib.pyplot as plt, import scipy. from scipy.cluster.hierarchy import fcluster, from
scipy.cluster.hierarchy import cophenet, from scipy.spatial.disatnce import pdist.
(Refer Slide Time: 24:07)
So, I have data in hierarchical clustering, so what happened this was our data set, for this data
set, we have plotted the 2 dimensional picture, so in that all the objects are displayed.
(Refer Slide Time: 24:14)
1153
Now, when you run this comment that is from scipy.cluster.hierarchy import dendrogram,
linkage, you will get a this kind of pictures, So, what is happening you see that this is the way 1
and 4 are joining together this stage, after that along with 1 and 4, the 5 is joined, initially 2 and
4; 2 and 3 is a joined, so along with 2 and 3, 7 is joined, after sometime along with the 7, 2, 3, 6
also joined.
At the end, see that the blue line which says all are forming one clustering, this picture shows the
dendrogram, so for that from scipy dot cluster dot hierarchy import dendrogram, linkage, so
linked equal to linkage data, single, we are going to have single linkage, the label is this, range 1
to 8 that is a figure size. So, when you run that you are getting the dendrogram.
(Refer Slide Time: 25:17)
1154
Now, here what it shows that, here we have entered k equal to 2, you look at this value from
sklearn; from sklearn dot cluster import AgglomerativeClustering, when you put k equal to 2, we
are going to say Euclidean distance and single linkage, so when you; k equal to 2, so the cluster
name is into 2 category; one is 0 is one group, 1 is another group, so this was the labels; 1, 0.
(Refer Slide Time: 25:46)
So, when you run this, you see that there are 2 clustering, so this is forms 1 cluster, this is form
another cluster, suppose if you write k equal to 3 here, you may get with 3 clusters now, I am
going to the python demo for doing this agglomerative hierarchical clustering.
(Refer Slide Time: 26:11)
1155
Now, I am going to show how to do agglomerative hierarchical clustering in python, so import
this necessary library; I an running this, I have stored the data in the file name
hierarchical_clustering.
(Refer Slide Time: 26:26)
So, this is our data, for this I am going to do this hierarchical clustering.
(Refer Slide Time: 26:33)
1156
So, first I am going to show the scatterplot, in that it is showing all the objects, while looking at
the object itself, you see that if you go for 2 clustering that will be good.
(Refer Slide Time: 26:46)
So, what we are going to do; we are going to do the single linkage, here I am going to show the
figure size here also, so this picture you know we are going to get the dendrogram.
(Refer Slide Time: 26:56)
1157
So, this is the dendrogram, so 2 and 3 is forming one cluster, out of that this 7 is joining, out of
that 6 is joining. Here at level 1, it is 1 and 4 is joining, then 5 is joining there at the end, it is
going to; all are going to be in the same cluster.
(Refer Slide Time: 27:12)
Now, from sklearn dot cluster import AgglomerativeClustering, suppose we will start with 2, so
2, so run this, let us see hierarchical clustering, how we are doing, see that there are 2 clustering,
it is one is named as 1, another one is 0.
(Refer Slide Time: 27:38)
1158
So, let us see what are the labels so, this is labels; 1, 0, 0, 1, 1, 0, 0.
(Refer Slide Time: 27:48)
So, now if you run this, what you are getting; now there are 2 clusters, so the red colour point,
say 1 cluster, the purple is another cluster. Suppose, if we go for k equal to 3, let us see what
kind of answer we are getting.
(Refer Slide Time: 28:04)
1159
Suppose, if we go for k equal to 3, again you run it, you see that the level is 0, 1, 2, again you run
it, level is see that.
(Refer Slide Time: 28:18)
Now, if you run this, you see that there are 3 cluster; green, purple and red because red is only
one cluster, there is only one element, so the optimal number of cluster for this kind of data set is
k equal to 2 that is a purpose we can visualise how the cluster is formed and quality of cluster
also. In this lecture, what I have done, I have started agglomerative hierarchical algorithm with
the help of a numerical example.
1160
In that example, I have first I found the distance matrix, after finding the distance matrix, I
formed a cluster wherever there is a minimum value is there, I connected that 2 objects, then I
have updated the distance matrix, again I have gone to where there is a minimum point is there,
so that point and that objects are clubbed together. At the end, for the same data set, I have
explained how to do python programming.
And I also shown how the result is appearing, so here the number of cluster I have initially
started with k equal to 2, then again I changed k equal to 3 but when I changed k equal to 3, I get
some other result that is not looking good, so I kept only k equal to 2 is the right number of
clusters, so optimal number of clusters.
1161
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology - Roorkee
Lecture – 57
Classification and Regression Trees(CART) - I
In our previous lecture, we studied about different cluster and techniques, in this class we will
start a new topic that is a classification and regression trees, shortly this is called CART models.
(Refer Slide Time: 00:41)
The agenda for this lecture is introduction to classification, regression trees, attribute selection
measures and introduction. There are different measures for selecting attributes; attributes means
variables that we will study about different attribute selection measures in this class.
(Refer Slide Time: 00:55)
1162
The introduction about this topic, CART model; classification is one form of data analysis that
can be used to extract models describing important data classes or to predict future data trends.
Classification predicts categorical that is discrete, unordered labels, whereas regression analysis
is a statistical methodology that is most often used for numeric prediction. There is a difference
between classification techniques and regressions.
In a classification techniques; the dependent variable is categorical variable most of the time but
in regression analysis, most of the time the regression analysis, the continuous variable is the
dependent variable for regression analysis. For example, we can build a classification model to
categorize the bank loan applications as either safe or risky, see this is categorical. The
regression model is used to protect expenditures in dollars of potential customers and computed
equipment given their income and occupations. Most of the time the regression analysis used to
predict a continuous variable but the classification analysis is used to predict the categorical
variable.
(Refer Slide Time: 02:16)
1163
We are going to take one problem, this problem is taken from this boo, Han, Pei and Kamber,
data mining; book title is data mining concept and techniques. The problem says there are 1, 2, 3,
4, 5 columns, in the 5 columns there is age is there, income is there, student, credit rating, this
dependent variable is buys computer; buys underscore computer, so this is a database, one
portion of data base is shown.
So, the dependent variable is buys computer, there are 4 independent variables like age, income,
student and credit rating. So by taking this example, we are going to explain how to use CART
model, in coming lecture also we will use this data.
(Refer Slide Time: 03:04)
1164
Now, let us understand certain terminology in the CART model, for example root node, internal
node and child node, when you look at this picture, there is age; age there are 3 levels; youth,
middle aged, senior, then it is a student there are 2 levels; no or yes. Credit rating; there are 2
levels; fair and excellent. So, a decision tree uses a tree structure to represent a number of
possible decision paths and an outcome for each path.
The decision tree consist of root node, internal node and leaf node, you look at this the first one
the age because the whole problem we are going to start with a variable age, so this age is called
root node or parent node that is in the rectangular box. A decision tree consist of root node,
internal node and leaf node, the top most node in a tree is called root node or parent node, this
one for example age, this is a root node.
It represents entire sample population, the next term is internal node, for example here student is
the internal node or non-leaf node, denotes a test on an attribute, each branch represents outcome
of the test. The next node is leaf node or child node, see this yes or no that is which is in the
elliptical shape that is called leaf node, it cannot be further split.
(Refer Slide Time: 04:46)
A decision tree for the concept buys computer is a variable indicating whether a customer at all
electronics that is a database, where it was taken, a customer at all electronics is likely to
purchase a computer, each internal node represents a test on attribute, each leaf node represents a
1165
class, a class means either a person buys the computer yes or no, actually this yes or no is
nothing but this column that we will go to into the child node.
(Refer Slide Time: 05:21)
Now, the CART comes under supervised learning techniques, we know that the machine
learning techniques are classified into 2 categories; one is supervised, another one is
unsupervised. What is the meaning of supervised learning is that there is a label; label in the
sense we know in advance what is going to be independent variable, what is going to be
dependent variable, in this problem also, it is supervised learning.
Because we know in advance what is the buys underscore computer is going to be our dependent
variable, then CART adopts a greedy that is a non-backtracking approach in which decision trees
are constructed in a top down recursive divide and conquer manner, it is very interpretable
model. A person who was not having any statistical analysis also can easily interpret the CART
model.
(Refer Slide Time: 06:14)
1166
Now, I will explain the decision tree algorithm, what are the inputs data partition D, this whole
data set is a set of training tuples and their associated class labels, this is the class labels.
Attribute list is the set of candidate attribute, for example in this problem these are the attributes;
age, income, student, credit rating. Now, what you have to do before starting the problem, out of
these 4 variables, we have to decide from which variable we have to start for classification.
So, for that we need a attribute selection method, so attribute selection method a procedure to
determine the splitting criterion that best partitions the data tuples into individual classes, this
criterion consist of a splitting the attribute and possibly, either a split point on splitting subset.
So, output of this model will be the decision tree.
(Refer Slide Time: 07:16)
1167
Decision tree algorithm; the algorithm is called with the 3 parameters; one is D that is a data set,
then is attribute list independent variable and attribute selection methods. D is defined as data
partition; initially it is the complete set of training tuples and their associated class labels. In our
problem, this whole table represents the D, initially what is that it covers all the independent
variables and dependent variables.
The parameter attribute list is a list of attributes or independent variables which are describing
the tuples. Here the parameter attributes; attributes nothing but all the independent variables, so
attribute selection method specify a heuristic procedure for selecting the attribute that best
discriminates the given tuples according to class.
(Refer Slide Time: 08:12)
1168
This procedure employs an attribute selection measures such as information gain, that is a one
method for selecting the attribute, second one method is called gain ratio, third method is called
Gini index, whether the tree is strictly binary is generally driven by the attribute selection
measures. For example, we can go for binary selection suppose, this is one variable, sometime
we can go for more than 2 classifications also.
For example, if you use Gini method that will cover in coming class that always you need to go
for binary selection, some attribute selection measures such as Gini index enforce the resulting
tree to be binary, others like information gain do not, therein allowing multiway splits. So, if you
follow Gini index, there is only binary split, if you follow other than Gini index for example,
information gain, you can have more than 2 split also.
(Refer Slide Time: 09:11)
1169
Decision tree method; I am going to explain different steps in the decision tree method, there are
15 steps in coming slides, I will explained in each steps. So, first we will start the overview of all
15 steps. Create a node N, if tuples in D all are of the same class C, then return N as the leaf node
labelled with the class C, if attribute list is empty then return N as a leaf node labelled with
majority class in D by using the concept called majority voting.
Then apply attribute selection method D to find the best splitting criterion, label node N with the
splitting criterion, if splitting attribute is discrete valued and multiway splits is allowed, then
attribute list, there are different attribute list that we can choose for example, splitting attribute is
one method. The step 8 is if splitting attribute is discrete valued and multiway split is allowed,
then the attribute list which already have occurred that has to be removed from the our D.
For each outcome j for splitting criterion, so what is a splitting criterion is partition of the tuples
and grow sub tree for each partition. Let Dj be the set of data tuples in D satisfying the outcome
j, if Dj is empty then attach a leaf labelled with the majority class in D to node N else attach node
returned by generate decision tree to node N, then N for return N. So, I am going to explain that
each steps in detail in coming slides.
(Refer Slide Time: 11:05)
1170
The tree starts as a single node N representing training tuples in D that was our step 1, if the
tuples in D are all of the same class, then N becomes a leaf and is labelled with that class that is
step 2 and 3. If the tuples in D, all of the same class C, then return N as a leaf node labelled with
the class C, the meaning is, suppose the age is taken as variable, if there are 3 split; one is youth,
middle-aged and senior.
For example, the middle-aged with respect to our dependent variable all are answered yes, so if it
all are answered yes, we need not go for further classification, then the age attribute has to be
dropped from our model, then we have continue with the remaining attributes like student, credit
rating and income. So, step 4 and 5 are terminating conditions, if attribute list is empty that
means, you have to go for each attributes otherwise, the algorithm calls attribute selection
method to determine splitting criterion.
Suppose, only 1 attribute is there, if there are remaining attributes, to choose that attribute, you
have to use attribute selection method to determine splitting criterion. The splitting criterion like
Gini tells us which attribute to test at node N by determining the best way to separate or partition
the tuples in D into individual classes.
(Refer Slide Time: 12:44)
1171
The splitting criterion indicates the splitting attributes and may also indicate either a split pointer
or splitting subset; I will explain, what is the meaning of split point and splitting subset in next
slide. The splitting criterion is determined, so that ideally the resulting partitions at each branch
are as a pure as possible. A partition pure if all of the tuples in it belongs to the same class, the
node N is labelled with the splitting criterion which serve as a test at the node that is a step 7. A
branch is grown from node N for each of the outcomes for splitting criterion, the tuples in D are
partitions accordingly that is our step in 11.
(Refer Slide Time: 13:30)
So, 3 possibilities for partitioning tuples based on the splitting criterion, there are 3 possible
scenarios as illustrated in figure a, b and c. Let A be the splitting attributes, A has v distinct
1172
values a1; see a2, a2 and av based on training data. If A is discrete valued in figure a, then one
branch is grown for each known value of A. See for example colour; may be red, green, blue,
purple, orange, if it is income, there are 3 split; low, medium, high.
(Refer Slide Time: 14:13)
If A is a continuous valued in figure b, then 2 branches are grown corresponding to A less than
or equal to split point and A greater than or equal to split point. So, what will happen; the A less
than or equal to split point is the one split, A greater than split point is another branch, where the
split point is the split point returned by attribute selection method as part of the splitting
criterion.
For example, income is there, we can group that into 2 categories, those who have incomes are
below 42,000, those who have incomes are above 42,000; this 42,000 generally is nothing but
the average value.
(Refer Slide Time: 14:54)
1173
If A is a discrete valued and binary tree must be produced, then the test is of the form A belongs
to SA, where SA is the splitting subset of A, so is it that A belongs to SA, if it is yes is one
group, no if it is another group. In the; if A, for example colour it may be red or green, then that
time also, it should be yes or no.
(Refer Slide Time: 15:18)
Then, we will go for termination condition; the algorithm uses the same process recursively to
form a decision tree for the tuples at each resulting partition Dj of D, what is the recursion
means; if one attribute is over that is repeated for the second attribute and third attribute up to all
the attributes are exhausted. The recursive partitioning stops only when any one of the following
1174
terminating condition is true. The first condition is all of the tuples in partition D representing at
node N belong to the same class that was our same step 2 and 3.
(Refer Slide Time: 15:58)
Or there are no remaining attribute on which the tuples maybe further partitioned that is in step 4,
in this case majority voting is employed, this involves converting a node into a leaf and labelling
it with the most common class in D, alternatively the class distribution of the node tuples may be
stored. The third condition is there are no tuples for a given branch that is a partition Dj is empty
that is explained in step 12. In this case, a leaf is created with the majority class in D that is your
step 13, the resulting decision tree is returned that is our step 15.
(Refer Slide Time: 16:42)
1175
Now, the second part of the selection is different attribute selection measures; attribute selection
measures are also known as splitting rules because they determine how the tuples at a given node
to be split, it is a heuristic approach for selecting the splitting criterion that best separates a given
data partition D of class labelled training tuples into individual classes. The attribute selection
measures provide a ranking for each attributes describing the given training tuples. The attributes
having the best score for the measure is chosen as the splitting attribute for the given tuples.
If the splitting attribute is continuous valued or if we are restricted to binary trees, then
respectively, either a split point or split subset must also be determined as part of the splitting
criterion. There are 3 popular attribute selection measures; one is information gain, gain ratio,
Gini index. In this class, I am going to explain the theory about this 3 attribute measures, in
coming classes by using the same examples which I have discussed, I am going to find out the
value of information gain.
I am going to explain the selection procedures using the criteria information gain, gain ratio and
Gini index. So, in this lecture this we are going to see the theoretical point of all these 3 selection
methods. So, CART algorithm uses Gini index measures for attribute selection.
(Refer Slide Time: 18:30)
Attribute selection measures let us find out certain notations, the notation used herein is as
follows. Let D, the data partitions, be a training set of class labelled tuples, for example this
1176
dataset suppose, the class label attribute as m distinct values, here m is there are; this is a class 1,
here the value of m is 2 because yes is 1 category, no is another category, distinct class in Ci.
The Ci is it is 1 to m, it may be 1 and another may be it is 2.
Let CiD be the set of tuples of class CiD for example, what is the CiD means, for example if it is
a high for this income variable, in high how many no is there; 1, 2, high, it is a 2, for income one
level is called high for that it is no, no, so that is our CiD; CiD, set of tuples of class Ci in D. So,
this modulus of D represents that the 14 CiD represents how many number of values if it is high
and what is no. If it is for example, if you say low, this variable; income variable low, how many
yes is there; 1, low, 2, low, 3, so 3, the modulus of CiD is 3, so this values I have explained in
coming lectures with an example.
(Refer Slide Time: 20:11)
Then, we will go to the first criteria for selecting the attributes information gain, this measure
studied the value or information content of messages, the attribute with the highest information
gain is chosen as the splitting attribute for node, this attribute minimises the information needed
to classify the tuples in the resulting partitions and reflect the least randomness or impurity in
these partitions.
(Refer Slide Time: 20:49)
1177
So, this approach minimises the expected number of test needed to classify a given tuple, so
information gain that is nothing but entropy measure, the expected information needed to classify
a tuple in D is given by Info (D) = -Ʃmi=1 pi log2 (pi) to the base 2, where pi is a probability that
an arbitrary tuple in D belongs to class Ci and is estimated by modulus of (CiD) divided by
modulus of (D).
For example, in this for this dataset, what is a pi; pi is the number of yes, how many number of
yes is there? 1, 2, 3, 4, 5, 6, 7, 8, 9, so it is 9, that 9 is nothing but your CiD, modulus of (D) is
total 14 that is for level 1. For the level 2, how many no is there? 1, 2, 3, 4, 5, so 5 divided by 14
log 5 divided by 14 to the base 2 equal to 0.940 bits, so this is the meaning of our Info D. A log
function to the base 2 is used because the information is encoded in bits.
So, Info (D) or entropy, another name for entropy is just the average amount of information
needed to identify the class label of a tuple in D. Generally, the lesser the value of entropy that
means we need very less informations to identify the class label of a tuple D, so generally the
value of entropy should be less, so that attribute will be chosen for classification.
(Refer Slide Time: 22:44)
1178
It is quite likely that the partitions will be impure, where a partition may contain a collection of
tuples from different classes rather than from the single class, so how much more information
would still need, after the partition in order to arrive an exact classification. So, this amount is
measured by Info (D) for one attribute that is for Ʃvj=1 for all the splits;( modulus of (Dj) divided
by modulus of (D)) multiplied by Info (Dj), this is nothing but your entropy.
The term Dj; modulus Dj divided by modulus D act as a weight of jth operation, Info D for
attribute A is expected information required this one, expected information required to classify a
tuple from D based on the partitioning by A.
(Refer Slide Time: 23:51)
1179
So, the information gain; the smaller the expected information required the greater the purity of
the partitions, so information gain is defined as the difference between the original information
requirement that is done by based on just proportion of classes and the new requirement that is
obtained after partitioning on A. So, the Gain A = Info (D) - Info (D) for attribute A, I have used
this example in my coming classes with the help of numerical example, I have explain how to
find out the gain A.
The attribute A with the highest information gain is chosen as a splitting attribute at node N, you
see that the entropy should be very smaller but the information gain should be higher for
choosing an attribute.
(Refer Slide Time: 24:46)
Next we will go to the next concept Gini index; Gini index is used to measure the impurity of D,
the data partition or set of training tuples, the formula for Gini D = 1 – Ʃmi=1 pi2, where pi is the
probability that tuples in D belongs to a class Ci and is estimated by modulus of (CiD) divided
by lDl, for example I will explain for this dataset, how to find out Gini index.
So, 1 minus; so how many yes is there here, it is 9 yes is there, so (9 /14) whole square minus,
how many no is there; 5 no is there; (5 divided by 14) whole square, so 1 – (9 /14)2 – (5 / 14)2
equal to 0.459, this is Gini index because the sum is computed over m classes, we are doing for
1180
all the for m equal to 1, m equal to 2, the Gini index considers a binary split for each attribute.
So, here we are going to get only binary split if you use Gini index.
(Refer Slide Time: 26:02)
When considering a binary split, we compute a weighted sum of impurity of each resulting
partitions for example, if a binary split on a partitions D into 2 category; one is D1, D2, then the
Gini index of D given that partitioning is; so Gini D for each attribute = (modulus of (D1)
divided by lDl). Gini of D1 + (modulus of (D2) divided by modulus of D). Gini of D2. So, in my
coming lectures I have used this formula also with the help of a numerical example.
There you can have very clear understanding how we are finding this, for each attribute each of
the possible binary split is considered, for a discrete valued attribute the subset that gives the
minimum Gini index, you have to remember this, minimum Gini index for that attribute is
selected as its splitting subset.
(Refer Slide Time: 27:02)
1181
For continuous valued attributes, each possible split point must be considered; the strategy is
similar where the midpoint between each of the pair, adjacent value is taken as a possible split
point. If there is a continuous variable, the midpoint in a sorted dataset, the midpoint should be
taken as the splitting criteria. For a possible split of A, D1 is the set of tuples in D satisfying A
less than or equal to split point.
And D2 is the set of tuples in D satisfying A greater than split point, the reduction in impurity
that would be incurred by a binary split on a discrete or continuous valued attribute A is delta of
Gini A = Gini D, this is for our class variable - Gini D for a particular attributes. The attribute
that maximises the reduction in impurity has the otherwise, which is having minimum Gini index
is selected for splitting attribute. So, this value should be maximum otherwise, this will be
maximum only if the Gini index is minimum.
(Refer Slide Time: 28:16)
1182
So, we have seen 2 methods; one method is information gain, another method is Gini index,
there is one more method is called gain ratio that I will take in my coming classes. So, how to
choose which attribute method has to be chosen, all measures have some bias for example, this
technic information gain also having some bias that I will explain in coming class. The time
complexity of decision tree generally increases exponentially with the tree height.
Hence, measures that tend to produce shallower trees that is with multiway rather than binary
split and that favour more balanced split may be preferred that is why most of the time Gini
index is chosen because that is giving a balance split however, some studies have found that
shallow trees tend to have a large number of leaves and higher error rates, several comparative
studies suggest no one attribute selection measures has been found to be significantly superior to
others.
(Refer Slide Time: 29:24)
1183
Next, we will go the concept called tree pruning, when a decision tree is built, many of the
branches will reflect anomalies in the training data due to noise or outliers. So, tree pruning use
statistical measures to remove the least branches, pruned trees tend to be smaller and less
complex and thus easier to comprehend, they are usually faster and better at correctly classifying
independent test data than unpruned trees.
(Refer Slide Time: 30:01)
How does tree pruning work; there are 2 common approaches to tree pruning; one is pre-
pruning, another one is post-pruning. In the pre-pruning approach, the tree is pruned by halting
its construction early that is by deciding not to further split or partition the subset of training
1184
tuples at a given node, when constructing a tree measures such as statistical significance,
information gain, Gini index can be used to assess the goodness of a split.
(Refer Slide Time: 30:34)
Now, let us talk about the post-pruning; the post pruning approach remove the sub tree from a
fully grown tree, a sub tree at a given node is pruned by removing its branches and replacing it
with a leaf, so the leaf is labelled with the most frequent class among the sub tree being replaced.
(Refer Slide Time: 30:55)
Look at this picture which is given in the next slide, assume that we are going to remove this
portions, what is happening; in this when you look at this one, if you are removing this portion of
the tree, the class B is frequently occurring, so we have to bring as a leaf node, in that leaf node,
1185
the class B has to be retained. For example, the subtree at the node A3 in the unpruned tree is
shown in figure 1.2.
The most common class within the subtree is class B, look at this there are class B is there, class
B is there, class A is there, 1 class A is there, 2 class B is there, so the most common class with
this subclass is class B. The pruned version of the tree, the subtree in the question in is pruned by
replacing with the leaf class B, you see that this is the pruned version, in that we have retain class
B. This figure explains unpruned decision tree and the post pruned decision tree.
What you have done in this lecture; I have introduced what is classification regression tree
CART model, then I have explained different terminology, which are frequently used in the
CART model, then I have explained the theory behind different attribute selection measures like
information gain, Gini index. At the end, I have explained how to do the pruning of the tree. In
the next class with the help of a numerical example, I will explain how to do or how to select
different attributes. For example, with the help of information how to choose attributes; correct
attributes, thank you very much.
1186
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology - Roorkee
Lecture – 58
Measures of Attribute Selection
In our previous class, I have given introduction to classification regression tree, in this lecture
I am going to take some example, some numerical examples; with the help of numerical
examples, I am going to explain how to select attribute for the CART model.
(Refer Slide Time: 00:47)
The agenda for this lecture is measures of attribute selection using; there are 3 measures for
selecting attributes. Here, what is the meaning of attribute is choosing an independent
variables for making classification, so there are 3 criteria; one is we can choose attribute with
the help of information gain, another measure is gain ratio, the third one is Gini index. In this
lecture, I am going to take the first criteria that is information gain.
(Refer Slide Time: 01:24)
1187
With the help of this attribute, I am going to explain how to choose the attribute, now, taken
one sample example, this example is taken from this book, Han and Kamber, the book title is
Data mining, concepts and techniques. The problem is there are 1, 2, 3, 4, 5, there are 5
column is there, these 5 columns is called attributes, the last column is class that is a buys
computer.
So, what this database says that the company; the database called all electronics customer
database, they are going to see what kind of customers, they are going to buy the computer,
they have 1 attributes or variable called age in that they have in different levels; youth,
middle aged, senior. In income there are 3 levels; high, medium, low. In student; yes or no, in
credit rating whether their credit rating is fair or excellent.
So, the final objective is we have to make a decision tree or classification tree for this
dependent variable that is buy say computer, for choosing that one out of these 4 variables,
we want to know from which variable we have to started. For that purpose, the information
gain criteria is taken as a measure, let us see how it is working. The following table
represents a training set, the data set called D of class labelled tuples randomly selected from
all electronics customer database.
(Refer Slide Time: 03:13)
1188
The tuples is nothing but the full database called tuples, in this example each attribute is
discrete valued, what are the attributes here; age is an attribute, it is a discrete because there is
no continuous value, income is an another attribute, student is another attribute, credit rating
is another attribute and “buys computer” also an attribute, all are categorical variable there is
no continuous variable here.
The class labels attributes that is in the last column the variable called buys computer has 2
distinct value namely yes, no therefore, there are 2 distinct classes, this m equal to 2, so this
value m equal to 2 I am going to use in coming slides, please remember this m equal to;
because there are 2 levels the person is going to buy the computer or not, let the class C1
corresponds to yes, class C2 corresponds to no.
There are 9 tuples of class yes and 5 tuples of class no, a root node N is created for the tuples
in D, so for making this root node we have to find out which variable, from which variable
you have to start this root node, 1, 2, 3, 4 out of this 4 variables; age, income, student and
credit rating, we are going to find out which variable is going to be in the root node.
(Refer Slide Time: 04:36)
1189
Here what we are going to do; expected information needed to classify a tuples D, to find the
splitting criterion for these tuples, we must compute the information gain of each attributes.
Let us consider the class, buys computer actually it is variable, buys underscore as a decision
criteria D, so calculate informations that is - py log (py) to the base 2 - pn log (pn) to the base
2, where the py is a probability of yes and pn is probability of no.
So, the another name for this equation is entropy, so Info D py, let us see what is a py; in the
last column when you look at this how many yes is there; 1, 2, 3, 4, 5, 6, 7, 8, 9 yes is there,
so 9 yes is there out of 14, - (9 by 14) log ;again, py 9 by 14; (9 divided by 14) to the base 2
minus; then how many no’s is there because out of 14, 9 is yes, so remaining 5 is no, (5
divided by 14) log2 ( 5 divided by 14).
(Refer Slide Time: 06:10)
1190
So, this is 0.940 bits for our dependent variable, now let us calculate entropy for the variable
age, so age can be; see there are 3 levels is there in age, one is youth, second one is middle-
aged, third one is senior. If you take youth how many people has given yes is their option, so
here youth there is 1, youth because here also yes, yes there are 2 people, so youth how many
people are yes; 2. So, how many people say no; this one, 1, 2, 3, so out of 5 youth, 2 youth
they told their yes; yes means yes for buying computer, 3 youth told no for buying computer.
(Refer Slide Time: 07:08)
Let us calculate entropy for youth, so out of 5 entropy for youth is, this entropy for youth
equal to - 2 divided by 5 log 2 divided by 5 to the base 2 - 3 people have told no, so - 3 by 5
log 3 by 5 to the base 2, so this is entropy for youth. Then we will go to the next level, the
next level is middle aged; in this middle aged, how many people told yes? 1, this 2, middle
aged yes, middle aged yes, so there are 4, 4 people told yes, so no is 0.
(Refer Slide Time: 08:07)
1191
Because only there are 4; 1, 2, 3, 4 middle aged people, so calculate entropy for a middle
aged here, entropy - 4 divided by 4 log (4 by 4) to the base 2 - 0 by 4 log 2 0 divided by 4, so
here entropy is 0. Similarly, the next level is senior, when it is senior how many people told
yes, so this senior is yes, this senior is yes, this senior is yes, so there are 3 people told yes, so
how many senior level they told no; this they told no here and this senior also told no, there
are 2 people, so out of 5, 3 people told yes for buying computer, 2 people have told no for
buying computer.
(Refer Slide Time: 08:59)
Let us find out entropy for senior, so - 3 by 5 log 2 log 3 divided by 5 to the base 2 - 2 by 5,
remaining 2 people told no - 2 divided by 5 log 2 divided by 5 to the base 2. So, now what
you are going to see, the expected information needed to classify a tuple in D, if the tuples are
1192
partitioned according to age is; so what we are going to do, we are going to find out the
expected information needed.
There are 5 element right, there are 5, how we got this 5; for example, youth is 5, so it is 5 by
14, then middle aged 4 divided by 14, then senior 5 out of 14, so now we are finding the
expected information needed that is 0.694 bits.
(Refer Slide Time: 10:05)
Calculation of entropy for senior category; so we have seen in our previous slide out of 5,
there are 5 senior score is there, out of 5, 3 people have answered yes, 2 people have
answered no, so the entropy is – 3 divided by 5 log 3 divided by 5 to the base 2 - 2 divided by
5 log 2 by 5 to the base 2. So, now we have got the entropy for all the levels.
(Refer Slide Time: 10:36)
1193
Now, let us find the expected information needed to classify a tuple in D, according to age so,
the expected information needed to classify a tuple in D, if the tuples are partitioned
according to age is, so here nothing but we are finding weighted, so because there was a 5
youth out of 14, there are 4 middle-aged people out of 14, there are 5 senior out of 14, so 5
divided by 14, then corresponding entropy, 4 divided by 14 corresponding entropy, 5 divided
by 14 corresponding entropy, so it is 0.694 bits,.
(Refer Slide Time: 11:24)
Now, calculation of information gain of age, so the gain of age is; see that gain of age = Info
D – Info age D for only 1 variable age, so this Info D 0.940 which we have found for the class
attributes that is the our dependent variable, this is only for the info variable age. So, the
difference is 0.246 bits.
(Refer Slide Time: 11:58)
1194
Now, we will go to next variable, we have seen for age, now we will go for variable income.
In the income, the same way we will repeat the procedure, in the income there are 3 level is
there; high, medium, low, so we will find the entropy for high, medium and low, then we will
find the expected information required, then we will find the information gain, how will you
find the information gain? So, it is the gain from our dependent variable minus this income
variable, let us find out that one.
(Refer Slide Time: 12:31)
So, when you go for high income, how many people have told yes? Here that is 1, here yes,
then how many people told no, when the level is this 1, 2 that is all, so out of there are 4
values under the income level, out of 4, 2 have answered yes for buying computer, 2 have
responded no for buying computer. So, we will find out entropy for high, so – 2 divided by 4
log 2 divided by 4 to the base 2 – 2 divided by 4 log 2 divided by 4 to the base 2.
(Refer Slide Time: 13:20)
1195
So, this value we got now, we will go to the next category level. The next category level is
medium; in the medium, we are going to find out how many people are answered yes,
medium yes, medium no, this is no, then medium yes, medium yes, medium yes, 1, 2, 3, 4, so
there are 4 people have answered yes for buying computer. So, let us see in medium how
many people answered no.
So, medium no, then medium this one, medium no, so 2 people have answered no for buying
computed, so we will find the entropy, so – 4 divided by 6 log 4 divided by 6 to the base 2 –
2 divided by 6 log 2 divided by 6 to the base 2, so we will get an entropy for medium.
(Refer Slide Time: 14:11)
Then, we will find out the entropy for low; in low, how many yes is there; low yes, then low
yes, then here low yes, so 3 people have answered yes for buying computer, how many
1196
people are answered no, when they are low yeah, here it is there, this low no, so only 1
people answered no. So, out of 4, 3 answered yes for buying computer, 1 answered no for
buying computer. Now, we will find out the entropy; - 1 divided by 4 log 1 divided by 4 to
the base 2 - 3 divided by 4 log 3 divided by 4 to the base 2.
(Refer Slide Time: 15:03)
The expected information needed to classify a tuple in D if the tuples are partitioned
according to income is; so we are finding this weightage, it is nothing but the weighted mean
of the entropy, so 4 divided by 14, then it is 6 divided by 14, what is this 6; we will go back,
medium will be 1, 2, 3, 4, 5, 6, so there are 6 medium. So, what is happening, it is like this;
low, medium, high.
1197
We will go to the next variable is a student, in student there are 2 level is there; one is yes and
no. So, how many yes is there? 1, 2, 3, 4, 5, 6, 7; 7 yes is there. How many no is there? 1, 2,
3, 4, 5, 6, 7, so 7 no is there. Now, we will find out the entropy when it is yes, we will find
out the entropy when it is no.
(Refer Slide Time: 17:08)
When it is no, how many yes, so here one is there, no, yes, no, no, no, yes, so out of 7, 3
people have answered yes to buy the computer. So, how many people answered no for buying
computer when they it is no, so this is 1, 2, 3, here one more is there 4, there is a 4. So,
entropy for no is - 3 divided by 7 log 3 divided by 7 to the base 2 - 4 divided by 7 log 4
divided by 7 to the base 2.
(Refer Slide Time: 18:03)
1198
Now, we will find out entropy for yes, so when it is yes, how many people are answered yes
for buys a computer; 1, 2, 3, 4, 5, 6, so there are 6 yes is there. Now, how many no; when
there is yes, how many people have answered no, here this one, there is only 1 no is there, so
the entropy for yes is - 6 divided by 7 log 6 divided by 7 to the base 2 - 1 divided by 7 log 1
divided by 7 to the base 2.
(Refer Slide Time: 18:44)
Now, we will find out expected information, so for the expected informations, so what we
have to do; we have to find out the weighted entropy. So, the weighted entropy is 7 divided
by 14 because already we have seen when there was a student, how many yes is there, how
many no is there? There are 7 yes, there are 7 no, so out of 14, 7 divided by 14 and that is
corresponding entropy plus for no, there is 7 divided by 14 corresponding entropy.
1199
So, this was the expected information needed, so the gain is; so 0.94 for our dependent
variable and for this student variable, it is 0.789, so we got this one the gain; the gain is
0.151.
(Refer Slide Time: 19:41)
Next, we will go to the next variable; credit rating. In credit rating, how many levels is there?
Fair, excellent, there are 2 level is there, fair and excellent. So, how many fair is there? 1, 2,
3, 4, 5, 6, 7, 8. How many excellent is there; 1, 2, 3, 4, 5, 6, out of 14, 6 is there. So, now we
will find out the entropy, when it is a fair, we will find out entropy when it is excellent for the
credit rating.
(Refer Slide Time: 20:14)
So, for fair how many people are answered yes; fair, fair, fair yes, fair yes, fair yes, fair yes,
fair yes, fair yes, 1, 2, 3, 4, 5, 6. So, how many people are fair and same time it is no, so this
1200
fair no, this fair no, there are 2. So, how to find out the entropy for fair; - 6 divided by 8 log
of 6 divided by 8 to the base 2 - 2 divided by 8 log 2 divided by 8 to the base 2.
(Refer Slide Time: 21:00)
Now, let us find the entropy for excellent, so how many people are yes for buying compute,
when they are excellent; excellent 1, excellent 2, excellent 3, there are 3 people. How many
people told no; excellent no, excellent no, excellent no, so 3 people are answered yes for
buying computer when their level is excellent, 3 people are answered no for buying computer
when their level is excellent. So, the entropy for excellent is - 3 divided by 6 log 3 divided by
6 to the base 2 - 3 divided by 6 log 3 divided by 6 to the base 2.
(Refer Slide Time: 21:44)
Now, we will find out the expected information needed to classify in a tuple D, if the tuples
are partitioned according to credit rating. So, here in the credit rating there are 2 levels; one is
1201
as I told you one is fair, another one is excellent. So, for fair there are 8 is there; 8 items, so 8
divided by 14 and corresponding their entropy, the remaining 6 divided by 14 and their
corresponding their entropy.
So, the expected information needed is 0.892, now we will find out gain for credit rating.
What is the meaning of gain for credit rating? If we use credit rating as the root variable, how
much information is required, so this variable we got it when it is 0.94 for the dependent
variable, this is for credit rating variable just now we got it 0.892, so the difference is 0.048.
(Refer Slide Time: 22:52)
Now, I have summarized; if we use age is the classifier, the information gain is 0.246, if we
use income as the classifier, the information gain is 0.029, if we use student as a classifier the
information gain is 0.151, if we use credit rating as a classifier the information gain is 0.048.
So, out of this 4, the highest value is 0.246, so now we will start keeping age as a classifier
variable, so that is application of this.
(Refer Slide Time: 23:36)
1202
So, what happened because age has the highest information gain among the attributes, it is
selected as the splitting attribute, node N is labelled with age and the branches are grown for
each of the attributes values. The tuples are then partitioned accordingly, notice that the
tuples falling into the partition for age, when there is a middle aged all belongs to the same
class.
So, we need not go for further classification because all are belongs to; all middle aged
people are answered yes for buying computer, so further classification is not required because
they are belongs to class yes, a leave should therefore be created at the end of this branch and
labelled with yes.
(Refer Slide Time: 24:23)
1203
This was the decision tree, so what happened; the age is the classifier, if there is age; youth,
middle aged and senior, there are 3 level is there. So, for middle aged all are answering yes,
so further classification is not required. So, if it is youth there are some other variables is
there, income is there, student is there, credit rating is there, so the information gain
algorithm methodology which we have used can be used this group also.
Again, we can find out, out of these 3 variables; income, student, credit rating, which variable
should appear here, same way the one classifier is a senior, when it is a senior there are 1, 2, 3
variable is there, you can find out this information criteria, so out of these 3 variable which
variable should come into the this node, so this way we can continue our classification
procedure.
(Refer Slide Time: 25:25)
This was our final decision tree returned by algorithm I shown in the figure, we started with
the age; when you started with the age there are 3 level was there in the age; youth, middle
aged, senior, so we are stopping because all are yes there, we do further classification
required. Then you see that as I told you previously there are 3 options income, student,
credit rating, so out of this, the student yes appeared one classifier on the left hand side.
So, when there is a student we can there is a 2 possibility; yes or no and the right hand side
when it is seen here, you see that there are 3 possibilities there, we can classify with respect
to income, student, credit rating. So, what happened is student already we have done that one,
so now we go for credit rating, this also got by using that information gain measures, so this
was our final decision tree.
1204
We have seen different measures for selecting the attributes, there are 3 measure is there; one
measure is information gain, another one is gain ratio, another one is gain index. In this
lecture, what I have done using information gain as a measure by taking one numerical
example, I have explained how to choose an attribute. In the next class, I will take another
measures for choosing the attribute that is gain ratio and Gini index with that example, we
will continue in my next lecture, thank you.
1205
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee
Lecture – 59
Attribute Selection Measures in CART – II
In this lecture, we are going to see how to select attributes in CART model. In our previous
lectures, we have seen how to choose attributes by using gain value. In this lecture, there are
another two methods for choosing the attributes. One is gain ratio; another one is Gini index.
This lecture we are going to see other two criteria for choosing the attributes.
(Refer Slide Time: 00:51)
First, we will look at the Gain ratio. The information gain measure is biased towards tests
with many outcomes. That is, it prefers to select attributes having a large number of values.
For example, consider an attribute that act as a unique identifier, such as product ID.
(Refer Slide Time: 01:41)
1206
A split on product ID would result in a large number of partitions as many as there values
each one containing just one tuple because each partition is pure, the information required to
classify dataset D based on this partitioning would be Info product ID will be 0, because there
might be only one value in each partition. So Information gain is we know the formula, the
formula for information gain is Info D – Info product ID for D because this value is going to
be 0, so the information gain will be maximum.
Therefore, the information gained by partitioning on this attribute is maximal. Clearly, such
partitioning is useless for classification, because there are going to be one element for each
partition. So, the gain ratio is an extension to information gain, which attempts to overcome
this bias.
(Refer Slide Time: 02:14)
1207
Let us see what is split information. It applies a kind of normalization to information gain
using a split information value defined analogously with Info D. So, how to find out the split
info for level for D = – Ʃvj=1, v is different levels; (lDjl divided by lDl), multiplied by log
(lDjl divided by lDl) to the base 2. Here the Dj is the single partition. In the next example, I
will explain the Dj. D is dataset. So, this value that is the split info represents the potential
information generated by splitting the training data set D into v partitions, corresponding to
the v outcomes of test on attribute A.
(Refer Slide Time: 03:11)
So, let us see the formula for the gain ratio. The gain ratio differs from the information gain,
which measures the information with respect to classification that is acquired based on the
same partitioning. So, the gain ratio is (Gain of A) / (Split Info of A). The attribute with the
maximum gain ratio is selected as the splitting attribute. I have an example; with that I can
explain how to use this gain ratio for choosing the attribute.
(Refer Slide Time: 03:43)
1208
Look at this example. In my previous lecture, I was showing this dataset. Consider in our
previous example for computation of gain ratio for the attribute income. So, we are going to
take this variable first. We are going to find out gain ratio. I am going to say the procedure
for only one attribute, like that you have to try for age attribute, for student attribute, for
credit rating attribute.
Whichever is high that variable should be chosen for classification. A test on income splits
the data into following table into three partitions. See, when you look at this income column,
there are three levels, one is low, medium, and high containing in low, there are 6 element is
there, in medium 6 is there, in high there is 4. This example is taken from the book data
mining concepts and techniques.
(Refer Slide Time: 05:04)
1209
Now, let us calculate the Entropy for level low. In level low, how many have answered yes to
by computer when it is low. So how many people have answered low. When the level is low,
so the entropy is –(3 / 4) log 3 / 4 to the base 2 – (1 / 4 )log 1 / 4 to the base 2.
(Refer Slide Time: 05:44)
For finding the information gain for the attribute income, first we need to know the entropy
for the class D. We know that how to find the entropy for the class D. It is – py log py to the
base 2 – pn log pn to the base 2. So, the py is in our class, how many people have answered
yes. When you look at the table, there was 9 yes and there is 5 no. So, for s it is 9 / 14, log 9 /
14 to the base 2 minus no. There is 5 no, so it is 5 no, so 5 / 14, log 5 / 14 to the base 2 equal
to 0.940 bits.
(Refer Slide Time: 06:43)
1210
Now, let us find out the information gain for the attribute income. What is the meaning is, if
you use income as an attribute for the classification, how much information you can gain. So
the expected information needed to classify a tuple in D if the tuples are partitioned according
to income is, so Info for attribute income is 4 / 14, we got this 4 for the attribute income,
there was three levels low, medium, and high.
In low, there was a 4, in medium there was 6, and in high there was 4 values. So, it is a kind
of weighted attributed, because it is nothing but expected information needed. So, it is
nothing but our weighted entropy. So, the weighted entropy is 0.911 bits. So, the gain of
income is Info D – Info for the attribute income. So, we got this 0.94, which we got from this
value, 0.94 – 0.911 = 0.029.
(Refer Slide Time: 8:08)
Now, we will go for split ratio. So, the split ratio is equal to – (4 / 14) multiplied log (4 / 14)
base 2 – (6 / 14) log (6 / 14) to the base 2 – (4 / 14) log (4 / 14) to the base 2. So, split info is
0.926. Therefore, the Gain-Ratio for the attribute income is 0.029 / 0.926 is equal to 0.031.
(Refer Slide Time: 09:05)
1211
Further, we calculate the same for the rest of 3 criteria, what are the other three attributes, age
is there, student is there, credit rating is there. The one with the maximum Gain ratio value
will result in maximum reduction in impurity of the tuples in D and is returned as the splitting
criterion. So, what we have to do, we have seen for one attribute that is income. There are
another 3 attributes, such as age, student and credit rating. For these attributes also, we have
to find out the information gained ratio.
That corresponding attribute should be taken as the splitting criterion. We have found the
Gain ratio for the attribute income. The same way, there are other attributes like age, student
and credit rating. For these attributes also, we have to find out the Gain ratio. One with the
maximum gain ratio value will result in maximum reduction and impurities of the tuples in D
and is written as the splitting criterion.
(Refer Slide Time: 10:09)
1212
For example, assume that the attribute age is having maximum gain ratio, so that variable
should be chosen as the splitting variable. Then from there, if there is a student, assume that
the attribute age is having highest gain ratio, so that age is taken as the splitting variable.
Then, there is a student. There are remaining other variables, for example, student, credit
rating and so on. So, out of these, again you have to find the Gain ratio. So, out of this, which
one is giving the maximum Gain ratio, that should be taken as the splitting criterion.
(Refer Slide Time: 10:50)
Next, we will go to another criterion for choosing the attribute, that is Gini index. Let us take
the introduction of decision tree using Gini index. Let D be the training data of the following
table. So, this data also taken from the book data mining, concepts and techniques, the source
is given here. Now, we are going to see how to find out the Gini index.
(Refer Slide Time: 11:12)
1213
In this example, each attribute is discrete-valued because all are in different category, so
continuous-valued attributes have been generalized. We did not take continuous value. The
class label attribute, buy computer, has two distinct values, yes or no, therefore, there are two
distinct classes, m = 2. Let class C1 correspond to yes and class C2 correspond to no. There
are nine tuples of class yes and five tuples of class no. A root n is created for the tuples D.
(Refer Slide Time: 12:07)
Now, we go for calculation of Gini index. We first use the following equation for Gini index
to compute the impurity of D. So, first we will find out the Gini of class D is 1 – Ʃmi =1, (m is
number of levels), pi2. So, what is the p, how many s is there. So, 1 – (9 / 14) 2, how many no
is there, next level, so the Gini of class D = 1 –(9/14)2 – (5 / 14)2, that is 0.459.
(Refer Slide Time: 12:54)
1214
Let us calculate, Gini index, previously found the Gini, the Gini index for income attribute.
To find the splitting criterion for the tuples in D, we need to compute the Gini index for each
attribute. In this, you take an example, income. Let us start with the attributes income and
consider each of the possible splitting subsets. Incomes has three possible values, namely
low, medium, high.
The possible subsets are low, medium, high, then all possible combinations low and medium,
low and high, medium and high, then values which is having one value in the set, low,
medium, high and null set. Power set and empty set will not be used for splitting. What is the
power set where all the element is there, for example, low, medium, high is the power set, the
null set is nothing but the empty set.
(Refer Slide Time: 14:00)
1215
So this will not be used for splitting. Now, we are going to split into two category. One is
subset low and medium, because it is a binary classification, when you go for low and
medium, another group. Suppose, there are two group is there, group 1 and group 2, in group
1, we have taken low and medium, obviously another set will be high. So, this would result in
10 tuples in partition D1.
We have to count how many low medium is there. We have got 10 tuples in partition D1, so
this group is D1, that is why the condition income in the set low and medium. The remaining
four tuples of D, the remaining is high, that is the remaining 4 that is D2, the remaining four
tuples of D would be assigned to partition D2. What has happened, we have made a two
subset, one is low medium and the remaining one is high.
(Refer Slide Time: 15:09)
For the low and medium, for the class variable buys computer, we are going to see how many
yes is there. By looking at together, the level medium and low, there are 7 yes is there, and
there are 3 no is there.
(Refer Slide Time: 15:28)
1216
For high, because that was group D1, this is for group D2, how many yes is there, 2, how
many no is there, 2 no is there. So, two yes is there and 2 no is there.
(Refer Slide Time: 15:38)
The Gini index for income attribute. The Gini index value computed based on this
partitioning is Gini income to the set low and medium, so (10 / 14) Gini D1 + (4 / 14) Gini
D2, so how we got this four, by looking at the low and medium, there are 10 values is there,
out of 14, so 10 / 14. The another set, the remaining is 4. In D2, there is only 4, so 4 / 14. So
Gini D we have seen in the previous lecture, 1 – (7 / 10) whole square – (3 / 10) the whole
square.
How we got this 7, you see that there are 7 yes and there are 3 no. That is why it is 7 / 10
whole square minus 3 / 10 whole square. For D2, there are two yes, there are two no. So, (4 /
1217
14) (1 – (2 / 4) whole square – (2 / 4) whole square), so this value is 0.443, so this is Gini
value for the income high. For example, if you found the Gini value for low and medium, that
is equivalent to finding the Gini value for the next group, that is the high.
(Refer Slide Time: 17:00)
Consider the next subset, that is high and medium. When you look at high and medium, there
will be 10 tuples in partitioning D1, so here also we are going for two group, one group is D1,
and another group is D2. Here high and medium is in one set, that is D1, so when you go for
high and medium, obviously the next will be low. Low will be in the set D2. High and
medium, there will be 10 tuples satisfying the condition, remaining 4 tuples of D would be
assigned to partitioning D2.
(Refer Slide Time: 17:57)
1218
When you look at high and medium together, 6 yes for buying computer, 4 no for buying
computer.
(Refer Slide Time: 18:10)
Then the next group, that is the tuples in partition D2, for low, there is one person has
answered no, and 3 people has answered yes for buys a computer.
(Refer Slide Time: 18:22)
Now, we will find out Gini index value computed based on this partitioning. Gini index for
income attribute, Gini index value based on the computed partitioning for Gini is for income,
we had two category, one is D1 and D2. D1, we had high and medium, and D2 we had low.
Now for high and medium, let us find out the Gini index because there was 10 out of 14 that
comes high and medium, so the Gini for D1 plus 4, in this low, there was 4 out of 14, so Gini
for D2.
1219
So, what is the value of D1, because in D1, for high and medium together, there was 6 yes
was there and 4 no was there. For D1, that is the Gini index value. For D2, the Gini index
value is 4 / 14, 1 yes was there and 3 no was there. So 1 / 4 whole square minus 3/4 whole
square, so the Gini index for high and medium group is 0.45, that is nothing but the Gini
index for the group also.
(Refer Slide Time: 19:54)
Now, we go for another subset, high and low. Now, here what we are going to do. We are
going to have two groups because it is a binary classification, so high, low is one group, so
this is D1, and D2 obviously it is D2. So here for this, this would result that is 8 tuples in
partition D1 satisfying the income is high and low, the remaining 6 tuples of D would be
assigned to partitioning D2.
(Refer Slide Time: 20:27)
1220
For high and low, there was 5 yes is there, for high and low, there was 3 no is there. Another
group is D2. In that, there was a medium. In that medium, there was a 2 no, and 4 yes.
(Refer Slide Time: 20:41)
We will find out the Gini index value computed based on the partitioning that is for this
group, high and low. In high and low, totally there was 8 value, 8 / 14, in that 1 – 5 yes out of
8, so 5/8 whole square, there are 3 no, -3 / 8 whole square plus for group D2, that is medium
that was in another group, in that 6 elements was there out of 14, (6 / 14) (1 – how many yes
was there, two no was there, 2 / 6 whole square, four yes was there, 4 / 6 whole square). So,
the Gini index for the group high and low that is equivalent to. For the medium group, Gini
index value, that also 0.458.
(Refer Slide Time: 21:40)
1221
We have completed all possible binary classifications, so Gini income high and low, Gini
index value 0.458, the Gini group for high and medium is 0.45, for medium low, the Gini
index value is 0.443. How to interpret this table, so this value 0.443, this is having the
minimum Gini index value.
(Refer Slide Time: 22:11)
The best binary split for the attribute income is on medium and low, otherwise high, because
it minimizes the Gini index. When you look at the previous table, it is having the minimum
Gini index value. The splitting subset, medium and low therefore gives the minimum Gini
index for attribute income. So, the reduction in impurity is 0.459, this value which we got
from slide number 18, this for the class D, - 0.443.
1222
We got from this value, Gini index value for each subset. So the difference is 0.016. Further,
we calculate the same for the rest of 3 criteria, so we got for reduction impurity for, this is
income, there are another 3 attributes that is age, student and credit rating. For each attribute,
we have to find out the reduction in impurity. The one with minimum Gini index value will
result in the maximum reduction in impurity of the tuples in D and is returned as the splitting
criterion.
(Refer Slide Time: 23:28)
Now, how to have the classifications. For example, we have income is the splitting variable.
There was a 2 binary split. The first one was low and medium is one group, and high is
another group. Now, in the high also, you might have some other table like this. For each
value, we have to find out the Gini index, for low and medium group also, we will have
another table. For this value also, we have to find out the Gini index.
After finding out the Gini index, whichever is having highest level of reduction impurity,
otherwise lowest value of Gini index should be chosen as the splitting criteria for further
classification. In this lecture, I have explained how to choose an attribute for the decision tree
model, by using two criteria. One is gain ratio and Gini index. For both the method, I have
taken a numerical example. With the help of numerical example, I have explained how to
choose an attribute. In the next lecture, we are going to use Python for making a CART
model. Thank you.
1223
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology - Roorkee
Lecture – 60
Classification and Regression Trees (CART) - III
In my previous lecture I have explained how to choose an attribute for the decision tree model.
We know that there are three methods one is information gain method; second one is gain ratio
method; third one is the Gini index. In previous lecture I have explained by using gain ratio and
Gini index how to choose an attributes and I have explained all the procedures.
(Refer Slide Time: 00:49)
In this lecture, we are going to use python with help of python. We are going to construct the
CART model then I am going to explain the decision tree then I am going to interpret the output
of CART model.
(Refer Slide Time: 01:07)
1224
This was the sample data this is an example say there are variable age, income, student, current
rating is the attributes for independent variable the dependent variable is buys_computer. This
example is taken from this source Han, Pei and Kamber the title of the book is Data Mining
concepts and techniques.
(Refer Slide Time: 01:30)
I have brought this screenshot of first we will import relevant libraries and loading the data file.
This file I have copied into the excel so I am importing pandas as pd import numpy as np, import
matplotlib.pyplot as plt. I have imported the data then stored any object called data.
(Refer Slide Time: 01:50)
1225
Then there are different methods for encoding the data. The first one is LabelEncoder this
method is used to normalize labels it can also be used to transform non-numerical labels into
numerical labels. In our examples, the data are non-numerical that we are going to convert into
numerical labels. The another function is fit under score transform this method is used for fitting
label encoder and return encoded labels.
(Refer Slide Time: 02:23)
So this is a data in coding percentages we import sklearn from sklearn dot preprocessing import
LabelEncoder. So le_age LabelEncoder le_income like that for all the variables. Now there are
different attributes like age but age, income, student credit rating buys computer this is in the text
form. So that I am going to convert into numerical form, but new variable is age under n by
1226
using this function le underscore age dot fit underscore transformation. So that wherever there is
a fit_transformation, that is a text data is going to convert into numerical form.
(Refer Slide Time: 03:10)
Now what is happening is this portion is the text by using transformations I have converted into
the numerical form.
(Refer Slide Time: 03:20)
Then structuring the data frame by using the function drop. This is used to remove rows or
columns by specifying label names in corresponding axis or by specifying directly index or
column names. So what we are going to do in the previous layers, I told you there is a text data
also is there , numerical data also there. Since already we are transforming into numerical that
1227
text portions that I am going to drop it by using this drop function. So after that, this was the my
dataset in this there is no text only numerical values. So this data set is going to be taken for
building the CART model.
(Refer Slide Time: 03:58)
In the building of the CART model we go to specify what is the independent and dependent
variables we know that dependent variable is buys underscore computer. The independent
variable is age, income, student credit rating you see that I am using age underscore n that is in
the numerical form. In the dependent variables only two options, Yes or No that is buys
underscore computer underscore n that has only two levels one is 0 or 1 that is Yes or No.
(Refer Slide Time: 04:31)
1228
Now we are going to build the decision tree model without splitting what is the meaning of
without splitting is in data mining whenever the huge amount of data is there, some data sets
should be used for training the model the remaining data set should be used for testing the model.
Now we are not going to do that way we are going to take all the dataset for build the model
we are not going to test the model from sklearn.tree import DecisionTreeClassifier clf =
DecisionTreeClassifier dt = clf.fit (xy) this was the dt this was the output.
(Refer Slide Time: 05:10)
Now we are going to visualize the decision tree from sklearn dot tree import export graphviz.
From sklearn dot externals dot six import String IO. From IPython dot display import Image then
import pydotplus. So dot underscore data = StringIO export underscore graphviz this was the
commands for getting the graphical output of our CART model. Then I have specified what is
the dependent variable.
(Refer Slide Time: 05:48)
1229
This is the output of our CART model that is the Decision Tree this stage, let us understand how
to interpret this. When age underscore n <= 0.5 there are two possibility it is true and false. I will
explain the meaning of this 0.5 in the next slide. So whenever is a true, whenever this blue box
represents it is a favorable decision for us. For example, orange colored box represents
unfavorable.
What is unfavorable? That person will not buy the computer that is a class 0. If it is a class 1
means that person will buys the computer. Suppose if you want to interpret this blue box how to
interpret is first look for the age, age we have seen, there are different levels in age. If it is true
that person surely will buy the computer. If it is a false, then look for the student and the student
also there is a two possibility Yes or No, this is true, this is false.
When you go for a student then this condition is failed it will go for a false then it will look for
another attributes credit rating then it is true this is false. The credit rating when the false
condition appears then this is favorable decision. If it is true, then look for another attribute
income in that income if the false is applicable then it is favorable decision for us. So I will
explain each values in the box and the meaning of the different color coding in coming slides.
(Refer Slide Time: 07:31)
1230
Let us interpret the CART output.
(Refer Slide Time: 07:35)
This was the data which we have taken we first used the following equation for the Gini index to
compute the impurity of D this also I have explained. We know that formula Gini D = 1 - Ʃmi=1
pi 2, m is number of levels in our dependent variable here the m = 2 because Yes or No so Gini
index = 1 minus the 9 represents how many years? 1, 2, 3, 4, 5, 6, 7, 8, 9; 9 yes out of 14 - this 5
represents number of Nos so 5 divided by 14 the whole square this is 0.459. This is Gini index
for our dependent variable, so D represents the dependent variable.
(Refer Slide Time: 08:26)
1231
Now let us take one attribute at this stage suppose I have taken income, income is over attribute
we are going to use Gini index for this attribute we know Gini index always go for binary
classification. There are three levels low, medium, high option 1 is we can group into two way
because it is a binary classification low medium is one group high is another group. Next high
medium is one group low is another group the last choice is high, low is one group medium is
another group for all these three combinations, let us find out what is the Gini index.
(Refer Slide Time: 09:11)
First, we will take low and medium in the low and medium we have to look at how many people
answered Yes 1 No not this one 2, 3 that is how we got this 3 when it is medium, how many
people answered Yes for our dependent variable, so this is 1 this is 2 this is 3 this is 4 so 3 + 4.
1232
The same way when it is low, how many people answered No this case 1 only one options is
there low and No.
Then the second case when it is medium, how many people answered No medium and No this is
1 then this, so 7 and 3 that is how we got this table this is one group part D1 actually what is
happening here? so the income variable we are going to go for binary classification one is D1
another one is D2. D1 we are going to consider the two levels low and medium. In the another
group we are going to consider only high so that is why D1 is low and medium so the D2 is only
high.
(Refer Slide Time: 10:51)
So D2 it is high how many people have answered when it is high, how many people answered
Yes, this is 1 high Yes 2. So when it is a high, how many people answered No this 1 high No 2
yeah this is 2.
(Refer Slide Time: 11:13)
1233
Now we are going to do the Gini index for income attributes the Gini index value computed
based on this partitioning is, so Gini income belongs to low and medium so 10 to 12 / 14 how we
got this 10 when you count low and medium, there will be 10 count it 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
divided by 14 Gini D1 the formula is 1 – (7/10) whole square – (3/10) whole square. The second
option for D2 4 / 14.
How we got 4? When it is a high? We have to count it how many high is there in the income. 1,
2, 3, 4 so 4/14 Gini D2, D2 is 1 minus because number of Yes and number of Nos are same 1-
(2/4) whole square – (2/4) whole square. So this is the Gini index for low and medium that is
0.443 that is equivalent to Gini index for other level high. The Gini index value for the another
option what is the another option.
So there are two options, option one D1 D2 this is D1 this is D2, D1 high and medium is there
D2 low is there. So when you having this two splits the Gini index for income attributes, which
belongs to that is high and medium we are getting the 0.45.
(Refer Slide Time: 13:02)
1234
The third option is high and low is one group then medium is another group. The same way
when you continue, we are getting the Gini index is 0.4.
(Refer Slide Time: 13:18)
Now by comparing all the combinations, the lowest Gini index value is this one 0.443 that is
where when income is high.
(Refer Slide Time: 13:31)
1235
Now the Gini index for age attribute, so Gini index for senior it is 0.457 for middle age it is
0.357 for youth it is 0.393. The lowest Gini value is this one that is 0.357.
(Refer Slide Time: 13:51)
Similarly, we will take another attribute that attribute is student, so for student attribute also
when you find the Gini index it is 0.367.
(Refer Slide Time: 14:01)
1236
The next attribute is the credit_rating. For this, the Gini index is 0.428.
(Refer Slide Time: 14:09)
Now when you bring all the generic index for different attributes, the age one is having the least
Gini score that is 0.357 this attribute should be chosen for the classification that is the preference
should be given for age. So the attribute with the minimum Gini score will be taken that is the
age because that Gini index is 0.357. When you go for age that are 3 levels did middle age, youth
and senior when you look at middle age, they have answered Yes for buying all people have
answered Yes for buying computers with that we can stop it. Now there are youth and senior
then we have to continue which attribute has to be chosen for here.
(Refer Slide Time: 14:54)
1237
So since the all the middle aged people have answered Yes see that middle age is Yes, Yes, Yes.
Now the next calculations we are going to drop these rows. After dropping these rows again here
we are going to find out the Gini index, so after separating this 4 samples belonging to middle-
age total 10 rows are remaining out of 14.
(Refer Slide Time: 15:22)
So for that 10 rows we are finding Gini index for our dependent variable 0.5 Gini index for age is
0.48 Gini index for our credit rating is 0.41 for a student it is 0.32 for income it is 0.375. Again,
we have to look at which attribute is having the lowest Gini value. Take the student as node as it
is having minimum Gini score.
(Refer Slide Time: 15:53)
1238
So what we have to do after first priority is age next, we have to take student as a classifier. In
the student there was a 2 level was there Yes and No. If the student says Yes, then which
attribute has to be chosen? If the student is No then which attribute has to be chosen, we will see
that.
(Refer Slide Time: 16:10)
Now what we have to do omit the marked rows either belonging age equal to middle-aged or
student equal to Yes. So wherever we are talking about here in which variable which attribute
has to be chosen. So what we are going to do, if it is a middle-age the dataset if the middle age is
there that has to be dropped, the students are answering the student level is Yes that also is going
to be dropped. So middle-aged is dropped wherever the student is Yes that also dropped. So how
1239
many of you are going to drop 1, 2, 3, 4, 5, 6, 7, 8, 9 so out of 14, 9 rows we are going to drop it.
So the remaining rows for further iteration is 5 rows.
(Refer Slide Time: 17:03)
For that 5 rows again, we are going to find out the Gini index for the dependent variable 0.32 for
the age 0.2 for credit rating, 0.267 for a student it is 0.32 for income it is 0.267. Again we have to
look at which attribute is having the least Gini score. So the age is having the least Gini score.
When the student level is No, we have identified what is next attribute. Similarly, when the
student level is Yes, we have to find out what does the next year attribute for further
classification for that omit the marked rows either belonging age equal to middle aged or student
equal to No.
So when you do that one the remaining 5 rows are remaining using these 5 rows again, we are
going to find out the Gini index for our dependent variable 0.32 for age 0.267 for credit rating it
is 0.2 for student it is 0.32 for income it is 0.267. Again, you have to look at which attribute is
having minimum Gini index though the credit rating is having the minimum Gini index.
(Refer Slide Time: 18:20)
1240
What is to be done so the credit rating attribute has to be brought here for further classification.
Like this, we have to continue to satisfy all the conditions
(Refer Slide Time: 18:32)
Now I am going to explain the coding scheme because this coding scheme is important if you are
going to interpret the python output for our CART model. There was a one attribute called age,
student ,credit rating, income, Buys computer. So there are 4 independent variable, one
dependent variable youth is coded as 2, middle-age is 0, senior is 1. So because this coding is
important for interpreting the output.
(Refer Slide Time: 19:00)
1241
This was our python output Now if we look at the first variable is age underscore n when it is
less than 0.5 this Gini index we got it 0.459 when you go back and see that this one, whatever the
value which are appearing in the python output we have manually solved it so that you will feel
more confidence when the age underscore n is the less than 0.5. Now you have to see what is the
meaning of this 0.5 because we look at this coding the age is coded, youth is 2, middle-age is 0,
senior is 1.
So if n < = 0.5 that represents the middle age that is why we got this one when this condition is
true this is middle age when the condition is false, they will belongs to youth and senior what is
the meaning of this 5,9 the 5 represents No 9 represents Yes. Since in the middle aged group all
are see that 0, 4 all are answered Yes. So the first represents No second represents Yes all are
answered Yes so, we are not going for further classification.
Then, we are taking student underscore n as a another attribute for further classification when it
is less than 0.5 we have to see what is less than 0.5 the student we have coded Yes = 1, No = 0,
when I say n < 0.5 that represents No when the student is No that is why when the condition is
true it is No this is true, this is false this is one branch when it is Yes there is another branch.
Then next we choose age underscore n <= 1.5.
1242
Because now in the age only two group is there one is youth and senior when you go here youth
and senior is there they when n <= 1.5 that represents the senior when the condition is true that
represents senior when the condition is false that represents youth we will go to the next one
when the credit rating underscore n < 0.5. There are two options when the condition is true we
got excellent. How are we got this excellent? Go to credit rating coding fair represents 1
excellent represents 0.
So when the condition is less than 0.5 it is excellent when the condition is true it is excellent if
the condition is false it is fair. Then you go for income underscore n <= 1.5 when the condition is
true high and low how it is let us go for this coding income high = 0, low = 1, medium = 2 if it is
1.5 if n < 1.5 that represents low and medium when the income underscore n < 1.5 what is the
meaning you look at this table.
So when it is less than 1.5 this group, this is less than 1.5 high and low, so high and low is one
group medium is another group. So as a manager, how to interpret this first, the classifier is the
age if it is true them go for middle age, the middle aged people will have the positive response
for buying the computer that is why 1. If this condition fails then go for the student if they
answer Yes then look for another attribute credit rating if it is fair, there is favorable response if
it is excellent, we should go for further classification.
When it is excellent the next classifier is income when the income if it is true, they belongs to
high and low then they will not buy the computer when it is medium then they will buy the
computer. When we look at the left hand side, when the student n <= 5 it is No then we will go
for next attribute age because the age already we have dropped middle age the remaining is
youth and senior if it is true it is senior, if it is false it is youth.
So if it is senior then you have to look for credit rating if it is youth there is a not favorable
response. The 4 represents number of Nos 1 represents number of Yes in our dependent variable.
So and another thing is look at the wherever the class is 1 there is only number of Yes is there
see here that blue one 0, 4 only Yes is there 1, 4 only Yes is there 0,3 only Yes is there 0, 1 Yes
is there here also 0, 1 Yes is there.
1243
So the blue boxes are which will give you the favorite decision for us the orange boxes and look
at this, see that the here there is no see 1, 0 only No is there here also 1, 0 only No is there here
also 3, 0 only No is there. So the orange box represents it is not going to give a favorable
decision. The white box represents that is intermediate in the sense we have to go for choosing
some more attributes for further conclusion.
So what does 14 represents values for the dependent variable there are 14 then the sample
represents the sample size and so on. So repeat the splitting process until we obtain all leaf nodes
and the final output. The leaf node represents this one this is leaf nodes.
(Refer Slide Time: 25:07)
Now we are done the data without splitting, but in the data mining generally we used to split the
dataset. So we split the dataset so testing data set 25% data is going to be used for testing the
remaining data set is for the training.
(Refer Slide Time: 25:22)
1244
So you run the python code, what I told previously. So here we are going to get the accuracy is
0.75 what is the meaning of this 0.75 our classifier, this decision tree model able to classify
whether they are going to buy the computer or not with the 75% of accuracy. Then visualize the
decision tree.
(Refer Slide Time: 25:46)
So what is happening? you see when you are splitting the data set now the node variable is
changed. The node is student previously it was the age, so what is happening since our data set is
very only 14 dataset, we are getting this different result.
(Refer Slide Time: 26:03)
1245
Now I am going to explain the python code for running the problem which I have explained.
First import the pandas as pd than other libraries like numpy, matplotlib I have input. Then I am
going to import the data. The data you know that that are age income, student ,credit rating, buys
underscore computer. So this is in the text form, but I have to encode this data because they want
it the numerical form I am encoding.
(Refer Slide Time: 26:46)
Now this shows that after encoding the right hand side we are able to see the equivalent
numerical values. Now we are going to drop this text values because we are going to use only the
numerical values for that building the CART model. So this was only the numerical values
1246
because there is a 14 dataset. Now we are going to declare what our independent variable, what
are the dependent variable.,
(Refer Slide Time: 27:14)
So the x these are the independent variable what are they? independent variable is there age,
income, student, credit rating. Let us see what is the dependent variable? Dependent variable is
buys underscore computer that has two levels, 0 and 1 then I am building the decision tree model
for getting this output you need to install these packages.
(Refer Slide Time: 27:54)
This shows our CART output, so age is the first attribute and the age is true then they we are
getting the leaf node it is a false then we are choosing another two attributes if it is then age,
1247
credit rating, further we say credit rating then we go for income if it is false, we are stopping. So
here what do you need to understand the blue, the blue circle represents, the blue rectangle
represents the favorable decision for us, the orange one represents the unfavorable, white one
represents in between that means we need to do further analysis. Now we are going to split the
dataset splitting the ratio of 75, 25 so 75% days of data set is for training remaining 25% is for
testing.
(Refer Slide Time: 28:59)
So after splitting the dataset, we are running the CART model then we are going to evaluate
there is accuracy of our model. So the accuracy is 0.75 now we will visualize the CART model.
(Refer Slide Time: 29:40)
1248
This was the output of data set where we are doing the splitting. So now the student is taken as
the primary node for splitting if it is true, we will go for credit rating if it is false then you go for
age for the next classifier. Then age we will go for the condition and this age underscore n <= 1.5
I explained what is the meaning of 1.5 everywhere there are more possibility of favorable
decision because class = 1.
There is orange rectangles which represents 0, 0 which is not favorable decisions. So what it
means that when we split the data set, our decision making become very simple because our tree
is in the very simple form, easy to interpret it. In this lecture I have explained how to do the
CART model with the help of python. I have taken an example problem with the help of sample
problem first I have got the CART model without splitting the dataset.
After that, after splitting the dataset, then I got output then I have compared, and I have
explained in detail the output of the CART model. With that we are concluding this course data
analytics with the python. Thank you very much for attending this course. Thank you.
1249
THIS BOOK IS NOT FOR SALE
NOR COMMERCIAL USE