Lecture1 1
Lecture1 1
Lecture - 03
Introduction and Types of data – Part 1
In this week 1, the learning objectives are you understand first, why you are learning this
course.
What is statistics? We are just going to tell briefly about the two main branches of
statistics which will be relevant at this point of time to you people will tell what you
mean what is understood by descriptive statistics and inferential statistics.
The minute I talk about inferential statistics, I need to introduce what is the notion of a
sample and a population. So, that is what I am going to introduce. Then we move on to
understand why we need data; we will understand a bit about how data is collected and
we will talk about how to organize data in form of what we call a data set.
Once we have a data set, we will understand to more about data by classifying data in
terms of categorical and numerical or cross-sectional and time-series, and we will talk a
bit we will discuss a bit about measurement scales.
Finally, I think any statistical analysis, the key is to understand your data and frame
questions based on data. So, we will focus some time to try and understand and train
ourselves to frame questions based on data. So, these are the learning objectives for the
week 1.
What is statistics? If you go through the definitions of statistics over the years, you can
see that there has been a transformation and that has been changing over the period of
time. What started as just summarizing data, then afterwards gradually improve to
inference from data and then afterwards now with lot of data available, statistics is being
redefined as the art of learning from data.
Now, the minute I say learning from data, it includes that you want to seek some
information from data. So, Sheldon Ross defined statistics as the art of learning from
data, you are concerned with collection of data, subsequent description and their analysis
which often leads to drawing of conclusion. So, the main idea of statistics and statistical
analysis is to actually draw conclusions based on data.
(Refer Slide Time: 02:44)
So, if you look at the classification of statistics, even though there are newer branches of
statistics and new titles given, you may broadly classify the branches of statistics or you
might broadly look at the main branches of statistics to be two: one way you are
describing data that is a part of statistics which is concerned to description and
summarization of data more popularly referred to as the descriptive statistics branch.
The part of statistics which is concerned with drawing conclusions from data is called the
inferential statistics branch that is you want to infer from data. Now, when you want to
infer from data, there is one very important thing which is the possibility of chance
because when you are inferring from data there is an element of chance you do not have
exactly what you are having what you know.
And, hence we are preparing in this course in this foundation course with an introduction
to probability, to help you understand or help you prepare for the next league or the next
course where will you where you will be learning about inferential statistics.
(Refer Slide Time: 04:11)
So, primarily when you talk about inferential statistics, we are trying to talk about
drawing of conclusions from data. Now, a branch of inference as inferential statistics,
one important thing is many a time you are interested perhaps in knowing about the
percentage of all students in India who have passed their Class 12 exams and study
engineering; the prices of all households in Tamil Nadu; the total sales of all cars in India
in the year 2019; the age distribution of people who visit a city Mall in a particular
month.
So, one way of answering all these questions is one is through a complete enumeration –
you go and collect data on everybody or everything you are interested. For example, in
this question you are interested in knowing about the percentage of all students in India,
but very quickly you understand that getting this kind of data might not be very easy.
So, many a time what we are interested in knowing is the percentage of all students in
India. Now, if I just want to construct a database and I would want the actual data of all
the students who have passed class 12, but if my intention is just to know an overall feel
of what are the kind of people who finally, end up taking engineering then one thing I
would want to know is work with a smaller subset of all the students in India. All the set
of all students in India is what we refer to as a population. A smaller subset of this is
referred to as a sample. It is a subset, so, I am putting it as a sample.
Now, many a time you might be wanting to know about the prices of all houses. Again,
you need not go and find out about all the houses that have been sold in a particular year;
you might want to know about a smaller subset of the entire population. One thing you
want about the sample is you want it to be as representative as possible you want the
sample to be as representative as possible.
Now, what do we mean by representative sample? For example, let me define the
population is a collection of all elements that we are interested in. If this is the population
so, let me draw different colours here. What is the tool I use?.
So, suppose this is a population and I take another subset here. Suppose I take a subset,
this is a subset. The smaller set is actually a subset of the larger set, but we very quickly
notice that the smaller set does not have any yellow elements in it. So, I cannot say this
smaller set is actually a good representative sample of the larger set.
(Refer Slide Time: 07:39)
So, a sample is basically a subgroup of the population that will be studied in detail.
Now, we need the idea of a population and sample and you will be introduced to this
concept of population and sample in greater detail when you do your inferential statistics
course. But, nevertheless why do we need the concept of a population and sample in this
course is, eventually when we are going to come up with summary statistics, we always
need to understand whether the summary statistics is for a population or a sample and
this is something which we will know in due course.
Now, what do we mean by that? Let me demonstrate it to you through a data set ok.
This is again another hypothetical data set which is just showing the names of the cricket
players. All of us are very well aware of these cricket players – Tendulkar, Kohli, Dhoni.
The matches they have played, in what role, what are the total runs, the batting average,
the highest score, wickets, bowling average and best bowling.
Now, suppose a purpose is just to understand what are the total runs scored, what is the
batting average, what is the who has the highest batting average, who has the highest run
scored, who have played the most number of matches, if these who has taken the highest
number of wickets – if these are the questions of interest then all these questions of
interest which I have just posed now, can directly be just got from the data set.
I might also want to order the number of runs of a batsman has scored; I might want to
also know what is among the batsman how have the people scored runs and all of this I
can just describe this data. I do not have to do anything more about this data. So, in this
case the question I am asking is basically, the purpose I have here the purpose I have
here is to just examine and explore the information that is given. So, the study is just
descriptive. I am not asking anything more. I just want to describe the data set that is
given here and this study is descriptive.
But, suppose I am using this and one thing which we notice again in this data is the
following. If you look at this data this data is not the entire cricketing data about all the
cricketers available from all the countries.
It is a sample from an entire population of data. It is just a small sample. I can say it is at
best a representative sample of the Indian cricketing data over the last 5 or 10 years or
perhaps about this could be about for the in the last decade. It is a sample of definitely, it
is a sample of the Indian cricketing data.
But, it is again not the entire population which includes over all batsmen and overall
cricketers, but however, if I am just interested in summarizing this data if my inherent
interest is just about summarizing this data, then I would be interested in only a
descriptive nature of studies for which descriptive statistics is sufficient.
But, now if I am going to use this to draw a conclusions further conclusions; for
example, if I want to know about the role a batsman plays with a batting average, I
would need more information and I want to pick up a team for the future. For example,
you we all know about the IPL auctions and how people are chosen. So, there is a further
role. I am just not interested in describing this data.
The bigger role for me or the bigger interest for me is to use this data to gather or infer
some information which I am going to use in my decision making process. For that I
would I am going to have an element of chance and there I am going to have what I need
is I am going to have an inferential study in that case. So, very often we see that a
descriptive study we need to understand whether our nature of a study is only going to be
descriptive or whether we want to do an inferential study.
When we come to inferential study a descriptive study sorry when we come to for a
descriptive study it might be either performed on a sample or on a population. Since in
the classes to come we will be talking about descriptive statistics in detail. We need to
understand whether a descriptive study is performed on a sample or on an entire
population that is the reason why we introduced the notion of a sample and a population
at this stage.
However, if our inference is to be made about a population based on the sample, then the
study becomes inferential. Inferential statistics is not the scope of this course, but
however, you will be introduced to the concept of probability which will help you
develop the methodology towards inferential statistics.
So, in summary, you should know the two main branches are descriptive statistics,
inferential statistics. You are going to do a descriptive study or inferential study based on
what is your purpose of study.
If your intrinsic purpose is just to summarize your data, you would go for an descriptive
statistic. But if your purpose of study is to infer into the future or infer about a larger
population using a smaller subset, you would go for inferential statistic. To do
understand inferential statistic, you need to understand what is the concept of a
population and sample.