Introduction To Statistics: "There Are Three Kinds of Lies: Lies, Damned Lies, and Statistics." (B.Disraeli)

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 32

Introduction To Statistics

“THERE ARE THREE KINDS OF LIES: LIES, DAMNED LIES, AND


STATISTICS.”
(B.DISRAELI)
Why study statistics?
1. Data is everywhere
2. Statistical techniques are used to make many decisions that affect our lives
3. No matter what your career, you will make professional decisions that involve
data. An understanding of statistical methods will help you make these
decisions efectively
Statistics
Methodology for:
collecting, classifying, summarizing, organizing, presenting, analyzing and
interpreting numerical information

Think card games (vs Chess).


Statistical Processes
1. Descriptive: numerical and graphical methods to find patterns in the data. ­
Summarizes the information data reveals and presents it in a meaningful way

2. Inferential: uses data to make inferences about features of the environment


from which it was selected or about the underlying mechanism that generated it
­Uses data to make estimates, predictions
DEDUCTION VS INDUCTION
Why is Virat Kohli a modern great?
Deduction: He has strong wrists, hand-eye coordination, dedication etc…Greats tend to
have all of these qualities. Therefore, Virat is a modern great.
Start with a set of assumptions, if those assumptions are correct and the argument is
valid, then the conclusion is correct (true). Provides certainty.

Induction: I have seen him bat wonderfully. He performs better than other batsmen
around him. Look at his statistics.. Based on observations.
But even if 10000000 observations support the claim, doesn’t necessarily mean that the
next one will.
India’s banking
situation

slide 6
Population vs Sample

Population refers to the entirety of the things/members/objects being investigated


◦ Number of fish eaters in India.
◦ Population: all Indians
◦ Census – Difficult, Expensive

Alternative: Sample – a selected part of the sample used to estimate population


characteristics
◦ National Sample Survey – subset of all Indians
◦ How to decide on a sample?
◦ How to extrapolate?
Sampling
A sample should have the same characteristics as the population it is representing.
Sampling can be:
with replacement: a member of the population may be chosen more than once (picking the candy
from the bowl)
without replacement: a member of the population may be chosen only once (lottery ticket)
PRIMARY VS SECONDARY DATA

Primary – collected for a specific purpose directly from the field


◦ Census, RBI bulletin
◦ Likely to be more reliable
◦ Debate about Indian GDP numbers ?

Secondary – data compiled from primary sources


◦ Limitation: collected for a different purpose (missing data)
◦ Demonetization: Good or Bad - think about the kind of data you would need
Inferential Statistics
Estimation
◦ e.g., Estimate the population mean
weight using the sample mean weight

Hypothesis testing
◦ e.g., Test the claim that the population
mean weight is 70 kg

Inference is the process of drawing conclusions or making decisions about a


population based on sample results
Random sampling methods
simple random sample (each sample of the same size has an equal chance of being
selected)
stratified sample (divide the population into groups called strata and then take a
sample from each stratum)
cluster sample (divide the population into strata and then randomly select some of the
strata. All the members from these strata are in the cluster sample.)
Data
Statistical data are usually obtained by counting or measuring items. Most data
can be put into the following categories:
Qualitative - data are measurements that each fail into one of several
categories. (hair color, ethnic groups and other attributes of the population)
Quantitative - data are observations that are measured on a numerical scale
(distance traveled to college, number of children in a family, etc.)
◦ Discreet vs Continuous
Numerical scale of measurement:
Nominal – consist of categories in each of which the number of respective
observations is recorded. The categories are in no logical order and have no particular
relationship. The categories are said to be mutually exclusive since an individual,
object, or measurement can be included in only one of them.
Ordinal –With ordinal scales, the order of the values is what’s important and
significant, but the differences between each one is not really known.
Interval: Interval scales are numeric scales in which we know both the order and the
exact differences between the values
Ratio – consists of numerical measurements where the distance between numbers is
of a known, constant size, in addition, there is a nonarbitrary zero point.
Frequency distributions – numerical
presentation of quantitative data

Frequency distribution – shows the frequency, or number of occurences, in each


of several categories. Frequency distributions are used to summarize large
volumes of data values.

When the raw data are measured on a quantitative scale, either interval or
ration, categories or classes must be designed for the data values before a
frequency distribution can be formulated.
Steps for constructing a frequency
distribution
1. Determine the number of classes m n
2. Determine the size of each class
3. Determine the starting point for the first class
4. Tally the number of values that occur in each class
5. Prepare a table of the distribution using actual counts and/ or percentages
(relative frequencies)

h
 max  min 
m
Charts and graphs
Frequency distributions are a good way to present the essential aspects of data
collections in concise and understable terms
Pictures are always more effective in displaying large data collections
Histogram
Frequently used to graphically present interval and ratio data
Is often used for interval and ratio data
The adjacent bars indicate that a numerical range is being summarized by
indicating the frequencies in arbitrarily chosen classes
Frequency polygon
Another common method for graphically presenting interval and ratio data
To construct a frequency polygon mark the frequencies on the vertical axis and
the values of the variable being measured on the horizontal axis, as with the
histogram.
If the purpose of presenting is comparation with other distributions, the
frequency polygon provides a good summary of the data
Ogive
A graph of a cumulative frequency distribution
Ogive is used when one wants to determine how many observations lie above or below a certain
value in a distribution.
First cumulative frequency distribution is constructed
Cumulative frequencies are plotted at the upper class limit of each category
Ogive can also be constructed for a relative frequency distribution.
Pie Chart
The pie chart is an effective way of displaying the percentage breakdown of data
by category.
Useful if the relative sizes of the data components are to be emphasized
Pie charts also provide an effective way of presenting ratio- or interval-scaled
data after they have been organized into categories
Bar chart
Another common method for graphically presenting nominal and ordinal scaled
data
One bar is used to represent the frequency for each category
The bars are usually positioned vertically with their bases located on the
horizontal axis of the graph
The bars are separated, and this is why such a graph is frequently used for
nominal and ordinal data – the separation emphasize the plotting of frequencies
for distinct categories
Time Series Graph
The time series graph is a graph of data that have been
measured over time.
The horizontal axis of this graph represents time periods
and the vertical axis shows the numerical values
corresponding to these time periods
Happy Holi

You might also like