Module 4
Module 4
7. Applications of Statistics
What is Data?
➢Data is often viewed as the lowest level of abstraction from which information and knowledge are derived.
➢Data can be numbers, words, measurements, observations or even just descriptions of things. Also, data is a
representation of a fact, figure and idea.
➢Data on its own carries no meaning. In order for data to become information, it must be interpreted and take
on a meaning.
An example of raw data table. It is just a collection of random info and data.
Exploring Data
Generally one of the first things to do with new data is to get to know it by asking some general questions like
but not limited to the following:
• What variables are included? What information are we getting?
• What is the format of the variables: string, numeric, etc.?
• What type of variables: categorical, continuous, and discrete?
• Is this sample or population data?
Inferential Statistics
Study patterns, randomness and uncertainty in the data.
used to draw inferences about the process or population being
studied .
used to make conclusions and future predictions by analysing
numeric data.
Descriptive Statistics
Descriptive Statistics: Methods of organizing, summarizing, and presenting data in an informative way.
A Population is a Collection
of all possible individuals,
objects, or measurements of
interest.
A Sample is a portion, or
part, of the population of
interest
Examples of inferential statistics
One of the primary purposes of classifying variables according to their level or scale of measurementis
to facilitate the choice of a statistical analysis used to analyze the data.
There are certain statistical analyses which are only meaningful for data which are measured atcertain
measurement scales.
Statistical representation of data
Measures of Central
Tendency & Dispersion
Estimation
Descriptive Statistics:
Tools for summarizing, organizing Summary Graphic Hypothesis
& simplifying data Inferential Statistics:
Tools Testing
Tools for generalizing beyond
Tables & Graphs actual observations
Measures of Central Tendency
Measures of Variability Generalize from a sample to a
population
Mean, median and mode are different measures ofcentral tendency
Histogram of weekly returns of xyz equity prices Measure
17
Other Measures of CentralTendencies…
sum
Harmonic Mean :
Gurgaon to Delhi you travel at 40 miles per hour, Delhi to Faridhabad you travel at 60 miles per hour, then your
average speed is given by the Harmonic Mean of 40 and 60, which is 48 miles per hour; that is; the total amount of time
for the trip is the same as if you travelled the entire trip at 48 miles per hour.
The Central Tendencies Summary
Mean:
It's just the average of the data, computed as the sum of the data points divided by the number of points
Mode:
Mode is the most common value in the data set.
Tricky circumstances:
If no value occurs more than once, then there is no mode
If two values occur as frequently as each other and more frequently than any other, then there are two modes (in
the same way, there could also be more than two modes).
Median:
Median is the value in the middle of the data set, when the data points are arranged from smallest to largest.
If there is an odd number of data points, then just arrange them and look for the middle value
Tricky circumstances:
If there is an even number of data points, you will need to take the average of the two middle values.
Appropriate Measures of Central Tendency
The selection should be based on level-of-measurement.
A parent wanting to know whether their child better or worse than typical child at - Mode
his grade level
Are these sufficient?
Auto Office Transport OwnCar
7 9 1
• There is the man who drowned crossing a
6 9 3
stream with an average depth of six
3 9 5
inches. ~W.I.E. Gates
8 9 7
12 9 9
• Say you were standing with one foot in the 9 9 9
oven and one foot in an ice 9 9 9
bucket. According to the averages, you 13 9 11
should be perfectly comfortable. 13 9 13
9 9 15
10 9 17
Mean 9 9 9
e.g. x1, x2, x3 Are the times taken to get
Median 9 9 9
to Delhi in different modes of transport
Mode 9 9 9
NO!!!
Measures of Dispersion (Variance)
Distributions with different dispersions
Dispersion refers to the spread or
variability in the data.
▪ Standard Deviation
▪ Variance
▪ Percentiles/Quartiles
24
Variance and Standard Deviation
Standard deviation: The square root of
= 2 the variance.
Now try this…
Auto Office Transport O w n Car
7 9 1
6 9 3
3 9 5
8 9 7
12 9 9
9 9 9
9 9 9
13 9 11
13 9 13
9 9 15
10 9 17
Mean 9 9 9
Median 9 9 9
Mode 9 9 9
Dravid 150 150 130 125 145 110 100 152 120 50 128 Mean 123.636 123.636
Sehwag 230 240 150 50 173 23 20 300 45 1 128 Median 128 128
CV 24% 84%
Skewness
Lack of Symmetry
• A distribution is skewed if one of its tails is longer than the other.
• If the distribution of the data is symmetric then skewness is zero
Leptokurtic
Mesokurtic
Platykurtic
Descriptive statistics (using excel’s data analysistool)
Let’s get some descriptive statistics for this data. In excel go to Tools – Data
Analysis. If you do not see “data analysis” option you need to install it, go to
Tools – Add-Ins, a window will pop-up and check the “Analysis ToolPack” option,
then press OK. Try running data analysis again.
Descriptive statistics
✓Notice how each of the following examples are used to illustrate the data.
✓Choose the best graph form to express your results.
Graphical Representation of variables
Bar Graph Pie Graph Line Graph
• A bar graph is used to show relationships • A circle graph is used to show how a • A line graph is used to show
between groups. part of something relates to the continuing data; how one thing is
• The two items being compared do not whole. affected by another.
need to affect each other. • This kind of graph is needed to show • To see how things are going by the
• It's a fast way to show big differences. percentages effectively. rises and falls a line graph.
Notice how easy it is to read a bar graph.
Choc ol a te M I l k S o l d
Chocolate Milk Sold
Chocolate Milk Sold
120
120 100
112
80
100
Amount
60
80
Sold
76
Monday
Amount Sold
72
40
Tuesday
60
53 Wednesday
20
Thursday
40
33 Friday 0
M onday T ues day W ednes day T hurs day Friday
20 Day
Choc ol a te
0
Monday Tuesday Wednesday Thursday Friday
Monday Tuesday
Wednesday Thursday Day
On what day was the least amount On what day did they have a drop in
On what day did they sell the most chocolate milk sales?
of chocolate milk sold?
chocolate milk?
a. Tuesday b. Friday c. Wednesday a. Monday b. Tuesday c. Thursday a. Thursday b. Tuesday c. Monday
Graphical Representation of variables
Histogram Line charts Ogives
▪ A histogram is a special ▪ A representation of data ▪ In statistics, an ogive is a
kind of bar chart which allows varying over time, eg. graph showing the curve of
us to visualize the distribution commodity prices a cumulative distribution
of values of an function.
ordinal/continuous variable
▪ It provides insights like trend
of the data, seasonality or ▪ It provides insights like
▪ Can be developed in Excel presence of outliers distribution of population
2007 through Data>>>data within a given range
analysis>>>histogram Brent – 1 month forwards
$/barrel
150
100
50
1 Footnote
SOURCE: Wik3ip5edia
Choosing the RightGraph
• Use a bar graph if you are not looking for trends (or patterns) over time; and the items (or
categories) are not parts of a whole.
•Use a pie chart if you need to compare different parts of a whole, there is no time involved and
there are not too many items (or categories).
•Use a line graph if you need to see how a quantity has changed over time. Line graphs enable
us to find trends (or patterns) over time.
Common Chart Types
Outliers
•An outlier is an observation that is numerically distant from the rest of the data.
•An outlying observation, or outlier, is one that appears to deviate markedly from other
members of the sample in which it occurs.
•Outliers can occur by chance in any distribution, but they are often indicative either of
measurement error or that the population has a heavy-tailed distribution.
Bill Gates makes $500 million a year. He’s in a room with 9 teachers, 4 of whom
make $40k, 3 make $45k, and 2 make $55k a year. What is the mean salary of
everyone in the room? What would be the mean salary if Gates wasn’t included?
b. As a rule of thumb, if a variable has more than 5% missing values, cases are not deleted.
b.The imputed dataset is a "what-if" hypothetical dataset which relies on estimation, though it is a "best guess" attempt
to present what choices respondents are likely to have made, given their responses on other items.
c.It is preferable to run all analyses on both the original and imputed datasets, and discuss in the report where
imputation would make a difference for the substantive interpretations.
Normal Distributions
• The normal distribution is a pattern for the distribution of a set of data which follows a bell
shaped curve. This also called the Gaussian distribution
• Normal Distribution has the mean, the median, and the mode all coinciding at its peak and with
frequencies gradually decreasing at both ends of the curve.
• The normal distribution is a theoretical ideal distribution. Real-life empirical distributions never
match this model perfectly. However, many things in life do approximate the normal distribution,
and are said to be “normally distributed.”
The Bell Shaped Curve 68-95-99.7 Rule
• The bell shaped curve has the following
characteristics:
• The curve is concentrated in the center and 68% of
decreases on either side. the data
X −
Z =
If birth weights in a population are normally distributed with a mean of 109 oz and a standard
deviation of 13 oz
a. What is the chance of obtaining a birth weight of 141 oz or heavier when sampling
birth records at random?
b. What is the chance of obtaining a birth weight of 120 or lighter?
Answer
a. What is the chance of obtaining a birth b. What is the chance of obtaining a birth weight
weight of 141 oz or heavier when sampling of 120 or lighter?
birth records at random?
From the chart or SAS → Z of 2.46 corresponds From the chart → Z of .85 corresponds to a left
to a right tail (greater than) area of: tail area of:
• Modern portfolio theory assumes that the returns of diversified asset portfolio follow a normal
distribution.
The relationship between two variables over a period, especially one that shows a
close match between the variables' movements
50
Correlation measures may be misleading in certain scenarios
Correlation and independence Spurious correlation
A spurious relationship is a mathematical relationship in
which two events or variables have no direct causal
connection, yet it may be wrongly inferred that they do, due to
either coincidence or the presence of a certain third, unseen
factor (referred to as a "confounding factor" or "lurking
variable")
51
Examples
• Increase in height results in weight increase for children
• Attending lessons leads to improved grades
• Age of the car impact its stopping distances
• More the years of education higher the income
Business Examples
• Rising unemployment leads to a decrease in sales of taste the difference products
• Increase in demand of a product leads to increase in supply
• More efficient the workers higher the productivity