0% found this document useful (0 votes)
29 views

Module 4

The document discusses topics related to statistics including types of data, levels of measurement, descriptive statistics, inferential statistics, analyzing individual variables through measures of central tendency and dispersion, exploring relationships among variables through correlation and hypothesis testing, and applications of statistics. Key points covered include defining categorical and quantitative data, explaining descriptive and inferential statistics, and analyzing individual and relationships among variables.

Uploaded by

jalaj.joshi2020
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Module 4

The document discusses topics related to statistics including types of data, levels of measurement, descriptive statistics, inferential statistics, analyzing individual variables through measures of central tendency and dispersion, exploring relationships among variables through correlation and hypothesis testing, and applications of statistics. Key points covered include defining categorical and quantitative data, explaining descriptive and inferential statistics, and analyzing individual and relationships among variables.

Uploaded by

jalaj.joshi2020
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Topics

1. Introduction to Data, Types of Data


2. Levels of Measurement
3. Definition and Uses of Statistics
4. Types of Statistics – Descriptive, Inferential
5. Analyzing Individual Variables-
• Measures of Central Tendency and Dispersion

• Using graphs to Explore data

• Preliminary Analysis: Outlier detection, Missing value treatment


• Normal Distribution – Bell curve, Z score
• Descriptive statistics using Excel
6. Analyzing Relationship among Variables
• Correlation: Correlation coefficient, Correlation Matrix, 2D Scatter plot

• Inferential Statistics- Testing of Hypothesis, P-Value Concept

• Frequently used test- T-Test, F-Test etc.

7. Applications of Statistics
What is Data?
➢Data is often viewed as the lowest level of abstraction from which information and knowledge are derived.

➢Data can be numbers, words, measurements, observations or even just descriptions of things. Also, data is a
representation of a fact, figure and idea.

➢Data on its own carries no meaning. In order for data to become information, it must be interpreted and take
on a meaning.
An example of raw data table. It is just a collection of random info and data.
Exploring Data
Generally one of the first things to do with new data is to get to know it by asking some general questions like
but not limited to the following:
• What variables are included? What information are we getting?
• What is the format of the variables: string, numeric, etc.?
• What type of variables: categorical, continuous, and discrete?
• Is this sample or population data?

After looking at the data you may want to know


• How many males/females?
• What is the average age?
• How many undergraduate/graduates students?
• What is the average SAT score? It is the same for graduates and undergraduates?
• Who reads the newspaper more frequently: men or women?
Types of Data / Variables 6

Categorical Data is the data that is non


numeric. Variable
A variable is a value that may
e.g.. Favorite color, Place of Birth, Types
change within the scope of a given
of Car
problem or set of operations
Quantitative Data is numerical. There
are 2 types of quantitative data.
Discrete Continuous
1. Discrete data can only take
specific values; Random variable which takes only Random variable which takes
isolated values in its range of any value in its range of
e.g. shoe size, number of brothers,
variation. For example number of variation. For example, height
number of cars in a car park.
heads in 10 tosses of a coin of a person
2. Continuous data can take any
numerical value;
Nominal Ordinal
e.g. height, mass, length.
▪ Values do not have ordering ▪ Values are ordered
▪ Example categorical variables ▪ Example RSS scores
like color, nationality and so on
Examine the differences between Categorical and quantitativedata.
Categorical Data Quantitative Data

• Deals with descriptions. 1. Deals with numbers.


1. Data can be observed but not measured. 2. Data which can be measured.
2. Colors, textures, smells, tastes, appearance, beauty, 3. Length, height, area, volume, weight, speed, time,
etc. temperature, humidity, sound levels, cost,
3. Categorical → Category members, ages, etc.
4. Quantitative → Quantity
4. Ex: Oil Painting

1. blue/green color, gold frame 1. picture is 10" by 14"


2. smells old and musty 2. with frame 14" by 18"
3. texture shows brush strokes of oil paint 3. weighs 8.5 pounds
4. peaceful scene of the country 4. surface area of painting is 140 sq. in.
5. masterful brush strokes 5. cost $300
What is Statistics?

Statistics is the science of   


collecting, organizing,  
presenting, analyzing, and
interpreting numerical data to
assist in making more effective
decisions.
Types of Statistics
Descriptive Statistics
Study the basic features of the data that describe what is or what
the data shows.
Statistical methods can be used to summarize or describe a
collection of data.

involves the analysis of numeric data, pictures, graphs and figures.

Inferential Statistics
Study patterns, randomness and uncertainty in the data.
used to draw inferences about the process or population being
studied .
used to make conclusions and future predictions by analysing
numeric data.
Descriptive Statistics
Descriptive Statistics: Methods of organizing, summarizing, and presenting data in an informative way.

EX 1: A Gallup poll found that 49% of the EX 2: According to Consumer Reports,


people in a survey knew the name of the General Electric washing machine owners
first book of the Bible. The statistic 49 reported 9 problems per 100 machines during
describes the number out of every 100 2001. The statistic 9 describes the number of
persons who knew the answer. problems out of every 100 machines.
Inferential Statistics
Inferential Statistics: A decision, estimate, prediction, or generalization about a population, based on a
sample.

A Population is a Collection
of all possible individuals,
objects, or measurements of
interest.

A Sample is a portion, or
part, of the population of
interest
Examples of inferential statistics

Example 1: TV networks constantly Example 2: Wine tasters sip a few


monitor the popularity of their drops of wine to make a decision with
programs by hiring Nielsen and respect to all the wine waiting to be
other organizations to sample the #1 released for sale.
preferences of TV viewers.

Example 3: The accounting department of a large firm


will select a sample of the invoices to check for
accuracy for all the invoices of the company.
The data type and Statistical StatisticalAnalysis
Who Cares?

The type(s) of data collected


in a study determine the type
of statistical analysis used.

One of the primary purposes of classifying variables according to their level or scale of measurementis
to facilitate the choice of a statistical analysis used to analyze the data.
There are certain statistical analyses which are only meaningful for data which are measured atcertain
measurement scales.
Statistical representation of data

For example ...


Categorical data are commonly summarized using ?Frequencies/percentages? (or ?proportions?).
11% of students have a tattoo
2%, 33%, 39%, and 26% of the students in class are, respectively, freshmen, sophomores, juniors,
and seniors.

And for example ?


Measurement data are typically summarized using ?averages? (or ?means?).
Average number of siblings Fall 1998 Stat 250 students have is 1.9.
Average weight of male Fall 1998 Stat 250 students is 173 pounds.
Average weight of female Fall 1998 Stat 250 students is 138 pounds.
Analyzing Individual Variables- Univariate Independent Analysis
- Measures of Central Tendency
- Measures of Dispersion/Variability
- Using graphs to Explore data
Preliminary Analysis:
- Missing data
- Outlier detection
- Normal Distribution
Recall the Branches of Statistics
Statistics

Descriptive Statistics Inferential Statistics

Descriptive Statistics describes Inferential statistics is a set of


the data set that’s being methods that is used to draw
analyzed, but doesn’t allow us to conclusions or inferences about
draw any conclusions or make the characteristics of populations
any inferences about the data based on data from a sample

Measures of Central
Tendency & Dispersion
Estimation
Descriptive Statistics:
Tools for summarizing, organizing Summary Graphic Hypothesis
& simplifying data Inferential Statistics:
Tools Testing
Tools for generalizing beyond
Tables & Graphs actual observations
Measures of Central Tendency
Measures of Variability Generalize from a sample to a
population
Mean, median and mode are different measures ofcentral tendency
Histogram of weekly returns of xyz equity prices Measure

Percent Mean It is the easiest metric to


understand and communicate
Median = -1 Mean = 202
27 27 27
26 Mean is prone to presence of
outliers
18
16 15 Median Median is a more “robust” to
11
9
presence of outliers
8 8
It is more complicated to
3
1 2 2 communicate

Mode Not very practical since it is


affected by skewness
-24 -21 -18 -15 -11 -8 -5 -1 2 5 9 12 15 18 2000
Weekly returns Most real life distributions are
multimodal

17
Other Measures of CentralTendencies…

Weighted Mean : Geometric Mean: The geometric mean is the nth


= 588/28 root of the product of the scores. (used for
Logarithmic distributions)
= 21

sum

Harmonic Mean :

Gurgaon to Delhi you travel at 40 miles per hour, Delhi to Faridhabad you travel at 60 miles per hour, then your
average speed is given by the Harmonic Mean of 40 and 60, which is 48 miles per hour; that is; the total amount of time
for the trip is the same as if you travelled the entire trip at 48 miles per hour.
The Central Tendencies Summary
Mean:
It's just the average of the data, computed as the sum of the data points divided by the number of points

Mode:
Mode is the most common value in the data set.
Tricky circumstances:
If no value occurs more than once, then there is no mode
If two values occur as frequently as each other and more frequently than any other, then there are two modes (in
the same way, there could also be more than two modes).

Median:
Median is the value in the middle of the data set, when the data points are arranged from smallest to largest.
If there is an odd number of data points, then just arrange them and look for the middle value
Tricky circumstances:
If there is an even number of data points, you will need to take the average of the two middle values.
Appropriate Measures of Central Tendency
The selection should be based on level-of-measurement.

Tips for selecting


use the mode when...
variables are measured at the nominal level
you want a quick and easy measure for ordinal and interval-ratio variables
you want to report the most common score
use the median when...
variables are measured at the ordinal level
variables measured at the interval-ratio level have highly skewed distributions
you want to report the central score. The median always lies at the exact center of a distribution.
use the mean when…
variables are measured at the interval-ratio level
you want to report the typical score. The mean is "the fulcrum that exactly balances all of the scores."
you anticipate additional statistical analysis.
Examples
What is a typical student in the class doing? - Mean

To compare performance of any single student against group - Median

A parent wanting to know whether their child better or worse than typical child at - Mode
his grade level
Are these sufficient?
Auto Office Transport OwnCar
7 9 1
• There is the man who drowned crossing a
6 9 3
stream with an average depth of six
3 9 5
inches. ~W.I.E. Gates
8 9 7
12 9 9
• Say you were standing with one foot in the 9 9 9
oven and one foot in an ice 9 9 9
bucket. According to the averages, you 13 9 11
should be perfectly comfortable. 13 9 13
9 9 15
10 9 17
Mean 9 9 9
e.g. x1, x2, x3 Are the times taken to get
Median 9 9 9
to Delhi in different modes of transport
Mode 9 9 9

NO!!!
Measures of Dispersion (Variance)
Distributions with different dispersions
Dispersion refers to the spread or
variability in the data.

It determines how spread out are the scores


around the mean.

The basic question being asked is how much do the scores


deviate around the Mean? The more “bunched up” around
the mean the better your ability to make accurate
predictions.
Measures of Dispersion
▪ Box-plot
▪ Range • Reveals the spread of the data
• Outliers defined using the
▪ Inter-Quartile Range
Q1 - 1.5(Q3-Q1) and Q3 + 1.5(Q3-Q1)
▪ Mean Deviation outlier

▪ Standard Deviation

▪ Variance

▪ Percentiles/Quartiles

24
Variance and Standard Deviation

Variance: the arithmetic mean  (X - ) 2


 =
of the squared deviations from N
the mean.
(x −  )2 + (x −  )2 +   
=  

X is the value of an observation in the population
m is the arithmetic mean of the population

N is the number of observations in the population


Standard deviation: The square root of
 = 2 the variance.
Now try this…
Auto Office Transport O w n Car
7 9 1
6 9 3
3 9 5
8 9 7
12 9 9
9 9 9
9 9 9
13 9 11
13 9 13
9 9 15
10 9 17
Mean 9 9 9
Median 9 9 9
Mode 9 9 9

Std Dev 3.0 0.0 4.9

Variance 9.2 0.0 24.0


Coefficient of Variation
The coefficient of variation (CV) is a normalized measure of dispersion of a probability distribution. It is
defined as the ratio of the standard deviation to the mean :

• Measure of relative dispersion



_ 

Always a %
Shows variation relative to mean
CVx =  (100)
• Used to compare 2 or more groups

Which Cricketer do you like? Who is more consistent?


Dravid Sehwag

Dravid 150 150 130 125 145 110 100 152 120 50 128 Mean 123.636 123.636
Sehwag 230 240 150 50 173 23 20 300 45 1 128 Median 128 128

CV 24% 84%
Skewness
Lack of Symmetry
• A distribution is skewed if one of its tails is longer than the other.
• If the distribution of the data is symmetric then skewness is zero

Positive Skew Negative Skew


This means that the distribution has This means that the distribution has
a long tail to the right a long tail to the left
Mean > Median > Mode Mean < Median < Mode

Measure : Mean – Median or Mean – Mode


Kurtosis
 Kurtosis measures the "peakedness" of a distribution.
 Higher Kurtosis means more of the variance is the result of infrequent extreme
deviations, as opposed to frequent modestly sized deviations
 The Kurtosis of the Normal Distribution is 3.

Leptokurtic
Mesokurtic

Platykurtic
Descriptive statistics (using excel’s data analysistool)
Let’s get some descriptive statistics for this data. In excel go to Tools – Data
Analysis. If you do not see “data analysis” option you need to install it, go to
Tools – Add-Ins, a window will pop-up and check the “Analysis ToolPack” option,
then press OK. Try running data analysis again.
Descriptive statistics

Now we know something about our data


Data analysis using Graphs

Tables, charts and graphs are convenient ways to clearly


show your data.
Sample data
The cafeteria wanted to collect data on how much milk was sold in 1 week. The table below shows
the results. We are going to take this data and display it in 3 different types of graphs.

Day Chocolate Strawberry White


Monday 53 78 126
Tuesday 72 97 87
Wednesday 112 73 86
Thursday 33 78 143
Friday 76 47 162

✓Notice how each of the following examples are used to illustrate the data.
✓Choose the best graph form to express your results.
Graphical Representation of variables
Bar Graph Pie Graph Line Graph
• A bar graph is used to show relationships • A circle graph is used to show how a • A line graph is used to show
between groups. part of something relates to the continuing data; how one thing is
• The two items being compared do not whole. affected by another.
need to affect each other. • This kind of graph is needed to show • To see how things are going by the
• It's a fast way to show big differences. percentages effectively. rises and falls a line graph.
Notice how easy it is to read a bar graph.
Choc ol a te M I l k S o l d
Chocolate Milk Sold
Chocolate Milk Sold
120

120 100
112

80
100

Amount
60
80

Sold
76
Monday
Amount Sold

72
40
Tuesday
60
53 Wednesday
20
Thursday
40
33 Friday 0
M onday T ues day W ednes day T hurs day Friday

20 Day

Choc ol a te

0
Monday Tuesday Wednesday Thursday Friday
Monday Tuesday
Wednesday Thursday Day

On what day was the least amount On what day did they have a drop in
On what day did they sell the most chocolate milk sales?
of chocolate milk sold?
chocolate milk?
a. Tuesday b. Friday c. Wednesday a. Monday b. Tuesday c. Thursday a. Thursday b. Tuesday c. Monday
Graphical Representation of variables
Histogram Line charts Ogives
▪ A histogram is a special ▪ A representation of data ▪ In statistics, an ogive is a
kind of bar chart which allows varying over time, eg. graph showing the curve of
us to visualize the distribution commodity prices a cumulative distribution
of values of an function.
ordinal/continuous variable
▪ It provides insights like trend
of the data, seasonality or ▪ It provides insights like
▪ Can be developed in Excel presence of outliers distribution of population
2007 through Data>>>data within a given range
analysis>>>histogram Brent – 1 month forwards
$/barrel
150

100

50

Jul- Jan- Jul- Jan-


08 09 09 10

1 Footnote

SOURCE: Wik3ip5edia
Choosing the RightGraph

• Use a bar graph if you are not looking for trends (or patterns) over time; and the items (or
categories) are not parts of a whole.

•Use a pie chart if you need to compare different parts of a whole, there is no time involved and
there are not too many items (or categories).

•Use a line graph if you need to see how a quantity has changed over time. Line graphs enable
us to find trends (or patterns) over time.
Common Chart Types
Outliers
•An outlier is an observation that is numerically distant from the rest of the data.

•An outlying observation, or outlier, is one that appears to deviate markedly from other
members of the sample in which it occurs.

•Outliers can occur by chance in any distribution, but they are often indicative either of
measurement error or that the population has a heavy-tailed distribution.
Bill Gates makes $500 million a year. He’s in a room with 9 teachers, 4 of whom
make $40k, 3 make $45k, and 2 make $55k a year. What is the mean salary of
everyone in the room? What would be the mean salary if Gates wasn’t included?

Mean With Gates: Mean Without Gates:


$50,040,500 $45,000
Plots for analyzing outliers
A Scatterplot is useful for "eyeballing" the In a Box plot, a point beyond an inner fence on
presence of outliers. either side is considered a mild outlier. A point
beyond an outer fence is considered an
4
5 extreme outlier.
4
0
16 45
3 40
14
5 35
12 median
3 10 30
Q1
0 25
8 min
2 20
6 max
5 15
4 10 Q3
2
2 5
0 0 2 4 6 8 1
0 0 0
1
v a lu e
5
Stock Price of Peach Inc.
1 Hourly power prices
0
320 5 320
280 0 280
240 240
200 200
160 160
120 120
80 80
Jul-09 Oct-09 Jan-10 Apr-10 Jul-09 Oct-09 Jan-10 Apr-10
Missing Values Imputation Methods
•In the ideal data collection project, complete data would Some common imputation methods
exist for all variables across all experimental units (also • Mean (median, mode) imputation
called subjects, cases, or observations). • Pairwise deletion a.k.a. available case analysis
•Unfortunately, for a number of reasons it is inevitable that
• Dummy variable adjustment
some values won't be collected, will become lost, or will be
unusable. • List wise deletion a.k.a. complete case analysis
• Multiple imputation (MI)
There are a number of reasons why data become missing.
•sensor failures
•omitted entries in databases
•non-response in questionnaires.
•loss to follow up
•lack of overlap between linked data sets
•dropping out of school, graduation, etc.
•survey design: “skip patterns” between respondents
Some facts about missingdata
Why not just delete cases with missing values rather than impute values atall?
a.Deletion can introduce substantial bias into the study. And, the loss in sample size can appreciably diminish the
statistical power of the analysis.

b. As a rule of thumb, if a variable has more than 5% missing values, cases are not deleted.

Should I use original data or imputed data when reporting results?


a.The original dataset may be biased by a large number of non-random missing values.

b.The imputed dataset is a "what-if" hypothetical dataset which relies on estimation, though it is a "best guess" attempt
to present what choices respondents are likely to have made, given their responses on other items.

c.It is preferable to run all analyses on both the original and imputed datasets, and discuss in the report where
imputation would make a difference for the substantive interpretations.
Normal Distributions

• The normal distribution is a pattern for the distribution of a set of data which follows a bell
shaped curve. This also called the Gaussian distribution

• Normal Distribution has the mean, the median, and the mode all coinciding at its peak and with
frequencies gradually decreasing at both ends of the curve.

• The normal distribution is a theoretical ideal distribution. Real-life empirical distributions never
match this model perfectly. However, many things in life do approximate the normal distribution,
and are said to be “normally distributed.”
The Bell Shaped Curve 68-95-99.7 Rule
• The bell shaped curve has the following
characteristics:
• The curve is concentrated in the center and 68% of
decreases on either side. the data

• The bell shaped curve is symmetric and Unimodal


• The curve extends to + / - infinity 95% of the data

• Area under the curve = 1 99.7% of the data

The empirical rule states that for a normal


distribution:
•68% of the data will fall within 1 SD of mean
•95% of the data will fall within 2 SD’s of the
mean
•Almost all (99.7%) of the data will fall within 3
SD’s of the mean
Are my data “normal”?
• Not all continuous random variables are normally distributed!!
• It is important to evaluate how well the data are approximated by a normal distribution

Are my data normally distributed?

1. Look at the histogram! Does it appear bell shaped?


2. Compute descriptive summary measures—are mean, median, and mode similar?
3. Do 2/3 of observations lie within 1 std dev of the mean? Do 95% of observations lie within
2 std dev of the mean?
4. Look at a normal probability plot—is it approximately linear?
5. Run tests of normality (such as Kolmogorov-Smirnov). But, be cautious, highly influenced
by sample size!
Standard (Z) Scores – standard normal variable
• A standard score (also called Z score) is the
number of standard deviations that a given
raw score is above or below the mean.

X −
Z =

All normal distributions can be converted


into the standard normal curve by
subtracting the mean and dividing by the
How good is rule for real data?
standard deviation:
Check some example data:
The mean of the weight of the women = 127.8
The standard deviation (SD) = 15.5
Practice problem

If birth weights in a population are normally distributed with a mean of 109 oz and a standard
deviation of 13 oz
a. What is the chance of obtaining a birth weight of 141 oz or heavier when sampling
birth records at random?
b. What is the chance of obtaining a birth weight of 120 or lighter?
Answer
a. What is the chance of obtaining a birth b. What is the chance of obtaining a birth weight
weight of 141 oz or heavier when sampling of 120 or lighter?
birth records at random?

141 − 109 120 − 109


Z = = 2 .46 Z = = .85
13 13

From the chart or SAS → Z of 2.46 corresponds From the chart → Z of .85 corresponds to a left
to a right tail (greater than) area of: tail area of:

P(Z≥2.46) = 1-(.9931)= .0069 or .69 % P(Z≤.85) = .8023= 80.23%


Applications of Normal Distribution to BusinessAdministration

• Modern portfolio theory assumes that the returns of diversified asset portfolio follow a normal
distribution.

• In operations management, process variations often are normally distributed

• In human resource management, employee performance sometime is considered to be normally


distributed.
Correlation
What is the relationship between two variables?

Relationship between hours studying (X) and grades on a midterm (Y)?

Relationship between self-esteem (X) and depression (Y)?

The relationship between two variables over a period, especially one that shows a
close match between the variables' movements

Direction and strength of relationship between two variables


Graphical representation of data in a bivariatesetup
No association Strong linear relationship Strong linear relationship

Exact linear relationship Quadratic relationship Sinusoidal relationship (damped)

50
Correlation measures may be misleading in certain scenarios
Correlation and independence Spurious correlation
A spurious relationship is a mathematical relationship in
which two events or variables have no direct causal
connection, yet it may be wrongly inferred that they do, due to
either coincidence or the presence of a certain third, unseen
factor (referred to as a "confounding factor" or "lurking
variable")

No correlations: Does not imply no association

Another popular example is a series of Dutch statistics


showing a positive correlation between the number of
storks nesting in a series of springs and the number of
human babies born at that time. Of course there was no
causal connection; they were correlated with each other
only because they were correlated with the weather
nine months before the observations

51
Examples
• Increase in height results in weight increase for children
• Attending lessons leads to improved grades
• Age of the car impact its stopping distances
• More the years of education higher the income

Business Examples
• Rising unemployment leads to a decrease in sales of taste the difference products
• Increase in demand of a product leads to increase in supply
• More efficient the workers higher the productivity

You might also like