0% found this document useful (0 votes)
126 views

Business Statistics: Prof. Lancelot JAMES

This document provides an outline for a business statistics course taught by Professor Lancelot James at Hong Kong University of Science and Technology. It discusses prerequisites, grading, textbooks, and an introduction to descriptive statistics including populations and samples, types of variables, and methods for presenting data in tables and charts. Key topics covered are descriptive statistics, inferential statistics, and how statistics can be used in business contexts such as decision making.

Uploaded by

satyasainadh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
126 views

Business Statistics: Prof. Lancelot JAMES

This document provides an outline for a business statistics course taught by Professor Lancelot James at Hong Kong University of Science and Technology. It discusses prerequisites, grading, textbooks, and an introduction to descriptive statistics including populations and samples, types of variables, and methods for presenting data in tables and charts. Key topics covered are descriptive statistics, inferential statistics, and how statistics can be used in business contexts such as decision making.

Uploaded by

satyasainadh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 103

Business Statistics

Prof. Lancelot JAMES


Hong Kong University of Science and Technology,
Information and Systems ManagemenT

ISMT-551 _ Fall 2006

Outline of the Course

Prerequisites-Good STAMINA
Class Participation is Encouraged
Grading: Homeworks/Projects and Final Exam
Textbook: Bowerman, O Connell, Orris (2004) Essentials
of Business Statistics. Mc Graw Hill.
Use the online tutorials
(http://highered.mcgraw-hill.com/sites/
0072827823/student_view0/electronic_
tutorials.html)

Salutations

Prof/Dr. James

Introduction

Descriptive Statistics

What is statistics?

The formal definition is simply the study or analysis of data.


Statistics is a tool for studying a characteristic or a behavior in
the real world based on a sample from the entire population
1

DATA IS EVERYWHERE

Introduction

Descriptive Statistics

How might statistics be used in a business


context?

To know how to present and describe information

To know how to draw conclusions about large populations


based only on information obtained in samples

To know how to improve processes

To know how to obtain forecasts

Making DECISIONS

Introduction

Descriptive Statistics

Populations and samples

ParameterA summary measure that describes a


characteristic of an entire population. For example the
average height of people in the US.

Sample A portion of the population that is selected for


analysis.

A Statistic: A summary measure computed from sample


data that is used to describe or estimate a characteristic of
the entire population.

Introduction

Descriptive Statistics

Populations and samples

Key Definitions

a Population is a set of existing units (people, objects,


events,...)
a Variable is any characteristic of a Population
All the population measurements may be collected in a
Census.
a Sample is a subset of the units in the population.
A sample of measurements that can be
1

Described Descriptive Statistics

Used to make generalizations about important aspects of


the Population Statistical Inference

Introduction
Populations and samples

Population vs. Sample

Descriptive Statistics

Introduction
Populations and samples

Types of Data

Descriptive Statistics

Introduction

Descriptive Statistics

Populations and samples

1. Population:
Some Examples of a Population

All items or subjects under consideration.


1

The stars in the galaxy

red cars, cars produced by Toyota in a given year

People in ISMT551, HKUST, Hong Kong, Asia, World

Potential voters in an election

The Moon

Fish in the Ocean, Abalone population

Players in the World cup

Introduction

Descriptive Statistics

Populations and samples

Question: Why do we need to perform statistical


analysis on samples, why not just look at the
whole population?

Some reasons Many Populations are too big. Stars in the


galaxy, Populations of people, fish in the Ocean.

That is, in many cases to gather information from an entire


population is practically impossible.

Other populations are costly to sample in terms of time or


money, or it simply may be foolish to test everything.

Introduction

Descriptive Statistics

Populations and samples

Example: Destructive Sampling Car manufacturers often


need to test the safety of their cars. The methods to do so
rely on destruction of a small set of vehicles. The
manufacturer then uses the information gathered from
these tests to decide whether the rest of cars
manufactured are suitable for use.

Example: Rare Samples Some objects are rarely found or


difficult to obtain. Moon Rock, Giant Squid.

Introduction

Descriptive Statistics

Populations and samples

Describing sets of Data


Terms: Variable, Experimental Units, Datasets

Variable stores one particular kind of information contained


in an item or a subject (either from a sample or of a
population). A characteristic which changes over
individuals or time

Examples Company type, Company size, Company Sales,


Hot dog brand name, product type.

Experimental units Individual or object on which the


variable is(are) measured. If a sample of size n is taken
then there are n experimental units.

Introduction

Descriptive Statistics

Populations and samples

Suppose that the variable of interest is the number of goals


scored by a professional soccer player. Then for instance the
soccer player David Beckham would be considered a possible
experimental unit where the data collected would consist of the
number of goals he scores. In mathematical shorthand we
might write:
Let X denote the variable corresponding to goals scored by an
individual. Then XBeckham denotes the number of goals scored
by Beckham. Note that if Beckham has not played the season
yet, the variable XBeckham is random, as it may take values from 0
to say a number less than 100.

Introduction

Descriptive Statistics

Populations and samples

Includes information based on one or more variables. For


instance Let X be the number of goals scored, Let Y be the
height of a player, let Z be the weight of a player. Then
each experimental unit , say i, is associated with the vector
(Xi , Yi , Zi ) =(Goals scored for i, height of i, weight of i).
This can be put into a table format:

Introduction

Descriptive Statistics

Types of Variables: Qualitative, Quantitative

Qualitative(Categorical)

Categories Examples: {Red, Blue, Green},{Male,


Female},{Yes, No}.

Ordered or ranked data Examples {High, Medium, Low},


{Very good, Good, Bad}.

Introduction

Descriptive Statistics

Types of Variables: Qualitative, Quantitative

Quantitative(Numerical)

values are numerical and fall essentially into two types


1

Discrete (finite or countable values) {0,1,2,3} or {1, 2, 3 . . .}.


Examples Number of goals scored, number of classes,
number of phone calls in one hour, number of times before
a head is tossed on a fair coin, number of customers in a
restaurant.

Continuous All values on an interval or on the real


line(Uncountable quantities). An example is Time

Introduction

Descriptive Statistics

Types of Variables: Qualitative, Quantitative

Questions
1. For each of the following random variables, determine
whether the variable is categorical or numerical(quantitative).
a) Number of telephones per household
b) Type of telephone primarily used
c) Number of long-distance calls made per month
d) Length (in minutes) of longest long-distance call made per
month
e) Color of telephone primarily used
f) monthly charge (in dollars and cents) for long distance
calls made
g) ownership of a cellular phone

Introduction

Descriptive Statistics

Types of Variables: Qualitative, Quantitative

2. Which of the following is NOT a reason for sampling?


a) It is usually too costly to study the whole population.
b) It is usually too time consuming to look at the whole
population
c) It is sometimes destructive to observe the entire population
d) It is always more informative by investigating a sample
than the entire population.

Introduction

Descriptive Statistics

Types of Variables: Qualitative, Quantitative

3. To monitor campus security, the campus police office is


taking a survey of the number of students in a parking lot each
30 minutes of a 24-hour period with the goal of determining
when patrols of the lot would serve the most students. If X is
the number of students in the lot each period of time, then X is
an example of
a) a categorical random variable
b) a discrete random variable
c) a continuous random variable
d) a statistic

Introduction

Descriptive Statistics

Types of Variables: Qualitative, Quantitative

4. The Chancellor of a major university was concerned about


alcohol abuse on campus and wanted to find out the
portion(percentage) of students at her university who visited
campus bars every weekend. Her advisor took a random
sample of 250 students and computed the portion of students
in the the sample who visited campus bars every weekend.
Consider the following possibilities and answer the questions
below:
(i) The total number of students in the sample who visited
campus bars every weekend is an example of (which of the
following)
(ii) The portion of students at her university who visited campus
bars every weekend is an example of (which of the following)

Introduction

Descriptive Statistics

Types of Variables: Qualitative, Quantitative

(iii) The portion of students in the sample who visited campus


bars every weekend is an example of (which of the following)
a) a categorical random variable
b) a discrete random variable
c) a parameter
d) a statistic

Introduction

Descriptive Statistics

Types of Variables: Qualitative, Quantitative

Presenting Data in Tables and Charts


Tables and Graphs for Numerical Data

This section discusses how to take possibly large amounts of


information and present them in such a way that they can be
easily interpreted by visual means. Topics to be discussed
1

Basic methods to organize data- Ordered Array, Stem and


Leaf Plots

How to use and construct-Tables: Frequency distributions,


Cumulative distributions, Graphs: Histogram, Polygon,
Ogive

Bivariate Numerical Data- Scatter Diagram

Introduction

Descriptive Statistics

Types of Variables: Qualitative, Quantitative

Tables and Graphs for Categorical Data

Summary Table, Bar Chart, Pie Chart, Pareto Diagram

Bivariate Categorical Data- Contingency Table, Side by


side chart

Introduction

Descriptive Statistics

Types of Variables: Qualitative, Quantitative

Ordered Array

An ordered array is simply created by taking a set of data and


displaying the items in ranked fashion from lowest to Highest.
Example: Data from 3 year percentage of high risk funds n=47
-22.82 -12.57 -10.55 -5.32 -2.89
-.33
-.14
4.00
...
...
...
...
...
... ...
49.02 49.67 54.43 58.71 63.79 68.58 86.13

Introduction
Types of Variables: Qualitative, Quantitative

Stem and Leaf Displays

Descriptive Statistics

Introduction

Descriptive Statistics

Types of Variables: Qualitative, Quantitative

Stem and leaf plots separates data into


1

Stems-leading digits

Leaves- trailing digits

Note that often numbers are rounded off and there can be
many different stem and leaf plots for the same data. Stem and
leaf plots display how values are clustered or grouped together

Introduction

Descriptive Statistics

Types of Variables: Qualitative, Quantitative

Example: Consider the data relating to the previous ordered


array. The numbers range from -22% to 86% (after roundoff)
Stems- -2,-1,-0,0,1,2,3,45,6,7,8 form categories.
Consider the first 4 values -22.82 -12.57 -10.55 -5.32 Note
write -5.32=-05.32 This would be displayed as
-2
-1
-0

2
20
5

Introduction

Descriptive Statistics

Types of Variables: Qualitative, Quantitative

For the full data sets one has


-2
-1
-0
0
1
2
3
4
5
6
7
8

2
20
5320
011146688
3357
23346889999
056789
235799
48
38
6

Introduction

Descriptive Statistics

Types of Variables: Qualitative, Quantitative

Stem-and-leaf display
Building a Stem-and-leaf display

Data in raw form: 24, 26, 24, 21, 27, 27, 30, 41, 32, 38
Order the Data: 21, 24, 24, 26, 27, 27, 30, 32, 38, 41
Choose Stem unit and Leave Unit: 10s digit for Stem 1s digit
for Leaf
For each measurement: list the leaves of each stem.

Introduction

Descriptive Statistics

Types of Variables: Qualitative, Quantitative

Question: Interpreting a Stem and leaf plot


A survey was conducted to determine how people rated the
quality of programming available on television. Respondents
were asked to rate the overall quality from 0(no quality at all) to
100(extremely good quality). The stem and leaf display is
shown below
3 24
4 03478999
5 0112345
6 12566
7 01
8
9 2

Introduction

Descriptive Statistics

Types of Variables: Qualitative, Quantitative

Q1 What percentage of the respondents rated overall


television quality with a rating of 80 or above?
a)0.00 b) 0.04 c) 0.96 d)1.00

Introduction

Descriptive Statistics

Types of Variables: Qualitative, Quantitative

Q2 What percentage of the respondents rated overall


television quality with a rating of 50 or below?
a)0.11 b) 0.40 c) 0.44 d)0.56

Introduction

Descriptive Statistics

Types of Variables: Qualitative, Quantitative

Tables and Graphs for Numerical Data


Frequency Distribution

A frequency distribution is a summary table in which the data


are arranged into conveniently established, numerically ordered
class groupings or categories. Classes consists usually of
intervals of values

Introduction

Descriptive Statistics

Types of Variables: Qualitative, Quantitative

How to determine the number of classes and class


intervals
(1) Determine number of classes based on total # of
observations n: one simple rule is
n < 25
25 n 400
n > 400

5 categories

n categories
20 categories

(2) Determine class width by


Class width =

range
# of classes

Note: be sure that class boundaries are well differentiated.


That is, the class intervals should not overlap. The class
midpoint is the point halfway between the boundaries of each
class and is representative of the data within that class.

Introduction

Descriptive Statistics

Types of Variables: Qualitative, Quantitative

Frequency or class Frequency is the number of


observations in each class. Some notation, let fi denote
the frequency of class {i} for i = 1, . . . , k classes. That is,
fi = # of observations in class i

Relative Frequency/percentage distribution A relative


frequency with respect to a particular class is defined as
relative frequency of class i :=

fi
n

A percentage distribution is formed by multiplying each


relative frequency by 100%.

Introduction

Descriptive Statistics

Types of Variables: Qualitative, Quantitative

Cumulative percentage Distribution Describes the


percentage of values which are in or below a class interval.
Thus the cumulative percentage value associated with
class 3 is calculated as
f1 + f2 + f3
100%
n

Introduction

Descriptive Statistics

Types of Variables: Qualitative, Quantitative

Example: Data for Utility Charges

96, 171, 202, 178, 147, 102, 153, 197, 127, 82, ....158

Introduction

Descriptive Statistics

Types of Variables: Qualitative, Quantitative

1. Form a frequency distribution that has 5, 6, 7 class intervals.


First need to determine minimum and maximum values
minimum is 82 and maximum is 213, hence the width of interval
formula is used to obtain class interval size as follows
Determine class width by
Class width =

213 82
= 26.2 := 30
5

213 82
= 21.83 := 25
6
213 82
Class width =
= 18.71 := 20
7
Class width =

Introduction

Descriptive Statistics

Types of Variables: Qualitative, Quantitative

EC
80<100
100<120
120<140
A chart for 7 intervals
140<160
160<180
180<200
200<220

Midpoint
90
110
130
150
170
190
210

Freq
4
7
9
13
9
5
3

Per
8%
14%
18%
26%
18%
10%
6%

Introduction

Descriptive Statistics

Types of Variables: Qualitative, Quantitative

Histograms/Polygon graphs

Histogram Is a chart in which the rectangular bars are


constructed at the boundaries of each class. When plotting
a histogram variable of interest is displayed along the (X)
horizontal axis (differentiated by the class boundaries. The
vertical axis(Y) represents the number, proportion or
percentage of observations per class interval.

Introduction

Descriptive Statistics

Types of Variables: Qualitative, Quantitative

Polygon A percentage Polygon is a line chart, which is


useful for comparing two or more groups. The advantage
over a histogram is that it is easier to see visually the
difference between two groups. The percentage polygon is
constructed by connecting lines between the midpoints of
each interval at their respective class percentages.

Introduction

Descriptive Statistics

Types of Variables: Qualitative, Quantitative

Cumulative percentage Polygons or Ogives A cumulative


percentage polygon, otherwise known as an Ogive, is a
line graph of the cumulative percentage distribution.
Similar to the Polygon line chart, it is constructed by
connecting lines between the midpoints of each interval at
their respective cumulative percentages.
For example in the case of 7 classes one would construct a
plot based on the pairs (70, 0), (90, 8), (110, 22), (130,40),
(150,66), (170, 84), (190, 94), (210, 100). Where the first
value represents the class midpoints and the second value
is the cumulative percentage up to an including that class.

Introduction

Descriptive Statistics

Types of Variables: Qualitative, Quantitative

Scatter Diagram

The scatter diagram is a graphical method used to compare the


possible relationships between two variables of interest.
Variable 1 is plotted on the X-axis and variable two on the
Y-axis. If there are n associated pairs of data then these can be
represented as (X1 , Y1 ), . . . , (Xn , Yn ). A question to ask is if the
graphs seem to visually imply a solid relationship between two
variables. A common use of scatter diagrams is to determine if
there is a linear relationship between the X variable and the Y
variable.

Introduction
Types of Variables: Qualitative, Quantitative

Scatter Diagram _ Oldfaithfuldata

Descriptive Statistics

Introduction

Descriptive Statistics

Tables and Charts for Categorical Data:

Tables and charts are often used for Categorical data. There
are many similarities between the methods for numerical data
and categorical data. One main distinction is that the terms,
classes or class intervals, which are based on a range of
numerical values is replaced by types of objects or
categories.
The idea of frequencies or percentages is then taken with
respect to these categories.

Introduction

Descriptive Statistics

Tables and Charts for Categorical Data:

Summary Table

The Summary Table is quite similar to the frequency distribution


table for numerical values except that now frequencies and
percentages are calculated with respect to types or categories
of objects.

Introduction

Descriptive Statistics

Tables and Charts for Categorical Data:

Suppose that there are 4 categories labeled {A,B,C,D}, then


one would calculate the number of objects of type A, B, C, D to
obtain the frequencies and organize this in a chart.
That is, the frequency of A, can be represented as
fA = # of observations in class A

Introduction

Descriptive Statistics

Tables and Charts for Categorical Data:

Bar Chart

A Bar Chart is very similar to a histogram. A frequency bar


chart is constructed by representing each category as a bar,
where the length or height of the bar represents the frequency
or percentage of observations falling in that category.

Introduction

Descriptive Statistics

Tables and Charts for Categorical Data:

Example: One plots (A, fA ), (B, fB ), (C, fC ) etc, where A, B, C


play a similar role to the class intervals in a histogram. Note
that unlike histograms there is no real concept of a midpoint.
However because the types are represented by equally wide
bars, one still has a visual midpoint.

Introduction

Descriptive Statistics

Tables and Charts for Categorical Data:

Pie Chart

This is a popular graphical device which simply represents the


percentage in each category as pieces of a pie. Hence the
category with the largest piece of pie, represents the category
with the largest percentage and so on. In order to properly
calculate the pie chart one uses the formula
360 percentage in category.

Introduction

Descriptive Statistics

Tables and Charts for Categorical Data:

Pareto Diagram

In many respects a Pareto Diagram is quite similar to an Ogive


for numerical data combined with a Bar chart. To construct one
displays a ranked Bar chart in decreasing order. That is, the bar
chart starts with the category with the highest frequency or
percentage then the next highest etc. A cumulative frequency
polygon or Ogive can then be constructed using the visual
midpoints of the bar chart. The ranked bar chart combined with
the overlayed Ogive define the Pareto Diagram.
The Pareto Diagram is preferred to the bar chart and pie chart
when there are many categories.

Introduction

Descriptive Statistics

Tables and Charts for Categorical Data:

Tabulating Bivariate Categorical Data


1

Contingency table A contingency table or


class-classification table is used when comparison is
necessary between two categorical variables. The table
consists of a matrix(table) form as follows: Suppose that
variable 1 consists of the categories (V1 , V2 ) and variable 2
has possible categories (A, B, C)

var 1
V1
V2
Total

var 2
B

Total

the elements of the rows and columns can contain counts,


or percentages relative to row totals column totals or
overall totals.

Introduction
Tables and Charts for Categorical Data:

Dot plot _ Example for bivariate Data

Descriptive Statistics

Introduction

Descriptive Statistics

Numerical Descriptive Measures

Measures of central tendency

Measures of central tendency or location of the data are used


to identify a typical value that can be used to describe the entire
set. Three common measurements are the arithmetic mean,
median, and the mode. Respectively these measure, the
average value, the middlemost value, and the most occurring
value in a dataset.

Introduction

Descriptive Statistics

Numerical Descriptive Measures

The Arithmetic Mean


Certainly the most commonly recognized and used measure of
central tendency is the arithmetic mean or the (common)
average. If a dataset consists of n observations, X1 , X2 , . . . , Xn ,
then the arithmetic mean of the sample is written as
= X1 + X2 + . . . + Xn
X
n
The sum or total can be expressed in shorthand as,
n
X
i=1

Xi := X1 + X2 + . . . + Xn

Introduction

Descriptive Statistics

Numerical Descriptive Measures

The mean is often a fine measurement of central tendency.


However its main drawback is that it is greatly affected by
extreme values in the data.
That is values which are much smaller or much larger than
most of the values in the data set.

Introduction

Descriptive Statistics

Numerical Descriptive Measures

Example: Imagine that one wants to find the average income in


Hong Kong. Average in this sense means one wants to be able
to pinpoint what salary per month the average person makes.
In order to do this a survey is taken say based on 10 people at
random, the people report monthly incomes of
4000, 10,000, 10,000, 15,000, 20,000, 25,000, 30,000, 60,000,
60,000, 5,000,000 (a BIG TYCOON)
The total is 5,226,000 and the average income is
= 522, 600.
X

Introduction

Descriptive Statistics

Numerical Descriptive Measures

Naturally as n increases this number will most likely decrease


but this example illustrates how sensitive mean calculations can
be.

Introduction

Descriptive Statistics

Numerical Descriptive Measures

Median
The Median is the middle value in an ordered array of data. The
median is not affected by extreme values and may be
preferable to the mean in this situation.
There are two methods of computing the median of the set of
data depending on whether the sample size is even or odd.
First one needs to remember to order the data from the
minimum to maximum value.
1

When n is odd;
Median =

n+1
ranked observation
2

Introduction

Descriptive Statistics

Numerical Descriptive Measures

Example: consider the data 12, 7, 7, 9, 0, 7, 3. n=7 is odd and


the ordered values are
0, 3, 7, 7, 7, 9, 12.
The median is then 7 or the 4th value in the ordered sample.

Introduction

Descriptive Statistics

Numerical Descriptive Measures

When n is even. The median is defined to be the average of the


two middle most values.
Example: recall the income example. 4000, 10,000, 10,000,
15,000, 20,000, 25,000, 30,000, 60,000, 60,000, 5,000,000.
The two middlemost values are 20,000 and 25,000 which yields
a median of
20, 000 + 25, 000
= 22, 500
2

Introduction

Descriptive Statistics

Numerical Descriptive Measures

The Mode The mode corresponds to the value in the data


set which occurs most often.

Introduction

Descriptive Statistics

Numerical Descriptive Measures

Geometric Mean:Investments
The Geometric Mean and the Geometric Rate of Return are
used to measure the status of an investment over time.
Measures the rate of change of a variable over time.
1

The formula for the geometric mean of variables X1 , . . . , Xn


is
1
G = (X1 X2 Xn ) n
X

The formula for the Geometric mean rate of return is


1

G = [(1 + R1 ) (1 + Rn )] n 1
R
where Ri is the rate of return in time period i. The rate of
return is defined to be the loss or gain in period i divided by
the starting value in the period and then multiplied by
100%.

Introduction

Descriptive Statistics

Numerical Descriptive Measures

Example: to illustrate this lets look at the example on p.104 in


text. An initial investment of 100,000 if made at the end of year
one the fund declined to 50,000 and then rebounded to its
original 100,000 value at the end of year two. Hence
R1 = (

50, 000 100, 000


) 100% := 50%
100, 000

R2 = (

100, 000 50, 000


) 100% := 100%
50, 000

Introduction

Descriptive Statistics

Numerical Descriptive Measures

The average return is calculated to be 25,000,


[R1 + R2 ]/2 100, 000 = .25 100, 000

Introduction

Descriptive Statistics

Numerical Descriptive Measures

while the geometric mean rate of return is calculated to be


(.5 2)1/2 1 = 0,
which more accurately reflects the fact that at the end of the 2
year period there was no gain or loss.

Introduction

Descriptive Statistics

Measures of noncentral tendency

Quartiles
The Quartiles divide the ranked data into four quarters.
1

The value of the data where 25% of the data is below and
75% are above it is called the 1st quartile, denoted as Q1 .
A formula for Q1 is given as
Q1 =

n+1
ordered observation
4

The value of the data where 75% of the data is below and
25% are above it is called the 3rd quartile, denoted as Q3 .
A formula for Q3 is given as
Q1 =

3(n + 1)
ordered observation
4

Introduction

Descriptive Statistics

Measures of noncentral tendency

Example: for the dataset 4000, 10,000, 10,000, 15,000, 20,000,


25,000, 30,000, 60,000, 60,000, 5,000,000. Using the formulas
the Q1 correspond to the 2.75 ordered value and Q3
corresponds to the 8.25 ordered value. It follows that rounding
up 2.75 to 3 yields
Q1 = 10, 000
Rounding down 8.25 to 8,
Q3 = 60, 000

Introduction

Descriptive Statistics

Measures of Variation

In addition to measurements of central tendency it is important


to identify the amount of Variability or Spread in data. Three
such measurements are the Range, Interquartile Range and
Variance.
Example: As a simple motivating example consider the case of
two data sets {2,2,2,2,2} and {0,1,2,3,4}
1

The arithmetic mean of both sets is 2 and the median of


both sets is also 2.

However the data sets are quite different. For the first data
set all measures of variation would yield the value 0 while
this will not be the case for the second data set.

Introduction

Descriptive Statistics

Measures of Variation

Range

The Range is simply the difference between the minimum


value and the maximum value of the data.

That is the formula


Range = Xlargest Xsmallest

It is a measure of the total spread in the data.

One drawback is that it does not take into account the


other data points besides the minimum and the maximum.

Another problem is it is highly sensitive to extreme values

Introduction

Descriptive Statistics

Measures of Variation

For the simple example above the Range of the second


data set {0, 1, 2, 3, 4} is
40=4
as compared to 2 2 = 0 for the first set {2, 2, 2, 2, 2}.

Introduction

Descriptive Statistics

Measures of Variation

Interquartile Range

The interquartile range considers the spread in the middle


50% of the data and is therefore not influenced by extreme
values

The formula is
Interquartile range = Q3 Q1

Introduction

Descriptive Statistics

Measures of Variation

Example: Calculate the Interquartile range for the data sets


{2, 2, 2, 2, 2} and {0, 1, 2, 3, 4}
1

Solution: first note that n = 5 and compute the positions for


Q1 and Q3

Note that,
n+1
6
= = 1.5
4
4

It follows that Q1 corresponds to the second number and


Q3 corresponds to the 3(1.5) := 4.5 or the 5th number.

hence for {2, 2, 2, 2, 2}, Q1 = Q3 = 2 and the Interquartile


Range is 0.

For {0, 1, 2, 3, 4}, Q3 = 4 and Q1 = 1, thus the Interquartile


Range is 4 1 = 3

Introduction

Descriptive Statistics

Measures of Variation

Variance and Standard Deviation

The Sample Variance of a data set is defined as,


2

(X1 X) + + (Xn X)
S =
n1
2

Introduction

Descriptive Statistics

Measures of Variation

Example: Calculate the variance of the data sets


{2, 2, 2, 2, 2} and {0, 1, 2, 3, 4}

Solution: first calculate the arithmetic mean X where n = 5.


2+2+2+2+2
0+1+2+3+4
= 2 and
=2
5
5

Now calculate the squared differences from the arithmetic


mean. For the data set {2, 2, 2, 2, 2} we see that all the
differences are 0. For {0, 1, 2, 3, 4}, we have (0 2)2 =
4, (1 2)2 = 1, (2 2)2 = 0, (3 2)2 = 1, (4 2)2 = 4

Introduction

Descriptive Statistics

Measures of Variation

the variance of {2, 2, 2, 2, 2} is 0 and for {0, 1, 2, 3, 4},


S2 =

4+1+0+1+4
= 2.5
4

Introduction

Descriptive Statistics

Measures of Variation

Computationally easier formula:


2

S =

Pn

2
i=1 Xi

nX
.
n1

The Variance measures the average squared distance of


the individual observations from the mean.

A low variance corresponds to small spread in the data.

In other words this suggest that most values are quite near
the mean.

A high variance translates into a dataset which has values


which are more widely spread out.

Introduction

Descriptive Statistics

Measures of Variation

Standard Deviation
1

A small drawback of the Sample Variance is that it is


measured in squared units relative to the measurements
for X.

That is if X1 is expressed in terms of dollars, the variance is


expressed in terms of dollars squared.

For this reason the Sample Standard Deviation is often


preferred

The Sample standard deviation is defined to be the positive


square root of the the sample variance and is denoted as S.

For instance the standard deviation is reported in dollars


not squared dollars

The standard deviation of the data set {0, 1, 2, 3, 4} is 2.5

Introduction

Descriptive Statistics

Measures of Variation

Understanding Variation in Data

The more spread out, or dispersed, the data are, the larger
will be the Range, the Interquartile Range, the Variance,
and the Standard deviation

The more concentrated, or similar, the data are, the


smaller will be the range, interquartile range, the variance
and the standard deviation.

If the observations are all the same, the range, the


interquartile range, variance and standard deviation are all
zero.

None of the measures of variation considered here can be


negative

Introduction

Descriptive Statistics

Measures of Variation

Coefficient of Variation
The Coefficient of Variation measures the scatter in the data
relative to the mean.
1

It is expressed in terms of percentages rather than units


and is calculated as
CV =

S
100%
X

An advantage of the coefficient of variation is that one


compare the relative variability of two or more variables
even when the two variables are based on different units of
measurement.

Introduction

Descriptive Statistics

Shape of a data set

Shape of Data

The third property of a data set is related to the way the data
are distributed. All descriptions of shape are taken relative to
how symmetric the data set is. A data set which is not
symmetric is said to be asymmetrical or skewed

Introduction

Descriptive Statistics

Shape of a data set

Symmetrical data set

A data set is considered symmetric if the mean and


median are equal That is, X =Median

A data set is right-skewed if the mean is greater than the


median. X >Median In other words there are some
extremely large values in the data

A data set is left-skewed if the mean is less than the


median. X <Median That is, there are some extremely
small values in the data

Introduction
Shape of a data set

Question 1

Which of the following is sensitive to extreme values


1. The median
2. The Interquartile range
3. The arithmetic mean
4. the 1st Quartile, Q1

Descriptive Statistics

Introduction

Descriptive Statistics

Shape of a data set

Question 2
A sociologist recently conducted a survey of citizens over 60
years of age whose net worth is too high to qualify for
subsidized medical care and have no private health insurance.
A summary of ages of the 25 uninsured senior citizens were as
follows
The average age is 74.04, the median age is 73, the first
Quartile is 65, the third Quartile is 81.
Identify which of the statements is correct.
1. One fourth of the senior citizens sampled are below 64
years of age
2. The middle 50% of the senior citizens sampled are
between 65 and 73 years of age
3. 25% of the senior citizens sampled are older than 81 years
of age
4. All of the above are correct

Introduction

Descriptive Statistics

Shape of a data set

Example: Sample: {5, 7, 1, 2, 4}

sum

Xi
5
7
1
2
4
19

(Xi X)
1.2
3.2
-2.8
-1.8
0.2
0
X=

S2 =

(Xi X)
1.44
10.24
7.84
3.24
0.04
22.8

19
= 3.8
5

22.8
= 5.7 and S = 2.387
4
S
CV = = 62.8%
X

Xi 2
25
49
1
4
16
95

Introduction

Descriptive Statistics

Describing Central Tendency

Some measures of central tendency for numerical


data

Population (N objects) vs Sample (n objects) // Parameter


vs. Statistic
The sample mean x is a point estimate of the population
mean
n
1X
x1 + x2 + . . . + xn
x =
=
xi
n
n
i=1

The Median Md is the middlemost measurement in the


ordering

Introduction

Descriptive Statistics

Describing Central Tendency

if n = 2p + 1 is odd: it is the (p + 1)th .


if n = 2p is even: it is the average of the pth and (p + 1)th
The Mode M0 is the measurement that occurs most
frequently.

Introduction

Descriptive Statistics

Measure of Variation

Some measures of variation for numerical data

The Range is the largest measurement minus the smallest


measurement
The sample Variance
Standard Deviation

Introduction

Descriptive Statistics

Percentiles, Quartiles, and Box-and-Whiskers Displays

Questions
1. For each of the following random variables, determine
whether the variable is categorical or numerical(quantitative).
a) Number of telephones per household
b) Type of telephone primarily used
c) Number of long-distance calls made per month
d) Length (in minutes) of longest long-distance call made per
month
e) Color of telephone primarily used
f) monthly charge (in dollars and cents) for long distance
calls made
g) ownership of a cellular phone

Introduction

Descriptive Statistics

Summarizing Data/Exploratory data Analysis

5-number summary provides a way of determining the


shape of data based on the quantities
Xsmallest

Q1

median

Q3

Xlargest

Box and Whisker plots: Uses the 5-number summary to


graphically represent the data.

Introduction

Descriptive Statistics

Summarizing Data/Exploratory data Analysis

R IGHT- SKEWED DATA


Xsmall Q1 med

Q3

Xlargest

S YMMETRICAL DATA
Xsmall

Q1

med

Q3

Xlargest

L EFT- SKEWED DATA


Xsmall

Q1

med Q3

Xlargest

Introduction

Descriptive Statistics

Summarizing Data/Exploratory data Analysis

Skewness
Symmetry/Skewness
Skewed to the right, Symmetrical, Skewed to the left

Introduction

Descriptive Statistics

Descriptive measures of the Population

Suppose that X1 , . . . , XN represents now all possible values


from a population of size N.
1
Note that in some cases the population size N may be
unknown or infinite. An example of a finite population is the
collection of students at UST.
2
Similar to the arithmetic mean of the sample, we can
calculate the true mean of the population, denoted as .
PN
Xi
= i=1
N
3
The variance of the Population is calculated as Population
variance 2 and Population standard deviation of data
X1 , X2 , . . . , XN :
2

N
1X
=
(Xi )2
N
i=1

and =

Introduction

Descriptive Statistics

Descriptive measures of the Population

The phrase within k standard deviations from the mean refers


to data which are in an interval
[ k, + k]

Introduction

Descriptive Statistics

Descriptive measures of the Population

Based on a known standard deviation of the population, there


are at least two rules which can tell us more about the
clustering and distribution of the data/masses.
(1) The Empirical Rule requires the data histogram is
symmetrical and bell-shaped.

k
1
2
3
4

% of data within k SD
each way from the mean
68%
95%
99%
ALL

Introduction

Descriptive Statistics

Descriptive measures of the Population

1
(2) The Chebyshev Rule: states that at least 1 2
k
data lie within k standard deviation of their mean
(regardless of how skewed the data is).


1
(in %)
k
1 2
k
1 Not calculable (NA)
2
3/4
(75%)
3
8/9
(89%)
4
15/16
(94%)


of the

Introduction

Descriptive Statistics

Descriptive measures of the Population

Example
The mean is = 28.2 and = 6.75.
a. 1 standard deviation: between 21.45 and 34.95 Ans:
Empirical Rule 68%
b. 2 standard deviations:between 14.7 and 41.7 Ans:
Empirical Rule 95%
c. Between 21.45 and 34.95 using Chebyshev rule Ans: NA
d. Between 14.7 and 41.7 using Chebyshev Rule Ans: 75%
e. Between 7.95 and 48.45 using Chebyshev Rule Ans: 89%
f. 94% should have values within 4 standard deviations from
the mean according to Chebyshev Rule, which is between
1.2 and 55.2

Introduction

Descriptive Statistics

Descriptive measures of the Population

Coefficient of Correlation
The coefficient of correlation measures the strength of the
linear relationship between two variables X and Y
1

The correlation coefficient always satisfies


1 1

If = 1 then there is a perfect positive linear relationship.


That is
Y = a + bX
where b is a positive number
If = 1 then there is a perfect negative relationship. That
is
Y = a bX
If = 0 then there is no correlation (no linear relationship)
between X and Y.

Introduction

Descriptive Statistics

Descriptive measures of the Population

Sample Coefficient of Correlation


The quantity measures the correlation for entire population.
One can use a scatter diagram.
In a sample one should not expect to see perfect
correlations. In place of we can calculate the sample
coefficient of correlation, denoted as r.
Measures linear relationship between two numerical
variables X and Y of a dataset of size n:
n
X

Xi X

Yi Y

r = v i=1
,
uX
n
u n
2 X
2
t
Xi X
Yi Y
i=1

i=1

Introduction
Descriptive measures of the Population

where 1 r 1 .

Descriptive Statistics

Introduction

Descriptive Statistics

Descriptive measures of the Population

Example. X represents Energy cost and Y represents Price of


refrigerator.
Xi
48
54
58
66
77
66
70
81
72
78
670

Yi
850
760
900
870
1100
800
650
750
750
570
8000

(Xi X)
-19
-13
-9
-1
10
-1
3
14
5
11
0

(Yi Y)
50
-40
100
70
300
0
-150
-50
-50
-330
0

(Xi X)(Yi Y)
-950
520
-900
-70
3000
0
-450
-700
-250
-3630
-3430

Introduction

Descriptive Statistics

Descriptive measures of the Population

X = 67 and Y = 800
10
X

(Xi X) = 1064

i=1
10
X

(Yi Y) = 245, 400

i=1

r = 0.1641
The result of r indicates a very weak negative relationship
between price and energy cost.

You might also like