0% found this document useful (0 votes)
17 views12 pages

CHAPTER+ONE+Descriptive+Statistics+ +univariate

Uploaded by

nila.vishwas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views12 pages

CHAPTER+ONE+Descriptive+Statistics+ +univariate

Uploaded by

nila.vishwas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

CHAPTER ONE: Introduction

What is statistics? There are two meanings, or uses, of the word:

I) Descriptive Statistics
methods for organizing and summarizing (often large amounts of) information.
The purpose is to represent a large data set (i.e. a bunch of numbers) in a clear and efficient manner.
Examples of descriptive statistics:
o your SAT score(s)
o the Dow Jones Industrial Average
o a baseball box score
o pie charts showing political polling results

II) Inferential Statistics


methods for drawing and measuring the reliability of conclusions about a population based on information
from a sample
Examples of inferential statistics:
o confidence intervals
o hypothesis tests
o regression analysis

Definitions
The entire body of people or things that you wish to investigate is called the population. The subset of the population
that you directly observe and examine (from which information is recorded) is called the sample.

The first step in a statistical study is that of identifying the population: it all depends on the question you’d like to
answer. (Information obtained from the entire population is called a census.) It is often the case that only some
members of the population can be examined. (Why?)

Examples
1. Suppose that you would like to know the average SAT score of all college freshmen in the United States.
The population: All college freshmen in the United States
A sample: 500 college freshmen, picked across various campuses

2. You would like to know who is favored in an upcoming congressional election in New Jersey.
The population: All registered voters (perhaps only those who “intend to vote”) in that district.
A sample: 1000 computer-selected voters

3. Does a newly developed drug work?


The population: All people who would ever take that new drug.
A sample: 200 people who agree to participate in a study to test the effects of the drug.

Why take samples? It is often the case that the population is much too big for you to gather information from every
single member. When using samples we must understand that there is no guarantee that our conclusions are 100%
correct. Ideally, you would like to use a sample that is “perfectly representative” of the population from which it
came, but there is no such guarantee and one cannot tell how representative it is simply by looking at the sample.
Instead, we must focus on the method of sampling.

Most sampling procedures involve random selection in some way, where the people or things are selected blindly,
or by an outside process that cannot be predicted.
Most common sampling methods:
1. Simple random sampling
2. Stratified random sampling
3. Cluster sampling

1
Definition
Simple random sampling is the sampling procedure which assigns equal likelihood to every possible sample of that
particular size in the population.

Example
If two different students are to be selected from a class of 25, then a simple random sampling procedure would assign
equal likelihood to each of the 300 possible pairs of students. (We’ll see where “300” comes from later.)

If the sample is to be obtained one-by-one, random sampling requires that all individual members of the population
have the same chance of being included in the sample at all times in the selection process. Two main types of sampling:

Sampling with replacement


each chosen person or thing is returned to the population to possibly be chosen again

Sampling without replacement


each chosen person or thing is removed from the population and cannot be chosen again

While most studies involve sampling without replacement, probabilities are more easily derived from samples chosen
with replacement. Fortunately, for most problems of interest, there isn’t much of a difference between “with
replacement” and “without replacement” as far as actual probabilities. (More on this later.)

Before we learn how to summarize data, we first need to know what data is and how to classify it.

Definition
A variable is a characteristic that varies from one person or thing to another.

There are two fundamental types of variables:

o quantitative (numerical)
o qualitative (non-numerical, or categorical)

Examples of quantitative variables Examples of qualitative variables


the height of a building a person’s middle name
the number of pages in a newspaper the shape of a pill
the volume of a box the company that makes a new drug
the temperature in a room the gender of a dog
the dissolving time of a pill the state in which a US resident lives
the number of floors in a building the presence (or lack of) a characteristic

Quantitative variables can be subdivided into two groups: discrete (if possible values are a countable set) and
continuous (if possible values are uncountable).

Definition
Observed values of a variable or variables is called data.

Example
The number of brothers a person has is a variable
The statement “Greg has one brother” is data.

Definition
A data set is a collection of multiple observations of one or more variables.
We can summarize data sets with
• numbers
• tables
• pictures

2
Numerical summaries

The two most important characteristics of a quantitative data set:


• the "center" of the data set, or the most typical value
• the extent to which the data values are scattered

We can identify/quantify both with a pair of numbers


• measure of location
• measure of dispersion

In other words, we can use a very small set of (two) numbers to summarize a much larger set of numbers. By
themselves, these two measures can explain quite a bit about the data set from which they are computed.

The ordered values of a raw data set (from smallest to largest) will be denoted by

x1 , x 2 , x3 , ...... , x n
where n = the total number of pieces of data.

Measure of Location
- a single number that identifies the “center”, or most typical value of a data set.

We will study these, in particular:


1. mean
2. median
3. midrange
4. trimmed mean
5. midquartile (later in the chapter)

Definitions
The mean of a data set is the average of all of the values.

x1+ x2 + x3 + ..... + xn
Mean =
n
The median of a data set is the value that divides the ordered data values into two equal halves – a lower half and
an upper half. It is defined separately for two cases, depending on the sample size:

if n is odd, then the median is the single value in the middle.

Median = x( n+1 )
2

if n is even, then the median is the average of the two middle values.

x( n ) + x( n +1)
Median = 2 2

2
NOTE: When finding the median, do not forget to order the data values first!

3
Examples
n is odd:
For a data set with n=173 values, the median is the ((173+1)/2) = 87th smallest value.

n is even:
For a data set with n=256 values, the median is the average of the (256/2) = 128th and the ((256/2)+1) = 129th values.

PROBLEM 1.1
The following data set consists of exam scores for ten students:

74 88 90 70 74 82 12 84 62 84

(a) What is the mean score?


(b) What is the median score?

PROBLEM 1.2
Refer to the data set in Problem 1.1.
(a) Suppose that the instructor “curved” the scores by adding nine points to each one. What are the mean and
median of the curved scores?
(b) Can you explain what causes the difference between the mean and the median? (See the graphic below.)

The dotplot pictured below displays the exam scores in the above problem

Definition
A resistant measure is one which isn’t affected much (if at all) by outliers or extreme values in a data set.

If you can make a measure be equal to whatever you want it to be, just by changing just one data value, then it is
not resistant. (Otherwise it is.)

How to determine if a measure is resistant:


1) Create, or at least imagine, a small set of numbers.
2) Compute the measure for these numbers.
3) Change the largest data value to something very large (like, say, 100,000,000)
4) Re-compute the measure
- if the measure “blows up”, it is not resistant
- if the measure “stay put” it is resistant

Definition
The midrange of a data set is the average of the smallest and largest data values.

x1 + xn
Midrange =
2
Question: Is the midrange a resistant measure?

Definition
A trimmed mean of a data set is the mean of a specified fraction of data values in the middle; i.e. the same number
of values on the low and high end are ‘trimmed” off and the remaining values are averaged.

Trimmed means are identified by the percentage of data values that are trimmed off.

4
Example
For a data set with n = 50 values, the 20% trimmed mean drops 50 x (.20) = 10 data values. Since half of 10 is 5 …

the five smallest (x1 , ….. , x5) are trimmed off


the five largest (x46, ….. , x50) are trimmed off

and the remaining 40 values in the “middle” (x6, …… , x45) are averaged.

PROBLEM 1.3
Recall once again the ten exam scores from Problem 1.1:

74 88 90 70 74 82 12 84 62 84

Find:
(a) the 20% trimmed mean
(b) the 40% trimmed mean
(c) the 60% trimmed mean
(d) the 80% trimmed mean

By itself, a measure of location does not provide an adequate summary. It can tell you where the center of data
values is on the number line, but nothing else.

Example
Suppose that two brothers, Jimmy and Billy, each have six children of their own. The ages of their children:

Jimmy’s children: 10 11 12 13 14 15
Billy’s children: 3 8 9 16 17 22

Dotplots for each data set are given below.

Jimmy’s children:

Billy’s children:

Both data sets have the same mean, the same median, the same midrange, etc. Yet they are very different. How?

Measure of dispersion
- a single number that measures the extent to which the values of a data set are scattered

We will concentrate on these:


1. range
2. standard deviation
3. median absolute deviation
4. interquartile range (later)

5
Definition
The range of a data set is the difference between the largest and smallest data values.

Range = xn − x1

PROBLEM 1.4
For the exam scores in Problem 1.1, compute the range. Is the range a resistant measure?

*The range is easy to compute, but it is very wasteful. A better measure of dispersion is desired.

Before we proceed, we need to define some special sums:

Data values: x1 , x 2 , x3 , ...... , x n

x i = x1 + x2 + x3 + ...... + xn
(sum of data)

Now let x denote the mean of the data values.

 (x i − x ) 2 = ( x1 − x ) 2 + ( x2 − x ) 2 + ( x3 − x ) 2 + ...... + ( xn − x ) 2
(sum of squared deviations)

So you have a bunch of data values: x1 , x 2 , x3 , ...... , x n . You would like to quantify the amount of scatter.

Compute the mean as a measure of location and ask yourself: are the data values “close to ” or “far from” x ?
In other words, consider these values:

( x1 − x ) , ( x 2 − x ) , ( x3 − x ) , ...... , ( x n − x )

These are called the deviations from the mean – they are the differences between each data value and the overall
mean. Ask yourself what will these deviations look like if
- the data values are close together?
- the data values are widely scattered?

Clearly, these deviations contain information about the degree to which the original data values are scattered.
We would like to have a single number that measures scatter, so should we simply add up these deviations? NO!

 (x i − x ) = 0 ALWAYS!!! (i.e. for every data set)

Consider squaring each of the deviations first …..

( x1 − x ) 2 , ( x2 − x ) 2 , ( x3 − x ) 2 , ...... , ( xn − x ) 2
….. before adding them up.

 (x i − x ) 2 WORKS!

The sum of squared deviations will successfully measure the degree of scatter in a data set. This “special” sum is the
main component in the formula for standard deviation.

6
Definition
The standard deviation of a data set is given by:

Standard Deviation =
 (x − x )
i
2

n −1
Example
For the exam scores from Problem 1.1, the standard deviation is computed with the help of a three-column chart:
We need n, x ,  (x i − x)2

,
For the childrens’ ages on page 5 we see how the standard deviations reflect the difference in variability:

Jimmy’s children: Billy’s children:

NOTE: Standard deviation can also be expressed as:

𝟐 For the exam scores in Problem 1.1:


𝟐 (∑ 𝒙𝒊 )
∑𝒙 −
=√ 𝒊 𝒏−𝟏
𝒏

7
PROBLEM 1.5
Suppose that the instructor of a college class wishes to curve an exam by adding 9 points to the raw scores (as in
Problem 1.2). What will happen to the standard deviation of these new “curved” scores?

Consider the next measure of dispersion which uses the median as its initial measure of location, instead of the
mean.

The Median Absolute Deviation (MAD):

The MAD works like standard deviation except that it relies on the median as its initial measure of location.
Given a set of data: x1 , x 2 , x3 , ...... , x n (in column 1 of a table)

Step 1 Find the median of the data set. Call it ~


x .
Step 2 Subtract ~
x from each data value. This yields the deviations from the median. (column 2)
Step 3 Take absolute values of the deviations of the median. (column 3)
Step 4 Find the median of these absolute values. This is the MAD.

The MAD can be found – and more importantly, understood – with the help of another three-column chart.

Example:
Consider the following small data set, consisting of the ages of ten night-school students:

32 34 22 60 25 38 32 44 28 42

First create a three column chart with the ordered data in column 1, and find the median. The deviations from the
median and their absolute values fill out the chart.

x x-x |x-x|
22 -11 11
25 -8 8
28 -5 5
32 -1 1
32 -1 1
34 1 1
38 5 5
42 9 9
44 11 11
60 27 27

Median = ~
x = 33

Finally: order the values in the third column and find the median again ….

1 1 1 5 5 8 9 11 11 27

MAD = (5+8)/2 = 6.5

Question: Is the MAD resistant?

8
Some notation

It is either the case that our data set constitutes a population in itself (if those values are all that interest us) or it is a
sample from some larger population. (It all depends on what our objective is!)

To distinguish between the two cases, we must introduce some notation.

If we are summarizing a population:


population size: N
population mean: 
population standard deviation: 

If we are summarizing a sample:


sample size: n
sample mean: x
sample standard deviation: s

Definitions
A parameter is a descriptive measure for a population; a statistic is a descriptive measure for a sample.

Statistics are used as estimates of unknown parameters! (In other words, x is used to estimate μ and s is used to
estimate σ.)

NOTES:
1. The variance of a data set, denoted by s2 or 2 is simply the standard deviation “squared”.

2. The standard deviation of a population is defined by

 =  i
( x −  ) 2

N
The “n-1” used for the sample standard deviation is used primarily for estimation purposes. Observe that both
formulas adequately serve the same descriptive purpose.

Quartiles

For a given data set, suppose the median is found. Divide the data values into two equal halves: a lower half and
an upper half.

• if n is odd, then include the median in both halves


• this is not done if n is even

Now find the median of each half. This yields the three quartiles of the data set: Q1, Q2, Q3

• 1st quartile (Q1) = median of lower half of data set


• 2nd quartile (Q2) = median of the full data set (reminder: you find this first)
• 3rd quartile (Q3) = median of upper half of data set

9
Quartiles give us another measure of location and another measure of dispersion:

Q1 + Q3
Midquartile = (location)
2

Interquartile Range = Q3 – Q1 (dispersion)

PROBLEM 1.6
Recall the data set consisting of the exam scores for the ten students in Problem 1.1:

74 88 90 70 74 82 12 84 62 84

Find the quartiles and compute the MQ and the IQR.

It is still the case that the mean and standard deviation are the most popular choices for measuring location and
dispersion, respectively. The following theorem illustrates the significance of both.

Chebyshev’s Theorem

Suppose that, for a particular data set, you are only told its mean and its standard deviation. Nothing else. These two
numbers give you a pretty good idea where “most” of the values fall on the number line. You know for certain:

- at least 75% of the data values fall within 2 standard deviations of the mean
- at least 89% of the data values fall within 3 standard deviations of the mean
- at least 93% of the data values fall within 4 standard deviations of the mean

 1 
Generally speaking, at least 1 − 2   100% of the data values fall within K standard deviations of the mean.
 K 

Example
You read in the newspaper that the population of seniors at City High School averaged 1030 on the SATs last year
with a standard deviation of 80. Given no other information, you know that:

- at least 75% of the seniors scored between   2 1030  2(80)


(870, 1190)

- at least 89% of the seniors scored between   3 1030  3(80)


(790, 1270)

10
Graphical summaries: tables and pictures

We will now look at:


o frequency/relative frequency distributions
o frequency/relative frequency histograms
One way to summarize quantitative data is to put them into groups. Divide the number line into non-overlapping
intervals of equal length (i.e. count by 1s, or by 5s, or by 100s, or ….). Then tally up the data by counting the
number of values that fall into each group. This gives you the frequencies for each group. Dividing the frequencies
by the total number yields the relative frequencies.

Example
The raw scores on an old statistics midterm (40 scores total):

Mean = 66.8
Median = 68.5

Standard deviation = 15.6

The scores are also summarized in a frequency distribution (left) and a relative frequency distribution (right):

Definition
A histogram is a graphical display of the information contained in a frequency distribution. Groups are marked off
on the horizontal axis and frequencies of each are determined by the heights of bars.

Histogram for the above exam scores:

Histograms further summarize a data set by describing its shape. Here are three common shapes of data sets that
have only one “peak”:
Question: In each of these examples,
how do the mean and median compare
to each other? (Is one bigger than the
other, and if so, which one?)

11
Stem and Leaf Diagrams

We can obtain the same graphical summary as a histogram while retaining the information of individual data
values. A Stem and Leaf diagram replaces the bars of a histogram with digits.

For the exam scores on page 11:

Digits to the left of the bar are the stems and digits to If the stem and leaf diagram is rotated counter-
the right are the leaves. Each data value of split into a clockwise by 90 degrees, we obtain the same visual
stem and a leaf. Observe that the entire data set can summary as the histogram, i.e. we can observe that
be reconstructed from a stem and leaf diagram; the data set is left-skewed.
therefore you can still find the mean, median, etc.

Example
A stem and leaf diagram for the raw scores from the first exam in Intro to Business Statistics, Fall 2023:

12

You might also like