0% found this document useful (0 votes)
15 views81 pages

Chapter 1-Overview & Descriptive Statistics - Classroom Upload

The document provides an overview of data types, including univariate, bivariate, and multivariate data, as well as classifications of variables into quantitative and categorical types. It discusses methods of data collection, including census and sampling techniques, and introduces concepts such as enumerative and analytic studies. Additionally, it covers statistical visualization techniques like stem-and-leaf displays, histograms, and boxplots, along with measures of central tendency and variability.

Uploaded by

snehahussain6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views81 pages

Chapter 1-Overview & Descriptive Statistics - Classroom Upload

The document provides an overview of data types, including univariate, bivariate, and multivariate data, as well as classifications of variables into quantitative and categorical types. It discusses methods of data collection, including census and sampling techniques, and introduces concepts such as enumerative and analytic studies. Additionally, it covers statistical visualization techniques like stem-and-leaf displays, histograms, and boxplots, along with measures of central tendency and variability.

Uploaded by

snehahussain6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Chapter 1

Overview &
Descriptive Statistics
Dr. Harpreet Kaur
2022-23
Populations, Samples, and Processes
Data is a collection of facts.
Univariate data records the value of only one variable for each
observation.
Multivariate data records the value of multiple variables for each
observation.
Bivariate data is a special case of multivariate data; there are two
variables quantified.
A variable is any characteristic whose value may change from one object
to another in the population.
Variables an be Quantitative or Categorical variables.
Categorical variables take values from a finite number of possibilities.
Quantitative variables, however, take numerical values.
Populations, Samples, and Processes
Data can be classified into nominal, ordinal, interval, and ratio types, the first
two breaking up the “categorical” data type and the second two breaking up
the “quantitative” data type.

A population is a group of interest.


If we collect data for the entire population, we
have conducted a census.
Usually, though, we collect data for a subset of a
population, called a sample. Our objective is to
use the data in the sample to reach conclusions
about the population as a whole.
Populations, Samples, and Processes
Enumerative Versus Analytic Studies
In enumerative studies, the population is a fixed, finite, tangible group that
presently exists.
In analytic studies the population may not presently exist.
Statistics depends crucially on how data is collected in survey style,
observational studies. If data is collected poorly, the results of analysis cannot
be trusted.
Data can be collected using a simple random sample wherein each member
of the population of interest is eligible to be randomly selected to be included
in the sample. Alternatively, stratified sampling can be employed- the
population is divided into observable strata. A simple random sample is then
selected from individuals in each strata. A third approach is convenience
sampling which selects individuals in a way that is not completely random
An enumerative study is focused on obtaining information about and taking
action on specific items contained in a frame which is a well defined group of
physical items (for instance sampling from a batch of product to answer the
question should the batch be rejected or accepted). Statistical inference made
from the data is applied to the remaining units in the frame the goal is not to
characterise the process that produces the frame but to describe and act on
the frame.
An analytical study is focused on obtaining information from the system or
process under study and taking action on the cost system to improve
performance in the future for instance sampling from a batch of product to
answer the question as the process or system changed as a result of our
actions or is the process consistently producing acceptable product the
statistical inference made from the data is applied to the process the goal is to
characterise the process that produces the frame not to describe and act on
the frame.
A distribution describes what values a variable takes and
how frequently it takes them.
Visualization is an important first step in a statistical
project, as it reveals patterns that are difficult to describe
using numbers only, and could suggest what statistical
procedures are appropriate.
1. Select the number of leading digits to be the stem
values. The remaining digits are the leaf values.
2. Draw a vertical line and list the stem values to the left of
this line, in order.
3. Record the leaf of each observation in the row
corresponding to its stem value.
4. Somewhere in the display, indicate the units of the stem
and leaves. (For example, the stems start at the tens place,
and the leaves start at the ones place.)
A stem-and-leaf display conveys information about the
following aspects of the data:
➢ identification of a typical or representative value
➢ extent of spread about the typical value
➢ presence of any gaps in the data
➢ extent of symmetry in the distribution of values
➢ number and location of peaks
➢ presence of any outlying values
A teacher asked 10 of her students how many books they had read in the
last 12 months. Their answers were as follows:
12, 23, 19, 6, 10, 7, 15, 25, 21, 12
Prepare a stem and leaf plot for these data.
The results of 41 students' math tests (with a best possible score of 70) are
recorded below:
31, 49, 19, 62, 50, 24, 45, 23, 51, 32, 48, 55, 60, 40, 35, 54, 26, 57, 37, 43, 65, 50, 55,
18, 53, 41, 50, 34, 67, 56, 44, 4, 54, 57, 39, 52, 45, 35, 51, 63, 42

1.Is the variable discrete or continuous? Explain.


2.Prepare an ordered stem and leaf plot for the data and briefly describe what it
shows.
3.Are there any outliers? If so, which scores?
4.Look at the stem and leaf plot from the side. Describe the distribution's main
features such as:
a. number of peaks
b. symmetry
c. value at the centre of the distribution
➢ The display suggests that a typical or representative
value is in the stem 4 row, perhaps in the mid-40%
range.
➢ The observations are not highly concentrated about this
typical value, as would be the case if all values were
between 20% and 49%.
➢ The display rises to a single peak as we move downward,
and then declines; there are no gaps in the display.
➢ The shape of the display is not perfectly symmetric, but
instead appears to stretch out a bit more in the direction
of low leaves than in the direction of high leaves.
➢ Lastly, there are no observations that are unusually far
from the bulk of the data (no outliers), as would be the
case if one of the 26% values had instead been 86%.
➢ At most colleges in the sample, at least one-quarter of
the students are binge drinkers. The problem of heavy
drinking on campuses is much more pervasive than
many had suspected.
Find the range of the data
represented in the given
stem & leaf Plot
For the given stem & leaf plot, what
is the median value?
What is the mode for the given stem & leaf plot?
A dotplot represents each data point as a dot along a real
number line, putting the point on the line according to its
value. If two points would be almost overlapping, they
would instead be stacked.
A dotplot gives information about location, spread, extremes,
and gaps.
A quantitative variable can be discrete- all possible values are
countable or continuous -if possible values consist of entire intervals of
the real number line. Generally, discrete variables arise from counting,
while continuous variables arise from measurements.
The frequency of a variable is the number of times that value was seen
in a dataset. For discrete variables it’s reasonable to list the frequency
of each observed value, but for continuous variables this is not
reasonable. Instead, for continuous variables, we list the frequency of a
range in which a datapoint lies.
The frequency of any particular 𝑥 value is the number of times that
value occurs in the data set. The relative frequency of a value is the
fraction or proportion of times the value occurs:

Suppose, for example, that our data set consists of 200 observations on
of courses a college student is taking this term. If 70 of these 𝑥 values
are 3, then
A frequency distribution is a tabulation of frequencies or relative frequencies.
Constructing a Histogram for Discrete Data
First, determine the frequency and relative frequency of each x value. Then mark
possible x values on a horizontal scale. Above each value, draw a rectangle whose
height is the relative frequency (or alternatively, the frequency) of that value.
How unusual is a no-hitter or a
one-hitter in a major league
baseball game, and how
frequently does a team get more
than 10, 15, or even 20 hits? The
given table is a frequency
distribution for the number of hits
per team per game for all nine-
inning games that were played
between 1989 and 1993.
Constructing a Histogram for Continuous Data
Determine the frequency and relative frequency for each class. Mark the class
boundaries on a horizontal measurement axis. Above each class interval, draw a
rectangle whose height is the corresponding relative frequency (or frequency).
Constructing a Histogram for Continuous Data: Unequal Class Widths
After determining frequencies and relative frequencies, calculate the height of
each rectangle using the formula

The resulting rectangle heights are usually called densities, and the vertical scale is
the density scale. This prescription will also work when class widths are equal.
Q 27
Histograms come in a variety of shapes.
A unimodal histogram is one that rises to a single peak and then declines.
A bimodal histogram has two different peaks. Bimodality can occur when the
data set consists of observations on two quite different kinds of individuals or
objects.
A histogram with more than two peaks is said to be multimodal. Of course, the
number of peaks may well depend on the choice of class intervals, particularly
with a small number of observations. The larger the number of classes, the more
likely it is that bimodality or multimodality will manifest itself.
A histogram is symmetric if the left half is a mirror image of the right half.
A unimodal histogram is positively skewed if the right or upper tail is stretched
out compared with the left or lower tail and negatively skewed if the stretching
is to the left.
Both a frequency distribution and a histogram can be constructed when the data
set is qualitative (categorical) in nature.
A Pareto diagram is a variation of a histogram for
categorical data resulting from a quality control
study. Each category represents a different type of
product nonconformity or production problem. The
categories are ordered so that the one with the
largest frequency appears on the far left, then the
category with the second largest frequency, and so
on. Suppose the following information on
nonconformities in circuit packs is obtained: failed
component, 126; incorrect component, 210;
insufficient solder, 67; excess solder, 54; missing
component, 131. Construct a Pareto diagram.
Quartiles:
Quartiles divide the data set into four equal parts, with the
observations above the third quartile constituting the upper
quarter of the data set, the second quartile being identical to
the median, and the first quartile separating the lower
quarter from the upper three-quarters.
Percentiles:
A data set (sample or population) can be even more finely
divided using percentiles; the 99th percentile separates the
highest 1% from the bottom 99%, and so on.
Trimmed Mean:
A trimmed mean is a compromise between 𝑥ҧ and 𝑥෤ . A 10%
trimmed mean, for example, would be computed by
eliminating the smallest 10% and the largest 10% of the
sample and then averaging what remains.

If the desired trimming percentage is 100𝛼% and 𝑛𝛼 is not


an integer, the trimmed mean must be calculated by
interpolation.
ഥ = 𝟑. 𝟔𝟓
𝒙
෥ = 𝟑. 𝟑𝟓
𝒙
ഥ𝒕𝒓(𝟕.𝟕) = 𝟑. 𝟒𝟐
𝒙
350
408
540
555
575
590
608
679
815
1285

Trimmed Trimmed Trimmed


Mean Mean Mean
Median (20%) (10%) (15%)
582.5 591.1667 596.25 593.7083
When the data is categorical, a frequency distribution or
relative frequency distribution provides an effective tabular
summary of the data. The natural numerical summary
quantities in this situation are the individual frequencies and
the relative frequencies.
Range, is the difference between the largest and smallest
sample values.
Deviations from Mean
Sum of Deviations from Mean
Squared Deviations from Mean
The Web site www.fueleconomy.gov contains a wealth of information about fuel
characteristics of various vehicles. In addition to EPA mileage ratings, there are many
vehicles for which users have reported their own values of fuel efficiency (mpg).
Consider the following sample of n=11 efficiencies for the 2009 Ford Focus equipped
with an automatic transmission (for this model, EPA reports an overall rating of 27
mpg–24 mpg for city driving and 33 mpg for highway driving):
𝑥𝑖 ’s tend to be closer to their average, 𝑥ҧ than to the population average
𝜇, so to compensate for this the divisor 𝑛 − 1 is used rather than 𝑛.
In other words, if we used a divisor 𝑛 in the sample variance, then the
resulting quantity would tend to underestimate (produce estimated
values that are too small on the average), whereas dividing by the slightly
smaller 𝑛 − 1 corrects this underestimating.

𝑠 2 is based on 𝑛 − 1 degrees of freedom (df)


Boxplots can be used to describe several of a data set’s most prominent
features such as
(1) Center
(2) Spread
(3) the extent and nature of any departure from symmetry
(4) identification of “outliers”

Because even a single outlier can drastically affect the values 𝑥ҧ of and s, a
boxplot is based on measures that are “resistant” to the presence of a few
outliers—the median and a measure of variability called the fourth spread.
Roughly speaking, the fourth spread is unaffected by the positions of those observations
in the smallest 25% or the largest 25% of the data. Hence it is resistant to outliers.

smallest 𝒙𝒊 lower fourth median upper fourth largest 𝒙𝒊

The simplest boxplot is based on the following five-number summary:


➢ First, draw a horizontal measurement scale.
➢ Then place a rectangle above this axis; the left edge of the rectangle is at the lower
fourth, and the right edge is at the upper fourth .
➢ Place a vertical line segment or some other symbol inside the rectangle at the location
of the median; the position of the median symbol relative to the two edges conveys
information about skewness in the middle 50% of the data.
➢ Finally, draw “whiskers” out from either end of the rectangle to the smallest and
largest observations.
A boxplot with a vertical orientation can also be drawn by making obvious modifications
in the construction process.
Ultrasound was used to gather the accompanying corrosion data on the thickness
of the floor plate of an aboveground tank used to store crude oil (“Statistical
Analysis of UT Corrosion Data from Floor Plates of a Crude Oil Aboveground
Storage Tank,” Materials Eval., 1994: 846–849); each observation is the largest pit
depth in the plate, expressed in milli-in.

40 52 55 60 70 75 85 85 90 90 92 94 94 95 98 100 115 125 125


𝑥෤ = 92.17 l𝑜𝑤𝑒𝑟 4𝑡ℎ = 45.64 𝑢𝑝𝑝𝑒𝑟 4𝑡ℎ = 167.79
𝑓𝑠 = 122.15
1.5𝑓𝑠 = 183.225
3𝑓𝑠 = 366.45
325 359 370 393
325 359 373 394
334 363 373 397
339 364 374 402
356 364 375 403
356 366 389 424
369 392

You might also like