Chapter 1-Overview & Descriptive Statistics - Classroom Upload
Chapter 1-Overview & Descriptive Statistics - Classroom Upload
Overview &
Descriptive Statistics
Dr. Harpreet Kaur
2022-23
Populations, Samples, and Processes
Data is a collection of facts.
Univariate data records the value of only one variable for each
observation.
Multivariate data records the value of multiple variables for each
observation.
Bivariate data is a special case of multivariate data; there are two
variables quantified.
A variable is any characteristic whose value may change from one object
to another in the population.
Variables an be Quantitative or Categorical variables.
Categorical variables take values from a finite number of possibilities.
Quantitative variables, however, take numerical values.
Populations, Samples, and Processes
Data can be classified into nominal, ordinal, interval, and ratio types, the first
two breaking up the “categorical” data type and the second two breaking up
the “quantitative” data type.
Suppose, for example, that our data set consists of 200 observations on
of courses a college student is taking this term. If 70 of these 𝑥 values
are 3, then
A frequency distribution is a tabulation of frequencies or relative frequencies.
Constructing a Histogram for Discrete Data
First, determine the frequency and relative frequency of each x value. Then mark
possible x values on a horizontal scale. Above each value, draw a rectangle whose
height is the relative frequency (or alternatively, the frequency) of that value.
How unusual is a no-hitter or a
one-hitter in a major league
baseball game, and how
frequently does a team get more
than 10, 15, or even 20 hits? The
given table is a frequency
distribution for the number of hits
per team per game for all nine-
inning games that were played
between 1989 and 1993.
Constructing a Histogram for Continuous Data
Determine the frequency and relative frequency for each class. Mark the class
boundaries on a horizontal measurement axis. Above each class interval, draw a
rectangle whose height is the corresponding relative frequency (or frequency).
Constructing a Histogram for Continuous Data: Unequal Class Widths
After determining frequencies and relative frequencies, calculate the height of
each rectangle using the formula
The resulting rectangle heights are usually called densities, and the vertical scale is
the density scale. This prescription will also work when class widths are equal.
Q 27
Histograms come in a variety of shapes.
A unimodal histogram is one that rises to a single peak and then declines.
A bimodal histogram has two different peaks. Bimodality can occur when the
data set consists of observations on two quite different kinds of individuals or
objects.
A histogram with more than two peaks is said to be multimodal. Of course, the
number of peaks may well depend on the choice of class intervals, particularly
with a small number of observations. The larger the number of classes, the more
likely it is that bimodality or multimodality will manifest itself.
A histogram is symmetric if the left half is a mirror image of the right half.
A unimodal histogram is positively skewed if the right or upper tail is stretched
out compared with the left or lower tail and negatively skewed if the stretching
is to the left.
Both a frequency distribution and a histogram can be constructed when the data
set is qualitative (categorical) in nature.
A Pareto diagram is a variation of a histogram for
categorical data resulting from a quality control
study. Each category represents a different type of
product nonconformity or production problem. The
categories are ordered so that the one with the
largest frequency appears on the far left, then the
category with the second largest frequency, and so
on. Suppose the following information on
nonconformities in circuit packs is obtained: failed
component, 126; incorrect component, 210;
insufficient solder, 67; excess solder, 54; missing
component, 131. Construct a Pareto diagram.
Quartiles:
Quartiles divide the data set into four equal parts, with the
observations above the third quartile constituting the upper
quarter of the data set, the second quartile being identical to
the median, and the first quartile separating the lower
quarter from the upper three-quarters.
Percentiles:
A data set (sample or population) can be even more finely
divided using percentiles; the 99th percentile separates the
highest 1% from the bottom 99%, and so on.
Trimmed Mean:
A trimmed mean is a compromise between 𝑥ҧ and 𝑥 . A 10%
trimmed mean, for example, would be computed by
eliminating the smallest 10% and the largest 10% of the
sample and then averaging what remains.
Because even a single outlier can drastically affect the values 𝑥ҧ of and s, a
boxplot is based on measures that are “resistant” to the presence of a few
outliers—the median and a measure of variability called the fourth spread.
Roughly speaking, the fourth spread is unaffected by the positions of those observations
in the smallest 25% or the largest 25% of the data. Hence it is resistant to outliers.