Quant Descriptive Statistics
Quant Descriptive Statistics
Skewness or symmetry:
Histogram
Common graph for quantitative data. The horizontal axis is a number
line broken into ranges and the vertical axis is the count or frequency.
Since the median is the midpoint of the data, 50% of the values are
below it. Hence, it is also the 50th percentile.
Variation or dispersion
Dispersion refers to the degree of variation in the data; that is, the
numerical spread (or compactness) of the data. How spread out is the
data?
As variation in the data increases, all measures of variation take larger
values.
Range
The range is the simplest measure of variation. The range is the
difference between the maximum value and the minimum value in the
data set.
Range = max value – min value
The range is affected by outliers, and is often used only for very small
data sets.
Variance
The variance is roughly the average squared deviation from the mean.
Here is the formula for the sample variance:
This includes only the middle 50% of the data and, therefore, is not
influenced by extreme values.
Boxplots
Boxplots (or box-and-whisker plots) are graphical displays built from
the five-number summary. The five-number summary consists of the
min, Q1, median, Q3, and max.
The box extends from Q1 to Q3. A line is drawn inside the box at the
value of the median. The “whiskers” extend to the values of the min
and max.
In addition, boxplots are often modified to incorporate outlier
detection rules based on distances beyond the quartiles and either
1.5×IQR, for potential outliers, or 3×IQR for probable outliers.
possible probable
possible outlier
outliers
outlier
1.5 × IQR 3.0 × IQR
o
* * *
inner inner outer x
fence fence fence
median
Q1 Q3
Z-scores
A standardized value, commonly called a z-score, provides a relative
measure of the distance an observation is from the mean, which is
independent of the units of measurement.
Subtracting the mean from all data values centers the data set at 0.
Dividing all of the centered values by the standard deviation scales the
values to a new standard deviation of 1.
The process of standardizing data with z-scores in often called
“centering and scaling” the data.
Impact of outliers
Outliers pull the mean toward them.
Outliers inflate the value of the range, variance, and standard
deviation.
(Note: measures of variation always get larger when outliers are
present.)
Outliers also impact other statistics, such as the correlation,
coefficients of regression models, etc.
Some statistics are resistant to the effects of outliers, like the median
and IQR.
Identifying outliers
There are several common rules of thumb for identifying outliers.
1) Values above Q3 + 1.5×IQR or below Q1 – 1.5×IQR, which are called
the “inner fences,” are potential outliers.
2) Values above Q3 + 3×IQR or below Q1 – 3×IQR, which are called the
“outer fences,” are probable outliers/extreme values.
3) Values with z-scores above +3 or below –3 are potential outliers.
Choosing appropriate measures
The mean and standard deviation are the most popular measures of
center and variation. If the data is roughly symmetric in shape and
contains no obvious outliers, these measures are acceptable.
The median and IQR, which are both resistant to the impact of outliers,
should be strongly considered when the data contains outliers or is
strongly skewed in shape.
Categorical data
Categorical data is fundamentally different than quantitative/numeric
data.
Averages, standard deviations, and other summary statistics often
make no sense for categorical data.
Sample proportion
The sample proportion, denoted by p or , is the fraction of data that
have a certain characteristic or that belong to a certain category.
Proportions are key descriptive statistics for categorical data, such as
defects or errors in quality control applications or consumer
preferences in market research.
Frequency distribution
A frequency distribution displays the values of a categorical variable
and one or more measures derived from the count of how often each
category occurs in the data.
Pie chart
Pie charts show the whole group of cases as a circle sliced into pieces
with sizes proportional to the fraction of the whole in each category.
Bar chart
A bar chart displays the distribution of a categorical variable, showing
the counts for each category next to each other for easy comparison.
Contingency table
The frequencies of two categorical variables can be summarized and
displayed simultaneously using a contingency table (or
crosstabulation):
Other bar charts
More elaborate bar charts, such as
clustered or stacked bar charts can
be created from contingency tables:
Practice
A) Which is most likely true for the distribution of “percentage of time
actually spent taking notes in class,” which is displayed in the
histogram?
(a) mean > median
(b) mean ~ median
(c) mean < median
(d) impossible to tell
Practice
B) Which of these variables do you expect to be uniformly distributed?
(a) weights of adult females
(b) salaries of a random sample of people from North Carolina
(c) house prices
(d) birthdays of classmates (day of the month)
Practice
C) If someone's gross annual income has a z-score of +2.3, what can be
concluded?
oTheir income is 2.3 standard deviations below the mean income.
oTheir income is 2.3 standard deviations above the mean income.
oTheir income is 2.3 times the mean income.
oTheir income is 2.3 standard deviations above the median income.
Practice
D) A community college school board is negotiating a new contract with the college
faculty. The distribution of faculty salaries is skewed right by several faculty
members who make over $100,000 per year. If the school board wants to give the
community the impression that the faculty are already overpaid, should they
adver se the mean or median of the faculty salaries?
o The school board should use the mean to make their argument. The mean will be
higher than the median since it will be influenced by the few high salaries.
o The school board should use the median to make their argument. The median
will be lower than the mean since the mean is influenced by the few high salaries.
o The school board should use the mean to make their argument. The mean will be
lower than the median since the median is influenced by the few high salaries.
Practice
E) A company advertises a mean lifespan of 1000 hours for a particular type
of light bulb. If you were in charge of quality control at the factory, would
you prefer that the standard deviation of the lifespans for the light bulbs be 5
hours or 50 hours? Why?
o50 hours would be preferable since a larger standard deviation indicates a
longer average lifespan for the light bulbs.
o5 hours would be preferable since a smaller standard deviation indicates
more consistency.
o50 hours would be preferable since a larger standard deviation indicates
more consistency.
o5 hours would be preferable since a smaller standard deviation indicates a
longer average lifespan for the light bulbs.
Practice solution
A) Which is most likely true for the distribution of “percentage of time
actually spent taking notes in class,” which is displayed in the
histogram?
(c) mean < median median: 80%
mean: 76%
The distribution is skewed
to the left and the data
values in the left tail pull
the mean toward them, but
the median is unaffected.
Practice solution
B) Which of these variables do you expect to be uniformly distributed?
(d) birthdays of classmates (day of the month)