0% found this document useful (0 votes)
20 views28 pages

R22 Unit2 CH2

Uploaded by

227r1a67a3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views28 pages

R22 Unit2 CH2

Uploaded by

227r1a67a3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Basic Statistical Descriptions of Data

Measuring the Central Tendency: Mean, Median, Mode


we look at various ways to measure the central tendency of data. Suppose that we
have some attribute X, like salary, which has been recorded for a set of objects. Let
x1, x2, . . . , Xn be the set of N observed values or observations for X
plot the observations for salary, where would most of the values fall?
Measures of central tendency include the mean, median, mode, and midrange
Median:
In probability and statistics, the median generally applies to numeric data,
however, we may extend the concept to ordinal data. Suppose that a given
data set of N values for an attribute X is sorted in increasing order. If N is odd,
then the median is the middle value of the ordered set. If N is even then the
median is not unique;
Let is find the median example
The data are already sorted in increasing order. There is an even number of
observations , therefore, the median is not unique. It can be any value within
the two middlemost values of 52 and 56 (30, 31, 47, 50, 52, 52, 56, 60, 63, 70,
70 that is, within the 5th and 6th values in the list). By convention, we assign
the average of the two middlemost values as the median. That is, 52+56/ 2 =
108 /2 = 54. Thus, the median is $54K.
• The median is expensive to compute when we have a large number of
observations. Assume that data are grouped in intervals according to their
xi data values and that the frequency of each interval is known. For
example, employees may be grouped according to their annual salary in
intervals such as 10–20K, 20–30K, and so on.
• Let the interval that contains the median frequency be the median
interval. We can approximate the median of the entire data set (e.g., the
median salary) by interpolation using the formula:
The mode for a set of data is the value that occurs most frequently in the
set. Therefore, it can be determined for qualitative and quantitative
attributes. It is possible for the greatest frequency to correspond to several
different values, which results in more than one mode. Data sets with one,
two, or three modes are respectively called unimodal, bimodal, and
trimodal
Measuring the Dispersion of Data:

• Range, Quartiles, and the Interquartile Range (IQR) :


Let x1, x2, . . . , Xn be a set of observations for some numeric attribute, X. The range of the set is the
difference between the largest (max()) and smallest (min()) values. Suppose that the data for attribute
X are sorted in increasing numeric order.
Imagine that we can pick certain data points so as to split the data distribution into equal-sized
consecutive sets, called quartiles
• Quantiles are points taken at regular intervals of a data distribution, dividing it into
essentially equal-sized consecutive sets.
• The kth q-quantile for a given data distribution is the value x such that at most k/q of
the data values are less than x and at most (q − k)/q of the data values are more than x,
where k is an integer such that 0 < k < q. There are q − 1 q-quantiles.
The quartiles give an indication of the center, spread, and shape of a distribution. The
first quartile, denoted by Q1, is the 25th percentile. The third quartile, denoted by Q3,
is the 75th percentile. The second quartile is the 50th percentile. As the median, it gives
the center of the data distribution. The distance between the first and third quartiles is a
simple measure of spread that gives the range covered by the middle half of the data.
This distance is called the interquartile range (IQR) and is defined as IQR = Q3 − Q1.
• Example 2.10 Interquartile range.
• The quartiles are the three values that split the sorted data set into four
equal parts. The data of Example 2.2.1 contain 12
observations(30,36,47,50,52,52,56,60,63,70,70,110) already sorted in
increasing order. Thus, the quartiles for this data are the 3rd, 6th, and
9th values, respectively, in the sorted list. Therefore, Q1 = $47K and Q3 is
$63K.
• Thus, the interquartile range is IQR = 63 − 47 = $16K. (Note that the 6th
value is a median, $52K, although this data set has two medians since
the number of data values is even.)
Five-Number Summary, Boxplots, and Outliers
IQR, is very useful for describing skewed distributions. Have a look at the
symmetric and skewed data distributions
In the symmetric distribution, the median splits the data into equal-size
halves, this does not occur for skewed distributions
it is more informative to also provide the two quartiles Q1 and Q3, along
with the median
A common rule of thumb for identifying suspected outliers is to single out
values falling at least 1.5 × IQR above the third quartile or below the first
quartile
• Because Q1, the median, and Q3 together contain no information about
the endpoints (e.g., tails) of the data, a fuller summary of the shape of a
distribution can be obtained by providing the lowest and highest data
values as well. This is known as the five-number summary.
• The five-number summary of a distribution consists of the median, the
quartiles Q1 and Q3, and the smallest and largest individual observations,
written in the order of Minimum, Q1, Median, Q3, Maximum.
• Boxplots are a popular way of visualizing a distribution.
A boxplot incorporates the five-number summary as follows:
Typically, the ends of the box are at the quartiles, so that the box length is
the interquartile range, IQR.
The median is marked by a line within the box.
Two lines (called whiskers) outside the box extend to the smallest
(Minimum) and largest (Maximum) observations.
Figure 2.3 shows boxplots for unit
price data for items sold at four
branches of All Electronics during a
given time period.
For branch 1, we see that the
median price of items sold is $80,
Q1 is $60, Q3 is $100. Notice that
two outlying observations for this
branch were plotted individually, as
their values of 175 and 202 are
more than 1.5 times the IQR here
of 40.
Variance and Standard Deviation
Variance and standard deviation are measures of data dispersion. They indicate how
spread out a data distribution is. A low standard deviation means that the data
observations tend to be very close to the mean, while high standard deviation indicates
that the data are spread out over a large range of values.
The basic properties of the standard deviation, σ, as a measure of spread
are
• σ measures spread about the mean and should be considered only when
the mean is chosen as the measure of center.
• σ = 0 only when there is no spread, that is, when all observations have
the same value. Otherwise σ > 0
Graphic Displays of Basic Statistical
Descriptions of Data
A quantile plot is a simple and effective way to have a first look at a univariate data
distribution First, it displays all of the data for the given attribute Second, it plots
quantile information.
Let xi , for i = 1 to N, be the data sorted in increasing order so that x 1 is the smallest
observation and xN is the largest for some ordinal or numeric attribute X.
Each observation, xi , is paired with a percentage, fi , which indicates that approximately
fi × 100% of the data are below the value, xi .
• These numbers increase in equal steps of 1/N, ranging from 1/ 2N (which is
slightly above zero) to 1 − 1 /2N (which is slightly below one). On a quantile
plot, xi is graphed against fi
For example, given the quantile plots of sales data for two different time
periods, we can compare their Q1, median, Q3, and other fi values at a glance.
• A quantile-quantile plot, or q-q plot, graphs the quantiles of one univariate
distribution against the corresponding quantiles of another.
• It is a powerful visualization tool in that it allows the user to view whether there
is a shift in going from one distribution to another.
• Suppose that we have two sets of observations for the attribute or variable unit
price, taken from two different branch locations.
• Let x1, . . . , xN be the data from the first branch, and y1, . . . , yM be the data
from the second, where each data set is sorted in increasing order.
• If M = N (i.e., the number of points in each set is the same), then we simply
plot yi against xi , where yi and xi are both (i − 0.5)/N quantiles of their respective
data sets
• If M < N (i.e., the second branch has fewer observations than the first), there
can be only M points on the q-q plot. Here, yi is the (i − 0.5)/M quantile of the y
data, which is plotted against the (i − 0.5)/M quantile of the x data.
This computation typically involves interpolation.
• Histograms “Histog” means pole and “gram” means chart, so a histogram
is a chart of poles.
• Plotting histograms is a graphical method for summarizing the distribution of
a given attribute, X.
• If X is nominal, such as item type, then a pole or vertical bar is drawn for
each known value of X. The height of the bar indicates the frequency (i.e.,
count) of that X value. The resulting graph is more commonly known as a bar
chart.
• If X is numeric, the term histogram is preferred. The range of values for X
is partitioned into disjoint consecutive subranges.
• The subranges, referred to as buckets, are disjoint subsets of the data
distribution for X.
• The range of a bucket is known as the width. Typically, the buckets are
equal-width.
• For example, a price attribute with a value range of $1 to $200 (rounded
up to the nearest dollar) can be partitioned into subranges 1 to 20, 21 to
40, 41 to 60, and so on. For each subrange, a bar is drawn whose height
represents the total count of items observed within the subrange
• A scatter plot is one of the most effective graphical methods for
determining if there appears to be a relationship, pattern, or trend
between two numeric attributes.
• The scatter plot is a useful method for providing a first look at bivariate
data to see clusters of points and outliers, or to explore the possibility of
correlation relationships.
• Two attributes, X, and Y , are correlated if one attribute implies the other.
Correlations can be positive, negative, or null (uncorrelated). Figure 2.8
shows examples of positive and negative correlations between two
attributes.
• If the pattern of plotted points slopes from lower left to upper right, this
means that the values of X increase as the values of Y increase, which
suggests a positive correlation (Figure 2.8a)).
• If the pattern of plotted points slopes from upper left to lower right, then
the values of X increase as the values of Y decrease, suggesting a negative
correlation (Figure 2.8b)). A line of best fit can be drawn in order to study
the correlation between the variables.
• Statistical tests for correlation are given in on data integration . Figure 2.9
shows three cases for which there is no correlation relationship between
the two attributes in each of the given data sets. Section 2.3.2 shows how
scatter plots can be extended to n attributes, resulting in a scatter plot
matrix.

You might also like