We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28
Basic Statistical Descriptions of Data
Measuring the Central Tendency: Mean, Median, Mode
we look at various ways to measure the central tendency of data. Suppose that we have some attribute X, like salary, which has been recorded for a set of objects. Let x1, x2, . . . , Xn be the set of N observed values or observations for X plot the observations for salary, where would most of the values fall? Measures of central tendency include the mean, median, mode, and midrange Median: In probability and statistics, the median generally applies to numeric data, however, we may extend the concept to ordinal data. Suppose that a given data set of N values for an attribute X is sorted in increasing order. If N is odd, then the median is the middle value of the ordered set. If N is even then the median is not unique; Let is find the median example The data are already sorted in increasing order. There is an even number of observations , therefore, the median is not unique. It can be any value within the two middlemost values of 52 and 56 (30, 31, 47, 50, 52, 52, 56, 60, 63, 70, 70 that is, within the 5th and 6th values in the list). By convention, we assign the average of the two middlemost values as the median. That is, 52+56/ 2 = 108 /2 = 54. Thus, the median is $54K. • The median is expensive to compute when we have a large number of observations. Assume that data are grouped in intervals according to their xi data values and that the frequency of each interval is known. For example, employees may be grouped according to their annual salary in intervals such as 10–20K, 20–30K, and so on. • Let the interval that contains the median frequency be the median interval. We can approximate the median of the entire data set (e.g., the median salary) by interpolation using the formula: The mode for a set of data is the value that occurs most frequently in the set. Therefore, it can be determined for qualitative and quantitative attributes. It is possible for the greatest frequency to correspond to several different values, which results in more than one mode. Data sets with one, two, or three modes are respectively called unimodal, bimodal, and trimodal Measuring the Dispersion of Data:
• Range, Quartiles, and the Interquartile Range (IQR) :
Let x1, x2, . . . , Xn be a set of observations for some numeric attribute, X. The range of the set is the difference between the largest (max()) and smallest (min()) values. Suppose that the data for attribute X are sorted in increasing numeric order. Imagine that we can pick certain data points so as to split the data distribution into equal-sized consecutive sets, called quartiles • Quantiles are points taken at regular intervals of a data distribution, dividing it into essentially equal-sized consecutive sets. • The kth q-quantile for a given data distribution is the value x such that at most k/q of the data values are less than x and at most (q − k)/q of the data values are more than x, where k is an integer such that 0 < k < q. There are q − 1 q-quantiles. The quartiles give an indication of the center, spread, and shape of a distribution. The first quartile, denoted by Q1, is the 25th percentile. The third quartile, denoted by Q3, is the 75th percentile. The second quartile is the 50th percentile. As the median, it gives the center of the data distribution. The distance between the first and third quartiles is a simple measure of spread that gives the range covered by the middle half of the data. This distance is called the interquartile range (IQR) and is defined as IQR = Q3 − Q1. • Example 2.10 Interquartile range. • The quartiles are the three values that split the sorted data set into four equal parts. The data of Example 2.2.1 contain 12 observations(30,36,47,50,52,52,56,60,63,70,70,110) already sorted in increasing order. Thus, the quartiles for this data are the 3rd, 6th, and 9th values, respectively, in the sorted list. Therefore, Q1 = $47K and Q3 is $63K. • Thus, the interquartile range is IQR = 63 − 47 = $16K. (Note that the 6th value is a median, $52K, although this data set has two medians since the number of data values is even.) Five-Number Summary, Boxplots, and Outliers IQR, is very useful for describing skewed distributions. Have a look at the symmetric and skewed data distributions In the symmetric distribution, the median splits the data into equal-size halves, this does not occur for skewed distributions it is more informative to also provide the two quartiles Q1 and Q3, along with the median A common rule of thumb for identifying suspected outliers is to single out values falling at least 1.5 × IQR above the third quartile or below the first quartile • Because Q1, the median, and Q3 together contain no information about the endpoints (e.g., tails) of the data, a fuller summary of the shape of a distribution can be obtained by providing the lowest and highest data values as well. This is known as the five-number summary. • The five-number summary of a distribution consists of the median, the quartiles Q1 and Q3, and the smallest and largest individual observations, written in the order of Minimum, Q1, Median, Q3, Maximum. • Boxplots are a popular way of visualizing a distribution. A boxplot incorporates the five-number summary as follows: Typically, the ends of the box are at the quartiles, so that the box length is the interquartile range, IQR. The median is marked by a line within the box. Two lines (called whiskers) outside the box extend to the smallest (Minimum) and largest (Maximum) observations. Figure 2.3 shows boxplots for unit price data for items sold at four branches of All Electronics during a given time period. For branch 1, we see that the median price of items sold is $80, Q1 is $60, Q3 is $100. Notice that two outlying observations for this branch were plotted individually, as their values of 175 and 202 are more than 1.5 times the IQR here of 40. Variance and Standard Deviation Variance and standard deviation are measures of data dispersion. They indicate how spread out a data distribution is. A low standard deviation means that the data observations tend to be very close to the mean, while high standard deviation indicates that the data are spread out over a large range of values. The basic properties of the standard deviation, σ, as a measure of spread are • σ measures spread about the mean and should be considered only when the mean is chosen as the measure of center. • σ = 0 only when there is no spread, that is, when all observations have the same value. Otherwise σ > 0 Graphic Displays of Basic Statistical Descriptions of Data A quantile plot is a simple and effective way to have a first look at a univariate data distribution First, it displays all of the data for the given attribute Second, it plots quantile information. Let xi , for i = 1 to N, be the data sorted in increasing order so that x 1 is the smallest observation and xN is the largest for some ordinal or numeric attribute X. Each observation, xi , is paired with a percentage, fi , which indicates that approximately fi × 100% of the data are below the value, xi . • These numbers increase in equal steps of 1/N, ranging from 1/ 2N (which is slightly above zero) to 1 − 1 /2N (which is slightly below one). On a quantile plot, xi is graphed against fi For example, given the quantile plots of sales data for two different time periods, we can compare their Q1, median, Q3, and other fi values at a glance. • A quantile-quantile plot, or q-q plot, graphs the quantiles of one univariate distribution against the corresponding quantiles of another. • It is a powerful visualization tool in that it allows the user to view whether there is a shift in going from one distribution to another. • Suppose that we have two sets of observations for the attribute or variable unit price, taken from two different branch locations. • Let x1, . . . , xN be the data from the first branch, and y1, . . . , yM be the data from the second, where each data set is sorted in increasing order. • If M = N (i.e., the number of points in each set is the same), then we simply plot yi against xi , where yi and xi are both (i − 0.5)/N quantiles of their respective data sets • If M < N (i.e., the second branch has fewer observations than the first), there can be only M points on the q-q plot. Here, yi is the (i − 0.5)/M quantile of the y data, which is plotted against the (i − 0.5)/M quantile of the x data. This computation typically involves interpolation. • Histograms “Histog” means pole and “gram” means chart, so a histogram is a chart of poles. • Plotting histograms is a graphical method for summarizing the distribution of a given attribute, X. • If X is nominal, such as item type, then a pole or vertical bar is drawn for each known value of X. The height of the bar indicates the frequency (i.e., count) of that X value. The resulting graph is more commonly known as a bar chart. • If X is numeric, the term histogram is preferred. The range of values for X is partitioned into disjoint consecutive subranges. • The subranges, referred to as buckets, are disjoint subsets of the data distribution for X. • The range of a bucket is known as the width. Typically, the buckets are equal-width. • For example, a price attribute with a value range of $1 to $200 (rounded up to the nearest dollar) can be partitioned into subranges 1 to 20, 21 to 40, 41 to 60, and so on. For each subrange, a bar is drawn whose height represents the total count of items observed within the subrange • A scatter plot is one of the most effective graphical methods for determining if there appears to be a relationship, pattern, or trend between two numeric attributes. • The scatter plot is a useful method for providing a first look at bivariate data to see clusters of points and outliers, or to explore the possibility of correlation relationships. • Two attributes, X, and Y , are correlated if one attribute implies the other. Correlations can be positive, negative, or null (uncorrelated). Figure 2.8 shows examples of positive and negative correlations between two attributes. • If the pattern of plotted points slopes from lower left to upper right, this means that the values of X increase as the values of Y increase, which suggests a positive correlation (Figure 2.8a)). • If the pattern of plotted points slopes from upper left to lower right, then the values of X increase as the values of Y decrease, suggesting a negative correlation (Figure 2.8b)). A line of best fit can be drawn in order to study the correlation between the variables. • Statistical tests for correlation are given in on data integration . Figure 2.9 shows three cases for which there is no correlation relationship between the two attributes in each of the given data sets. Section 2.3.2 shows how scatter plots can be extended to n attributes, resulting in a scatter plot matrix.