Mining Data Dispersion Characteristics
Mining Data Dispersion Characteristics
Motivation: To better understand the data: central tendency, variation and spread
1.MEAN:
1 n x xi n i 1
w x
i 1 n
i i
w
i 1
3. MEDIAN: A holistic measure Middle value if odd number of values, or average of the middle two values otherwise estimated by interpolation
median L1 (
n / 2 ( f )l f median
)c
4.Mode Value that occurs most frequently in the data Unimodal, bimodal, trimodal Empirical formula:
5.MIDRANGE:
midrange, that is, the average of the largest and smallest values in a data set,
Quartiles:
The first quartile, denoted by Q1, is the 25-th percentile; the third quartile, denoted by Q3, is the 75-th percentile.
Inter-quartile range:
The distance between the first and third quartiles is a simple measure of spread that gives the range covered by the middle half of the data.
IQR = Q3 Q1 IQR, is very useful for describing skewed distributions. The spreads of two sides of a skewed distribution are unequal. Therefore, it is more informative to also provide the two quartiles Q1 and Q3, along with the median, M.
OUTLIERS:
values falling at least 1.5* IQRabove the third quartile or below the first quartile. FIVE NUMBER SUMMARY: Because Q1, M, and Q3 contain no information about the endpoints (e.g., tails) of the data, a fuller summary of the shape of a distribution can be obtained by providing the highest and lowest data values as well. This is known as the five-number summary. The five-number summary of a distribution consists of the median M, the quartiles Q1 and Q3, and the smallest and largest individual observations, written in the order Minimum; Q1; M; Q3; Maximum:
Boxplot
Data is represented with a box The ends of the box are at the first and third quartiles, i.e., the height of the box is IRQ
The median is marked by a line within the box Whiskers: two lines outside the box extend to Minimum and Maximum Example: a set of data
Corresponding Boxplot:
s measures spread about the mean and should be used only when the mean is chosen as the measure of center.
s = 0 only when there is no spread, that is, when all observations have the same value. Otherwise s > 0.
Quantile Plot Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences) Plots quantile information For a data xi data sorted in increasing order, fi indicates that approximately 100 fi% of the data are below or equal to the value xi
Scatter plot Provides a first look at bivariate data to see clusters of points, outliers, etc Each pair of values is treated as a pair of coordinates and plotted as points in the plane
Loess Curve Adds a smooth curve to a scatter plot in order to provide better perception of the pattern of dependence Loess curve is fitted by setting two parameters: a smoothing parameter, and the degree of the polynomials that are fitted by the regression