0% found this document useful (0 votes)
373 views7 pages

Mining Data Dispersion Characteristics

The document discusses various methods for measuring and visualizing the central tendency, variation, and dispersion of data, including measures like the mean, median, mode, weighted average, midrange, percentiles, quartiles, interquartile range, variance, standard deviation, histograms, quantile plots, quantile-quantile plots, scatter plots, and loess curves. Key metrics include the five number summary (minimum, first quartile, median, third quartile, maximum) and identifying outliers that fall more than 1.5 times the interquartile range above the third quartile or below the first quartile.

Uploaded by

9696379353
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
373 views7 pages

Mining Data Dispersion Characteristics

The document discusses various methods for measuring and visualizing the central tendency, variation, and dispersion of data, including measures like the mean, median, mode, weighted average, midrange, percentiles, quartiles, interquartile range, variance, standard deviation, histograms, quantile plots, quantile-quantile plots, scatter plots, and loess curves. Key metrics include the five number summary (minimum, first quartile, median, third quartile, maximum) and identifying outliers that fall more than 1.5 times the interquartile range above the third quartile or below the first quartile.

Uploaded by

9696379353
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Mining Data Dispersion Characteristics

Motivation: To better understand the data: central tendency, variation and spread

Measuring the Central Tendency

1.MEAN:

1 n x xi n i 1

2.WEIGHTED AVERAGE MEAN: x

w x
i 1 n

i i

w
i 1

3. MEDIAN: A holistic measure Middle value if odd number of values, or average of the middle two values otherwise estimated by interpolation

median L1 (

n / 2 ( f )l f median

)c

4.Mode Value that occurs most frequently in the data Unimodal, bimodal, trimodal Empirical formula:

mean mode 3 (mean median)

5.MIDRANGE:

midrange, that is, the average of the largest and smallest values in a data set,

Measuring the Dispersion of Data


K-th percentile: The kth percentile of a set of data in numerical order is the value x having
the property that k percent of the data entries lies at or below x. Values at or below the median M (discussed in the previous subsection) correspond to the 50-th percentile.

Quartiles:
The first quartile, denoted by Q1, is the 25-th percentile; the third quartile, denoted by Q3, is the 75-th percentile.

Inter-quartile range:
The distance between the first and third quartiles is a simple measure of spread that gives the range covered by the middle half of the data.
IQR = Q3 Q1 IQR, is very useful for describing skewed distributions. The spreads of two sides of a skewed distribution are unequal. Therefore, it is more informative to also provide the two quartiles Q1 and Q3, along with the median, M.

OUTLIERS:
values falling at least 1.5* IQRabove the third quartile or below the first quartile. FIVE NUMBER SUMMARY: Because Q1, M, and Q3 contain no information about the endpoints (e.g., tails) of the data, a fuller summary of the shape of a distribution can be obtained by providing the highest and lowest data values as well. This is known as the five-number summary. The five-number summary of a distribution consists of the median M, the quartiles Q1 and Q3, and the smallest and largest individual observations, written in the order Minimum; Q1; M; Q3; Maximum:

Boxplot
Data is represented with a box The ends of the box are at the first and third quartiles, i.e., the height of the box is IRQ

The median is marked by a line within the box Whiskers: two lines outside the box extend to Minimum and Maximum Example: a set of data

Corresponding Boxplot:

Variance and standard deviation


Variance s2: (algebraic, scalable computation)
s2 1 n 1 n 2 1 n ( xi x ) 2 [ xi ( xi ) 2 ] n 1 i 1 n 1 i 1 n i 1

Standard deviation s is the square root of variance s2


The basic properties of the standard deviation s as a measure of spread are:

s measures spread about the mean and should be used only when the mean is chosen as the measure of center.
s = 0 only when there is no spread, that is, when all observations have the same value. Otherwise s > 0.

Graph displays of basic statistical class descriptions Histogram Analysis


Graph displays of basic statistical class descriptions Frequency histograms A univariate graphical method Consists of a set of rectangles that reflect the counts or frequencies of the classes present in the given data

Quantile Plot Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences) Plots quantile information For a data xi data sorted in increasing order, fi indicates that approximately 100 fi% of the data are below or equal to the value xi

Quantile-Quantile (Q-Q) Plot


Graphs the quantiles of one univariate distribution against the corresponding quantiles of another Allows the user to view whether there is a shift in going from one distribution to another

Scatter plot Provides a first look at bivariate data to see clusters of points, outliers, etc Each pair of values is treated as a pair of coordinates and plotted as points in the plane

Loess Curve Adds a smooth curve to a scatter plot in order to provide better perception of the pattern of dependence Loess curve is fitted by setting two parameters: a smoothing parameter, and the degree of the polynomials that are fitted by the regression

You might also like