R22 Unit2 CH2

Uploaded by

227r1a67a3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views28 pages

R22 Unit2 CH2

Uploaded by

227r1a67a3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 28

Basic Statistical Descriptions of Data

Measuring the Central Tendency: Mean, Median, Mode

we look at various ways to measure the central tendency of data. Suppose that we
have some attribute X, like salary, which has been recorded for a set of objects. Let
x1, x2, . . . , Xn be the set of N observed values or observations for X
plot the observations for salary, where would most of the values fall?
Measures of central tendency include the mean, median, mode, and midrange
Median:
In probability and statistics, the median generally applies to numeric data,
however, we may extend the concept to ordinal data. Suppose that a given
data set of N values for an attribute X is sorted in increasing order. If N is odd,
then the median is the middle value of the ordered set. If N is even then the
median is not unique;
Let is find the median example
The data are already sorted in increasing order. There is an even number of
observations , therefore, the median is not unique. It can be any value within
the two middlemost values of 52 and 56 (30, 31, 47, 50, 52, 52, 56, 60, 63, 70,
70 that is, within the 5th and 6th values in the list). By convention, we assign
the average of the two middlemost values as the median. That is, 52+56/ 2 =
108 /2 = 54. Thus, the median is $54K.
• The median is expensive to compute when we have a large number of
observations. Assume that data are grouped in intervals according to their
xi data values and that the frequency of each interval is known. For
example, employees may be grouped according to their annual salary in
intervals such as 10–20K, 20–30K, and so on.
• Let the interval that contains the median frequency be the median
interval. We can approximate the median of the entire data set (e.g., the
median salary) by interpolation using the formula:
The mode for a set of data is the value that occurs most frequently in the
set. Therefore, it can be determined for qualitative and quantitative
attributes. It is possible for the greatest frequency to correspond to several
different values, which results in more than one mode. Data sets with one,
two, or three modes are respectively called unimodal, bimodal, and
trimodal
Measuring the Dispersion of Data:

• Range, Quartiles, and the Interquartile Range (IQR) :

Let x1, x2, . . . , Xn be a set of observations for some numeric attribute, X. The range of the set is the
difference between the largest (max()) and smallest (min()) values. Suppose that the data for attribute
X are sorted in increasing numeric order.
Imagine that we can pick certain data points so as to split the data distribution into equal-sized
consecutive sets, called quartiles
• Quantiles are points taken at regular intervals of a data distribution, dividing it into
essentially equal-sized consecutive sets.
• The kth q-quantile for a given data distribution is the value x such that at most k/q of
the data values are less than x and at most (q − k)/q of the data values are more than x,
where k is an integer such that 0 < k < q. There are q − 1 q-quantiles.
The quartiles give an indication of the center, spread, and shape of a distribution. The
first quartile, denoted by Q1, is the 25th percentile. The third quartile, denoted by Q3,
is the 75th percentile. The second quartile is the 50th percentile. As the median, it gives
the center of the data distribution. The distance between the first and third quartiles is a
simple measure of spread that gives the range covered by the middle half of the data.
This distance is called the interquartile range (IQR) and is defined as IQR = Q3 − Q1.
• Example 2.10 Interquartile range.
• The quartiles are the three values that split the sorted data set into four
equal parts. The data of Example 2.2.1 contain 12
observations(30,36,47,50,52,52,56,60,63,70,70,110) already sorted in
increasing order. Thus, the quartiles for this data are the 3rd, 6th, and
9th values, respectively, in the sorted list. Therefore, Q1 = $47K and Q3 is
$63K.
• Thus, the interquartile range is IQR = 63 − 47 = $16K. (Note that the 6th
value is a median, $52K, although this data set has two medians since
the number of data values is even.)
Five-Number Summary, Boxplots, and Outliers
IQR, is very useful for describing skewed distributions. Have a look at the
symmetric and skewed data distributions
In the symmetric distribution, the median splits the data into equal-size
halves, this does not occur for skewed distributions
it is more informative to also provide the two quartiles Q1 and Q3, along
with the median
A common rule of thumb for identifying suspected outliers is to single out
values falling at least 1.5 × IQR above the third quartile or below the first
quartile
• Because Q1, the median, and Q3 together contain no information about
the endpoints (e.g., tails) of the data, a fuller summary of the shape of a
distribution can be obtained by providing the lowest and highest data
values as well. This is known as the five-number summary.
• The five-number summary of a distribution consists of the median, the
quartiles Q1 and Q3, and the smallest and largest individual observations,
written in the order of Minimum, Q1, Median, Q3, Maximum.
• Boxplots are a popular way of visualizing a distribution.
A boxplot incorporates the five-number summary as follows:
Typically, the ends of the box are at the quartiles, so that the box length is
the interquartile range, IQR.
The median is marked by a line within the box.
Two lines (called whiskers) outside the box extend to the smallest
(Minimum) and largest (Maximum) observations.
Figure 2.3 shows boxplots for unit
price data for items sold at four
branches of All Electronics during a
given time period.
For branch 1, we see that the
median price of items sold is $80,
Q1 is $60, Q3 is $100. Notice that
two outlying observations for this
branch were plotted individually, as
their values of 175 and 202 are
more than 1.5 times the IQR here
of 40.
Variance and Standard Deviation
Variance and standard deviation are measures of data dispersion. They indicate how
spread out a data distribution is. A low standard deviation means that the data
observations tend to be very close to the mean, while high standard deviation indicates
that the data are spread out over a large range of values.
The basic properties of the standard deviation, σ, as a measure of spread
are
• σ measures spread about the mean and should be considered only when
the mean is chosen as the measure of center.
• σ = 0 only when there is no spread, that is, when all observations have
the same value. Otherwise σ > 0
Graphic Displays of Basic Statistical
Descriptions of Data
A quantile plot is a simple and effective way to have a first look at a univariate data
distribution First, it displays all of the data for the given attribute Second, it plots
quantile information.
Let xi , for i = 1 to N, be the data sorted in increasing order so that x 1 is the smallest
observation and xN is the largest for some ordinal or numeric attribute X.
Each observation, xi , is paired with a percentage, fi , which indicates that approximately
fi × 100% of the data are below the value, xi .
• These numbers increase in equal steps of 1/N, ranging from 1/ 2N (which is
slightly above zero) to 1 − 1 /2N (which is slightly below one). On a quantile
plot, xi is graphed against fi
For example, given the quantile plots of sales data for two different time
periods, we can compare their Q1, median, Q3, and other fi values at a glance.
• A quantile-quantile plot, or q-q plot, graphs the quantiles of one univariate
distribution against the corresponding quantiles of another.
• It is a powerful visualization tool in that it allows the user to view whether there
is a shift in going from one distribution to another.
• Suppose that we have two sets of observations for the attribute or variable unit
price, taken from two different branch locations.
• Let x1, . . . , xN be the data from the first branch, and y1, . . . , yM be the data
from the second, where each data set is sorted in increasing order.
• If M = N (i.e., the number of points in each set is the same), then we simply
plot yi against xi , where yi and xi are both (i − 0.5)/N quantiles of their respective
data sets
• If M < N (i.e., the second branch has fewer observations than the first), there
can be only M points on the q-q plot. Here, yi is the (i − 0.5)/M quantile of the y
data, which is plotted against the (i − 0.5)/M quantile of the x data.
This computation typically involves interpolation.
• Histograms “Histog” means pole and “gram” means chart, so a histogram
is a chart of poles.
• Plotting histograms is a graphical method for summarizing the distribution of
a given attribute, X.
• If X is nominal, such as item type, then a pole or vertical bar is drawn for
each known value of X. The height of the bar indicates the frequency (i.e.,
count) of that X value. The resulting graph is more commonly known as a bar
chart.
• If X is numeric, the term histogram is preferred. The range of values for X
is partitioned into disjoint consecutive subranges.
• The subranges, referred to as buckets, are disjoint subsets of the data
distribution for X.
• The range of a bucket is known as the width. Typically, the buckets are
equal-width.
• For example, a price attribute with a value range of $1 to $200 (rounded
up to the nearest dollar) can be partitioned into subranges 1 to 20, 21 to
40, 41 to 60, and so on. For each subrange, a bar is drawn whose height
represents the total count of items observed within the subrange
• A scatter plot is one of the most effective graphical methods for
determining if there appears to be a relationship, pattern, or trend
between two numeric attributes.
• The scatter plot is a useful method for providing a first look at bivariate
data to see clusters of points and outliers, or to explore the possibility of
correlation relationships.
• Two attributes, X, and Y , are correlated if one attribute implies the other.
Correlations can be positive, negative, or null (uncorrelated). Figure 2.8
shows examples of positive and negative correlations between two
attributes.
• If the pattern of plotted points slopes from lower left to upper right, this
means that the values of X increase as the values of Y increase, which
suggests a positive correlation (Figure 2.8a)).
• If the pattern of plotted points slopes from upper left to lower right, then
the values of X increase as the values of Y decrease, suggesting a negative
correlation (Figure 2.8b)). A line of best fit can be drawn in order to study
the correlation between the variables.
• Statistical tests for correlation are given in on data integration . Figure 2.9
shows three cases for which there is no correlation relationship between
the two attributes in each of the given data sets. Section 2.3.2 shows how
scatter plots can be extended to n attributes, resulting in a scatter plot
matrix.

BT07 BMT17 Seismic Slope Displacement v5
No ratings yet
BT07 BMT17 Seismic Slope Displacement v5
16 pages
MMW Chapter 4
No ratings yet
MMW Chapter 4
84 pages
Worksheet Booklet 2020-2021 Name: - Grade: Subject: Economics
No ratings yet
Worksheet Booklet 2020-2021 Name: - Grade: Subject: Economics
48 pages
Levels of Measurement
100% (1)
Levels of Measurement
64 pages
Unit 2 PDF
No ratings yet
Unit 2 PDF
53 pages
DLP Alim-1
No ratings yet
DLP Alim-1
12 pages
Testing and Evaluation MCQs
50% (2)
Testing and Evaluation MCQs
33 pages
Nonparametric Methods: C Vi S A: I N M
No ratings yet
Nonparametric Methods: C Vi S A: I N M
73 pages
Stata Tutorial
No ratings yet
Stata Tutorial
88 pages
DM - 02 - 02 - Descriptive Data Summarization
No ratings yet
DM - 02 - 02 - Descriptive Data Summarization
32 pages
DM 02 01 Data Undrestanding
No ratings yet
DM 02 01 Data Undrestanding
35 pages
CH - 4
No ratings yet
CH - 4
71 pages
Assignment Class X Statistics CRPF
33% (3)
Assignment Class X Statistics CRPF
4 pages
Interpreting The Item Analysis Report
100% (1)
Interpreting The Item Analysis Report
3 pages
QTM Assignment-2: Submitted by NAME - Akash Malik ROLL NUMBER-170102018
0% (1)
QTM Assignment-2: Submitted by NAME - Akash Malik ROLL NUMBER-170102018
7 pages
A4 G10 Q4 Module 3 MELC-3
0% (1)
A4 G10 Q4 Module 3 MELC-3
8 pages
Exercise Lesson 4,5,6
No ratings yet
Exercise Lesson 4,5,6
23 pages
Lecture 2 - Introductory Statistics
No ratings yet
Lecture 2 - Introductory Statistics
55 pages
20 - Levels of Measurement, Central Tendency Dispersion
No ratings yet
20 - Levels of Measurement, Central Tendency Dispersion
35 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
2 Knowing Data & Visualization
No ratings yet
2 Knowing Data & Visualization
51 pages
Probability+&+Statistics Formulas
No ratings yet
Probability+&+Statistics Formulas
47 pages
Variability Final
No ratings yet
Variability Final
53 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
20 pages
Edexcel Gcse Statistics Coursework Help
100% (2)
Edexcel Gcse Statistics Coursework Help
8 pages
Statistics - Introduction To Basic Concepts
No ratings yet
Statistics - Introduction To Basic Concepts
5 pages
ds4015 Big Data Analytics Vignesh K Notes
No ratings yet
ds4015 Big Data Analytics Vignesh K Notes
146 pages
ch03 Ver3
No ratings yet
ch03 Ver3
25 pages
Recap W4 L7: - Measures of Dispersion
No ratings yet
Recap W4 L7: - Measures of Dispersion
50 pages
CHP 2
No ratings yet
CHP 2
52 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
Assignment#8614 2
No ratings yet
Assignment#8614 2
37 pages
Chapter 02-Describing Distributions With Numbers
No ratings yet
Chapter 02-Describing Distributions With Numbers
21 pages
Topic 1 Describing Data II
No ratings yet
Topic 1 Describing Data II
68 pages
Chapter 5 Statistics, Probability & Counting Methods Q
No ratings yet
Chapter 5 Statistics, Probability & Counting Methods Q
11 pages
Data Mining-5 - Getting Know Data 1
No ratings yet
Data Mining-5 - Getting Know Data 1
27 pages
Visualization
No ratings yet
Visualization
27 pages
02 Measures of Central Tendency
No ratings yet
02 Measures of Central Tendency
41 pages
DM Introduction
No ratings yet
DM Introduction
50 pages
Measusres of Locations
No ratings yet
Measusres of Locations
52 pages
Mining Data Dispersion Characteristics
No ratings yet
Mining Data Dispersion Characteristics
7 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
Data Mining:: Concepts and Techniques
100% (1)
Data Mining:: Concepts and Techniques
63 pages
02data Part2
No ratings yet
02data Part2
34 pages
2010 Skipping Class in College and Exam Performance Evidence From A Regression Discontinuity Classroom Experiment
No ratings yet
2010 Skipping Class in College and Exam Performance Evidence From A Regression Discontinuity Classroom Experiment
10 pages
Implementing Industry 4.0 Technologies For Enhanced Material Flow and Handling Management A Case Study in Logistics
No ratings yet
Implementing Industry 4.0 Technologies For Enhanced Material Flow and Handling Management A Case Study in Logistics
9 pages
Statistics Midterm Review
No ratings yet
Statistics Midterm Review
21 pages
Lecture03 Descriptive Statistics
No ratings yet
Lecture03 Descriptive Statistics
22 pages
Note 02
No ratings yet
Note 02
31 pages
Week 6+7+8
No ratings yet
Week 6+7+8
37 pages
Statistics
No ratings yet
Statistics
15 pages
Stat1008 Assignment
No ratings yet
Stat1008 Assignment
12 pages
Atg Pr2 Lesson 1
No ratings yet
Atg Pr2 Lesson 1
8 pages
Measurement 1
No ratings yet
Measurement 1
27 pages
STAE Lecture Notes - LU3
No ratings yet
STAE Lecture Notes - LU3
24 pages
Introductory of Statistics - Chapter 3
No ratings yet
Introductory of Statistics - Chapter 3
7 pages
Annotated 3 Ch3 Data Description F2014
No ratings yet
Annotated 3 Ch3 Data Description F2014
16 pages
TUT1
No ratings yet
TUT1
7 pages
Measures of Central Tendency and Spread
No ratings yet
Measures of Central Tendency and Spread
26 pages
Interdisciplinary Unit Plan
No ratings yet
Interdisciplinary Unit Plan
9 pages
Getting To Know Your Data: 2.1 Exercises
100% (1)
Getting To Know Your Data: 2.1 Exercises
8 pages
Worksheet II Basic 2021
No ratings yet
Worksheet II Basic 2021
3 pages
Maths Class X Chapter 13 Statistics DPP) Practice
No ratings yet
Maths Class X Chapter 13 Statistics DPP) Practice
4 pages
final ap statistics qp 6人
No ratings yet
final ap statistics qp 6人
63 pages
Data Mining
No ratings yet
Data Mining
5 pages
02 Data
No ratings yet
02 Data
36 pages
Ap Stat Exam Rev ch1-13
No ratings yet
Ap Stat Exam Rev ch1-13
120 pages
2 Research - 2ND QT - Week 1 - 10 14 2024
No ratings yet
2 Research - 2ND QT - Week 1 - 10 14 2024
13 pages
Nothing HHHHHHH
No ratings yet
Nothing HHHHHHH
5 pages
Third Term jss3 Mathematics
No ratings yet
Third Term jss3 Mathematics
14 pages
Visualization
No ratings yet
Visualization
24 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
19 pages
Measures of Centrality and Variability
No ratings yet
Measures of Centrality and Variability
42 pages
Tutorial Wk3
No ratings yet
Tutorial Wk3
21 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
8 pages
Quant Descriptive Statistics
No ratings yet
Quant Descriptive Statistics
37 pages
Measures of Central Tendency
100% (15)
Measures of Central Tendency
15 pages
SCSA1606 - Predictive and Advanced Analytics - Unit II
No ratings yet
SCSA1606 - Predictive and Advanced Analytics - Unit II
50 pages
Stats
No ratings yet
Stats
109 pages
STAE Lecture Notes - LU3 - Annotated
No ratings yet
STAE Lecture Notes - LU3 - Annotated
10 pages
Topic 21 - Statistics by Ui
No ratings yet
Topic 21 - Statistics by Ui
58 pages
1 Basics of Stat (Statistics IEM 2-2)
No ratings yet
1 Basics of Stat (Statistics IEM 2-2)
29 pages
Module 1 Overview - of - Statistics
No ratings yet
Module 1 Overview - of - Statistics
11 pages
Iie 3017 02
No ratings yet
Iie 3017 02
35 pages
Module 3 4 MMW
No ratings yet
Module 3 4 MMW
6 pages
Data Mining 1
No ratings yet
Data Mining 1
29 pages
L-03 PBH 611 Exploratory Data Analysis
No ratings yet
L-03 PBH 611 Exploratory Data Analysis
78 pages
Descriptive Statistics: Six Sigma Thinking, #3
From Everand
Descriptive Statistics: Six Sigma Thinking, #3
Sumeet Savant
No ratings yet
Statistics I Essentials
From Everand
Statistics I Essentials
Emil G. Milewski
No ratings yet
Co-Clustering: Models, Algorithms and Applications
From Everand
Co-Clustering: Models, Algorithms and Applications
Gérard Govaert
No ratings yet

R22 Unit2 CH2

Uploaded by

R22 Unit2 CH2

Uploaded by

Basic Statistical Descriptions of Data

Measuring the Central Tendency: Mean, Median, Mode

• Range, Quartiles, and the Interquartile Range (IQR) :

You might also like