0% found this document useful (0 votes)

3 views37 pages

Quant Descriptive Statistics

The document provides a refresher on descriptive statistics, covering key topics such as measures of central tendency, dispersion, and the impact of outliers. It introduces concepts like SOCS (Shape, Outliers, Center, Spread), various measures of variation, and graphical representations like histograms and boxplots. Additionally, it discusses the importance of choosing appropriate statistical measures based on data characteristics and includes practice questions to reinforce understanding.

Uploaded by

bellojoshuakehinde55

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views37 pages

Quant Descriptive Statistics

Uploaded by

bellojoshuakehinde55

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Descriptive Statistics Refresher

MFT Topics discussed in these review slides are in bold:

1. Measure of set operations
2. Conditional/joint probabilities
3. Counting rules
4. Measures of central tendency and dispersion
5. Distributions (including normal and binomial)
6. Sampling and estimation
7. Hypothesis testing
8. Correlation and regression
9. Time-series forecasting
10. Statistical concepts in quality control
“SOCS”
When you analyze a set of data, remember “SOCS”
Shape – what is the shape of the distribution?
Outliers – Are there any unusual data values in the distribution?
Center – How can I best describe the typical value or center of the
distribution?
Spread – How can I describe the variation or dispersion in the data?
Commonly observed shapes
Modality:

Skewness or symmetry:
Histogram
Common graph for quantitative data. The horizontal axis is a number
line broken into ranges and the vertical axis is the count or frequency.

Shape: skewed left

Outliers
An outlier is a data value that appears extreme relative to the rest of
the data.
Outliers can often be identified by examining an appropriate graph of
the data.
Some outliers are data entry or data collection errors that can be
corrected after they are identified. Other outliers are natural features
of the data.
Since outliers impact many calculations, you should inspect your data
for outliers near the beginning of your analysis.
Center or central tendency
Measures of central tendency the center, or middle, or typical value of
a distribution.
Common measure of center: mean and median

Mean: The sample mean, denoted as x̄, can be calculated as

where x1, x2, ..., xn represent the n observed values.

Mean
The population mean is also computed the same way but is denoted as
µ. It is often not possible to calculate µ since population data are rarely
available.
The sample mean is a sample statistic, and serves as a
point estimate of the population mean. This estimate may not be
perfect, but if the sample is good (representative of the population),
it is usually a pretty good estimate.
The sample mean is impacted by the presence of outliers. The mean is
pulled toward the side of the distribution containing outliers.
Median
The median is the value that splits the data in half when ordered in
ascending order.

If there are an even number of observations, then the median is the

average of the two values in the middle.

Since the median is the midpoint of the data, 50% of the values are
below it. Hence, it is also the 50th percentile.
Variation or dispersion
Dispersion refers to the degree of variation in the data; that is, the
numerical spread (or compactness) of the data. How spread out is the
data?
As variation in the data increases, all measures of variation take larger
values.
Range
The range is the simplest measure of variation. The range is the
difference between the maximum value and the minimum value in the
data set.
Range = max value – min value

The range is affected by outliers, and is often used only for very small
data sets.
Variance
The variance is roughly the average squared deviation from the mean.
Here is the formula for the sample variance:

The sample variance is a point estimate of the corresponding

population variance. The population variance has a slightly different
formula:
Standard deviation
The standard deviation is the square root of the variance. The unit of
measure of the standard deviation is the same as the data. This makes
the standard deviation more practical than the variance to use in
applications. The SD is interpreted as roughly the average distance
between observations in the data set and the mean of the data.
Formula:
Empirical rule (or 68-95-99.7 rule)
In a unimodal, symmetric distribution, about 68% of the values fall
within one standard deviation of the mean, about 95% of the values
fall within two standard deviations of the mean, and about 99.7% of
the values fall within three standard deviations of the mean.
Quartiles
Quartiles are special percentiles that divide a data set into four
sections, each containing 25% of the data set.
Q1 = first quartile = 25th percentile
Q2 = second quartile = 50th percentile
Q3 = third quartile = 75th percentile

Lowest Middle Middle Highest

25% of 25% of 25% of 25% of
data data data data
Q1 Q2 Q3
Inter-quartile range
The interquartile range (IQR), or the midspread is the difference
between the first and third quartiles, Q3 – Q1.

This includes only the middle 50% of the data and, therefore, is not
influenced by extreme values.
Boxplots
Boxplots (or box-and-whisker plots) are graphical displays built from
the five-number summary. The five-number summary consists of the
min, Q1, median, Q3, and max.
The box extends from Q1 to Q3. A line is drawn inside the box at the
value of the median. The “whiskers” extend to the values of the min
and max.
In addition, boxplots are often modified to incorporate outlier
detection rules based on distances beyond the quartiles and either
1.5×IQR, for potential outliers, or 3×IQR for probable outliers.
possible probable
possible outlier
outliers
outlier
1.5 × IQR 3.0 × IQR

o
* * *
inner inner outer x
fence fence fence

median

Q1 Q3
Z-scores
A standardized value, commonly called a z-score, provides a relative
measure of the distance an observation is from the mean, which is
independent of the units of measurement.

Subtracting the mean from all data values centers the data set at 0.
Dividing all of the centered values by the standard deviation scales the
values to a new standard deviation of 1.
The process of standardizing data with z-scores in often called
“centering and scaling” the data.
Impact of outliers
Outliers pull the mean toward them.
Outliers inflate the value of the range, variance, and standard
deviation.
(Note: measures of variation always get larger when outliers are
present.)
Outliers also impact other statistics, such as the correlation,
coefficients of regression models, etc.
Some statistics are resistant to the effects of outliers, like the median
and IQR.
Identifying outliers
There are several common rules of thumb for identifying outliers.
1) Values above Q3 + 1.5×IQR or below Q1 – 1.5×IQR, which are called
the “inner fences,” are potential outliers.
2) Values above Q3 + 3×IQR or below Q1 – 3×IQR, which are called the
“outer fences,” are probable outliers/extreme values.
3) Values with z-scores above +3 or below –3 are potential outliers.
Choosing appropriate measures
The mean and standard deviation are the most popular measures of
center and variation. If the data is roughly symmetric in shape and
contains no obvious outliers, these measures are acceptable.

The median and IQR, which are both resistant to the impact of outliers,
should be strongly considered when the data contains outliers or is
strongly skewed in shape.
Categorical data
Categorical data is fundamentally different than quantitative/numeric
data.
Averages, standard deviations, and other summary statistics often
make no sense for categorical data.
Sample proportion
The sample proportion, denoted by p or , is the fraction of data that
have a certain characteristic or that belong to a certain category.
Proportions are key descriptive statistics for categorical data, such as
defects or errors in quality control applications or consumer
preferences in market research.
Frequency distribution
A frequency distribution displays the values of a categorical variable
and one or more measures derived from the count of how often each
category occurs in the data.
Pie chart
Pie charts show the whole group of cases as a circle sliced into pieces
with sizes proportional to the fraction of the whole in each category.
Bar chart
A bar chart displays the distribution of a categorical variable, showing
the counts for each category next to each other for easy comparison.
Contingency table
The frequencies of two categorical variables can be summarized and
displayed simultaneously using a contingency table (or
crosstabulation):
Other bar charts
More elaborate bar charts, such as
clustered or stacked bar charts can
be created from contingency tables:
Practice
A) Which is most likely true for the distribution of “percentage of time
actually spent taking notes in class,” which is displayed in the
histogram?
(a) mean > median
(b) mean ~ median
(c) mean < median
(d) impossible to tell
Practice
B) Which of these variables do you expect to be uniformly distributed?
(a) weights of adult females
(b) salaries of a random sample of people from North Carolina
(c) house prices
(d) birthdays of classmates (day of the month)
Practice
C) If someone's gross annual income has a z-score of +2.3, what can be
concluded?
oTheir income is 2.3 standard deviations below the mean income.
oTheir income is 2.3 standard deviations above the mean income.
oTheir income is 2.3 times the mean income.
oTheir income is 2.3 standard deviations above the median income.
Practice
D) A community college school board is negotiating a new contract with the college
faculty. The distribution of faculty salaries is skewed right by several faculty
members who make over $100,000 per year. If the school board wants to give the
community the impression that the faculty are already overpaid, should they
adver se the mean or median of the faculty salaries?
o The school board should use the mean to make their argument. The mean will be
higher than the median since it will be influenced by the few high salaries.
o The school board should use the median to make their argument. The median
will be lower than the mean since the mean is influenced by the few high salaries.
o The school board should use the mean to make their argument. The mean will be
lower than the median since the median is influenced by the few high salaries.
Practice
E) A company advertises a mean lifespan of 1000 hours for a particular type
of light bulb. If you were in charge of quality control at the factory, would
you prefer that the standard deviation of the lifespans for the light bulbs be 5
hours or 50 hours? Why?
o50 hours would be preferable since a larger standard deviation indicates a
longer average lifespan for the light bulbs.
o5 hours would be preferable since a smaller standard deviation indicates
more consistency.
o50 hours would be preferable since a larger standard deviation indicates
more consistency.
o5 hours would be preferable since a smaller standard deviation indicates a
longer average lifespan for the light bulbs.
Practice solution
A) Which is most likely true for the distribution of “percentage of time
actually spent taking notes in class,” which is displayed in the
histogram?
(c) mean < median median: 80%
mean: 76%
The distribution is skewed
to the left and the data
values in the left tail pull
the mean toward them, but
the median is unaffected.
Practice solution
B) Which of these variables do you expect to be uniformly distributed?
(d) birthdays of classmates (day of the month)

C) If someone's gross annual income has a z-score of +2.3, what can be

concluded?
oTheir income is 2.3 standard deviations above the mean income.
Practice solution
C) A community college school board is negotiating a new contract with
the college faculty. The distribution of faculty salaries is skewed right by
several faculty members who make over $100,000 per year. If the
school board wants to give the community the impression that the
faculty are already overpaid, should they adver se the mean or median
of the faculty salaries?
oThe school board should use the mean to make their argument. The
mean will be higher than the median since it will be influenced by
the few high salaries.
Practice solution
E) A company advertises a mean lifespan of 1000 hours for a particular
type of light bulb. If you were in charge of quality control at the
factory, would you prefer that the standard deviation of the lifespans
for the light bulbs be 5 hours or 50 hours? Why?
o5 hours would be preferable since a smaller standard deviation
indicates more consistency.

Basic Statistics
100% (9)
Basic Statistics
73 pages
Gtu 302 Biostatistics: Descriptive Statistics
100% (2)
Gtu 302 Biostatistics: Descriptive Statistics
57 pages
Data Management
100% (1)
Data Management
51 pages
Statistics For Data Science
100% (1)
Statistics For Data Science
27 pages
Raj Informatica Cloud IICS Course Content
No ratings yet
Raj Informatica Cloud IICS Course Content
6 pages
Pyspark Interview Questions: Click Here
0% (1)
Pyspark Interview Questions: Click Here
35 pages
Big Data Architectural Patterns and Best Practices On AWS Presentation
100% (1)
Big Data Architectural Patterns and Best Practices On AWS Presentation
56 pages
Stat Chapter 5-9
No ratings yet
Stat Chapter 5-9
32 pages
Measures of Central Tendency
100% (15)
Measures of Central Tendency
15 pages
Statistics
100% (1)
Statistics
11 pages
Statistics
No ratings yet
Statistics
12 pages
ch03 Ver3
No ratings yet
ch03 Ver3
25 pages
2.1 - Examining Numerical Data
No ratings yet
2.1 - Examining Numerical Data
60 pages
Descriptive Statsistics
No ratings yet
Descriptive Statsistics
34 pages
Microsoft Office Access 2016 For Windows: Relational Databases & Subforms
No ratings yet
Microsoft Office Access 2016 For Windows: Relational Databases & Subforms
21 pages
20 - Levels of Measurement, Central Tendency Dispersion
No ratings yet
20 - Levels of Measurement, Central Tendency Dispersion
35 pages
Measusres of Locations
No ratings yet
Measusres of Locations
52 pages
Stats Review
No ratings yet
Stats Review
5 pages
Ch3 Numerically Summarizing Data
No ratings yet
Ch3 Numerically Summarizing Data
35 pages
02 Data
No ratings yet
02 Data
36 pages
Data Management
No ratings yet
Data Management
44 pages
Chapter 3
No ratings yet
Chapter 3
28 pages
Huawei Cbs Routine Maintenance Guide R002c02lg020101baseline Commonfor
No ratings yet
Huawei Cbs Routine Maintenance Guide R002c02lg020101baseline Commonfor
266 pages
Chapter 5 - RM
No ratings yet
Chapter 5 - RM
22 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
35 pages
SLIDES - Statistics-Descriptive Statistics
No ratings yet
SLIDES - Statistics-Descriptive Statistics
25 pages
Basic Statistics
No ratings yet
Basic Statistics
24 pages
2 Research - 2ND QT - Week 1 - 10 14 2024
No ratings yet
2 Research - 2ND QT - Week 1 - 10 14 2024
13 pages
EDA W3 Obtaining-Data
No ratings yet
EDA W3 Obtaining-Data
57 pages
Slides Week2
No ratings yet
Slides Week2
43 pages
Introduction To Biostatistics
No ratings yet
Introduction To Biostatistics
53 pages
Statistics ClassNotes - 2
No ratings yet
Statistics ClassNotes - 2
10 pages
Basic Statistical Concepts-2
No ratings yet
Basic Statistical Concepts-2
20 pages
Chapter 3 (Technical English For Statistics)
No ratings yet
Chapter 3 (Technical English For Statistics)
8 pages
Topic 2 - Descriptive - Statistics
No ratings yet
Topic 2 - Descriptive - Statistics
36 pages
Descreptive Statistics 1
No ratings yet
Descreptive Statistics 1
74 pages
Stat 1101 4 7
No ratings yet
Stat 1101 4 7
18 pages
Descriptive Statistics 1
No ratings yet
Descriptive Statistics 1
63 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
20 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
Mining Data Dispersion Characteristics
No ratings yet
Mining Data Dispersion Characteristics
7 pages
1.3 Variation
No ratings yet
1.3 Variation
16 pages
Click To Add Text Dr. Cemre Erciyes
No ratings yet
Click To Add Text Dr. Cemre Erciyes
69 pages
Statistics 091147
No ratings yet
Statistics 091147
60 pages
Unit-3 DS Students
No ratings yet
Unit-3 DS Students
35 pages
MMW Reviewer
No ratings yet
MMW Reviewer
9 pages
Exploring Data: AP Statistics Unit 1: Chapters 1-4
No ratings yet
Exploring Data: AP Statistics Unit 1: Chapters 1-4
83 pages
Unit 3 - Descriptive Statistics
No ratings yet
Unit 3 - Descriptive Statistics
44 pages
MATM111 Midterms REVIEWER
No ratings yet
MATM111 Midterms REVIEWER
3 pages
NITKclass 1
No ratings yet
NITKclass 1
50 pages
Ib A&i 3.1
No ratings yet
Ib A&i 3.1
38 pages
Unit 1 - Business Statistics & Analytics
No ratings yet
Unit 1 - Business Statistics & Analytics
25 pages
Stats
No ratings yet
Stats
109 pages
Math in The Modern World Stat Lecture
No ratings yet
Math in The Modern World Stat Lecture
3 pages
1 Basics of Stat (Statistics IEM 2-2)
No ratings yet
1 Basics of Stat (Statistics IEM 2-2)
29 pages
B. Data Management
No ratings yet
B. Data Management
61 pages
Biostatistics (Descriptive Statistics)
No ratings yet
Biostatistics (Descriptive Statistics)
30 pages
SCSA1606 - Predictive and Advanced Analytics - Unit II
No ratings yet
SCSA1606 - Predictive and Advanced Analytics - Unit II
50 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
63 pages
Chapter 3
No ratings yet
Chapter 3
17 pages
DDDDDD 2
No ratings yet
DDDDDD 2
5 pages
Statistics 1
No ratings yet
Statistics 1
10 pages
CH 2 Lecture Notes
No ratings yet
CH 2 Lecture Notes
12 pages
Online Examination System
No ratings yet
Online Examination System
2 pages
VamsiKrishna Hyderabad Secunderabad, Telangana 2.03 Yrs
No ratings yet
VamsiKrishna Hyderabad Secunderabad, Telangana 2.03 Yrs
3 pages
Oracle AWR (Automatic Workload Repository) Trending
No ratings yet
Oracle AWR (Automatic Workload Repository) Trending
6 pages
En ES 8.5.2 Depl Book
No ratings yet
En ES 8.5.2 Depl Book
160 pages
Sebl8essentials d46318gc10 Ag
No ratings yet
Sebl8essentials d46318gc10 Ag
716 pages
Claims Management System
No ratings yet
Claims Management System
31 pages
Major Issues in DM
No ratings yet
Major Issues in DM
5 pages
Overview On Bitcoin Hacking
No ratings yet
Overview On Bitcoin Hacking
3 pages
B.Sc. (Computer Science) SYLLABUS: Sem I S. No. Paper Code Paper Name
No ratings yet
B.Sc. (Computer Science) SYLLABUS: Sem I S. No. Paper Code Paper Name
11 pages
Kips ThIKA ACCESS EXAMS
No ratings yet
Kips ThIKA ACCESS EXAMS
7 pages
1st Monthly Exam Grade 7 ICT 2nd Grading Useng Edited
No ratings yet
1st Monthly Exam Grade 7 ICT 2nd Grading Useng Edited
2 pages
Dsa Lab Manual
No ratings yet
Dsa Lab Manual
77 pages
Implementing The Account and Financial Dimensions Framework AX2012
No ratings yet
Implementing The Account and Financial Dimensions Framework AX2012
43 pages
Assignment 2
100% (1)
Assignment 2
5 pages
Lecture 1 - Getting To Know Scalability
No ratings yet
Lecture 1 - Getting To Know Scalability
49 pages
Database
No ratings yet
Database
28 pages
Sr. No. Title of The Practical No. of Hours: Including RPM, Yum, Tar and Top Commands. Including Awk' Filter
No ratings yet
Sr. No. Title of The Practical No. of Hours: Including RPM, Yum, Tar and Top Commands. Including Awk' Filter
1 page
GCP Digital-Leader - Chapter 13 Sample Questions Exam Preparation Google Cloud Digital Leader Certification Guide
No ratings yet
GCP Digital-Leader - Chapter 13 Sample Questions Exam Preparation Google Cloud Digital Leader Certification Guide
32 pages
SQL Recap
No ratings yet
SQL Recap
22 pages
Chapter 6 Case Study Hadoop
No ratings yet
Chapter 6 Case Study Hadoop
39 pages
Jesswyna Anessa Anak Joannes Wat PDF
No ratings yet
Jesswyna Anessa Anak Joannes Wat PDF
9 pages
Current - Log Hook Up
No ratings yet
Current - Log Hook Up
20 pages
Cambridge International AS & A Level: Computer Science 9618/11
No ratings yet
Cambridge International AS & A Level: Computer Science 9618/11
9 pages
Dadm (1) Sidra
No ratings yet
Dadm (1) Sidra
9 pages
DBMS Interview Questions by Company
No ratings yet
DBMS Interview Questions by Company
15 pages
Deed of Assignment For Zuma Rock Collage - 033538
No ratings yet
Deed of Assignment For Zuma Rock Collage - 033538
5 pages
Increment 033539
No ratings yet
Increment 033539
2 pages
Sumec Generator Technical Guide
No ratings yet
Sumec Generator Technical Guide
2 pages
AWSAzure L2 Cloud Operation Position - JUbliant
No ratings yet
AWSAzure L2 Cloud Operation Position - JUbliant
2 pages
GST Class
No ratings yet
GST Class
1 page
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
From Everand
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
Peter Bradley
No ratings yet

Quant Descriptive Statistics

Uploaded by

Quant Descriptive Statistics

Uploaded by

Descriptive Statistics Refresher

MFT Topics discussed in these review slides are in bold:

Shape: skewed left

Mean: The sample mean, denoted as x̄, can be calculated as

where x1, x2, ..., xn represent the n observed values.

If there are an even number of observations, then the median is the

The sample variance is a point estimate of the corresponding

Lowest Middle Middle Highest

C) If someone's gross annual income has a z-score of +2.3, what can be

You might also like