0% found this document useful (0 votes)

39 views25 pages

3 DescriptiveStatistics

The document discusses common descriptive statistics used to summarize and explore datasets, including measures of center (mean, median), spread (range, variance, standard deviation), and relationships between variables (correlation). It covers functions in R like mean(), median(), range(), var(), sd(), summary(), cor(), and how to handle missing data. Examples using the diamonds dataset demonstrate calculating descriptive statistics and exploring relationships between variables.

Uploaded by

DevendraReddyPoreddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views25 pages

3 DescriptiveStatistics

Uploaded by

DevendraReddyPoreddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Intro to

3. Descriptive Statistics
Descriptive Statistics

Explore a dataset:
● What's in the dataset?
● What does it mean?
● What if there's a lot of it?
Basic statistical
functions in R

Wanted: measures of the center and the spread

of our numeric data.
● mean()
● median()
● range()
● var() and sd() # variance, standard deviation
● summary() # combination of measures
mean()

A measure of the data's “most typical” value.

● Arithmetic mean == average
● Divide sum of values by number of values
mean()
A measure of the data's “most typical” value.

> f <- c(3, 2, 4, 1)

> mean(f) # == sum(f)/length(f) == (3+2+4+1)/4
[1] 2.5
median()

A measure of the data's center value. To find it:

● Sort the contents of the data structure
● Compute the value at the center of the data:
– For odd number of elements, take the center
element's value.
– For even number of elements, take mean around
center.
median()
Odd number of values:
h h' h'
find
1 3 sort 1 1 center 1 1
2 1 2 2 2 2
3 2 3 3 3 3

median(h) = 2
> h <- c(3, 1, 2)
> median(h)
[1] 2
median()
Even number of values: need to find mean()
f f' f'
find
1 3 sort 1 1 center 1 1
2 2 2 2 2 2 median(f)
3 4 3 3 3 3 = mean(c(2,3))
= 2.5
4 1 4 4 4 4

> f <- c(3, 2, 4, 1)

> median(f)
[1] 2.50
range():
min() and max()

range() reports the minimum and maximum

values found in the data structure.

> f <- c(3, 2, 4, 1)

> range(f) # reports min(f) and max(f)
[1] 1 4
var() and sd()

● Variance: a measure of the spread of the

values relative to their mean:

Sample variance

● Standard deviation: square root of the

variance
Sample standard deviation
R's summary()
function

Provides several useful descriptive statistics about the data:

> g <- c(3, NA, 2, NA, 4, 1)

> summary(g)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.00 1.75 2.50 2.50 3.25 4.00 2

Quartiles: Sort the data set and divide it up into quarters...

Quartiles
Quartiles are the three points that divide ordered
data into four equal-sized groups:
● Q1 marks the boundary just above the lowest
25% of the data
● Q2 (the median) cuts the data set in half
● Q3 marks the boundary just below the highest
25% of data
Quartiles

Boxplot and probability distribution function of Normal N(0,1σ2) population

Summary: basic
statistical functions

● Characterize the center and the spread of our

numeric data.
● Comparing these measures can give us a
good sense of our dataset.
Statistics and Missing Data

If NAs are present, specify na.rm=TRUE to call:

● mean()
● median()
● range()
● sum()
● ...and some other functions
R disregards NAs, then proceeds with the calculation.
diamonds data

50,000 diamonds, for example:

carat cut color clarity depth table price x y z
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31

What can we learn about these data?

diamonds data
summary()
Information provided by summary() depends on the type of data, by column:

carat cut color price

Min. :0.2000 Fair : 1610 D: 6775 Min. : 326
1st Qu.:0.4000 Good : 4906 E: 9797 1st Qu.: 950
Median :0.7000 Very Good:12082 F: 9542 Median : 2401
Mean :0.7979 Premium :13791 G:11292 Mean : 3933
3rd Qu.:1.0400 Ideal :21551 H: 8304 3rd Qu.: 5324
Max. :5.0100 I: 5422 Max. :18823
J: 2808

numeric data: categorical (factor) data:

statistical summary counts
Diamond Price with Size:
Scatter Plot

Price = Dependent
Variable ↑

Carats = Independent variable→

table() function

Contingency table: counts of categorical values for selected

columns
> table(diamonds$cut, diamonds$color)

D E F G H I J
Fair 163 224 312 314 303 175 119
Good 662 933 909 871 702 522 307
Very Good 1513 2400 2164 2299 1824 1204 678
Premium 1603 2337 2331 2924 2360 1428 808
Ideal 2834 3903 3826 4884 3115 2093 896
Diamond Color and Cut

Bar Plot: Counts of categorical values

Correlation

Do the two quantities X and Y vary together?

– Positively:
– Or negatively:

A pairwise, statistical relationship between quantities

Correlation

NOTE: Correlation does not imply causation...

Looking for
correlations

diamonds data frame: 50,000 diamonds

● carat: weight of the diamond (0.2–5.01)
● table: width of top of diamond relative to widest
point (43–95)
● price: price in US dollars
● x: length in mm (0–10.74)
● y: width in mm (0–58.9)
● z: depth in mm (0–31.8)
cor() function

Look at pairwise, statistical relationships between numeric data:

> cor(diamonds[c(1,6:10)])
carat table price x y z
carat 1.0000000 0.1816175 0.9215913 0.9750942 0.9517222 0.9533874
table 0.1816175 1.0000000 0.1271339 0.1953443 0.1837601 0.1509287
price 0.9215913 0.1271339 1.0000000 0.8844352 0.8654209 0.8612494
x 0.9750942 0.1953443 0.8844352 1.0000000 0.9747015 0.9707718
y 0.9517222 0.1837601 0.8654209 0.9747015 1.0000000 0.9520057
z 0.9533874 0.1509287 0.8612494 0.9707718 0.9520057 1.0000000

-1.0: perfectly anticorrelated

↕
0 : uncorrelated
↕
1.0: perfectly correlated
Interlude
Complete descriptive statistics exercises.

Open in the RStudio source editor:

<workshop>/exercises/exercises-descriptive-statistics.R

11 Economics - Measures of Central Tendency - Notes
92% (12)
11 Economics - Measures of Central Tendency - Notes
16 pages
Unit 2
No ratings yet
Unit 2
32 pages
A Quick Approach To Statistics by G.R.pashA
77% (13)
A Quick Approach To Statistics by G.R.pashA
210 pages
Case Study
No ratings yet
Case Study
20 pages
Describing Data: Probability and Statistics For Science and Engineering With Examples in R
No ratings yet
Describing Data: Probability and Statistics For Science and Engineering With Examples in R
24 pages
Big Data Analytics
No ratings yet
Big Data Analytics
13 pages
New Chapter 13 Elementary Statistics
No ratings yet
New Chapter 13 Elementary Statistics
15 pages
First Week
No ratings yet
First Week
8 pages
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
No ratings yet
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
20 pages
Lectures - ProbaStat For Engineers
No ratings yet
Lectures - ProbaStat For Engineers
60 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
Question 1: Basic Summary Statistics: Mean (Diamond$price) Median (Diamond$price) SD (Diamond$price)
No ratings yet
Question 1: Basic Summary Statistics: Mean (Diamond$price) Median (Diamond$price) SD (Diamond$price)
4 pages
3-Data Description
No ratings yet
3-Data Description
91 pages
Statistics Midterm Review
No ratings yet
Statistics Midterm Review
21 pages
Exploratory Data Analysis - NOTES
No ratings yet
Exploratory Data Analysis - NOTES
31 pages
Data Visualization
No ratings yet
Data Visualization
37 pages
Chapter 1
No ratings yet
Chapter 1
44 pages
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
33 pages
3 Data Description
No ratings yet
3 Data Description
87 pages
Descriptive Statistics and Exploratory Data Analysis
No ratings yet
Descriptive Statistics and Exploratory Data Analysis
36 pages
Ex1a & 1b
No ratings yet
Ex1a & 1b
4 pages
Nummerical Summaries
No ratings yet
Nummerical Summaries
11 pages
Intro To Statistics (CH1&2)
No ratings yet
Intro To Statistics (CH1&2)
38 pages
1.1 CS3352-FDS - Unit 1
No ratings yet
1.1 CS3352-FDS - Unit 1
42 pages
CS361 FA23 Lec2 Post
No ratings yet
CS361 FA23 Lec2 Post
67 pages
FDSA Unit-2
No ratings yet
FDSA Unit-2
41 pages
DSA1101 2019 Week1 Part2
No ratings yet
DSA1101 2019 Week1 Part2
38 pages
Unit 3
No ratings yet
Unit 3
11 pages
Week 1
No ratings yet
Week 1
25 pages
DS Chapter - 2
No ratings yet
DS Chapter - 2
73 pages
Basic Descriptive Statistics Using R
No ratings yet
Basic Descriptive Statistics Using R
4 pages
Week - 1 Day - 1 Descriptive Statistics
No ratings yet
Week - 1 Day - 1 Descriptive Statistics
40 pages
Data Analytics Summary
No ratings yet
Data Analytics Summary
80 pages
SINGLE VARIABLE Notes 5.3 Year 10
No ratings yet
SINGLE VARIABLE Notes 5.3 Year 10
9 pages
Statistical Analysis 2023
No ratings yet
Statistical Analysis 2023
56 pages
Section 1: Make Allowances For It
No ratings yet
Section 1: Make Allowances For It
28 pages
Note 02
No ratings yet
Note 02
31 pages
Pred Mold Buiness Report PDF
No ratings yet
Pred Mold Buiness Report PDF
49 pages
Statistics and Its Types (v1.0)
No ratings yet
Statistics and Its Types (v1.0)
6 pages
Advanced Quantitative Methods - Mean Mode
No ratings yet
Advanced Quantitative Methods - Mean Mode
5 pages
QPlot Tutorial
No ratings yet
QPlot Tutorial
8 pages
5 - Data Summaries and Visualization
No ratings yet
5 - Data Summaries and Visualization
87 pages
02 Kinds of Data
No ratings yet
02 Kinds of Data
41 pages
5 - Data Summaries and Visualization
No ratings yet
5 - Data Summaries and Visualization
97 pages
Lecture 2 - Exploratory Data Analysis
No ratings yet
Lecture 2 - Exploratory Data Analysis
35 pages
Notes: Section 1: Exploratory Data Analysis
No ratings yet
Notes: Section 1: Exploratory Data Analysis
6 pages
Data Analysis and Visualization EDA
No ratings yet
Data Analysis and Visualization EDA
51 pages
DA Major Notes
No ratings yet
DA Major Notes
46 pages
Program-1
No ratings yet
Program-1
15 pages
Module 1
No ratings yet
Module 1
64 pages
Unit II TYCS DS
No ratings yet
Unit II TYCS DS
176 pages
Basic Statistical Descriptions of Data
No ratings yet
Basic Statistical Descriptions of Data
7 pages
DBBA2102
No ratings yet
DBBA2102
10 pages
C1S1 Statistics Packet
No ratings yet
C1S1 Statistics Packet
24 pages
Data Mining:: Concepts and Techniques
100% (1)
Data Mining:: Concepts and Techniques
63 pages
7exploatory Data Analysis
No ratings yet
7exploatory Data Analysis
33 pages
Business and Statistics
No ratings yet
Business and Statistics
29 pages
R For Data Exploration
No ratings yet
R For Data Exploration
52 pages
Numerical Descriptive Measures
No ratings yet
Numerical Descriptive Measures
52 pages
It B.tech II Year II Sem DV (R18a0555)
No ratings yet
It B.tech II Year II Sem DV (R18a0555)
73 pages
VIPDMTheory Chapter 2
No ratings yet
VIPDMTheory Chapter 2
56 pages
MIT6 0002F16 ProblemSet5
No ratings yet
MIT6 0002F16 ProblemSet5
13 pages
MIT6 0002F16 Lec15
No ratings yet
MIT6 0002F16 Lec15
20 pages
Mitocw - Watch?V Uk5Yvoxnksk
No ratings yet
Mitocw - Watch?V Uk5Yvoxnksk
13 pages
18.A34 PROBLEMS #7: 1 2 K I I I
No ratings yet
18.A34 PROBLEMS #7: 1 2 K I I I
5 pages
Mitocw - Watch?V Eg8Djywdmyg: Professor
No ratings yet
Mitocw - Watch?V Eg8Djywdmyg: Professor
13 pages
Intro To Bayes Approach. Reasons To Be Bayesian: Differences Between Bayesian and Frequentist Approaches 1
No ratings yet
Intro To Bayes Approach. Reasons To Be Bayesian: Differences Between Bayesian and Frequentist Approaches 1
9 pages
Mit6 041SCF13 L07
No ratings yet
Mit6 041SCF13 L07
3 pages
Probabilistic Collocation Method (PCM) For Modeling Response of GEOS-Chem Simulations To Model Parameter Uncertainties
No ratings yet
Probabilistic Collocation Method (PCM) For Modeling Response of GEOS-Chem Simulations To Model Parameter Uncertainties
18 pages
Ghjjb. CCC
No ratings yet
Ghjjb. CCC
8 pages
MIT18 A34F18Supp11
No ratings yet
MIT18 A34F18Supp11
4 pages
MIT18 440S14 Lecture4 PDF
No ratings yet
MIT18 440S14 Lecture4 PDF
19 pages
MIT18 335JF10 Lec4 Hand PDF
No ratings yet
MIT18 335JF10 Lec4 Hand PDF
3 pages
MIT18 650F16 Regression
No ratings yet
MIT18 650F16 Regression
44 pages
Problem Set 6: Course Textbook
No ratings yet
Problem Set 6: Course Textbook
3 pages
Lec6 Constr Opt
No ratings yet
Lec6 Constr Opt
30 pages
Sums of Independent Random Variables: Scott She Eld
No ratings yet
Sums of Independent Random Variables: Scott She Eld
10 pages
Mit6 S095iap18 Puzzle 7
No ratings yet
Mit6 S095iap18 Puzzle 7
8 pages
14.382 Spring 2017 Homework 1 Data Description
No ratings yet
14.382 Spring 2017 Homework 1 Data Description
3 pages
MIT14 384F13 Lec11 PDF
No ratings yet
MIT14 384F13 Lec11 PDF
6 pages
hw3 Realworld PDF
No ratings yet
hw3 Realworld PDF
1 page
State-Space Models. ML Estimation. DSGE Models. Examples of State-Space Models (Cont.)
No ratings yet
State-Space Models. ML Estimation. DSGE Models. Examples of State-Space Models (Cont.)
9 pages
Problems On Sums and Integrals
No ratings yet
Problems On Sums and Integrals
15 pages
Massachusetts Institute of Technology: (Final Exam - Spring 2009)
No ratings yet
Massachusetts Institute of Technology: (Final Exam - Spring 2009)
14 pages
Testing Concepts.: 1 Hypotheses
No ratings yet
Testing Concepts.: 1 Hypotheses
6 pages
14: Classification, Statistical Sins
No ratings yet
14: Classification, Statistical Sins
23 pages
Problem Set 2: Warmups
No ratings yet
Problem Set 2: Warmups
1 page
9.913 Pattern Recognition For Vision: Class VII, Part I - Techniques For Clustering Yuri Ivanov
No ratings yet
9.913 Pattern Recognition For Vision: Class VII, Part I - Techniques For Clustering Yuri Ivanov
52 pages
MIT15 S21IAP14 Session4.2 PDF
No ratings yet
MIT15 S21IAP14 Session4.2 PDF
84 pages
6wUD gp5WeE PDF
No ratings yet
6wUD gp5WeE PDF
13 pages
MCMC: Gibbs Sampling: D K k1 k+1 D
No ratings yet
MCMC: Gibbs Sampling: D K k1 k+1 D
7 pages
Dax Formulas
No ratings yet
Dax Formulas
14 pages
Chapter 1 Errors in Chemical Analysis
No ratings yet
Chapter 1 Errors in Chemical Analysis
23 pages
Measures of Central Tendency: (Mean, Mode and Median - Exercises)
No ratings yet
Measures of Central Tendency: (Mean, Mode and Median - Exercises)
61 pages
1001 Survey Solved Problems Part 20
No ratings yet
1001 Survey Solved Problems Part 20
25 pages
Mco 22
No ratings yet
Mco 22
26 pages
CFA Level I Mock Exam B Morning Session
No ratings yet
CFA Level I Mock Exam B Morning Session
65 pages
Introduction To Biostatistics
100% (1)
Introduction To Biostatistics
13 pages
FSc.1 - LN - 22 - Chapter#3 (Average) - PTB-solved Exercise
No ratings yet
FSc.1 - LN - 22 - Chapter#3 (Average) - PTB-solved Exercise
8 pages
Statistics MCQ P4
No ratings yet
Statistics MCQ P4
3 pages
Review Exercises
No ratings yet
Review Exercises
11 pages
Assignment On Estimating A Population Mean Course Title: Research Methodology Course Code: REM - 312
No ratings yet
Assignment On Estimating A Population Mean Course Title: Research Methodology Course Code: REM - 312
10 pages
Math Solving LET
No ratings yet
Math Solving LET
2 pages
Measures of Central Tendency and Other Positional Measures
No ratings yet
Measures of Central Tendency and Other Positional Measures
13 pages
Math Project Work
No ratings yet
Math Project Work
10 pages
QUARTILE
No ratings yet
QUARTILE
2 pages
Chapt4 - Average and Areal Rainfal
No ratings yet
Chapt4 - Average and Areal Rainfal
4 pages
Average Vs Weighted Average
No ratings yet
Average Vs Weighted Average
4 pages
Lesson 4 Measure of Central Tendency
100% (1)
Lesson 4 Measure of Central Tendency
20 pages
Mathematics in The Modern World
100% (1)
Mathematics in The Modern World
10 pages
BSC Statistics
No ratings yet
BSC Statistics
12 pages
Science & Maths (IX)
No ratings yet
Science & Maths (IX)
12 pages
Instrumentation and Control ET ZC 341: BITS Pilani
No ratings yet
Instrumentation and Control ET ZC 341: BITS Pilani
24 pages
Statistics File of Pust
No ratings yet
Statistics File of Pust
78 pages
Estimation of Parameters
No ratings yet
Estimation of Parameters
5 pages
Mathesar - 2023 - Aritra Majumder
No ratings yet
Mathesar - 2023 - Aritra Majumder
14 pages
DBB2102 Unit-03
No ratings yet
DBB2102 Unit-03
25 pages
B9ed0measures of Central Tendency
No ratings yet
B9ed0measures of Central Tendency
36 pages
Teacher'S Activity Learner'S Activity A. Daily Routine: (The Learners Recite The Our Father)
No ratings yet
Teacher'S Activity Learner'S Activity A. Daily Routine: (The Learners Recite The Our Father)
5 pages

3 DescriptiveStatistics

Uploaded by

3 DescriptiveStatistics

Uploaded by

Intro to

Wanted: measures of the center and the spread

A measure of the data's “most typical” value.

> f <- c(3, 2, 4, 1)

A measure of the data's center value. To find it:

> f <- c(3, 2, 4, 1)

range() reports the minimum and maximum

> f <- c(3, 2, 4, 1)

● Variance: a measure of the spread of the

● Standard deviation: square root of the

Provides several useful descriptive statistics about the data:

> g <- c(3, NA, 2, NA, 4, 1)

Quartiles: Sort the data set and divide it up into quarters...

Boxplot and probability distribution function of Normal N(0,1σ2) population

● Characterize the center and the spread of our

If NAs are present, specify na.rm=TRUE to call:

50,000 diamonds, for example:

What can we learn about these data?

carat cut color price

numeric data: categorical (factor) data:

Carats = Independent variable→

Contingency table: counts of categorical values for selected

Bar Plot: Counts of categorical values

Do the two quantities X and Y vary together?

A pairwise, statistical relationship between quantities

NOTE: Correlation does not imply causation...

diamonds data frame: 50,000 diamonds

Look at pairwise, statistical relationships between numeric data:

-1.0: perfectly anticorrelated

Open in the RStudio source editor:

You might also like