0% found this document useful (0 votes)

63 views9 pages

Data Science Summary Notes

Here are the key points about numerical and graphical summaries: Numerical summaries: - Advantages: Easy to communicate and compare data quantitatively. Reduces complexity. - Disadvantages: Loses a lot of information about the shape and outliers of the data. Single values don't show distribution. Graphical summaries: - Advantages: Show the overall distribution and outliers visually. Easier to identify patterns. - Disadvantages: Harder to directly compare values. More complex than single numbers. Both are useful together to get a full picture of the data. Numerical summaries quantify aspects while graphical summaries show the overall shape and outliers visually. Skills include calculating summaries by hand and with

Uploaded by

lira shrestha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views9 pages

Data Science Summary Notes

Uploaded by

lira shrestha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

Misleading ways to present data

Advantages of Numerical Summaries

Summaries of Spreads

Reporting Center and Spread

Formulas

Base R Codes

GGplot Codes

Misleading ways to present data

Real estate agents can use the average sale price of suburbs. A few houses sold at a high price
(outliers) could increase the average, despite that the median price of houses are much lower.
Bad suburbs are presented to be better.

Price per square foot instead of the predominant Australian metric unit of square meter can
make the price look lower.

Advantages of Numerical Summaries

Reduces all the data to 1 simple number. This loses a lot of information. However it allows easy
communication and comparisons.

Major features that can be summarised numerically:

● Maximum
● Minimum
● Center (mean, median)
● Spread (standard deviation, range, IQR)

Summaries of Spread

IQR
Range
Variance
Standard deviation

Reporting center and spread

Correct pairs to present: (mean, SD) OR (median, IQR)

INCORRECT pairs to present: (mean, IQR) OR (median, SD)

Note: (mean, SD) pair is not robust; (median, IQR) IS robust

Characteristics of Summaries

Summary Robust Center or Spread Compares 2 Property of Effect of

or more data shift in scaling
set? variance (multiplying/
(adding/subtr dividing)ever
acting ‘n’ y data
from every number by ‘n’
data number
effect)

IQR

Range

Median

Mean Shifts Scales

Variance

Standard No change Scales. Sd*n

Deviation
Formulas

Function Formula

Coefficient of Variation (CV): combines mean and CV = SD/mean

standard deviation into 1 summary

Uses: analytical chemistry to express precision and

repeatability of an assay. Engineering and physics for
quality assurance studies. Economics for determining
the volatility of a security.

Upper threshold Q1 - 1.5IQR

Lower threshold Q3 - 1.5IQR

Standard units: how many standard deviations is it (data point - mean)/SD

above or below the mean OR gap/SD

OR data point = mean + SD * standard units

Standard deviation population SDpop = RMS of (gaps from the mean)

Area under the normal curve pnorm(number you want to find area under, mean =
Note: default mean = 0, SD=1 enter, sd = enter)

Finding area between -2.5 and 2.5 sd pnorm(2.5)-pnorm(-2.5)

dnorm(

rnorm(

To remove errors ```{r setup, include=FALSE}

knitr::opts_chunk$set(message = F, warning = F)
```

Base R Codes

Function Code
Histogram with one variable hist(iris$Sepal.Length)

Boxplot with one variable boxplot(iris$Petal.Length)

Average/mean mean(datasetname$variable)

Eg. mean(Newtown2017$Sold)

Importing Data into R Studio Import dataset → (choose data) → import

Show dataset: displays data size, str(datasetname)

variable names, variable
classifications

Tidyverse library(tidyverse)

head(datasetname,rows)

Eg. head(Newtown2017,2)

Filter to find further subset of mean(Newtown2017$Sold[Newtown$Type == “House” &

data Newtown2017$Bedrooms == “4”])

Note: this focuses on houses with 4 bedrooms (large), the mean price

Median median(datasetname$variable)

Median focusing on variable median(datasetname$variable[datasetname$Type==”variable” &

datasetname$variable2==”number”])

Eg. median(Newtown2017$Sold[Newtown$Type ==”House” &

Newtown2017$Bedrooms==”4”])

Mean and median together c(mean(datasetname$variable), median(datasetname$variable))

Gaps gaps = datasetname$variable - mean(dataset$variable)

Maximum gap max(gaps)

Standard Deviation for sample sd(datasetname$variable)

Standard deviation for Sample sd * sqrt((n-1)/n): sd(datasetname$variable) * sqrt((n-1)/n)

population
OR

Rafalib package + popsd(datasetname$variable)

Make barplot

Quantile quantile(datasetname$variable)

IQR quantile(datasetname$variable)[4] -
quantile(quantile(datasetname$variable)[2]

Moving data up Eg. A = c(1:20)

B=A+5

NOTE: mean(B) = mean(A) + 5

sd(A) = sd(B)

Boxplot values summary(datasetname)

Ordering sort(datasetname)

Population standard deviation library(rafalib)

popsd(datasetname)

NOTE: without rafalib package, sd(datasetname) outputs sample sd

GGplot Codes

NOTE:
NO $ IN GGPLOT
Function Code
Histogram: 1 variable ggplot(iris, aes(Petal.length)) + geom_histogram()

Histogram: 1 variable + coloured ggplot(iris, aes(Petal.length)) +

geom_histogram(aes(fill=Species))

Boxplot: 1 variable ggplot(iris, aes(Petal.length)) + geom_boxplot()

Calculating popsd without the function

Quiz

Question Answer

What feature of a data can be easily communicated by The maximum of quantitative data
a single numerical summary?

The mean is the unique point at which the data is Balanced

____.

For measuring the spread of data, what is wrong with It will always be 0
calculating the mean of the “gaps”, where “gap” =
data - mean?

Can standard deviation be negative? No. Involves RMS which cannot be negative.

In R, does the sd() command works out the False. sd() command gives sample sd. Rafalib
population SD, and does rafalib package need to be package + popsd() command gives population sd.
installed?
Project1 <- read.csv("Downloads/Project1Data.csv")
View(Project1Data)
str(Project1)
library(tidyverse)

ggplot(Project1, aes(Breakfast,fill=Employment))+geom_bar(stat = "count", bins=10) +

labs(x="Number of days per week breakfast is consumed", y = "Frequency", title="Breakfast
Habits vs Employment Status" )

Potential contents:

Advantages and disadvantages of numerical and graphical summaries

Skills:
- Find mean, median, iqr, range etc. in boxplot, histogram and dataset both by hand, R
studio and ggplot

UGRD-ENG6301-Engineering-Management-Overall-Finals Quizess
No ratings yet
UGRD-ENG6301-Engineering-Management-Overall-Finals Quizess
18 pages
Summary Measures
No ratings yet
Summary Measures
26 pages
Traverse Report COMPLETE
50% (2)
Traverse Report COMPLETE
17 pages
Ids Unit 2 Notes Ckm-1
No ratings yet
Ids Unit 2 Notes Ckm-1
30 pages
1 3 ST-explore
No ratings yet
1 3 ST-explore
55 pages
Topic2 Numerical Summary
No ratings yet
Topic2 Numerical Summary
62 pages
Chapter 1
No ratings yet
Chapter 1
44 pages
The Effectiveness of Food Labelling in Controlling Ones Calorie Intake
100% (1)
The Effectiveness of Food Labelling in Controlling Ones Calorie Intake
33 pages
DSA1101 2019 Week1 Part2
No ratings yet
DSA1101 2019 Week1 Part2
38 pages
Statistics For Managers Using Microsoft Excel: 5 Edition
No ratings yet
Statistics For Managers Using Microsoft Excel: 5 Edition
54 pages
Statistical Measures
No ratings yet
Statistical Measures
54 pages
Seminar Slides Week 3 - With Solutions - Fullpage
No ratings yet
Seminar Slides Week 3 - With Solutions - Fullpage
33 pages
Advanced Statistics
No ratings yet
Advanced Statistics
259 pages
Probability Theory & Statistics: Describing Data: Numerical
No ratings yet
Probability Theory & Statistics: Describing Data: Numerical
36 pages
Introduction To Data Science With R Programming
No ratings yet
Introduction To Data Science With R Programming
12 pages
Lecture 5
No ratings yet
Lecture 5
25 pages
Measures of Dispersion
No ratings yet
Measures of Dispersion
29 pages
Business Analytics Unit 4
No ratings yet
Business Analytics Unit 4
24 pages
Genetica Cuantitativa
No ratings yet
Genetica Cuantitativa
120 pages
Describing Data Numerical
No ratings yet
Describing Data Numerical
53 pages
Chapter 4-1
No ratings yet
Chapter 4-1
46 pages
24-01-22 Marked Slides
No ratings yet
24-01-22 Marked Slides
50 pages
Lecture 4 Copy 1
No ratings yet
Lecture 4 Copy 1
13 pages
Lecture5 Stat104 Fall2017 V1 6up
No ratings yet
Lecture5 Stat104 Fall2017 V1 6up
13 pages
Numerical Descriptive Measures
No ratings yet
Numerical Descriptive Measures
52 pages
Seminar Slides Week 3 - Fullpage
No ratings yet
Seminar Slides Week 3 - Fullpage
36 pages
Quantitative Methods: Sessions 1-3 Case: Catalog Marketing
No ratings yet
Quantitative Methods: Sessions 1-3 Case: Catalog Marketing
70 pages
Measures of Dispersion
No ratings yet
Measures of Dispersion
8 pages
Business Statistics: Session 2
No ratings yet
Business Statistics: Session 2
60 pages
Lecture 2.2 - Statistics - Desc Stat and Distrib
No ratings yet
Lecture 2.2 - Statistics - Desc Stat and Distrib
48 pages
CB161 (R Lab Manual)
No ratings yet
CB161 (R Lab Manual)
32 pages
Quantitative Methods: Describing Data Numerically
No ratings yet
Quantitative Methods: Describing Data Numerically
32 pages
Merlin
No ratings yet
Merlin
4 pages
Bus. Statt. Chapter-Lecture 2+3
No ratings yet
Bus. Statt. Chapter-Lecture 2+3
43 pages
Packages Used in This Chapter: R Studio - Descriptive Statistics
No ratings yet
Packages Used in This Chapter: R Studio - Descriptive Statistics
9 pages
Stats Mid Term
No ratings yet
Stats Mid Term
22 pages
Measures of Central Tendency and Variability
No ratings yet
Measures of Central Tendency and Variability
4 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
26 pages
Chapter 4
No ratings yet
Chapter 4
46 pages
Lec 4
No ratings yet
Lec 4
18 pages
Measures of Dispersion or Variation: Vijay - Gahlawat@yahoo - Co.in
No ratings yet
Measures of Dispersion or Variation: Vijay - Gahlawat@yahoo - Co.in
31 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
14 pages
Lecture III-Measures of Dispersion
No ratings yet
Lecture III-Measures of Dispersion
33 pages
Mulholland 1999
No ratings yet
Mulholland 1999
8 pages
Lets End Cyberbullying - Project Report
No ratings yet
Lets End Cyberbullying - Project Report
42 pages
R Module 5
No ratings yet
R Module 5
21 pages
Measure of Dispersion or Variation
No ratings yet
Measure of Dispersion or Variation
5 pages
Muthayammal College of Arts and Science Rasipuram: Assignment No - 3
No ratings yet
Muthayammal College of Arts and Science Rasipuram: Assignment No - 3
8 pages
DSBDAL - Assignment No 3
No ratings yet
DSBDAL - Assignment No 3
4 pages
Measures of Variability
No ratings yet
Measures of Variability
20 pages
Lecture 5&6
No ratings yet
Lecture 5&6
15 pages
Module 5-6
No ratings yet
Module 5-6
12 pages
Capital Gains
No ratings yet
Capital Gains
8 pages
Chapter - 3
No ratings yet
Chapter - 3
11 pages
Unit3 R
No ratings yet
Unit3 R
19 pages
Statistics - Imp Points
No ratings yet
Statistics - Imp Points
6 pages
Statistics Unit1 Notes
No ratings yet
Statistics Unit1 Notes
11 pages
Project On Quantitative Techniques of Business Sta
No ratings yet
Project On Quantitative Techniques of Business Sta
24 pages
Report Stats PDF
No ratings yet
Report Stats PDF
23 pages
A B, Michael: Performance and Surface Analysis of Tapered Bearings Lubricated With A Manual Transmission Fluid
No ratings yet
A B, Michael: Performance and Surface Analysis of Tapered Bearings Lubricated With A Manual Transmission Fluid
11 pages
Nummerical Summaries
No ratings yet
Nummerical Summaries
11 pages
Notes Stats Quiz 2
No ratings yet
Notes Stats Quiz 2
10 pages
BDA 09 Shridhti Tiwari
No ratings yet
BDA 09 Shridhti Tiwari
12 pages
DSBDAL - Assignment No 10
No ratings yet
DSBDAL - Assignment No 10
5 pages
Measures of Dispersion and Relative Standing
No ratings yet
Measures of Dispersion and Relative Standing
11 pages
Chapt3 Overheads
No ratings yet
Chapt3 Overheads
8 pages
Chapter 10 Solutions
100% (1)
Chapter 10 Solutions
22 pages
Recruitment and Selection
No ratings yet
Recruitment and Selection
19 pages
Mengistu Ashinie Kebede Final Thesis
No ratings yet
Mengistu Ashinie Kebede Final Thesis
58 pages
Inception Report Consultant For Stakeholder Engagement (Prosavana Programme)
No ratings yet
Inception Report Consultant For Stakeholder Engagement (Prosavana Programme)
48 pages
Mathematics Applications and Interpretation Paper 2 TZ1 SL
No ratings yet
Mathematics Applications and Interpretation Paper 2 TZ1 SL
12 pages
Video Game Thesis Topics
100% (3)
Video Game Thesis Topics
8 pages
A2LA - P102 Politicas
No ratings yet
A2LA - P102 Politicas
26 pages
Econometrics Theory and Practice 1
No ratings yet
Econometrics Theory and Practice 1
5 pages
Bay, Has Suffered Some Damage Since Its Construction Which Was Completed in 1976
No ratings yet
Bay, Has Suffered Some Damage Since Its Construction Which Was Completed in 1976
14 pages
Hsme 1215 - Course Outline
No ratings yet
Hsme 1215 - Course Outline
3 pages
Drivers of Project Collapse in Developing Countries
No ratings yet
Drivers of Project Collapse in Developing Countries
17 pages
RAPID Workshop Score
No ratings yet
RAPID Workshop Score
17 pages
SheppardGabbettetal - Ijspp2007 Repeated Effort Test
No ratings yet
SheppardGabbettetal - Ijspp2007 Repeated Effort Test
14 pages
Percentiles and T-Distribution: Grade 11 Statistics and Probability
No ratings yet
Percentiles and T-Distribution: Grade 11 Statistics and Probability
8 pages
Objective in Research
No ratings yet
Objective in Research
7 pages
Oral Script
No ratings yet
Oral Script
7 pages
CHAPTER 3 Final Revisions
No ratings yet
CHAPTER 3 Final Revisions
8 pages
Questionnaire For Students' Perception: Personal Information of The Students
No ratings yet
Questionnaire For Students' Perception: Personal Information of The Students
5 pages
HO Ka Man Carman - Report I (English Version)
No ratings yet
HO Ka Man Carman - Report I (English Version)
3 pages
Descriptive - Statistics Data Discret chp2
No ratings yet
Descriptive - Statistics Data Discret chp2
7 pages
Wilcoxon Signed-Rank Test
No ratings yet
Wilcoxon Signed-Rank Test
9 pages
Group Project
No ratings yet
Group Project
6 pages

Data Science Summary Notes

Uploaded by

Data Science Summary Notes

Uploaded by

Contents

Misleading ways to present data

Advantages of Numerical Summaries

Reporting Center and Spread

Misleading ways to present data

Advantages of Numerical Summaries

Major features that can be summarised numerically:

Reporting center and spread

Correct pairs to present: (mean, SD) OR (median, IQR)

Note: (mean, SD) pair is not robust; (median, IQR) IS robust

Summary Robust Center or Spread Compares 2 Property of Effect of

Mean Shifts Scales

Standard No change Scales. Sd*n

Coefficient of Variation (CV): combines mean and CV = SD/mean

Uses: analytical chemistry to express precision and

Upper threshold Q1 - 1.5IQR

Lower threshold Q3 - 1.5IQR

Standard units: how many standard deviations is it (data point - mean)/SD

OR data point = mean + SD * standard units

Standard deviation population SDpop = RMS of (gaps from the mean)

Finding area between -2.5 and 2.5 sd pnorm(2.5)-pnorm(-2.5)

To remove errors ```{r setup, include=FALSE}

Boxplot with one variable boxplot(iris$Petal.Length)

Importing Data into R Studio Import dataset → (choose data) → import

Show dataset: displays data size, str(datasetname)

Filter to find further subset of mean(Newtown2017$Sold[Newtown$Type == “House” &

Median focusing on variable median(datasetname$variable[datasetname$Type==”variable” &

Eg. median(Newtown2017$Sold[Newtown$Type ==”House” &

Mean and median together c(mean(datasetname$variable), median(datasetname$variable))

Gaps gaps = datasetname$variable - mean(dataset$variable)

Maximum gap max(gaps)

Standard Deviation for sample sd(datasetname$variable)

Standard deviation for Sample sd * sqrt((n-1)/n): sd(datasetname$variable) * sqrt((n-1)/n)

Rafalib package + popsd(datasetname$variable)

Moving data up Eg. A = c(1:20)

NOTE: mean(B) = mean(A) + 5

Boxplot values summary(datasetname)

Population standard deviation library(rafalib)

NOTE: without rafalib package, sd(datasetname) outputs sample sd

Histogram: 1 variable + coloured ggplot(iris, aes(Petal.length)) +

Boxplot: 1 variable ggplot(iris, aes(Petal.length)) + geom_boxplot()

Calculating popsd without the function

The mean is the unique point at which the data is Balanced

ggplot(Project1, aes(Breakfast,fill=Employment))+geom_bar(stat = "count", bins=10) +

Advantages and disadvantages of numerical and graphical summaries

You might also like