0% found this document useful (0 votes)
47 views

Data Science Summary Notes

Here are the key points about numerical and graphical summaries: Numerical summaries: - Advantages: Easy to communicate and compare data quantitatively. Reduces complexity. - Disadvantages: Loses a lot of information about the shape and outliers of the data. Single values don't show distribution. Graphical summaries: - Advantages: Show the overall distribution and outliers visually. Easier to identify patterns. - Disadvantages: Harder to directly compare values. More complex than single numbers. Both are useful together to get a full picture of the data. Numerical summaries quantify aspects while graphical summaries show the overall shape and outliers visually. Skills include calculating summaries by hand and with

Uploaded by

lira shrestha
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Data Science Summary Notes

Here are the key points about numerical and graphical summaries: Numerical summaries: - Advantages: Easy to communicate and compare data quantitatively. Reduces complexity. - Disadvantages: Loses a lot of information about the shape and outliers of the data. Single values don't show distribution. Graphical summaries: - Advantages: Show the overall distribution and outliers visually. Easier to identify patterns. - Disadvantages: Harder to directly compare values. More complex than single numbers. Both are useful together to get a full picture of the data. Numerical summaries quantify aspects while graphical summaries show the overall shape and outliers visually. Skills include calculating summaries by hand and with

Uploaded by

lira shrestha
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Contents

Misleading ways to present data

Advantages of Numerical Summaries

Summaries of Spreads

Reporting Center and Spread

Formulas

Base R Codes

GGplot Codes

Misleading ways to present data


Real estate agents can use the average sale price of suburbs. A few houses sold at a high price
(outliers) could increase the average, despite that the median price of houses are much lower.
Bad suburbs are presented to be better.

Price per square foot instead of the predominant Australian metric unit of square meter can
make the price look lower.

Advantages of Numerical Summaries

Reduces all the data to 1 simple number. This loses a lot of information. However it allows easy
communication and comparisons.

Major features that can be summarised numerically:

● Maximum
● Minimum
● Center (mean, median)
● Spread (standard deviation, range, IQR)

Summaries of Spread

IQR
Range
Variance
Standard deviation

Reporting center and spread

Correct pairs to present: (mean, SD) OR (median, IQR)


INCORRECT pairs to present: (mean, IQR) OR (median, SD)

Note: (mean, SD) pair is not robust; (median, IQR) IS robust


Characteristics of Summaries

Summary Robust Center or Spread Compares 2 Property of Effect of


or more data shift in scaling
set? variance (multiplying/
(adding/subtr dividing)ever
acting ‘n’ y data
from every number by ‘n’
data number
effect)

IQR

Range

Median

Mean Shifts Scales

Variance

Standard No change Scales. Sd*n


Deviation
Formulas

Function Formula

Coefficient of Variation (CV): combines mean and CV = SD/mean


standard deviation into 1 summary

Uses: analytical chemistry to express precision and


repeatability of an assay. Engineering and physics for
quality assurance studies. Economics for determining
the volatility of a security.

Upper threshold Q1 - 1.5IQR

Lower threshold Q3 - 1.5IQR

Standard units: how many standard deviations is it (data point - mean)/SD


above or below the mean OR gap/SD

OR data point = mean + SD * standard units

Standard deviation population SDpop = RMS of (gaps from the mean)

Area under the normal curve pnorm(number you want to find area under, mean =
Note: default mean = 0, SD=1 enter, sd = enter)

Finding area between -2.5 and 2.5 sd pnorm(2.5)-pnorm(-2.5)

dnorm(

rnorm(

To remove errors ```{r setup, include=FALSE}


knitr::opts_chunk$set(message = F, warning = F)
```

Base R Codes

Function Code
Histogram with one variable hist(iris$Sepal.Length)

Boxplot with one variable boxplot(iris$Petal.Length)

Average/mean mean(datasetname$variable)

Eg. mean(Newtown2017$Sold)

Importing Data into R Studio Import dataset → (choose data) → import

Show dataset: displays data size, str(datasetname)


variable names, variable
classifications

Tidyverse library(tidyverse)

head(datasetname,rows)

Eg. head(Newtown2017,2)

Filter to find further subset of mean(Newtown2017$Sold[Newtown$Type == “House” &


data Newtown2017$Bedrooms == “4”])

Note: this focuses on houses with 4 bedrooms (large), the mean price

Median median(datasetname$variable)

Median focusing on variable median(datasetname$variable[datasetname$Type==”variable” &


datasetname$variable2==”number”])

Eg. median(Newtown2017$Sold[Newtown$Type ==”House” &


Newtown2017$Bedrooms==”4”])

Mean and median together c(mean(datasetname$variable), median(datasetname$variable))

Gaps gaps = datasetname$variable - mean(dataset$variable)

Maximum gap max(gaps)

Standard Deviation for sample sd(datasetname$variable)

Standard deviation for Sample sd * sqrt((n-1)/n): sd(datasetname$variable) * sqrt((n-1)/n)


population
OR

Rafalib package + popsd(datasetname$variable)


Make barplot

Quantile quantile(datasetname$variable)

IQR quantile(datasetname$variable)[4] -
quantile(quantile(datasetname$variable)[2]

Moving data up Eg. A = c(1:20)


B=A+5

NOTE: mean(B) = mean(A) + 5


sd(A) = sd(B)

Boxplot values summary(datasetname)

Ordering sort(datasetname)

Population standard deviation library(rafalib)


popsd(datasetname)

NOTE: without rafalib package, sd(datasetname) outputs sample sd

GGplot Codes

NOTE:
NO $ IN GGPLOT
Function Code
Histogram: 1 variable ggplot(iris, aes(Petal.length)) + geom_histogram()

Histogram: 1 variable + coloured ggplot(iris, aes(Petal.length)) +


geom_histogram(aes(fill=Species))

Boxplot: 1 variable ggplot(iris, aes(Petal.length)) + geom_boxplot()

Calculating popsd without the function


Quiz

Question Answer

What feature of a data can be easily communicated by The maximum of quantitative data
a single numerical summary?

The mean is the unique point at which the data is Balanced


____.

For measuring the spread of data, what is wrong with It will always be 0
calculating the mean of the “gaps”, where “gap” =
data - mean?

Can standard deviation be negative? No. Involves RMS which cannot be negative.

In R, does the sd() command works out the False. sd() command gives sample sd. Rafalib
population SD, and does rafalib package need to be package + popsd() command gives population sd.
installed?
Project1 <- read.csv("Downloads/Project1Data.csv")
View(Project1Data)
str(Project1)
library(tidyverse)

ggplot(Project1, aes(Breakfast,fill=Employment))+geom_bar(stat = "count", bins=10) +


labs(x="Number of days per week breakfast is consumed", y = "Frequency", title="Breakfast
Habits vs Employment Status" )

Potential contents:

Advantages and disadvantages of numerical and graphical summaries

Skills:
- Find mean, median, iqr, range etc. in boxplot, histogram and dataset both by hand, R
studio and ggplot

You might also like