Data Science Summary Notes
Data Science Summary Notes
Summaries of Spreads
Formulas
Base R Codes
GGplot Codes
Price per square foot instead of the predominant Australian metric unit of square meter can
make the price look lower.
Reduces all the data to 1 simple number. This loses a lot of information. However it allows easy
communication and comparisons.
● Maximum
● Minimum
● Center (mean, median)
● Spread (standard deviation, range, IQR)
Summaries of Spread
IQR
Range
Variance
Standard deviation
IQR
Range
Median
Variance
Function Formula
Area under the normal curve pnorm(number you want to find area under, mean =
Note: default mean = 0, SD=1 enter, sd = enter)
dnorm(
rnorm(
Base R Codes
Function Code
Histogram with one variable hist(iris$Sepal.Length)
Average/mean mean(datasetname$variable)
Eg. mean(Newtown2017$Sold)
Tidyverse library(tidyverse)
head(datasetname,rows)
Eg. head(Newtown2017,2)
Note: this focuses on houses with 4 bedrooms (large), the mean price
Median median(datasetname$variable)
Quantile quantile(datasetname$variable)
IQR quantile(datasetname$variable)[4] -
quantile(quantile(datasetname$variable)[2]
Ordering sort(datasetname)
GGplot Codes
NOTE:
NO $ IN GGPLOT
Function Code
Histogram: 1 variable ggplot(iris, aes(Petal.length)) + geom_histogram()
Question Answer
What feature of a data can be easily communicated by The maximum of quantitative data
a single numerical summary?
For measuring the spread of data, what is wrong with It will always be 0
calculating the mean of the “gaps”, where “gap” =
data - mean?
Can standard deviation be negative? No. Involves RMS which cannot be negative.
In R, does the sd() command works out the False. sd() command gives sample sd. Rafalib
population SD, and does rafalib package need to be package + popsd() command gives population sd.
installed?
Project1 <- read.csv("Downloads/Project1Data.csv")
View(Project1Data)
str(Project1)
library(tidyverse)
Potential contents:
Skills:
- Find mean, median, iqr, range etc. in boxplot, histogram and dataset both by hand, R
studio and ggplot