Lab 2
Lab 2
Contents
Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Centrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Graphing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Importing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Objectives
• Learn how to calculate measures of centrality
• Learn how to calculate spread
• Learn how to graph data
• Learn how to import and process CSV’s
Centrality
The primary measures of centrality we will use are mean and median. In order to calculate mean, you use
the function mean(data)
x <- sample(1:100, 10)
y <- sample(1:100, 10)
print(mean(x))
## [1] 44.4
print(mean(y))
## [1] 46
Another measure of centrality is the median. In order to calculate the median, you use the function
median(data)
print(median(x))
## [1] 56.5
print(median(y))
## [1] 45
Spread
The primary measure of spread we will use is standard deviation. In order to calculate standard deviation,
you use the function sd(data)
1
print(sd(x))
## [1] 25.94096
print(sd(y))
## [1] 28.40188
Alternatively you may want to find the IQR of the data. In order to calculate the IQR. One way is by finding
the difference between the 75th percentile and the 25th percentile.
#find the 25th percentile
q25 <- quantile(x, 0.25)
#find the 75th percentile
q75 <- quantile(x, 0.75)
#find the IQR
iqr <- q75 - q25
print(iqr)
## 75%
## 42.25
You can also find IQR using the function IQR(data)
print(IQR(x))
## [1] 42.25
print(IQR(y))
## [1] 41.25
Graphing
In order to plot a scatter plot, you can use the function plot(x, y)
plot(x, y)
80
60
y
40
20
10 20 30 40 50 60 70
2
You can make a histogram using the function hist(data)
hist(x)
Histogram of x
4
3
Frequency
2
1
0
0 10 20 30 40 50 60 70
x
You can make a boxplot using the function boxplot(data). You can also make a boxplot of multiple data
sets by passing in a list of data sets.
# add labels to the boxplot
boxplot(x, y, names=c("X", "Y"), main="Boxplot of X and Y")
Boxplot of X and Y
80
60
40
20
X Y
3
Importing Data
In order to import data from a CSV file, you can use the read.csv() function.
data <- read.csv("organizations-100.csv", header=TRUE, sep=",")
## [1] 38
You can also filter the data based on a condition. For instance, you may want to filter the data to only
include companies with more than 100 employees
filtered_data <- data[data$Number.of.employees > 100,]
print(head(filtered_data))
4
## 4 Automotive 921
## 5 Transportation 7870
## 6 Primary / Secondary Education 4914
# print the number of companies with more than 100 employees
print(nrow(filtered_data))
## [1] 100
Exercises
Part 1
1. Generate a sample of 1000 numbers from 1 to 100 and calculate the mean, median, standard deviation,
and IQR of the sample
2. Create a histogram of the sample
3. Create a boxplot of the sample
Part 2
1. Calculate the mean, median, standard deviation, and IQR of the number of employees in the data set
organizations-100.csv
2. Create a scatter plot of the number of employees and the year founded
3. Create a histogram of the number of employees
4. Create a boxplot of the number of employees
5. Filter the data to only include companies with more than 100 employees and calculate the mean, median,
standard deviation, and IQR of the number of employees in the filtered data set
References
Source of data: https://github.com/datablist/sample-csv-files?tab=readme-ov-file