STA 272 Chapter 02 Notes and Codes Data Frames in R
STA 272 Chapter 02 Notes and Codes Data Frames in R
A data frame is one of the most common data structures in R. It is similar to a table in a
relational database or a spreadsheet in Excel. A data frame is a collection of columns that can
contain different data types (numeric, character, factor, etc.). Each column is a vector, and all
columns in a data frame have the same length.
r
Copy code
# Creating a simple data frame
employee_data <- data.frame(
Name = c("Alice", "Bob", "Charlie", "David"),
Age = c(25, 30, 35, 28),
Department = factor(c("HR", "Finance", "IT", "HR")),
Salary = c(50000, 60000, 70000, 55000)
)
Output:
markdown
Copy code
Name Age Department Salary
1 Alice 25 HR 50000
2 Bob 30 Finance 60000
3 Charlie 35 IT 70000
4 David 28 HR 55000
You can access specific rows, columns, or individual elements using various methods.
r
Copy code
# Accessing a single column
employee_data$Name
r
Copy code
# Accessing the first row
employee_data[1, ]
r
Copy code
# Accessing the element in row 2, column 4
employee_data[2, 4]
r
Copy code
# Adding a new column "Experience" to the data frame
employee_data$Experience <- c(3, 5, 10, 4)
print(employee_data)
r
Copy code
# Adding a new row
new_employee <- data.frame(
Name = "Eve",
Age = 27,
Department = factor("IT", levels = levels(employee_data$Department)),
Salary = 58000,
Experience = 2
)
r
Copy code
# Removing the "Experience" column
employee_data$Experience <- NULL
print(employee_data)
4. Built-in Datasets in R
R comes with several built-in datasets that are useful for learning and testing. Here are a few
commonly used ones:
The mtcars dataset contains data about car models, including variables like miles per gallon
(mpg), number of cylinders (cyl), horsepower (hp), etc.
r
Copy code
# Loading the mtcars dataset
data(mtcars)
Example Analysis:
r
Copy code
# Calculating the average miles per gallon
mean_mpg <- mean(mtcars$mpg)
mean_mpg
The iris dataset is famous in data science and machine learning. It contains measurements
for different flower species (setosa, versicolor, virginica) and includes variables like sepal
length, sepal width, petal length, and petal width.
r
Copy code
# Loading the iris dataset
data(iris)
# Summary statistics
summary(iris)
Example Analysis:
r
Copy code
# Boxplot of Sepal Length by Species
boxplot(Sepal.Length ~ Species, data = iris,
main = "Sepal Length by Species",
xlab = "Species", ylab = "Sepal Length",
col = c("red", "green", "blue"))
r
Copy code
# Filter rows where Salary is greater than 55000
high_salary <- employee_data[employee_data$Salary > 55000, ]
print(high_salary)
r
Copy code
# Sort the employee data by Salary in descending order
sorted_employee_data <- employee_data[order(employee_data$Salary,
decreasing = TRUE), ]
print(sorted_employee_data)
r
Copy code
# Calculate the average Salary by Department
avg_salary_by_dept <- aggregate(Salary ~ Department, data = employee_data,
mean)
print(avg_salary_by_dept)
6. Application Examples using Built-in Datasets
You can treat variables in mtcars like demographic data (e.g., treating car characteristics as
socio-economic features).
r
Copy code
# Analyzing the relationship between weight (wt) and miles per gallon (mpg)
cor(mtcars$wt, mtcars$mpg)
# Visualization
plot(mtcars$wt, mtcars$mpg, main = "Car Weight vs. Miles Per Gallon",
xlab = "Weight (1000 lbs)", ylab = "Miles Per Gallon", col = "purple",
pch = 18)
You can treat the species in iris as customer segments and analyze their characteristics.
r
Copy code
# Calculate the average Sepal Length for each Species
avg_sepal_length <- aggregate(Sepal.Length ~ Species, data = iris, mean)
print(avg_sepal_length)
Conclusion
Data frames are the go-to structure for most tabular data in R. Understanding how to create,
manipulate, and analyze data frames is crucial for any data analysis work. With built-in
datasets like mtcars and iris, you can practice these concepts and explore more complex
analyses.
Happy analyzing!