Phan Project2 Report
Phan Project2 Report
Key Findings:
Assignment Part 1:
1.
Read a CSV file named "2015.csv" and stores the data into a variable called data_2015. The
header = TRUE argument indicates that the first row of the CSV file contains the column
names and display the first few rows of the data_2015 dataset using the head() function.
2.
Renaming Columns: This line assigns a new set of column names to the data_2015 dataframe. It
uses the names() function to update the column names to more readable and properly formatted
labels.
3.
View data set in new tab
View(data_2015)
4.
This line uses the glimpse() function to provide a transposed, concise view of the data_2015
dataframe, allowing you to see the structure and a preview of the dataset.
5.
This line loads the janitor package, which provides tools for cleaning and organizing data in R,
such as the clean_names() function.
The clean_names() function is used to modify the column names of the data_2015 dataframe.
This function converts column names to lowercase and replaces any spaces or special characters
with underscores (_). It ensures that all column names are in a consistent and more
programming-friendly format.
Display the data_2015 dataframe after cleaning the column names.
6.
• Using the dplyr Package: The %>% operator (pipe operator) is commonly used in the
dplyr package to pass the output of one function as the input to the next function, making
the code more readable and concise.
• Selecting Specific Columns: The select() function is used to extract specific columns
from the data_2015 dataframe.
• Storing the Selected Data: The resulting dataframe, which contains only the specified
columns, is stored in a new variable named happy_df.
7.
Function Used: The head() function is used to return the first few rows of a dataframe.
Operation:
• happy_df: This is the dataframe from which rows are being selected.
• 10: Specifies that the first 10 rows of happy_df should be returned.
Storing the Result: The first 10 rows of the happy_df dataframe are stored in a new dataframe
called top_ten_df.
8.
Using the Pipe Operator (%>%): The pipe operator is used to pass happy_df as input to the
filter() function.
Filtering Rows: The filter() function, which is part of the dplyr package, is used to select rows
that satisfy a specific condition.
Storing the Filtered Data: The resulting filtered dataframe is stored in a new variable called
no_freedom_df.
9.
Using the Pipe Operator (%>%): The pipe operator is used to pass happy_df as input to the
arrange() function.
Sorting the Data: The arrange() function from the dplyr package is used to sort rows of the
dataframe.
Storing the Sorted Data: The resulting dataframe, sorted in descending order of freedom, is
stored in a new variable called best_freedom_df.
10.
Using the Pipe Operator (%>%): The pipe operator is used to pass the data_2015 dataframe as
input to the mutate() function.
Creating a New Column: The mutate() function from the dplyr package is used to add a new
column or modify existing ones.
Storing the Updated Dataframe: The resulting dataframe, which now includes the new gff_stat
column, is stored back into data_2015.
11.
The group_by() function groups the happy_df dataframe by the region column. This means that
subsequent operations will be performed separately for each group (each region).
The summarise() function is used to create new summary variables for each group:
• country_count = n(): The n() function counts the number of rows (countries) in each
region.
• mean_happiness = mean(happiness_score): Calculates the mean (average) of the
happiness_score column for each region.
• mean_freedom = mean(freedom):
Calculates the mean of the freedom
column for each region.
12.
Reading the CSV File:
• read.csv("baseball.csv", header = TRUE): This function reads a CSV file named
baseball.csv and loads its content into a dataframe named baseball.
• The header = TRUE argument specifies that the first row of the CSV file contains column
names, so R will use this row to name the columns in the dataframe.
Storing the Data: The loaded data is stored in the baseball dataframe. This dataframe will contain
all rows and columns from the CSV file, with column names set according to the first row of the
CSV.
14.
Using the Pipe Operator (%>%): The pipe operator is used to pass the baseball dataframe as
input to the filter() function.
Filtering the Rows: The filter() function, which is part of the dplyr package, is used to keep rows
that meet a certain condition. This condition checks the AB column (likely representing the
number of "At Bats" for each player). The filter() function will retain only the rows where AB is
not equal to 0. All rows where AB is 0 will be removed.
Storing the Filtered Data: The filtered dataframe, which no longer includes players with 0 at-
bats, is stored back into the original baseball dataframe.
15.
• Using the Pipe Operator (%>%): The pipe operator is used to pass the baseball dataframe
as input to the mutate() function.
• Creating a New Column: The mutate() function, which is part of the dplyr package, is
used to add or modify columns in a dataframe. This formula divides the number of hits
(H) by the number of at-bats (AB) for each player to calculate the batting average (BA).
• Storing the Updated Dataframe: The resulting dataframe, which now includes the new
BA column, is stored back into the original baseball dataframe.
16.
Using the Pipe Operator (%>%): The pipe operator is used to pass the baseball dataframe as
input to the mutate() function.
Creating a New Column: The mutate() function from the dplyr package is used to add or
modify columns in a dataframe. This formula calculates the On-Base Percentage (OBP) for each
player. It is computed as the sum of hits (H) and bases on balls (BB) divided by the sum of at-
bats (AB) and bases on balls (BB).
Storing the Updated Dataframe: The resulting dataframe, which now includes the new OBP
column, is stored back into the original baseball dataframe.
17.
Using the Pipe Operator (%>%): The pipe operator is used to pass the baseball dataframe
through multiple operations.
The arrange() function is used to sort the rows of the dataframe. The argument desc(SO) sorts the
dataframe in descending order based on the SO column, where SO likely represents the number
of strikeouts. This means players with the highest number of strikeouts are ordered at the top.
The top_n() function selects the top n rows from the dataframe. In this case:
• 10: Specifies that the top 10 rows should be selected.
• wt = SO: Indicates that the selection should be based on the SO column, so it picks the
top 10 players with the highest strikeouts.
Storing the Result: The resulting dataframe, which contains the top 10 players with the most
strikeouts, is stored in a new variable named strikeout_artist.
18.
Using the Pipe Operator (%>%): The pipe operator is used to pass the baseball dataframe as
input to the filter() function.
Filtering Rows: The filter() function, which is part of the dplyr package, is used to select rows
that meet one or more conditions.
• AB >= 300: Selects players who have had at least 300 at-bats.
• G >= 100: Selects players who have played at least 100 games.
• The | symbol represents the logical "OR" operator. This means that rows meeting either
of the two conditions (i.e., AB >= 300 or G >= 100) will be retained in the resulting
dataframe.
Storing the Filtered Data: The filtered dataframe, which includes only eligible players based on
the given criteria, is stored in a new variable called eligible_df.
19.
Using ggplot() Function:
• ggplot(eligible_df, aes(BA)): This initializes a ggplot object using the eligible_df
dataframe.
o eligible_df: The dataframe that contains the data to be visualized.
o aes(BA): The aes() function maps the BA column (batting average) to the x-axis
of the plot.
20.
Select specific columns from the baseball dataframe:
Conclusion/Recommendation:
In conclusion, this project demonstrates the utility of R programming for data manipulation and
visualization using the ggplot2, tidyverse, and dplyr packages. The two-part analysis provides
insights into distinct datasets, focusing on understanding happiness metrics across global regions
and evaluating baseball player performance based on key statistics.