0% found this document useful (0 votes)
24 views10 pages

Phan Project2 Report

Uploaded by

Phan Thieny
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views10 pages

Phan Project2 Report

Uploaded by

Phan Thieny
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Project 2 – Exploratory Data Analysis

Yvette Lee Phan


College of Professional Studies, Northeastern University
ALY 6000 Introduction to Analytics
Professor Kayal Chandrasekaran
September 28, 2024
Introduction:
This project explores data manipulation and visualization techniques in R using the `ggplot2`,
`tidyverse`, and `dplyr` packages. The analysis is divided into two main parts:
1. Happiness Dataset Analysis (2015 Data):
- Reads and cleans the 2015 World Happiness dataset.
- Conducts exploratory data analysis (EDA) including column renaming, data selection,
filtering, and grouping.
- Creates new variables and summary statistics to better understand happiness scores across
different regions.

2. Baseball Dataset Analysis:


- Focuses on performance metrics for baseball players, including batting average (BA) and on-
base percentage (OBP).
- Filters and ranks players based on performance criteria.
- Visualizes player statistics through histograms and determines the most valuable player
(MVP).

Key Findings:

Assignment Part 1:

1.
Read a CSV file named "2015.csv" and stores the data into a variable called data_2015. The
header = TRUE argument indicates that the first row of the CSV file contains the column
names and display the first few rows of the data_2015 dataset using the head() function.

2.
Renaming Columns: This line assigns a new set of column names to the data_2015 dataframe. It
uses the names() function to update the column names to more readable and properly formatted
labels.

3.
View data set in new tab
View(data_2015)

4.
This line uses the glimpse() function to provide a transposed, concise view of the data_2015
dataframe, allowing you to see the structure and a preview of the dataset.

5.
This line loads the janitor package, which provides tools for cleaning and organizing data in R,
such as the clean_names() function.
The clean_names() function is used to modify the column names of the data_2015 dataframe.
This function converts column names to lowercase and replaces any spaces or special characters
with underscores (_). It ensures that all column names are in a consistent and more
programming-friendly format.
Display the data_2015 dataframe after cleaning the column names.
6.

• Using the dplyr Package: The %>% operator (pipe operator) is commonly used in the
dplyr package to pass the output of one function as the input to the next function, making
the code more readable and concise.
• Selecting Specific Columns: The select() function is used to extract specific columns
from the data_2015 dataframe.
• Storing the Selected Data: The resulting dataframe, which contains only the specified
columns, is stored in a new variable named happy_df.

7.
Function Used: The head() function is used to return the first few rows of a dataframe.
Operation:
• happy_df: This is the dataframe from which rows are being selected.
• 10: Specifies that the first 10 rows of happy_df should be returned.
Storing the Result: The first 10 rows of the happy_df dataframe are stored in a new dataframe
called top_ten_df.

8.
Using the Pipe Operator (%>%): The pipe operator is used to pass happy_df as input to the
filter() function.
Filtering Rows: The filter() function, which is part of the dplyr package, is used to select rows
that satisfy a specific condition.
Storing the Filtered Data: The resulting filtered dataframe is stored in a new variable called
no_freedom_df.
9.
Using the Pipe Operator (%>%): The pipe operator is used to pass happy_df as input to the
arrange() function.
Sorting the Data: The arrange() function from the dplyr package is used to sort rows of the
dataframe.
Storing the Sorted Data: The resulting dataframe, sorted in descending order of freedom, is
stored in a new variable called best_freedom_df.

10.
Using the Pipe Operator (%>%): The pipe operator is used to pass the data_2015 dataframe as
input to the mutate() function.
Creating a New Column: The mutate() function from the dplyr package is used to add a new
column or modify existing ones.
Storing the Updated Dataframe: The resulting dataframe, which now includes the new gff_stat
column, is stored back into data_2015.

11.
The group_by() function groups the happy_df dataframe by the region column. This means that
subsequent operations will be performed separately for each group (each region).
The summarise() function is used to create new summary variables for each group:
• country_count = n(): The n() function counts the number of rows (countries) in each
region.
• mean_happiness = mean(happiness_score): Calculates the mean (average) of the
happiness_score column for each region.
• mean_freedom = mean(freedom):
Calculates the mean of the freedom
column for each region.

Storing the Results: The resulting summary


statistics are stored in a new dataframe
called regional_stats_df.

12.
Reading the CSV File:
• read.csv("baseball.csv", header = TRUE): This function reads a CSV file named
baseball.csv and loads its content into a dataframe named baseball.
• The header = TRUE argument specifies that the first row of the CSV file contains column
names, so R will use this row to name the columns in the dataframe.
Storing the Data: The loaded data is stored in the baseball dataframe. This dataframe will contain
all rows and columns from the CSV file, with column names set according to the first row of the
CSV.

14.
Using the Pipe Operator (%>%): The pipe operator is used to pass the baseball dataframe as
input to the filter() function.
Filtering the Rows: The filter() function, which is part of the dplyr package, is used to keep rows
that meet a certain condition. This condition checks the AB column (likely representing the
number of "At Bats" for each player). The filter() function will retain only the rows where AB is
not equal to 0. All rows where AB is 0 will be removed.
Storing the Filtered Data: The filtered dataframe, which no longer includes players with 0 at-
bats, is stored back into the original baseball dataframe.
15.
• Using the Pipe Operator (%>%): The pipe operator is used to pass the baseball dataframe
as input to the mutate() function.
• Creating a New Column: The mutate() function, which is part of the dplyr package, is
used to add or modify columns in a dataframe. This formula divides the number of hits
(H) by the number of at-bats (AB) for each player to calculate the batting average (BA).
• Storing the Updated Dataframe: The resulting dataframe, which now includes the new
BA column, is stored back into the original baseball dataframe.

16.
Using the Pipe Operator (%>%): The pipe operator is used to pass the baseball dataframe as
input to the mutate() function.
Creating a New Column: The mutate() function from the dplyr package is used to add or
modify columns in a dataframe. This formula calculates the On-Base Percentage (OBP) for each
player. It is computed as the sum of hits (H) and bases on balls (BB) divided by the sum of at-
bats (AB) and bases on balls (BB).
Storing the Updated Dataframe: The resulting dataframe, which now includes the new OBP
column, is stored back into the original baseball dataframe.

17.
Using the Pipe Operator (%>%): The pipe operator is used to pass the baseball dataframe
through multiple operations.
The arrange() function is used to sort the rows of the dataframe. The argument desc(SO) sorts the
dataframe in descending order based on the SO column, where SO likely represents the number
of strikeouts. This means players with the highest number of strikeouts are ordered at the top.
The top_n() function selects the top n rows from the dataframe. In this case:
• 10: Specifies that the top 10 rows should be selected.
• wt = SO: Indicates that the selection should be based on the SO column, so it picks the
top 10 players with the highest strikeouts.

Storing the Result: The resulting dataframe, which contains the top 10 players with the most
strikeouts, is stored in a new variable named strikeout_artist.

18.
Using the Pipe Operator (%>%): The pipe operator is used to pass the baseball dataframe as
input to the filter() function.
Filtering Rows: The filter() function, which is part of the dplyr package, is used to select rows
that meet one or more conditions.
• AB >= 300: Selects players who have had at least 300 at-bats.
• G >= 100: Selects players who have played at least 100 games.
• The | symbol represents the logical "OR" operator. This means that rows meeting either
of the two conditions (i.e., AB >= 300 or G >= 100) will be retained in the resulting
dataframe.
Storing the Filtered Data: The filtered dataframe, which includes only eligible players based on
the given criteria, is stored in a new variable called eligible_df.

19.
Using ggplot() Function:
• ggplot(eligible_df, aes(BA)): This initializes a ggplot object using the eligible_df
dataframe.
o eligible_df: The dataframe that contains the data to be visualized.
o aes(BA): The aes() function maps the BA column (batting average) to the x-axis
of the plot.

Adding the Histogram Layer:


• geom_histogram(): Adds a histogram layer to the ggplot.
• Parameters:
o binwidth = 0.025: Specifies the
width of each bin in the
histogram. Each bin will
represent a range of 0.025 units
of batting average.
o fill = "green": Sets the fill color
of the bars to green.
o color = "blue": Sets the border
color of the bars to blue

20.
Select specific columns from the baseball dataframe:

• Last: Last name of the player.


• First: First name of the player.
• HR: Home runs hit by the player.
• RBI: Runs batted in.
• OBP: On-base percentage.
The resulting dataframe will contain only these columns.
Sort the dataframe in descending order based on the following criteria:
• desc(HR): Home runs (HR) are sorted in descending order, so players with the most
home runs are at the top.
• desc(RBI): If two or more players have
the same number of home runs, the
dataframe is further sorted by runs batted
in (RBI) in descending order.
• desc(OBP): If two or more players have
the same HR and RBI, the dataframe is sorted by on-base percentage (OBP) in
descending order.
Displaly the first row of the sorted mvp_award dataframe, representing the player who ranks
highest according to the specified criteria (most home runs, then most RBIs, then highest on-base
percentage).

Conclusion/Recommendation:
In conclusion, this project demonstrates the utility of R programming for data manipulation and
visualization using the ggplot2, tidyverse, and dplyr packages. The two-part analysis provides
insights into distinct datasets, focusing on understanding happiness metrics across global regions
and evaluating baseball player performance based on key statistics.

You might also like