0% found this document useful (0 votes)
29 views6 pages

Phan Project3 Report

Uploaded by

Phan Thieny
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views6 pages

Phan Project3 Report

Uploaded by

Phan Thieny
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Project 3 – Exploring Visualization

Yvette Lee Phan


College of Professional Studies, Northeastern University
ALY 6000 Introduction to Analytics
Professor Kayal Chandrasekaran
October 6th, 2024
Introduction:

The following outlines a series of data cleaning, transformation, and visualization steps
conducted on a books dataset. It starts with using `clean_names()` for better readability of
column names and applying `mdy()` to standardize date formats. A new column is created to
extract the year from the `first_publish_date`, followed by filtering and removing unnecessary
data to focus on books published between 1990 and 2020. Further, a summary of book ratings
and page counts is explored using histograms and box plots. Data grouping and summarization
are performed to analyze book counts by year and publisher. Pareto and ogive charts are created
to visualize cumulative book counts, while bar charts show average ratings by the top 10
publishers, highlighting key trends and insights within the dataset.

Key Findings:
1.
before

after

The use of clean_names in the package janitor helps the column names easily be read.

2.

Before After

The mdy() function converts the dates from the format "MM-DD-YYYY" (month-day-year)
to "YYYY-MM-DD" (year-month-day).

3.
The R code snippet books$year <- year(books$first_publish_date) is used to extract the
year component from the first_publish_date column in the books data frame and assign it
to a new column named year.
4-9.

After extracting the year from the first_publish_date column, rows with NA values in
the year column are removed. The data is then filtered to include only books published
between 1990 and 2020. Following this, unnecessary columns are dropped from the
dataframe. The resulting data is further filtered to retain only books with fewer than 700
pages. Any remaining rows containing NA values in any column are then removed to ensure
a complete dataset. The structure of the dataframe is examined to confirm data types and
view the first few rows, and descriptive statistics are obtained for each column, providing a
summary of the dataset’s central tendencies, dispersion, and distribution.

10.
X-axis (Rating): The rating values
range approximately from 2 to 5,
indicating that the histogram focuses on
this specific range of book ratings.
Y-axis (Number of Books): The y-
axis shows the frequency of books for
each rating value, with a scale from 0 to
3000.
Distribution: The histogram is right-
skewed, with the majority of book ratings
clustered between 3.5 and 4.5. The
highest frequency occurs around a rating
of 4, where over 3000 books fall into this
rating category.
Bars: The bars show that very few books have a rating below 3, while the majority have
ratings between 3.5 and 4.5, tapering off as the rating approaches 5.

11.
X-axis (Pages): Represents the page counts of books, with
values ranging from 0 to around 700.
Box: The red box represents the interquartile range (IQR),
which contains the middle 50% of the data. The bottom and
top edges of the box correspond to the 25th and 75th
percentiles, respectively.
Median Line: There is a vertical black line inside the box
that represents the median (50th percentile) of the page
counts.
Whiskers: The whiskers extend from the box to the smallest and largest values within 1.5
times the IQR. This indicates that most of the data falls within this range.
Outliers: There are a few points to the right of the box plot, beyond the whisker, which
are identified as outliers. These represent books with exceptionally high page counts
compared to the majority.
12-13.

The code creates a new dataframe, `by_year`, which contains the count of books published each
year by first grouping the `books` dataframe by the `year` column and using the `summarise()`
function to calculate the number of books per year, stored in a column named `total_books`. The
resulting dataframe is then arranged in ascending order by year. Subsequently, the code filters
`by_year` to create a new dataframe, `by_year_filtered`, which includes only the years between
1990 and 2020, providing a summary of book counts specifically for this time period.

X-axis (Year): Represents the years


ranging from 1990 to 2020.
Y-axis (Total Books): Indicates the total
number of books rated, with values ranging
from 0 to over 600.
Line Plot: A solid black line connects
data points, showing the number of books
rated for each year.
Trend: The number of books rated per
year remained relatively stable from 1990 to
the late 1990s. After 2000, there was a steady
increase in the number of books rated,
peaking around 2010-2012 at over 600 books. Following this peak, there is a sharp decline in
book ratings, reaching a low in 2020.

14-20.

This code creates a new dataframe, `book_publisher`, that counts the number of books for each
publisher by first grouping the `books` dataframe by the `publisher` column. It filters out
publishers with fewer than 125 books, keeping only those with higher counts. The dataframe is
then sorted in descending order by book count. A cumulative sum column (`cum_counts`) is
added to track the cumulative total of books, followed by a relative frequency column
(`rel_freq`), showing the proportion of books for each publisher. Additionally, a cumulative
frequency column (`cum_freq`) is calculated, and finally, the `publisher` column is converted
into a factor with levels corresponding to the current order of publishers. This process results in a
well-structured summary of book counts by publisher, with relevant cumulative and relative
frequencies.
21.
X-axis (Publisher): Represents six
publishers: Random House, Harper
Collins, MacMillan, Hachette, Simon and
Schuster, and Scholastic Books. The
publisher names are displayed at a slight
angle for clarity.
Left Y-axis (Number of Books):
Shows the number of books published by
each publisher, with values ranging from
0 to 1200.
Right Y-axis (Cumulative Number
of Books): Represents the cumulative
sum of books published, with values
ranging from 0 to 3000.
Bars (Pareto Chart): Each cyan-
colored bar represents the number of books published by a specific publisher. Random House
has the highest number of books (just over 1000), followed by Harper Collins with
approximately 600 books. The remaining publishers have fewer than 400 books each.
Line Plot (Ogive): A black line plot connects points above each bar, showing the cumulative
number of books published as more publishers are included. The line starts at Random House
and rises steadily until reaching over 3000 total books when all publishers are considered.

22.
X-axis (Publisher):
Displays the names of the
top 10 publishers, arranged
in descending order of
average rating. The
publisher names are
displayed at an angle for
better readability and
include: VIZ Media LLC,
Tyndale House Publishers,
Simon and Schuster,
Hatchette, Createspace
Independent Publishing
Platform, Harper Collins,
Random House, Pocket
Books, Scholastic Books,
and MacMillan.
Y-axis (Average Rating): Represents the average book ratings, with values ranging from 0 to
5.
Bars: Each blue bar represents the average rating for the respective publisher. All bars are
clustered around the upper part of the scale, indicating high average ratings across all publishers.
Trends: The average ratings are very close to each other, ranging between approximately 4.0
and 4.2. VIZ Media LLC has the highest average rating, slightly above 4.2, while MacMillan has
the lowest average rating, close to 4.0.

Conclusion/Recommendation:

The analysis effectively cleaned and transformed the books dataset to extract meaningful insights
regarding publication trends, publisher influence, and book ratings. By filtering data and
grouping by key attributes, the visualizations provided a clear view of book ratings, page
distributions, and the impact of top publishers over time. The Pareto and ogive charts highlighted
the dominance of a few major publishers, while average rating comparisons showed consistent
high ratings across top publishers. Overall, these findings reveal patterns in book publishing and
ratings, offering valuable information for understanding the dynamics of the book industry
during the specified period.

You might also like