Phan Project3 Report

Uploaded by

Phan Thieny

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views6 pages

Phan Project3 Report

Uploaded by

Phan Thieny

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Project 3 – Exploring Visualization

Yvette Lee Phan

College of Professional Studies, Northeastern University
ALY 6000 Introduction to Analytics
Professor Kayal Chandrasekaran
October 6th, 2024
Introduction:

The following outlines a series of data cleaning, transformation, and visualization steps
conducted on a books dataset. It starts with using `clean_names()` for better readability of
column names and applying `mdy()` to standardize date formats. A new column is created to
extract the year from the `first_publish_date`, followed by filtering and removing unnecessary
data to focus on books published between 1990 and 2020. Further, a summary of book ratings
and page counts is explored using histograms and box plots. Data grouping and summarization
are performed to analyze book counts by year and publisher. Pareto and ogive charts are created
to visualize cumulative book counts, while bar charts show average ratings by the top 10
publishers, highlighting key trends and insights within the dataset.

Key Findings:
1.
before

after

The use of clean_names in the package janitor helps the column names easily be read.

Before After

The mdy() function converts the dates from the format "MM-DD-YYYY" (month-day-year)
to "YYYY-MM-DD" (year-month-day).

3.
The R code snippet books$year <- year(books$first_publish_date) is used to extract the
year component from the first_publish_date column in the books data frame and assign it
to a new column named year.
4-9.

After extracting the year from the first_publish_date column, rows with NA values in
the year column are removed. The data is then filtered to include only books published
between 1990 and 2020. Following this, unnecessary columns are dropped from the
dataframe. The resulting data is further filtered to retain only books with fewer than 700
pages. Any remaining rows containing NA values in any column are then removed to ensure
a complete dataset. The structure of the dataframe is examined to confirm data types and
view the first few rows, and descriptive statistics are obtained for each column, providing a
summary of the dataset’s central tendencies, dispersion, and distribution.

10.
X-axis (Rating): The rating values
range approximately from 2 to 5,
indicating that the histogram focuses on
this specific range of book ratings.
Y-axis (Number of Books): The y-
axis shows the frequency of books for
each rating value, with a scale from 0 to
3000.
Distribution: The histogram is right-
skewed, with the majority of book ratings
clustered between 3.5 and 4.5. The
highest frequency occurs around a rating
of 4, where over 3000 books fall into this
rating category.
Bars: The bars show that very few books have a rating below 3, while the majority have
ratings between 3.5 and 4.5, tapering off as the rating approaches 5.

11.
X-axis (Pages): Represents the page counts of books, with
values ranging from 0 to around 700.
Box: The red box represents the interquartile range (IQR),
which contains the middle 50% of the data. The bottom and
top edges of the box correspond to the 25th and 75th
percentiles, respectively.
Median Line: There is a vertical black line inside the box
that represents the median (50th percentile) of the page
counts.
Whiskers: The whiskers extend from the box to the smallest and largest values within 1.5
times the IQR. This indicates that most of the data falls within this range.
Outliers: There are a few points to the right of the box plot, beyond the whisker, which
are identified as outliers. These represent books with exceptionally high page counts
compared to the majority.
12-13.

The code creates a new dataframe, `by_year`, which contains the count of books published each
year by first grouping the `books` dataframe by the `year` column and using the `summarise()`
function to calculate the number of books per year, stored in a column named `total_books`. The
resulting dataframe is then arranged in ascending order by year. Subsequently, the code filters
`by_year` to create a new dataframe, `by_year_filtered`, which includes only the years between
1990 and 2020, providing a summary of book counts specifically for this time period.

X-axis (Year): Represents the years

ranging from 1990 to 2020.
Y-axis (Total Books): Indicates the total
number of books rated, with values ranging
from 0 to over 600.
Line Plot: A solid black line connects
data points, showing the number of books
rated for each year.
Trend: The number of books rated per
year remained relatively stable from 1990 to
the late 1990s. After 2000, there was a steady
increase in the number of books rated,
peaking around 2010-2012 at over 600 books. Following this peak, there is a sharp decline in
book ratings, reaching a low in 2020.

14-20.

This code creates a new dataframe, `book_publisher`, that counts the number of books for each
publisher by first grouping the `books` dataframe by the `publisher` column. It filters out
publishers with fewer than 125 books, keeping only those with higher counts. The dataframe is
then sorted in descending order by book count. A cumulative sum column (`cum_counts`) is
added to track the cumulative total of books, followed by a relative frequency column
(`rel_freq`), showing the proportion of books for each publisher. Additionally, a cumulative
frequency column (`cum_freq`) is calculated, and finally, the `publisher` column is converted
into a factor with levels corresponding to the current order of publishers. This process results in a
well-structured summary of book counts by publisher, with relevant cumulative and relative
frequencies.
21.
X-axis (Publisher): Represents six
publishers: Random House, Harper
Collins, MacMillan, Hachette, Simon and
Schuster, and Scholastic Books. The
publisher names are displayed at a slight
angle for clarity.
Left Y-axis (Number of Books):
Shows the number of books published by
each publisher, with values ranging from
0 to 1200.
Right Y-axis (Cumulative Number
of Books): Represents the cumulative
sum of books published, with values
ranging from 0 to 3000.
Bars (Pareto Chart): Each cyan-
colored bar represents the number of books published by a specific publisher. Random House
has the highest number of books (just over 1000), followed by Harper Collins with
approximately 600 books. The remaining publishers have fewer than 400 books each.
Line Plot (Ogive): A black line plot connects points above each bar, showing the cumulative
number of books published as more publishers are included. The line starts at Random House
and rises steadily until reaching over 3000 total books when all publishers are considered.

22.
X-axis (Publisher):
Displays the names of the
top 10 publishers, arranged
in descending order of
average rating. The
publisher names are
displayed at an angle for
better readability and
include: VIZ Media LLC,
Tyndale House Publishers,
Simon and Schuster,
Hatchette, Createspace
Independent Publishing
Platform, Harper Collins,
Random House, Pocket
Books, Scholastic Books,
and MacMillan.
Y-axis (Average Rating): Represents the average book ratings, with values ranging from 0 to
5.
Bars: Each blue bar represents the average rating for the respective publisher. All bars are
clustered around the upper part of the scale, indicating high average ratings across all publishers.
Trends: The average ratings are very close to each other, ranging between approximately 4.0
and 4.2. VIZ Media LLC has the highest average rating, slightly above 4.2, while MacMillan has
the lowest average rating, close to 4.0.

Conclusion/Recommendation:

The analysis effectively cleaned and transformed the books dataset to extract meaningful insights
regarding publication trends, publisher influence, and book ratings. By filtering data and
grouping by key attributes, the visualizations provided a clear view of book ratings, page
distributions, and the impact of top publishers over time. The Pareto and ogive charts highlighted
the dominance of a few major publishers, while average rating comparisons showed consistent
high ratings across top publishers. Overall, these findings reveal patterns in book publishing and
ratings, offering valuable information for understanding the dynamics of the book industry
during the specified period.

The Nursing School Study Buddy - BNE 2024
100% (4)
The Nursing School Study Buddy - BNE 2024
228 pages
DB58 Engine Manual (En)
88% (8)
DB58 Engine Manual (En)
210 pages
4 - Data Visualization For Decison Making
100% (1)
4 - Data Visualization For Decison Making
64 pages
Informatics Practices Project - 221228 - 132356
No ratings yet
Informatics Practices Project - 221228 - 132356
30 pages
NumPy, Pandas, MatplotLib, Seaborn, ScikitLearn (SkLearn)
No ratings yet
NumPy, Pandas, MatplotLib, Seaborn, ScikitLearn (SkLearn)
14 pages
Lesson 08 Data Visualization With Python
No ratings yet
Lesson 08 Data Visualization With Python
125 pages
ALY 6000 Project 3
No ratings yet
ALY 6000 Project 3
7 pages
Werkstatthandbuch Linhai 310 420 600
100% (2)
Werkstatthandbuch Linhai 310 420 600
514 pages
Anticancer Drugs Classification
100% (1)
Anticancer Drugs Classification
19 pages
ML2 Feature Engineering
No ratings yet
ML2 Feature Engineering
153 pages
Chapter 2
No ratings yet
Chapter 2
40 pages
Krypton Series 5600, 8000, 11000, 13000
No ratings yet
Krypton Series 5600, 8000, 11000, 13000
2 pages
Data Analytics and Interactive Dashboards Using Python
No ratings yet
Data Analytics and Interactive Dashboards Using Python
96 pages
UNIT - 1 EDA Continuation
No ratings yet
UNIT - 1 EDA Continuation
113 pages
Python Notes - 5 Unit
No ratings yet
Python Notes - 5 Unit
104 pages
Handout6 - Visualization
No ratings yet
Handout6 - Visualization
75 pages
Combinepdf
No ratings yet
Combinepdf
101 pages
UNIT4
No ratings yet
UNIT4
62 pages
Civil Engineering Project Ideas - Best and Exclusive FE Collection
No ratings yet
Civil Engineering Project Ideas - Best and Exclusive FE Collection
4 pages
Combinepdf
No ratings yet
Combinepdf
77 pages
Using Graphs To Display Data R 2-12 PDF
No ratings yet
Using Graphs To Display Data R 2-12 PDF
2 pages
The Soil Underfoot - Infinite Possibilities For A Finite Resource (Gnv64)
100% (2)
The Soil Underfoot - Infinite Possibilities For A Finite Resource (Gnv64)
462 pages
Unit IV
No ratings yet
Unit IV
63 pages
Data Visualization
No ratings yet
Data Visualization
24 pages
PythonDASE - 2025 Version1
No ratings yet
PythonDASE - 2025 Version1
44 pages
Chapter3 - Visualization and Communication
No ratings yet
Chapter3 - Visualization and Communication
45 pages
DAUP Exam Notes - 2in1
No ratings yet
DAUP Exam Notes - 2in1
35 pages
121a1086 - Bda - Assignment - No.2
No ratings yet
121a1086 - Bda - Assignment - No.2
31 pages
DS Unit 4
No ratings yet
DS Unit 4
37 pages
Tableau
No ratings yet
Tableau
67 pages
Data Visualization
No ratings yet
Data Visualization
48 pages
PYDS 3150713 Unit-4
No ratings yet
PYDS 3150713 Unit-4
59 pages
Data Science With Python - Lesson 10 - Data Visualization in Python With Matplotlib - Raw
No ratings yet
Data Science With Python - Lesson 10 - Data Visualization in Python With Matplotlib - Raw
71 pages
Information Practices Yashdeep 12 B
No ratings yet
Information Practices Yashdeep 12 B
30 pages
R Module 4
No ratings yet
R Module 4
42 pages
Unit2 Modified
No ratings yet
Unit2 Modified
42 pages
Introduction To Data Science Module 1
No ratings yet
Introduction To Data Science Module 1
32 pages
CS2 2 Study Unit 7 Introduction To Data Visualization
No ratings yet
CS2 2 Study Unit 7 Introduction To Data Visualization
47 pages
Ad3301 Apr May 2024 Answer Key
No ratings yet
Ad3301 Apr May 2024 Answer Key
31 pages
Final
No ratings yet
Final
36 pages
Unit 3 (Python)
No ratings yet
Unit 3 (Python)
29 pages
Input Data Categorical (E.g., Product Categories, Months) Purpose Visualize Comparisons Between Different Categories.
No ratings yet
Input Data Categorical (E.g., Product Categories, Months) Purpose Visualize Comparisons Between Different Categories.
28 pages
Concepts of Data Visualisation (Autosaved) (Autosaved)
No ratings yet
Concepts of Data Visualisation (Autosaved) (Autosaved)
34 pages
DV Lab Manual (Ex - No.1-10)
No ratings yet
DV Lab Manual (Ex - No.1-10)
23 pages
Abb Disconnector Gw54 1yva000105 Reva en
No ratings yet
Abb Disconnector Gw54 1yva000105 Reva en
8 pages
Data Manipulation and Visualization
No ratings yet
Data Manipulation and Visualization
21 pages
Olympic Data Minor Project 5th Sem
No ratings yet
Olympic Data Minor Project 5th Sem
23 pages
Netflix Analysis Report (2105878 - Bibhudutta Swain)
No ratings yet
Netflix Analysis Report (2105878 - Bibhudutta Swain)
19 pages
Data Interpretation
No ratings yet
Data Interpretation
24 pages
DSLAB5
No ratings yet
DSLAB5
17 pages
Informatics Practices Project Class XII Prepared by Balagam Risha Raj
No ratings yet
Informatics Practices Project Class XII Prepared by Balagam Risha Raj
13 pages
Ip Project Final
No ratings yet
Ip Project Final
18 pages
Nhapmon
No ratings yet
Nhapmon
18 pages
Business Report Sparkling Dataset - TSF
No ratings yet
Business Report Sparkling Dataset - TSF
26 pages
NM Assignment
No ratings yet
NM Assignment
14 pages
Phase-3 Project
No ratings yet
Phase-3 Project
14 pages
Capstone Project: I. Definition
No ratings yet
Capstone Project: I. Definition
17 pages
Topic 2. Visual Data Analysis in Python: Mlcourse - Ai (Https://mlcourse - Ai)
No ratings yet
Topic 2. Visual Data Analysis in Python: Mlcourse - Ai (Https://mlcourse - Ai)
15 pages
ML Report
No ratings yet
ML Report
12 pages
mgn343 Ca4
No ratings yet
mgn343 Ca4
9 pages
Data Visualisation Lab Digital Assignment 2: Name: Samar Abbas Naqvi Registration Number: 19BCE0456
No ratings yet
Data Visualisation Lab Digital Assignment 2: Name: Samar Abbas Naqvi Registration Number: 19BCE0456
7 pages
Big Data Report
No ratings yet
Big Data Report
6 pages
BI-Project 21
No ratings yet
BI-Project 21
5 pages
Ib On Granites
No ratings yet
Ib On Granites
23 pages
Guidelines DAVP
No ratings yet
Guidelines DAVP
3 pages
1 - Introduction - Data Visualization
No ratings yet
1 - Introduction - Data Visualization
3 pages
Data Science Four Marks Qa
No ratings yet
Data Science Four Marks Qa
4 pages
ITS62604 Tutorial 6 (Answer)
No ratings yet
ITS62604 Tutorial 6 (Answer)
2 pages
ALY6000 Module 6.0
No ratings yet
ALY6000 Module 6.0
54 pages
Mercuria Energy&Commodities Brochure
No ratings yet
Mercuria Energy&Commodities Brochure
6 pages
Semiconductor Devices and Circuits
No ratings yet
Semiconductor Devices and Circuits
3 pages
MedSurg - Respiratory Case Study
No ratings yet
MedSurg - Respiratory Case Study
7 pages
Malarity and Strength of KMnO4
No ratings yet
Malarity and Strength of KMnO4
2 pages
External Thermal Insulation Composite Systems Etics
No ratings yet
External Thermal Insulation Composite Systems Etics
34 pages
ALY6000 Module 5.3 - Multiplication Rule 1 Independent Events
No ratings yet
ALY6000 Module 5.3 - Multiplication Rule 1 Independent Events
13 pages
ALY6000 Module 4.1 Mean Vs Median
No ratings yet
ALY6000 Module 4.1 Mean Vs Median
11 pages
ALY6000 Module 6.1 Area Under Normal Curve Using Table (Example 1)
No ratings yet
ALY6000 Module 6.1 Area Under Normal Curve Using Table (Example 1)
11 pages
ALY6000 Module 4.3 Percentiles
No ratings yet
ALY6000 Module 4.3 Percentiles
9 pages
ALY6000 Module 4.2 Empirical Rule - Example
No ratings yet
ALY6000 Module 4.2 Empirical Rule - Example
9 pages
ALY6000 Module 6.1 Area Under Normal Curve Using Table (Example 4)
No ratings yet
ALY6000 Module 6.1 Area Under Normal Curve Using Table (Example 4)
14 pages
ALY6000 Module 5.3 Calculate A Complementary Probability
No ratings yet
ALY6000 Module 5.3 Calculate A Complementary Probability
7 pages
ALY6000 Module 5.2 - Addition Rule 1
No ratings yet
ALY6000 Module 5.2 - Addition Rule 1
6 pages
7 Colin Shelley Facts Global Energy
No ratings yet
7 Colin Shelley Facts Global Energy
13 pages
Phan Project6 Report
No ratings yet
Phan Project6 Report
5 pages
Timeline
No ratings yet
Timeline
4 pages
Books of M.A I-Ii Etc
No ratings yet
Books of M.A I-Ii Etc
3 pages
Guarniz Flores, Joel Luis
No ratings yet
Guarniz Flores, Joel Luis
8 pages
FLashcard of Polymerisation
No ratings yet
FLashcard of Polymerisation
28 pages
Effects of Food On Drug Therapy
No ratings yet
Effects of Food On Drug Therapy
12 pages
c E mc: Pan Pearl River Delta Physics Olympiad 2008 Part-1 (Total 6 Problems) 卷-1（共 6 题）
No ratings yet
c E mc: Pan Pearl River Delta Physics Olympiad 2008 Part-1 (Total 6 Problems) 卷-1（共 6 题）
8 pages
Numerical Methods
No ratings yet
Numerical Methods
25 pages
Unit 4
No ratings yet
Unit 4
7 pages
MMM-Case Study
No ratings yet
MMM-Case Study
24 pages
Sermon YOLO
No ratings yet
Sermon YOLO
21 pages
Human Person As A Being For Death
No ratings yet
Human Person As A Being For Death
16 pages
Boq Line Item No 4 T4N
No ratings yet
Boq Line Item No 4 T4N
2 pages
Lapp Pro206402en
No ratings yet
Lapp Pro206402en
4 pages
3.1 Circuits
No ratings yet
3.1 Circuits
3 pages
Further Reading - Applied Mathematics in Physics
No ratings yet
Further Reading - Applied Mathematics in Physics
6 pages
Operation On The Musculoskeletal Oni Fix
No ratings yet
Operation On The Musculoskeletal Oni Fix
3 pages
Excel 2019 Charts: Easy Excel Essentials 2019, #2
From Everand
Excel 2019 Charts: Easy Excel Essentials 2019, #2
M.L. Humphrey
No ratings yet
Excel 2024 Charts: Easy Excel 2024 Essentials, #3
From Everand
Excel 2024 Charts: Easy Excel 2024 Essentials, #3
M.L. Humphrey
No ratings yet

Phan Project3 Report

Uploaded by

Phan Project3 Report

Uploaded by

Project 3 – Exploring Visualization

Yvette Lee Phan

X-axis (Year): Represents the years

You might also like