Naan Mudhalvan - Google Cloud Data Analytics
Naan Mudhalvan - Google Cloud Data Analytics
Submitted by
ELAKYA P - 421322104012
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Year / Semester – III /VI
Name ELAKYA P
Reg.No 421322104012
Sem/Dep III / VI
Certified that this is the bonafide record of work done by the above
student in the NM Course - Google Cloud Data Analytics during the
academic year 2024 – 2025.
AIM:
To Perform Exploratory Data Analysis (EDA) on Global Superstore Dataset.
INTRODUCTION:
The Global Superstore dataset contains detailed transactional data, including order
dates, shipping dates, customer demographics, product categories, sales figures, profit
margins, and shipping modes.
OBJECTIVES:
PROGRAM:
import pandas as pd
# Load the dataset
df = pd.read_excel("/content/Global Superstore.xls")
PROGRAM:
print(df.head())
OUTPUT:
PROGRAM:
PROGRAM:
# Check missing values
missing_values = df.isnull().sum()
print(missing_values)
OUTPUT:
II. Filling the Missing Values:
PROGRAM:
# Create a city → postal code mapping from known (non-null) data
postal_map=
df.dropna(subset=['PostalCode']).drop_duplicates('City').set_index('City')['Postal Code']
# Fill missing postal codes using map and fillna
df['Postal Code'] = df['Postal Code'].fillna(df['City'].map(postal_map))
print(df['Postal Code'].isnull().sum())
OUTPUT:
PROGRAM:
OUTPUT:
STEP 3 : Calculate summary statistics (mean, median, standard deviation) for Sales
and Profit.
PROGRAM:
# Calculate mean, median, and standard deviation for 'Sales' and 'Profit'
mean_sales = df['Sales'].mean()
median_sales = df['Sales'].median()
std_sales = df['Sales'].std()
mean_profit = df['Profit'].mean()
median_profit = df['Profit'].median()
std_profit = df['Profit'].std()
# Print the results
print(f"Sales - Mean: {mean_sales}, Median: {median_sales}, Std Dev: {std_sales}")
print(f"Profit - Mean: {mean_profit}, Median: {median_profit}, Std Dev:
{std_profit}")
OUTPUT:
STEP 4 : Analyze
PROGRAM:
OUTPUT:
PROGRAM:
# Group by 'Category' and calculate total profit for each category
total_profit_per_category = df.groupby('Category')['Profit'].sum()
# Sort the categories by total profit in descending order
sorted_profit = total_profit_per_category.sort_values(ascending=False) #
Display the top 5 most profitable categories
top_5_profitable_categories = sorted_profit.head(5)
print(top_5_profitable_categories)
OUTPUT:
PROGRAM:
OUTPUT:
PROGRAM:
# Group by 'Year' and calculate total sales for each year
sales_by_year = df.groupby('Year')['Sales'].sum()
# Display the total sales per year
print(sales_by_year)
OUTPUT:
STEP 5 : Visualizations
PROGRAM:
import matplotlib.pyplot as plt
# Plotting the sales by region
plt.figure(figsize=(10, 6))
total_sales_per_region.plot(kind='bar', color='violet')
# Add labels and title
plt.title('Total Sales by Region')
plt.xlabel('Region')
plt.ylabel('Total Sales')
plt.xticks(rotation=0) # Optional: Rotate x-axis labels for better visibility
plt.show()
OUTPUT:
Insights:
The bar chart depicting total sales by region reveals significant variations in sales
performance across different geographic areas. The Asia-Pacific and US regions
typically lead in overall sales, indicating strong market presence and customer demand.
In contrast, Africa and Canada show comparatively lower sales volumes, suggesting
either limited market reach or fewer transactions recorded in those areas.
This distribution may reflect regional differences in customer base size, product
availability, or operational scale. The insights from this chart can guide strategic
decisions such as regional marketing investments, supply chain adjustments, and
potential market expansion opportunities.
II.Line chart: Year-wise sales trend.
PROGRAM:
import matplotlib.pyplot as plt
# Plotting the year-wise sales trend as a line chart
plt.figure(figsize=(10, 6))
sales_by_year.plot(kind='line', marker='o', color='b')
# Add labels and title
plt.title('Year-Wise Sales
Trend') plt.xlabel('Year')
plt.ylabel('Total Sales')
plt.grid(True) # Add gridlines for better readability
plt.xticks(rotation=45) # Rotate x-axis labels for better visibility
plt.show()
OUTPUT:
Insights:
The line chart illustrating year-wise sales trends shows a generally increasing
trajectory in total sales over the years, indicating business growth and expanding
customer demand. In particular, there's often a noticeable spike in sales in the final year
(e.g., 2014), which could be attributed to seasonal campaigns, improved logistics, or
expanded operations.
https://colab.research.google.com/drive/1cZQihPLXWEbxvgih2axmp2JhaQJJQ8uk?usp
=sharing
Result:
The EDA of the Global Superstore dataset shows steady growth in sales over the
years, with peak sales in the most recent year. The Consumer segment and Technology
category drive most of the revenue, while Furniture often results in losses. The US and
Asia-Pacific regions perform best, whereas Africa and Canada underperform. High
discounts and costly shipping methods reduce profitability. Outliers reveal cases of high
sales with negative profits, suggesting areas for operational improvement.Thus, The
Exploratory Data Analytics is performed on the Global Supoerstore Dataset.
EX.NO:02 EDA ON COVID-19 GLOBAL DATASET
AIM:
To Perform the Exploratory Data Analysis (EDA) on Covid-19 Dataset.
INTRODUCTION:
The COVID-19 pandemic has significantly impacted countries across the globe,
and India, with its vast and diverse population, has faced unique challenges in managing
the spread and effects of the virus. To better understand the dynamics of the pandemic
within the country, it is crucial to analyze COVID-19 data at a more granular level
specifically, state-wise. This project focuses on performing Exploratory Data Analysis
(EDA) on a state-wise COVID-19 dataset for India.
By examining key metrics such as confirmed cases, recoveries, active cases, and
deaths, we aim to gain meaningful insights into the regional progression of the pandemic.
The findings from this EDA can help identify states with high transmission rates, assess
healthcare response effectiveness, and provide a data-driven foundation for public health
decision-making.
OBJECTIVES:
PROGRAM:
import pandas as pd
# Load the dataset
df = pd.read_excel("/state_wise_data.csv")
PROGRAM:
print(df.head())
OUTPUT:
II. Display the Total Number of Rows and Columns.
PROGRAM:
OUTPUT:
PROGRAM:
# Check missing values
missing_values = df.isnull().sum()
print(missing_values)
OUTPUT:
II. Convert date columns to datetime format.
PROGRAM:
OUTPUT:
STEP 3: Calculate
PROGRAM:
PROGRAM:
PROGRAM:
OUTPUT:
STEP 4 : Visualizations:
PROGRAM:
state_confirmed = df.groupby('State')['Confirmedcases'].sum() #
Get top 5 states
top_5 = state_confirmed.sort_values(ascending=False).head(5) #
Plot pie chart
plt.figure(figsize=(8, 8))
plt.pie(top_5, labels=top_5.index, autopct='%1.1f%%', startangle=140)
plt.title('Top 5 States by Confirmed COVID-19 Cases')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.tight_layout()
plt.show()
OUTPUT:
Insights:
During the exploratory data analysis of the COVID-19 state-wise dataset for India, a pie
chart provides a clear visual representation of how cases are distributed across different
states. It reveals that a small number of states, such as Maharashtra, Kerala, and Delhi,
contribute to a disproportionately large share of confirmed cases, indicating regional
hotspots of infection. Similarly, when visualizing active cases, the pie chart highlights
the states where the virus remains prevalent, helping to identify areas that may still be
under significant healthcare pressure. In the case of deaths and recoveries, the chart helps
assess how effectively different states have managed the pandemic, with larger slices
suggesting better recovery efforts or, conversely, higher mortality.
PROGRAM:
import pandas as pd
import matplotlib.pyplot as plt
# Ensure date is in datetime format
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True, errors='coerce')
# Sort by date
df.sort_values('Date', inplace=True)
# Plot as-is: all points
plt.figure(figsize=(12, 6))
plt.plot(df['Date'], df['Confirmedcases'], linestyle='-', marker='.', color='red')
plt.title('Trend of daily Confirmed COVID-19 Cases')
plt.xlabel('Date')
plt.ylabel('Confirmed Cases')
plt.grid(True)
plt.tight_layout()
plt.show()
OUTPUT:
Insights:
https://colab.research.google.com/drive/11DO_s_JTgtanNu3PjaIheMfRe3bg8X8P?usp=s
haring
RESULT:
The exploratory data analysis (EDA) of the COVID-19 state-wise India dataset
revealed several key findings. It was observed that a few states, such as Maharashtra,
Kerala, and Karnataka, accounted for the majority of confirmed and active cases,
indicating regional hotspots. Line charts showed clear trends in the rise and fall of cases
over time, highlighting critical periods such as the peaks of the first and second waves.
Recovery and mortality patterns varied significantly among states, with some achieving
high recovery rates while others showed relatively higher fatality ratios. Pie charts and
bar graphs provided a comparative view of the burden across states, emphasizing the
uneven impact of the pandemic in India. The analysis also helped identify outliers, data
inconsistencies, and states with efficient healthcare responses. Overall, the EDA offered
valuable insights that can support data-driven decision-making and better preparedness
for future health emergencies.
EX.NO.03 EDA ON YOUTUBE TRENDING VIDEOS DATASET
AIM:
To Perform the Exploratory Data Analysis (EDA) on Youtube Trending Videos
Dataset.
INTRODUCTION:
By analyzing features such as views, likes, dislikes, comment counts, tags, and
publish times, the goal is to identify the key factors that contribute to a video becoming
viral. The dataset includes videos from different categories and regions, allowing us to
explore trends across genres and understand regional preferences. This analysis provides
a data-driven perspective on content popularity and can be useful for content creators,
marketers, and platform analysts.
OBJECTIVES:
PROGRAM:
import pandas as pd
# Load the dataset
df = pd.read_excel("/content/youtube.csv")
PROGRAM:
print(df.head())
OUTPUT:
II.Display the Total Number of Rows and Columns.
PROGRAM:
OUTPUT:
PROGRAM:
# Check missing values
missing_values = df.isnull().sum()
print(missing_values)
OUTPUT:
STEP 3 : Calculate:
PROGRAM:
OUTPUT :
PROGRAM:
PROGRAM:
OUTPUT:
III. Average likes, views, and comments.
PROGRAM:
OUTPUT:
STEP 4: Visualizations
Bar chart: Video count by category.
Scatter plot: Likes vs. Views.
OUTPUT:
PROGRAM:
Video popularity and engagement trends have evolved significantly, driven largely
by the rise of short-form content and algorithmic curation. Platforms like TikTok,
YouTube Shorts, and Instagram Reels have popularized brief, visually engaging videos
that cater to short attention spans, resulting in higher completion rates and shareability.
Personalized recommendation systems now play a critical role in surfacing content,
meaning that creators who target niche interests often see stronger engagement.
Authentic, user-generated content continues to outperform polished productions,
especially when it fosters relatability and trust. Storytelling and interactive elements—
such as calls to comment or participate—boost viewer involvement, while live streaming
enhances real-time engagement and community building. Moreover, the mobile-first
nature of video consumption has made vertical formats and captivating intros essential
for capturing and maintaining attention. Together, these trends highlight the importance
of agility, authenticity, and platform-specific strategies in driving video success today.
Google Colab Link:
https://colab.research.google.com/drive/1jH3TkCt9cAS8Zxi60WuWrsTB7eAcO3gk?usp=
sharing
RESULT:
The EDA of YouTube trending videos reveals several key patterns in video
popularity and engagement. Videos with titles that include emotionally charged or
curiosity-driven words tend to attract more views and clicks. Content in categories such
as music, entertainment, and gaming appears most frequently in the trending list,
indicating strong viewer demand. High-performing videos often have a high like-to-
dislike ratio and generate significant comment activity, suggesting that viewer
engagement is a major factor in trending status. Additionally, channels with consistent
upload schedules and a high subscriber base tend to trend more often, highlighting the
importance of audience loyalty. Lastly, video length impacts performance—shorter
videos generally trend more frequently, but longer videos (7–15 minutes) tend to sustain
higher average view durations when well-produced. These insights suggest that content
quality, emotional appeal, and audience interaction are critical to driving trends on
YouTube.
Google Cloud Data Analytics Course Completion Certificate: