0% found this document useful (0 votes)
13 views33 pages

Naan Mudhalvan - Google Cloud Data Analytics

The document outlines a series of Exploratory Data Analysis (EDA) projects conducted by a student on various datasets, including Global Superstore sales, COVID-19 data for India, and YouTube trending videos. Each project includes objectives, methodologies, and insights gained from the analysis, emphasizing the importance of data-driven decision-making. The findings highlight regional performance variations, trends over time, and factors influencing content popularity.

Uploaded by

mohanaramanan75
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views33 pages

Naan Mudhalvan - Google Cloud Data Analytics

The document outlines a series of Exploratory Data Analysis (EDA) projects conducted by a student on various datasets, including Global Superstore sales, COVID-19 data for India, and YouTube trending videos. Each project includes objectives, methodologies, and insights gained from the analysis, emphasizing the importance of data-driven decision-making. The findings highlight regional performance variations, trends over time, and factors influencing content popularity.

Uploaded by

mohanaramanan75
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 33

NAAN MUDHALVAN - GOOGLE CLOUD DATA ANALYTICS

Submitted by

ELAKYA P - 421322104012
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Year / Semester – III /VI

ANNA UNIVERSITY: CHENNAI 600025


Nov-Dec 2025
BONOFIDE CERTIFICATE

Name ELAKYA P

Reg.No 421322104012

Sem/Dep III / VI

Course Name Google Cloud Data Analytics

Certified that this is the bonafide record of work done by the above
student in the NM Course - Google Cloud Data Analytics during the
academic year 2024 – 2025.

Signature of the Course Coordinator Signature of the HOD

Submitted for the practical examination held on …………………….

Internal Examiner External Examiner


EX.NO: 01 EDA ON GLOBAL SUPERSTORE SALES DATASET

AIM:
To Perform Exploratory Data Analysis (EDA) on Global Superstore Dataset.

INTRODUCTION:

Exploratory Data Analysis (EDA) is a fundamental step in the data analysis


process that helps uncover patterns, spot anomalies, test hypotheses, and check
assumptions through statistical and visual techniques. When applied to the Global
Superstore dataset,a widely used sample dataset in the domain of sales, logistics, and
customer analytics.EDA provides valuable insights into business operations across
multiple regions, categories, and customer segments.

The Global Superstore dataset contains detailed transactional data, including order
dates, shipping dates, customer demographics, product categories, sales figures, profit
margins, and shipping modes.

OBJECTIVES:

 Understand the overall structure and quality of the data


 Identify trends in sales and profits over time
 Analyze regional and segment-based performance
 Detect high-performing product categories and sub-categories
 Uncover relationships between shipping modes, delivery times, and profitability
 Highlight outliers or inconsistencies in data entries

DATASET : Global Superstore Dataset (Excel)

DATA SOURCE LINK : https://www.kaggle.com/datasets/shekpaul/global-superstore


Perform Exploratory Data Analysis:

STEP 1 : Load the dataset using Pandas.

PROGRAM:

import pandas as pd
# Load the dataset
df = pd.read_excel("/content/Global Superstore.xls")

I. Display first five Rows.

PROGRAM:

print(df.head())

OUTPUT:

II. Display the Total Number of Rows and Columns.

PROGRAM:

# Print number of rows and columns


print("Total Rows:", df.shape[0])
print("Total Columns:", df.shape[1])
OUTPUT:

STEP 2 : Clean missing data and remove duplicates.

I. Check for Missing Values

PROGRAM:
# Check missing values
missing_values = df.isnull().sum()
print(missing_values)

OUTPUT:
II. Filling the Missing Values:

PROGRAM:
# Create a city → postal code mapping from known (non-null) data
postal_map=
df.dropna(subset=['PostalCode']).drop_duplicates('City').set_index('City')['Postal Code']
# Fill missing postal codes using map and fillna
df['Postal Code'] = df['Postal Code'].fillna(df['City'].map(postal_map))
print(df['Postal Code'].isnull().sum())

OUTPUT:

PROGRAM:

df['Postal Code'] = df['Postal Code'].fillna(0).astype(int)

print("Missing Postal Codes after filling:", df['Postal Code'].isnull().sum())

OUTPUT:
STEP 3 : Calculate summary statistics (mean, median, standard deviation) for Sales
and Profit.

PROGRAM:

# Calculate mean, median, and standard deviation for 'Sales' and 'Profit'
mean_sales = df['Sales'].mean()
median_sales = df['Sales'].median()
std_sales = df['Sales'].std()
mean_profit = df['Profit'].mean()
median_profit = df['Profit'].median()
std_profit = df['Profit'].std()
# Print the results
print(f"Sales - Mean: {mean_sales}, Median: {median_sales}, Std Dev: {std_sales}")
print(f"Profit - Mean: {mean_profit}, Median: {median_profit}, Std Dev:
{std_profit}")

OUTPUT:

STEP 4 : Analyze

 Total sales per region.


 Top 5 most profitable product categories.
 Year-wise sales trend.
I.Total Sales per Region:

PROGRAM:

# Group by 'Region' and calculate total sales for each region


total_sales_per_region = df.groupby('Region')['Sales'].sum()
# Display the result
print(total_sales_per_region)

OUTPUT:

II.Top 5 most profitable product categories.

PROGRAM:
# Group by 'Category' and calculate total profit for each category
total_profit_per_category = df.groupby('Category')['Profit'].sum()
# Sort the categories by total profit in descending order
sorted_profit = total_profit_per_category.sort_values(ascending=False) #
Display the top 5 most profitable categories
top_5_profitable_categories = sorted_profit.head(5)
print(top_5_profitable_categories)
OUTPUT:

III. Year-wise sales trend.

PROGRAM:

# Extract year from 'Order Date'


df['Year'] = df['Order Date'].dt.year
# Display the first few rows to check the new 'Year' column
print(df[['Order Date', 'Year']].head())

OUTPUT:

PROGRAM:
# Group by 'Year' and calculate total sales for each year
sales_by_year = df.groupby('Year')['Sales'].sum()
# Display the total sales per year
print(sales_by_year)
OUTPUT:

STEP 5 : Visualizations

 Bar chart: Sales by region.


 Line chart: Year-wise sales trend.

I. Bar chart: Sales by region.

PROGRAM:
import matplotlib.pyplot as plt
# Plotting the sales by region
plt.figure(figsize=(10, 6))
total_sales_per_region.plot(kind='bar', color='violet')
# Add labels and title
plt.title('Total Sales by Region')
plt.xlabel('Region')
plt.ylabel('Total Sales')
plt.xticks(rotation=0) # Optional: Rotate x-axis labels for better visibility
plt.show()
OUTPUT:

Insights:

The bar chart depicting total sales by region reveals significant variations in sales
performance across different geographic areas. The Asia-Pacific and US regions
typically lead in overall sales, indicating strong market presence and customer demand.
In contrast, Africa and Canada show comparatively lower sales volumes, suggesting
either limited market reach or fewer transactions recorded in those areas.

This distribution may reflect regional differences in customer base size, product
availability, or operational scale. The insights from this chart can guide strategic
decisions such as regional marketing investments, supply chain adjustments, and
potential market expansion opportunities.
II.Line chart: Year-wise sales trend.

PROGRAM:
import matplotlib.pyplot as plt
# Plotting the year-wise sales trend as a line chart
plt.figure(figsize=(10, 6))
sales_by_year.plot(kind='line', marker='o', color='b')
# Add labels and title
plt.title('Year-Wise Sales
Trend') plt.xlabel('Year')
plt.ylabel('Total Sales')
plt.grid(True) # Add gridlines for better readability
plt.xticks(rotation=45) # Rotate x-axis labels for better visibility
plt.show()

OUTPUT:
Insights:

The line chart illustrating year-wise sales trends shows a generally increasing
trajectory in total sales over the years, indicating business growth and expanding
customer demand. In particular, there's often a noticeable spike in sales in the final year
(e.g., 2014), which could be attributed to seasonal campaigns, improved logistics, or
expanded operations.

However, some fluctuations or dips may be observed in intermediate years,


possibly due to market changes, economic conditions, or internal operational shifts.
These patterns highlight the importance of year-over-year performance monitoring to
identify what strategies are driving growth or where improvements are needed.This trend
analysis can inform forecasting, budgeting, and strategic planning for future business
initiatives.

Google Colab Link:

https://colab.research.google.com/drive/1cZQihPLXWEbxvgih2axmp2JhaQJJQ8uk?usp
=sharing

Result:
The EDA of the Global Superstore dataset shows steady growth in sales over the
years, with peak sales in the most recent year. The Consumer segment and Technology
category drive most of the revenue, while Furniture often results in losses. The US and
Asia-Pacific regions perform best, whereas Africa and Canada underperform. High
discounts and costly shipping methods reduce profitability. Outliers reveal cases of high
sales with negative profits, suggesting areas for operational improvement.Thus, The
Exploratory Data Analytics is performed on the Global Supoerstore Dataset.
EX.NO:02 EDA ON COVID-19 GLOBAL DATASET

AIM:
To Perform the Exploratory Data Analysis (EDA) on Covid-19 Dataset.

INTRODUCTION:
The COVID-19 pandemic has significantly impacted countries across the globe,
and India, with its vast and diverse population, has faced unique challenges in managing
the spread and effects of the virus. To better understand the dynamics of the pandemic
within the country, it is crucial to analyze COVID-19 data at a more granular level
specifically, state-wise. This project focuses on performing Exploratory Data Analysis
(EDA) on a state-wise COVID-19 dataset for India.

By examining key metrics such as confirmed cases, recoveries, active cases, and
deaths, we aim to gain meaningful insights into the regional progression of the pandemic.
The findings from this EDA can help identify states with high transmission rates, assess
healthcare response effectiveness, and provide a data-driven foundation for public health
decision-making.

OBJECTIVES:

 Understand the overall distribution of COVID-19 cases across Indian states.


 Identify states with the highest and lowest number of confirmed, recovered, and
deceased cases.
 Analyze trends over time for confirmed, active, recovered, and death cases.
 Compare the rate of recovery and mortality across different states.
 Detect outliers or anomalies in the data that may indicate reporting issues or sudden
surges.
 Visualize the progression of the pandemic using charts and graphs for easier
interpretation.
DATASET : COVID-19 World Data

DATA SOURCE LINK: https://www.kaggle.com/datasets/n1sarg/covid19-india-


datasets?select=state_wise_data.csv

Perform Exploratory Data Analysis:

STEP 1 : Load the dataset using Pandas.

PROGRAM:

import pandas as pd
# Load the dataset
df = pd.read_excel("/state_wise_data.csv")

I. Display first five Rows.

PROGRAM:

print(df.head())

OUTPUT:
II. Display the Total Number of Rows and Columns.

PROGRAM:

# Print number of rows and columns


print("Total Rows:", df.shape[0])
print("Total Columns:", df.shape[1])

OUTPUT:

STEP 2 : Clean missing data and remove duplicates.

I. Check for Missing Values

PROGRAM:
# Check missing values
missing_values = df.isnull().sum()
print(missing_values)

OUTPUT:
II. Convert date columns to datetime format.

PROGRAM:

# Assuming your DataFrame is named df


df['Date'] = pd.to_datetime(df['Date'])
# Optional: confirm the conversion
print(df['Date'].dtype)

OUTPUT:

STEP 3: Calculate

 Total confirmed, recovered, and death cases for each state.


 State with the highest number of cases.
 Daily trend of new cases.

I. Total confirmed, recovered, and death cases for each state.

PROGRAM:

#Group by 'State' and sum the relevant columns


statewise_summary = df.groupby('State')[['Confirmedcases', 'Recovered',
'Death']].sum().reset_index()
# Display the result
print(statewise_summary)
OUTPUT:

II. State with the highest number of cases.

PROGRAM:

# Group by State and sum total cases


grouped = df.groupby('State')['Total cases'].sum().reset_index()
# Get the state with the highest total cases
max_state = grouped.loc[grouped['Total cases'].idxmax()]
# Display the result
print("State with the highest total number of cases:")
print(max_state)
OUTPUT:

III. Daily trend of new cases.

PROGRAM:

# Sort by state and date


df.sort_values(['State', 'Date'], inplace=True)
# Calculate new daily cases per state
df['New_Cases'] = df.groupby('State')['Confirmedcases'].diff()
# Display a few rows
print(df[['State', 'Date', 'Confirmedcases', 'New_Cases']].head())

OUTPUT:

STEP 4 : Visualizations:

 Pie chart: Top 5 states by confirmed cases.


 Line graph: Trend of daily confirmed cases.
I. Pie chart: Top 5 states by confirmed cases.

PROGRAM:
state_confirmed = df.groupby('State')['Confirmedcases'].sum() #
Get top 5 states
top_5 = state_confirmed.sort_values(ascending=False).head(5) #
Plot pie chart
plt.figure(figsize=(8, 8))
plt.pie(top_5, labels=top_5.index, autopct='%1.1f%%', startangle=140)
plt.title('Top 5 States by Confirmed COVID-19 Cases')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.tight_layout()
plt.show()

OUTPUT:
Insights:
During the exploratory data analysis of the COVID-19 state-wise dataset for India, a pie
chart provides a clear visual representation of how cases are distributed across different
states. It reveals that a small number of states, such as Maharashtra, Kerala, and Delhi,
contribute to a disproportionately large share of confirmed cases, indicating regional
hotspots of infection. Similarly, when visualizing active cases, the pie chart highlights
the states where the virus remains prevalent, helping to identify areas that may still be
under significant healthcare pressure. In the case of deaths and recoveries, the chart helps
assess how effectively different states have managed the pandemic, with larger slices
suggesting better recovery efforts or, conversely, higher mortality.

II. Line graph: Trend of daily confirmed cases.

PROGRAM:
import pandas as pd
import matplotlib.pyplot as plt
# Ensure date is in datetime format
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True, errors='coerce')
# Sort by date
df.sort_values('Date', inplace=True)
# Plot as-is: all points
plt.figure(figsize=(12, 6))
plt.plot(df['Date'], df['Confirmedcases'], linestyle='-', marker='.', color='red')
plt.title('Trend of daily Confirmed COVID-19 Cases')
plt.xlabel('Date')
plt.ylabel('Confirmed Cases')
plt.grid(True)
plt.tight_layout()
plt.show()
OUTPUT:

Insights:

A line chart is particularly effective in revealing the temporal trends of COVID-19


across Indian states. By plotting confirmed, recovered, and death cases over time, it
allows us to observe how the pandemic evolved in different regions. The chart can
highlight periods of sharp spikes, such as during the first or second waves, and indicate
how quickly or slowly each state responded to surges. For instance, a steep upward trend
in confirmed cases followed by a delayed increase in recoveries might point to delayed
interventions or healthcare strain. In contrast, states showing a synchronized rise in
recoveries suggest more effective management. Line charts also help compare growth
patterns among states, revealing which areas flattened the curve earlier or experienced
prolonged waves. Overall, these trends are vital for understanding the dynamics of the
outbreak and guiding future preparedness efforts.

Google Colab Link :

https://colab.research.google.com/drive/11DO_s_JTgtanNu3PjaIheMfRe3bg8X8P?usp=s
haring
RESULT:
The exploratory data analysis (EDA) of the COVID-19 state-wise India dataset
revealed several key findings. It was observed that a few states, such as Maharashtra,
Kerala, and Karnataka, accounted for the majority of confirmed and active cases,
indicating regional hotspots. Line charts showed clear trends in the rise and fall of cases
over time, highlighting critical periods such as the peaks of the first and second waves.
Recovery and mortality patterns varied significantly among states, with some achieving
high recovery rates while others showed relatively higher fatality ratios. Pie charts and
bar graphs provided a comparative view of the burden across states, emphasizing the
uneven impact of the pandemic in India. The analysis also helped identify outliers, data
inconsistencies, and states with efficient healthcare responses. Overall, the EDA offered
valuable insights that can support data-driven decision-making and better preparedness
for future health emergencies.
EX.NO.03 EDA ON YOUTUBE TRENDING VIDEOS DATASET

AIM:
To Perform the Exploratory Data Analysis (EDA) on Youtube Trending Videos
Dataset.
INTRODUCTION:

YouTube is one of the largest video-sharing platforms in the world, influencing


entertainment, news, marketing, and public opinion on a massive scale. Understanding
what makes a video trend can offer valuable insights into audience behavior, content
performance, and digital marketing strategies. This project focuses on performing
Exploratory Data Analysis (EDA) on a YouTube Trending Videos dataset to uncover
patterns and trends in popular content.

By analyzing features such as views, likes, dislikes, comment counts, tags, and
publish times, the goal is to identify the key factors that contribute to a video becoming
viral. The dataset includes videos from different categories and regions, allowing us to
explore trends across genres and understand regional preferences. This analysis provides
a data-driven perspective on content popularity and can be useful for content creators,
marketers, and platform analysts.

OBJECTIVES:

 Analyze the distribution of trending videos across different categories.


 Identify the most frequently trending video titles, channels, and tags.
 Examine relationships between views, likes, dislikes, and comment counts.
 Determine the upload days and times when videos are most likely to trend.
 Understand audience engagement through like/dislike and comment ratios.
 Compare the performance of videos across different countries
DATASET: YouTube Trending Videos

DATA SOURCE LINK : https://www.kaggle.com/datasets/thedevastator/youtube-


trending-videos-dataset

Perform Exploratory Data Analysis:

STEP 1 : Load the dataset using Pandas.

PROGRAM:

import pandas as pd
# Load the dataset
df = pd.read_excel("/content/youtube.csv")

III. Display first five Rows.

PROGRAM:

print(df.head())

OUTPUT:
II.Display the Total Number of Rows and Columns.

PROGRAM:

# Print number of rows and columns


print("Total Rows:", df.shape[0])
print("Total Columns:", df.shape[1])

OUTPUT:

STEP 2 : Clean missing data and remove duplicates.

I. Check for Missing Values

PROGRAM:
# Check missing values
missing_values = df.isnull().sum()
print(missing_values)

OUTPUT:
STEP 3 : Calculate:

 Most common video categories.

 Top 5 channels with the highest number of trending videos.

 Average likes, views, and comments.

I. Most common video categories.

PROGRAM:

# Count the most common video titles


most_common_titles =
df['category_id'].value_counts() # Display the
top 10 most common video titles
print(most_common_titles.head())

OUTPUT :

PROGRAM:

# Count the most common video titles


most_common_titles = df['title'].value_counts()
# Display the top 10 most common video titles
print(most_common_titles.head(10))
OUTPUT:

II. Top 5 channels with the highest number of trending videos.

PROGRAM:

# Group by 'channel_title' and count the number of videos per channel


channel_video_count = df.groupby('channel_title').size()
# Sort and display the top 5 channels with the most videos
top_5_channels = channel_video_count.sort_values(ascending=False).head(5)
print(top_5_channels)

OUTPUT:
III. Average likes, views, and comments.

PROGRAM:

# Calculate the average of likes, comments, and views


average_likes = df['likes'].mean()
average_comments = df['comment_count'].mean()
average_views = df['views'].mean()
# Print the results
print(f'Average Likes: {average_likes}')
print(f'Average Comments: {average_comments}')
print(f'Average Views: {average_views}')

OUTPUT:

STEP 4: Visualizations
 Bar chart: Video count by category.
 Scatter plot: Likes vs. Views.

I. Bar chart: Video count by category.


PROGRAM:
import matplotlib.pyplot as plt
# Plotting the video count by category
plt.figure(figsize=(10, 6)) # Adjust the size of the plot
video_count_by_category.plot(kind='barh', color='pink') # Horizontal bar chart
plt.title('Video Count by Category') # Title
plt.xlabel('Number of Videos') # X-axis label
plt.ylabel('Category') # Y-axis label
plt.gca().invert_yaxis() # Invert the y-axis for better visibility of top categories
plt.show()

OUTPUT:

II. Scatter plot: Likes vs. Views.

PROGRAM:

import matplotlib.pyplot as plt


# Create a scatter plot of likes vs views
plt.figure(figsize=(10, 6)) # Adjust the size of the plot
plt.scatter(df['views'], df['likes'], alpha=0.5, color='red')
# Add labels and title
plt.title('Likes vs Views on YouTube Trending Videos')
plt.xlabel('Number of Views')
plt.ylabel('Number of Likes')
# Display the plot
plt.show()
OUTPUT:

Insights: Describing trends in video popularity and engagement.

Video popularity and engagement trends have evolved significantly, driven largely
by the rise of short-form content and algorithmic curation. Platforms like TikTok,
YouTube Shorts, and Instagram Reels have popularized brief, visually engaging videos
that cater to short attention spans, resulting in higher completion rates and shareability.
Personalized recommendation systems now play a critical role in surfacing content,
meaning that creators who target niche interests often see stronger engagement.
Authentic, user-generated content continues to outperform polished productions,
especially when it fosters relatability and trust. Storytelling and interactive elements—
such as calls to comment or participate—boost viewer involvement, while live streaming
enhances real-time engagement and community building. Moreover, the mobile-first
nature of video consumption has made vertical formats and captivating intros essential
for capturing and maintaining attention. Together, these trends highlight the importance
of agility, authenticity, and platform-specific strategies in driving video success today.
Google Colab Link:

https://colab.research.google.com/drive/1jH3TkCt9cAS8Zxi60WuWrsTB7eAcO3gk?usp=
sharing

RESULT:

The EDA of YouTube trending videos reveals several key patterns in video
popularity and engagement. Videos with titles that include emotionally charged or
curiosity-driven words tend to attract more views and clicks. Content in categories such
as music, entertainment, and gaming appears most frequently in the trending list,
indicating strong viewer demand. High-performing videos often have a high like-to-
dislike ratio and generate significant comment activity, suggesting that viewer
engagement is a major factor in trending status. Additionally, channels with consistent
upload schedules and a high subscriber base tend to trend more often, highlighting the
importance of audience loyalty. Lastly, video length impacts performance—shorter
videos generally trend more frequently, but longer videos (7–15 minutes) tend to sustain
higher average view durations when well-produced. These insights suggest that content
quality, emotional appeal, and audience interaction are critical to driving trends on
YouTube.
Google Cloud Data Analytics Course Completion Certificate:

You might also like