0% found this document useful (0 votes)

13 views33 pages

Naan Mudhalvan - Google Cloud Data Analytics

The document outlines a series of Exploratory Data Analysis (EDA) projects conducted by a student on various datasets, including Global Superstore sales, COVID-19 data for India, and YouTube trending videos. Each project includes objectives, methodologies, and insights gained from the analysis, emphasizing the importance of data-driven decision-making. The findings highlight regional performance variations, trends over time, and factors influencing content popularity.

Uploaded by

mohanaramanan75

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views33 pages

Naan Mudhalvan - Google Cloud Data Analytics

Uploaded by

mohanaramanan75

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 33

NAAN MUDHALVAN - GOOGLE CLOUD DATA ANALYTICS

Submitted by

ELAKYA P - 421322104012
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Year / Semester – III /VI

ANNA UNIVERSITY: CHENNAI 600025

Nov-Dec 2025
BONOFIDE CERTIFICATE

Name ELAKYA P

Reg.No 421322104012

Sem/Dep III / VI

Course Name Google Cloud Data Analytics

Certified that this is the bonafide record of work done by the above
student in the NM Course - Google Cloud Data Analytics during the
academic year 2024 – 2025.

Signature of the Course Coordinator Signature of the HOD

Submitted for the practical examination held on …………………….

Internal Examiner External Examiner

EX.NO: 01 EDA ON GLOBAL SUPERSTORE SALES DATASET

AIM:
To Perform Exploratory Data Analysis (EDA) on Global Superstore Dataset.

INTRODUCTION:

Exploratory Data Analysis (EDA) is a fundamental step in the data analysis

process that helps uncover patterns, spot anomalies, test hypotheses, and check
assumptions through statistical and visual techniques. When applied to the Global
Superstore dataset,a widely used sample dataset in the domain of sales, logistics, and
customer analytics.EDA provides valuable insights into business operations across
multiple regions, categories, and customer segments.

The Global Superstore dataset contains detailed transactional data, including order
dates, shipping dates, customer demographics, product categories, sales figures, profit
margins, and shipping modes.

OBJECTIVES:

 Understand the overall structure and quality of the data

 Identify trends in sales and profits over time
 Analyze regional and segment-based performance
 Detect high-performing product categories and sub-categories
 Uncover relationships between shipping modes, delivery times, and profitability
 Highlight outliers or inconsistencies in data entries

DATASET : Global Superstore Dataset (Excel)

DATA SOURCE LINK : https://www.kaggle.com/datasets/shekpaul/global-superstore

Perform Exploratory Data Analysis:

STEP 1 : Load the dataset using Pandas.

PROGRAM:

import pandas as pd
# Load the dataset
df = pd.read_excel("/content/Global Superstore.xls")

I. Display first five Rows.

PROGRAM:

print(df.head())

OUTPUT:

II. Display the Total Number of Rows and Columns.

PROGRAM:

# Print number of rows and columns

print("Total Rows:", df.shape[0])
print("Total Columns:", df.shape[1])
OUTPUT:

STEP 2 : Clean missing data and remove duplicates.

I. Check for Missing Values

PROGRAM:
# Check missing values
missing_values = df.isnull().sum()
print(missing_values)

OUTPUT:
II. Filling the Missing Values:

PROGRAM:
# Create a city → postal code mapping from known (non-null) data
postal_map=
df.dropna(subset=['PostalCode']).drop_duplicates('City').set_index('City')['Postal Code']
# Fill missing postal codes using map and fillna
df['Postal Code'] = df['Postal Code'].fillna(df['City'].map(postal_map))
print(df['Postal Code'].isnull().sum())

OUTPUT:

PROGRAM:

df['Postal Code'] = df['Postal Code'].fillna(0).astype(int)

print("Missing Postal Codes after filling:", df['Postal Code'].isnull().sum())

OUTPUT:
STEP 3 : Calculate summary statistics (mean, median, standard deviation) for Sales
and Profit.

PROGRAM:

# Calculate mean, median, and standard deviation for 'Sales' and 'Profit'
mean_sales = df['Sales'].mean()
median_sales = df['Sales'].median()
std_sales = df['Sales'].std()
mean_profit = df['Profit'].mean()
median_profit = df['Profit'].median()
std_profit = df['Profit'].std()
# Print the results
print(f"Sales - Mean: {mean_sales}, Median: {median_sales}, Std Dev: {std_sales}")
print(f"Profit - Mean: {mean_profit}, Median: {median_profit}, Std Dev:
{std_profit}")

OUTPUT:

STEP 4 : Analyze

 Total sales per region.

 Top 5 most profitable product categories.
 Year-wise sales trend.
I.Total Sales per Region:

PROGRAM:

# Group by 'Region' and calculate total sales for each region

total_sales_per_region = df.groupby('Region')['Sales'].sum()
# Display the result
print(total_sales_per_region)

OUTPUT:

II.Top 5 most profitable product categories.

PROGRAM:
# Group by 'Category' and calculate total profit for each category
total_profit_per_category = df.groupby('Category')['Profit'].sum()
# Sort the categories by total profit in descending order
sorted_profit = total_profit_per_category.sort_values(ascending=False) #
Display the top 5 most profitable categories
top_5_profitable_categories = sorted_profit.head(5)
print(top_5_profitable_categories)
OUTPUT:

III. Year-wise sales trend.

PROGRAM:

# Extract year from 'Order Date'

df['Year'] = df['Order Date'].dt.year
# Display the first few rows to check the new 'Year' column
print(df[['Order Date', 'Year']].head())

OUTPUT:

PROGRAM:
# Group by 'Year' and calculate total sales for each year
sales_by_year = df.groupby('Year')['Sales'].sum()
# Display the total sales per year
print(sales_by_year)
OUTPUT:

STEP 5 : Visualizations

 Bar chart: Sales by region.

 Line chart: Year-wise sales trend.

I. Bar chart: Sales by region.

PROGRAM:
import matplotlib.pyplot as plt
# Plotting the sales by region
plt.figure(figsize=(10, 6))
total_sales_per_region.plot(kind='bar', color='violet')
# Add labels and title
plt.title('Total Sales by Region')
plt.xlabel('Region')
plt.ylabel('Total Sales')
plt.xticks(rotation=0) # Optional: Rotate x-axis labels for better visibility
plt.show()
OUTPUT:

Insights:

The bar chart depicting total sales by region reveals significant variations in sales
performance across different geographic areas. The Asia-Pacific and US regions
typically lead in overall sales, indicating strong market presence and customer demand.
In contrast, Africa and Canada show comparatively lower sales volumes, suggesting
either limited market reach or fewer transactions recorded in those areas.

This distribution may reflect regional differences in customer base size, product
availability, or operational scale. The insights from this chart can guide strategic
decisions such as regional marketing investments, supply chain adjustments, and
potential market expansion opportunities.
II.Line chart: Year-wise sales trend.

PROGRAM:
import matplotlib.pyplot as plt
# Plotting the year-wise sales trend as a line chart
plt.figure(figsize=(10, 6))
sales_by_year.plot(kind='line', marker='o', color='b')
# Add labels and title
plt.title('Year-Wise Sales
Trend') plt.xlabel('Year')
plt.ylabel('Total Sales')
plt.grid(True) # Add gridlines for better readability
plt.xticks(rotation=45) # Rotate x-axis labels for better visibility
plt.show()

OUTPUT:
Insights:

The line chart illustrating year-wise sales trends shows a generally increasing
trajectory in total sales over the years, indicating business growth and expanding
customer demand. In particular, there's often a noticeable spike in sales in the final year
(e.g., 2014), which could be attributed to seasonal campaigns, improved logistics, or
expanded operations.

However, some fluctuations or dips may be observed in intermediate years,

possibly due to market changes, economic conditions, or internal operational shifts.
These patterns highlight the importance of year-over-year performance monitoring to
identify what strategies are driving growth or where improvements are needed.This trend
analysis can inform forecasting, budgeting, and strategic planning for future business
initiatives.

Google Colab Link:

https://colab.research.google.com/drive/1cZQihPLXWEbxvgih2axmp2JhaQJJQ8uk?usp
=sharing

Result:
The EDA of the Global Superstore dataset shows steady growth in sales over the
years, with peak sales in the most recent year. The Consumer segment and Technology
category drive most of the revenue, while Furniture often results in losses. The US and
Asia-Pacific regions perform best, whereas Africa and Canada underperform. High
discounts and costly shipping methods reduce profitability. Outliers reveal cases of high
sales with negative profits, suggesting areas for operational improvement.Thus, The
Exploratory Data Analytics is performed on the Global Supoerstore Dataset.
EX.NO:02 EDA ON COVID-19 GLOBAL DATASET

AIM:
To Perform the Exploratory Data Analysis (EDA) on Covid-19 Dataset.

INTRODUCTION:
The COVID-19 pandemic has significantly impacted countries across the globe,
and India, with its vast and diverse population, has faced unique challenges in managing
the spread and effects of the virus. To better understand the dynamics of the pandemic
within the country, it is crucial to analyze COVID-19 data at a more granular level
specifically, state-wise. This project focuses on performing Exploratory Data Analysis
(EDA) on a state-wise COVID-19 dataset for India.

By examining key metrics such as confirmed cases, recoveries, active cases, and
deaths, we aim to gain meaningful insights into the regional progression of the pandemic.
The findings from this EDA can help identify states with high transmission rates, assess
healthcare response effectiveness, and provide a data-driven foundation for public health
decision-making.

OBJECTIVES:

 Understand the overall distribution of COVID-19 cases across Indian states.

 Identify states with the highest and lowest number of confirmed, recovered, and
deceased cases.
 Analyze trends over time for confirmed, active, recovered, and death cases.
 Compare the rate of recovery and mortality across different states.
 Detect outliers or anomalies in the data that may indicate reporting issues or sudden
surges.
 Visualize the progression of the pandemic using charts and graphs for easier
interpretation.
DATASET : COVID-19 World Data

DATA SOURCE LINK: https://www.kaggle.com/datasets/n1sarg/covid19-india-

datasets?select=state_wise_data.csv

Perform Exploratory Data Analysis:

STEP 1 : Load the dataset using Pandas.

PROGRAM:

import pandas as pd
# Load the dataset
df = pd.read_excel("/state_wise_data.csv")

I. Display first five Rows.

PROGRAM:

print(df.head())

OUTPUT:
II. Display the Total Number of Rows and Columns.

PROGRAM:

# Print number of rows and columns

print("Total Rows:", df.shape[0])
print("Total Columns:", df.shape[1])

OUTPUT:

STEP 2 : Clean missing data and remove duplicates.

I. Check for Missing Values

PROGRAM:
# Check missing values
missing_values = df.isnull().sum()
print(missing_values)

OUTPUT:
II. Convert date columns to datetime format.

PROGRAM:

# Assuming your DataFrame is named df

df['Date'] = pd.to_datetime(df['Date'])
# Optional: confirm the conversion
print(df['Date'].dtype)

OUTPUT:

STEP 3: Calculate

 Total confirmed, recovered, and death cases for each state.

 State with the highest number of cases.
 Daily trend of new cases.

I. Total confirmed, recovered, and death cases for each state.

PROGRAM:

#Group by 'State' and sum the relevant columns

statewise_summary = df.groupby('State')[['Confirmedcases', 'Recovered',
'Death']].sum().reset_index()
# Display the result
print(statewise_summary)
OUTPUT:

II. State with the highest number of cases.

PROGRAM:

# Group by State and sum total cases

grouped = df.groupby('State')['Total cases'].sum().reset_index()
# Get the state with the highest total cases
max_state = grouped.loc[grouped['Total cases'].idxmax()]
# Display the result
print("State with the highest total number of cases:")
print(max_state)
OUTPUT:

III. Daily trend of new cases.

PROGRAM:

# Sort by state and date

df.sort_values(['State', 'Date'], inplace=True)
# Calculate new daily cases per state
df['New_Cases'] = df.groupby('State')['Confirmedcases'].diff()
# Display a few rows
print(df[['State', 'Date', 'Confirmedcases', 'New_Cases']].head())

OUTPUT:

STEP 4 : Visualizations:

 Pie chart: Top 5 states by confirmed cases.

 Line graph: Trend of daily confirmed cases.
I. Pie chart: Top 5 states by confirmed cases.

PROGRAM:
state_confirmed = df.groupby('State')['Confirmedcases'].sum() #
Get top 5 states
top_5 = state_confirmed.sort_values(ascending=False).head(5) #
Plot pie chart
plt.figure(figsize=(8, 8))
plt.pie(top_5, labels=top_5.index, autopct='%1.1f%%', startangle=140)
plt.title('Top 5 States by Confirmed COVID-19 Cases')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.tight_layout()
plt.show()

OUTPUT:
Insights:
During the exploratory data analysis of the COVID-19 state-wise dataset for India, a pie
chart provides a clear visual representation of how cases are distributed across different
states. It reveals that a small number of states, such as Maharashtra, Kerala, and Delhi,
contribute to a disproportionately large share of confirmed cases, indicating regional
hotspots of infection. Similarly, when visualizing active cases, the pie chart highlights
the states where the virus remains prevalent, helping to identify areas that may still be
under significant healthcare pressure. In the case of deaths and recoveries, the chart helps
assess how effectively different states have managed the pandemic, with larger slices
suggesting better recovery efforts or, conversely, higher mortality.

II. Line graph: Trend of daily confirmed cases.

PROGRAM:
import pandas as pd
import matplotlib.pyplot as plt
# Ensure date is in datetime format
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True, errors='coerce')
# Sort by date
df.sort_values('Date', inplace=True)
# Plot as-is: all points
plt.figure(figsize=(12, 6))
plt.plot(df['Date'], df['Confirmedcases'], linestyle='-', marker='.', color='red')
plt.title('Trend of daily Confirmed COVID-19 Cases')
plt.xlabel('Date')
plt.ylabel('Confirmed Cases')
plt.grid(True)
plt.tight_layout()
plt.show()
OUTPUT:

Insights:

A line chart is particularly effective in revealing the temporal trends of COVID-19

across Indian states. By plotting confirmed, recovered, and death cases over time, it
allows us to observe how the pandemic evolved in different regions. The chart can
highlight periods of sharp spikes, such as during the first or second waves, and indicate
how quickly or slowly each state responded to surges. For instance, a steep upward trend
in confirmed cases followed by a delayed increase in recoveries might point to delayed
interventions or healthcare strain. In contrast, states showing a synchronized rise in
recoveries suggest more effective management. Line charts also help compare growth
patterns among states, revealing which areas flattened the curve earlier or experienced
prolonged waves. Overall, these trends are vital for understanding the dynamics of the
outbreak and guiding future preparedness efforts.

Google Colab Link :

https://colab.research.google.com/drive/11DO_s_JTgtanNu3PjaIheMfRe3bg8X8P?usp=s
haring
RESULT:
The exploratory data analysis (EDA) of the COVID-19 state-wise India dataset
revealed several key findings. It was observed that a few states, such as Maharashtra,
Kerala, and Karnataka, accounted for the majority of confirmed and active cases,
indicating regional hotspots. Line charts showed clear trends in the rise and fall of cases
over time, highlighting critical periods such as the peaks of the first and second waves.
Recovery and mortality patterns varied significantly among states, with some achieving
high recovery rates while others showed relatively higher fatality ratios. Pie charts and
bar graphs provided a comparative view of the burden across states, emphasizing the
uneven impact of the pandemic in India. The analysis also helped identify outliers, data
inconsistencies, and states with efficient healthcare responses. Overall, the EDA offered
valuable insights that can support data-driven decision-making and better preparedness
for future health emergencies.
EX.NO.03 EDA ON YOUTUBE TRENDING VIDEOS DATASET

AIM:
To Perform the Exploratory Data Analysis (EDA) on Youtube Trending Videos
Dataset.
INTRODUCTION:

YouTube is one of the largest video-sharing platforms in the world, influencing

entertainment, news, marketing, and public opinion on a massive scale. Understanding
what makes a video trend can offer valuable insights into audience behavior, content
performance, and digital marketing strategies. This project focuses on performing
Exploratory Data Analysis (EDA) on a YouTube Trending Videos dataset to uncover
patterns and trends in popular content.

By analyzing features such as views, likes, dislikes, comment counts, tags, and
publish times, the goal is to identify the key factors that contribute to a video becoming
viral. The dataset includes videos from different categories and regions, allowing us to
explore trends across genres and understand regional preferences. This analysis provides
a data-driven perspective on content popularity and can be useful for content creators,
marketers, and platform analysts.

OBJECTIVES:

 Analyze the distribution of trending videos across different categories.

 Identify the most frequently trending video titles, channels, and tags.
 Examine relationships between views, likes, dislikes, and comment counts.
 Determine the upload days and times when videos are most likely to trend.
 Understand audience engagement through like/dislike and comment ratios.
 Compare the performance of videos across different countries
DATASET: YouTube Trending Videos

DATA SOURCE LINK : https://www.kaggle.com/datasets/thedevastator/youtube-

trending-videos-dataset

Perform Exploratory Data Analysis:

STEP 1 : Load the dataset using Pandas.

PROGRAM:

import pandas as pd
# Load the dataset
df = pd.read_excel("/content/youtube.csv")

III. Display first five Rows.

PROGRAM:

print(df.head())

OUTPUT:
II.Display the Total Number of Rows and Columns.

PROGRAM:

# Print number of rows and columns

print("Total Rows:", df.shape[0])
print("Total Columns:", df.shape[1])

OUTPUT:

STEP 2 : Clean missing data and remove duplicates.

I. Check for Missing Values

PROGRAM:
# Check missing values
missing_values = df.isnull().sum()
print(missing_values)

OUTPUT:
STEP 3 : Calculate:

 Most common video categories.

 Top 5 channels with the highest number of trending videos.

 Average likes, views, and comments.

I. Most common video categories.

PROGRAM:

# Count the most common video titles

most_common_titles =
df['category_id'].value_counts() # Display the
top 10 most common video titles
print(most_common_titles.head())

OUTPUT :

PROGRAM:

# Count the most common video titles

most_common_titles = df['title'].value_counts()
# Display the top 10 most common video titles
print(most_common_titles.head(10))
OUTPUT:

II. Top 5 channels with the highest number of trending videos.

PROGRAM:

# Group by 'channel_title' and count the number of videos per channel

channel_video_count = df.groupby('channel_title').size()
# Sort and display the top 5 channels with the most videos
top_5_channels = channel_video_count.sort_values(ascending=False).head(5)
print(top_5_channels)

OUTPUT:
III. Average likes, views, and comments.

PROGRAM:

# Calculate the average of likes, comments, and views

average_likes = df['likes'].mean()
average_comments = df['comment_count'].mean()
average_views = df['views'].mean()
# Print the results
print(f'Average Likes: {average_likes}')
print(f'Average Comments: {average_comments}')
print(f'Average Views: {average_views}')

OUTPUT:

STEP 4: Visualizations
 Bar chart: Video count by category.
 Scatter plot: Likes vs. Views.

I. Bar chart: Video count by category.

PROGRAM:
import matplotlib.pyplot as plt
# Plotting the video count by category
plt.figure(figsize=(10, 6)) # Adjust the size of the plot
video_count_by_category.plot(kind='barh', color='pink') # Horizontal bar chart
plt.title('Video Count by Category') # Title
plt.xlabel('Number of Videos') # X-axis label
plt.ylabel('Category') # Y-axis label
plt.gca().invert_yaxis() # Invert the y-axis for better visibility of top categories
plt.show()

OUTPUT:

II. Scatter plot: Likes vs. Views.

PROGRAM:

import matplotlib.pyplot as plt

# Create a scatter plot of likes vs views
plt.figure(figsize=(10, 6)) # Adjust the size of the plot
plt.scatter(df['views'], df['likes'], alpha=0.5, color='red')
# Add labels and title
plt.title('Likes vs Views on YouTube Trending Videos')
plt.xlabel('Number of Views')
plt.ylabel('Number of Likes')
# Display the plot
plt.show()
OUTPUT:

Insights: Describing trends in video popularity and engagement.

Video popularity and engagement trends have evolved significantly, driven largely
by the rise of short-form content and algorithmic curation. Platforms like TikTok,
YouTube Shorts, and Instagram Reels have popularized brief, visually engaging videos
that cater to short attention spans, resulting in higher completion rates and shareability.
Personalized recommendation systems now play a critical role in surfacing content,
meaning that creators who target niche interests often see stronger engagement.
Authentic, user-generated content continues to outperform polished productions,
especially when it fosters relatability and trust. Storytelling and interactive elements—
such as calls to comment or participate—boost viewer involvement, while live streaming
enhances real-time engagement and community building. Moreover, the mobile-first
nature of video consumption has made vertical formats and captivating intros essential
for capturing and maintaining attention. Together, these trends highlight the importance
of agility, authenticity, and platform-specific strategies in driving video success today.
Google Colab Link:

https://colab.research.google.com/drive/1jH3TkCt9cAS8Zxi60WuWrsTB7eAcO3gk?usp=
sharing

RESULT:

The EDA of YouTube trending videos reveals several key patterns in video
popularity and engagement. Videos with titles that include emotionally charged or
curiosity-driven words tend to attract more views and clicks. Content in categories such
as music, entertainment, and gaming appears most frequently in the trending list,
indicating strong viewer demand. High-performing videos often have a high like-to-
dislike ratio and generate significant comment activity, suggesting that viewer
engagement is a major factor in trending status. Additionally, channels with consistent
upload schedules and a high subscriber base tend to trend more often, highlighting the
importance of audience loyalty. Lastly, video length impacts performance—shorter
videos generally trend more frequently, but longer videos (7–15 minutes) tend to sustain
higher average view durations when well-produced. These insights suggest that content
quality, emotional appeal, and audience interaction are critical to driving trends on
YouTube.
Google Cloud Data Analytics Course Completion Certificate:

Preschool English Activity
100% (1)
Preschool English Activity
64 pages
Control of A Two-Tank System - MATLAB & Simulink Example PDF
No ratings yet
Control of A Two-Tank System - MATLAB & Simulink Example PDF
21 pages
Data Analysis Report
No ratings yet
Data Analysis Report
27 pages
Pranita Dane - IBM - Internship Project Submission - Data Analytics
No ratings yet
Pranita Dane - IBM - Internship Project Submission - Data Analytics
28 pages
Data Analysis and Data Science Task - 2
No ratings yet
Data Analysis and Data Science Task - 2
3 pages
Superstore Sales Data Analysis Report - 24MSG1R43 - Sanjeev Kumar
No ratings yet
Superstore Sales Data Analysis Report - 24MSG1R43 - Sanjeev Kumar
8 pages
Unit-III Final Java Servlets and XML Notes
No ratings yet
Unit-III Final Java Servlets and XML Notes
64 pages
Sales Analysis
No ratings yet
Sales Analysis
4 pages
Naan Mudhalvan Data Analytics Course For Engineering Students
No ratings yet
Naan Mudhalvan Data Analytics Course For Engineering Students
18 pages
Naan Mudhalvan Data Analytics Course For Engineering Students
No ratings yet
Naan Mudhalvan Data Analytics Course For Engineering Students
18 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
19 pages
NM
No ratings yet
NM
23 pages
Naan Mudhalvan Data Analytics Course For Engineering Students
No ratings yet
Naan Mudhalvan Data Analytics Course For Engineering Students
18 pages
Lab07ML - f40
No ratings yet
Lab07ML - f40
13 pages
Universal Data Analytics Algorithm
No ratings yet
Universal Data Analytics Algorithm
51 pages
Ad3301 Dev Splitup
No ratings yet
Ad3301 Dev Splitup
5 pages
CLP 02.2 Course Title: Microprocessors & Microcontrollers Lab
No ratings yet
CLP 02.2 Course Title: Microprocessors & Microcontrollers Lab
6 pages
HTML Tags
No ratings yet
HTML Tags
14 pages
Data Analytics Fundamentals-2
No ratings yet
Data Analytics Fundamentals-2
34 pages
INDEX
No ratings yet
INDEX
16 pages
Final Analytical Cement Mortar Grouting Exceed BOQ Report 11.03.2024 From TL
No ratings yet
Final Analytical Cement Mortar Grouting Exceed BOQ Report 11.03.2024 From TL
55 pages
Spinach 1
No ratings yet
Spinach 1
7 pages
Phase 3
No ratings yet
Phase 3
19 pages
Knowledge Institute of Technology: (An Autonomous Institution)
No ratings yet
Knowledge Institute of Technology: (An Autonomous Institution)
33 pages
Data Exploration and Visualization Unit 3
No ratings yet
Data Exploration and Visualization Unit 3
13 pages
Test - Unit - 1 - Vector - Google Forms
No ratings yet
Test - Unit - 1 - Vector - Google Forms
4 pages
Water Energy Generator US20060180473A1
100% (1)
Water Energy Generator US20060180473A1
26 pages
Credit Scoring Using Machine Learning
No ratings yet
Credit Scoring Using Machine Learning
381 pages
(Reg. Relationship Steps
No ratings yet
(Reg. Relationship Steps
4 pages
Eda Lab Manual
No ratings yet
Eda Lab Manual
34 pages
Main Phase 3 Dharani
No ratings yet
Main Phase 3 Dharani
19 pages
Assignment (Difference Equations)
No ratings yet
Assignment (Difference Equations)
7 pages
Analisa Sifat Material
No ratings yet
Analisa Sifat Material
10 pages
BIDA Practical Print
No ratings yet
BIDA Practical Print
56 pages
R4M - Superstore Dataset
No ratings yet
R4M - Superstore Dataset
2 pages
Code
No ratings yet
Code
13 pages
3-4 Gas Laws Int - Reader - Study - Guide PDF
No ratings yet
3-4 Gas Laws Int - Reader - Study - Guide PDF
6 pages
Book Review of Lewis Vaughn's "The Power of Critical Thinking"
No ratings yet
Book Review of Lewis Vaughn's "The Power of Critical Thinking"
6 pages
Intro To Pandas For Data Analytics
No ratings yet
Intro To Pandas For Data Analytics
20 pages
Manmohan Pandey Lab Mannual
No ratings yet
Manmohan Pandey Lab Mannual
30 pages
EDA Report Week2
No ratings yet
EDA Report Week2
15 pages
Assignment
No ratings yet
Assignment
2 pages
Dev Record Final
No ratings yet
Dev Record Final
34 pages
Ali Shafi BSBA 2-A 6522 Sales Market Data
No ratings yet
Ali Shafi BSBA 2-A 6522 Sales Market Data
40 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
Da Laqs Saqs
No ratings yet
Da Laqs Saqs
23 pages
Economic Data Analysis (Finance Analyst)
No ratings yet
Economic Data Analysis (Finance Analyst)
38 pages
Data Analysis
No ratings yet
Data Analysis
4 pages
Silicon NPN Epitaxial General Purpose Amplifier: Features
No ratings yet
Silicon NPN Epitaxial General Purpose Amplifier: Features
6 pages
Project Merged
No ratings yet
Project Merged
7 pages
2 Operations On Polynomials
No ratings yet
2 Operations On Polynomials
5 pages
E Commerce EDA Python Project 2
No ratings yet
E Commerce EDA Python Project 2
1 page
Intern 23
No ratings yet
Intern 23
21 pages
Document
No ratings yet
Document
29 pages
Project
No ratings yet
Project
6 pages
Wa0002.
No ratings yet
Wa0002.
4 pages
Data Collection and Data Cleaning: Next Connect To The Drive
No ratings yet
Data Collection and Data Cleaning: Next Connect To The Drive
16 pages
Notes 20241025083428
No ratings yet
Notes 20241025083428
4 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
11 pages
Python
100% (1)
Python
635 pages
DEV Lab Material
No ratings yet
DEV Lab Material
16 pages
ChatGPT in Exploratory Data Analysis
No ratings yet
ChatGPT in Exploratory Data Analysis
6 pages
UNIT 5 Scenario
No ratings yet
UNIT 5 Scenario
5 pages
Case Study Reportf
No ratings yet
Case Study Reportf
6 pages
Shape Vocabulary Word Mat
No ratings yet
Shape Vocabulary Word Mat
4 pages
Final Project
No ratings yet
Final Project
15 pages
Kubler - Bellows Couplings
No ratings yet
Kubler - Bellows Couplings
2 pages
Python For Business Decision Making Asm2
No ratings yet
Python For Business Decision Making Asm2
21 pages
PDF of Knowledge
No ratings yet
PDF of Knowledge
3 pages
Case Study Reportf
No ratings yet
Case Study Reportf
6 pages
LIFT DATA SHEET (Single Mobile Crane Lift)
No ratings yet
LIFT DATA SHEET (Single Mobile Crane Lift)
1 page
Python Project
No ratings yet
Python Project
20 pages
Piyush Kumar Singh - Project Submission - Data Analytics
No ratings yet
Piyush Kumar Singh - Project Submission - Data Analytics
23 pages
Properties of Water Reading - (1 - )
No ratings yet
Properties of Water Reading - (1 - )
4 pages
Supermarket Sales Analysis 1
No ratings yet
Supermarket Sales Analysis 1
13 pages
Siemens 1LA7 Cat 48
No ratings yet
Siemens 1LA7 Cat 48
1 page
Brochure Rilsan-PA11 2005
No ratings yet
Brochure Rilsan-PA11 2005
32 pages
Supermarket Sales Data Analysis
No ratings yet
Supermarket Sales Data Analysis
6 pages
Day 1 Resources
No ratings yet
Day 1 Resources
6 pages
Day 1
No ratings yet
Day 1
3 pages
Training
No ratings yet
Training
17 pages
Q4 W1 Commissions
No ratings yet
Q4 W1 Commissions
23 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
Intro To BA
No ratings yet
Intro To BA
7 pages
Q4G8W2
No ratings yet
Q4G8W2
7 pages
Retail Sales Analytics Project
No ratings yet
Retail Sales Analytics Project
3 pages
Periodical Exam Science 8
No ratings yet
Periodical Exam Science 8
3 pages
Immobilization of Enzymes
No ratings yet
Immobilization of Enzymes
21 pages
Half-Wave Rectifier Feeding A DC Motor
No ratings yet
Half-Wave Rectifier Feeding A DC Motor
4 pages
Self-Service Data Analytics and Governance for Managers
From Everand
Self-Service Data Analytics and Governance for Managers
Nathan E. Myers
No ratings yet

Naan Mudhalvan - Google Cloud Data Analytics

Uploaded by

Naan Mudhalvan - Google Cloud Data Analytics

Uploaded by

NAAN MUDHALVAN - GOOGLE CLOUD DATA ANALYTICS

ANNA UNIVERSITY: CHENNAI 600025

Course Name Google Cloud Data Analytics

Signature of the Course Coordinator Signature of the HOD

Submitted for the practical examination held on …………………….

Internal Examiner External Examiner

Exploratory Data Analysis (EDA) is a fundamental step in the data analysis

 Understand the overall structure and quality of the data

DATASET : Global Superstore Dataset (Excel)

DATA SOURCE LINK : https://www.kaggle.com/datasets/shekpaul/global-superstore

STEP 1 : Load the dataset using Pandas.

I. Display first five Rows.

II. Display the Total Number of Rows and Columns.

# Print number of rows and columns

STEP 2 : Clean missing data and remove duplicates.

I. Check for Missing Values

df['Postal Code'] = df['Postal Code'].fillna(0).astype(int)

print("Missing Postal Codes after filling:", df['Postal Code'].isnull().sum())

 Total sales per region.

# Group by 'Region' and calculate total sales for each region

II.Top 5 most profitable product categories.

III. Year-wise sales trend.

# Extract year from 'Order Date'

 Bar chart: Sales by region.

I. Bar chart: Sales by region.

However, some fluctuations or dips may be observed in intermediate years,

Google Colab Link:

 Understand the overall distribution of COVID-19 cases across Indian states.

DATA SOURCE LINK: https://www.kaggle.com/datasets/n1sarg/covid19-india-

Perform Exploratory Data Analysis:

STEP 1 : Load the dataset using Pandas.

I. Display first five Rows.

# Print number of rows and columns

STEP 2 : Clean missing data and remove duplicates.

I. Check for Missing Values

# Assuming your DataFrame is named df

 Total confirmed, recovered, and death cases for each state.

I. Total confirmed, recovered, and death cases for each state.

#Group by 'State' and sum the relevant columns

II. State with the highest number of cases.

# Group by State and sum total cases

III. Daily trend of new cases.

# Sort by state and date

 Pie chart: Top 5 states by confirmed cases.

II. Line graph: Trend of daily confirmed cases.

A line chart is particularly effective in revealing the temporal trends of COVID-19

Google Colab Link :

YouTube is one of the largest video-sharing platforms in the world, influencing

 Analyze the distribution of trending videos across different categories.

DATA SOURCE LINK : https://www.kaggle.com/datasets/thedevastator/youtube-

Perform Exploratory Data Analysis:

STEP 1 : Load the dataset using Pandas.

III. Display first five Rows.

# Print number of rows and columns

STEP 2 : Clean missing data and remove duplicates.

I. Check for Missing Values

 Most common video categories.

 Top 5 channels with the highest number of trending videos.

 Average likes, views, and comments.

I. Most common video categories.

# Count the most common video titles

# Count the most common video titles

II. Top 5 channels with the highest number of trending videos.

# Group by 'channel_title' and count the number of videos per channel

# Calculate the average of likes, comments, and views

I. Bar chart: Video count by category.

II. Scatter plot: Likes vs. Views.

import matplotlib.pyplot as plt

Insights: Describing trends in video popularity and engagement.

You might also like