0% found this document useful (0 votes)
14 views12 pages

1 2 Merged

The document outlines five experiments conducted using Google Cloud Data Analytics, each focusing on different datasets and analytical tasks. Experiments include generating sales reports, tracking daily temperatures, analyzing COVID-19 cases, evaluating movie ratings, and assessing online course completion statuses. Each experiment details the aim, algorithm, procedure, code, and results, showcasing data cleaning, computation, and visualization techniques.

Uploaded by

HEMACHANDAR . S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views12 pages

1 2 Merged

The document outlines five experiments conducted using Google Cloud Data Analytics, each focusing on different datasets and analytical tasks. Experiments include generating sales reports, tracking daily temperatures, analyzing COVID-19 cases, evaluating movie ratings, and assessing online course completion statuses. Each experiment details the aim, algorithm, procedure, code, and results, showcasing data cleaning, computation, and visualization techniques.

Uploaded by

HEMACHANDAR . S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Google Cloud Data Analytics Lab Experiments

Experiment 1: Simple Sales Report


Aim:

To generate a sales report by cleaning missing data, computing total sales, and identifying

top-performing products.

Algorithm:

- Load the dataset into a Pandas DataFrame.

- Fill missing values in the 'Price' column with the average price of the respective product.

- Create a new column: Total_Sales = Quantity × Price.

- Identify the product with the highest total sales.

- Visualize total sales by product using a bar chart.

Procedure:

- Open Google Colab and upload the dataset or use sample data.

- Handle missing 'Price' values by filling in average prices using group-by.

- Create a new column for total sales.

- Group by product and sum total sales.

- Identify the highest selling product.

- Plot a bar chart using matplotlib.

Code:
import pandas as pd
import matplotlib.pyplot as plt

data = {
'Product': ['Pen', 'Pencil', 'Notebook', 'Pen', 'Pencil', 'Notebook'],
'Quantity': [10, 15, 5, 12, 18, 7],
'Price': [5, None, 20, 5, 3, None]
}
df = pd.DataFrame(data)
df['Price'] = df.groupby('Product')['Price'].transform(lambda x: x.fillna(x.mean()))
df['Total_Sales'] = df['Quantity'] * df['Price']
sales_by_product = df.groupby('Product')['Total_Sales'].sum()
top_product = sales_by_product.idxmax()
sales_by_product.plot(kind='bar', title='Total Sales by Product')
plt.ylabel('Total Sales')
plt.show()
print("Product with highest total sales:", top_product)

Sample Output:

Product with highest total sales: Pencil

Result:

The program successfully computes and visualizes product-wise sales and identifies the top-selling

item.
Experiment 2: Daily Temperature Tracker
Aim:

To process temperature data, handle missing values, and visualize average temperature trends over

time.

Algorithm:

- Load the temperature dataset.

- Fill missing values in Min_Temp and Max_Temp with their column means.

- Calculate Average_Temp = (Min_Temp + Max_Temp)/2.

- Find the date with the highest average temperature.

- Plot a line graph of average temperature over time.

Procedure:

- Load the dataset with dates, min temp, and max temp.

- Use fillna() to replace nulls with column averages.

- Compute Average_Temp and add to the DataFrame.

- Use idxmax() to find the date with the highest average.

- Plot temperature trends over time.

Code:
import pandas as pd
import matplotlib.pyplot as plt

data = {
'Date': pd.date_range(start='2023-01-01', periods=5),
'Min_Temp': [21, 23, None, 22, 25],
'Max_Temp': [30, None, 35, 31, 34]
}
df = pd.DataFrame(data)
df['Min_Temp'].fillna(df['Min_Temp'].mean(), inplace=True)
df['Max_Temp'].fillna(df['Max_Temp'].mean(), inplace=True)
df['Average_Temp'] = (df['Min_Temp'] + df['Max_Temp']) / 2
hottest_day = df.loc[df['Average_Temp'].idxmax(), 'Date']
plt.plot(df['Date'], df['Average_Temp'], marker='o')
plt.title("Average Temperature Over Time")
plt.xlabel("Date")
plt.ylabel("Average Temp")
plt.grid(True)
plt.show()
print("Date with highest average temperature:", hottest_day.date())

Sample Output:

Date with highest average temperature: 2023-01-05

Result:

The trend line provides a visual representation of temperature changes, and the hottest day is

identified.
Google Cloud Data Analytics Lab Experiments

Experiment 3: COVID-19 Daily Cases


Aim:

To analyze COVID-19 daily case data by cleaning missing values and visualizing trends.

Algorithm:

- Load dataset with Date and Cases.

- Fill missing case values with 0.

- Calculate total and average daily cases.

- Find the date with the highest case count.

- Plot a line chart of daily cases.

Procedure:

- Import dataset into a DataFrame.

- Use fillna(0) for missing cases.

- Use sum() and mean() to compute totals.

- Use idxmax() for the peak day.

- Plot the data as a line graph.

Code:
import pandas as pd
import matplotlib.pyplot as plt

data = {
'Date': pd.date_range(start='2023-01-01', periods=5),
'Cases': [100, None, 250, 400, None]
}
df = pd.DataFrame(data)
df['Cases'].fillna(0, inplace=True)
total_cases = df['Cases'].sum()
average_cases = df['Cases'].mean()
peak_day = df.loc[df['Cases'].idxmax(), 'Date']
plt.plot(df['Date'], df['Cases'], marker='o')
plt.title("COVID-19 Daily Cases")
plt.xlabel("Date")
plt.ylabel("Cases")
plt.grid(True)
plt.show()
print("Total cases:", total_cases)
print("Average daily cases:", average_cases)
print("Date with highest number of cases:", peak_day.date())

Sample Output:

Total cases: 750.0

Average daily cases: 150.0

Date with highest number of cases: 2023-01-04

Result:

Correctly shows trends and identifies the peak infection date with a clear graph.
Experiment 4: Movie Ratings Dataset
Aim:

To analyze movie ratings and identify top movies based on viewer feedback.

Algorithm:

- Load the movie dataset.

- Remove entries with missing ratings.

- Calculate the average rating.

- Find the top 3 movies with the highest ratings.

- Display a bar chart of top 5 movies.

Procedure:

- Load the movie dataset.

- Drop rows with null ratings.

- Use mean() to get average rating.

- Use nlargest() to get top movies.

- Visualize with matplotlib.

Code:
import pandas as pd
import matplotlib.pyplot as plt

data = {
'Movie_Name': ['Movie A', 'Movie B', 'Movie C', 'Movie D', 'Movie E', 'Movie F'],
'Viewer_Rating': [4.5, 4.8, None, 4.2, 4.9, 4.3]
}
df = pd.DataFrame(data)
df.dropna(inplace=True)
average_rating = df['Viewer_Rating'].mean()
top_movies = df.nlargest(3, 'Viewer_Rating')
top_5 = df.nlargest(5, 'Viewer_Rating')
plt.bar(top_5['Movie_Name'], top_5['Viewer_Rating'], color='skyblue')
plt.title("Top 5 Movie Ratings")
plt.ylabel("Rating")
plt.xticks(rotation=45)
plt.show()
print("Average Rating:", average_rating)
print("Top 3 Movies:")
print(top_movies[['Movie_Name', 'Viewer_Rating']])

Sample Output:

Average Rating: 4.54

Top 3 Movies:

Movie_Name Viewer_Rating

4 Movie E 4.9

1 Movie B 4.8

0 Movie A 4.5

Result:

Identifies and displays the top 3 movies with supporting bar chart visualization.
Experiment 5: Online Course Completion Data
Aim:

To analyze student course completion status and visualize completion vs non-completion.

Algorithm:

- Load the dataset.

- Replace missing Completion_Status with 'No'.

- Count 'Yes' and 'No' entries.

- Plot a pie chart of the results.

Procedure:

- Load the dataset into Pandas.

- Replace nulls in Completion_Status with 'No'.

- Use value_counts() to count 'Yes' and 'No'.

- Visualize using a pie chart.

Code:
import pandas as pd
import matplotlib.pyplot as plt

data = {
'Student_ID': [101, 102, 103, 104, 105],
'Completion_Status': ['Yes', None, 'No', 'Yes', None]
}
df = pd.DataFrame(data)
df['Completion_Status'].fillna("No", inplace=True)
completion_count = df['Completion_Status'].value_counts()
plt.pie(completion_count, labels=completion_count.index, autopct='%1.1f%%',
startangle=140)
plt.title("Course Completion vs Non-Completion")
plt.axis('equal')
plt.show()
print("Course Completion Counts:")
print(completion_count)

Sample Output:
Course Completion Counts:

No 3

Yes 2

Name: Completion_Status, dtype: int64

Result:

Pie chart clearly shows student completion status distribution.

You might also like