0% found this document useful (0 votes)
13 views23 pages

NM

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views23 pages

NM

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

1.

Here is a simplified Python code to address the given problem. The code assumes you have a dataset
(e.g., a CSV file) with student names and their scores in various subjects (e.g., Math, Science, English).

### Python Code

```python

# Import necessary libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

# Load the dataset into a Pandas DataFrame

# Replace 'your_file.csv' with the path to your dataset

df = pd.read_csv('your_file.csv')

# Handle missing values by replacing them with the mean of the respective column

df.fillna(df.mean(), inplace=True)
# Calculate the average score for each student

df['Average_Score'] = df.iloc[:, 1:].mean(axis=1) # Assuming the first column is student names

# Categorize students into performance levels

def categorize_performance(avg_score):

if avg_score >= 80:

return 'High'

elif avg_score >= 50:

return 'Medium'

else:

return 'Low'

df['Performance_Category'] = df['Average_Score'].apply(categorize_performance)

# Identify the subject with the highest average score across students

subject_avg_scores = df.iloc[:, 1:-2].mean()

highest_avg_subject = subject_avg_scores.idxmax()

# Determine the number of students in each performance category

category_counts = df['Performance_Category'].value_counts()

# Visualization: Bar chart for average score per subject

subject_avg_scores.plot(kind='bar', title='Average Score Per Subject', ylabel='Average Score',


xlabel='Subjects', color='skyblue')

plt.show()

# Visualization: Pie chart for performance category distribution

category_counts.plot(kind='pie', autopct='%1.1f%%', title='Performance Category Distribution',


ylabel='')

plt.show()

```

### Explanation

1. **Data Loading and Cleaning:**

- Loads a CSV file into a Pandas DataFrame.

- Handles missing values by replacing them with the column mean.

2. **Data Manipulation:**

- Calculates the average score for each student.

- Categorizes students based on their average score into "High," "Medium," or "Low."

3. **Analysis:**

- Finds the subject with the highest average score across all students.

- Counts the number of students in each performance category.

4. **Visualization:**

- Creates a bar chart showing the average scores for each subject.

- Creates a pie chart showing the percentage of students in each performance category.

### Instructions to Run

1. Upload your dataset (e.g., `your_file.csv`) to Google Colab.

2. Replace `'your_file.csv'` in the code with the actual file path.

3. Run the code cells step-by-step in Google Colab.

### Sample Output (Assuming Example Dataset)


**Bar Chart:**

Displays a bar chart with average scores for Math, Science, and English.

**Pie Chart:**

Shows a pie chart with categories like "High" (30%), "Medium" (50%), and "Low" (20%).

**Console Output:**

- Subject with the highest average score: `Science`

- Performance category counts:

```

Medium 5

High 3

Low 2

Name: Performance_Category, dtype: int64

```
Here’s a concise Python code that you can run in Google Colab to analyze a COVID-19 dataset as
described in the question.

### Python Code

```python

# Import necessary libraries

import pandas as pd

import matplotlib.pyplot as plt

# Load the dataset

# Replace 'covid_data.csv' with the path to your dataset

df = pd.read_csv('covid_data.csv')

# Handle missing values and duplicates

df.fillna(0, inplace=True)

df.drop_duplicates(inplace=True)

# Add a new column for daily new cases

df['New_Cases'] = df['Total_Cases'].diff().fillna(0)

# Extract 'Date' into separate columns for Year, Month, and Day

df['Date'] = pd.to_datetime(df['Date'])

df['Year'] = df['Date'].dt.year

df['Month'] = df['Date'].dt.month

df['Day'] = df['Date'].dt.day

# Calculate total cases and deaths globally

total_cases = df['Total_Cases'].sum()
total_deaths = df['Total_Deaths'].sum()

# Identify the country with the highest number of cases and deaths

country_cases = df.groupby('Country')['Total_Cases'].max()

country_deaths = df.groupby('Country')['Total_Deaths'].max()

highest_cases_country = country_cases.idxmax()

highest_deaths_country = country_deaths.idxmax()

# Analyze daily new cases trend (last 30 days)

last_30_days = df[df['Date'] >= (df['Date'].max() - pd.Timedelta(days=30))]

# Visualization: Line chart for total cases trend

df.groupby('Date')['Total_Cases'].sum().plot(kind='line', title='Trend of Total COVID-19 Cases Over


Time', ylabel='Total Cases', xlabel='Date')

plt.show()

# Bar chart for top 5 countries with the highest cases

top_5_countries = country_cases.nlargest(5)

top_5_countries.plot(kind='bar', title='Top 5 Countries with Highest Cases', ylabel='Total Cases',


xlabel='Countries', color='orange')

plt.show()

# Pie chart for proportion of cases in continents

continent_cases = df.groupby('Continent')['Total_Cases'].sum()

continent_cases.plot(kind='pie', autopct='%1.1f%%', title='Proportion of Cases by Continent', ylabel='')

plt.show()

# Print key results

print(f"Total cases globally: {total_cases}")


print(f"Total deaths globally: {total_deaths}")

print(f"Country with highest cases: {highest_cases_country}")

print(f"Country with highest deaths: {highest_deaths_country}")

```

---

### Explanation

1. **Data Loading and Cleaning:**

- The dataset is loaded into a DataFrame, missing values are replaced with 0, and duplicates are
dropped.

2. **Data Manipulation:**

- Calculates daily new cases (`New_Cases`).

- Extracts `Year`, `Month`, and `Day` from the `Date` column for analysis.

3. **Analysis:**

- Computes total global cases and deaths.

- Identifies the countries with the highest cases and deaths.

- Filters data for the last 30 days to analyze trends.

4. **Visualization:**

- **Line Chart:** Shows the trend of total cases over time.

- **Bar Chart:** Displays the top 5 countries with the highest cases.

- **Pie Chart:** Shows the proportion of cases by continent.

---
### Instructions to Run

1. Upload your dataset (e.g., `covid_data.csv`) to Google Colab.

2. Replace `'covid_data.csv'` in the code with the file name.

3. Run each code cell step-by-step to load, analyze, and visualize the data.

---

### Sample Output (Assuming Example Dataset)

**Console Output:**

```

Total cases globally: 500,000,000

Total deaths globally: 5,000,000

Country with highest cases: USA

Country with highest deaths: Brazil

```

**Visualizations:**

1. Line chart showing the rising trend of total cases globally.

2. Bar chart highlighting the top 5 countries with the highest total cases.

3. Pie chart dividing the proportion of cases by continent.


Here’s a simple Python code that you can run in Google Colab to analyze a sales dataset as described
in the question.

### Python Code

```python

# Import necessary libraries

import pandas as pd

import matplotlib.pyplot as plt

# Load the dataset

# Replace 'sales_data.csv' with the path to your dataset

df = pd.read_csv('sales_data.csv')

# Handle missing values and duplicates


df.fillna(0, inplace=True)

df.drop_duplicates(inplace=True)

# Add a new column for total revenue

df['Total_Revenue'] = df['Quantity'] * df['Price']

# Group by product category to calculate total revenue and number of items sold

category_summary = df.groupby('Product_Category').agg(

Total_Revenue=('Total_Revenue', 'sum'),

Total_Quantity=('Quantity', 'sum')

# Identify the top 3 products generating the highest revenue

top_products = df.groupby('Product').agg(Total_Revenue=('Total_Revenue', 'sum')).nlargest(3,


'Total_Revenue')

# Determine the month with the highest total sales

df['Date'] = pd.to_datetime(df['Date'])

df['Month'] = df['Date'].dt.to_period('M')

monthly_sales = df.groupby('Month').agg(Total_Revenue=('Total_Revenue', 'sum'))

highest_sales_month = monthly_sales.idxmax()

# Visualization: Bar chart for total revenue by product category

category_summary['Total_Revenue'].plot(kind='bar', title='Total Revenue by Product Category',


ylabel='Total Revenue', xlabel='Product Category', color='green')

plt.show()

# Visualization: Line graph for monthly sales trends

monthly_sales.plot(kind='line', title='Monthly Sales Trends', ylabel='Total Revenue', xlabel='Month',


marker='o', color='blue')
plt.show()

# Print key results

print("Top 3 products generating highest revenue:")

print(top_products)

print(f"Month with highest total sales: {highest_sales_month}")

```

---

### Explanation

1. **Data Loading and Cleaning:**

- Loads the sales dataset into a Pandas DataFrame.

- Handles missing values by replacing them with 0 and removes duplicate entries.

2. **Data Manipulation:**

- Calculates `Total_Revenue` for each transaction as `Quantity × Price`.

- Groups the data by `Product_Category` to calculate total revenue and number of items sold.

3. **Analysis:**

- Identifies the top 3 products generating the highest revenue.

- Determines the month with the highest total sales.

4. **Visualization:**

- **Bar Chart:** Displays total revenue by product category.

- **Line Graph:** Shows monthly sales trends.

---
### Instructions to Run

1. Upload your dataset (e.g., `sales_data.csv`) to Google Colab.

2. Replace `'sales_data.csv'` in the code with your dataset's filename.

3. Run each code cell step-by-step to analyze and visualize the data.

---

### Sample Output (Assuming Example Dataset)

**Console Output:**

```

Top 3 products generating highest revenue:

Total_Revenue

Product

Product_A 100000.00

Product_B 80000.00

Product_C 75000.00

Month with highest total sales: 2024-05

```

**Visualizations:**

1. **Bar Chart:** Shows total revenue for categories like "Electronics," "Furniture," etc.

2. **Line Graph:** Displays sales trends over months with peaks and valleys.
[12/6, 8:53 PM] : Below is a Python code template to solve the tourism data analysis problem
described. You'll need a tourism dataset in CSV format to run this code. The code will include the
required steps, explanations, and instructions to execute it in Google Colab.

### Code

```python

# Step 1: Import Libraries

import pandas as pd

import matplotlib.pyplot as plt

# Step 2: Load the Dataset

# Replace 'tourism_data.csv' with your actual file name

from google.colab import files

uploaded = files.upload() # Upload the dataset

data = pd.read_csv(list(uploaded.keys())[0])

# Step 3: Data Cleaning

data.drop_duplicates(inplace=True) # Remove duplicate rows


data.dropna(inplace=True) # Drop rows with missing values

# Step 4: Data Manipulation

# Add Total Visitors column

data['Total_Visitors'] = data['Domestic_Visitors'] + data['International_Visitors']

# Extract year and month from the 'Date' column

data['Date'] = pd.to_datetime(data['Date'])

data['Year'] = data['Date'].dt.year

data['Month'] = data['Date'].dt.month

# Step 5: Analysis

# Identify the month with the highest total visitors

highest_month = data.loc[data['Total_Visitors'].idxmax()]

# Calculate the average number of visitors per year

average_visitors_per_year = data.groupby('Year')['Total_Visitors'].mean()

# Proportion of domestic vs international visitors by year

proportion = data.groupby('Year')[['Domestic_Visitors', 'International_Visitors']].sum()

proportion['Domestic_Proportion'] = proportion['Domestic_Visitors'] /
(proportion['Domestic_Visitors'] + proportion['International_Visitors'])

proportion['International_Proportion'] = proportion['International_Visitors'] /
(proportion['Domestic_Visitors'] + proportion['International_Visitors'])

# Step 6: Visualization

# Bar Chart - Total Visitors per Month

monthly_totals = data.groupby('Month')['Total_Visitors'].sum()

monthly_totals.plot(kind='bar', title='Total Visitors Per Month', ylabel='Visitors', xlabel='Month')


plt.show()

# Pie Chart - Proportion of Domestic vs International Visitors

latest_year = data['Year'].max()

latest_data = proportion.loc[latest_year]

latest_data[['Domestic_Proportion', 'International_Proportion']].plot(kind='pie', autopct='%1.1f%%',


title=f'Domestic vs International Visitors ({latest_year})', ylabel='')

plt.show()

# Line Graph - Trend of Total Visitors Over the Years

yearly_totals = data.groupby('Year')['Total_Visitors'].sum()

yearly_totals.plot(kind='line', title='Total Visitors Over the Years', ylabel='Visitors', xlabel='Year')

plt.show()

# Step 7: Output Results

print("Month with Highest Total Visitors:")

print(highest_month)

print("\nAverage Visitors Per Year:")

print(average_visitors_per_year)

print("\nProportion of Domestic vs International Visitors by Year:")

print(proportion)

```

---

### Instructions to Run in Google Colab

1. **Upload the Dataset**: Replace the placeholder `tourism_data.csv` with your dataset. When you
run the `files.upload()` block, it will prompt you to upload your file.

2. **Install Required Libraries**: Google Colab already includes `pandas` and `matplotlib`. No
additional installations are necessary.
3. **Run the Cells**: Copy and paste the code into Google Colab and execute each cell sequentially.

---

### Explanation of the Code

1. **Data Cleaning**: Handles duplicates and missing values to ensure data consistency.

2. **Data Manipulation**: Calculates the total visitors and extracts `Year` and `Month` for analysis.

3. **Analysis**:

- Finds the month with the highest visitors.

- Calculates average yearly visitors.

- Analyzes proportions of domestic and international visitors.

4. **Visualization**: Uses bar, pie, and line plots to display results graphically.

---

### Output

The output includes:

1. **Text Outputs**:

- The month with the highest visitors.

- The average number of visitors per year.

- Proportion data.

2. **Graphs**:

- A bar chart for total visitors per month.

- A pie chart for domestic vs international visitor proportions.

- A line graph showing the trend of total visitors over the years.

Run the code to view the exact outputs based on your dataset. If you'd like me to adjust the code or
work with a sample dataset, let me know!
[12/6, 8:53 PM] : Here's an example of what the output might look like if we use a fictional
dataset for tourism data. This will give you an idea of the expected results:

### **Sample Text Output**

#### **Month with Highest Total Visitors**

```

Date 2023-07-01

Domestic_Visitors 500,000

International_Visitors 300,000

Total_Visitors 800,000

Year 2023

Month 7

Name: 189, dtype: object

```

#### **Average Visitors Per Year**

```

Year

2019 450,000.0

2020 200,000.0

2021 350,000.0

2022 500,000.0

2023 600,000.0

Name: Total_Visitors, dtype: float64

```

#### **Proportion of Domestic vs International Visitors by Year**

```
Domestic_Visitors International_Visitors Domestic_Proportion International_Proportion

Year

2019 2,000,000 700,000 0.740 0.260

2020 1,200,000 500,000 0.706 0.294

2021 1,500,000 700,000 0.682 0.318

2022 2,000,000 1,000,000 0.667 0.333

2023 2,500,000 1,300,000 0.658 0.342

```

---

### **Sample Visualizations**

1. **Bar Chart: Total Visitors Per Month**

A bar chart showing total visitors for each month, with July as the peak month.

2. **Pie Chart: Proportion of Domestic vs International Visitors (2023)**

A pie chart for 2023 might show:

- **65.8% Domestic Visitors**

- **34.2% International Visitors**

3. **Line Graph: Total Visitors Over the Years**

A line graph showing a general upward trend in tourism, with a dip in 2020 (likely due to external
factors like a pandemic) and steady growth afterward.

---

### **Key Notes**

- The outputs will vary depending on your dataset.


- If you don't have real tourism data, you can simulate data by creating a CSV file with columns like
`Date`, `Domestic_Visitors`, and `International_Visitors`.

- Let me know if you’d like help generating sample data for testing!

Here is how you can run the code on **Google Colab**, step-by-step:

---

### **Step 1: Open Google Colab**

1. Go to [Google Colab](https://colab.research.google.com/).

2. Create a new notebook by clicking on **"File > New Notebook"**.

---

### **Step 2: Upload the Dataset**

1. Save your dataset (e.g., `bank_campaign_data.csv`) on your local machine.


2. In Google Colab, click on the folder icon in the left sidebar.

3. Click the upload icon and upload your dataset.

---

### **Step 3: Run the Code**

1. Copy and paste the following Python code into a code cell in Colab:

```python

# Install necessary libraries

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

# Step 1: Load the dataset

from google.colab import files

uploaded = files.upload() # Upload the dataset here

file_path = list(uploaded.keys())[0] # Get the uploaded file name

data = pd.read_csv(file_path)

# Step 2: Data Cleaning

# Handle missing values

data.fillna(method='ffill', inplace=True)

# Drop duplicate entries

data.drop_duplicates(inplace=True)

# Step 3: Data Manipulation


# Add a column for Contacted_Last_Month

data['Contacted_Last_Month'] = data['campaign'].apply(lambda x: 'Yes' if x > 0 else 'No')

# Convert categorical variables to numeric using one-hot encoding

categorical_cols = ['job', 'marital', 'education']

data = pd.get_dummies(data, columns=categorical_cols, drop_first=True)

# Step 4: Analysis

# Average age of customers who subscribed

avg_age = data[data['y'] == 'yes']['age'].mean()

# Most common job category for subscribed customers

most_common_job = data[data['y'] == 'yes']['job'].mode()[0]

# Proportion of subscribed customers

subscribed_proportion = len(data[data['y'] == 'yes']) / len(data)

# Step 5: Visualization

# Bar chart showing subscription rate by job

sns.countplot(x='job', hue='y', data=data)

plt.title('Subscription Rate by Job')

plt.xticks(rotation=45)

plt.show()

# Pie chart showing subscription proportion

data['y'].value_counts().plot.pie(autopct='%1.1f%%', labels=['Not Subscribed', 'Subscribed'])

plt.title('Subscription Proportion')

plt.ylabel('')

plt.show()
# Histogram for age distribution

data['age'].plot.hist(bins=10)

plt.title('Distribution of Customer Ages')

plt.xlabel('Age')

plt.show()

# Print analysis results

print(f"Average Age of Subscribed Customers: {avg_age:.2f}")

print(f"Most Common Job for Subscribed Customers: {most_common_job}")

print(f"Proportion of Subscribed Customers: {subscribed_proportion:.2%}")

```

2. Run the cell.

3. When prompted, upload your dataset (e.g., `bank_campaign_data.csv`).

---

### **Sample Output**:

1. The console will display:

```

Average Age of Subscribed Customers: 41.20

Most Common Job for Subscribed Customers: admin

Proportion of Subscribed Customers: 12.50%

```

2. Visualizations:

- **Bar Chart**: Subscription rate by job category.

- **Pie Chart**: Proportion of subscribed vs. not subscribed customers.


- **Histogram**: Age distribution of customers.

---

### **Note**:

Make sure your dataset includes the necessary columns like `age`, `job`, `campaign`, `y`, and other
required fields. Adjust column names in the code if they differ in your dataset.

You might also like