0% found this document useful (0 votes)
14 views6 pages

ML Lab A1 A4

The document provides Python code for exploratory data analysis on two datasets, focusing on observations, features, occupations, and various metrics related to a football tournament. It also includes a function for calculating regression error metrics such as SSE, MSE, RMSE, and R2 score using actual and predicted values. The code demonstrates data manipulation and analysis using pandas and sklearn libraries.

Uploaded by

safiapathan03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views6 pages

ML Lab A1 A4

The document provides Python code for exploratory data analysis on two datasets, focusing on observations, features, occupations, and various metrics related to a football tournament. It also includes a function for calculating regression error metrics such as SSE, MSE, RMSE, and R2 score using actual and predicted values. The code demonstrates data manipulation and analysis using pandas and sklearn libraries.

Uploaded by

safiapathan03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

A1).

Load the dataset from the below file and write python code to answer below exploratory analysis
questions :
a) How many observations are there in this dataset
num_observations = len(df) //df-CSV file name

print(f"There are {num_observations} observations in the dataset.")

b) How many various features are there in the dataset


num_features = len(df.columns)

print(f"There are {num_features} features in the dataset.")

c) How many different occupations (unique) are there in the dataset.


num_unique_occupations = df['High'].nunique()

print(f"There are {num_unique_occupations} different occupations in the dataset.")

d) What occupation is the most common.


most_common_occupation = df['Occupation'].mode()[0]

print(f"The most common occupation is: {most_common_occupation}")

e) What is the average age of all the people in this dataset


average_age = df['Age'].mean()

print(f"The average age of all people in the dataset is: {average_age:.2f}")

f) What is the average age of people in each occupation group


average_age_per_occupation = df.groupby('Occupation')['Age'].mean()
print("Average age of people in each occupation group:")
print(average_age_per_occupation)

g) What are the occupations of the youngest and oldest people in this dataset

youngest_person_age = df['Age'].min()
youngest_person_occupation = df[df['Age'] == youngest_person_age]['Occupation'].iloc[0]

oldest_person_age = df['Age'].max()
oldest_person_occupation = df[df['Age'] == oldest_person_age]['Occupation'].iloc[0]

print(f"The occupation of the youngest person is: {youngest_person_occupation}")


print(f"The occupation of the oldest person is: {oldest_person_occupation}")
A2. Load the dataset from the below file and write python code to answer below exploratory
analysis questions:
a) How many teams participated in this tournament.
b) List top two teams with high discipline and bottom two teams with low discipline (you can
consider red and yellow cards to calculate discipline)
c) On average, how many yellow cards are given per team.
d) How many teams scored more than 5 goals and which are those teams.
e) Which team is most accurate in shooting?
f) How many teams made more fouls than their opponents?

import pandas as pd
import numpy as np
df=pd.read_csv('A4-Football.csv')
df.head()

a) teams_participated = data['Team']. nunique ()


print(f"The number of teams participated in the tournament: {teams_participated}")

b) data ['Discipline'] = data ['Red Cards'] + data ['Yellow Cards']


# Find the top two teams with the highest discipline
top_teams = data.groupby('Team')['Discipline'].sum().nlargest(2)
print ("Top two teams with highest discipline:") print(top_teams)

# Find the bottom two teams with the lowest discipline bottom_teams = data.groupby('Team')
['Discipline'].sum().nsmallest(2)
print ("\nBottom two teams with lowest discipline:")
print(bottom_teams)

c) average_yellow= data.groupby('Team')['Yellow Cards'].mean()

# Calculate overall average of yellow cards across all teams


overall_average_yellow = average_yellow.mean()
print(f"On average, {overall_average_yellow:.2f} yellow cards are given per team.")

d) teams_goals = data [data ['Goals Scored'] > 5]

# Count the number of teams that scored more than 5 goals


Num_teams_more_5_goals = teams_goals['Team']. unique ()
print(f"{Num_teams_more_5_goals} teams scored more than 5 goals.")
print ("\nThe teams that scored more than 5 goals:")
print (teams_goals[['Team', 'Goals Scored']])

e) most_accurate_team = data.loc[data['Shooting Accuracy'].idxmax()]


print (f"The most accurate team in shooting is: {most_accurate_team['Team']} " f"with a
shooting accuracy of {most_accurate_team['Shooting Accuracy']}.")
# If you need the shooting accuracy values for all teams, you can sort the
#DataFrame:

sorted_teams_by_accuracy = data.sort_values(by='Shooting Accuracy',


ascending=False)
print ("\nTeams sorted by shooting accuracy:")
print (sorted_teams_by_accuracy[['Team', 'Shooting Accuracy']])

f) teams_more_fouls_than_opponents = data [data ['Own Fouls'] > data ['Opponent


Fouls']]

# Count the number of teams that made more fouls than their opponents
num_teams_more_fouls_than_opponents =
teams_more_fouls_than_opponents['Team']. unique ()
print(f"{num_teams_more_fouls_than_opponents} teams made more fouls than
their opponents.")
print ("\nThe teams that made more fouls than their opponents:")
print (teams_more_fouls_than_opponents[['Team', 'Own Fouls', 'Opponent
Fouls']])

A4). Write python code for calculating various regression errors/error metrics such as SSE, MSE,
RMSE and R2 score. The function should take actual target values and predicted targets from
the model as input and return these error metrics as output

Here's a Python function that calculates the Sum of Squared Errors (SSE), Mean Squared
Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R2) score:
The line from sklearn.metrics import mean_squared_error, r2_score imports specific
functions mean_squared_error and r2_score from the sklearn.metrics module. These
functions are used for evaluating regression models and calculating performance metrics:
mean_squared_error: This function calculates the Mean Squared Error (MSE), which
measures the average squared difference between the actual and predicted values. It's a
widely used metric to evaluate regression models. The formula for MSE is:

MSE = Σ(actual - predicted)^2 / n


r2_score: This function calculates the R-squared (R2) score, also known as the coefficient of
determination. It measures the proportion of variance in the dependent variable (target)
that is predictable from the independent variables (features). R2 score ranges between 0
and 1, where 1 indicates a perfect fit.

R2 Score = 1 - (SSres / SStot)

where SSres is the sum of squared residuals and SStot is the total sum of squares.
import numpy as np
import sklearn.metrics as metrics
import matplotlib.pyplot as plt
y = np.array([-3, -1, -2, 1, -1, 1, 2, 1, 3, 4, 3, 5])
yhat = np.array([-2, 1, -1, 0, -1, 1, 2, 2, 3, 3, 3, 5])
x = list(range(len(y)))
plt.scatter(x, y, color="blue", label="original")
plt.plot(x, yhat, color="red", label="predicted")
plt.legend()
plt.show()
# calculate manually
d = y - yhat
mse_f = np.mean(d**2)
mae_f = np.mean(abs(d))
rmse_f = np.sqrt(mse_f)
r2_f = 1-(sum(d**2)/sum((y-np.mean(y))**2))
print("Results by manual calculation:")
print("MAE:",mae_f)
print("MSE:", mse_f)
print("RMSE:", rmse_f)
print("R-Squared:", r2_f)
mae = metrics.mean_absolute_error(y, yhat)
mse = metrics.mean_squared_error(y, yhat)
rmse = np.sqrt(mse) #mse**(0.5)
r2 = metrics.r2_score(y,yhat)
print("Results of sklearn.metrics:")
print("MAE:",mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R-Squared:", r2)

Output:

You might also like