ML A 6 Project
ML A 6 Project
A6 PROJECT REPORT
of
MACHINE LEARNING WITH PYTHON
PGCSA113
Submitted in partial fulfilment of the requirement for the award of Degree of
Submitted to
1
ACKNOWLEDGEMENT
I have taken this opportunity to express my gra tude and humble regards to the
Vivekananda Global University to provide an opportunity to present a project on
the “ On my Dataset I neraries ” Which is a “ Machine Learning with Python ”
based project.
I would also be thankful to my project guide “Mr.Ka b Showkat Zarger” to help me in the
comple on of my project and the documenta on. I have taken efforts in this project,
but the success of this project would not be possible without their support and
encouragement.
I would like to thanks our Dean sir “Dr. R.C Tripathi” to help us in providing all
the necessary books and other stuffs as and when required. I show my gra tude to
the authors whose books has been proved as the guide in the comple on of my
project I am also thankful to my classmates and friends who have encouraged me in
the course of comple on of the project.
Thanks
Hemant Maurya
Enrollmnt No: - 24CSA3BC006
Place: Jaipur
Date: 01-05-2025
2
DECLARATION
We hereby declare that this Project Report tled “ On my Dataset I neraries ”
submi ed by us and approved by our project guide, to the Vivekananda Global University,
Jaipur is a bonafide work undertaken by us and it is not submi ed to any other University
or Ins tu on for the award of any degree diploma / cer ficate or published any me
before.
3
1. Import Dataset
from google.colab import drive
drive.mount('/content/drive')
import numpy as np
import pandas as pd
import plotly.express as px
import pandas as pd
[5]
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/itineraries.csv")
df.head(5)
4
3. from bokeh.plotting import figure, show, output_notebook
from bokeh.models import ColumnDataSource
output_notebook()
source =
ColumnDataSource(df.groupby('startingAirport')['totalFare'].mean().reset_index())
p = figure(x_range=source.data['startingAirport'], height=600, width=800,
title="Average Total Fare by Starting Airport", toolbar_location=None,
tools="")
p.vbar(x='startingAirport', top='totalFare', width=0.9, source=source,
color="skyblue", line_color="black")
p.xaxis.axis_label, p.yaxis.axis_label = 'Starting Airport', 'Average Total Fare ($)'
p.xaxis.major_label_orientation, p.grid.grid_line_color = "vertical", None
show(p)
4. df1 = pd.DataFrame(df)
new_df1 = df1.drop(['legId', 'segmentsArrivalTimeEpochSeconds',
'segmentsDistance'], axis=1)
print(new_df1)
corr2=df.select_dtypes(include='number')
correlation_matrix = corr2.corr()
5
5. plt.figure(figsize=(5, 4))
sns.boxplot(data=df[["totalFare", "travelDuration"]])
plt.title("Box Plot of Features")
plt.xlabel("Features")
plt.ylabel("Value")
plt.show()
6
6. plt.figure(figsize=(5, 4))
plt.plot(df.index, [df['totalFare'].mean()] * len(df), marker="o", linestyle="-",
color="blue", alpha=0.6)
plt.title("Line Graph: Mean values across Samples")
plt.xlabel("Sample Index")
plt.ylabel("Mean")
plt.show()
7. df['Mean'] = df['totalFare'].mean()
df['Variance'] = df['totalFare'].var()
df['Target'] = 15
plt.figure(figsize=(8, 5))
sns.scatterplot(x=df["Mean"], y=df["Variance"], hue=df["Target"],
palette="coolwarm")
plt.title("Scatter Plot: Mean vs Variance")
plt.xlabel("Mean")
plt.ylabel("Variance")
plt.legend(title="Target")
plt.show()
7
8. Visualization of the dataset using different libraries.
correlation_matrix = df.select_dtypes(include=['number']).corr()
# Histogram of 'totalFare'
plt.figure(figsize=(8,6))
plt.hist(df['totalFare'], bins=20, color='skyblue', edgecolor='black')
plt.xlabel('Total Fare')
plt.ylabel('Frequency')
plt.title('Distribution of Total Fare')
plt.show()
8
# Box plot of 'totalFare' for different 'Starting Airport'
plt.figure(figsize=(10,6))
sns.boxplot(x='startingAirport', y='totalFare', data=df)
plt.xticks(rotation=45, ha='right')
plt.ylabel('Total Fare')
plt.xlabel('Starting Airport')
plt.title('Total Fare Distribution by Starting Airport')
plt.show()
9
# Heatmap of the correlation matrix
plt.figure(figsize=(10,8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix Heatmap')
plt.show()
10
10.Train and test a Linear Regression Model on your Dataset.
df['travelDuration'] = pd.to_timedelta(df['travelDuration']).dt.total_seconds()
X = df[['travelDuration']] # Assuming these are your features
y = df['totalFare']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Predicted')
plt.xlabel('Travel Duration')
plt.ylabel('Total Fare')
plt.title('Linear Regression Model')
plt.legend()
plt.show()
11
11.from sklearn.preprocessing import PolynomialFeatures
X = np.random.rand(100, 1) * 10
y = 2.5 * X**2 - 1.5 * X + np.random.randn(100, 1)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
model = LinearRegression()
model.fit(X_train_poly, y_train)
y_pred = model.predict(X_test_poly)
plt.scatter(X_test, y_test, color='blue')
plt.scatter(X_test, y_pred, color='red')
plt.title('Polynomial Regression')
plt.xlabel('X')
plt.ylabel('y')
plt.show()
12
12. X = df[['travelDuration']]
y = df['totalFare']
threshold = df['totalFare'].median()
y = (y > threshold).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
X_train['travelDuration'] =
pd.to_timedelta(X_train['travelDuration']).dt.total_seconds()
X_test['travelDuration'] = pd.to_timedelta(X_test['travelDuration']).dt.total_seconds()
# Now you can use SimpleImputer
imputer = SimpleImputer(strategy='mean')
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)
imputer = SimpleImputer(strategy='mean')
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
plt.figure(figsize=(6, 4))
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.scatter(X_test, y_pred, color='red', label='Predicted')
plt.xlabel('Travel Duration')
plt.ylabel('Total Fare (Binary)')
plt.title('Logistic Regression: Travel Duration vs Total Fare')
plt.legend()
plt.show()
13
13.df['travelDuration'] = pd.to_timedelta(df['travelDuration']).dt.total_seconds()
X = df[['travelDuration']]
y = df['totalFare']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.plot(X_test, y_pred, color='red', label='Predicted')
plt.xlabel('Travel Duration (minutes)')
plt.ylabel('Total Fare')
plt.title('Linear Regression: Travel Duration vs Total Fare')
plt.legend()
plt.show()
14
14.Gaussian Distribution in Machine Learning
mu = df['totalFare'].mean()
sigma = df['totalFare'].std()
df['Gaussian_TotalFare'] = np.random.normal(mu, sigma, len(df))
plt.figure(figsize=(5, 4))
plt.hist(df['Gaussian_TotalFare'], bins=30, density=True, alpha=0.6,
color='blue', label='Gaussian Distribution of totalFare')
bgcolor = 'pink'
plt.xlabel('Total Fare')
plt.ylabel('Probability Density')
plt.title('Gaussian Distribution of Total Fare')
plt.legend()
plt.show()
15
15.. precision recall f1 score
16
17
K Means Clustering Practical
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler # Import StandardScaler
df['travelDuration'] = pd.to_timedelta(df['travelDuration']).dt.total_seconds()
X = df[['travelDuration', 'totalFare']]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
kmeans = KMeans(n_clusters=15, random_state=42)
df['cluster'] = kmeans.fit_predict(X_scaled)
plt.figure(figsize=(5, 3))
plt.scatter(df['travelDuration'], df['totalFare'], c=df['cluster'], cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='Red',
label='Centroids', marker = '*')
plt.xlabel('Travel Duration')
plt.ylabel('Total Fare')
plt.title('KMeans Clustering of Travel Data')
plt.legend()
plt.show()
18