0% found this document useful (0 votes)
5 views18 pages

ML A 6 Project

The document is a project report for a Machine Learning with Python course, detailing the analysis of a dataset related to travel itineraries. It includes sections on data import, visualization, correlation analysis, and the application of various machine learning models such as linear regression and K-means clustering. The report acknowledges the support of faculty and peers and is submitted for the Master's degree at Vivekananda Global University.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views18 pages

ML A 6 Project

The document is a project report for a Machine Learning with Python course, detailing the analysis of a dataset related to travel itineraries. It includes sections on data import, visualization, correlation analysis, and the application of various machine learning models such as linear regression and K-means clustering. The report acknowledges the support of faculty and peers and is submitted for the Master's degree at Vivekananda Global University.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Dataset Itineraries

A6 PROJECT REPORT
of
MACHINE LEARNING WITH PYTHON
PGCSA113
Submitted in partial fulfilment of the requirement for the award of Degree of

MASTER OF COMPUTER APPLICATIONS

Submitted to

Submitted to: Submitted by:


Mr. Katib Showkat Zarger (Hemant Maurya)
Computer Science & Application Enroll No. 24CSA3BC006

Department of Computer Science & Application


Vivekananda Global University, Jaipur
Year- 2025

1
ACKNOWLEDGEMENT
I have taken this opportunity to express my gra tude and humble regards to the
Vivekananda Global University to provide an opportunity to present a project on
the “ On my Dataset I neraries ” Which is a “ Machine Learning with Python ”
based project.
I would also be thankful to my project guide “Mr.Ka b Showkat Zarger” to help me in the
comple on of my project and the documenta on. I have taken efforts in this project,
but the success of this project would not be possible without their support and
encouragement.
I would like to thanks our Dean sir “Dr. R.C Tripathi” to help us in providing all
the necessary books and other stuffs as and when required. I show my gra tude to
the authors whose books has been proved as the guide in the comple on of my
project I am also thankful to my classmates and friends who have encouraged me in
the course of comple on of the project.

Thanks

Hemant Maurya
Enrollmnt No: - 24CSA3BC006

Place: Jaipur
Date: 01-05-2025

2
DECLARATION
We hereby declare that this Project Report tled “ On my Dataset I neraries ”

submi ed by us and approved by our project guide, to the Vivekananda Global University,

Jaipur is a bonafide work undertaken by us and it is not submi ed to any other University

or Ins tu on for the award of any degree diploma / cer ficate or published any me

before.

Student Name: Hemant Maurya(24CSA3BC006)


ERP: - 2430815

Project Guide: Mr. Ka b Showkat Zarger

3
1. Import Dataset
from google.colab import drive
drive.mount('/content/drive')
import numpy as np
import pandas as pd
import plotly.express as px
import pandas as pd
[5]

df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/itineraries.csv")
df.head(5)

2. import seaborn as sns


import matplotlib.pyplot as plt
airport_fare = df.groupby('startingAirport')['totalFare'].mean().reset_index()
sns.set(style="whitegrid")
plt.figure(figsize=(7, 4))
sns.barplot(x='startingAirport', y='totalFare', data=airport_fare, palette="Blues_d")
plt.title('Average Total Fare by Starting Airport', fontsize=14)
plt.xlabel('Starting Airport', fontsize=12)
plt.ylabel('Average Total Fare ($)', fontsize=12)
plt.xticks(rotation=45)
plt.show()

4
3. from bokeh.plotting import figure, show, output_notebook
from bokeh.models import ColumnDataSource

output_notebook()
source =
ColumnDataSource(df.groupby('startingAirport')['totalFare'].mean().reset_index())
p = figure(x_range=source.data['startingAirport'], height=600, width=800,
title="Average Total Fare by Starting Airport", toolbar_location=None,
tools="")
p.vbar(x='startingAirport', top='totalFare', width=0.9, source=source,
color="skyblue", line_color="black")
p.xaxis.axis_label, p.yaxis.axis_label = 'Starting Airport', 'Average Total Fare ($)'
p.xaxis.major_label_orientation, p.grid.grid_line_color = "vertical", None
show(p)

4. df1 = pd.DataFrame(df)
new_df1 = df1.drop(['legId', 'segmentsArrivalTimeEpochSeconds',
'segmentsDistance'], axis=1)
print(new_df1)
corr2=df.select_dtypes(include='number')
correlation_matrix = corr2.corr()

5
5. plt.figure(figsize=(5, 4))
sns.boxplot(data=df[["totalFare", "travelDuration"]])
plt.title("Box Plot of Features")
plt.xlabel("Features")
plt.ylabel("Value")
plt.show()

6
6. plt.figure(figsize=(5, 4))
plt.plot(df.index, [df['totalFare'].mean()] * len(df), marker="o", linestyle="-",
color="blue", alpha=0.6)
plt.title("Line Graph: Mean values across Samples")
plt.xlabel("Sample Index")
plt.ylabel("Mean")
plt.show()

7. df['Mean'] = df['totalFare'].mean()
df['Variance'] = df['totalFare'].var()
df['Target'] = 15
plt.figure(figsize=(8, 5))
sns.scatterplot(x=df["Mean"], y=df["Variance"], hue=df["Target"],
palette="coolwarm")
plt.title("Scatter Plot: Mean vs Variance")
plt.xlabel("Mean")
plt.ylabel("Variance")
plt.legend(title="Target")
plt.show()

7
8. Visualization of the dataset using different libraries.

correlation_matrix = df.select_dtypes(include=['number']).corr()

 # Histogram of 'totalFare'
plt.figure(figsize=(8,6))
plt.hist(df['totalFare'], bins=20, color='skyblue', edgecolor='black')
plt.xlabel('Total Fare')
plt.ylabel('Frequency')
plt.title('Distribution of Total Fare')
plt.show()

 # Scatter plot of 'travelDuration' vs. 'totalFare'


plt.figure(figsize=(8,6))
plt.scatter(df['travelDuration'], df['totalFare'], alpha=0.5)
plt.xlabel('Travel Duration')
plt.ylabel('Total Fare')
plt.title('Travel Duration vs. Total Fare')
plt.show()

8
 # Box plot of 'totalFare' for different 'Starting Airport'
plt.figure(figsize=(10,6))
sns.boxplot(x='startingAirport', y='totalFare', data=df)
plt.xticks(rotation=45, ha='right')
plt.ylabel('Total Fare')
plt.xlabel('Starting Airport')
plt.title('Total Fare Distribution by Starting Airport')
plt.show()

 # Pairplot of numerical features


sns.pairplot(df.select_dtypes(include=['number']))
plt.show()

9
 # Heatmap of the correlation matrix
plt.figure(figsize=(10,8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix Heatmap')
plt.show()

9. Calculating Correlation between different variables in a separate data frame.


numerical_df = df.select_dtypes(include=np.number)
correlation_matrix = numerical_df.corr()
print(correlation_matrix)
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix Heatmap')
plt.show()

10
10.Train and test a Linear Regression Model on your Dataset.
df['travelDuration'] = pd.to_timedelta(df['travelDuration']).dt.total_seconds()
X = df[['travelDuration']] # Assuming these are your features
y = df['totalFare']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Predicted')
plt.xlabel('Travel Duration')
plt.ylabel('Total Fare')
plt.title('Linear Regression Model')
plt.legend()
plt.show()

11
11.from sklearn.preprocessing import PolynomialFeatures
X = np.random.rand(100, 1) * 10
y = 2.5 * X**2 - 1.5 * X + np.random.randn(100, 1)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
model = LinearRegression()
model.fit(X_train_poly, y_train)
y_pred = model.predict(X_test_poly)
plt.scatter(X_test, y_test, color='blue')
plt.scatter(X_test, y_pred, color='red')
plt.title('Polynomial Regression')
plt.xlabel('X')
plt.ylabel('y')
plt.show()

12
12. X = df[['travelDuration']]
y = df['totalFare']
threshold = df['totalFare'].median()
y = (y > threshold).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
X_train['travelDuration'] =
pd.to_timedelta(X_train['travelDuration']).dt.total_seconds()
X_test['travelDuration'] = pd.to_timedelta(X_test['travelDuration']).dt.total_seconds()
# Now you can use SimpleImputer
imputer = SimpleImputer(strategy='mean')
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)
imputer = SimpleImputer(strategy='mean')
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
plt.figure(figsize=(6, 4))
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.scatter(X_test, y_pred, color='red', label='Predicted')
plt.xlabel('Travel Duration')
plt.ylabel('Total Fare (Binary)')
plt.title('Logistic Regression: Travel Duration vs Total Fare')
plt.legend()
plt.show()

13
13.df['travelDuration'] = pd.to_timedelta(df['travelDuration']).dt.total_seconds()
X = df[['travelDuration']]
y = df['totalFare']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.plot(X_test, y_pred, color='red', label='Predicted')
plt.xlabel('Travel Duration (minutes)')
plt.ylabel('Total Fare')
plt.title('Linear Regression: Travel Duration vs Total Fare')
plt.legend()
plt.show()

14
14.Gaussian Distribution in Machine Learning
mu = df['totalFare'].mean()
sigma = df['totalFare'].std()
df['Gaussian_TotalFare'] = np.random.normal(mu, sigma, len(df))

plt.figure(figsize=(5, 4))
plt.hist(df['Gaussian_TotalFare'], bins=30, density=True, alpha=0.6,
color='blue', label='Gaussian Distribution of totalFare')
bgcolor = 'pink'

xmin, xmax = plt.xlim()


x = np.linspace(xmin, xmax, 100)
p = 1/(sigma * np.sqrt(2 * np.pi)) * np.exp( - (x - mu)**2 / (2 *
sigma**2))
plt.plot(x, p, 'k', linewidth=2, label='Theoretical Gaussian
Distribution')

plt.xlabel('Total Fare')
plt.ylabel('Probability Density')
plt.title('Gaussian Distribution of Total Fare')
plt.legend()
plt.show()

15
15.. precision recall f1 score

from sklearn.metrics import precision_score, recall_score, f1_score


threshold = df['totalFare'].median()
y_test_binary = (y_test > threshold).astype(int)
y_pred_binary = (y_pred > threshold).astype(int)
precision = precision_score(y_test_binary, y_pred_binary)
recall = recall_score(y_test_binary, y_pred_binary)
f1 = f1_score(y_test_binary, y_pred_binary)
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-score: {f1}")

16. K Means Clustering Practical


from sklearn.cluster import KMeans
X = df[['totalFare', 'travelDuration']]
X['travelDuration'] = pd.to_timedelta(X['travelDuration']).dt.total_seconds()
kmeans = KMeans(n_clusters=5, random_state=0) # Now KMeans is defined
kmeans.fit(X)
labels = kmeans.labels_
df['cluster'] = labels
print(df)
fig = px.scatter(df, x='totalFare', y='travelDuration', color='cluster', title='K-Means
Clustering')
fig.show()

16
17
 K Means Clustering Practical
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler # Import StandardScaler
df['travelDuration'] = pd.to_timedelta(df['travelDuration']).dt.total_seconds()
X = df[['travelDuration', 'totalFare']]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
kmeans = KMeans(n_clusters=15, random_state=42)
df['cluster'] = kmeans.fit_predict(X_scaled)
plt.figure(figsize=(5, 3))
plt.scatter(df['travelDuration'], df['totalFare'], c=df['cluster'], cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='Red',
label='Centroids', marker = '*')
plt.xlabel('Travel Duration')
plt.ylabel('Total Fare')
plt.title('KMeans Clustering of Travel Data')
plt.legend()
plt.show()

18

You might also like