0% found this document useful (0 votes)
37 views19 pages

Phase 5

Uploaded by

chinnasamy8103
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views19 pages

Phase 5

Uploaded by

chinnasamy8103
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

o

GOVERNMENT COLLEGE OF ENGINEERING,THANJAVUR

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

PHASE 5 PROJECT SUBMISSION

College Code: 8227


Technology: Artificial Intelligence (AI)
Total number of Students in the group: 5

Completed the project named as

PREDICTIVE MAINTENANCE

Submitted by

Lokesh R

Meganthan P

Krishna kumar R

Rathish GP

Dhayanithi S
Project Title: Project Development – Dynamic Pricing

Introduction:

 Dynamic pricing, also known as real-time pricing, is a pricing strategy where businesses
adjust the prices of their products or services in response to real-time supply and demand
conditions, market trends, competitor pricing, and other external factors.
 This approach contrasts with static pricing, where prices remain fixed over a period of time
regardless of changing circumstances.
 Dynamic pricing is widely adopted in various industries, including airlines, hospitality, e-
commerce, and ride-sharing, due to its potential to maximize revenue and optimize
resource utilization.

Project Objective:

 Maximizing Revenue: Increase overall revenue by setting prices that reflect current
demand levels, ensuring higher prices during peak demand and competitive pricing during
low demand.
 Improving Profit Margins: Optimize profit margins by efficiently balancing supply and
demand, reducing the occurrence of overpricing or underpricing.
 Enhancing Market Responsiveness: Quickly adapt to changes in market conditions,
including competitor pricing, seasonal variations, and consumer behavior trends.
 Personalizing Customer Experience: Use customer data to offer personalized pricing,
improving customer satisfaction and loyalty.
 Optimizing Inventory Management: Adjust prices to manage inventory levels
effectively, preventing stockouts and overstock situations.
 Leveraging Technology: Implement advanced technologies such as machine learning
algorithms and big data analytics to predict demand and automate price adjustments.
 Maintaining Competitive Edge: Stay ahead of competitors by continually analyzing and
responding to their pricing strategies.
 Ensuring Transparency and Trust: Clearly communicate the reasons for price changes
to maintain customer trust and avoid perceptions of unfair pricing practices.

About the Dataset:


o Product ID - Values: Unique identifier for each product or service (e.g., "P12345")
o Product Name - Values: Descriptive name of the product (e.g., "Wireless Headphones")
o Category - Values: Product category or type (e.g., "Electronics")
o Base Price - Values: Initial or standard price before any dynamic adjustments (e.g.,
$50.00)
o Current Price - Values: The dynamically adjusted price at a given time (e.g., $45.00)
o Demand Level - Values: Indicator of current demand (e.g., "High", "Medium", "Low")
o Stock Level - Values: Quantity of product currently available (e.g., 100 units)
o Competitor Prices - Values: Prices offered by competitors for the same or similar
products (e.g., $48.00)
o Time of Day - Values: Specific time when the price was recorded or adjusted (e.g.,
"14:00")
o Day of Week - Values: Day when the price was recorded or adjusted (e.g., "Monday")
o Date - Values: Specific date when the price was recorded or adjusted (e.g., "2024-06-04")
o Season - Values: Current season affecting pricing (e.g., "Summer", "Winter")

System Requirements:

 Data:
o Description of the heart attack dataset used, including its source, size, and
attributes.
o Explanation of how the data was collected and preprocessed.
 Hardware:
o Specifications for the hardware required to run the predictive maintenance
system, such as computational resources.
 Software:
o List of software tools and libraries used for data preprocessing, model
development, and evaluation.
o Description of any specific software requirements for deploying the predictive
maintenance system.

Methodology:
 Overview of the methodology followed in the project, including the steps involved in:
o Data preprocessing: Cleaning the dataset, handling missing values, encoding
categorical variables, and feature scaling.
o Model development: Choosing appropriate algorithms (e.g.,
RandomForestClassifier) and hyperparameter tuning.
o Model evaluation: Splitting the data into training and testing sets, assessing
model performance metrics (accuracy, precision, recall, F1-score), and
visualizing results using confusion matrices and classification reports.
 Explanation of any additional steps taken, such as feature engineering or ensemble
techniques.

Data Preprocessing:
 Detailed explanation of the data preprocessing steps undertaken, including:
o Handling missing values: Imputation techniques used, if any.
o Encoding categorical variables: One-hot encoding or label encoding methods
applied.
o Feature scaling: Standardization or normalization of features to ensure
consistent scale across variables.

Model Evaluation:
 Description of the model evaluation process, covering:
o Splitting the dataset into training and testing sets.
o Training the predictive model using appropriate algorithms and
hyperparameters.
o Evaluating the model's performance on the test set using relevant metrics
(accuracy, precision, recall, F1-score).
o Visualizing evaluation results using confusion matrices and classification
reports to gain insights into the model's performance.

Existing Work:
 Review of existing literature, research papers, and projects related to predictive
maintenance in cardiovascular health.
 Summary of methodologies, techniques, and findings from previous studies.
 Identification of gaps or limitations in existing approaches.

Proposed Work:
 Overview of the proposed methodology and objectives of the project.
 Explanation of how the proposed approach addresses the limitations or gaps identified
in existing work.
 Description of the predictive maintenance model for heart attack analysis and its
components.

Flow chart:
Implementation:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES
# TO THE CORRECT LOCATION (/kaggle/input) IN YOUR NOTEBOOK,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

import os import
sys
from tempfile import NamedTemporaryFile
from urllib.request import urlopen from
urllib.parse import unquote, urlparse from
urllib.error import HTTPError from zipfile
import ZipFile import tarfile import
shutil

CHUNK_SIZE = 40960
DATA_SOURCE_MAPPING = 'dynamic-pricing-dataset:https%3A%2F
%2Fstorage.googleapis.com%2Fkaggle-data-sets
%2F4365344%2F7496965%2Fbundle%2Farchive.zip%3FX-Goog-Algorithm
%3DGOOG4-RSA-SHA256%26XGoog-Credential%3Dgcp-kaggle-com%2540kaggle-
161607.iam.gserviceaccount.com%252F20240604%252Fauto%252Fstorage
%252Fgoog4_request%26X-Goog-Date%3D20240604T141813Z%26X- Goog-Expires
%3D259200%26X-Goog-SignedHeaders%3Dhost%26X- Goog-Signature
%3D4b0752a7b69b9e7909ee9c3329842cd9ba61766b7eda89a24962a67551b95d3eaa7
dea046d402fae2dfc42360f03a955d9127029345c2f54cc95910da9b072fd57efe3dc1
80 ad7b4abede0315918728a5fd9f81fb9390eee7010e9dd76c0d32a76fb2f3f5173ba
1
e8c1429d53fbe7656a876b9c3d2823829bdbe6c65c163cd94d6f50800a16cb544b2c29
d348abdf43465906677d00f2bb0ff96b69f9b537b503b539fc31ec85a8e78912fb4dff
835 bed78eda36be723f116855c4f47ef14db5c58d570f8efd7d9f5f98edc77ddff8b
11 2
aa83d61495d4bfcc042bd2f98774cbceaf3f68ffb954c43201a931cddd1f67afcb67a
d966e30bb4606f119e24f1bda'

KAGGLE_INPUT_PATH='/kaggle/input'
KAGGLE_WORKING_PATH='/kaggle/working'
KAGGLE_SYMLINK='kaggle'

!umount /kaggle/input/ 2> /dev/null


shutil.rmtree('/kaggle/input', ignore_errors=True)
os.makedirs(KAGGLE_INPUT_PATH, 0o777, exist_ok=True)
os.makedirs(KAGGLE_WORKING_PATH, 0o777, exist_ok=True)

try: os.symlink(KAGGLE_INPUT_PATH, os.path.join("..",


'input'), target_is_directory=True) except FileExistsError:
pass try:
os.symlink(KAGGLE_WORKING_PATH, os.path.join("..", 'working'),
target_is_directory=True) except FileExistsError: pass

for data_source_mapping in DATA_SOURCE_MAPPING.split(','):


directory, download_url_encoded = data_source_mapping.split(':')
download_url = unquote(download_url_encoded) filename =
urlparse(download_url).path
destination_path = os.path.join(KAGGLE_INPUT_PATH, directory)
try:
with urlopen(download_url) as fileres, NamedTemporaryFile() as
tfile: total_length = fileres.headers['content-length']
print(f'Downloading {directory}, {total_length} bytes compressed')
dl = 0
data = fileres.read(CHUNK_SIZE)
while len(data) > 0: dl +=
len( data )
tfile.write(data)
done = int(50 * dl / int( total_length ))
sys.stdout.write(f"\r[{'=' * done}{' ' * (50-done)}]
{dl} bytes downloaded")
sys.stdout.flush()
data = fileres.read(CHUNK_SIZE)
if filename.endswith('.zip'): with
ZipFile(tfile) as zfile:
zfile.extractall(destination_path) else:
with tarfile.open(tfile.name) as tarfile:
tarfile.extractall(destination_path)
print(f'\nDownloaded and uncompressed: {directory}')
except HTTPError as e: print(f'Failed to load (likely expired)
{download_url} to path
{destination_path}') continue except
OSError as e: print(f'Failed to load
{download_url} to path
{destination_path}')
continue

print('Data source import complete.')


Downloading dynamic-pricing-dataset, 22341 bytes compressed
[==================================================] 22341 bytes
downloaded Downloaded and uncompressed: dynamic-
pricing-dataset Data source import complete.
Introduction
This notebook explores a dataset provided by a ride-sharing company seeking to implement a dynamic
pricing strategy. Currently, the company sets fares based solely on ride duration. This project aims to
leverage data-driven techniques to develop a predictive model for dynamic pricing that adjusts fares in
response to real-time market conditions.
The provided dataset encompasses historical ride information, including features like the number of
riders, drivers, location categories, customer loyalty, past rides, average ratings, booking time, vehicle
type, expected ride duration, and historical costs.
Our objective here is to build a dynamic pricing model that utilizes these features to predict optimal fares
for rides in real-time, considering factors like demand patterns and driver availability.

import warnings import pandas as


pd import numpy as np import
seaborn as sns import
matplotlib.pyplot as plt
warnings.filterwarnings("ignore")

EDA
# Loading data
data =
pd.read_csv("/kaggle/input/dynamic-pricing-dataset/dynamic_pricing.csv
")
data.head()
{"summary":"{\n \"name\": \"data\",\n \"rows\": 1000,\n \"fields\":
[\n {\n \"column\": \"Number_of_Riders\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
23,\n \"min\": 20,\n \"max\": 100,\n
\"num_unique_values\": 81,\n \"samples\": [\n 68,\n
90,\n 48\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"Number_of_Drivers\",\n \"properties\": {\n \"dtype\":
\"number\",\n \"std\": 19,\n \"min\": 5,\n
\"max\": 89,\n \"num_unique_values\": 79,\n \"samples\":
[\n 55,\n 45,\n 9\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\ n
},\n {\n \"column\": \"Location_Category\",\n
\"properties\": {\n \"dtype\": \"category\",\n
\"num_unique_values\": 3,\n \"samples\": [\n
\"Urban\",\n \"Suburban\",\n \"Rural\"\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\ n
},\n {\n \"column\": \"Customer_Loyalty_Status\",\n
\"properties\": {\n \"dtype\": \"category\",\n
\"num_unique_values\": 3,\n \"samples\": [\n
\"Silver\",\n \"Regular\",\n \"Gold\"\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\ n
},\n {\n \"column\": \"Number_of_Past_Rides\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
29,\n \"min\": 0,\n \"max\": 100,\n
\"num_unique_values\": 101,\n \"samples\": [\n 42,\n
31,\n 90\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"Average_Ratings\",\n \"properties\": {\n \"dtype\":
\"number\",\n \"std\": 0.4357808813802833,\n \"min\":
3.5,\n \"max\": 5.0,\n \"num_unique_values\": 151,\n
\"samples\": [\n 4.26,\n 4.82,\n 4.91\n
],\n \"semantic_type\": \"\",\n \"description\": \"\"\n
}\n },\n {\n \"column\": \"Time_of_Booking\",\n
\"properties\": {\n \"dtype\": \"category\",\n
\"num_unique_values\": 4,\n \"samples\": [\n
\"Evening\",\n \"Morning\",\n \"Night\"\n ],\
n \"semantic_type\": \"\",\n \"description\": \"\"\n
}\n },\n {\n \"column\": \"Vehicle_Type\",\n
\"properties\": {\n \"dtype\": \"category\",\n
\"num_unique_values\": 2,\n \"samples\": [\n
\"Economy\",\n \"Premium\"\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\ n
},\n {\n \"column\": \"Expected_Ride_Duration\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
49,\n \"min\": 10,\n \"max\": 180,\n
\"num_unique_values\": 171,\n \"samples\": [\n 145,\n
28\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"Historical_Cost_of_Ride\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 187.1587562201799,\n
\"min\": 25.993449448411635,\n \"max\": 836.1164185613576,\n
\"num_unique_values\": 1000,\n \"samples\": [\n
470.2690237026412,\n 286.409294385432\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\ n
}\n ]\n}","type":"dataframe","variable_name":"data"} data.info()

< class 'pandas.core.frame.DataFrame' >


RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Number_of_Riders 1000 non-null int64
1 Number_of_Drivers 1000 non-null int64
2 Location_Category 1000 non-null object
3 Customer_Loyalty_Status 1000 non-null object
4 Number_of_Past_Rides 1000 non-null int64
5 Average_Ratings 1000 non-null float64
6 Time_of_Booking 1000 non-null object
7 Vehicle_Type 1000 non-null object
8 Expected_Ride_Duration 1000 non-null int64 9
Historical_Cost_of_Ride 1000 non-null float64
dtypes: float64(2), int64(4), object(4)
memory usage: 78.2+ KB data.describe()

{"summary":"{\n \"name\": \"data\",\n \"rows\": 8,\n \"fields\": [\


n {\n \"column\": \"Number_of_Riders\",\n \"properties\":
{\n \"dtype\": \"number\",\n \"std\":
335.2107999243274,\n \"min\": 20.0,\n \"max\": 1000.0,\n
\"num_unique_values\": 8,\n \"samples\": [\n 60.372,\n
60.0,\n 1000.0\n ],\n \"semantic_type\": \"\",\
n \"description\": \"\"\n }\n },\n {\n
\"column\": \"Number_of_Drivers\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 343.8714218161568,\n
\"min\": 5.0,\n \"max\": 1000.0,\n
\"num_unique_values\": 8,\n \"samples\": [\n 27.076,\n
22.0,\n 1000.0\n ],\n \"semantic_type\": \"\",\
n \"description\": \"\"\n }\n },\n {\n
\"column\": \"Number_of_Past_Rides\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 338.2700202372605,\n
\"min\": 0.0,\n \"max\": 1000.0,\n
\"num_unique_values\": 8,\n \"samples\": [\n 50.031,\n
51.0,\n 1000.0\n ],\n \"semantic_type\": \"\",\
n \"description\": \"\"\n }\n },\n {\n
\"column\": \"Average_Ratings\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 352.2447571461208,\n
\"min\": 0.4357808813802833,\n \"max\": 1000.0,\n
\"num_unique_values\": 8,\n \"samples\": [\n 4.25722,\
n 4.27,\n 1000.0\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\ n
},\n {\n \"column\": \"Expected_Ride_Duration\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
325.4930600251677,\n \"min\": 10.0,\n \"max\": 1000.0,\n
\"num_unique_values\": 8,\n \"samples\": [\n 99.588,\n
102.0,\n 1000.0\n ],\n \"semantic_type\":
\"\",\n \"description\": \"\"\n }\n },\n {\n
\"column\": \"Historical_Cost_of_Ride\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 331.49960406582255,\n
\"min\": 25.993449448411635,\n \"max\": 1000.0,\n
\"num_unique_values\": 8,\n \"samples\": [\n
372.5026233496332,\n 362.01942584564324,\n 1000.0\n
],\n \"semantic_type\": \"\",\n \"description\": \"\"\n
}\n }\n ]\n}","type":"dataframe"}
In lack of a complete coloumn description we need to make some assumptions. For the further analysis
we assume the following:
Number_of_Riders: The number of riders available at the time of booking, reflecting the market situation.
Number_of_Drivers: The number of drivers available at the time of booking, reflecting the market
situation. Location_Category: The category representing the geographical location where the ride was
booked, such as Urban, Suburban, or Rural. Customer_Loyalty_Status: The loyalty status of the customer
towards the ride-sharing company, indicating whether the customer is a regular user or enrolled in a
loyalty program. Number_of_Past_Rides: The number of past rides taken by the customer, indicating
their experience and familiarity with the service. Average_Ratings: The average rating given by the
customer for past rides, reflecting customer satisfaction and feedback. Time_of_Booking: The time of the
day when the ride was booked, categorized into different time slots such as Morning, Afternoon, Evening,
or Night.
Vehicle_Type: The type of vehicle used for the ride, such as Premium, Economy, or other classes.
Expected_Ride_Duration: The expected duration of the ride in minutes. Historical_Cost_of_Ride:
The historical cost of past rides, indicating pricing patterns and customer spending.
numerical_data = data[['Number_of_Riders', 'Number_of_Drivers',
'Number_of_Past_Rides',
'Average_Ratings', 'Expected_Ride_Duration',
'Historical_Cost_of_Ride']]

sns.pairplot(numerical_data, diag_kind='hist')
plt.show()
sns.regplot(x='Expected_Ride_Duration', y='Historical_Cost_of_Ride',
data=data, scatter=True, color='cornflowerblue',
line_kws={"color":
"green"})

plt.title('Scatterplot of Expected Ride Duration vs. Historical Cost


of Ride with Trendline')
plt.xlabel('Expected Ride Duration')
plt.ylabel('Historical Cost of Ride')
plt.show()
cat = ['Location_Category', 'Customer_Loyalty_Status',
'Time_of_Booking', 'Vehicle_Type']

# create subplots
plt.figure(figsize=(12,10))

for i, c in enumerate(cat, 1):


plt.subplot(2,2,i)
sns.boxplot(y=data['Historical_Cost_of_Ride'], x=data[c],
palette='GnBu')

plt.subplots_adjust(hspace=0.5, wspace=0.5)
plt.show()
plt.figure(figsize=(12, 10))

for i, c in enumerate(cat, 1):


plt.subplot(2,2,i)
c_counts = data[c].value_counts()

sns.barplot(x=c_counts.index, y=c_counts.values,
palette='GnBu')

plt.subplots_adjust(hspace=0.5, wspace=0.5)
plt.show()
A look at the results:
• Urban rides are the cheapest
• Regular and Gold customer pay surprisingly identical prices
• AFternoon rides tend to be significantly more expensive
• Big Differerence between the vehicle Type
correlation_matrix = numerical_data.corr()

plt.figure(figsize=(10,8))
sns.heatmap(correlation_matrix, annot=True, cmap='crest',
linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()
Preprocessing
data = pd.get_dummies(data, columns= cat, dtype = int)
data.head()
{"summary":"{\n \"name\": \"data\",\n \"rows\": 1000,\n \"fields\":
[\n {\n \"column\": \"Number_of_Riders\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
23,\n \"min\": 20,\n \"max\": 100,\n
\"num_unique_values\": 81,\n \"samples\": [\n 68,\n
90,\n 48\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"Number_of_Drivers\",\n \"properties\": {\n \"dtype\":
\"number\",\n \"std\": 19,\n \"min\": 5,\n
\"max\": 89,\n \"num_unique_values\": 79,\n \"samples\":
[\n 55,\n 45,\n 9\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\ n
},\n {\n \"column\": \"Number_of_Past_Rides\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
29,\n \"min\": 0,\n \"max\": 100,\n
\"num_unique_values\": 101,\n \"samples\": [\n 42,\n
31,\n 90\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"Average_Ratings\",\n \"properties\": {\n \"dtype\":
\"number\",\n \"std\": 0.4357808813802833,\n \"min\":
3.5,\n \"max\": 5.0,\n \"num_unique_values\": 151,\n
\"samples\": [\n 4.26,\n 4.82,\n 4.91\n
],\n \"semantic_type\": \"\",\n \"description\": \"\"\n
}\n },\n {\n \"column\": \"Expected_Ride_Duration\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
49,\n \"min\": 10,\n \"max\": 180,\n
\"num_unique_values\": 171,\n \"samples\": [\n 145,\n
28,\n 169\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"Historical_Cost_of_Ride\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 187.1587562201799,\n
\"min\": 25.993449448411635,\n \"max\": 836.1164185613576,\n
\"num_unique_values\": 1000,\n \"samples\": [\n
470.2690237026412,\n 286.409294385432,\n
552.2693747461685\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"Location_Category_Rural\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n
\"max\": 1,\n \"num_unique_values\": 2,\n \"samples\":
[\n 1,\n 0\n ],\n \"semantic_type\":
\"\",\n \"description\": \"\"\n }\n },\n {\n
\"column\": \"Location_Category_Suburban\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n
\"max\": 1,\n \"num_unique_values\": 2,\n \"samples\":
[\n 1,\n 0\n ],\n \"semantic_type\":
\"\",\n \"description\": \"\"\n }\n },\n {\n
\"column\": \"Location_Category_Urban\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n
\"max\": 1,\n \"num_unique_values\": 2,\n \"samples\":
[\n 0,\n 1\n ],\n \"semantic_type\":
\"\",\n \"description\": \"\"\n }\n },\n {\n
\"column\": \"Customer_Loyalty_Status_Gold\",\n \"properties\":
{\n \"dtype\": \"number\",\n \"std\": 0,\n
\"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n
\"samples\": [\n 1,\n 0\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\
n },\n {\n \"column\":
\"Customer_Loyalty_Status_Regular\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n
\"max\": 1,\n \"num_unique_values\": 2,\n \"samples\":
[\n 1,\n 0\n ],\n \"semantic_type\":
\"\",\n \"description\": \"\"\n }\n },\n {\n
\"column\": \"Customer_Loyalty_Status_Silver\",\n \"properties\":
{\n \"dtype\": \"number\",\n \"std\": 0,\n
\"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n
\"samples\": [\n 0,\n 1\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\ n
},\n {\n \"column\": \"Time_of_Booking_Afternoon\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
0,\n \"min\": 0,\n \"max\": 1,\n
\"num_unique_values\": 2,\n \"samples\": [\n 1,\n
0\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"Time_of_Booking_Evening\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n
\"max\": 1,\n \"num_unique_values\": 2,\n \"samples\":
[\n 1,\n 0\n ],\n \"semantic_type\":
\"\",\n \"description\": \"\"\n }\n },\n {\n
\"column\": \"Time_of_Booking_Morning\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n
\"max\": 1,\n \"num_unique_values\": 2,\n \"samples\":
[\n 1,\n 0\n ],\n \"semantic_type\":
\"\",\n \"description\": \"\"\n }\n },\n {\n
\"column\": \"Time_of_Booking_Night\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n
\"max\": 1,\n \"num_unique_values\": 2,\n \"samples\":
[\n 0,\n 1\n ],\n \"semantic_type\":
\"\",\n \"description\": \"\"\n }\n },\n {\n
\"column\": \"Vehicle_Type_Economy\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n
\"max\": 1,\n \"num_unique_values\": 2,\n \"samples\":
[\n 1,\n 0\n ],\n \"semantic_type\":
\"\",\n \"description\": \"\"\n }\n },\n {\n
\"column\": \"Vehicle_Type_Premium\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n
\"max\": 1,\n \"num_unique_values\": 2,\n \"samples\":
[\n 0,\n 1\n ],\n \"semantic_type\":
\"\",\n \"description\": \"\"\n }\n }\n ]\
n}","type":"dataframe","variable_name":"data"}
X = data.drop('Historical_Cost_of_Ride', axis = 1)
y = data['Historical_Cost_of_Ride']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =
0.3, random_state = 42)
from sklearn.linear_model import LinearRegression from
sklearn.ensemble import RandomForestRegressor from sklearn.ensemble
import GradientBoostingRegressor
from sklearn.cluster import KMeans

from sklearn.metrics import r2_score, mean_squared_error

Modeling
Upon completing exploratory data analysis (EDA) and gaining insights into the dataset, the next step
is to develop predictive models for price estimation. In this phase, we will explore the performance of
various machine learning algorithms, namely K-Nearest Neighbors, Random Forest, Linear
Regression, and Gradient Boosting, in predicting prices based on the available features. Through
rigorous evaluation, we aim to identify the most effective model for accurate price prediction, thereby
facilitating informed decision-making and enhancing the overall efficiency of our system.
# Creating a dictionary to store the results
results = {}
# In this section, we'll use the Elbow Method to determine the optimal
number of clusters for the K-Nearest Neighbor algorithm. from
sklearn.cluster import KMeans
wcss = []
max_cluster = 10

for c in range(1, max_cluster+1): kmeans =


KMeans(n_clusters=c, random_state=42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)

plt.plot(range(1, max_cluster+1), wcss, marker='o')


plt.title('Elbow Method for Optimal Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS - Within-Cluster-Sum-of-Squares')
plt.show()

# Linear regression specifically targeting features that exhibit


linear relationships, as observed during the EDA phase. lr =
LinearRegression()
lr.fit(X_train[['Expected_Ride_Duration']] , y_train )
y_pred_lr = lr.predict(X_test[['Expected_Ride_Duration']])

RMSE = mean_squared_error(y_test, y_pred_lr, squared=False)


r2 = r2_score(y_test, y_pred_lr)

results['Lineare Regression'] = {'RMSE': RMSE.round(3), 'r2':


r2.round(3)}
In this phase, we assess the performance of several machine learning models in predicting prices
based on the available features. The models under consideration include K-Nearest Neighbor,
Random Forest, and Gradient Boosting algorithms.
df_result = pd.DataFrame(results)
df_result
{"summary":"{\n \"name\": \"df_result\",\n \"rows\": 2,\n
\"fields\": [\n {\n \"column\": \"Lineare Regression\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
51.37908582779573,\n \"min\": 0.847,\n \"max\": 73.508,\
n \"num_unique_values\": 2,\n \"samples\": [\n
0.847,\n 73.508\n ],\n \"semantic_type\":
\"\",\n \"description\": \"\"\n }\n },\n {\n
\"column\": \"K-Nearest-Neighbor\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 304.1054133848985,\n
\"min\": -4.143,\n \"max\": 425.927,\n
\"num_unique_values\": 2,\n \"samples\": [\n -
4.143,\n
425.927\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"Random Forest\",\n \"properties\": {\n \"dtype\":
\"number\",\n \"std\": 52.41782568935877,\n \"min\":
0.841,\n \"max\": 74.971,\n \"num_unique_values\":
2,\n \"samples\": [\n 0.841,\n 74.971\n
],\n \"semantic_type\": \"\",\n \"description\": \"\"\n
}\ n },\n {\n \"column\": \"Gradient Boosting\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
50.71440545348038,\n \"min\": 0.851,\n \"max\":
72.572,\ n \"num_unique_values\": 2,\n \"samples\":
[\n 0.851,\n 72.572\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\n
}\n ]\ n}","type":"dataframe","variable_name":"df_result"}

Upon evaluating the machine learning models for price prediction, the following observations were
made:
Linear Regression: Linear Regression demonstrates reasonable performance, as evidenced by an
RMSE of 73.508 and an R² of 0.847. These metrics indicate that the model effectively explains a
substantial amount of variance in the data, suggesting its suitability for this predictive task.
K-Nearest Neighbor: In contrast, K-Nearest Neighbor exhibits poor performance relative to other
models. It yields a considerably high RMSE of 425.927 and a negative R² value, indicating that it fails
to accurately capture the underlying patterns in the data. Thus, K-Nearest Neighbor is not deemed
suitable for the task of price prediction in this context.
Random Forest and Gradient Boosting: Both Random Forest and Gradient Boosting models perform
comparably well, with RMSE values hovering around 74 and R² scores ranging between
0.84 and 0.85. These models effectively capture the underlying patterns in the data, demonstrating
their potential for accurate price prediction.

Future Enhancement:

 Contextual Factors: Dynamic pricing adjusts prices based on real-time market


conditions, including demand, supply, and competitor actions1.
 Advanced Algorithms: Machine learning models, such as regression and decision
trees, continuously optimize pricing1.
 Data Sources: Internal data (sales, customer behavior) and external signals (traffic,
conversions) inform pricing decisions1.
 Competitive Advantage: Companies embracing dynamic pricing gain a sustained
edge in today’s competitive landscape2.
 Future Trends: Ongoing development of machine learning technology will further
shape real-time pricing.

Conclusion:
 Summary of key findings and contributions of the project.
 Reflection on the effectiveness and implications of the predictive
maintenanceapproach for heart attack analysis.
 Recommendations for future research and areas for further exploration in the
field ofpredictive healthcare analytics.

You might also like