Bike Sharing Prediction Project Structure
Bike Sharing Prediction Project Structure
2. Problem Statement:
• Objective:
Predict the number of bikes rented at different hours of the day based on environmental
factors like temperature, humidity, wind speed, and whether it’s a working day or holiday.
• Goal:
Help bike-sharing companies optimize their bike distribution, improve availability, and
ensure customer satisfaction by predicting demand.
5. Code Implementation:
Below is a clean version of your code with explanations:
python
Copy
import pandas as pd
3
import numpy as np
import matplotlib.pyplot as plt
# Data Visualization
plt.figure(figsize=(10, 6))
plt.scatter(data['temperature'], data['bikes_rented'], alpha=0.6)
plt.title('Temperature vs Bikes Rented')
plt.xlabel('Temperature (°C)')
plt.ylabel('Number of Bikes Rented')
plt.show()
6. Future Improvements/Advancements:
• Hyperparameter Tuning & Model Evaluation:
Implement grid search or random search to fine-tune model parameters.
• Deep Learning:
Use deep learning models like neural networks or LSTMs for time series forecasting.
• Real-Time Data Updates:
Add real-time data fetching and updating functionality for dynamic predictions.
5
Output:
hour temperature humidity windspeed holiday working_day bikes_rented
0 6 7.870921 62.052065 14.452489 0 0 192.811250
1 19 20.265501 44.090339 7.785484 1 0 186.177394
2 14 12.215764 41.548714 10.163134 0 0 229.449734
3 10 19.530956 48.109977 4.677958 0 0 240.951000
4 7 8.859068 97.786907 11.609871 0 1 147.003672
CODE IMPLEMENTATION:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Calculate Mean Squared Error and R-squared for the best model
mse_best = mean_squared_error(y_test, y_pred_best)
r2_best = r2_score(y_test, y_pred_best)
# Optional: Save the model for future use (e.g., for deployment)
import joblib
joblib.dump(best_model, 'best_bike_sharing_model.pkl')
9
Output:
Explanation of the Code:
1. Imports:
o Libraries such as pandas, scikit-learn, matplotlib, and joblib are used to process data,
train models, evaluate, and visualize results.
2. Data Loading:
o The dataset is loaded from a CSV file, and the features (X) and target variable (y) are
extracted.
3. Data Preprocessing:
o The data is split into training and testing sets (80% for training and 20% for testing).
o The features are standardized using StandardScaler to improve the performance of
machine learning algorithms.
4. Model Training:
o A Random Forest Regressor model is trained on the scaled training data.
o Predictions are made on the test set.
5. Model Evaluation:
o The model’s performance is evaluated using Mean Squared Error (MSE) and R-
squared (R²), which help determine how well the model fits the data.
6. Visualization:
o A scatter plot visualizes the predicted values vs. the actual values, helping assess how
well the model performs.
7. Hyperparameter Tuning (GridSearchCV):
o GridSearchCV is used to find the best hyperparameters for the Random Forest model.
It searches through different combinations of hyperparameters like the number of
trees (n_estimators), maximum depth of trees (max_depth), and minimum samples to
split a node (min_samples_split).
o After finding the best hyperparameters, the model is retrained, and performance is
evaluated again.
8. Saving the Model:
o The trained and tuned model is saved using joblib for future use, such as deployment
in a real-time prediction system.
Future Advancements:
1. Cross-Validation:
o Use cross_val_score to perform cross-validation to get a better estimate of model
performance.
11
2. Advanced Models:
o Consider using more advanced models like XGBoost or LightGBM, which often
perform better for regression tasks.
3. Time Series Modeling:
o Incorporate time series forecasting techniques (e.g., ARIMA, LSTM) to predict bike
rentals based on time-related patterns.
4. Real-Time Data Integration:
o Integrate real-time data (e.g., weather, traffic conditions) into the model to provide
up-to-date predictions for bike-sharing demand.
5. Deployment:
o Deploy the trained model using a Flask or FastAPI web application for real-time
prediction.
6. Automated Retraining:
o Set up a pipeline for automated retraining of the model as new data becomes
available to keep the model's predictions up-to-date.
CODE IMPLEMENTATION:
import pandas as pd
import numpy as np
# Define a function to get user input and save the data to CSV
def get_user_input():
# Get input from the user
hour = int(input("Enter the hour of the day (0-23): "))
temperature = float(input("Enter the temperature (in Celsius): "))
humidity = float(input("Enter the humidity (%): "))
windspeed = float(input("Enter the windspeed (in m/s): "))
holiday = int(input("Is it a holiday? (1 for Yes, 0 for No): "))
working_day = int(input("Is it a working day? (1 for Yes, 0 for No): "))
bikes_rented = float(input("Enter the number of bikes rented: "))
try:
# If it exists, append the new data to it
existing_df = pd.read_csv('bike_sharing_data.csv')
updated_df = pd.concat([existing_df, new_df], ignore_index=True)
updated_df.to_csv('bike_sharing_data.csv', index=False)
except FileNotFoundError:
# If the CSV doesn't exist, create a new one with the new data
new_df.to_csv('bike_sharing_data.csv', index=False)
Future Advancements:
1. Data Validation:
o Implement checks to ensure the user inputs are within valid ranges, e.g., ensuring the
hour is between 0 and 23.
2. User Interface (UI):
o Consider creating a graphical user interface (GUI) using libraries like Tkinter or
PyQt for easier data input or even a web interface with Flask or Django.
3. Automated Data Aggregation:
o Add functionality to automatically aggregate data by time (e.g., average bike rentals
per day, per hour) for better data analysis.
4. Real-time Data Integration:
o Incorporate real-time data sources, such as weather APIs or bike-sharing systems, to
collect data automatically without manual input.
5. Data Preprocessing:
o Introduce preprocessing steps like normalization, encoding categorical features, or
handling missing values to prepare the dataset for predictive modeling.
6. Predictive Modeling:
o With enough data, you could apply machine learning algorithms (e.g., Random
Forest, Linear Regression, Time-series forecasting) to predict future bike rentals
based on input features.
15
7. Visualization:
o Visualize the data using plotting libraries like matplotlib or seaborn to identify
trends in bike rentals based on time, temperature, and other factors.
8. Data Storage and Security:
o For large datasets, consider moving from a CSV file to a more scalable database
solution (e.g., SQLite, PostgreSQL) for better performance and security.
Code:
data = pd.read_csv('bike_sharing_data.csv')
data.groupby('hour')['bikes_rented'].mean().plot()
OUTPUT:
CODE IMPEMENTATION:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
OUTPUT:
CODE IMPLEMENTATION:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
'windspeed': windspeed,
'holiday': holiday,
'working_day': working_day,
'bikes_rented': bikes_rented
})
EXPLANATION:
Synthetic Data Generation: It creates synthetic bike rental data based on features such as hour,
temperature, humidity, windspeed, holiday status, and working day status.
Adding weekday Column: It adds a column representing the day of the week, assuming that the
dataset starts on a Monday.
Adding epoch Column: It adds a column with the number of seconds since the Unix epoch (1970-
01-01), based on the datetime column.
Saving the Data: Finally, it saves the updated dataset to a CSV file
(updated_bike_sharing_data.csv).
CODE IMPLEMETATION:
import pandas as pd
import numpy as np
from datetime import timedelta
# Bike companies
bike_companies = ['Hero', 'Honda', 'Pulsar', 'Bajaj', 'TVS Scooty', 'Honda Activa', 'Elecfasion']
'windspeed': windspeed,
'holiday': holiday,
'working_day': working_day,
'bikes_rented': bikes_rented
})
# Bike Name - Randomly selecting from predefined companies and generating bike names
bike_names = []
for company in bike_companies:
bike_names += [f'{company}-{i}' for i in range(1, 101)] # Creating 100 bikes per company
# Enrolling Date (random date between January 2020 and January 2023)
enrollment_dates = pd.to_datetime(np.random.choice(pd.date_range('2020-01-01', '2023-01-01',
freq='D'), size=num_samples))
data['enrollment_date'] = enrollment_dates
# Servicing Time (every 3 months from the enrollment date)
data['next_servicing'] = data['enrollment_date'] + pd.to_timedelta(np.random.randint(90, 120,
size=num_samples), unit='D')
OUTPUT:
hour temperature humidity windspeed holiday working_day \
0 6 7.870921 62.052065 14.452489 0 0
1 19 20.265501 44.090339 7.785484 1 0
2 14 12.215764 41.548714 10.163134 0 0
3 10 19.530956 48.109977 4.677958 0 0
4 7 8.859068 97.786907 11.609871 0 1
# Group by bike company and calculate the average number of bikes rented
average_rentals = data.groupby('bike_company')['bikes_rented'].mean()
OUTPUT:
The best bike company based on the highest average bike rentals is: Honda Activa
With an average of 187.26 bikes rented.
27
EXPLANATION:
Load the dataset: Reads the CSV file updated_bike_sharing_with_bike_details.csv that contains
the bike-sharing data.
Group by bike company: It groups the data by bike_company and calculates the mean number of
bikes rented for each company.
Identify the best company: Finds the bike company with the highest average number of bikes
rented using idxmax() and max().
Display results: Prints out the name of the best company and its average rentals, formatted to two
decimal places.
CODE IMPLEMENTATION:
import pandas as pd
import numpy as np
from datetime import timedelta
# Bike companies
bike_companies = ['Hero', 'Honda', 'Pulsar', 'Bajaj', 'TVS Scooty', 'Honda Activa', 'Elecfasion']
'holiday': holiday,
'working_day': working_day,
'bikes_rented': bikes_rented
})
# Bike Name - Randomly selecting from predefined companies and generating bike names
bike_names = []
for company in bike_companies:
bike_names += [f'{company}-{i}' for i in range(1, 101)] # Creating 100 bikes per company
# Enrolling Date (random date between January 2020 and January 2023)
enrollment_dates = pd.to_datetime(np.random.choice(pd.date_range('2020-01-01', '2023-01-01',
freq='D'), size=num_samples))
data['enrollment_date'] = enrollment_dates
# Servicing Time (every 3 months from the enrollment date)
data['next_servicing'] = data['enrollment_date'] + pd.to_timedelta(np.random.randint(90, 120,
size=num_samples), unit='D')
Output:
Output:
CODE IMPLEMENTATION:
import pandas as pd
# Function to interact with the user and fetch the relevant bike rental information
def get_rental_info():
# Ask the user to provide some criteria for filtering
print("Available bike companies: Hero, Honda, Pulsar, Bajaj, TVS Scooty, Honda Activa,
Elecfasion")
company_name = input("Enter the bike company name to see rentals (or type 'exit' to quit):
").strip()
if company_name.lower() == 'exit':
return
if filtered_data.empty:
print("No data found for this company. Please check the name and try again.")
return
Please provide your feedback on the service (e.g., 'Good service', 'Needs improvement'): 1
Rate the service (1-5, with 5 being the best): 5