Uber Trip Analysis Machine Learning Project (Data Analyst)
Uber Trip Analysis Machine Learning Project (Data Analyst)
Uber Trip Analysis Machine Learning Project (Data Analyst)
Dataset : Dataset is available in the given link. You can download it at your
convenience.
FiveThirtyEight obtained the data from the NYC Taxi & Limousine Commission (TLC) by submitting a Freedom of
Information Law request on July 20, 2015. The TLC has sent us the data in batches as it continues to review trip
data Uber and other HFV companies have submitted to it. The TLC's correspondence with FiveThirtyEight is
included in the files TLC_letter.pdf, TLC_letter2.pdf and TLC_letter3.pdf. TLC records requests can be
made here.
This data was used for four FiveThirtyEight stories: Uber Is Serving New York’s Outer Boroughs More Than Taxis
Are, Public Transit Should Be Uber’s New Best Friend, Uber Is Taking Millions Of Manhattan Rides Away From
Taxis, and Is Uber Making NYC Rush-Hour Traffic Worse?.
The Data
The dataset contains, roughly, four groups of files:
● Uber trip data from 2014 (April - September), separated by month, with detailed location information
● Uber trip data from 2015 (January - June), with less fine-grained location information
● non-Uber FHV (For-Hire Vehicle) trips. The trip information varies by company, but can include day of trip,
time of trip, pickup location, driver's for-hire license number, and vehicle's for-hire license number.
● aggregate ride and vehicle statistics for all FHV companies (and, occasionally, for taxi companies)
● uber-raw-data-apr14.csv
● uber-raw-data-aug14.csv
● uber-raw-data-jul14.csv
● uber-raw-data-jun14.csv
● uber-raw-data-may14.csv
● uber-raw-data-sep14.csv
B02512 : Unter
B02598 : Hinter
B02617 : Weiter
B02682 : Schmecken
B02764 : Danach-NY
B02765 : Grun
B02835 : Dreist
B02836 : Drinnen
For coarse-grained location information from these pickups, the file taxi-zone-lookup.csv shows the taxi Zone
● American_B01362.csv
● Diplo_B01196.csv
● Highclass_B01717.csv
● Skyline_B00111.csv
● Carmel_B00256.csv
● Federal_02216.csv
● Lyft_B02510.csv
● Dial7_B00887.csv
● Firstclass_B01536.csv
● Prestige_B01338.csv
Aggregate Statistics
There is also a file other-FHV-data-jan-aug-2015.csv containing daily pickup data for 329 FHV companies
from January 2015 through August 2015.
The file Uber-Jan-Feb-FOIL.csv contains aggregated daily Uber trip statistics in January and February 2015.
Project Overview
The goal of this project is to analyze Uber trip data to identify patterns and build a
predictive model for trip demand. The analysis will cover various aspects such as popular
pickup times, busiest days, and fare prediction.
Dataset
The dataset used for this project is typically Uber's trip data, which includes details such
as:
Uber provides various datasets on platforms like Kaggle, which you can download for
analysis.
1. Data Preprocessing
2. Exploratory Data Analysis (EDA)
3. Feature Engineering
4. Model Building
5. Model Evaluation
6. Visualization
Implementation Code
Here is a sample implementation in Python:
# Data Preprocessing
# Convert Date/Time to datetime object
data['Date/Time'] = pd.to_datetime(data['Date/Time'])
# Model Building
# Train a Random Forest Regressor
rfr = RandomForestRegressor(random_state=42)
rfr.fit(X_train, y_train)
# Model Evaluation
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("R^2 Score:", r2_score(y_test, y_pred))
# Visualization of Predictions
plt.figure(figsize=(10,6))
plt.scatter(y_test, y_pred, alpha=0.3)
plt.xlabel('Actual Trips')
plt.ylabel('Predicted Trips')
plt.title('Actual vs Predicted Trips')
plt.show()
Explanation of Code
1. Data Preprocessing:
○ Load the dataset and convert the 'Date/Time' column to a datetime object.
○ Extract useful information like hour, day, day of the week, and month from the
'Date/Time' column.
2. Exploratory Data Analysis (EDA):
○ Visualize the number of trips per hour and per day of the week using count
plots.
3. Feature Engineering:
○ Create dummy variables for the categorical feature 'Base'.
○ Define the feature set X and the target variable y.
4. Model Building:
○ Split the data into training and testing sets.
○ Train a Random Forest Regressor on the training data.
○ Predict the number of trips on the test data.
5. Model Evaluation:
○ Evaluate the model using Mean Squared Error (MSE) and R² score.
○ Visualize the actual vs predicted trips to assess model performance.
Additional Resources
This implementation provides a framework for analyzing and predicting Uber trips. You
can extend it by adding more features, trying different models, and improving feature
engineering techniques.
Sample code
This notebook aims to delve into the predictive power of XGBoost, Random Forest and Gradient Boosted Tree
Regressor in forecasting Uber trips using historical data from 2014.
While there are currently other state-of-the-art strategies to predict univariate time series, this Notebook intends to
provide a machine learning oriented approach to time series forecasting as an alternative tool.
Objectives
● Data Exploration and Preprocessing: Understand and prepare the 2014 Uber trip data for model training.
● Model Training: Train three distinct types of models—XGBoost, GBTR and Random Forests
networks—using the 2014 data.
● Model Evaluation: Assess the performance of each model using Mean Average Percentage Error as the
main evaluation metric.
● Ensemble Techniques: Explore ensemble methods to combine the strengths of the individual models and
enhance forecasting accuracy.
● Comparative Analysis: Provide a comparative analysis of the forecasting capabilities of the models and
the ensemble approach.
In [1]:
import warnings
warnings.filterwarnings("ignore")
import os
import numpy as np
import pandas as pd
import seaborn as sns
import xgboost as xgb
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from xgboost import plot_importance, plot_tree
from sklearn.model_selection import train_test_split
from statsmodels.tsa.seasonal import seasonal_decompose
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV,
TimeSeriesSplit
In [2]:
def PlotDecomposition(result):
plt.figure(figsize=(22,18))
plt.subplot(4,1,1)
plt.plot(result.observed,label='Observed',lw=1)
plt.legend(loc='upper left')
plt.subplot(4,1,2)
plt.plot(result.trend,label='Trend',lw=1)
plt.legend(loc='upper left')
plt.subplot(4, 1, 3)
plt.plot(result.seasonal, label='Seasonality',lw=1)
plt.legend(loc='upper left')
plt.subplot(4, 1, 4)
plt.plot(result.resid, label='Residuals',lw=1)
plt.legend(loc='upper left')
plt.show()
def CalculateError(pred,sales):
percentual_errors = []
for A_i, B_i in zip(sales, pred):
percentual_error = abs((A_i - B_i) / B_i)
percentual_errors.append(percentual_error)
return sum(percentual_errors) / len(percentual_errors)
def PlotPredictions(plots,title):
plt.figure(figsize=(18, 8))
for plot in plots:
plt.plot(plot[0], plot[1], label=plot[2], linestyle=plot[3],
color=plot[4],lw=1)
plt.xlabel('Date')
plt.ylabel("Trips")
plt.title(title)
plt.legend()
plt.xticks(rotation=30, ha='right')
plt.show()
def create_lagged_features(data, window_size):
X, y = [], []
for i in range(len(data) - window_size):
X.append(data[i:i+window_size])
y.append(data[i+window_size])
return np.array(X), np.array(y)
In [3]:
files = []
In [4]:
In [5]:
In [6]:
uber2014.head()
Out[6]:
Count
Date
2014-04-01 01:00:00 66
2014-04-01 02:00:00 53
2014-04-01 03:00:00 93
In [7]:
print(uber2014.index.min())
print(uber2014.index.max())
2014-04-01 00:00:00
2014-09-30 22:00:00
In [8]:
result=seasonal_decompose(uber2014['Count'],model='add', period=24*1)
PlotDecomposition(result)
In [10]:
In [11]:
uber2014_train = uber2014.loc[:cutoff_date]
uber2014_test = uber2014.loc[cutoff_date:]
In [12]:
uber2014_test.rename(columns={'Count':'TEST
SET'}).join(uber2014_train.rename(columns={'Count':'TRAINING SET'}),
how='outer').plot(figsize=(15,5),title='Train / Test Sets', style='-',lw=1)
Out[12]:
In [14]:
test_data = np.concatenate([uber2014_train['Count'].values[-window_size:],
uber2014_test['Count'].values])
X_test, y_test = create_lagged_features(test_data, window_size)
In [15]:
seed = 12345
In [16]:
tscv = TimeSeriesSplit(n_splits=5)
In [17]:
xgb_param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [3, 6, 9],
'learning_rate': [0.01, 0.1, 0.3],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0]
}
In [18]:
In [19]:
Out[19]:
GridSearchCV
estimator: XGBRegressor
XGBRegressor
In [20]:
In [21]:
xgb_predictions = xgb_grid_search.best_estimator_.predict(X_test)
In [22]:
PlotPredictions([
(uber2014_test.index,uber2014_test['Count'],'Test','-','darkslateblue'),
(uber2014_test.index,xgb_predictions,'XGBoost Predictions','--','red')],
'Uber 2014 Trips: XGBoost Predictions vs Test')
In [23]:
In [24]:
rf_param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [10, 20, 30],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'max_features': [None, 'sqrt', 'log2']
}
In [25]:
rf_model = RandomForestRegressor(random_state=seed)
In [26]:
Out[26]:
GridSearchCV
estimator: RandomForestRegressor
RandomForestRegressor
In [27]:
In [28]:
rf_predictions = rf_grid_search.best_estimator_.predict(X_test)
In [29]:
PlotPredictions([
(uber2014_test.index,uber2014_test['Count'],'Test','-','gray'),
(uber2014_test.index,rf_predictions,'Random Forest Predictions','--','green')],
'Uber 2014 Trips: Random Forest Predictions vs Test')
In [30]:
gbr_param_grid = {
'n_estimators': [100, 200, 300],
'learning_rate': [0.01, 0.1],
'max_depth': [3, 4, 5],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'max_features': ['sqrt', 'log2']
}
In [32]:
gbr_model = GradientBoostingRegressor(random_state=seed)
In [33]:
Out[33]:
GridSearchCV
estimator: GradientBoostingRegressor
GradientBoostingRegressor
In [34]:
In [35]:
gbr_predictions = gbr_grid_search.best_estimator_.predict(X_test)
In [36]:
PlotPredictions([
(uber2014_test.index,uber2014_test['Count'],'Test','-','gray'),
(uber2014_test.index,gbr_predictions,'GBRT Predictions','--','orange')],
'Uber 2014 Trips: GBRT Predictions vs Test')
In [37]:
PlotPredictions([
(uber2014_test.index,uber2014_test['Count'],'Test','-','gray'),
(uber2014_test.index,xgb_predictions,'XGBoost Predictions','--','red'),
(uber2014_test.index,gbr_predictions,'GBRT Predictions','--','orange'),
(uber2014_test.index,rf_predictions,'Random Forest Predictions','--','green')],
'Uber 2014 Trips: All Models Predictions vs Test')
The above plot shows how all algorithms have actually being very close to predicting the test set. Visually, we can
safely assume that using either algorithm could be a safe bet. The last step is to try an ensemble to
8. Ensemble
Building the ensemble requires to understand how each algorithm has performed individually first. Then, decide
how we can leverage each one's strenghts to our advantage.
In [39]:
print(f'XGBoost MAPE:\t\t\t{xgb_mape:.2%}')
print(f'Random Forest MAPE:\t\t{rf_mape:.2%}')
print(f'GBTR Percentage Error:\t\t{gbr_mape:.2%}')
Convert MAPE scores to weights: Since MAPE is inversely related to model performance, we can use the
reciprocal of MAPE as a starting point for determining the weights. Normalize these reciprocals to get the weights.
The ensemble prediction formula can be expressed as follows:
After doing the sum of all of them and applying each one's weight, we come up with the following formula:
Ensemble Prediction = 0.368 XGBoost Prediction + 0.322 Random Forest Prediction + 0.310 *
GBTR Prediction
In [40]:
# Weights
weights = np.array([0.368, 0.322, 0.310])
PlotPredictions([
(uber2014_test.index,uber2014_test['Count'],'Test','-','gray'),
(uber2014_test.index,ensemble_predictions,'Ensemble
Predictions','--','purple')],
'Uber 2014 Trips: Ensemble Predictions vs Test')
In [42]:
In [43]:
print(f'XGBoost MAPE:\t\t{xgb_mape:.2%}')
print(f'Random Forest MAPE:\t{rf_mape:.2%}')
print(f'GBTR MAPE:\t\t{gbr_mape:.2%}')
print(f'Ensemble MAPE:\t\t{ensemble_mape:.2%}')
linkcode
9. Insights and Conclusions from Training and Evaluation
● XGBoost: With a MAPE of 8.37%, XGBoost remains the top-performing model, effectively capturing
patterns in the Uber Trip 2014 data. Its strong performance highlights its ability to manage complex
interactions and temporal dependencies.
● Random Forest: Recorded a MAPE of 9.61%, showing good performance. This model effectively utilizes
the window-based logic to capture time-dependent variations in the data.
● Gradient Boosted Tree Regressor (GBTR): Achieved a MAPE of 10.02%, indicating reasonable
performance, although it does not match the effectiveness of XGBoost or Random Forest.
Ensemble Model:
● The ensemble model achieved a MAPE of 8.60%, which is an improvement over both Random Forest and
GBTR. This performance showcases the ensemble's ability to integrate the strengths of the individual
models while providing robust and stable predictions.
● The ensemble combines predictions from XGBoost, Random Forest, and GBTR, capitalizing on the
complementary strengths of each model.
● Applying window-based logic to model training has effectively captured temporal dependencies in the
data, resulting in enhanced predictive accuracy across all models.
● This approach ensures that the models can better handle seasonality and trends, which is crucial for
accurate time series forecasting, particularly in dynamic contexts like ride-sharing demand.
Cross-Validation and Parameter Tuning:
● Cross-validation has provided a reliable assessment of model performance in temporal contexts, ensuring
robustness and reducing the risk of overfitting.
● Parameter tuning, particularly for XGBoost and GBTR, has likely contributed to their strong performances,
reflecting effective optimization efforts.
Practical Implications:
● For practical applications, XGBoost is recommended for scenarios where achieving the lowest error is
critical due to its superior MAPE.
● The ensemble model serves as a strong alternative, providing improved predictive performance over the
individual models, particularly useful for scenarios requiring stability and reliability.
Final Conclusion
The training and evaluation of these models underscore the effectiveness of XGBoost, with its best-in-class
MAPE of 8.37%. The ensemble model, achieving a MAPE of 8.60%, effectively combines the strengths of the
individual models, resulting in robust and reliable predictions. These findings highlight the importance of
considering temporal structures in time series data and lay a strong foundation for future predictive modeling
efforts in similar applications.
Reference link