0% found this document useful (0 votes)
46 views12 pages

Uber - Rides - Analysis - Jupyter Notebook

This document analyzes uber ride data from a dataset containing over 1,000 rides. The analysis includes data cleaning, exploring categorical variables through count plots, and visualizing correlations between features. Key insights are that most rides are for business purposes, meetings and meals are common purposes, and rides are most frequent in the afternoons. The data is preprocessed and encoded before further analysis of relationships between variables.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views12 pages

Uber - Rides - Analysis - Jupyter Notebook

This document analyzes uber ride data from a dataset containing over 1,000 rides. The analysis includes data cleaning, exploring categorical variables through count plots, and visualizing correlations between features. Key insights are that most rides are for business purposes, meetings and meals are common purposes, and rides are most frequent in the afternoons. The data is preprocessed and encoded before further analysis of relationships between variables.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

9/12/23, 12:13 AM uber_rides_analysis - Jupyter Notebook

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:

dataset = pd.read_csv("UberDataset.csv")
dataset.head()

Out[2]:

START_DATE END_DATE CATEGORY START STOP MILES PURPOSE

01-01-2016 01-01-2016 Fort


0 Business Fort Pierce 5.1 Meal/Entertain
21:11 21:17 Pierce

01-02-2016 01-02-2016 Fort


1 Business Fort Pierce 5.0 NaN
01:25 01:37 Pierce

01-02-2016 01-02-2016 Fort


2 Business Fort Pierce 4.8 Errand/Supplies
20:25 20:38 Pierce

01-05-2016 01-05-2016 Fort


3 Business Fort Pierce 4.7 Meeting
17:31 17:45 Pierce

01-06-2016 01-06-2016 Fort West Palm


4 Business 63.7 Customer Visit
14:42 15:49 Pierce Beach

In [4]:

dataset.shape

Out[4]:

(1156, 7)

In [5]:

dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1156 entries, 0 to 1155
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 START_DATE 1156 non-null object
1 END_DATE 1155 non-null object
2 CATEGORY 1155 non-null object
3 START 1155 non-null object
4 STOP 1155 non-null object
5 MILES 1156 non-null float64
6 PURPOSE 653 non-null object
dtypes: float64(1), object(6)
memory usage: 63.3+ KB

localhost:8888/notebooks/uber_rides_analysis.ipynb 1/12
9/12/23, 12:13 AM uber_rides_analysis - Jupyter Notebook

In [6]:

dataset['PURPOSE'].fillna("NOT", inplace=True)

In [7]:

dataset.head()

Out[7]:

START_DATE END_DATE CATEGORY START STOP MILES PURPOSE

01-01-2016 01-01-2016 Fort


0 Business Fort Pierce 5.1 Meal/Entertain
21:11 21:17 Pierce

01-02-2016 01-02-2016 Fort


1 Business Fort Pierce 5.0 NOT
01:25 01:37 Pierce

01-02-2016 01-02-2016 Fort


2 Business Fort Pierce 4.8 Errand/Supplies
20:25 20:38 Pierce

01-05-2016 01-05-2016 Fort


3 Business Fort Pierce 4.7 Meeting
17:31 17:45 Pierce

01-06-2016 01-06-2016 Fort West Palm


4 Business 63.7 Customer Visit
14:42 15:49 Pierce Beach

In [8]:

dataset['START_DATE'] = pd.to_datetime(dataset['START_DATE'],
errors='coerce')
dataset['END_DATE'] = pd.to_datetime(dataset['END_DATE'],
errors='coerce')

In [9]:

from datetime import datetime

dataset['date'] = pd.DatetimeIndex(dataset['START_DATE']).date
dataset['time'] = pd.DatetimeIndex(dataset['START_DATE']).hour

#changing into categories of day and night


dataset['day-night'] = pd.cut(x=dataset['time'],
bins = [0,10,15,19,24],
labels = ['Morning','Afternoon','Evening','Night'])

In [10]:

dataset.dropna(inplace=True)

In [11]:

dataset.drop_duplicates(inplace=True)

localhost:8888/notebooks/uber_rides_analysis.ipynb 2/12
9/12/23, 12:13 AM uber_rides_analysis - Jupyter Notebook

In [13]:

obj = (dataset.dtypes == 'object')


object_cols = list(obj[obj].index)

unique_values = {}
for col in object_cols:
unique_values[col] = dataset[col].unique().size
unique_values

Out[13]:

{'CATEGORY': 2, 'START': 175, 'STOP': 186, 'PURPOSE': 11, 'date': 291}

localhost:8888/notebooks/uber_rides_analysis.ipynb 3/12
9/12/23, 12:13 AM uber_rides_analysis - Jupyter Notebook

Data Visualization
In [18]:

plt.figure(figsize=(10,5))

plt.subplot(1,2,1)
sns.countplot(data=dataset, x='CATEGORY')
plt.xticks(rotation=90)

plt.subplot(1,2,2)
sns.countplot(data=dataset, x='PURPOSE')
plt.xticks(rotation=90)

Out[18]:

(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]),
[Text(0, 0, 'Meal/Entertain'),
Text(1, 0, 'NOT'),
Text(2, 0, 'Errand/Supplies'),
Text(3, 0, 'Meeting'),
Text(4, 0, 'Customer Visit'),
Text(5, 0, 'Temporary Site'),
Text(6, 0, 'Between Offices'),
Text(7, 0, 'Charity ($)'),
Text(8, 0, 'Commute'),
Text(9, 0, 'Moving'),
Text(10, 0, 'Airport/Travel')])

localhost:8888/notebooks/uber_rides_analysis.ipynb 4/12
9/12/23, 12:13 AM uber_rides_analysis - Jupyter Notebook

In [20]:

sns.countplot(data=dataset,x='day-night')
plt.xticks(rotation=90)

Out[20]:

(array([0, 1, 2, 3]),
[Text(0, 0, 'Morning'),
Text(1, 0, 'Afternoon'),
Text(2, 0, 'Evening'),
Text(3, 0, 'Night')])

localhost:8888/notebooks/uber_rides_analysis.ipynb 5/12
9/12/23, 12:13 AM uber_rides_analysis - Jupyter Notebook

In [21]:

plt.figure(figsize=(15, 5))
sns.countplot(data=dataset, x='PURPOSE', hue='CATEGORY')
plt.xticks(rotation=90)
plt.show()

Insights from the above count-plots :


Most of the rides are booked for business purpose.

Most of the people book cabs for Meetings and Meal / Entertain purpose.

Most of the cabs are booked in the time duration of 10am-5pm (Afternoon).

In [23]:

from sklearn.preprocessing import OneHotEncoder


object_cols = ['CATEGORY', 'PURPOSE']
OH_encoder = OneHotEncoder(sparse=False)
OH_cols = pd.DataFrame(OH_encoder.fit_transform(dataset[object_cols]))
OH_cols.index = dataset.index
OH_cols.columns = OH_encoder.get_feature_names_out()
df_final = dataset.drop(object_cols, axis=1)
dataset = pd.concat([df_final, OH_cols], axis=1)

C:\Users\ASUS\anaconda3\Lib\site-packages\sklearn\preprocessing\_encoders.
py:972: FutureWarning: `sparse` was renamed to `sparse_output` in version
1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leav
e `sparse` to its default value.
warnings.warn(

localhost:8888/notebooks/uber_rides_analysis.ipynb 6/12
9/12/23, 12:13 AM uber_rides_analysis - Jupyter Notebook

In [24]:

plt.figure(figsize=(12, 6))
sns.heatmap(dataset.corr(),
cmap='BrBG',
fmt='.2f',
linewidths=2,
annot=True)

C:\Users\ASUS\AppData\Local\Temp\ipykernel_10148\1039674243.py:2: FutureWa
rning: The default value of numeric_only in DataFrame.corr is deprecated.
In a future version, it will default to False. Select only valid columns o
r specify the value of numeric_only to silence this warning.
sns.heatmap(dataset.corr(),

Out[24]:

<Axes: >

Insights from the heatmap:


Business and Personal Category are highly negatively correlated, this have already proven earlier. So this
plot, justifies the above conclusions.

There is not much correlation between the features.

localhost:8888/notebooks/uber_rides_analysis.ipynb 7/12
9/12/23, 12:13 AM uber_rides_analysis - Jupyter Notebook

In [25]:

dataset['MONTH'] = pd.DatetimeIndex(dataset['START_DATE']).month
month_label = {1.0: 'Jan', 2.0: 'Feb', 3.0: 'Mar', 4.0: 'April',
5.0: 'May', 6.0: 'June', 7.0: 'July', 8.0: 'Aug',
9.0: 'Sep', 10.0: 'Oct', 11.0: 'Nov', 12.0: 'Dec'}
dataset["MONTH"] = dataset.MONTH.map(month_label)

mon = dataset.MONTH.value_counts(sort=False)

# Month total rides count vs Month ride max count


df = pd.DataFrame({"MONTHS": mon.values,
"VALUE COUNT": dataset.groupby('MONTH',
sort=False)['MILES'].max()})

p = sns.lineplot(data=df)
p.set(xlabel="MONTHS", ylabel="VALUE COUNT")

Out[25]:

[Text(0.5, 0, 'MONTHS'), Text(0, 0.5, 'VALUE COUNT')]

Insights from the above plot :


The counts are very irregular.

Still its very clear that the counts are very less during Nov, Dec, Jan, which justifies the fact that time winters
are there in Florida, US.

In [26]:

dataset['DAY'] = dataset.START_DATE.dt.weekday
day_label = {
0: 'Mon', 1: 'Tues', 2: 'Wed', 3: 'Thus', 4: 'Fri', 5: 'Sat', 6: 'Sun'
}
dataset['DAY'] = dataset['DAY'].map(day_label)

localhost:8888/notebooks/uber_rides_analysis.ipynb 8/12
9/12/23, 12:13 AM uber_rides_analysis - Jupyter Notebook

In [27]:

day_label = dataset.DAY.value_counts()
sns.barplot(x=day_label.index, y=day_label);
plt.xlabel('DAY')
plt.ylabel('COUNT')

Out[27]:

Text(0, 0.5, 'COUNT')

In [28]:

sns.boxplot(dataset['MILES'])

Out[28]:

<Axes: >

localhost:8888/notebooks/uber_rides_analysis.ipynb 9/12
9/12/23, 12:13 AM uber_rides_analysis - Jupyter Notebook

In [29]:

sns.boxplot(dataset[dataset['MILES']<100]['MILES'])

Out[29]:

<Axes: >

localhost:8888/notebooks/uber_rides_analysis.ipynb 10/12
9/12/23, 12:13 AM uber_rides_analysis - Jupyter Notebook

In [30]:

sns.distplot(dataset[dataset['MILES']<40]['MILES'])

C:\Users\ASUS\AppData\Local\Temp\ipykernel_10148\1678554178.py:1: UserWarn
ing:

`distplot` is a deprecated function and will be removed in seaborn v0.14.


0.

Please adapt your code to use either `displot` (a figure-level function wi


th
similar flexibility) or `histplot` (an axes-level function for histogram
s).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 (https://
gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751)

sns.distplot(dataset[dataset['MILES']<40]['MILES'])

Out[30]:

<Axes: xlabel='MILES', ylabel='Density'>

Insights from the above plots :


Most of the cabs booked for the distance of 4-5 miles.

Majorly people chooses cabs for the distance of 0-20 miles.

For distance more than 20 miles cab counts is nearly negligible.

localhost:8888/notebooks/uber_rides_analysis.ipynb 11/12
9/12/23, 12:13 AM uber_rides_analysis - Jupyter Notebook

localhost:8888/notebooks/uber_rides_analysis.ipynb 12/12

You might also like