Uber - Rides - Analysis - Jupyter Notebook
Uber - Rides - Analysis - Jupyter Notebook
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
dataset = pd.read_csv("UberDataset.csv")
dataset.head()
Out[2]:
In [4]:
dataset.shape
Out[4]:
(1156, 7)
In [5]:
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1156 entries, 0 to 1155
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 START_DATE 1156 non-null object
1 END_DATE 1155 non-null object
2 CATEGORY 1155 non-null object
3 START 1155 non-null object
4 STOP 1155 non-null object
5 MILES 1156 non-null float64
6 PURPOSE 653 non-null object
dtypes: float64(1), object(6)
memory usage: 63.3+ KB
localhost:8888/notebooks/uber_rides_analysis.ipynb 1/12
9/12/23, 12:13 AM uber_rides_analysis - Jupyter Notebook
In [6]:
dataset['PURPOSE'].fillna("NOT", inplace=True)
In [7]:
dataset.head()
Out[7]:
In [8]:
dataset['START_DATE'] = pd.to_datetime(dataset['START_DATE'],
errors='coerce')
dataset['END_DATE'] = pd.to_datetime(dataset['END_DATE'],
errors='coerce')
In [9]:
dataset['date'] = pd.DatetimeIndex(dataset['START_DATE']).date
dataset['time'] = pd.DatetimeIndex(dataset['START_DATE']).hour
In [10]:
dataset.dropna(inplace=True)
In [11]:
dataset.drop_duplicates(inplace=True)
localhost:8888/notebooks/uber_rides_analysis.ipynb 2/12
9/12/23, 12:13 AM uber_rides_analysis - Jupyter Notebook
In [13]:
unique_values = {}
for col in object_cols:
unique_values[col] = dataset[col].unique().size
unique_values
Out[13]:
localhost:8888/notebooks/uber_rides_analysis.ipynb 3/12
9/12/23, 12:13 AM uber_rides_analysis - Jupyter Notebook
Data Visualization
In [18]:
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
sns.countplot(data=dataset, x='CATEGORY')
plt.xticks(rotation=90)
plt.subplot(1,2,2)
sns.countplot(data=dataset, x='PURPOSE')
plt.xticks(rotation=90)
Out[18]:
(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]),
[Text(0, 0, 'Meal/Entertain'),
Text(1, 0, 'NOT'),
Text(2, 0, 'Errand/Supplies'),
Text(3, 0, 'Meeting'),
Text(4, 0, 'Customer Visit'),
Text(5, 0, 'Temporary Site'),
Text(6, 0, 'Between Offices'),
Text(7, 0, 'Charity ($)'),
Text(8, 0, 'Commute'),
Text(9, 0, 'Moving'),
Text(10, 0, 'Airport/Travel')])
localhost:8888/notebooks/uber_rides_analysis.ipynb 4/12
9/12/23, 12:13 AM uber_rides_analysis - Jupyter Notebook
In [20]:
sns.countplot(data=dataset,x='day-night')
plt.xticks(rotation=90)
Out[20]:
(array([0, 1, 2, 3]),
[Text(0, 0, 'Morning'),
Text(1, 0, 'Afternoon'),
Text(2, 0, 'Evening'),
Text(3, 0, 'Night')])
localhost:8888/notebooks/uber_rides_analysis.ipynb 5/12
9/12/23, 12:13 AM uber_rides_analysis - Jupyter Notebook
In [21]:
plt.figure(figsize=(15, 5))
sns.countplot(data=dataset, x='PURPOSE', hue='CATEGORY')
plt.xticks(rotation=90)
plt.show()
Most of the people book cabs for Meetings and Meal / Entertain purpose.
Most of the cabs are booked in the time duration of 10am-5pm (Afternoon).
In [23]:
C:\Users\ASUS\anaconda3\Lib\site-packages\sklearn\preprocessing\_encoders.
py:972: FutureWarning: `sparse` was renamed to `sparse_output` in version
1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leav
e `sparse` to its default value.
warnings.warn(
localhost:8888/notebooks/uber_rides_analysis.ipynb 6/12
9/12/23, 12:13 AM uber_rides_analysis - Jupyter Notebook
In [24]:
plt.figure(figsize=(12, 6))
sns.heatmap(dataset.corr(),
cmap='BrBG',
fmt='.2f',
linewidths=2,
annot=True)
C:\Users\ASUS\AppData\Local\Temp\ipykernel_10148\1039674243.py:2: FutureWa
rning: The default value of numeric_only in DataFrame.corr is deprecated.
In a future version, it will default to False. Select only valid columns o
r specify the value of numeric_only to silence this warning.
sns.heatmap(dataset.corr(),
Out[24]:
<Axes: >
localhost:8888/notebooks/uber_rides_analysis.ipynb 7/12
9/12/23, 12:13 AM uber_rides_analysis - Jupyter Notebook
In [25]:
dataset['MONTH'] = pd.DatetimeIndex(dataset['START_DATE']).month
month_label = {1.0: 'Jan', 2.0: 'Feb', 3.0: 'Mar', 4.0: 'April',
5.0: 'May', 6.0: 'June', 7.0: 'July', 8.0: 'Aug',
9.0: 'Sep', 10.0: 'Oct', 11.0: 'Nov', 12.0: 'Dec'}
dataset["MONTH"] = dataset.MONTH.map(month_label)
mon = dataset.MONTH.value_counts(sort=False)
p = sns.lineplot(data=df)
p.set(xlabel="MONTHS", ylabel="VALUE COUNT")
Out[25]:
Still its very clear that the counts are very less during Nov, Dec, Jan, which justifies the fact that time winters
are there in Florida, US.
In [26]:
dataset['DAY'] = dataset.START_DATE.dt.weekday
day_label = {
0: 'Mon', 1: 'Tues', 2: 'Wed', 3: 'Thus', 4: 'Fri', 5: 'Sat', 6: 'Sun'
}
dataset['DAY'] = dataset['DAY'].map(day_label)
localhost:8888/notebooks/uber_rides_analysis.ipynb 8/12
9/12/23, 12:13 AM uber_rides_analysis - Jupyter Notebook
In [27]:
day_label = dataset.DAY.value_counts()
sns.barplot(x=day_label.index, y=day_label);
plt.xlabel('DAY')
plt.ylabel('COUNT')
Out[27]:
In [28]:
sns.boxplot(dataset['MILES'])
Out[28]:
<Axes: >
localhost:8888/notebooks/uber_rides_analysis.ipynb 9/12
9/12/23, 12:13 AM uber_rides_analysis - Jupyter Notebook
In [29]:
sns.boxplot(dataset[dataset['MILES']<100]['MILES'])
Out[29]:
<Axes: >
localhost:8888/notebooks/uber_rides_analysis.ipynb 10/12
9/12/23, 12:13 AM uber_rides_analysis - Jupyter Notebook
In [30]:
sns.distplot(dataset[dataset['MILES']<40]['MILES'])
C:\Users\ASUS\AppData\Local\Temp\ipykernel_10148\1678554178.py:1: UserWarn
ing:
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 (https://
gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751)
sns.distplot(dataset[dataset['MILES']<40]['MILES'])
Out[30]:
localhost:8888/notebooks/uber_rides_analysis.ipynb 11/12
9/12/23, 12:13 AM uber_rides_analysis - Jupyter Notebook
localhost:8888/notebooks/uber_rides_analysis.ipynb 12/12