UBER Data Wrangling
UBER Data Wrangling
##
Data Exploration and Wrangling For UBER ride sharing data
Description:
Here we are working with the UBER ride sharing datset of Victoria, Australia. We will load
the given datasets:
dirty_data.csv : We will try to do some EDA and find different types of errors (Syntac-
tic/Semantic)
outliers.csv : In this file we will just try to analyse and find outliers(if any) in all the dimensions
missing_value.csv : In this file we will just try to find missing values(if any).
edges.csv : This file we will use to find the journey distance between different nodes and travel
time.
nodes.csv : This file we will use along with edges.csv for finding journey distance and travel
time .
1
3 ID1404486339 0 7 8
4 ID5960060344 2 5 6
In [4]: dirty_df.describe()
2
mean 15757.016071 4107.506571 81.731607
std 16165.420364 3855.761494 167.726843
min 154.000000 38.460000 2.820000
25% 5520.750000 1426.905000 12.577500
50% 8596.500000 2555.070000 20.490000
75% 13986.000000 4293.870000 56.722500
max 51032.000000 13204.980000 857.050000
def day_classifier(df):
days = ["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"]
day_class = {"Monday":0,"Tuesday":0,"Wednesday":0,"Thursday":0,"Friday":0,"Saturday
for i in range(0,len(df)):
dt = df.loc[i,"Departure Date"].split("-")
day = days[datetime(int(dt[0]),int(dt[1]),int(dt[2])).weekday()]
df.loc[i,'day_factor'] = day_class[day]
def time_classifier(df):
for i in range(0, len(df)):
if (datetime.strptime(str(df.loc[i]["Departure Time"]), FMT).hour >= 6) and (da
df.loc[i,'time_factor'] = 0
elif (datetime.strptime(str(df.loc[i]["Departure Time"]), FMT).hour >= 12) and
df.loc[i,'time_factor'] = 1
3
else:
df.loc[i,'time_factor'] = 2
Now, lets try to see our different locations for the ride on the map and do other exploratory
analysis to understand the data better and find out errors & inconsistencies (if any) .
4
else:
color = "pink"
return color
o_map
d_map
As we can see, based on the origin and destination points of rides on the map shows few
coordinates are outside VIC infact outside Australia. So we need to fix the latitude as we also saw
that there are positive latitudes which should not be the case if points belongs in VIC.
<h3><u> Error 1, 2 Outliers for origin/ destination latitude (Semantic Error) : </u></h3>
<p> As we know the points are only based in VIC, so any point outside that is an outlier to the
<h3><u> Fix: </u></h3>
<p> The best possible fix would be changing those negative latitudes to positive then they will
In [10]: # correcting the incorrect lat long for origin and destination
# setting the corrected flag for the rows which are being corrected
dirty_df.loc[dirty_df["Origin Latitude"] > 0, "corrected"] = 1
dirty_df.loc[dirty_df["Destination Latitude"] > 0, "corrected"] = 1
5
from datetime import timedelta
id_list = []
date_list =[]
d ={}
for i in range(0,len(id_list)):
d[id_list[i]] = date_list[i]
invalid_date = pd.Series(d).to_frame().reset_index()
invalid_date.columns = ["id","date"]
invalid_date.head(5)
Out[11]: id date
0 ID1959463567 2018-18-05
1 ID3198565330 2018-17-03
2 ID1328595966 2018-16-05
3 ID3130522574 2018-23-06
4 ID3230556880 2018-26-02
As we can see there are few dates not in the right format either their month and day are
swapped or date is exceeding the range of the month.
Let’s check time format as well.
id2_list = []
time_list =[]
for i in range(0,len(dirty_df)):
dept_flag = re.match("^(0+[0-9]|1[0-9]|2[0-3])[\:]([0-5][0-9])[\:]([0-5][0-9])$",d
arr_flag = re.match("^(0+[0-9]|1[0-9]|2[0-3])[\:]([0-5][0-9])[\:]([0-5][0-9])$",di
if not dept_flag:
id2_list.append(dirty_df.loc[i]['Unnamed: 0'])
time_list.append([dirty_df.loc[i]['Departure Time'],"Departure Time"])
#print('ID: ' + dirty_df.loc[i]['Unnamed: 0'] + ' Invalid date: {}'.format(dir
6
if not arr_flag:
id2_list.append(dirty_df.loc[i]['Unnamed: 0'])
time_list.append([dirty_df.loc[i]['Arrival Time'],"Arrival Time"])
#print('ID: ' + dirty_df.loc[i]['Unnamed: 0'] + ' Invalid date: {}'.format(dir
time = {}
for i in range(0,len(id2_list)):
time[id2_list[i]] = time_list[i]
invalid_time = pd.DataFrame.from_dict(time).transpose()
invalid_time.reset_index(inplace=True)
invalid_time.columns = ["id","time","type"]
invalid_time.head(5)
So, there are some time values which are inconsistent and not following the format like
00:00:00. We need to fix these inconsitencies as well before we move ahead.
Error 3, 4 Departure Date - month and date swapped, date exceeding the month (Syntactic
Error) :
<p> The date does not follow the consistent format. At some places the month and date are swapp
<h3><u> Fix: </u></h3>
<ul>
<li> <b> For month and date swapped: Swap them </b> </li>
<li> <b> For date exceeding month: Reduce the day to the last day of that month. </b> </li>
</ul>
if len(str(month)) == 1:
return (str(year) + "-0" + str(month) + "-" + str(min(day, lastday.day)))
else:
return (str(year) + "-" + str(month) + "-" + str(min(day, lastday.day)))
7
for i in range(0, len(invalid_date)):
dt = invalid_date.loc[i]["corr_date"].split("-")
Since now the Departure date has been fixed for the format we can add the day factor in our
dataset which is like 0 for weekdays and 1 for weekends.
In [14]: day_classifier(dirty_df)
plt.matshow(newdf.corr())
plt.xticks(range(len(newdf.columns)), newdf.columns, rotation="vertical")
plt.yticks(range(len(newdf.columns)), newdf.columns)
plt.colorbar()
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 25
fig_size[1] = 25
plt.rcParams["figure.figsize"] = fig_size
plt.show()
8
<Figure size 480x480 with 2 Axes>
9
132 ID1854884366 2 2 7
133 ID3988603151 2 3 2
136 ID1248152001 2 4 6
137 ID5863570328 2 2 6
153 ID5451784328 2 8 7
158 ID5275404451 2 9 8
162 ID5307940421 2 2 9
165 ID5780145614 2 2 7
183 ID5575311507 2 6 9
184 ID5180410566 2 3 9
195 ID5733258800 2 1 4
197 ID5593813987 2 3 9
200 ID5738135982 2 2 6
210 ID5797809501 2 9 6
217 ID5694183123 2 5 3
224 ID5795734336 2 3 2
238 ID5442284009 2 3 8
244 ID5931451825 2 9 1
247 ID5917894894 2 1 2
248 ID5362479694 2 9 2
249 ID5286942123 2 6 8
252 ID5983501769 2 5 8
253 ID5446369758 2 7 1
257 ID5574936734 2 8 3
266 ID5771069863 2 1 3
268 ID5720773996 2 7 5
273 ID5806301908 2 4 3
276 ID5514404100 2 3 6
10
84 -37.790797 144.985865 -37.813887
85 -37.811515 144.964682 -37.820839
86 -37.812439 144.973491 -38.110916
89 -37.773803 144.983647 -37.817511
93 -37.809748 144.993152 -37.861835
99 -37.813313 144.939603 -37.861835
104 -37.812388 144.973841 -37.816808
106 -37.816662 144.927391 -37.808225
107 -37.861835 144.905716 -37.813069
110 -37.787442 144.980409 -37.861835
112 -37.820126 144.980368 -37.815834
114 -37.815834 145.046450 -37.861835
.. ... ... ...
126 -37.808194 144.999863 -37.861835
129 -38.110916 144.654173 -37.818684
132 -37.815892 144.938251 -37.861835
133 -37.810004 144.995305 -37.811331
136 -37.810995 145.000582 -37.773803
137 -37.817791 144.932188 -37.773803
153 -37.807202 145.026637 -37.861835
158 -38.110916 144.654173 -37.815834
162 -37.814017 144.939616 -38.110916
165 -37.814805 144.944860 -37.861835
183 -37.790797 144.985865 -38.110916
184 -37.818560 144.999172 -38.110916
195 -37.820846 144.957268 -37.815372
197 -37.821107 145.001579 -38.110916
200 -37.814332 144.935923 -37.787433
210 -38.110916 144.654173 -37.790797
217 -37.813801 144.926999 -37.824905
224 -37.822148 144.979055 -37.823620
238 -37.822236 144.967743 -37.807202
244 -38.110916 144.654173 -37.818826
247 -37.818629 144.962994 -37.812528
248 -38.110916 144.654173 -37.812288
249 -37.773845 144.983689 -37.815834
252 -37.812669 144.931570 -37.815834
253 -37.861835 144.905716 -37.818945
257 -37.807202 145.026637 -37.823664
266 -37.800807 144.973561 -37.811033
268 -37.861835 144.905716 -37.809116
273 -37.811512 144.996490 -37.824432
276 -37.810602 145.000827 -37.790797
11
20 144.939897 9719.0 2018-07-10 20:39:32
22 144.976883 6881.0 2018-01-06 21:22:45
27 145.005497 3351.0 2018-01-04 05:22:25
28 144.654173 43421.0 2018-07-24 11:37:16
39 144.931867 9131.0 2018-02-04 09:23:28
40 144.935393 43482.0 2018-02-23 00:39:26
48 145.046450 5851.0 2018-03-16 19:15:14
49 145.001284 12513.0 2018-04-18 18:19:08
54 145.002000 11108.0 2018-05-09 00:01:46
58 144.980409 11630.0 2018-06-07 10:54:14
60 144.954683 2995.0 2018-05-22 23:38:41
75 144.972816 8523.0 2018-04-06 05:48:32
76 144.654173 42451.0 2018-05-24 03:49:00
78 144.943050 3717.0 2018-06-25 04:05:23
79 144.654173 48197.0 2018-03-05 09:20:02
84 144.971597 3365.0 2018-03-02 06:15:57
85 144.969401 1206.0 2018-07-14 15:34:53
86 144.654173 44350.0 2018-01-19 21:41:43
89 144.948759 9403.0 2018-07-28 01:57:00
93 144.905716 10450.0 2018-04-13 20:20:03
99 144.905716 10199.0 2018-02-01 02:07:13
104 145.007359 3305.0 2018-05-13 23:45:12
106 144.971402 5163.0 2018-07-09 00:39:40
107 144.986482 9612.0 2018-06-28 18:04:22
110 144.905716 11630.0 2018-04-01 20:44:21
112 145.046450 7153.0 2018-04-05 07:07:26
114 144.905716 15151.0 2018-05-19 09:57:55
.. ... ... ... ...
126 144.905716 11309.0 2018-06-26 03:18:26
129 144.942782 41229.0 2018-04-09 16:12:56
132 144.905716 10337.0 2018-02-15 22:42:54
133 144.929407 6890.0 2018-01-07 22:57:08
136 144.983647 9538.0 2018-06-08 13:30:25
137 144.983647 9607.0 2018-04-26 12:33:30
153 144.905716 13986.0 2018-03-27 20:49:16
158 145.046450 51032.0 2018-03-16 07:05:15
162 144.654173 42416.0 2018-07-01 11:47:02
165 144.905716 9590.0 2018-07-13 13:23:03
183 144.654173 47193.0 2018-07-06 12:44:09
184 144.654173 46702.0 2018-07-13 20:41:27
195 145.009057 5144.0 2018-04-13 15:03:12
197 144.654173 46999.0 2018-07-24 01:55:45
200 144.980377 6064.0 2018-02-28 15:50:34
210 144.985865 47193.0 2018-01-16 17:17:28
217 144.991917 7431.0 2018-05-17 02:52:44
224 144.942962 3710.0 2018-03-20 02:48:41
238 145.026637 6837.0 2018-05-15 19:34:49
244 144.953157 42432.0 2018-07-24 10:17:02
12
247 144.936209 3519.0 2018-05-28 05:29:37
248 144.937264 42927.0 2018-07-09 17:24:23
249 145.046450 10698.0 2018-02-11 21:53:48
252 145.046450 11187.0 2018-03-02 16:22:33
253 144.953212 8525.0 2018-02-27 15:06:42
257 144.967548 7002.0 2018-02-09 17:19:14
266 144.990060 2490.0 2018-01-16 07:55:40
268 144.931933 10914.0 2018-07-20 12:16:08
273 144.975799 3030.0 2018-03-15 04:26:30
276 144.985865 3444.0 2018-03-09 20:08:40
13
137 2754.54 13:19:24 204.37 0 0.0
153 4298.28 19:37:38 308.66 0 0.0
158 12681.06 10:36:36 857.05 0 0.0
162 10167.60 08:57:35 696.94 0 1.0
165 3189.60 14:16:12 233.88 0 0.0
183 11530.62 15:56:19 790.23 0 0.0
184 11336.82 23:50:23 778.55 0 0.0
195 1290.12 15:24:42 107.38 0 0.0
197 11282.94 05:03:47 782.69 0 0.0
200 1524.72 16:15:58 124.13 1 0.0
210 11535.78 20:29:43 790.36 0 0.0
217 1924.02 03:24:48 158.30 0 0.0
224 1010.34 03:05:31 97.77 0 0.0
238 1761.96 20:04:10 138.54 0 0.0
244 10211.94 13:07:13 693.56 0 0.0
247 881.22 05:44:18 89.23 0 0.0
248 10288.50 20:15:51 707.29 0 0.0
249 2756.46 22:39:44 220.85 0 1.0
252 3137.76 17:14:50 231.44 0 0.0
253 2926.14 15:55:28 216.62 0 0.0
257 1816.50 17:49:30 142.60 0 0.0
266 610.20 08:05:50 52.72 0 0.0
268 3539.46 13:15:07 257.10 0 0.0
273 699.96 04:38:09 75.68 0 0.0
276 970.38 20:24:50 83.75 0 0.0
max
Uber Type
0 29.81
14
1 75.55
2 857.05
3 28.40
plt.show()
plt.clf()
plt.close()
15
By viewing these scatter plots, we can see the two records of UBER type 3 are coinciding with
UBER type 0. This tells us that it could be a wrong entry for those two records and ideally they
should be 0. But let’s try to verify it with the help of transaction ID (Unnamed: 0) as it follows a
pattern and its third character is same for all the same UBER Type.
Uber Type 0: 1
Uber Type 1: 3,
Uber Type 2: 5,
Uber Type 3: 1,3 (Only 2 records and also we know that only 3 categories of UBER type exists
in the data. So this must be changed)
Let’s try to observe the pattern visually and more carefully.
16
In [22]: # SETTING STYLE FOR seaborn PLOT
sb.set(font_scale=1.5)
sb.set_style('whitegrid')
plt.show()
plt.clf()
plt.close()
As we observed that, there is a pattern in ‘Unnamed: 0’ or ID column for different UBER types,
like for type 0: ID1. . . and for type 1: ID3. . . and so on. Also in the ‘ID classification on UBER
type’ scatter plot we get to see that for type 1 and type 2 there are few IDs with slight verification.
Uber Type 0: ‘1’
Uber Type 1: ‘3’, ‘1’ (‘1’ is hardly for two or three records, which are not following the trend
which means this is an error wherever there is ‘1’ for this type)
Uber Type 2: ‘5’, ‘3’, ‘1’ (‘3’ and ‘1’ are hardly for two or three records, which are not following
the trend which means this is an error wherever there is ‘3’ and ‘1’ for this type)
Uber Type 3: 1,3 (Only 2 records and also we know that only 3 categories of UBER type exists
in the data. So this must be changed)
17
In [23]: uber_type = {'0':[],'1':[],'2':[]}
for i in range(len(dirty_df)):
if (dirty_df['Uber Type'][i]==0) & (dirty_df['Unnamed: 0'][i][2]!='1'):
uber_type['0'].append(dirty_df.loc[i]["Unnamed: 0"])
#print(dirty_df.loc[i]["Unnamed: 0"], dirty_df.loc[i]["Uber Type"])
elif (dirty_df['Uber Type'][i]==1) & (dirty_df['Unnamed: 0'][i][2]!='3'):
uber_type['1'].append(dirty_df.loc[i]["Unnamed: 0"])
#print(dirty_df.loc[i])
elif (dirty_df['Uber Type'][i]==2) & (dirty_df['Unnamed: 0'][i][2]!='5'):
uber_type['2'].append(dirty_df.loc[i]["Unnamed: 0"])
#print(dirty_df.loc[i])
uber_type
For UBER type 1: ‘ID1885469348’, ‘ID1574949198’ does not follow the trend.
For UBER type 2: ‘ID1701214714’, ‘ID1854884366’, ‘ID3988603151’, ‘ID1248152001’ does not
follow the trend.
<h3><u> Error 5 --> Additional UBER Type (i.e. type 3) (Syntactic Error) : </u></h3>
<p> There are only 3 UBER types as per specification in the dataset which are 0,1,2. As type 3
<h3><u> Fix: </u></h3>
<p> Based on the pattern observed with the fare and IDs, we have 2 records in type 3 where one
else: # type 2
dirty_df.loc[dirty_df["Unnamed: 0"] == utype.iloc[i]["Unnamed: 0"],"Uber Type"
# marking the corrected flag as 1
dirty_df.loc[dirty_df["Unnamed: 0"] == utype.iloc[i]["Unnamed: 0"],"corrected"
dirty_df.loc[dirty_df["Uber Type"] == 3]
18
<h3><u> Error 6 --> UBER Type 1,2 (Semantic Error) : </u></h3>
<p> The type 1 and type 2 has got two to four IDs which does not follow the pattern of their ID
<h3><u> Fix: </u></h3>
<p> The best possible fix we can do, is to change the UBER type for these 2 to 4 records to the
for i in range(0,len(uber_type['2'])):
if uber_type['2'][i][2] == '1':
dirty_df.loc[dirty_df["Unnamed: 0"] == uber_type['2'][i],"Uber Type"] = 0
# marking the corrected flag as 1
dirty_df.loc[dirty_df["Unnamed: 0"] == uber_type['2'][i],"corrected"] = 1
19
In [26]: # FINDING THE EUCLIDEAN DISTANCE BETWEEN ANY 2 POINTS
import math
radius = 6378 # km
dlat = math.radians(lat2-lat1)
dlon = math.radians(lon2-lon1)
a = math.sin(dlat/2) * math.sin(dlat/2) + math.cos(math.radians(lat1)) \
* math.cos(math.radians(lat2)) * math.sin(dlon/2) * math.sin(dlon/2)
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
d = radius * c
return d
alldata_reg = miss1.append([miss2,out1,out2,dirty1,dirty2])
alldata_reg.drop_duplicates(inplace=True)
alldata_reg.reset_index(inplace=True, drop = True)
alldata_reg["dist_center(km)"] = 0
alldata_reg["predicted_region"] = -1
region_center = alldata_reg.groupby('Region').agg('mean')
region_center.reset_index(inplace=True)
#region_center
predict_dict = {}
20
if alldata_reg["dist_center(km)"][i] > 2:
# check distance with other centers
for k in range(0, len(region_center)):
dist = distance(origin, [region_center.loc[k,"Latitude"], region_center.lo
predict_dict[region_center.loc[k,"Region"]] = dist
<h3><u> Error 7 --> Wrong Region Allocation for few points (Coverage Error) : </u></h3>
<p> The region has been alloted wrong for some points though visually they seem to belong in so
<h3><u> Fix: </u></h3>
<p> The best possible fix we can do, is to take mean of lat and long for each region which will
21
0.3.7 3.7 Validating Journey Distance(m) (Is it really the shortest path?)
Here we will try to validate whether the journey distance given in the dataset is actually the
shortest distance for any ride or not, as it could be wrong due to some data entry issues. So it
would be better if we verify before moving ahead. We will use Dijkstra algorithm to find the
shortest journey distance between the source point and the destination point and then compare it
with the given journey distance in the dataset.
# creating graph using edges and nodes in the edges.csv and nodes.csv respectively
graph = netx.from_pandas_edgelist(edges_df,'u','v',['distance(m)'])
id3_list = []
dist_list = []
for i in range(0,len(dirty_df)):
# computing shortest distance for all nodes in dirty data
distance, path = netx.single_source_dijkstra(graph, source = int(nodes_df['Unnamed
# checking if distance from dirty data and shortest distance from dijkstra are not
if distance != round(dirty_df.iloc[i]['Journey Distance(m)']):
id3_list.append(dirty_df.iloc[i]["Unnamed: 0"])
dist_list.append([round(dirty_df.iloc[i]["Journey Distance(m)"]), distance])
dist = {}
for i in range(0,len(id3_list)):
dist[id3_list[i]] = dist_list[i]
invalid_dist = pd.DataFrame.from_dict(dist).transpose()
invalid_dist.reset_index(inplace=True)
invalid_dist.columns = ["id","given_distance","computed_distance"]
invalid_dist
22
dirty_df.loc[dirty_df["Unnamed: 0"] == invalid_dist.loc[i,"id"], "Journey Distance
dirty_df.loc[dirty_df["Unnamed: 0"] == invalid_dist.loc[i,"id"], "corrected"] = 1
return tt
def time_diff(at,dt):
FMT = '%H:%M:%S'
at = str(at)
dt = str(dt)
tdelta = datetime.strptime(at, FMT) - datetime.strptime(dt, FMT)
if tdelta.days < 0:
tdelta = timedelta(days=0,seconds=tdelta.seconds, microseconds=tdelta.microsec
return tdelta.total_seconds()
In [35]: invalid_tt = []
tt_dict={}
for each in p:
flag=0
correct_tt = traveltime(each,edges_df)
calculated_tt = time_diff(dirty_df.loc[i]['Arrival Time'],dirty_df.loc[i][
correct_tt_list.append(correct_tt)
if flag == 0:
tt_dict[dirty_df.loc[i]['Unnamed: 0']] = {"Dept_Time":str(dirty_df.loc[i][
23
invalid_tt = pd.DataFrame.from_dict(tt_dict).transpose()
invalid_tt.reset_index(inplace=True)
invalid_tt
Path_Time
0 [2390.1600000000003, 2408.3400000000006, 2402...
1 [1334.88]
2 [341.76, 331.56, 349.02, 349.37999999999994, 3...
3 [10688.16, 10689.0, 10688.16, 10689.0]
4 [10217.159999999996]
5 [2850.839999999999, 2852.639999999999, 2852.87...
6 [2709.6, 2709.6]
7 [3515.1000000000004, 3523.32, 3514.32, 3522.54...
8 [4298.280000000001, 4316.460000000001, 4310.52...
9 [1549.9800000000007]
0.3.9 3.9 Validating Arrival Time (Except for the corrected rows)
We need to verify whether the arrival time is correct or not assuming that Departure time and
Travel Time is correct. Since we have already corrected Travel time for few rows, therefore we will
exclude those rows as we cannot use the corrected Travel Time to correct Arrival Time.
In [37]: arr_dict = {}
for i in range (0,len(dirty_df)):
if dirty_df.loc[i]["corrected"] == 1:
continue
else:
24
arr_time = str(dirty_df.iloc[i]['Arrival Time']).split(":")
dept_time = str(dirty_df.iloc[i]['Departure Time']).split(":")
travel_time = dirty_df.iloc[i]['Travel Time(s)']
if arr_time_secs != dept_tt_secs:
arr_dict[dirty_df.iloc[i]['Unnamed: 0']] = {"Travel Time": dirty_df.iloc[i
invalid_arr = pd.DataFrame.from_dict(arr_dict).transpose()
invalid_arr.reset_index(inplace=True)
invalid_arr
dept_tt(secs)
0 28049
1 3695
2 68318
3 10317
4 72727
5 62471
6 27698
25
7 76528
8 40792
9 27618
10 76744
11 55217
12 10649
13 22494
14 79254
15 52589
16 57218
17 48799
18 3378
19 2226
for i in range(0,len(invalid_arr)):
dept_time = invalid_arr.loc[i]["Departure Time"]
arr_time = invalid_arr.loc[i]["Arrival Time"]
travel_time = invalid_arr.loc[i]["Travel Time"]
26
Since now Arrival Time is fixed and wherever there was a need of swap with Departure Time
that is also done. so we are ensured about the authenticity of our data now and we can add time
factor as another dimension in our data.
In [39]: time_classifier(dirty_df)
We have checked every dimension during our EDA process, understood the relations between
the dimensions and wrangled the data as well for some data quality checks like synatctic, semantic
errors. So we can now move ahead to fix missing values and do outlier detection based on our
cleansed ‘dirty data’.
In [41]: miss_df.describe()
27
min 0.000000 1.000000 1.000000 -38.110916
25% 0.000000 2.000000 2.000000 -37.819819
50% 1.000000 5.000000 5.000000 -37.815449
75% 1.000000 7.000000 8.000000 -37.807202
max 2.000000 9.000000 9.000000 -37.773803
As we can see Uber Type and Fare$ has got few missing values (blank, NaN, or NA in this
case). So we need to figure out a way to handle these missing values, may be by imputing them
with their closest possible values.
Let’s add day factor and time factor in this dataset as they might prove useful in imputing the
missing values for Fare$
In [42]: # adding day factors and time factors based on Departure date and time
day_classifier(miss_df)
time_classifier(miss_df)
Since we have identified 2 dimensions which have missing values in it, let’s try to impute them
with the help of the relations they have with other dimensions in the data which we figure out in
our preious step while doing EDA.
0.4.1 4.1 Imputing Uber Type (Can be derived from Unnamed: 0 or ID)
We already know that Uber type is not actually missing and we can clearly derive it from the
transaction ID or ‘Unnamed: 0’. The 3rd character of the column ‘Unnamed: 0’ is fixed for specific
Uber Types.
28
if mdf[i]:
if miss_df.iloc[i]["Unnamed: 0"][2] == '1' : # type 0
miss_df.loc[i,"Uber Type"] = 0
elif miss_df.iloc[i]["Unnamed: 0"][2] == '3' : # type 1
miss_df.loc[i,"Uber Type"] = 1
else: # type 2
miss_df.loc[i,"Uber Type"] = 2
29
In [46]: # DATA FOR UBER TYPE 2
dd_ut2 = dirty_df[['Journey Distance(m)','Travel Time(s)','day_factor','time_factor','
miss_ut2 = miss_df[['Journey Distance(m)','Travel Time(s)','day_factor','time_factor',
#import random
# for i in range(0,100):
# a = random.randint(0,102)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state
lm_ut0 = LinearRegression()
reg = lm_ut0.fit(x_train,y_train)
predictions = lm_ut0.predict(x_test)
plt.subplots(figsize=(10,5))
f.subplots_adjust(wspace=0.4, hspace=0.4)
# plt.scatter(y_test,predictions,color="red")
a = sb.regplot(x=y_test, y=predictions)
30
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, predictio
print('R Square:', metrics.r2_score(y_test, predictions))
# import random
# for i in range(0,100):
# a = random.randint(0,102)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state
lm_ut1 = LinearRegression()
reg = lm_ut1.fit(x_train,y_train)
predictions = lm_ut1.predict(x_test)
31
plt.subplots(figsize=(10,10))
# plt.scatter(y_test,predictions,color="red")
a = sb.regplot(x=y_test, y=predictions)
32
In [50]: # MODEL FOR UBER TYPE: 2 -----------------------------------------------
x = model_data_ut2.iloc[:,:-1] #creating x variable for training
y = model_data_ut2['Fare$'] #creating y variable for training
#import random
#for i in range(0,100):
#a = random.randint(0,102)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state
lm_ut2 = LinearRegression()
reg = lm_ut2.fit(x_train,y_train)
predictions = lm_ut2.predict(x_test)
33
plt.subplots(figsize=(10,10))
# plt.scatter(y_test,predictions,color="red")
sb.regplot(x=y_test, y=predictions)
34
4.2.3 Predicting & Imputing for missing ‘Fare$’ values using the models build:
predictions = lm_ut0.predict(x)
miss_ut0.loc[:,'Fare$'] = predictions
miss_ut0.reset_index(inplace=True)
for i in range(0,len(miss_ut0)):
miss_df.loc[miss_df["Unnamed: 0"] == miss_ut0.iloc[i]["Unnamed: 0"],"Fare$"] = mis
35
In [52]: miss_ut1 = miss_df.loc[(miss_df['Uber Type'] == 1) & (miss_df['Fare$'].isna()),:]
predictions = lm_ut1.predict(x)
miss_ut1.loc[:,'Fare$'] = predictions
miss_ut1.reset_index(inplace=True)
for i in range(0,len(miss_ut1)):
miss_df.loc[miss_df["Unnamed: 0"] == miss_ut1.iloc[i]["Unnamed: 0"],"Fare$"] = mis
predictions = lm_ut2.predict(x)
miss_ut2.loc[:,'Fare$'] = predictions
miss_ut2.reset_index(inplace=True)
for i in range(0,len(miss_ut2)):
miss_df.loc[miss_df["Unnamed: 0"] == miss_ut2.loc[i, "Unnamed: 0"],["Fare$"]] = mi
Out[54]: Unnamed: 0 Unnamed: 0.1 Uber Type Origin Region Destination Region \
0 0 ID1717434181 0 3 1
1 1 ID1773039045 0 1 2
2 2 ID1239989104 0 6 9
3 3 ID5826371523 2 1 8
4 4 ID1551372482 0 6 1
36
Travel Time(s) Arrival Time Fare$
0 668.40 8:15:18 5.44
1 268.56 23:25:55 10.92
2 11350.50 8:45:00 34.64
3 2464.98 8:14:27 160.91
4 1505.82 18:04:45 9.47
In [55]: outlier_df.describe()
In [56]: # adding day factors and time factors based on Departure date and time
day_classifier(outlier_df)
time_classifier(outlier_df)
37
# predicting fare for the whole dirty data (cleansed) + missing data (with non imputed
predictions = lm_ut0.predict(x)
model_data_ut0["predicted Fare$"] = predictions
# calculating the residual between the actual value and fitted value
model_data_ut0["Residual"] = abs(model_data_ut0["Fare$"] - model_data_ut0["predicted F
38
# predicting fare for the outlier dataset using our model
predictions = lm_ut0.predict(x)
out_ut0['predicted Fare$'] = predictions
#out_ut0.reset_index(inplace=True)
# calculating the residual between the actual value and fitted value
out_ut0["Residual"] = abs(out_ut0["Fare$"] - out_ut0["predicted Fare$"])
# finding & dropping the outliers based on 3 sigma rejection rule (each residual is su
for i in out_ut0.index:
if abs(out_ut0.loc[i]["Residual"] - mean0) > (3*std0):
print(out_ut0.loc[i]["Fare$"], out_ut0.loc[i]['predicted Fare$'], out_ut0.loc
out_ut0.drop(i, 0, inplace=True)
outlier_df.drop(i, 0, inplace=True)
39
For Uber Type 1 (finding outliers):
predictions = lm_ut1.predict(x)
40
In [60]: out_ut1 = outlier_df.loc[outlier_df['Uber Type'] == 1,:]
predictions = lm_ut1.predict(x)
for i in out_ut1.index:
if abs(out_ut1.loc[i]["Residual"] - mean1) > (3*std1):
print(out_ut1.loc[i]["Fare$"], out_ut1.loc[i]['predicted Fare$'], out_ut1.loc
out_ut1.drop(i, 0, inplace=True)
outlier_df.drop(i, 0, inplace=True)
41
54.95 67.1840899082007 12.234089908200701
53.81 62.061939841446744 8.251939841446742
43.97 54.052970925890094 10.082970925890095
9.4 20.725492795302436 11.325492795302436
52.46 62.699743001436346 10.239743001436345
6.965 16.92427380127309 9.959273801273088
58.89 68.01466076429216 9.124660764292159
11.355 25.54959529940397 14.194595299403968
26.95 31.49292312396113 4.542923123961131
46.44 56.59000726351887 10.150007263518873
58.72 71.55622822736086 12.836228227360863
49.13 59.49888606901527 10.368886069015268
21.93 25.8220394673005 3.892039467300499
8.49 20.564830542098537 12.074830542098537
8.795 20.44568222779455 11.65068222779455
predictions = lm_ut2.predict(x)
42
model_data_ut2.head(5)
mean2 = statistics.mean(model_data_ut2["Residual"])
std2 = statistics.pstdev(model_data_ut2["Residual"])
predictions = lm_ut2.predict(x)
plt.subplots(figsize=(10,5))
sb.boxplot('Fare$', data=out_ut2, palette="rocket")
43
for i in out_ut2.index:
if abs(out_ut2.loc[i]["Residual"] - mean2) > (3*std2):
print(out_ut2.loc[i]["Fare$"], out_ut2.loc[i]['predicted Fare$'], out_ut2.loc
out_ut2.drop(i, 0, inplace=True)
outlier_df.drop(i, 0, inplace=True)
44
In [65]: del miss_df["day_factor"]
del miss_df["time_factor"]
del outlier_df["day_factor"]
del outlier_df["time_factor"]
6.3 Generating the solution files for dirty_data, missing_values and outliers
45