Group 1 ML Project - Ipynb - Colab
Group 1 ML Project - Ipynb - Colab
ipynb - Colab
The company proposed 2 goals. The two tasks correctly carried out will give you full points. The two tasks have the same value.
1. To identify 3 bike stations that could be removed or substantially reduced in size, and to identify areas for 3 new bike stations. Simulate
the time-course for the new bike stations. Quantify your result: you should make a case that your solution may actually improve the user
experience.
2. Build an algorithm that can warn DublinBikes when a bike station needs refill or to be emptied. This is not as simple as refilling all stations
that don’t have bikes. Instead, the task involves anticipating whether a station is going to be empty in the near future (e.g., the next 30
minutes, or the next hour). This will require machine learning. You will need to define appropriate evaluation metrics and to justify your
reasoning. There are many angle to tackle this task. Each group is asked to propose one solution.
You are free to use classification, regression, clustering, and any other methodology that was covered in class. You are free to use Gemini.
However, your report will have to make it clear that you understand what your code is doing. To do so, your report will need to include high-level
explanations in plain English of your methodology (e.g., the dataset has XX features; since some of them were not numeric, we converted them
to numerical values; outliers were removed by applying a threshold etc).
A few tips: The manager suggested focussing on a subset of the city for simplicity (of course, you are free to do more than that, if you like).
Make sure that the data for that bike station is available if you plan on combining datasets from different quartiles. Missing or bad data-points
can be a problem. So, identifying stations with good data will make your life easier (but feel free to make your life more complicated if you like
the challenge).
print(dataset.columns)
print(len(dataset))
dataset
Mounted at /content/drive
Index(['STATION ID', 'TIME', 'LAST UPDATED', 'NAME', 'BIKE_STANDS',
'AVAILABLE_BIKE_STANDS', 'AVAILABLE_BIKES', 'STATUS', 'ADDRESS',
'LATITUDE', 'LONGITUDE'],
dtype='object')
1994400
STATION LAST
TIME NAME BIKE_STANDS AVAILABLE_BIKE_STANDS AVAILABLE_BIKES STATUS ADDRESS LATITUDE
ID UPDATED
2023- 2022-
CLARENDON Clarendon
0 1 01-01 12-31 31 31 0 OPEN 53.3409
ROW Row
00:00:03 23:59:39
2023- 2022-
BLESSINGTON Blessington
1 2 01-01 12-31 20 18 2 OPEN 53.3568
STREET Street
00:00:03 23:57:48
2023- 2022-
BOLTON Bolton
2 3 01-01 12-31 20 9 11 OPEN 53.3512
STREET Street
00:00:03 23:57:10
2023- 2022-
GREEK Greek
3 4 01-01 12-31 20 8 12 OPEN 53.3469
STREET Street
00:00:03 23:51:39
2023- 2022-
CHARLEMONT Charlemont
4 5 01-01 12-31 40 16 24 OPEN 53.3307
PLACE Street
00:00:03 23:58:28
Task 1
https://colab.research.google.com/drive/12wsG8argKpM9F6cVYu9_Uzonb9_h79YF#scrollTo=hw3biYbDXe-m&printMode=true 1/12
12/2/24, 1:45 PM Group 1 ML Project.ipynb - Colab
# We start by preparing the data. We are calculating the average daily (weekday) profile for the 'AVAILABLE BIKES' feature.
dataset['AVAILABLE_BIKES_CHANGE'] = abs(dataset.groupby('STATION ID')['AVAILABLE_BIKES'].diff())
dataset = dataset.dropna()
dataset
STATION LAST
TIME NAME BIKE_STANDS AVAILABLE_BIKE_STANDS AVAILABLE_BIKES STATUS ADDRESS LATITUDE
ID UPDATED
2023- 2023-
CLARENDON Clarendon
113 1 01-01 01-01 31 31 0 OPEN 53.3409
ROW Row
00:30:02 00:21:23
2023- 2023-
BLESSINGTON Blessington
114 2 01-01 01-01 20 18 2 OPEN 53.3568
STREET Street
00:30:02 00:28:07
2023- 2023-
BOLTON Bolton
115 3 01-01 01-01 20 8 12 OPEN 53.3512
STREET Street
00:30:02 00:20:18
2023- 2023-
GREEK Greek
116 4 01-01 01-01 20 8 12 OPEN 53.3469
STREET Street
00:30:02 00:21:55
2023- 2023-
CHARLEMONT Charlemont
117 5 01-01 01-01 40 16 24 OPEN 53.3307
PLACE Street
00:30:02 00:28:45
# We select months after January to exclude the system update we detected in Tutorial 1
data = dataset.copy() # creating a new copy of the dataset
data['TIME'] = (pd.to_datetime(data['TIME'], format = "%Y-%m-%d %H:%M:%S")) # changing the format of the 'time' variable
dateMask = data['TIME'].dt.month > 1 # Mask: Data-points recorded after January are set to 'true', All other datapoints are set to 'false
data = data[dateMask] # applying the mask
# Remove columns with information that we don't need for the clustering
data = data.drop(columns = {'STATUS','ADDRESS','LAST UPDATED'})
# Change the 'TIME' column to only the time (erase the date)
weekdays_data['HOUR'] = weekdays_data['TIME'].dt.hour
weekdays_data= weekdays_data.assign(TIME = weekdays_data['TIME'].dt.hour+weekdays_data['TIME'].dt.minute/60)
weekdays_data
# Averaging data by grouping it into time of the day and station ID.
# This will produce a dataset with one datapoint for each station and time of the day.
# Fields such as 'available bikes' will indicate the average available bikes
# Across all datapoints (e.g., all days) for a given station and time of the day
weekday_avg = weekdays_data.groupby(['STATION ID','NAME','HOUR']).agg('mean')
weekday_avg
https://colab.research.google.com/drive/12wsG8argKpM9F6cVYu9_Uzonb9_h79YF#scrollTo=hw3biYbDXe-m&printMode=true 2/12
12/2/24, 1:45 PM Group 1 ML Project.ipynb - Colab
<ipython-input-4-51fa606ec974>:18: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
STATION
NAME HOUR
ID
STATION
NAME HOUR TIME BIKE_STANDS AVAILABLE_BIKE_STANDS AVAILABLE_BIKES LATITUDE LONGITUDE AVAILABLE_BIK
ID
CLARENDON
0 1 0 0.250000 31.0 23.565126 7.415966 53.3409 -6.26250
ROW
CLARENDON
1 1 1 1.250000 31.0 23.798319 7.180672 53.3409 -6.26250
ROW
CLARENDON
2 1 2 2.250000 31.0 23.794118 7.184874 53.3409 -6.26250
ROW
CLARENDON
3 1 3 3.249474 31.0 23.833684 7.147368 53.3409 -6.26250
ROW
CLARENDON
4 1 4 4.250000 31.0 23.794118 7.184874 53.3409 -6.26250
ROW
... ... ... ... ... ... ... ... ... ...
HANOVER
2731 117 19 19.250529 40.0 31.494715 8.488372 53.3437 -6.23175
QUAY EAST
HANOVER
2732 117 20 20 250529 40 0 31 955603 8 021142 53 3437 6 23175
# We create a new column 'Percent_full' indicating how many bikes are available in a station (percentage)
#weekday_norm = weekday_avg['AVAILABLE BIKES'] / weekday_avg['BIKE STANDS']
weekday_norm = weekday_avg['AVAILABLE_BIKES_CHANGE'] / weekday_avg['BIKE_STANDS']
weekday_avg = weekday_avg.assign(Percent_full = weekday_norm)
weekday_avg['Percent_full'] = weekday_avg['Percent_full'] * 100
weekday_avg
https://colab.research.google.com/drive/12wsG8argKpM9F6cVYu9_Uzonb9_h79YF#scrollTo=hw3biYbDXe-m&printMode=true 3/12
12/2/24, 1:45 PM Group 1 ML Project.ipynb - Colab
STATION
NAME HOUR TIME BIKE_STANDS AVAILABLE_BIKE_STANDS AVAILABLE_BIKES LATITUDE LONGITUDE AVAILABLE_BIK
ID
CLARENDON
0 1 0 0.250000 31.0 23.565126 7.415966 53.3409 -6.26250
ROW
CLARENDON
1 1 1 1.250000 31.0 23.798319 7.180672 53.3409 -6.26250
ROW
CLARENDON
2 1 2 2.250000 31.0 23.794118 7.184874 53.3409 -6.26250
ROW
CLARENDON
3 1 3 3.249474 31.0 23.833684 7.147368 53.3409 -6.26250
ROW
CLARENDON
4 1 4 4.250000 31.0 23.794118 7.184874 53.3409 -6.26250
ROW
... ... ... ... ... ... ... ... ... ...
HANOVER
2731 117 19 19.250529 40.0 31.494715 8.488372 53.3437 -6.23175
QUAY EAST
HANOVER
2732 117 20 20 250529 40 0 31 955603 8 021142 53 3437 6 23175
# Reshape to get each time of the day in a column (features) and each station in a row (data-points) for Percent full
time_station_percent_full= weekday_avg.pivot(index='STATION ID' , columns='HOUR', values='Percent_full')
print(time_station_percent_full.shape)
time_station_percent_full
(114, 24)
HOUR 0 1 2 3 4 5 6 7 8 9 ... 14 15
STATION
ID
1 1.883979 0.040661 0.000000 0.000000 0.000000 1.667118 4.964346 9.507997 6.986988 9.934942 ... 7.259762 5.426146
2 3.088235 0.493697 0.010504 0.010526 0.000000 0.651261 7.747368 9.926471 8.928571 9.065126 ... 6.389474 4.715789
3 3.119748 0.325630 0.063025 0.010526 0.010504 1.186975 4.294737 6.018908 7.163866 12.363445 ... 9.094737 6.842105
4 1.176471 0.094538 0.000000 0.000000 0.000000 0.199580 4.242105 5.892857 6.974790 13.025210 ... 5.694737 5.221053
5 1.953782 0.283613 0.015756 0.000000 0.000000 0.315126 5.952632 10.724790 27.657563 14.185924 ... 4.536842 3.868421
... ... ... ... ... ... ... ... ... ... ... ... ... ...
113 0.231092 0.015756 0.000000 0.000000 0.000000 0.005252 0.494737 1.654412 2.032563 4.243697 ... 1.826316 1.389474
114 0.651261 0.057773 0.010504 0.000000 0.000000 0.446429 4.984211 13.602941 14.994748 12.988445 ... 3.710526 5.815789 1
115 1.729692 0.273109 0.000000 0.014035 0.000000 0.084034 2.982456 5.602241 9.593838 17.394958 ... 4.835088 4.294737
116 0.567227 0.070028 0.000000 0.000000 0.000000 0.098039 0.701754 1.582633 2.100840 1.939776 ... 1.305263 1.319298
117 0.477941 0.063025 0.000000 0.000000 0.000000 0.000000 0.652632 2.069328 4.858193 10.099790 ... 1.726316 1.742105
https://colab.research.google.com/drive/12wsG8argKpM9F6cVYu9_Uzonb9_h79YF#scrollTo=hw3biYbDXe-m&printMode=true 4/12
12/2/24, 1:45 PM Group 1 ML Project.ipynb - Colab
(114, 24)
HOUR 0 1 2 3 4 5 6 7 8 9 ... 14 15
STATION
ID
1 0.584034 0.012605 0.000000 0.000000 0.000000 0.516807 1.538947 2.947479 2.165966 3.079832 ... 2.250526 1.682105 1.81
2 0.617647 0.098739 0.002101 0.002105 0.000000 0.130252 1.549474 1.985294 1.785714 1.813025 ... 1.277895 0.943158 1.54
3 0.623950 0.065126 0.012605 0.002105 0.002101 0.237395 0.858947 1.203782 1.432773 2.472689 ... 1.818947 1.368421 1.83
4 0.235294 0.018908 0.000000 0.000000 0.000000 0.039916 0.848421 1.178571 1.394958 2.605042 ... 1.138947 1.044211 1.52
5 0.781513 0.113445 0.006303 0.000000 0.000000 0.126050 2.381053 4.289916 11.063025 5.674370 ... 1.814737 1.547368 2.02
... ... ... ... ... ... ... ... ... ... ... ... ... ...
113 0.092437 0.006303 0.000000 0.000000 0.000000 0.002101 0.197895 0.661765 0.813025 1.697479 ... 0.730526 0.555789 0.70
114 0.260504 0.023109 0.004202 0.000000 0.000000 0.178571 1.993684 5.441176 5.997899 5.195378 ... 1.484211 2.326316 4.72
115 0.518908 0.081933 0.000000 0.004211 0.000000 0.025210 0.894737 1.680672 2.878151 5.218487 ... 1.450526 1.288421 1.84
116 0.170168 0.021008 0.000000 0.000000 0.000000 0.029412 0.210526 0.474790 0.630252 0.581933 ... 0.391579 0.395789 0.41
117 0.191176 0.025210 0.000000 0.000000 0.000000 0.000000 0.261053 0.827731 1.943277 4.039916 ... 0.690526 0.696842 1.14
# We have decided to look at peak hours in the morning and in the evening to capture the usage of bikes for work-related travel
time_station_percent_full = time_station_percent_full[:][[7,8,9,10,17,18,19]]
time_station_percent_full
HOUR 7 8 9 10 17 18 19
STATION ID
# Heat map of percent full for our peak hours (all together)
# Get unique station names, latitudes, and longitudes
stations, mask_stations = np.unique(dataset.NAME, return_index=True)
lats = dataset.LATITUDE.iloc[mask_stations]
longs = dataset.LONGITUDE.iloc[mask_stations]
# Get 'Percent_full' for the unique stations using the correct DataFrame: time_station_percent_full
# Use .loc to access values by station ID, ensuring to use .values to get numeric values
percent_full = time_station_percent_full.loc[dataset['STATION ID'].iloc[mask_stations]].mean(axis=1)
#average the percent full across different hours for each station.
# Create a colormap
colormap = linear.YlOrRd_09.scale(
percent_full.min(), percent_full.max()
)
https://colab.research.google.com/drive/12wsG8argKpM9F6cVYu9_Uzonb9_h79YF#scrollTo=hw3biYbDXe-m&printMode=true 5/12
12/2/24, 1:45 PM Group 1 ML Project.ipynb - Colab
mp.save('heatmap.html')
mp
Make this Notebook Trusted to load map: File -> Trust Notebook
+
−
# Heat map of percent full for our peak hours separated into morning and evening
# Filter data for peak hours
peak_hours = [7, 8, 9, 10, 17, 18, 19]
peak_hour_data = weekday_avg[weekday_avg['HOUR'].isin(peak_hours)]
def get_heatmap_data(data):
# Get unique station names, latitudes, and longitudes
stations, mask_stations = np.unique(dataset.NAME, return_index=True)
lats = dataset.LATITUDE.iloc[mask_stations]
longs = dataset.LONGITUDE.iloc[mask_stations]
https://colab.research.google.com/drive/12wsG8argKpM9F6cVYu9_Uzonb9_h79YF#scrollTo=hw3biYbDXe-m&printMode=true 6/12
12/2/24, 1:45 PM Group 1 ML Project.ipynb - Colab
percent_full = time_station_percent_full.loc[dataset['STATION ID'].iloc[mask_stations]].mean(axis=1)
return heatmap_data
display(morning_map)
Make this Notebook Trusted to load map: File -> Trust Notebook
+
−
https://colab.research.google.com/drive/12wsG8argKpM9F6cVYu9_Uzonb9_h79YF#scrollTo=hw3biYbDXe-m&printMode=true 7/12
12/2/24, 1:45 PM Group 1 ML Project.ipynb - Colab
evening_heatmap_data = get_heatmap_data(evening)
evening_map = folium.Map(location=[53.34, -6.2603], zoom_start=10, tiles='cartodbpositron')
colormap_evening = linear.YlOrRd_09.scale(min(data[2] for data in evening_heatmap_data), max(data[2] for data in evening_heatmap_data))
HeatMap(evening_heatmap_data, radius=15).add_to(evening_map)
display(evening_map)
Make this Notebook Trusted to load map: File -> Trust Notebook
+
−
time_station_available_bike_change = time_station_available_bike_change[:][[7,8,9,10,17,18,19]]
time_station_available_bike_change
https://colab.research.google.com/drive/12wsG8argKpM9F6cVYu9_Uzonb9_h79YF#scrollTo=hw3biYbDXe-m&printMode=true 8/12
12/2/24, 1:45 PM Group 1 ML Project.ipynb - Colab
HOUR 7 8 9 10 17 18 19
STATION ID
# Get 'Bike Change' for the unique stations using the correct DataFrame: time_station_available_bike_change
# Use .loc to access values by station ID, ensuring to use .values to get numeric values
change = time_station_available_bike_change.loc[dataset['STATION ID'].iloc[mask_stations]].mean(axis=
mean 1)
#average the percent full across different hours for each station.
# Create a colormap
colormap = linear.YlOrRd_09.scale(
change.min(), change.max()
)
mp.save('heatmap.html')
mp
https://colab.research.google.com/drive/12wsG8argKpM9F6cVYu9_Uzonb9_h79YF#scrollTo=hw3biYbDXe-m&printMode=true 9/12
12/2/24, 1:45 PM Group 1 ML Project.ipynb - Colab
Make this Notebook Trusted to load map: File -> Trust Notebook
+
−
# Heat map of bike change for our peak hours separated into morning and evening
# Filter data for peak hours
peak_hours = [7, 8, 9, 10, 17, 18, 19]
peak_hour_data = weekday_avg[weekday_avg['HOUR'].isin(peak_hours)]
def get_heatmap_data(data):
# Get unique station names, latitudes, and longitudes
stations, mask_stations = np.unique(dataset.NAME, return_index=True)
lats = dataset.LATITUDE.iloc[mask_stations]
longs = dataset.LONGITUDE.iloc[mask_stations]
https://colab.research.google.com/drive/12wsG8argKpM9F6cVYu9_Uzonb9_h79YF#scrollTo=hw3biYbDXe-m&printMode=true 10/12
12/2/24, 1:45 PM Group 1 ML Project.ipynb - Colab
# Use the index of percent_full when accessing values
heatmap_data = [[lats.iloc[i], longs.iloc[i], change.iloc[i]] for i in range(len(stations))]
return heatmap_data
display(morning_map)
Make this Notebook Trusted to load map: File -> Trust Notebook
+
−
https://colab.research.google.com/drive/12wsG8argKpM9F6cVYu9_Uzonb9_h79YF#scrollTo=hw3biYbDXe-m&printMode=true 11/12
12/2/24, 1:45 PM Group 1 ML Project.ipynb - Colab
HeatMap(evening_heatmap_data, radius=15).add_to(evening_map)
display(evening_map)
Make this Notebook Trusted to load map: File -> Trust Notebook
+
−