GoTo Data Science Recruiting Assignment
GoTo Data Science Recruiting Assignment
Detailed Walkthrough-
1. Initially, the code was setting the target as 1 wherever the merged dataset had
"ACCEPTED" in the `participant_status` column. However, this was incorrect
because a "CREATED" event is logged whenever the system polls a driver, after
which the driver either ACCEPTS, REJECTS, or IGNORES the request. Including
"CREATED" rows biased the target towards 0. To correct this, we should remove
"CREATED" rows and set the target to 0 only for "REJECTED" or "IGNORED"
statuses, as our goal is to maximize the "ACCEPTED" responses.
2. To implement the new feature capturing the track record of drivers, I retrieved the
number of unique rides COMPLETED by each driver from the booking_log
database. I then merged this data with the master database on the driver_id column
and dropped the null values.
3. To evaluate the model, I used accuracy, precision, recall, and F1 score metrics from
the `sklearn` library. I defined a new `predict_class` function in the
`SklearnClassifier` class to return the predicted classes instead of the predicted
probabilities, which the current `predict` function was computing.
4. The model was performing well with the predefined parameters on this data, but I
altered the hyperparameters, reducing the max_depth attribute to prevent overfitting
which resulted in a better score on test_data. Note that without making the
alterations to the code specified in Step 1, the model was performing poorly on the
test_data, reaffirming the fact that the data was biased towards 0 initially.