Task 1 EDA Report Week 0
Task 1 EDA Report Week 0
Task Objective: Perform exploratory data analysis to identify useful insights that would be
useful in performing feature engineering and to aid the modelling tasks.
Completed Tasks:
Some analysis has been carried out on the data and below are some of the observations
made.
● More than 80% of the data's status is discarded
● Only 14k (approx) images are successfully labelled
● The Food Ids (food_id) 63 and 146 have the highest value counts in the data
(>1500). The rest are capped at 1464
● There are food_id with NULL values and out of those, 27 counts have "successful"
status_type. Interestingly, 25 belong to device_id BTCH123000V3 and 2 belong to
device_id BTCH123000W7
● For those with a "Successful" status, 3k+ are w/o labelled_at and 3 where the
processed_image is False
Interestingly: number of status_type = successful < food_id = NA < labelled_at = NA
< processed_image = Tru
● There are 19 NaN for device_id, 92k+ for ingredient_id and 93k+ for
event_webhook_id (but these are not critical)
● (using core_food_data), there are food items such as Empty Bin (food_id 57) and
Plastic waste (food_id 58)
Data Trends
Plot 1: Item Weight of successful events for device BTCH123000V3. For days on end
sometimes there are no events
Plot2: Plot depicting average item weight per month, day, hour and day_name (logged_at )
time
Observations:
1.There is huge spike in the month of May
2. Spike on Sundays
3. Spike in the beginning of every month, mostly in the first weekend
4. Peak time spike between 16:00 to 20:00
5. Frequency of events increased drastically from Dec 2021 to Apr 2022 (~0 - 7k+)
Future Work:
● More analysis of the data including the new dataset provided by Mark
● Our first team meeting will be on Saturday and we expect all those who have
contributed to the analysis to attend so we can collate ideas and move forward
Challenges
● N/A
Limitations:
● N/A