1 Introduction

The ability to individually track animals provides an opportunity to identify inter-individual differences and personality patterns and compare cognitive performance on various tasks  (Christakis et al. 2012; Toledo et al. 2020). Some animals are relatively easy to track, especially when they are relatively large, move relatively slowly, and in two dimensions; others, such as insects, which can be small, and fly fast in three dimensions, are more challenging. For example, a honey bee with an average length of 14 mm and a flight speed exceeding 25 km/h is difficult to track in space (Barron and Srinivasan 2006).

Movement ecology relies on advanced tracking and spatial analysis tools due to technological advancements (Brum-Bastos et al. 2022). Beneficial insects, such as bees and wasps, are vital for agricultural services like pollination and biological control. Understanding their movement patterns is crucial for conservation biology (Chapman et al. 2023). When direct observation is impractical, theoretical frameworks can estimate flight distances and predict movement between flower patches (Cresswell et al. 2000; Brunet et al. 2023). However, a gap remains between theoretical predictions and actual observations in pollination studies (Kendall et al. 2022).

The main aim of this paper is to establish a machine learning (ML) based system to recognize and track insects to enable the investigation of various aspects of their spatial behavior. Yet, building ML systems requires existing data. Hence, the systems described in this paper were built to assist in our research regarding the influence of nutrition on bees’ spatial cognition. Additionally, given the challenge of tracking small insects, there is a need to develop dedicated models for various environments. Consequently, an integral aspect of our aim is to provide detailed guidance on building the database and training the models. This will enable other researchers to replicate and customize the system for their own experiments.

In our study, we used worker honey bees as model organism, tracking them individually to assess their behavior and the impact of stressors on their performance. This study presents the development of an AI system using machine-learning models to identify and track bees. We divide the study into two main steps: first, developing a machine-learning system (planar bee tracking system) to identify bees in a closed 8-arm radial maze for studying behavior while walking; and second, creating a machine-learning system (spatial bee tracking system) to track flying bees for advanced research on spatial behavior and flight patterns in the real world.

1.1 Background

Studying the spatial behavior of bees and the effect of various stressors on performance in a spatial task is a multifaceted activity. Bees have been suffering exposure to multiple biotic and abiotic stressors (Goulson et al. 2015; Samuelson et al. 2016; GÓmez-Moracho et al. 2017; Belsky and Joshi 2019). We wanted to develop a system that would enable these studies to be extended and allow researchers to rigorously quantify the effect of stressors on bees’ spatial behavior.

Bees’ spatial behavior can be studied using radar to examine their flight path in open areas (Menzel et al. 2005). Alternatively, bees can be trained to recognize landmarks and tested on changed terrain to observe their use of landmarks for navigation (Ushitani et al. 2016). For studying spatial behavior in bees, we designed an 8-arm radial maze similar to those used with rodents (Kurzina et al. 2020). While basic maze performance can be observed in real-time, video recording allows for more accurate and detailed analysis. Developing a machine-learning system enables high-resolution tracking of bee movement, providing precise data on speed, acceleration, and movement patterns, enhancing analysis efficiency.

A machine-learning model was trained to recognize and track the bee’s position in the maze. It provides insights into the bee’s location, maze completion, time taken, speed, and total walking distance. Another model was developed to analyze flying patterns, locations, and flight duration of bees in real-world scenarios. Videos of bees flying in a controlled environment were used to train and test this model.

Computer vision and AI have been utilized in numerous studies exploring different aspects of bee biology. For instance, one study (Bjerge et al. 2019) employed computer vision and AI to detect mite infestations in beehives, while another study (Bozek et al. 2021) investigated crowd behavior in bee colonies. Additional research (Ramírez et al. 2012; Elizondo et al. 2013) used computer vision to monitor honeycomb cells in hives, or to identify honey bee foragers that returned to the hive carrying pollen loads on their hind legs (Rodriguez et al. 2018). A comprehensive review (Odemer 2022) critically examines the historical progression of automated techniques for measuring bee flight and counting bees, with a focus on improving validation methods.

The results of these studies indicate that the AI approaches provide high measurement accuracy, which may be better than manual or traditional automated approaches. Recently, (Bjerge et al. 2022) provided an automated AI-based real-time insect monitoring system. The system identifies individual insects in the field, providing information about phenology and foraging behavior. Similarly, (Zhang et al. 2024) developed machine learning techniques to discriminate between three species of bees that pollinate alfalfa. Methods are also being developed for detecting small moving objects, such as bees, from videos recorded using unmanned aerial vehicles (Stojnić et al. 2021).

Our study introduces an automated tool for identifying and tracking individual bees. We provide a detailed description of the system, covering data preparation, training, classification, and tracking. Unlike previous research on swarm tracking or individual identification, our focus is on precisely measuring movement patterns of individuals. This detailed measurement enables analysis of walking and flying patterns, contributing to a better understanding of spatial behavior.

1.2 Machine learning background

Machine learning is a subset of AI that uses algorithms to learn patterns from data for classification, predictions, and decision-making. Supervised learning is a technique where a computer learns from labeled examples to solve problems. The model is trained using labeled examples, and its structure is adjusted to produce accurate solutions. Testing is done using separate labeled examples to evaluate accuracy.

Artificial neural networks (ANNs) are supervised learning algorithms inspired by brain processes (Bhattacharya et al. 2021). ANNs consist of interconnected units, including input, hidden, and output layers. Neurons perform mathematical operations with weighted contributions. Convolutional neural networks (CNNs) are a type of ANN used for image processing. They use convolutions instead of matrix multiplications to reduce image size and identify features. You Only Look Once (YOLO) is a fast and accurate CNN algorithm for image recognition (Fang et al. 2019). It divides images into parts and identifies objects based on their fit (Fig. 1).

Fig. 1
figure 1

You only look once (YOLO) algorithm: dividing the image into small parts in order to identify objects (Bandyopadhyay 2022)

Fig. 2
figure 2

The network architecture of you only look once (YOLO) (Redmon et al. 2016) and the evolution of its versions (Bandyopadhyay 2022)

YOLO’s architecture has a total of 24 convolutional layers with two fully connected layers at the end, as illustrated in Fig. 2a. Over the years, the algorithm has been improved and developed (see timeline in Fig. 2b), and today there is a fifth version (YOLO-v5), which we used in this study. We chose to use this version because of its high speed, which is vital for developing a real-time tracking system.

While recognizing bees by using the YOLO algorithms, we must also consider the possibility that several bees appear in the same frame, which will negatively impact the tracking process. Tracking means that from the moment we recognize a particular object, we continue to recognize it in the following frames as the same object we recognized first, even if we recognize more objects in the following frames. Hence, we need to know how to differentiate between the various objects by giving each object a unique number.

Simple online and realtime tracking (SORT) (Bewley et al. 2016) is a tracking algorithm written as an open-source project. This algorithm uses detection data from previous and current frames to make associations. For this purpose, it uses the Kalman filter (Bishop and Welch 2001). The SORT algorithm is one of the fastest open-source tracking algorithms for multiple objects.

Tracking algorithms use previous movement information (speed, direction) to predict future movement. When new samples are received, we associate them with previous ones based on our prediction. The Kalman filter balances the weight between the prediction and the noisy, inaccurate samples to obtain an accurate path in tracking.

2 Planar bee tracking system

This system tracks a single bee in an 8-arm radial maze. It recognizes and records the bee’s location and movement within the maze. In the upcoming sections, we provide an overview of the biological experiment, its technical aspects, and the development of the machine-learning system.

2.1 Biological and technical description of the input data

Our experiment tested honey bee foragers (Apis mellifera). The tested bees were kept in a temperature-controlled room, which was kept at 25 degrees Celsius. The room was naturally illuminated by one window, which provided the colony with circadian cues about the time of day. It was further illuminated by 36 fluorescent light bulbs, which turned on about half an hour after sunrise and turned off about half an hour before sunset throughout the entire experimental period. This allowed bees to naturally return to their hive towards sunset. Every third bulb was connected to a different phase of a three-phase electricity system, which reduces the flicker frequency to which bees are sensitive. The bees were divided into two groups with 30 bees in each: the control group received a balanced diet containing a 1:1 ratio of omega 6 and 3 essential fatty acids; and the other group received an unbalanced diet containing 5 times more omega 6 than 3, which is known to impair associative learning in bees (Arien et al. 2018).

After the feeding period, the behavior of the bees in the bee maze was tested. The maze was built in the form of 8 arms, as illustrated in Fig. 3. Honey bees have been tested in a 6-arm maze before (Brown et al. 1997), and 8-arm mazes are typically used with rodents (Olton and Samuelson 1976), thus facilitating comparisons. A feeder at the end of each arm contained 3 micro-liters of \(20\%\) w/w sucrose solution, since the quality fo reward can affect the motivation of foraging bees. Honey bees collect nectar in their crop mainly to carry back to the colony. A bee can load upwards of 50 micro-liters, so the bee in our experiment remained motivated to feed throughout the test. The bee was supposed to collect food from all the arms in less than 5 minutes. The bee’s performance was assessed by several measures: the speed of completing the maze, the number of times the bee entered each arm, and fine details in movement patterns.

Fig. 3
figure 3

Photo (left) and diagram (right) of the 8-arm radial maze used in this study. At the end of each arm is a feeder that could be filled with a small drop of sugar solution. The maze was covered with a removable transparent plexiglass ceiling, which had a small hole at the center, allowing the bee to fly into the center of the maze (marked by a dark circle). This hole was then covered by an additional small transparent flap, so that no additional bees could enter the maze. When the bee finished visiting all the arms of the maze, the plexiglass ceiling was lifted, and the bee flew away towards the hive. The diagram shows measurements in centimeters

The experiment was filmed using a camera with a fixed height and angle above the maze: see Fig. 4. The start and end of the recording were triggered manually from the moment the bee entered until it exited the maze, respectively.

Fig. 4
figure 4

A camera is fixed to a tripod placed above the 8-arm radial maze. The maze is fixed to a Styrofoam mold, and two of the three tripod legs are fixed to the mold as well. We used a Panasonic camera, model HC-VX870, with a resolution of 1080p and an image capture rate of 50fps. The distance between the camera and the maze is 70cm and the FOV (field of view) = 63.57deg

We worked with individually marked bees, and each bee made repeated visits to the maze. Bees were initially trained to approach the maze entrance. When an individual bee learned to approach the maze entrance, it was allowed entrance into the maze and the video recording began. The entrance to the maze was blocked by a transparent cover to prevent more bees from entering. The arms visited by the bee in the maze were noted by the experimenter, and when the bee had visited all eight arms, the ceiling was lifted, and the bee flew back to the hive.

2.2 Constructing the model

To train a machine-learning model, we need to use preexisting labeled examples to enable the model to identify the repetitive attributes and use them to recognize the desired object. Therefore, to train a model that can recognize a bee, we need to provide it with many bee images. The training data must contain various images with different locations and positions since we wish the model to recognize the bee without some constant background feature.

Fig. 5
figure 5

Preparing the training dataset for the planar bee tracking system model

2.2.1 Creating the training set

Many tools demand a manual marking of the object’s location in the image to create the labeled training dataset. This process is highly time-consuming. Hence, we created an automatic image processing system to mark the bee’s location. This system enables the quick creation of a set of marked images throughout several stages of the preparation process, as follows (see Fig. 5).

  1. (a1)

    Create reference image. To prepare the reference background image of the maze, we took two different frames from a video of a bee in the maze so that in the first frame, the bee is on the lower side of the maze, and in the second frame, the bee is on the upper side. We cut the two frames in the middle and took the parts without the bee. Connecting the two different parts created the reference image.Footnote 1

  2. (a2)

    Convert reference image to grayscale. Given a colorful reference image, we eliminate the unnecessary color information by turning the image to grayscale.

  3. (b1)

    Convert the frame to grayscale. As in the previous stage, we eliminate unnecessary color information, this time from the current frame.

  4. (b2)

    Subtract frame from reference image. For each frame, we subtract the current frame from the reference image. As a result, most of the image is reset to the zero value for each pixel (black color) or smaller. Smaller values (negatives) are wrapped to the maximum possible value of 255 (white color). Hence, we are left with the bee in grayscale (around 90 value) and many weak noise points around the values 255 and 0.

  5. (b3)

    Clean noise. We clean the image by resetting the noise (values in the range 0–30 or 225–255) to black, resulting in a smooth image with a grayscale bee.

  6. (b4)

    Convert to black and white. Every pixel that is not black is reset to white. That way, we get an image of two tones: a black background and a white bee in the middle. Note that there may be various additional white noises in the image.

  7. (b5)

    Morphological opening. We run an opening process on the current frame in two phases:

    1. 1.

      Activating erosion on the frame. To clean small noises from the frame, we blacken each pixel that is not surrounded by white in the original image. This process effectively removes unwanted disturbances.

    2. 2.

      Activating dilation on the frame. We set each pixel in the frame to white if it is surrounded by at least one white pixel in the original image. This process ensures the bee returns to its initial size.

    After these processes, we get a noise-free frame, with a bee in a slightly more clumsy form, without legs or delicate parts.

  8. (b6)

    Find bee location. After obtaining the image from the previous steps, we determine the upper-left and bottom-right coordinates of the white pixels to generate a blocking square. We then verify if this blocking square aligns with the size of a bee, thus filtering out images with multiple bees or human hands (for example, the hands of the experimenter in our case). If the blocking square matches a bee’s size, we have successfully automated the detection of a bee in the frame.

  9. (b7)

    Save image and text files. We save the original image and generate a new text file. This text file contains the bee’s position in the image and the length and width of the blocking square, normalized to the size of the image. These two output files constitute the training set, which is the input for the YOLO-v5 machine-learning model.

  10. (b8)

    Human validation. We duplicated each image for accuracy validation and added a red square to indicate the calculated bee location. By transforming these images into a video and reviewing them manually, we ensured that the red square consistently appeared in the correct position, i.e., that the bee was identified accurately. This confirms the validity of the automatic detection process.

To create a training set, we generated 4000 images using the aforementioned process.

2.2.2 Training the model

To train the model using the YOLO algorithm, we divided the compiled database into training (3200 images), testing (400 images), and validation (400 images) datasets.

We then trained the model for 300 epochs (the number of iterations over the complete training set) with a batch size of 32 images. The batch and epoch sizes were determined according to the recommended values that correspond to the memory we have in our hardware (16 GB of RAM).

The training process was completed after 4.5 hours, resulting in a fully trained model. Figure 6 shows the loss function and metrics as a function of the number of epochs completed during the training process.

Fig. 6
figure 6

Loss functions as a function of the number of epochs (x-axis) in the training process

The YOLO loss function has three different measurements:

  • Box (box_loss) Loss due to a box prediction not exactly covering an object (bounding box regression loss).

  • Objectness (obj_loss) Loss due to a wrong box-object IoUFootnote 2 prediction.

  • Classification (cls_loss) Loss due to deviations from predicting ‘1’ for the correct classes and ‘0’ for all the other classes for the object in that box.

The YOLO accuracy functions use mean average precision (mAP), which compares the ground-truth bounding box to the detected box and returns a score: the higher the score, the more accurate the model is in its detection.

  • mAP 0.5 When IoU is set to 0.5, the APFootnote 3 of all pictures of each category is calculated, and then all categories are averaged: mAP.

  • mAP 0.5\(-\)0.95 Represents the average mAP at different IoU thresholds (from 0.5 to 0.95 in steps of 0.05).

The six left plots in Fig. 6 display the loss functions. The upper three plots show training set loss, while the lower three plots show validation set loss.

All loss plots decreased with increasing epochs, indicating improved model performance. No overfitting was observed, as improvement was seen in both training and validation sets. After 300 epochs, further training yielded negligible improvement. Continuing beyond this point was unnecessary. Additionally, the four right-hand plots in Fig. 6 demonstrate significant initial improvement in measured accuracy.

Fig. 7
figure 7

Flow chart for the bee recognition process

2.3 Recognizing a bee in the maze

The process for recognizing a bee in a maze from a video consists of several steps (see Fig. 7):

  1. (a)

    Detect the maze’s center.

    In the preprocessing step, we determine the location of the maze’s center as a reference point. This step is crucial for accurately calculating the bee’s position and distance in the video. Instead of calculating the center’s location in each frame, which is inefficient and prone to noise from human hands or the bee, we detect the center by analyzing information from 100 frames. We select the point with the highest likelihood of being the center, grading each result based on its frequency and confidence value obtained from the model.

  2. (b1)

    Detect a bee. The recognition machine-learning models process video frames and provide the bee’s position as minXmaxXminYmaxY, representing the coordinates of the bee’s bounding box in pixels.

  3. (b2)

    Filter unreasonable results. To validate the bee’s position in the current frame, we verify that it falls within the known boundaries of the maze. Additionally, we compare the current and previous locations of the bee to ensure consistent movement and filter out bees that are simply flying around the maze.

  4. (b3)

    Calculate the bee’s position in centimeters. Given the machine-learning output—the blocking square, minXmaxXminYmaxY—we calculate the center of the bee in centimeters as follows:

    $$\begin{aligned} x_{bee}[cm]= & {} \left( minX_{bee}+\frac{maxX_{bee}-minX_{bee}}{2}+X_{MazeCenter}\right) \cdot pixelToCm \\ y_{bee}[cm]= & {} \left( minY_{bee} +\frac{maxY_{bee}-minY_{bee}}{2}+Y_{MazeCenter}\right) \cdot pixelToCm \end{aligned}$$

    Where \(X_{MazeCenter}\) and \( Y_{MazeCenter}\) are the maze center values in pixels (see phase (a)), and pixelToCm, a constant whose value is 0.039489 cm as measured in a video. It expresses the resolution of the pixel size in centimeters.

  5. (b4)

    Locate the arm and the distance. To track the bee’s progress in the 8-arm maze, we determine the specific arm (ranging from 1 to 8) it has visited and measure the distance it has traveled. We accomplish this by converting the bee’s Cartesian coordinates to polar coordinates and analyzing the angles within the designated range, as illustrated in Fig. 8.

  6. (b5)

    Calculate the total path length and the average speed. The spatial cognition abilities of a bee can be evaluated by considering the length of its path and the speed at which it completes the maze visit. A bee with strong spatial cognition will exhibit shorter paths and faster completion times, which we aim to calculate.

    We determine the total length traveled by the bee in the maze by comparing consecutive frames and measuring the distance between the bee’s locations. This allows us to assess the bee’s efficiency in navigating the maze.

    To ensure accuracy and exclude small movements while the bee is feeding (from the feeder at the end of each maze arm), we set a minimum distance threshold between frames. Only distances exceeding this threshold are added to the total length. The threshold is experimentally chosen and fine-tuned by observing the video to ensure accurate measurement of movement while accounting for feeding behavior. The minimum threshold is influenced by the frame rate, which is set at 50fps in our videos. Based on these considerations, we set the minimum movement distance to a shift of three pixels between frames. Figure 9 demonstrates an example of a bee’s processed path using Wolfram Mathematica software.

    To calculate the average walking speed, we divide the total path length by the total time spent walking (excluding feeding time).

  7. (b6)

    Save results to new video and CSV files. The analysis of bee’s recognition produces two output files:

    1. 1.

      A video file in which the bee and maze are annotated, and an information panel is displayed on the left side. Refer to Fig. 10 for visualization. The video aims to see the revelations ‘in real time’ and understand the software’s calculations and decisions. (For example, if it seems that the maze’s center was detected in the wrong location, this will help us understand the strange results we will get in decoding this video.)

    2. 2.

      A CSV file containing all the information collected about the bee in each frame, see example in Table 1. The purpose of the CSV file is to enable us to analyze the extracted data with additional external tools.

Fig. 8
figure 8

Numbering and locations of the arms

Fig. 9
figure 9

An example of the paths taken by a bee. Green dots show where the bee walked around the maze. The blue straight line indicates the straight path the bee could walk to reach the feeder; the red dots indicate the actual path of the first entry. The numbers on the axis represent centimeters

Fig. 10
figure 10

Video output file example. An example of a frame in the output video with the information block and markings (the bee in red, the maze center in magenta, and the black maze flags in green). The information panel provides comprehensive details about the bee’s location, including Cartesian and polar coordinates, total distance traveled, average speed, and its position within the maze (center or a specific arm) at any given time

Table 1 CSV output file example

2.3.1 Challenges in using the model

As part of analyzing the results, when we identified anomalies, we validated the bee tracking process by manually watching the processed videos (see Fig. 10) that produced an abnormal bee operation. This human procedure led to encountering two primary issues: incorrect detection of the maze’s center and the detection of flying bees outside the maze.

In some videos, the maze’s center was inaccurately identified, typically due to obstructions (e.g., the researcher’s hand) in the initial frames. Consequently, an erroneous center was selected, leading to incorrect calculations of the bee’s location. To resolve this issue, we extended the maze’s center detection stage, as outlined in step (a) (Fig. 7) above.

Another challenge arose when another bee flying above the maze was mistakenly identified as the tracked bee. To mitigate this problem, we implemented a filtering mechanism that eliminates information about other bees in the videos, as described in step (b2).

3 Spatial bee tracking system

The second system studying bees’ spatial behavior focuses on tracking a flying bee. Tracking flying bees allows us to change the experiment’s environment, take the bee out of the controlled room, and put it in the open field. Whereas the first system is typically designed for specific experiments following a single bee in a planar maze, the system described in this section was intended to be more general and support various experiments, including following several bees and in less planar environments. This would make it possible to use the system in many scenarios in which the behavior of individual bees needs to be tracked, either in more sophisticated lab-oriented experiments studying spatial learning and orientation of flying bees or outdoors. For example, bee traffic in and out of the hive is one of the measures used to assess colony health and strength. It is also a most pertinent metric for assessing the pollination potential of the hive, for example, when colonies are rented to provide pollination services. In such cases, a camera would be placed in front of a hive to monitor hive activity. In order to add recognition of individual bees over a longer time frame, the system could be combined with one that identifies individually-marked bees at the hive entrance. It is feasible to individually mark up to several hundred bees with colored numbered tags, RFID tags, or barcodes (for example, (Crall et al. 2015; Alburaki et al. 2021; Warren et al. 2024). It would also be possible to place a camera in the field, in front of a flowering tree or patch of flowers, in order to assess the floral visitation behavior of the bees. The system could also be extended to observe other bee species and focus, for example, on nests of ground-nesting bees, to quantify nest-building, procurement of nectar and pollen to the nest, and nest orientation behaviors. Many of these kinds of observations are often performed by human observers, who can only sample a small fraction of the total time and with limited measures. Cameras can greatly enlarge the amount of data collected, but manual collection of measurements from the videos is highly time-consuming and limited in the quality of data collected. An AI system could track bees for longer times, with fewer errors, and provide much more data, allowing more sophisticated measures and statistical analyses.

3.1 Training the model

The model was trained to detect flying bees inside a room with static cameras. Tracking flying bees involves three-dimensional data, requiring more than one camera. However, in our scenario, bees typically maintain a consistent height during flight. Therefore, we ignored this dimension to simplify data collection and analysis.

For the training, we used a process similar to the process described in Sect. 2 to automatically turn a video of flying bees into a dataset without needing to manually identify the bees in each frame. We checked whether the model from the previous part, which recognizes a bee in the maze, also manages to recognize flying bees, and it turned out that it does not recognize them. Hence, we trained a new model with the new dataset of flying bees, using Yolo-v5, as described in Sect. 2. The graphic analytic appears in Fig. 11. The different graphs’ meaning is explained in Sect. 2.2.2. These results indicate that the training process is improving the model’s accuracy.

Fig. 11
figure 11

The graphs of the flying bee recognition model training

3.2 Traceability

Tracking means that from the moment a specific bee is identified, we continue to identify it in subsequent frames as the same bee we identified the first time, even if more bees enter and exit in subsequent frames. We know how to differentiate between the different bees by giving each bee a unique label. There are different tracking methods available. In this study, we used the SORT algorithm (Bewley et al. 2016), responsible for making associations with the detection we already have from the machine-learning model. The SORT algorithm receives an array of recent discoveries from the machine-learning model in each frame. In the first frame, the algorithm gives a unique label to each discovery and returns the list of discoveries with the unique label. In the following frames, the algorithm tries to associate the new detections it receives with the previous detections according to proximity, speed, and direction of movement. The bees in each frame of the video were marked in the frame: each bee was marked with a different color, and the label obtained from the tracking algorithm was added. We can see in the video after executing the algorithm that the same bee is marked with the same color and label throughout the different frames, even when it moves from place to place and when there are several bees in the video (see example in Fig. 12).

Fig. 12
figure 12

Tracking in different frames: maintaining the same unique label and color. The distance between the camera and the hive is 150 cm, and the FOV = 63.57 deg

3.2.1 Challenges in using the tracking model

The tracking algorithm  (Bewley et al. 2016) is incompatible with our model because it relies on detecting overlaps between identified areas of the same object across different frames to track objects continuously. However, our system deals with bees as objects, and the identified area of a bee in each frame is small, while bees move quickly. Therefore, the fast movement of bees creates significant distance between their locations in consecutive frames, making it impossible to establish the necessary overlaps for the algorithm to function effectively.

Fig. 13
figure 13

Artificial enlargement of the object detection squares to improve the results of the tracking algorithm

Several empirical experiments were conducted in search of a solution to this problem. In the end, we created a two-step fix for this situation.

  • The parameters of the tracking model were changed so that even if the object was not detected for several frames, the model still keeps track in case it detects a new overlap.

  • The area of the detected object (the square surrounding the bee) was artificially enlarged before it was reported to the tracking algorithm. It was reduced to its original size after the tracking algorithm returned its results. Therefore, the size of the original detection is preserved. Yet, the square is larger for the tracking algorithm. Thus, there is an overlap between the different identifications along the frames, which allows the tracking algorithm to maintain accurate tracking within the frames (see Fig. 13).

The explained fix enables the tracking algorithm to maintain consistent tracking of bees throughout the video. Note that despite the algorithm’s reasonable performance and the bees’ high flight speed, there are rare instances where the algorithm fails to track a bee and assign a new label to the same bee.

3.2.2 Limitations of the tracking model

The recognition model did not recognize all the bees in the frames. The reason for this is the small size of the bee and the fact that the camera is placed at a distance from the bees. In addition, sometimes, during fast flight, the bee looks smeared, and it is difficult for the human eye to identify and differentiate it from dirt or various stapler pins found on the hive in the video.

Another limitation relates to the speed of the bee’s flight. A consistency-tracking algorithm needs a very fast and powerful processing ability. This research uses a relatively strong computer, given budget limitations (Intel i7-10,750 H, 16GB RAM, GeForce RTX 2070 GPU). However, it was not fast enough for real-time calculation, which uses high-quality Full HD \(1920\times 1080\) videos at 50 fps. We adjusted the algorithm to support the weaker hardware while still supporting real-time analysis. Every second frame was dropped out of the processing process, so we only handled 25 fps. This, in turn, increases the distance we find between the same bee’s locations in the different frames, which amplifies the problem mentioned and solved in Sect. 3.2.1.

4 Discussion and conclusion

The article suggests machine-learning-based systems for recognizing and tracking bees in both planar and spatial environments. This system can be used in various behavioral research with bees and other insects.

Besides developing a tool to study insect movement, the paper’s main contribution is the description of the different phases and the overall architecture of the machine-learning-based systems. A dedicated tracking system was adapted and built for each configuration of a bee—walking or flying, due to differences in the bee’s appearance. In that way, we ensure the system maximizes its performance since it is tailored specifically to the needed assignment.

In the first system, planar bee tracking system, the bees were recorded while walking in a two-dimensional maze. The maze videos are analyzed by creating training and testing databases, training the model, testing it, fine-tuning it, and eventually using the system to recognize the bee and analyze its path.

The second system, spatial bee tracking system, was built to track the bees in a spatial environment, such as when they were flying or walking in a closed 3-dimensional environment. This system requires building its own machine-learning model that includes the use of tracking algorithms and a few of our modifications to enable consistent tracking of the bees throughout the different frames of the videos. This system uses flight videos and is an infrastructure for future experiments investigating bees’ spatial behavior while flying. A comparison between the two systems appears in Table 2.

Table 2 Planar bee tracking system versus spatial bee tracking system

The two systems share some common attributes and challenges. Both systems were used to observe the significant contribution that machine learning can make to the world of insect research. In training the models, we used simple computer vision algorithms to generate the needed datasets easily and quickly based on unique characteristics that appeared in videos. We showed that the resources needed to use these algorithms automatically instead of manually are significantly reduced. The systems we developed allow cheap real-time tracking, which improves the resources needed for experiments and allows accurate execution of experiments that need adjustments during their process. Furthermore, the entire process of recognizing and tracking the bee allowed for an accurate analysis of the duration of the movement, the direction, and the length of the path the bee made, providing insights into patterns, trends, and anomalies. Manual tracking allows for the sampling of simple parameters such as the time of the occurrence of events and their durations or the quantification of the number of errors the agent makes; more fine-grained characteristics that allow a better understanding of spatial behavior could not be calculated by manual tracking. Additional advantages of automatic over manual tracking are the ability to monitor multiple bees simultaneously or in extended observation periods, integration with other technologies, such as sensors and environmental monitoring devices, and consistency in tracking and analyzing bee behavior. Automatic tracking avoids variations that may occur in manual tracking due to factors like fatigue or subjective judgment.

Each system has its own challenges and limitations that we resolved and acknowledged (see Sects. 2.3.1 and 3.2.1). This study gives a detailed description of the system construction process and the approaches used while addressing these setbacks, which can be used as a blueprint for researchers while implementing similar systems or tackling similar challenges.

The systems described in this article were developed for experiments conducted to study the spatial cognition of bees, which were examined in laboratory conditions. To adapt the system to various applications in field conditions, one needs to handle the main challenge of identifying small flying bees in highly cluttered backgrounds (e.g., soil, rocks, plants). Although the system showed reasonable performance in a slightly noisy environment, more research is needed to improve the detection while also dealing with the high computational power required to handle fast cameras with high resolution. A further research extension could explore the flight patterns of bees, which would necessitate three-dimensional tracking. This process would involve the calibration of multiple cameras or integration of RFID tags into the experiment, along with adjustments to the models to accommodate these changes.

Recently, methods are being developed to employ unoccupied aerial vehicles (UAV, a.k.a. drones) for remote sensing of vegetation in order to estimate biodiversity and bee diversity and abundance  (Torresani et al. 2023, 2024) and even for directly detecting small moving objects, such as bees, from videos recorded by these UAVs (Stojnić et al. 2021). An exciting, challenging future research might implement our system to track bees using UAVs.

Recognizing and tracking small objects with a fast movement pattern are complex operations that machine-learning systems will have to deal with in the coming years as part of the transition to a world with autonomous systems and a growing need to understand insect flight patterns and their implications. This work is meant as another milestone toward achieving this goal.