3.1. Overall Design of the Proposed System
This work aimed to increase the convenience and opportunities for BVI people when performing daily indoor activities autonomously. For this purpose, we propose a fire detection and notification system based on YOLOv4 and smart glasses, which captures images through a tiny camera and transmits them to a server equipped with an AI module that returns fire detection results with voice feedback. The proposed system uses deep CNNs to detect fire regions with high accuracy and a powerful processor to perform real-time image processing sufficiently fast. Thus, we introduce a client–server architecture consisting of smart glasses and a smartphone as the client and an AI server to perform image processing tasks.
Figure 2 depicts the general architecture of the proposed system.
Section 3.3 gives a more detailed explanation. We added home security cameras to deal with circumstances where BVI people are not at home, are asleep, or are not using smart glasses, preventing the early fire detection and alert system from failing. In these circumstances, the AI server sends fire prediction results to blind people and to the fire department, as shown in
Figure 2. If a fire is confirmed by blind people or the fire department, it can be suppressed by activating fire extinguishing devices. The client part consists of smart glasses and a smartphone that send the data through Bluetooth and a home security camera that records continuously (continuous video recording). Meanwhile, the AI server receives the images from the client, processes them, and returns the result in an audio format. The smart glasses receive the audio results and communicate with users via the built-in speaker or through a smartphone.
The client part of the system works as follows. Initially, the user connects the smart glasses to a smartphone through Bluetooth. Subsequently, the user can request the smart glasses to capture images, which are then sent to the smartphone. The power consumption of the glasses can be reduced in this case, which is more efficient than continuous video recording. The results from the AI server are then conveyed via earphones or speakers as voice feedback. BVI users with tactile devices can also touch and feel the outline of salient objects. Despite the recent introduction of lightweight deep CNN models, we used an AI server to perform deep-learning-based computer vision tasks, because the GPUs in wearable assistive devices have limited specifications compared to a powerful AI server. Because the smart glasses and smartphones were only used for capturing photos, this extended the battery life of these devices. Furthermore, the AI server was convenient for further improving the accuracy of the deep CNN models and adding new features. Following text-to-speech, the AI server received images and applied fire detection and object recognition models to detect fires and recognize objects. Subsequently, the audio results were delivered as an AI server response to the client’s request. The results of the fire prediction are sent to the fire department for confirmation.
3.2. Indoor Fire Detection Dataset
The level of precision of the deep learning model primarily depended on the dataset used in the training and testing procedures. As determined by our review of datasets for fire detection, the datasets developed for vision-based fire detection systems are insufficient, and existing open-access datasets have some drawbacks. To address these issues, we created a fire image dataset for indoor fire scenes. First, we classified fires based on the material that forms the fuel source. We then researched which fires were the most common in indoor situations. To the best of our knowledge, class A and B fires represent the most common fuel sources in home fires, including wood, paper, cloth, rubber, trash, plastics, gas, and oil-based products. Finally, we gathered fire and non-fire images from various open-access sources such as Kaggle, GitHub, Google, and Flickr, finding images depicting a range of different conditions (shape, color, size, time of day, and indoor environment). Our fire image dataset consisted of 6000 indoor fire and non-fire images, as shown in
Table 1.
A large amount of labeled training data is a key factor in the success of any deep learning model. However, it was challenging to obtain robust fire detection results using this dataset in real-world scenarios. This may be due to overfitting, underfitting, or class imbalance. An overfitted model cannot capture patterns in images in an appropriate way. Underfitting can be due to a shortage of data; hence, we employed the technique of image data augmentation (modifying and reusing images) to improve the inference power of the model. After our review [
40,
41,
42,
43] and experiments [
39,
44], we found that image data augmentation techniques based on geometric transformations such as flipping and rotation proved to be the most effective methods for our research. The sizes of training-image datasets and their resolutions determine the power of CNN models. Therefore, we increased the number of images in the fire detection dataset by rotating each original fire image at angles of 60° and 120°, as well as horizontally flipping each original and rotated image, as shown in
Figure 3. Thus, we modified the existing training images to generalize them to different circumstances, allowing the model to learn from a larger range of situations. Manually flipping, rotating, and labeling all of the images in the dataset is very time-consuming. To automate the image modification process, we designed software that can automatically flip and rotate images using the OpenCV library.
Flipping horizontally. The image was flipped along the vertical axis so that left and right exchanged positions. This is shown in mathematical terms in Equation (1), where
x and
y are the horizontal and vertical coordinates of the original pixel, and
Ix and
Iy are the corresponding elements of the resulting pixel.
Flipping vertically. Similarly, this means flipping about the horizontal axis, interchanging the top and bottom of the image, as shown in Equation (2):
Rotation of the image. The analogous rotation is given in Equation (3):
The coordinates of the flame in the image naturally change when the labeled pictures are rotated at specific angles. To avoid labeling them again manually, we read all of the pictures in the folder, transformed them into angles, and developed special software to update their labels. Using the LabelImg tool 1.8.0, we modified the location of the fires in each picture according to the YOLOv4 training annotation. The tag folder was a text file that tracked the fire coordinates. It was also used as part of the learning process in a CNN. We also utilized non-fire and fire-like images in the training set, to decrease false-positive detections.
The 6000 fire detection images were divided into training and test sets, with 80 percent (4800) used for training. We enlarged the dataset images by five times the number of original augmented images after using data augmentation methods on only the training set (
Figure 3). As shown in
Table 2, the total number of fire detection images increased to 30,000.
3.3. Implementation of Fire Detection and Notification System
The overall design of the proposed system, which included client and AI server components, is explained in
Section 3.1. In this section, we explain the processes of deep-learning-based computer vision techniques that run on an AI server. In our approach, several computer vision techniques using deep learning were developed to achieve our goals.
Data preprocessing. As illustrated in
Figure 4 (and explained in
Section 3.2), we first collected 6000 images to create an indoor fire detection dataset. Then, we increased the indoor fire detection accuracy using the fire image dataset and improved the deep CNN model. Currently, the YOLOv4 model is one of the most suitable deep CNN models for training with a custom image resolution. The image resolution for the YOLO model must be a multiple of 32. Therefore, we resized the original fire images in the dataset to a standard resolution of 416 × 416 pixels, because the training process took more time than expected with large input images and low frames-per-second (fps) values [
43]. However, the performance of the trained model was observed at different image resolutions at the test stage, including 416 × 416, 512 × 512, 608 × 608, 832 × 832, and 960 × 960.
We performed quantitative experiments by applying object detection evaluation metrics, including precision, recall, and average precision (AP), as in our previous research [
18,
44,
45,
46], and analyzed the results. Precision is the ability of a classifier to identify only the relevant objects, i.e., the proportion of true positives detected. Recall measures the ability of the model to identify all relevant cases; it is the proportion of true positives detected among all ground truths. A good model is one that can identify most ground-truth objects (it exhibits high recall) while identifying only the relevant objects (it exhibits high precision). A perfect model has a false-negative value of 0 (recall = 1) and a false-positive value of 0 (precision = 1). Precision and recall rates were obtained by comparing pixel-level ground-truth images with the results of the proposed method. We used the following equations to calculate the precision and recall metrics of indoor fire detection systems:
where
TP denotes true positives and is the number of correctly detected fires,
FP denotes false positives and is the number of background regions detected as fires, and
FN denotes false negatives and is the number of fires detected as background regions. We calculated the average precision (AP) as shown in Equation (6):
Figure 5 shows that the performance of the model improved with increasing size of the test images, and the best performance was at 608 × 608 pixels.
Next, we tested the performance of the deep CNN model with the original 6000 images, then with the full augmented dataset. The performance of the deep CNN model was better with the full dataset than with the original dataset, as shown in
Table 3.
After completing the training and testing procedures, we tested 1876 daytime and night-time pictures that were similar to the fire scenes. The number of false positives from these 1876 images also assisted in checking the performance of the trained weights. Sunlight is a common distraction for fire detection cameras; therefore, we included non-fire and fire-like images including sunsets, sunrises, and lighting in our dataset. In this research, we used 1876 non-fire and fire-like images such as sunrises, sunsets, and lighting in the training and testing steps of the model. This is because sunlight and lighting pixel values are very close to fire color intensities, even though they are not actual fires. Examples of non-fire images are shown in
Figure 6.
Fire detection model. YOLO detection is an object classification system based on AI. YOLO has been released in five versions thus far: YOLOv1 through YOLOv5. Currently, almost all versions are used to identify objects, although not all series are equally effective in detecting fires. In this study, we used YOLOv4, which is an extension of the YOLOv3 model, for the fire detection task. YOLOv4 is a real-time, high-precision, single-stage, regression-based object detection model which was presented in 2020. It incorporates the features of a series of YOLO detectors such as a path aggregation network (PANet), Mish activation function, spatial pyramid pooling (SPP), self-adversarial training, mosaic data enhancement, CmBN, and many other techniques to significantly enhance detection precision. The model structure consists of three parts: feature extraction (CSPDarknet53), the feature fusion or neck (PAN and SPP), and prediction (bounding box). We made some improvements to the original YOLOv4 network architecture to obtain robust indoor fire detection results. To reduce the running time and increase the robustness of the deep CNN model, the h-swish activation function was used to ensure elimination of gradient explosion. Other parts of the deep CNN model were improved by adding a convolutional block attention module, as presented in [
47]. We tested the performance of the proposed approach by experimenting with other versions of YOLO on the original fire dataset (6000 images) and compared the final precisions (
Table 4).
Fire prediction. At the fire prediction stage, smart glasses and security cameras record video and capture image frames, which are then sent to the AI server for processing. The AI server receives and resizes the image frames to a 608 × 608 resolution. Usually, in indoor settings, the contrast of the images is low, owing to the lack of natural light and other external factors; therefore, we applied contrast enhancement methods to the input images to obtain the desired results. In pixel transformations, the value of each output pixel depends only on the values of the corresponding input pixels. Brightness and contrast are reasonable examples of pixel modifications that increase image quality.
In Equation (7),
α > 0 and
β are the gain and bias parameters, respectively. These parameters affect image contrast and brightness.
I(
x) represents the source pixel of the original image and
O(
x) denotes the output pixel of the final image. To make Equation (7) easier to understand, we consider Equation (8):
where
i and
j denote the pixel in the
i-th row and
j-th column. By modifying the weights of
α (contrast [
1,
2]) and
β (brightness [
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30,
31,
32,
33,
34,
35,
36,
37,
38,
39,
40,
41,
42,
43,
44,
45,
46,
47,
48,
49,
50]), we generated an augmented image in the dataset. Brightness enhancement is one of the most effective approaches for image refinement during preprocessing. We experimentally tested these techniques in our previous research [
39,
44], using global color contrast enhancement [
46,
49], and combined local and global contrast enhancement approaches, as shown in
Figure 7.
In addition, we tested the performance of the proposed approach by experimenting with other versions of YOLO on the augmented fire dataset (30,000 images) and compared the final precision results.
Table 5 shows that the improved YOLOv4 model ranked the highest in the training and testing stages, with 73.6% and 71.5% accuracy, respectively. In addition, YOLOv4 achieved 72.8% (a difference of 0.8% from the improved YOLOv4 model) in testing, only marginally behind the improved YOLOv4 model in terms of testing accuracy. In training, YOLOv4-tiny and YOLOv3-tiny reached accuracies of 51.5% and 43.9%, respectively. Due to the larger quantity of dataset images, these models took longer than those in previous experiments. Although the processing time was more than that of the YOLOv4-tiny method, YOLOv4 was regarded as an efficient and strong fire detection model with the highest prediction accuracy. Using data augmentation methods, we increased the training accuracy from 69.3% to 73.6% (4.3%) and the test accuracy from 67.9% to 71.5% (3.6%).
Although we obtained 71.5% accuracy with the test set, we further researched and assessed numerous recently presented methods to enhance this result. To the best of our knowledge, most proposed methods fail in small-sized fire image detection [
50]. Thus, we gathered small-sized fire images to increase our dataset and enhance the fire detection accuracy.
Figure 8 shows some examples of small-sized fire images. As indicated in [
44], we used a large-scale feature map to detect tiny moving objects and concatenated it with a feature map from prior layers to maintain the fine-grained features. This large-scale feature map was used to identify small-sized fire pixels by combining the location information from earlier layers with complicated characteristics from deeper levels. We improved the fire detection accuracy to 72.6% using the test set.
Fire notification. Once fire regions are detected, two different actions are triggered for the fire notification stage: (1) the AI server sends audio and text messages to the user’s smartphone, and (2) the AI server sends a detected fire image to the fire department. Regarding the first action, BVI people can control the surrounding situation and differentiate daily lifestyle fires from hazardous fires by wearing smart glasses. If they confirm a hazardous fire, they can self-evacuate using fire, object, and text recognition [
51] methods, with object mapping methods to determine the relationships among different objects. The relationships among objects can provide additional information using keywords including “in”, “on”, “next to”, “below”, and “above”, for instance, “fire above oven” or “fire next to chair”. BVI people can hear voice guidelines and receive tactile information to assist indoor navigation from fire zones to safe zones. The most challenging relationships relate to the keywords “on” and “in” because they rely on the interaction of an object’s pixels with another object’s top line. If two bounding boxes are within a defined range of pixels (pixel tolerance) from one another when testing horizontally, the “next to” relationship is specified. Because separate objects might have appendages, the “below” and “above” relationships are specified by using the mass boxes of the objects rather than the bounding boxes. The bounding box for a table object, for example, would contain the table’s legs and be centered in the free area underneath the table. The mass box is centered closer to the table’s real surface. The mass box for an object is specified by beginning with the assigned coordinates of the bounding box and checking the image one axis at a time. The total number of pixels of the mask remaining in the box are multiplied by the percentage of pixels of the mask remaining in the box at each iteration of the pixel movement. The “below” and “above” keywords examine how closely the pixels from each object are aligned with each other, using the center of the mass box of the object.
Regarding the second action, the fire department receive a photo of the fire and can determine whether it is a daily lifestyle fire or a dangerous fire. If it is a potentially dangerous fire, they can follow the fire department’s procedures.