0% found this document useful (0 votes)
19 views12 pages

Instruction Set Extension of A RiscV Based SoC For Driver Drowsiness Detection

The document describes designing a driver drowsiness detection system using a modified RiscV processor on an FPGA by implementing a CNN model trained to classify driver expressions. The processor is modified by adding custom instructions to enhance computation rate, achieving the best latency improvement over the base processor. Despite efforts, driver distraction remains a leading cause of accidents requiring continued research on detection systems.

Uploaded by

1066357565
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views12 pages

Instruction Set Extension of A RiscV Based SoC For Driver Drowsiness Detection

The document describes designing a driver drowsiness detection system using a modified RiscV processor on an FPGA by implementing a CNN model trained to classify driver expressions. The processor is modified by adding custom instructions to enhance computation rate, achieving the best latency improvement over the base processor. Despite efforts, driver distraction remains a leading cause of accidents requiring continued research on detection systems.

Uploaded by

1066357565
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Received May 9, 2022, accepted May 19, 2022, date of publication May 25, 2022, date of current version

June 6, 2022.
Digital Object Identifier 10.1109/ACCESS.2022.3177743

Instruction Set Extension of a RiscV Based


SoC for Driver Drowsiness Detection
SEYED KIAN MOUSAVIKIA 1 , ERFAN GHOLIZADEHAZARI 2, MORTEZA MOUSAZADEH 1,

AND SIDDIKA BERNA ORS YALCIN 2


1 Department of Electrical Engineering, Faculty of Electrical and Computer Engineering, Urmia University, Urmia 5756151818, Iran
2 Department of Electronics and Communication Engineering, Istanbul Technical University (ITU), Maslak, Istanbul 34469, Turkey
Corresponding author: Morteza Mousazadeh ([email protected])
This work was supported in part by the Scientific and Technological Research Council of Turkey (TUBITAK) and the Ministry of Science,
Research and Technology of Iran (MSRT) under Project 119N641, and in part by Urmia University and Istanbul Technical University with
the support of Tabriz University and the Center for International Scientific Cooperation of MSRT.

ABSTRACT This paper describes the design and implementation of a driver drowsiness detection (DDD)
system using a modified RiscV processor on a field-programmable gate array (FPGA). To detect drowsiness,
Convolutional Neural Network (CNN) is implemented on a RiscV processor. The CNN is trained to classify
four primary driver’s expressions, including distraction, natural, sleep, and yawn. The trained CNN accuracy
is 81.07% on validation data. Furthermore, due to FPGA memory limitations, written C code for the trained
CNN is optimized in numerous ways. Optimizations include the usage of dynamic fixed-point data types
and dynamic memory allocations. On the other hand, the processor is modified by adding three custom
instructions, including custom store, conv2d(2 × 2), and multiply and accumulation (MAC) to enhance
the computation rate. As a result, the processor with custom store, conv2d(2 × 2), and MAC as custom
instructions achieved the best result in terms of latency, with an improvement factor of 1.7 over the base
processor and 1.25 over the processor with only custom store and multiply and accumulation (MAC) in
exchange of slight increase in area.

INDEX TERMS Convolutional neural network, driver drowsiness detection, FPGA, hardware implementa-
tion, modified RiscV processor.

I. INTRODUCTION embedded processor [4]. The embedded processor receives


Driver fatigue and drowsiness are the leading cause of human the camera’s input, processes it, and finally illustrates the
casualties in traffic accidents. Moreover, property damage network’s output on remarkably basic peripherals like seven
is another concern for governments. To minimize these segments or even light-emitting diodes (LEDs). In this paper,
casualties, automotive companies have spent a tremendous a CNN model is trained to detect driver drowsiness. More-
amount of time and money to design systems that detect over, this trained model is implemented on the FPGA, which
drowsiness and sleep and then alert the driver in these requires noticeable modification in the model parameters due
situations [1]. These systems must detect drowsiness before to the shortage of resources in low-cost FPGAs. Even though
it leads to accidents; subsequently, the designed system must shortage of resources on low-cost FPGA’s can be challenging;
have high accuracy [2]. still FPGA’s are a great choice for hardware designers.
In recent years, neural networks, especially convolutional The hardware designer can easily change the architecture
neural networks (CNNs), satisfy many requirements. These of the designed system inside of the FPGA by adding or
networks are noticed widely due to their outstanding removing some lines of code, which is not possible on
functionality in classification with considerable accuracy [3]. other hardware’s like GPU’s. Moreover, application-specific
Furthermore, these neural networks are implementable on integrated circuit (ASIC) implementation out of FPGA is also
the field-programmable gate arrays (FPGAs) by utilizing an possible.
Despite a wide range of research, many considered
The associate editor coordinating the review of this manuscript and drowsiness the main parameter of car accidents, yet driver
approving it for publication was Remigiusz Wisniewski . distraction leads to car accidents. Moreover, humans yawning

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.


VOLUME 10, 2022 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 58151
S. K. Mousavikia et al.: Instruction Set Extension of RiscV Based SoC for Driver Drowsiness Detection

is a noticeable sign of sleepiness. As an immediate result, is recorded, and via k-nearest neighbor and support vector
designing a system that can detect driver drowsiness is machine (SVM) 95.8% and 93.8% accuracy were achieved.
highly recommended. This paper proposed decreasing car In [9], with ECG signal and the same classifiers of [8]
accidents by considering driver drowsiness and distraction. driver drowsiness was detected, and above 90% accuracy
The hardware architecture of our designed system is such that was achieved for only two class classifications. In [10], the
it can form a system on a chip (SoC) compatible with future validity of detecting driver drowsiness by measuring the
ASIC implementation. Moreover, for the embedded proces- muscle activity associated with steering wheel grip with
sor instruction set architecture (ISA), RiscV is considered [5]. the help of EMG was analyzed. The results indicated that
Due to its simplicity and being completely open-source, it is the validity of the proposed algorithm, needs to be more
easily possible to modify and change the ISA to make the robustly tested with additional subjects and with different
processor optimized for any application. According to the experimental designs. In [11], five methods for drowsiness
research firm Semico, the number of chips that include at least detection, i.e. subjective reporting, driver biological features,
some RISC-V technology will grow 73.6 percent per year to driver physical features, vehicular features while driving,
2027, when there will be some 25 billion AI chips produced, and hybrid features are discussed and compared, which
accounting for US $291 billion in revenue [6]. On the indicates all methods have pros and cons, and even though
other hand, due to utilizing an embedded processor, the some of the stated methods are very complex and need
implementation of CNN becomes less challenging because many sensors; they still need improvement. As a result,
of the ability to implement the designed CNN in C/C++. although the accuracy of these systems is considerably high in
Additionally, codes in C/C++ can be compiled and translated certain situations, the usage of numerous sensors sometimes
into RiscV machine-level code due to a reliable RiscV GNU simultaneously, and the increased complexity of the system
Tool-chain found in [7]. Optimizations on the software side are the main downsides of this method. Moreover, the cost
are done by managing memory efficiently, as an instance in of these methods is not affordable, and also the flexibility of
the compiler, appropriate linkers, and decreasing the number modifying these systems to add other options is poor.
of parameters. Likewise, optimizations on the written code The other method for detecting drowsiness is image
are considered due to the shortage of memory resources on processing. These methods are great for high-level and
low-cost FPGAs. These optimizations reduce the size of the software approaches since they are hard to implement on
machine-level code. hardware without the help of high-level design languages
The organization of the paper is as follows. Section II pro- such as Matlab or Python. In [12], an intelligent surveillance
vides information regarding previous research in designing system is designed based on signal processing and embedded
driver drowsiness detection systems and RiscV processors. tools, which has three interlinked modules, driver fatigue
Section III describes the software design of the CNN and the detection, alcohol content detection, and vehicular crash
dataset utilized for system training and validation. Moreover, detection. Although the system is complete and practical,
the model’s modifications for software optimizations on the the brain of the system is a Raspberry Pi microcomputer
implemented CNN are explained in this section. Section IV which is not flexible and far from optimized for such a
provides information concerning hardware implementation huge task. In [13] however, an FPGA-based system for
and optimizations. In Section V, results and achievements drowsiness detection is proposed. The system utilizes two
of the paper are illustrated. Finally, section VI concludes the signs to detect the driver’s drowsiness, closed eye, and open
paper. mouth. Although the system is on FPGA, which means it
can be designed to be optimum in terms of power and delay
II. RELATED WORK and suitable for ASIC implementations, the use of high-level
By increasing the number of cars and trucks, the rate design tools, especially the use of Matlab HDL coder, reduces
of car accidents increases regretfully. In these accidents, the hardware designers’ control on the final architecture.
driver drowsiness has the most dominant role. Subsequently, In [14], a real-time system that utilizes computerized camera
researchers in both industry and academic endeavor to reduce to automatically track and process the driver’s eye using
the rate of car accidents by designing a system to detect Python, dlib, and OpenCV is proposed. With Python and
drowsiness. high-level libraries like OpenCV and dlib, small bare-metal
Between researches in the area of driver drowsiness processors cannot host this method.
detection, a significant number of them utilize biomed- Another method that can be used for detecting drowsiness
ical methods to detect drowsiness. In [1], the authors is neural networks especially convolutional neural networks
provide valuable information about psychological signals (CNNs). In [15] DriCare, a driver drowsiness detection
that can detect a driver’s drowsiness level. Authors also method is proposed by using face landmark detection and
measured drowsiness levels utilizing various methods like the help of CNN. Although the DriCare method accuracy
Electroencephalography (EEG), Electrocardiography (ECG), is about 92%, but the system is tested on an Intel Core
Electrooculogram (EOG), and Electromyogram (EMG). The i7 CPU, which is a supercomputer in comparison with small
measured accuracy of some methods like EEG and ECG is embedded processors. In [16], a real-time model based on
above 95%. In [8] utilizing an EEG headset, brain activity deep neural networks on an embedded processor is proposed.

58152 VOLUME 10, 2022


S. K. Mousavikia et al.: Instruction Set Extension of RiscV Based SoC for Driver Drowsiness Detection

In this paper, utilizing minimal facial landmarks, accuracy A. DATASET


is about 89.5% for three-class classification, and a detection In this paper, we consider several criteria which lead to
speed is 14.9 frames per second (FPS) tested on Jetson TK1. car accidents. While driver drowsiness is pivotal, consid-
Although the network is modified for an embedded processor, ering driver distraction is essential as well. Consequently,
nothing about the hardware optimization is proposed since we designed the system by considering driver drowsiness to
the hardware is fixed and not flexible. In [17], a similar detect yawning and sleeping. Furthermore, driver distraction
approach using multi-task CNN (MTCNN) is proposed. is considered. Despite numerous methods based on image
Eye and mouth characteristics are utilized for the driver’s processing and machine vision, a dataset that addresses our
behavior model. Changes to these characteristics are used requirements is not supplied. Subsequently, in this paper,
to monitor driver fatigue. Again the accuracy is high (98%) frames of the videos from reference [22] are extracted to
on the proposed datasets, but no discussion about hardware design an accurate system. The frames of these videos are
is made. extracted at the rate of 1 frame per second, then the frames are
In this paper, we decided to utilize neural networks to labeled in four distinct classes, including normal, distraction,
detect drowsiness since image processing techniques are very yawn, and sleep. After the initial design with the extracted
heavy to implement on embedded processors. Also, the use dataset, several images were added to the dataset to increase
of biomedical techniques can result in an accurate system in the system’s accuracy. In this paper, we consider a scenario
some situations, but not affordable. where the camera is fixed on the car’s mirror. Fig. 1 (a) to
As a hardware-software codesign matter, the selected (d), illustrate a sample of distinct classes, including normal,
method must run on suitable hardware. In recent years distraction, sleep, and yawn, respectively.
RiscV has gained popularity due to its open-source nature. For these classes, intending to improve the system accuracy
For deep learning algorithms, however, the majority of and sensitivity, extracted frames are augmented by adding
research indicates that the designers paired a base CPU with Gaussian noise to the images, changing the brightness of
an accelerator to decrease the latency of the implemented images, translation, and rotation, as can be seen in Fig. 1 (e) to
neural network. In [18], the authors proposed a RiscV-based (h). It must be noted that the system quality and received
hardware accelerator designed for the Yolo object detection images might be affected due to driving cars in tunnels,
system. The designed system executed Yolo in 400ms. bumpy roads, and the like. Consequently, we consider
In [19], the authors compared two hardware accelerators, Gaussian noise with a wide range of σ , including 0.01,
NVDLA and Gemmini, and they proved that NVIDIA’s 0.02, and 0.04. The reason for considering noises is due to
NVDLA accelerator outperforms Gemmini by 3.77x running the various quality of the camera and changing the quality
ResNet-50 on an equivalent configuration using the same of input images. Furthermore, while cars are derived at
system setup. However, in [20], the author mentioned that different times, the light varies considerably. For this reason,
a high-end FPGA is required to test the NVIDIA’s NVDLA to improve the system sensitivity, the effect of the light is
accelerator. As a result, they proposed to integrate NVDLA considered. It must be noted that, due to the mobility of
into a real RiscV SoC on the Amazon cloud FPGA using cars, moving in various directions, and the hills and valleys
a tool named FireSim. As a result, although adding an on the roads, the driver’s location, and position extracted
accelerator to a base CPU is logical, but the added area by the inputs of the camera witness conspicuous changes.
to the base processor requires higher-end FPGAs and, as a Consequently, the system cannot detect the correct position
result bigger chip which all leads to higher costs. Moreover, of the drivers. For this reason, in this paper, we consider the
the RiscV ISA is open, and the designer can add nearly as rotation of the images and translation. Overall, by considering
many new instructions as is needed to make the processor these criteria for data augmentation, the number of samples
customized. In [21], a Laplacian filter SoC with RiscV for normal, distraction, yawn, and sleep are 5069, 5063, 5062,
processor over wishbone protocol implemented. As a result, and 5109.
hardware implementation by adding a single DSP block was
36 times better than software implementation of Laplacian
filter. Consequently, we tried to implement a very efficient B. CNN MODEL
CNN regarding the number of parameters on a modified Driver drowsiness detection systems have been used widely
RiscV CPU. due to their profound effects on society. In this paper, the
system is implemented on both sides; software, and hardware.
It must be noted that for designing the CNN, some limitations
III. DATASET AND CNN MODEL must be considered related to the hardware shortage memory.
In this section, we propose the dataset, and our designed Furthermore, while the number of parameters of the CNN
CNN models. The dataset is described in III-A, and the CNN has profound effects on the system accuracy, as the driver
models are described in III-B. In this paper, the classified drowsiness detection system must be real-time, the system
images depict various behaviors of drivers. In this case, must be lightweight. The implemented CNN for this system
drivers yawn, sleep, and distraction are considered as criteria is presented by Model Summary 1, which has 3588
that increase the probability of car accidents. parameters:

VOLUME 10, 2022 58153


S. K. Mousavikia et al.: Instruction Set Extension of RiscV Based SoC for Driver Drowsiness Detection

FIGURE 1. Extracted frames of driver expressions (a) Normal, (b) Distracted, (c) Sleep, (d) Yawn, (e) Gaussian Noise, (f) Brighter, (g) Translation,
(h) Rotation.

Model Summary 1: Implemented CNN Structure In this paper, due to the hardware limitations, the designed
Conv2D(4, (2,2), input shape = (100,100,1), activation = CNN is compact. The implemented CNN’s layers are listed in
’relu’) Model Summary 1. It must be noted that in the system utilized
MaxPooling2D (MaxPooling2D (2,2)) for driver drowsiness detection, the system accuracy must
Conv2D(4, (2,2), activation = ’relu’) be as accurate as possible. For this reason, two additional
MaxPooling2D(pool size = (2,2)) models, CNN I and CNN II, are designed for having a
Conv2D(4, (2,2), activation = ’relu’) system with high accuracy. These models are implementable
MaxPooling2D(pool size = (2,2)) like Model Summary 1, but they require more memory. The
Conv2D(4, (2,2), activation = ’relu’) CNN II model is depicted in Fig. 3. Regarding CNN I, the
MaxPooling2D(pool size = (2,2)) first two convolution layers have three channels, similar to
Flatten() CNN II. However, the remaining convolution layers in CNN I
Dense(32, activation = ’relu’) have 4, 8, 16, and 32 channels. After each two convolution
Dense(classes, activation = ’softmax’) layers, a max-pooling layer is considered. Additionally, the
first dense layer has 128 neurons. Furthermore, the size of
input images in CNN I and CNN II is 160 × 120, then the
number of parameters in CNN I and CNN II are 21,282 and
As illustrated by Model Summary 1, the system is 45,546, respectively. Moreover, CNN I and CNN II accuracy
extremely lightweight and has a low number of parameters. and loss functions are depicted in Fig. 2.
Input image size is 100 × 100, and images for training Intending to evaluate the proposed models, we trained
and testing are grayscale. It must be noted that the CNN several well-known models with our dataset, including
model is designed based on the hardware limitations. For MobileNet V2, VGG-16, and Inception. While for Model
implementing the model on the hardware, the input size of the Summary 1, we trained the network with grayscale images,
image and the number of filters alongside the size of the filters this time, the networks are trained with an RGB version of
in the first convolutional layer’s have a significant effect on our dataset. In table 1 the CNN II accuracy and its number
the area of the implemented model on hardware. Therefore, of parameters are compared with famous state-of-the-art
memory usage is managed to use the minimum amount due networks. However, the reason for training with the RGB data
to a lack of memory on low-cost FPGAs. Besides the input is that the models MobileNet V2, VGG-16, and Inception are
image and convolutional layers, the activation function of trained with the ImageNet dataset, which contains RGB data,
all convolution layers is considered to be rectified linear and the weights are extracted for RGB images. As a result,
unite (RELU) in this model. Regarding the max-pooling for having a fair comparison between CNN II, and the pre-
layers, stride size is two, and padding does not add. There are trained models, CNN II is trained by RGB data as well. While
also two fully connected layers at the end of the model, first the accuracy of MobileNet is significantly higher than that of
with RELU activation and second with SoftMax activation, CNN II, the accuracy of VGG-16 is approximately the same
as four classes are considered. After training the model on the as CNN II trained with RGB data. Moreover, the accuracy
described dataset, the achieved accuracy for validation data is of Inception is lower than CNN II. It must be noted that the
81.07%. Fig. 2 (c) and Fig. 2 (f) depicts the system accuracy number of parameters in this pre-trained model is by far more
and loss in terms of the number of epochs for the CNN model. than the CNN II. These results indicate that although CNN I

58154 VOLUME 10, 2022


S. K. Mousavikia et al.: Instruction Set Extension of RiscV Based SoC for Driver Drowsiness Detection

FIGURE 2. (a): CNN I Accuracy, (b): CNN II Accuracy, (c): Implemented CNN Accuracy, (d): CNN I Loss, (e): CNN II Loss, (f): Implemented CNN Loss.

and CNN II have more parameters than Model Summary 1,


but they are a much better choices than other networks like
VGG-16. Furthermore, if a bigger FPGA is available, CNN I
and CNN II are better candidates than networks like VGG-16,
Inception, and even MobileNet V2.
Also, we should mention that we trained all of the above
models with Python library, Keras, optimizer was Adam, and
loss function was cross-entropy. Furthermore, we extracted
the weights and biases of the Model Summary 1 for C/C++
implementation. In addition to designing the model with a
low number of parameters, the format of the weights and
biases are modified as well. The reasons and results of
modifications are explained in subsection III-C.

C. SOFTWARE OPTIMIZATION
This subsection covers optimizations considered in writing
C code to achieve minimum code size for the implemented
CNN. As we mentioned in subsection III-B, the lack of
memory in low-cost FPGA models leads to selecting the
CNN model, which has a low number of weights and biases.
Despite choosing a lightweight model, tackling these issues
requires even more consideration. Overall, for some systems,
especially memory-hungry systems, the memory unit on
FPGAs might be insufficient. For this reason, intending to
decrease the needed memory block size on FPGA, all of the
weights and biases are converted into fixed-point numbers. FIGURE 3. The structure design of The improved CNN II model.
For conversion from floating-point to fixed-point, a python
package named fxpmath is utilized [23]. Also, for a better and comparing it to the famous IEEE 754 standard for float
understanding of the fixed-point configuration in our design numbers in Fig. 4, our fixed-point configuration can be seen.

VOLUME 10, 2022 58155


S. K. Mousavikia et al.: Instruction Set Extension of RiscV Based SoC for Driver Drowsiness Detection

TABLE 1. CNN II accuracy and well-known models accuracy.

TABLE 2. Accuracy evaluation of implemented CNN with fixed-point


parameters (whole number is 8 bits).

FIGURE 4. Configuration of utilized fixed-point number.

the whole number is 8bits (S + I + F), and the fraction part


is changed from 0 to 8 for finding the best configuration for
the network parameters. Table 2 indicates that 6 bits for the
fraction part and 1 bit for integer, including 1 bit for the sign
(I = 2, F = 6) results in 73.7% accuracy, which is the closest
FIGURE 5. Accuracy evaluation of implemented CNN (Integer part is accuracy to 81.07% achieved in validation accuracy with float
equal 8 bits including 1 bit for sign).
data types.
For other variables such as the input of the network and
Unlike hardware description languages like Verilog or Very the intermediate variables, fixed-point configuration must be
High-Speed Integrated Circuit Hardware Description Lan- decided as well. Since in the training phase, the input of
guage (VHDL), in C or C++ programming languages, only the CNN is normalized between 0 and 1; as a result, all of
valid data types are 8bits, 16bits, 32bits, and 64bits, so there the 8 bits can be dedicated to the fraction part. However,
are not many options for demonstrating network parameters. the utilized camera gives 565 output (5 bits for red, 6 bits
As a result, 8bits for network parameters and 16bits for for green, and 5 bits for blue). As a result, converting the
intermediate variables are considered to reduce the compiled input picture to grayscale, all three channels must be added,
code size compared to int(32bits) and double(64bits) in C leading to a maximum of 7 bits. Subsequently, the input layer
language programming, respectively. Nevertheless, utilizing configuration is selected to be 7 bits for the fraction part,
this modification requires more consideration to prevent a leading 1 bit for the sign, which is always 0. For intermediate
considerable drop in the system accuracy. It is undeniable variables after multiplying two 8-bit numbers, the result is
that using 32bit float data types for computation and 16 bits. These 16 bits results are reduced to 8 bits after all
demonstration of weights and biases in the system leads the calculations of the layer are done. With this method, both
to the same accuracy achieved in the training phase; precision and memory shortage can be satisfied. However, for
compared to the situation, the number of digits for saving reducing 16 bits to 8 bits, the amount of right shift decides
network parameters and computation is decreased. However, the configuration of the 8-bit number. This configuration
by utilizing fxpmath, the system accuracy is recalculated. must change throughout the network since, at the first layers,
We first tested the system accuracy with a wide range of numbers are small and close (more fraction bits are needed),
fraction bits (1 to 15 bits), changing just network parameters. but at the final layers, numbers grow and become widespread,
Fig. 5 demonstrates system accuracy while a wide range of so more bit for the integer part is needed. Fig. 6 supports
bits for fraction part is utilized. Note that the integer part of this idea and depicts the output range of each layer of our
the number is set to 8 bits. This test demonstrates that for implemented CNN for all of the pictures of our dataset.
network parameters, the number of fraction bits (F bits) has This decision results in a dynamic fixed-point configuration.
the most importance to systems accuracy. Table 3 depicts the configuration of each layer’s output,
As mentioned, in C/C++, data types are limited. As a which is deduced from Fig. 6. Note that all weights have the
result, the second test was considered for finding the best configuration of 1 bit for sign, 1 bit for integer, and 6 bits for
configuration for 8bit fixed-point representation. In this test, fraction part resulting from Table 2.

58156 VOLUME 10, 2022


S. K. Mousavikia et al.: Instruction Set Extension of RiscV Based SoC for Driver Drowsiness Detection

FIGURE 6. Output histogram of (a) First, (b) Second, (c) Third, and (d) Forth convolution layer, (e) First dense layer.

TABLE 3. Fixed-point configuration of each layer’s output. the Bare-Metal systems compiler is considered to implement
dynamic memory allocation. Fig. 7 provides a block diagram
to indicate how to compile C codes that contain DMA
functions for Bare-Metal cpu’s. Also, in Fig. 7, we should
mention that the −sbrk() must be added at the first line of the
written C code before other functions to allow the compiler
to utilize DMA functions. Moreover, we endeavored to use
points stated in [24] to optimize the written Makefile and
linkerfile for the compiler.
Finally, for the last dense layers configuration, we decided Alongside the optimizations mentioned above for memory
to keep 16-bit representation, to better differentiate between usage, we optimized the code in terms of run time and
four classes of the network. latency with the help of custom instructions. Overall, the
For the second optimization, due to an increase in the convolution function is the most called in this code. Algo-
number of intermediate variables after each convolution rithm 2 depicts the pseudo-code of a standard convolution
layer, we utilized Dynamic Memory Allocation (DMA) which indicates that the multiply and accumulation (MAC)
functions, especially Malloc() and free() in coding. By these operation frequently happens inside for loops. Due to
functions, the heap memory of the processor is allocated this reason, it was convenient to add MAC operation or
and deallocated to the variables dynamically. For example, even the whole convolution operation to the hardware.
by these functions, after calculating the first layer’s output, Besides, in the case of MAC operation, it can also be used
there is no need for the first layer’s input to occupy the for two fully connected layers of our implemented CNN
memory anymore. Subsequently, the occupied memory is network to reduce the run time of the code even more.
released with the free() function and automatically reallo- In Section IV, the hardware optimizations are explained by
cated to the following layer’s variables. However, due to considering the process of adding a custom store as custom0,
the architecture of the RiscV compiler, usage of dynamic conv2d(2 × 2) operation as custom1, and MAC as custom2
memory allocation functions in C or C++ is not possible to the Arithmetic Logic Unit (ALU) and processor decoder.
unless the target embedded processor has an operating However, in C code optimization, Fig. 8 displays a block
system. The −libgloss library contains these functions, but diagram of how to modify the C code and the compiler to
it applies to processors with an operating system. In this utilize the help of custom instructions. It must be noted that
case, we use a Bare-Metal processor on FPGA, so the for conv2d(2 × 2) operation as custom1, considering the 8bit
usage of −libgloss must be avoided in the Makefile of the demonstration and comprehending the size of the convolution
compiler. Then a specific version of the −sbrk() library for filters is 2 × 2, only the first operand of the 32bit ALU is

VOLUME 10, 2022 58157


S. K. Mousavikia et al.: Instruction Set Extension of RiscV Based SoC for Driver Drowsiness Detection

Algorithm 2: Normal Convolution


for i from 0 to (InputFeatureMapRowSize - 1)
for j from 0 to (InputFeatureMapColumnSize - 1)
oc[i][j] = 0
for k from 0 to (WeightMatrixRowSize - 1)
for l from 0 to (WeightMatrixColumnSize - 1)
oc[i][j] = oc[i][j] + w[k][l] * in[i+k][j+l]
end for
end for
end for
end for
FIGURE 7. Compiling codes containing DMA functions for bare-metal
CPU’s.

FIGURE 9. Block diagram of ibex core.

processor, we utilized the Ibex core [25]. Ibex is an


RV32IMC, written in System Verilog, with two pipeline
stages, as shown in Fig. 9. The two stages of ibex core are:
FIGURE 8. Compiling codes containing DMA and custom instruction instruction fetch(IF), instruction decoder(ID), and execution
functions for bare-metal CPU’s.
block(EX), which are merged together [25]. First, instruc-
tions are fetched into a prefetch buffer. Then instruction
occupied. The second unused operand can be utilized if a fetch controller supplies new instructions and their program
16bit representation is used. counter to the ID-EX stage. After that, the ID-EX stage
Regarding the SoftMax activation function used in the takes instructions and data from the IF stage, decodes them,
last layer of almost all CNN’s, instead of using (1), where and executes the instructions. This stage is made up of
i is the target class; in the code, we compare outputs of the multiple sub-blocks such as controller, decoder, register
final fully-connected layer and picked the index of the most file, arithmetic logic unit (ALU), multiplier/divider block
enormous number as the predicted class of the input picture. (MULT/DIV), control and status register block (CSR), and
This approach reduces run time and resource utilization since load-store Unit (LSU) as shown in Fig. 9.
for the processor, executing simple if statements are much We utilized Ibex core as the central processor because of
more straightforward than executing complex equations its simplicity and easy-to-understand architecture. Besides
like (1). these reasons, the Ibex core is small enough to fit nearly on
Finally, for better deducing the driver’s primary expression any low-cost FPGAs. Other more complicated RiscV cores
and neutralizing unnecessary alarms, instead of just taking such as Berkley’s Rocket or BOOM found in the Chipyard
one picture and then classifying it; based on frame rate, the repository [26] can also be tested in future designs.
system classifies frames for 15 seconds, and after this time Additionally, two remaining building blocks of our design
is elapsed based on what class was selected more often, then are a camera module and a monitor. These modules can
the driver’s expression is deduced. This method is used in all communicate to IBEX through the designed Wishbone
of our implementations. B4 interface [27]. The choice of Wishbone over other
exi interfaces like AXI and AMBA [28] is because the Wishbone
softmax(xi ) = P#classes (1) interface is simple and open source, like IBEX and RiscV
j=0 e xi
ISA. Moreover, in our designed wishbone protocol, all
masters and slaves communicate through a shared bus called
IV. HARDWARE IMPLEMENTATION AND OPTIMIZATION interconnect with an arbiter register controlling masters
This section describes hardware implementation and opti- priority. Furthermore, slaves are selected according to their
mizations of the trained CNN. As for the embedded base addresses, which masters apply. Subsequently, first,

58158 VOLUME 10, 2022


S. K. Mousavikia et al.: Instruction Set Extension of RiscV Based SoC for Driver Drowsiness Detection

FIGURE 12. R-type instruction format.

convolution function. However, the only downside of adding


FIGURE 10. One MAC block.
the mentioned custom instructions is the overhead of resource
utilization in FPGA. Fortunately, this increase is negligible
for both cases, as we will discuss the results of adding these
instructions to hardware in the following section. As a result,
three custom instructions, one for storing kernel weights, one
for MAC operation that can be used in convolution and fully
connected layers, and one for whole convolution operation,
are added to the decoder and ALU stage of the Ibex processor.
Furthermore, these instructions read the rs1-register and rs2-
register as source operands and write back the result into
rd register like other R-Type instructions in RiscV ISA [5].
Moreover, the type of operation selection is done by funct7
and funct3 as shown in Fig. 12.
It is essential to note that MAC and convolution operation
both happen in a series of the first store instruction, requiring
previously stored data for operation. Besides adding this
instructions to the hardware of the core, RiscV GNU
Tool-chain must be recompiled to recognize the related
instructions for newly added custom instruction opcodes.
The other optimization we considered for lowering mem-
FIGURE 11. Conv2d(2 × 2). ory size is to make the picture size received from the camera
to the size that CNN is trained for. In camera modules,
the camera IP becomes active and populates the RAM with however, the usual size of the output picture is VGA (640 ×
pixels. Next, the Ibex core instruction & data memory get 480), but its fractions like QVGA (320 × 240) and QQVGA
active and process the image with the written neural network (160 × 120) are also possible to achieve. On the other hand,
code inside the processor. Finally, the chosen class will be it is possible to adapt the image size from the camera inside
demonstrated on LED’s. The overall structure of our system the C code by either cropping the image or manually resizing
with added custom instruction blocks is depicted in Fig. 13. it. Since cropping can worsen the accuracy because of losing
In this figure, the VGA block is optional and can be omitted essential parts of the picture, resizing is preferred. Moreover,
without affecting the system. Also, a built-in timer inside the in our design, the first resizing is handled inside hardware
processor is used in the evaluation mode for measuring the by taking one of every two pixels coming from the camera.
processing time in hardware implementation. While the camera supports standard QVGA (320 × 240),
As mentioned, the most called function in our written its output images are resized to the QQVGA (160 × 120).
C code is the convolution function. In this function, the Using this trick in hardware resulted in a 14% reduction of
most frequent operation is MAC. Besides the convolution BRAM’s size on the utilized FPGA board with no added code
function, MAC operation is also used in fully connected in the software. The second conversion, cropping, happened
layers. MAC operation in our C code happens inside for inside C code, which converts QQVGA to 100 × 100 suitable
loops, as noted in Algorithm. 2. As a result, one MAC for the implemented CNN. This step does not require if the
operation can be added to the processor (Fig. 10). Moreover, CNN’s input is a fraction of the VGA like CNN1 and CNN2
for the convolution function, since 2×2 convolutions are used networks, stated in III-B.
in our design, four MAC operations can be combined and
operate in parallel lanes to form one complete convolution A. RASPBERRY PI IMPLEMENTATION
(Fig. 11). This instruction is called conv2d(2 × 2) since we This subsection covers Raspberry Pi’s implementation of
have two-dimensional convolutions with 2 × 2 filters in our the designed CNN. Since implementing the designed CNN’s
implemented CNN network. on FPGA is very time-consuming; it is recommended first
In this way, each time the data iterates through for to implement them on Raspberry Pi to test the system’s
loop, we gain a speedup relative to the method we use, accuracy. Fig. 14 depicts our driver drowsiness detection
and this speedup is escalated each time the code calls a system utilizing a Raspberry Pi board. In this system, four

VOLUME 10, 2022 58159


S. K. Mousavikia et al.: Instruction Set Extension of RiscV Based SoC for Driver Drowsiness Detection

TABLE 4. Comparison between custom instructions.

TABLE 5. Board utilization comparison between processors.

TABLE 6. Raspberry Pi and FPGA implementation comparison.

FIGURE 15. Designed system with FPGA (Nexys 4 DDR) board.

FIGURE 13. Overall system.


Accelerator, the CNN model .h5 file is converted to DAG
or series network format with the help of MATLAB Deep
Learning Toolbox Converter for TensorFlow models found in
Add-Ons section of MATLAB [30]. In this way, MATLAB
compiles the CNN model to the ARM-based processor
core of Raspberry Pi. However, MATLAB implements
the networks with IEEE754 standard, and also, on the
Raspberry Pi, the utilized camera is five megapixels instead
of 0.3 megapixels of OV7670 used on FPGA design. As a
result, the implemented design on Raspberry Pi will be the
best-case scenario and can achieve better results in terms
FIGURE 14. Designed system with raspberry pi board. of accuracy compared to the same network implemented
on FPGA. Besides accuracy, the frame rate achieved by
LED’s illustrate the current expression of the driver based Raspberry Pi can also be higher due to a much faster 1.2GHz
on the received picture. Also, a 5-inch HDMI resistive touch clock frequency and the fact of having multiple cores on
screen is provided to demonstrate the pictures taken from the Raspberry Pi compared to a 50MHz clock and single core
camera and form a GUI (Graphical User Interface) for the of ibex.
user.
Furthermore, implementing the designed CNN’s is V. RESULTS
achieved via the MATLAB support package for Raspberry In this section, the results of the designed system are
Pi hardware [29]. Although all CNN models are trained demonstrated. For hardware implementation on FPGA,
on the Keras framework, utilizing the NVIDIA Tesla K80 Nexys 4 DDR FPGA board [31] and OV7670 camera module

58160 VOLUME 10, 2022


S. K. Mousavikia et al.: Instruction Set Extension of RiscV Based SoC for Driver Drowsiness Detection

Allocation and the use of dynamic fixed-point numbers


instead of floating-point numbers for the weight, biases,
and intermediate variables of the CNN are used. On the
hardware side, three custom instructions, including a custom
store, conv2d(2 × 2), and multiply and accumulation (MAC)
operations, are added to the processor’s ISA, decoder, and
ALU sections to decrease the run-time of the code. Also, the
camera module’s taken picture size is changed from QVGA
to QQVGA, intending to save more memory for CNN. As a
result, by adding a custom store, MAC, and conv2d(2 ×
2) as custom instructions, the improvement factor over the
base processor is 1.7, which is higher than the improvement
factor achieved by adding custom store and MAC as custom
instructions which is 1.35.

REFERENCES
FIGURE 16. Real word test of designed systems.
[1] A. Chowdhury, R. Shankaran, M. Kavakli, and M. M. Haque, ‘‘Sensor
applications and physiological features in drivers’ drowsiness detection:
A review,’’ IEEE Sensors J., vol. 18, no. 8, pp. 3055–3067, Apr. 2018.
are used. For illustrating the system output and captured
[2] A. Quddus, A. S. Zandi, L. Prest, and F. J. E. Comeau, ‘‘Using
images from the camera module, an HDMI monitor used long short term memory and convolutional neural networks for driver
in Fig. 14 is connected to VGA connectors of the FPGA drowsiness detection,’’ Accident Anal. Prevention, vol. 156, Jun. 2021,
board via VGA to HDMI converter 15. However, the monitor Art. no. 106107.
[3] M. Hashemi, A. Mirrashid, and A. B. Shirazi, ‘‘Driver safety development:
block is optional and can be omitted when implementing the Real-time driver drowsiness detection system based on convolutional
real-world application. For latency evaluation, we enabled a neural network,’’ Social Netw. Comput. Sci., vol. 1, no. 5, p. 289, Sep. 2020,
counter at the beginning of the C code and disabled it at the doi: 10.1007/s42979-020-00306-9.
[4] D. T. Nguyen, T. N. Nguyen, H. Kim, and H. J. Lee, ‘‘A high-throughput
end of the code to measure the run time of each code precisely. and power-efficient FPGA implementation of YOLO CNN for object
As a result of an example picture from the dataset, latency is detection,’’ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 27, no. 8,
390ms for ibex with no custom instruction, 289ms for ibex pp. 1861–1873, Aug. 2019.
[5] RISC-V Specification, vol. 1, RISC-V, Unprivileged Spec V, Cham,
with custom store and MAC used in convolution and fully Switzerland, 2019.
connected function, and 231ms for ibex with custom store, [6] S. K. Moore, ‘‘RISC-V AI chips will be everywhere,’’ IEEE Spectrum,
conv2d(2 × 2) for convolution and MAC for fully connected New York, NY, USA, Tech. Rep., 2022.
[7] P. Dabbelt et al. (2020). GNU Toolchain for RISC-V, Including GCC.
function as shown in Table 4. For accuracy measurements, [Online]. Available: https://github.com/riscv/riscv-gnu-toolchain
we tested both FPGA and Raspberry Pi based systems in [8] S. Arif, M. Arif, S. Munawar, Y. Ayaz, M. J. Khan, and N. Naseer,
real-world and deployed the system inside a car, as can be ‘‘EEG spectral comparison between occipital and prefrontal cortices for
early detection of driver drowsiness,’’ in Proc. Int. Conf. Artif. Intell.
seen in Fig. 16. Moreover, the accuracy results can be seen
Mechatronics Syst. (AIMS), Apr. 2021, pp. 1–6.
in Tables 4 and 6. Besides for FPGA implementation, the [9] S. Murugan, J. Selvaraj, and A. Sahayadhas, ‘‘Detection and analysis:
Vivado utilization table is considered to compare resource Driver state with electrocardiogram (ECG),’’ Phys. Eng. Sci. Med., vol. 43,
utilization in both cases with and without custom instructions, no. 2, pp. 525–537, Jun. 2020.
[10] A. T. Satti, J. Kim, E. Yi, H.-Y. Cho, and S. Cho, ‘‘Microneedle array
as shown in Table 5. Finally, Table 6 compares the FPGA electrode-based wearable EMG system for detection of driver drowsiness
implementation with Raspberry Pi implementation. through steering wheel grip,’’ Sensors, vol. 21, no. 15, p. 5091, Jul. 2021.
Overall, results indicate that the processor with added [11] G. Sikander and S. Anwar, ‘‘Driver fatigue detection systems: A review,’’
IEEE Trans. Intell. Transp. Syst., vol. 20, no. 6, pp. 2339–2352, Jun. 2018.
custom store, conv2d(2 × 2), and MAC operation can achieve [12] V. S. Kumar, S. N. Ashish, I. V. Gowtham, S. P. A. Balaji, and E. Prabhu,
the best result in terms of latency but with negligible increase ‘‘Smart driver assistance system using raspberry pi and sensor networks,’’
in the usage of DSP blocks of the board. Microprocessors Microsyst., vol. 79, Nov. 2020, Art. no. 103275.
[13] S. Gupta, P. Jain, and E. Rufus, ‘‘Drowsy driver alerting system,’’ in Proc.
2nd Int. Conf. Electron., Commun. Aerosp. Technol. (ICECA), Mar. 2018,
VI. CONCLUSION pp. 1665–1670.
This paper discusses the implementation of a modified [14] A. A. Suhaiman, Z. May, and N. A. Rahman, ‘‘Development of an
intelligent drowsiness detection system for drivers using image processing
embedded processor based on RiscV ISA for driver drowsi- technique,’’ in Proc. IEEE Student Conf. Res. Develop. (SCOReD),
ness detection systems. The whole system consists of an Sep. 2020, pp. 233–236.
embedded processor, camera, and a monitor connected to [15] W. Deng and R. Wu, ‘‘Real-time driver-drowsiness detection system using
facial features,’’ IEEE Access, vol. 7, pp. 118727–118738, 2019.
the VGA port of the FPGA. The drowsiness is detected
[16] B. Reddy, Y.-H. Kim, S. Yun, C. Seo, and J. Jang, ‘‘Real-time driver
with a convolutional neural network. Implemented CNN drowsiness detection for embedded system using model compression of
classifies input images taken from the driver into four deep neural networks,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
classes: distraction, natural, sleep, and yawning. Moreover, Workshops, Jul. 2017, pp. 121–128.
[17] B. K. Savas and Y. Becerikli, ‘‘Real time driver fatigue detection system
system’s hardware and software parts are optimized for based on multi-task ConNN,’’ IEEE Access, vol. 8, pp. 12491–12498,
this application. On the software side, Dynamic Memory 2020.

VOLUME 10, 2022 58161


S. K. Mousavikia et al.: Instruction Set Extension of RiscV Based SoC for Driver Drowsiness Detection

[18] G. Zhang, K. Zhao, B. Wu, Y. Sun, L. Sun, and F. Liang, ‘‘A RISC-V based ERFAN GHOLIZADEHAZARI was born in
hardware accelerator designed for Yolo object detection system,’’ in Proc. Urmia, Iran. He received the B.Sc. and M.Sc.
IEEE Int. Conf. Intell. Appl. Syst. Eng. (ICIASE), Apr. 2019, pp. 9–11. degrees (Hons.) in electronics from Istan-
[19] A. Gonzalez and C. Hong, ‘‘A chipyard comparison of NVDLA and bul Technical University, Turkey, in 2021.
Gemmini,’’ Berkeley, CA, USA, Tech. Rep. EE 290-2, 2020. From 2017 to 2018, he was a Researcher at Urmia
[20] F. Farshchi, Q. Huang, and H. Yun, ‘‘Integrating NVIDIA deep learning University, Urmia, where he was involved in the
accelerator (NVDLA) with RISC-V SoC on FireSim,’’ in Proc. 2nd IR signal decoding project using NEC protocol for
Workshop Energy Efficient Mach. Learn. Cognit. Comput. Embedded Appl.
controlling projection systems. His undergraduate
(EMC), Feb. 2019, pp. 21–25.
research was about wireless charger designing and
[21] E. Gholizadehazari, T. Ayhan, and B. Ors, ‘‘An FPGA implementation of
a RISC-V based SoC system for image processing applications,’’ in Proc. implementation. From 2020 to 2021, he was a
29th Signal Process. Commun. Appl. Conf. (SIU), Jun. 2021, pp. 1–4. Graduate Researcher for the project between TUBİTAK (The Scientific
[22] S. Abtahi, M. Omidyeganeh, S. Shirmohammadi, and B. Hariri, ‘‘YawDD: and Technological Research Institution of Turkey) and Ministry of Science,
A yawning detection dataset,’’ in Proc. 5th ACM Multimedia Syst. Conf. Research and Technology, Iran, developing RISC-V based on SoC system
(MMSys), 2014, pp. 24–28. for driver fatigue detection algorithms. His research interests include
[23] A. Franco, J. Charlong, and E. Badger. A Python Library for microprocessor architecture, embedded systems, and HW/SW codesign.
Fractional Fixed-Point (Base 2) Arithmetic and Binary Manipula-
tion With Numpy Compatibility. Accessed: 2020. [Online]. Available:
https://github.com/francof2a/fxpmath
[24] M. Perotti, P. D. Schiavone, G. Tagliavini, D. Rossi, T. Kurd, M. Hill,
L. Yingying, and L. Benini, ‘‘HW/SW approaches for RISC-V code size
reduction,’’ in Proc. Workshop Comput. Archit. Res. RISC-V (CARRV),
2020, pp. 1–8.
[25] LowRISC. Ibex Core Documentation. Accessed: 2017. [Online]. Avail-
able: https://ibex-core.readthedocs.io/en/latest/
[26] A. Amid et al., ‘‘Chipyard: Integrated design, simulation, and implementa-
tion framework for custom SoCs,’’ IEEE Micro, vol. 40, no. 4, pp. 10–21,
2020, doi: 10.1109/MM.2020.2996616.
[27] R. Herveille, ‘‘WISHBONE system-on-chip (SoC) interconnection archi-
tecture for portable IP cores,’’ OpenCores, Amsterdam, The Netherlands,
Tech. Rep., 2010. MORTEZA MOUSAZADEH was born in Urmia,
[28] (2011). A R M Limited. AXI Spec. [Online]. Available: http:// Iran. He received the B.S. degree in electrical
www.gstitt.ece.ufl.edu/courses/fall15/eel4720_5721/labs/refs/AXI%4_ engineering from the Iran University of Science
specification.pdf and Technology, Tehran, Iran, in 2003, and the
[29] MathWorks. MATLAB Support Package for Raspberry PI Hardware. M.S. degree in electrical engineering and the Ph.D.
Accessed: 2014. [Online]. Available: https://www.mathworks.com/ degree in microelectronics from Urmia University,
hardware-support/raspberry-pi-MATLAB.html Urmia, in 2006 and 2014, respectively. He is
[30] MathWorks. Deep Learning Toolbox Converter for Tensorflow Models. currently a Professor at Urmia University. His
Accessed: 2017. [Online]. Available: https://www.mathworks.com/matla research interests include mixed mode IC design,
bcentral/fileexchange/64649-deep-learning-toolbox-converter-for- data converter, and AI accelerator.
tensorflow-models
[31] Digilent. Nexys 4 DDR Board. Accessed: 2013. [Online]. Available:
https://digilent.com/reference/programmable-logic/nexys-4-ddr/start

SEYED KIAN MOUSAVIKIA was born in Urmia,


Iran. He received the B.Sc. degree (Hons.) in
electronics engineering from the University of
Urmia, in 2019, where he is currently pursu-
ing the M.Sc. degree in electronics engineering.
From 2018 to 2019, he was a Researcher at Urmia SIDDIKA BERNA ORS YALCIN received the
University, where he was involved in the design of bachelor’s and M.Sc. degrees in electronics and
a signal generator based on direct digital synthesis communication engineering from Istanbul Techni-
method (DDS) on field programmable gate arrays. cal University (ITU), Turkey, in 1995 and 1998,
In 2020, he was a Researcher for the project respectively, and the Electrical Engineering degree
between TUBİTAK (The Scientific and Technological Research Institution in applied sciences from Katholieke Universiteit
of Turkey) and the Ministry of Science, Research and Technology, Iran, Leuven, Belgium, in 2005. She is currently an
developing RISC-V based on SoC system for driver fatigue detection Associate Professor at ITU. Her main research
algorithms. His current research interest includes HW/SW codesign with interests include cryptography, embedded sys-
the emphasis on deep learning and machine learning algorithms on field tems, and side-channel attacks.
programmable gate arrays.

58162 VOLUME 10, 2022

You might also like