# Sound Event Detection with Binary Neural Networks on Tightly Power-Constrained IoT Devices

Gianmarco Cerutti
Fondazione Bruno Kessler
Trento, Italy
ETH Zurich
Zurich, Switzerland
gcerutti@fbk.eu

Renzo Andri ETH Zurich Zurich, Switzerland andrire@iis.ee.ethz.ch Lukas Cavigelli ETH Zurich Zurich, Switzerland cavigelli@iis.ee.ethz.ch

Elisabetta Farella Fondazione Bruno Kessler Trento, Italy efarella@fbk.eu Michele Magno ETH Zurich Zurich, Switzerland magno@iis.ee.ethz.ch Luca Benini ETH Zurich Zurich, Switzerland University of Bologna Bologna, Italy benini@iis.ee.ethz.ch

## **ABSTRACT**

Sound event detection (SED) is a hot topic in consumer and smart city applications. Existing approaches based on deep neural networks (DNNs) are very effective, but highly demanding in terms of memory, power, and throughput when targeting ultra-low power always-on devices.

Latency, availability, cost, and privacy requirements are pushing recent IoT systems to process the data on the node, close to the sensor, with a very limited energy supply, and tight constraints on the memory size and processing capabilities precluding to run state-of-the-art DNNs.

In this paper, we explore the combination of extreme quantization to a small-footprint binary neural network (BNN) with the highly energy-efficient, RISC-V-based (8+1)-core GAP8 microcontroller. Starting from an existing CNN for SED whose footprint (815 kB) exceeds the 512 kB of memory available on our platform, we retrain the network using binary filters and activations to match these memory constraints. (Fully) binary neural networks come with a natural drop in accuracy of 12-18% on the challenging ImageNet object recognition challenge compared to their equivalent full-precision baselines. This BNN reaches a 77.9% accuracy, just 7% lower than the full-precision version, with 58 kB (7.2 $\times$  less) for the weights and 262 kB (2.4× less) memory in total. With our BNN implementation, we reach a peak throughput of 4.6 GMAC/s and 1.5 GMAC/s over the full network, including preprocessing with Mel bins, which corresponds to an efficiency of 67.1 GMAC/s/W and 31.3 GMAC/s/W, respectively. Compared to the performance of an ARM Cortex-M4 implementation, our system has a 10.3× faster execution time and a 51.1× higher energy-efficiency.

# **KEYWORDS**

Binary Neural Networks, Sound Event Detection, Ultra Low Power

## **ACM Reference Format:**

Gianmarco Cerutti, Renzo Andri, Lukas Cavigelli, Elisabetta Farella, Michele Magno, and Luca Benini. 2020. Sound Event Detection with Binary Neural Networks on Tightly Power-Constrained IoT Devices. In *ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED '20), August 10–12, 2020, Boston, MA, USA.* ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/3370748.3406588

## 1 INTRODUCTION

Cloud computing is the most widely-adopted paradigm for deploying artificial intelligence (AI) and specifically DNNs to extract useful information from sensors in the internet-of-things (IoT) era [12]. However, this cloud-centric approach has several drawbacks: high latency due to communication delays, availability and reliability limited by the communication infrastructure, privacy issues due to the streaming of sensitive data to a remote site, and high energy cost for data transmission [29]. Edge computing is the novel alternative to address these limitations by pushing AI close to the sensors, transmitting only relevant information and alerts [8]. Typically, IoT end-nodes are battery-powered and target a long battery life-ideally aiming at self-sustainable operation with the help of energy harvesters, whose collected energy is far from sufficient to power high-performance processors or GPUs [1]. Microcontrollers (MCUs), with their low power consumption and low cost, are the platform of choice to enable the migration of AI to the edge. The leading MCU architecture is the ARM Cortex-M series with power consumption in the range of milliwatts and throughput in the order of MOPS. To overcome this constrain, over the last few years, many researchers put effort into specialized hardware and optimized inference algorithms to run such DNNs on powerconstrained devices. On the software side, network complexity reduction while preserving the quality of predictions is of significant interest in porting deep and complex architectures on a heavily constrained IoT node. There are several approaches to target this goal, e.g., knowledge distillation [15], network pruning [13], or network quantization [21]. However, only a few implementations of DNNs on microcontrollers are presented in the literature [3, 19, 39]. An extreme case of quantization is Binary Neural Network (BNN),

in which all the weights and activations are described by a single bit representing the value of -1 or 1 [30]. As a consequence, BNNs significantly reduce the amount of memory required and compress 32 MAC operations in just two operations without significantly compromise the accuracy [30]. These two advantages make BNNs a promising approach when resource-constrained devices are involved in edge computing.

On the hardware side, new approaches enabling near-threshold parallel computing in the MCU space have been explored by researchers, industry, and academia [7]. For instance, a novel parallel processor, based on the RISC-V ISA has been launched recently [36]. GAP8 is a commercial processor, implemented from the Parallel Ultra Low Power (PULP) open-source project<sup>1</sup>. This processor has similar power requirements of the Cortex-M family (hundreds of mW) with up to 20 times higher computation performance for machine learning applications [36]. Furthermore, it features RISC-V extensions providing accelerating the BNN processing. The popcount instruction boosts the processing significantly for BNNs and other quantified neural networks.

Looking at applications, scene understanding, and context analysis are among the application domains where edge processing can be crucial. They often rely on computer vision. However, the combination with audio processing can highly improve the accuracy of event detection and activity recognition, complementing vision where line-of-sight occlusions or environmental light changes occur [37]. Furthermore, the use of audio detection alone can partially solve privacy concerns. Thus, sound event detection (SED) is a powerful tool for many applications such as traffic monitoring [27], crowd monitoring [22], measurement of occupancy levels for smart and energy-efficient buildings [35], and emergencies detection [11].

This paper proposes a novel Binary Neural Network (BNN) for resource constrain and low power microcontrollers for SED applications, i.e. classifying which sound event is present in an audio record. The proposed BNN has been implemented on the Greenwave's GAP8.

The main contribution of this paper is as follow:

- We propose, train, and efficiently implement a novel BNN architecture for SED, comparing it with a full-precision baseline network.
- (2) We present the design of a full system, based on the low-power and instruction set architecture (ISA) optimized for GAP8 microcontroller. The full pipeline is developed from audio acquisition with a low-power microphone, over the Mel bins feature extraction to the on-board classification. We present a detailed analysis of throughput and energy trade-off in a variety of supported configurations as well as on-board measurements.
- (3) We demonstrate that binarization of weights and activations are the key factor in matching hardware constraints. Experimental evaluation shows that our implementation on the PULP platform is 51x more efficient and 10x faster than the implementation of the same network in the Cortex-M4 based counterpart.

## 2 RELATED WORK

The most used techniques to address SED and in general audio processing, are employing Mel-frequency cepstral coefficients (MFCC) features followed by a GMM, HMM, or SVM classifier [23, 34, 42]. Recently, DNNs [24], convolutional neural networks (CNNs) [14], and recurrent neural networks (RNNs) [2] have been used instead. However, those models require a large amount of memory to perform high-performance predictions: for instance, DNNs for SED such as L3 [6] and VGGish [14] require approximately 4M and 70M parameters, respectively.

Achieving a reduction of the structure size of an existing network for SED has been largely investigated in the recent literature. In particular, knowledge distillation has been deployed to compress the L3 network to edge-L3 in [6], and VGGish is further compressed to baby VGGish in [2].

By replacing the fully connected layer of an existing CNN with average max-pooling, Meyer et al. [25] reduced the number of parameters while increasing the accuracy for the targeted dataset. Still, Meyernet is not suitable for our very constrained IoT use-case. Therefore further model compression is required to match these constraints.

In addition to model structure modification, recent works on CNN have investigated quantization to reduce the storage and computational costs of the inference task [17, 20, 21]. As an extreme case of quantization, BNNs reduce the precision of both weights and neuron activations to a single-bit [5, 30]. BNNs work on simple tasks like MNIST, CIFAR-10, and SVHN without drop in accuracy [16]. On the challenging ImageNet dataset, BNNs/TNNs have a drop of 12%/6.5% [32, 40]. Recent approaches use multiple binary weight bases, or part of the convolutions are done in full-precision. An accuracy drop down to 3.2% has been achieved [41]; unfortunately, these approaches increase the weight memory footprint and computational complexity.

BNNs are suitable to be implemented on resource-constrained platforms, thanks to their reduced memory requirements and their potential to convert multiplications in hardware-friendly XNOR operations.

Peak throughput and energy efficiency are achieved by ASIC accelerators. Particularly, BinarEye [26] achieves an energy efficiency of 115 TMAC/s/W. But these accelerators are not available on the market, and are usually fixed to few network types.

Several works have implemented CNNs with fixed-point format and operations, in video domain [3, 28] and in audio domain, where keyword spotting in Cortex-M4 based microcontroller [39], Cortex-M0+, and Raspberry Pi based platforms [19].

One of the challenges in this field is the development of energyefficient Neural Network (NN) firmware implementation for embedded systems.

Wang et al. [38] developed a library for neural network porting from the FANN framework to ARM MCUs and PULP platforms. In this case, the hardware is fully utilized, but there is support only for multilayer perceptrons. Garofalo et al. developed a custom library for quantized convolutional neural networks on PULP [10]. However, their focus has been on the precision-throughput tradeoff, thereby omitting several optimizations specific to the corner

<sup>&</sup>lt;sup>1</sup>https://www.pulp-platform.org

case of binary neural networks and limiting the evaluations to a synthetic single-layer benchmark.

To the best of our knowledge, this is the first BNN proposed and implemented on a parallel RISC-V based microcontroller.

#### 3 FEATURE EXTRACTION AND BNN

The idea behind BNNs is to approximate the multi-bit filter weights and inputs with binary values in NNs. Binary weights and activations imply a significant decrease in memory usage as well as computational cost [30]. In this section, we describe the structure of the network, starting from the audio stream to the final prediction.

# 3.1 Feature Extraction (Mel Bins)

The preprocessing part computes the short-time Fourier transform (STFT) in windows of 32 ms every 8 ms. Then, we apply the Mel filters to generate 64 Mel bins. The 400 features are then assembled to create the Mel-spectrogram for 3.2 s of audio. The resulting matrix with a shape of  $64 \times 400$  is the input to the neural network.

## 3.2 First Layer and Binarization

The input data to the network is non-binary and has, therefore, to be treated separately. A robust approach is to keep the first network layer in full-precision, like in Courbariaux et al. [5]. In this way, the network learns the binarization function from the training set.

After the convolution, batch normalization is applied, which can be replaced in inference by a bias and a scaling factor, and is finally followed by the signum activation function for binarization.

To avoid floating-point operations, all the operations described in this section are done in fixed-point. Fixed-point operations are more efficient in terms of execution time and energy consumption without significant loss of performance [21] also in floating-point embedded systems, and will be evaluated more in detail in the experimental result section.

On the other hand, fixed-point quantization requires additional effort in finding the correct amount of integer and fractional bits for each parameter representation. For doing this, we check the range of the parameters, and we choose the number of integer decimals that represents most of the numbers (99.9%) without overload error.

## 3.3 Binary Convolution

BNNs constrain weights and inputs to  $\mathbf{I} \in \{-1,1\}^{n_{in} \times h \times b}$  and  $\mathbf{W} \in \{-1,1\}^{n_{out} \times n_{in} \times k_y \times k_x}$ . To avoid using two bits, we represent -1 with 0, whereas the actual binary numbers are indicated with a hat (i.e.,  $\hat{i} = (i+1)/2$ ). It turns out that multiplications become xnor operations  $\bar{\oplus}$  [30]. Formally the output  $o_k$  of an output channel  $k \in \{0,...,n_{out}-1\}$  can be described as<sup>2</sup>:

$$\mathbf{o_k} = \operatorname{sgn}\left(\sum_{n=0}^{n_{in}-1} \mathbf{i_n} * \mathbf{w_{k,n}}\right) = \operatorname{sgn}\left(\sum_{n=0}^{n_{in}-1} 2\left(\hat{\mathbf{i_n}} * \hat{\mathbf{w}_{k,n}}\right) - k_y k_x\right)$$
$$= \operatorname{sgn}\left(\sum_{n=0}^{n_{in}-1} \sum_{(\Delta x, \Delta y)} 2\left(\hat{\mathbf{i_n}}^{y+\Delta y, x+\Delta x} \bar{\oplus} \hat{\mathbf{w}_{k,n}}^{\Delta y, \Delta x}\right) - 1\right)$$

Whereas  $\Delta y$  and  $\Delta x$  are the relative filter tap positions (e.g.,  $(\Delta y, \Delta x) \in \{-1, 0, 1\}^2$  for  $3 \times 3$  filters). As calculating single-bit operations on

microcontroller is not efficient, we pack several input channels into a 32-bit integer (e.g., the feature map pixels at  $(y + \Delta y, x + \Delta x)$  in spatial dimension and input channels 32n to (32(n+1)-1) packed in  $\hat{i}_{32n:+32}^{y+\Delta y,x+\Delta x}$ ), while the Multiply Accumulates (MACs) can be implemented with *popcount* and *xnor* operations.

Furthermore, as common embedded platforms like GAP8 do not have a built-in *xnor* operator, the *xor* operator  $\oplus$  is used and the result is inverted. Therefore, the final equation for the output channel is  $\mathbf{o_k} =$ 

$$\operatorname{sgn}\left(\sum_{n=0}^{\frac{n_{in}}{32}-1} \sum_{(\Delta x, \Delta y)} 32 - 2\operatorname{popent}\left(\hat{\mathbf{i}}_{32n:+32}^{y+\Delta y, x+\Delta x} \oplus \hat{\mathbf{w}}_{k,32n:+32}^{\Delta y, \Delta x}\right)\right)$$

## 3.4 Batch Normalization and Binarization

A batch normalization layer follows each binary convolutional layer. As the output of binary layers are integer values, and the signum function can be written as a comparison function, the activation function is simplified to:

$$binAct(x) = \begin{cases} 0, & \text{if } x \cdot sgn(\gamma') \ge \left\lfloor \frac{\beta'}{\gamma'} \right\rfloor \\ 1, & \text{if } x \cdot sgn(\gamma') < \left\lfloor \frac{\beta'}{\gamma'} \right\rfloor \end{cases} . \tag{1}$$

whereas  $\gamma'$  is the scaling factor and  $\beta'$  is the bias based on the batch normalization parameters. While exporting the model, we compute the integer threshold value  $\lfloor \frac{\beta'}{\gamma'} \rfloor$  in advance. In inference, one sign comparison and one threshold comparison have to be calculated for each activation value.

## 3.5 Last Layer and Prediction

In the last layer, the fixed-point values from the last binary layer are convolved with the fixed-point weights, and N output channels are calculated, where N is the number of classes. Finally, the network performs an average pooling over the whole image giving N predictions for each class.

## 3.6 Neural Network Architecture

Tbl. 1 summarizes the architecture of the NN. The neural network consists of 7 hidden layers, 5 of which are binary. The first and last layers are real-valued. Their required computations are significantly smaller than in the binary layers (e.g., 7 MMAC in the first layer compared to 109 MMAC in the second layer), and therefore they minimally contribute to the overall computational effort. The reason for having real-valued layers is the high loss of accuracy with entirely binarized neural networks [30].

# 4 EMBEDDED IMPLEMENTATION

The Mel bins extraction and BNN are implemented on GAP8. The application scenario for this device is low-latency low-power signal processing. The device has a tunable frequency and voltage supply. Fig. 1 shows the main block of the chip: GAP8 has two main programmable components, the fabric control (FC), and the cluster. The FC is the central microcontroller unit, and it is meant to manage peripherals and offload workloads to the cluster. The cluster is composed of eight parallel RISC-V cores, a convolution accelerator, and shared memory banks. The two domains share the same voltage

<sup>&</sup>lt;sup>2</sup>For simplicity, we omit bias and scaling factor in the formula.

Table 1: Kernel size, channel, and computational effort for each layer.

| Layer               | Kernel Size  | Channel | Stride | MACs |
|---------------------|--------------|---------|--------|------|
| First (real-valued) | 3 × 3        | 32      | 1      | 7M   |
| 1. Binary Layer     | $3 \times 3$ | 64      | 2      | 109M |
| 2. Binary Layer     | $3 \times 3$ | 128     | 1      | 405M |
| 3. Binary Layer     | $3 \times 3$ | 128     | 2      | 186M |
| 4. Binary Layer     | $3 \times 3$ | 128     | 1      | 154M |
| 5. Binary Layer     | $1 \times 1$ | 128     | 1      | 17M  |
| Last (real-valued)  | $1 \times 1$ | 28      | 1      | 6M   |
| Total:              |              |         |        | 884M |



Figure 1: Architecture of GAP8 embedded processor [9]

source but keep two different frequencies: On-chip DC-DC converters translate the voltage, and two independent frequency-locked loops (FLLs) generate the two different clock domains. The FC is a single-core in-order microcontroller implementing the RISC-V instruction set. To customize the core for signal processing application, GAP8 extends the RISCV-IMC instruction set for signal processing application. In addition to integer, multiplication, and compressed instruction (IMC), GAP8 ISA supports Multiply and Accumulate, Single Instruction Multiple Data (SIMD), Bit manipulation, post-increment load/store, and Hardware Loops. The FC is directly interconnected to an L2 memory of 512 kB SRAM. The cluster has eight cores identical to the FC. The cores share the 64 kB L1 SRAM scratchpad memory, equipped with a logarithmic interconnect that supports single-cycle concurrent access from different cores requesting memory locations on separate banks.

The cores fetch instructions from a multi-ported instruction cache to maximize the energy efficiency on the data-parallel code. Moreover, an efficient DMA (called  $\mu DMA$ ) enables multiple direct transfers from peripherals and L1 to the L2 memory. The cluster has a hardware synchronizer for event management and efficient parallel threads dispatching. The FC and cluster communicate with each other by an AXI-64 bidirectional bus. The software running on the FC overviews all tasks offloaded to the cluster and the  $\mu DMA$ . At the same time, a low-overhead runtime on the cluster cores exploits the hardware synchronizer to implement shared-memory parallelism in the fashion of OpenMP [4].

#### 5 EXPERIMENTAL RESULTS

To accurately evaluate the BNN, we designed a full system. Thus, the power and energy-efficient measurements are performed on the hardware platform.

#### 5.1 Dataset

In this work, we use the dataset of Takahashi et al. [33], which is based on the Freesound database, an online collaborative sound database. It consists of 28 different event types, e.g., instruments, animals, mechanical sounds. Each clip has a variable length, and the total length of all 5223 audio files is 768 minutes. All audio samples have a sampling rate of 16 kHz, a bit depth of 16, and are single-channel. The dataset is split into training (75%) and test set (25%). We compute the STFT in windows of 512 samples every 128 samples, respectively 32 ms and 8 ms. Then we apply 64 Mel-filters to generate 64 Mel bins. 400 features are then tiled together to create the Mel-spectrogram for 3.2 s of audio (see Sec. 3.1). For the training set, we split each audio clip in consecutive chunks of 3.2 s.

Chunks shorter than 3.2s are discarded, or zero-padded if it is the only chunk. In the test set, we extract one single patch of 3.2 s, starting from half of the clip.

#### 5.2 Firmware Details

To cope with L1 memory constraints, we run the prediction on 4 tiles in which the image is split. The tiles have an overlap of 20 pixels to take into account the receptive field of convolutional kernels at the border of the tiles. The firmware implements a double buffering for the weight loading: before the program processes the input of a specific layer, the cores configure the DMA to load the weights of the next layer, from the L2 memory to the single-cycle accessible L1 memory. An interesting feature of GAP8 is the built-in popcount instruction, which takes just one cycle and decreases the execution time significantly in binary layers, thus useful for BNN calculation. The single  $3\times3\times C$  kernel application gains speed thanks to loop unrolling. Finally, the code parallelization over the eight cores is implemented using the OpenMP API.

#### 5.3 Accuracy

We start from MeyerNet [25] and use the Additive Noise Annealing (ANA) algorithm [32] to train the network with binary weights and activations. Tbl. 2 provides an overview of the original MeyerNet and the BNN. The BNN-GAP8 network keeps the first and the last layer in 16-bit fixed-point, whereas the other layers are binary. For the accuracy of Meyernet, we consider its 16-bit quantized version because it is expected<sup>3</sup> to be the same the FP32 baseline.

The BNN achieves an accuracy of 77.9%, which is 7.3% below the full-precision baseline and is in-line with state-of-the-art binary and ternary networks (i.e., 12% binary and 6.5% ternary neural networks for ImageNet [32, 40]).

Tbl. 2 shows that the BNN matches with the memory constraints of  $512\,\mathrm{kB}$  of L2 memory in GAP8 chip, in contrast to the fixed-point baseline.

<sup>&</sup>lt;sup>3</sup>DNNs are robust to quantization down to 16 bit [18, 20, 28]

Table 2: Accuracy and Memory Footprint for the Baseline CNN (16-bit Fixed-Point precision), BNN with first/last layer in 16-bit Fixed-Point.

|                         | CNN [25]          | BNN-GAP8 |
|-------------------------|-------------------|----------|
| Accuracy                | 85.1%             | 77.9%    |
| Memory for weights [kB] | 815               | 58       |
| Memory for input [kB]   | 204               | 204      |
| Memory requirement [kB] | 1019 <sup>a</sup> | 262      |

<sup>&</sup>lt;sup>a</sup>It does not fit into the 512 kB SRAM of the GAP8 microcontroller.



Figure 2: Throughput and energy efficiency at different supply voltages and operating frequencies. All of the measured settings fulfill the requirement of one classification every 3.2s (see the grey dashed line).

# 5.4 Energy Efficiency

In the following section, we are discussing the throughput and energy efficiency trade-off. First, we sweep the independent cluster and fabric control frequency  $(f_{cl}, f_{fc}) \in \{30, 50, 85, 100, 150\}$  MHz  $\times \{10, 30, 50, 100, 150\}$  MHz for 1 V, and  $(f_{cl}, f_{fc}) \in \{50, 100, 150, 200, 250\}$  MHz  $\times \{10, 30, 50, 100, 150\}$  MHz for 1.2 V, supported by the GAP8 microcontroller. We set the real-time constraint to 0.3125 frames per second due to the 3.2 s long audio samples.

Fig. 2 shows clearly that the 1.0 V corners pareto-dominate the faster 1.2 V corners. It can be seen that the most energy-efficient corner is at 100 MHz for the FC, and 150 MHz for the cluster, where the system achieves an energy efficiency of 31.3 GMAC/s/W, and a throughput of 1.5 GMAC/s.

## 5.5 Execution Time and Power Consumption

We profile time and throughput as well as the energy-efficiency of each layer of the NN. The network architecture is shown in Tbl. 1 together with the amount of multiply-accumulate (MAC) required for each layer at the most energy-efficient corner according to the analysis in the previous section (i.e.,  $V_{dd} = 1.0 \,\text{V}$ ,  $(f_{cl}, f_{fc}) = (150 \,\text{MHz}, 100 \,\text{MHz})$ ).

The measurements are performed with the *Rocketlogger* [31]. Voltage and current of the system-on-chip (SoC) are logged. We

Table 3: Duration and energy consumption for each layer as well as throughput and energy efficiency compared to MACs.

| Layers                   | MACs | Time  | Energy | Through. | Efficiency |
|--------------------------|------|-------|--------|----------|------------|
|                          |      | [ms]  | [mJ]   | [MAC/s]  | [MAC/s/W]  |
| Mel bins                 | -    | 77.0  | 2.64   | -        | -          |
| First Layer              | 7M   | 130.8 | 5.94   | 54M      | 1.2G       |
| 1. Bin Layer             | 109M | 73.3  | 3.57   | 1494M    | 30.6G      |
| 2. Bin Layer             | 404M | 168.0 | 8.86   | 2404M    | 45.6G      |
| 3. Bin Layer             | 185M | 51.2  | 2.94   | 3628M    | 63.2G      |
| 4. Bin Layer             | 154M | 40.3  | 2.29   | 3822M    | 67.1G      |
| 5./6. Layer <sup>4</sup> | 21M  | 47.4  | 1.93   | 1724M    | 1.9G       |
| Total/Average            | 882M | 588.0 | 28.18  | 1503M    | 31.3G      |



Figure 3: Improvement in throughput and energy efficiency compared to the ARM Cortex-M4 implementation.

evaluate the power and duration of measurements and calculate the energy consumption. The results for each layer are listed in Tbl. 3.

Binary layers are the most efficient ones; this is because of the combination of xor and popcount instructions processing 32 pixels in just 2 instructions. The efficiency peak is at 67.1 GMAC/s/W in the fourth binary layer, and the average efficiency is 34.5 GMAC/s/W. The most efficient configuration meets the real-time constraint, and the entire network runs within 0.511 s.

For a further investigation of the improvement in throughput and energy efficiency thanks to the capabilities of the GAP8 SoC, we have implemented the BNN on the STM32F469I Discovery board. Fig. 3 gives an overview of the improvements of the GAP8 implementations compared to the single-core ARM Cortex-M4F implementation, which has popcount implemented in software. We port the SW-popcount (i.e., 12 cycles) to GAP8 and run the code on a single core, and all 8 cores. The GAP8 compared to the STM32F469I, running both the BNN on a single-core and without HW-popcount, shows a 7.9× better energy efficiency, but with a 1.6× lower throughput due to the higher operating frequency of the ARM core. Enabling the HW-popcount gives a significant improvement in energy efficiency (2.8×) and speed in computation (4.3×). Running the BNN on all 8 cores gives an improvement of 6.9/2.4× in throughput and energy efficiency. Finally, the popcount ISA extension gives another boost of  $2.4 \times$  and  $2.6 \times$ , respectively.

Overall the GAP8 implementation that uses all the functionality of the core (i.e., popcount instruction and multi-core) is  $10\times$  faster and  $51\times$  more efficient than running the same network on the Cortex-M4F.



Figure 4: Power trace of running the BNN on one tile on the GAP8 platform.

Fig. 4 shows the power trace of the layers in the same setup in Tbl. 3. As described in Sec. 5.2, we split the input data into tiles to match the memory constraints. The traces refer to one tile out of four. Thus the execution time is approximately one-fourth of the one presented in Tbl. 3. Between layers, the FC offloads the cluster for configuring the next layer: it switches the input and output buffer, allocates memory for the next weights, configures the DMA, and so on. This behavior is visible in the drop of power traces because the cluster is in sleep, and the activity of the FC consumes less. Similar behavior can be observed inside binary layers, where the processing is split in chunks of 32 channels.

## 6 CONCLUSIONS

Starting from the best-performing DNN for sound event detection on our target dataset, we have proposed and trained a DNN with the same topology but binary weights and activations. The proposed BNN matches the memory and resource-constraints of milliwatt range of the target embedded platforms. The resulting BNN has an accuracy of 77.9%, a drop of 7.2 percent point from the full-precision baseline which is in line of similar state-of-the-art BNNs/TNNs (i.e., 6.5-19%). The overall program requires 230 kB of RAM, 3.9 $\times$  less than the system using 16-bit quantized baseline CNN. Due to this compression, the network fits in the GAP8 PULP Platform. We evaluated energy efficiency with experimental measurement of the power consumption of the full system. The classification of 3.2 s of audio requires 511 ms and 25.54 mJ, with a peak energy efficiency of 67.1 GMAC/s/W and average 34.5 GMAC/s/W. The performance on the GAP8 board has been shown to be  $10 \times$  faster and 51× more energy-efficient than on an ARM Cortex-M4F platform, which comes from multi-core capabilities (i.e., 4.3/19.3×), the build-in popcount instruction (i.e., 2.4/2.6×).

## **ACKNOWLEDGMENTS**

This work was in part funded by the U.S. Office of Naval Research Global under the project ONRG - NICOP - N62909-19-1-2018, "Zeropower sensing for underwater monitoring."

## REFERENCES

 Massimo Alioto. 2017. IoT: bird's eye view, megatrends and perspectives. In Enabling the Internet of Things. Springer.

- [2] Gianmarco Cerutti et al. 2019. Neural network distillation on IoT platforms for sound event detection. In Proc. INTERSPEECH, Vol. 2019-Septe.
- [3] Gianmarco Cerutti et al. 2020. Compact recurrent neural networks for acoustic event detection on low-energy low-complexity platforms. IEEE JSTSP (2020).
- [4] Francesco Conti et al. 2016. Enabling the heterogeneous accelerator model on ultra-low power microcontroller platforms. In Proc. IEEE DATE.
- [5] Matthieu Courbariaux et al. 2016. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. In arXiv:1602.02830.
- [6] Jason Cramer et al. 2019. Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings. In ICASSP, Vol. 2019-May.
- [7] Ronald G Dreslinski et al. 2010. Near-threshold computing: Reclaiming moore's law through energy efficient ICs. <u>Proc. IEEE</u> 98, 2 (2010).
- [8] Elisabetta Farella et al. 2017. Technologies for a thing-centric internet of things. In Proc. IEEE FiCloud.
- [9] Eric Flamand et al. 2018. GAP-8: A RISC-V SoC for AI at the Edge of the IoT. In ASAP, IEEE.
- [10] Angelo Garofalo et al. 2020. PULP-NN: accelerating quantized neural networks on parallel ultra-low-power RISC-V processors. <u>Philos. Trans. R. Soc. A</u> 378, 2164 (2020)
- [11] Luigi Gerosa et al. 2007. Scream and gunshot detection in noisy environments. In Proc. IEEE EUSIPCO.
- [12] Jayavardhana Gubbi, Rajkumar Buyya, Slaven Marusic, and Marimuthu Palaniswami. 2013. Internet of Things (IoT): A vision, architectural elements, and future directions. Future generation computer systems 29, 7 (2013), 1645–1660.
- [13] Yihui He et al. 2017. Channel pruning for accelerating very deep neural networks. In Proc. ICCV.
- [14] Shawn Hershey et al. 2017. CNN architectures for large-scale audio classification. In ICASSP. IEEE.
- [15] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the Knowledge in a Neural Network. arXiv:1503.02531 (2015).
- [16] Itay Hubara et al. 2016. Binarized neural networks. In Adv. NIPS. 4107–4115.
- [17] Forrest N Iandola et al. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. arXiv:1602.07360 (2016).
- [18] Benoit Jacob et al. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proc. IEEE CVPR.
- [19] Aditya Kusupati et al. 2018. FastgRNN: A fast, accurate, stable and tiny kilobyte sized gated recurrent neural network. In NIPS, Vol. 2018-Decem.
- [20] Liangzhen Lai et al. 2018. CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs. arXiv:1801.06601 (2018).
- [21] Darryl D. Lin et al. 2016. Fixed point quantization of deep convolutional networks. In Proc. ICML, Vol. 6.
- [22] Qi Meng et al. 2015. The influence of crowd density on the sound environ. of commercial pedestrian streets. Sci. Total Environ. 511 (2015).
- [23] Annamaria Mesaros et al. 2010. Acoustic event detection in real life recordings. In Proc. IEEE EUSIPCO.
- [24] Annamaria Mesaros et al. 2017. DCASE 2017 Challenge setup: Tasks, datasets and baseline system. In <u>Proc. DCASE</u>.
- [25] Matthias Meyer et al. 2017. Efficient Convolutional Neural Network For Audio Event Detection. arXiv:1709.09888 (2017).
- [26] Bert Moons et al. 2018. BinarEye: An always-on energy-accuracy-scalable binary CNN processor with all memory on chip. In Proc. IEEE CICC.
- [27] Yueyue Na et al. 2015. An acoustic traffic monitoring system: Design and implementation. In Proc. IEEE UIC-ATC-ScalCom.
- [28] Daniele Palossi et al. 2019. A 64-mW DNN-Based Visual Navigation Engine for Autonomous Nano-Drones. <u>IEEE IoT Journal</u> 6, 5 (2019).
- [29] Gopika Premsankar et al. 2018. Edge computing for the Internet of Things: A case study. IEEE IoT Journal 5, 2 (2018), 1275–1284.
- [30] Mohammad Rastegari et al. 2016. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks. ECCV (2016).
- [31] Lukas Sigrist et al. 2017. Measurement and Validation of Energy Harvesting IoT Devices. In Proc. IEEE DATE.
- [32] Matteo Spallanzani et al. 2019. Additive noise annealing and approximation properties of quantized neural networks. arXiv:1905.10452 (2019).
- [33] Naoya Takahashi et al. 2016. Deep convolutional neural networks and data augmentation for acoustic event recognition. In <u>INTERSPEECH</u>.
- [34] Andrey Temko et al. 2007. CLEAR evaluation of acoustic event detection and classification systems. In LNCS, Vol. 4122 LNCS.
- [35] Sebastian Uziel et al. 2013. Networked embedded acoustic processing system for smart building applications. In <u>DASIP</u>. IEEE, 349–350.
- [36] VentureBeat.com. 2018. GreenWaves Technologies unveils Gap8 processor for AI at the edge.
- [37] Van-Thinh Vu, François Brémond, Gabriele Davini, Monique Thonnat, Quoc-Cuong Pham, Nicolas Allezard, Patrick Sayd, J Rouas, Sébastien Ambellouis, and Amaury Flancquart. 2006. Audio-video event recognition system for public transport security. (2006).
- [38] Xiaying Wang et al. 2019. FANN-on-MCU: An Open-Source Toolkit for Energy-Efficient Neural Network Inference at the Edge of the Internet of Things.

- $\frac{\text{arXiv:1911.03314 (2019).}}{\text{Yundong Zhang et al. 2017. Hello Edge: Keyword Spotting on Microcontrollers.}}$
- arXiv:1711.07128 (2017).

  [40] Shuchang Zhou et al. 2016. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160 (2016).
- [41] Bohan Zhuang et al. 2019. Structured binary neural networks for accurate image classification and semantic segmentation. In <u>IEEE CVPR</u>.
   [42] Xiaodan Zhuang et al. 2010. Real-world acoustic event detection. <u>Pattern</u>
- Recognit. Lett. 31, 12 (2010).