PID6053227
PID6053227
net/publication/338368907
CITATIONS READS
2 1,502
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Hiromitsu Nishizaki on 15 March 2021.
Abstract—Internet-of-Things (IoT) devices have rapidly be- with deep learning, we should adopt a neural network
come important in understanding conditions in an environ- that can consider a time sequential change. The recurrent
ment. The sensed data from an IoT (or sensor) device generally neural network (RNN) is very suitable in dealing with time
form a time sequential signal where the values vary with time.
This study describes time sequential signal processing using sequential data, such as signal waveform.
a recurrent-based neural network and particularly focuses on Therefore, many studies deal with signal processing using
two sorts of signal classification tasks: a sound classification RNN [5]–[9]. The audio processing research field focuses
and a tennis swing motion classification. We will introduce on the acoustic event detection tasks. For example, the
these classification tasks and their evaluation results using Detection and Classification of Acoustic Scenes and Events
recurrent neural networks. The experimental results show that
the recurrent neural networks could well classify the signals. (DCASE) challenges1 have been held every year since
Moreover, the bi-directional analysis is critical to achieving 2014. Many techniques with RNN were proposed in the
high-performance classification. previous DCASE workshops [8], [9]. Meanwhile, in the
Index Terms—deep learning, neural network, signal pro- music information processing research field, the music clas-
cessing, signal classification sification task [7], [10] is a well-known signal classification
I. I NTRODUCTION task. Many RNN-based approaches are proposed for these
tasks [11], [12]. Apart from that, many tasks have been
We have recently become capable of obtaining the sens- proposed for signal processing, and various sorts of datasets
ing data from various Internet-of-Things (IoT) devices in on signals have been released on Keggle2 or other websites.
real time and storing them to storage. The sensing (signal) We will pick up the following two signal-classification
data from these devices are then analyzed, and the analyses tasks: sound classification and tennis swing motion clas-
results are utilized for various objects. Deep learning frame- sification. We will also propose the usage of RNN-based
work has recently received much attention. Deep learning is approaches to deal with these signal classification tasks
a very effective technology for understanding various sorts herein.
of data. Object recognition [1], [2] on a photo/movie, speech
recognition of an utterance [3], and machine translation [4]
have been developed by using the deep learning framework.
Many research papers on deep learning are published every
year.
This study will focus on signal processing in a deep
learning framework among various media. We particularly
deal with two types of signal processing tasks: sound Fig. 1. Example of a sound waveform.
classification and tennis swing motion classification.
Sound data are generally made by catching an oscillation
wave that travels through the air from a sound source, II. S NACK S OUND C LASSIFICATION
such as human mouth(s) using a microphone sensor device. A. Dataset
The sensed analog signal by a microphone is translated
The first signal processing task was the sound classifica-
into a digital signal in a suitable sampling frequency and
tion (snack name identification) task. We focused on a small
a quantization bit rate. Fig. 1 shows an example of a
dataset, in which a set of sounds recorded by shaking a bag
sound waveform. Similar to sound, the sensing data from
of snack was included. This dataset included six types of
an IoT device, such as an electroencephalograph and a
snack sound with five sorts of microphone devices. Fig. 2
thermometer, are time sequential. Therefore, these sensed
shows the six sorts of popular Japanese snacks used in the
data can be represented as Fig. 1.
dataset. All of them are sold in Japan.
A signal waveform is characterized by the wave contour
Table I lists the microphone devices used for recording
at a certain time. In other words, the waveform varies
the snack sounds. The duration of each recorded snack
depending on the time. Therefore, when classifying a signal
waveform, we should consider the continuous change of 1 http://dcase.community
frequency [Hz]
time [s] time [s]
(a) Calbee Potato Chips (b) Tongari Corn
frequency [Hz]
frequency [Hz]
time [s] time [s]
(c) Babystar (d) Sapporo Potato BBQ
Fig. 2. Six sorts of snacks in the dataset for the sound classification task.
Top left: Tongari Corn; top middle: Kappa Ebisen; top right: Babystar;
bottom left: Sapporo Potato Vegetable; bottom middle: Sapporo Potato
frequency [Hz]
frequency [Hz]
BBQ; and bottom right: Calbee Potato Chips.
(512)
TABLE I. F IVE SORTS OF MICROPHONE DEVICES
LSTM LSTM LSTM … LSTM
Mic. ID Mic. type Model number
01 Microphone JVC-KENWOOD MZ-V8
02 USB-mic BUFFALO BSHSM05KB X1 X2 X3 … XT
03 Laptop built-in mic FMVS90PWD1 (512 dim.)
04 Voice recorder Olympus DS-850 Power spectrum sequence
05 Smartphone VAIO VPA0511S (num. of samples is T)
Fig. 4. Architecture of the RNN-based neural network for the snack sound
B. Neural network model classification task.
Power 2
spectrum 1.5
1 x y z
accelera!on [G]
0.5
512 dim. spectrum vector 0
for each 10ms
-0.5
…
-1
-1.5
Fig. 5. Pre-process for the sound signal waveform.
-2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
TABLE II. T RAINING CONDITION OF THE NEURAL NETWORK !me [sec]
Mini-batchsize 16 Fig. 6. An example of a tennis swing signal.
Num. of epochs 50
Activation at the 1st FC ReLU [16]
Batch normalization [13] Yes (the LSTM and the 1st FC) The number of subjects who swung a racket was 10.
Loss func. Softmax cross entropy Each subject swung a racket for 50 times. The total number
Optimaizer MomentumSGD
Learning rate 0.001
of swing signals was 500. The 500 swing signals were
separated into 400 and 100 signals for training and testing,
respectively.
C. Classification result B. Neural network model
Table III shows the accuracy rates of the snack sound We also used an RNN for the classification of the tennis
classification. Table III shows the confusion matrix between swing motion. Fig. 7 shows the architecture of the RNN-
the microphone devices. For example, when the neural net- based neural network, which was almost the same as the
work model was trained from the sound signals recorded by model for classifying a snack sound signal. The difference
microphone 01, the accuracy for the sound signals recorded between the tennis swing model and the snack sound model
by the same microphone 01 (matched condition) was 85.7%. was the utilization of a bi-directional LSTM. In addition,
However, the unmatched condition to the microphone en- the input signal did not undergo any pre-processing like the
vironments drastically created damages to the classification Fourier transformation. The sampled raw data (sequence of
accuracies. Under the matched condition of the microphone 3D vectors) were directly inputted to the neural network.
environments, the RNN-based neural network achieved the The training conditions were almost the same as those in
high performances of sound classification. In addition, the Table II. However, we did not perform any batch normaliza-
accuracy rates improved more when all the recorded signals tion processes. We also used an RNN for the classification
from all the microphone devices were used for model of the tennis swing motion.
training. The neural network model was implemented using
Chainer ver. 5.3.0.
TABLE III. S NACK SOUND CLASSIFICATION ACCURACY RATES [%]
C. Classification result
Mic. ID 01 (test) 02 (test) 03 (test) 04 (test) 05 (test)
01 (train) 85.7 21.2 50.6 21.7 28.5 Table IV shows the classification results of the tennis
02 (train) 35.4 86.3 18.4 19.3 29.0 swing signals. A∼J denotes the subject IDs. The table is
03 (train) 24.4 16.2 87.8 30.6 35.6
04 (train) 35.4 17.3 50.2 87.4 25.7 represented as the confusion matrix, in which the subject ID
05 (train) 37.6 16.9 29.3 39.2 79.3 on the horizontal axis depicts whose tennis swing signal was
All devices 91.1 88.8 94.0 86.7 84.3
input to the neural network, while the IDs on the vertical
axis denote the output from the neural network.
III. T ENNIS S WING M OTION C LASSIFICATION Table IV illustrates that most tennis swing signals can
be correctly classified using the bi-directional RNN model.
A. Dataset The accuracy rate for all the subjects’ swing signals was
The second task was the tennis swing motion classifica- 97% (97/100). We also applied the same neural network
tion. This task challenged to estimate those who swing a model (the uni-directional LSTM model) in Fig. 4 to the
Output layer … 10 players R EFERENCES
(10) [1] J. Deng, W. Dong, R. Socher, L.-J. Li, Kai Li, and Li Fei-
Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in
Fully-connected layer 2009 IEEE Conference on Computer Vision and Pattern Recognition
Dropout (rate=0.5) (128) (CVPR), 2009, pp. 248–255.
Fully-connected layer [2] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only
Look Once: Unified, Real-Time Object Detection,” in 2016 IEEE
Dropout (rate=0.5) Dropout (rate=0.5) Conference on Computer Vision and Pattern Recognition (CVPR),
(128) Backward-LSTM
2016, pp. 779—788.
LSTM LSTM LSTM … LSTM [3] G. Hinton et al., “Deep Neural Networks for Acoustic Modeling in
Speech Recognition: The Shared Views of Four Research Groups,”
(128) IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82–97, 2012.
LSTM LSTM LSTM … LSTM [4] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, “BERT: Pre-training
of Deep Bidirectional Transformers for Language Understanding,”
Forward-LSTM
arXiv:1810.04805, 2018.
[5] J. Nam, K. Choi, J. Lee, S.-Y. Chou, and Y.-H. Yang, ”Deep Learning
(X1, Y1, Z1) (X2, Y2, Z2) (X3, Y3, Z3) … (XT, YT, ZT) for Audio-Based Music Classification and Tagging: Teaching Com-
Tennis swing data
puters to Distinguish Rock from Bach,” IEEE Signal Process. Mag.,
(num. of samples is T) vol. 36, no. 1, pp. 41–51, 2019.
[6] J. Lee, J. Park, K. L. Kim, and J. Nam, “Sample-level deep convolu-
tional neural networks for music auto-tagging using raw waveforms,”
Fig. 7. Architecture of the RNN-based neural network for the tennis swing
in the Sound and Music Computing Conference, pp.220–226, 2017.
motion classification.
[7] G. Tzanetakis, and P. Cook, “Musical genre classification of audio
signals,” IEEE Trans. Speech Audio Processing, vol. 10, no. 5,
pp. 293—302, 2002.
tennis swing classification task. The accuracy rate for all the [8] I. Himawan, M. Towsey, and P. Roe, “3D convolutional recurrent
subjects was 83% (83/100), which is substantially worse neural networks for bird sound detection,” in Workshop on Detection
and Classification of Acoustic Scenes and Events (DCASE2018),
than that for the bi-directional LSTM model. The tennis 2018.
swing classification was a competitively hard task compared [9] Y. Guo1 M. Xu, J. Wu, Y. Wang, and K. Hoashi, “Multi-scale
to the snack sound classification; however, the bi-directional convolutional recurrent neural network with ensemble method for
weakly labeled sound event detection,” in Workshop on Detection
LSTM model can capture the characteristics of a tennis and Classification of Acoustic Scenes and Events (DCASE2018),
swing signal. 2018.
[10] M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson, “FMA:
A dataset for music analysis,” in 18th International Society for Music
TABLE IV. C LASSIFICATION MATRIX AMONG SUBJECTS Information Retrieval Conference, 2017, pp. 316–323.
[11] K. Choi, G. Fazekas, K. Cho, and M. Sandler, “Convolutional
Subject ID A B C D E F G H I J recurrent neural networks for music classification,” in the 2017 IEEE
A 9 0 0 0 0 0 0 1 0 0 International Conference on Acoustics, Speech and Signal Processing
B 0 9 0 0 0 0 1 0 0 0 (ICASSP), 2017, pp. 2392–2396.
C 0 0 10 0 0 0 0 0 0 0 [12] J. Dai, S. Liang, W. Xue, C. Ni, and W. Liu, “Long Short-term
D 0 0 0 10 0 0 0 0 0 0 Memory Recurrent Neural Network based Segment Features for
E 0 0 0 0 10 0 0 0 0 0 Music Genre Classification,” in the 10th International Symposium
F 0 0 0 0 0 10 0 0 0 0 on Chinese Spoken Language Processing (ISCSLP), 2016, pp. 1–5.
G 0 0 0 0 0 0 9 1 0 0 [13] S. Ioffe and C. Szegedy, “Batch normalization: Accelarating
H 0 0 0 0 0 0 0 10 0 0 deep network training by reducing internal covariate shift,”
I 0 0 0 0 0 0 0 0 10 0
arXiv:1502.03167v3, 2016.
J 0 0 0 0 0 0 0 0 0 10
[14] S. Hochreiter and J. Schmidhuber, “Long Short-Time Memory,”
Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[15] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and
R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural
IV. C ONCLUSIONS Networks from Overfitting,” J. Mach. Learn. Res., vol. 15, no. 1,
pp. 1929—1958, 2014.
This study introduced the signal classification tasks using [16] X. Glorot, A. Bordes and Y. Bengio, “Deep Sparse Rectifier Neural
Networks,” in 14th International Conference on Artificial Intelligence
the deep learning framework. We showed that the recurrent- and Statistics (AISTATS), 2011, vol. 15, pp. 315–323.
based neural network was very effective in understanding [17] A. S. Abdull Sukor, A. Zakaria, N. A. Rahim, L. M. Kamarudin,
and classifying the signal. In particular, the bi-directional R. Setchi and H. Nishizaki, “A hybrid approach of knowledge-driven
and data-driven reasoning for activity recognition in smart homes,”
LSTM can realize a robust classification. J. Intell. Fuzzy Syst., vol. 36, no. 5, pp. 4177—4188, May 2019.
Although we dealt with the snack sound classification [18] R. Gunasagaran,L. M. Kamarudin,E. Kanagaraj, A. Zakaria and
and the tennis swing classification, the RNN-based model A. Y. M. Shakaff, “Internet of Things: Solar power under forest
canopy,” in the 2016 IEEE Student Conference on Research and
can be widely applied to the time sequential data from Development (SCOReD), 2016.
various IoT devices. In the future work, we will develop
an environment-understanding system for in-home [17] or
forest environments [18] with IoT devices and the deep
learning framework.
ACKNOWLEDGMENT
This work was supported by JSPS KAKENHI Grant-in-
Aid for Scientific Research (B) Grant Number 17H01977.