0% found this document useful (0 votes)
13 views5 pages

PID6053227

This document summarizes research on classifying signals using deep learning. It describes two signal classification tasks: sound classification of snacks based on the sounds they make when shaken, and tennis swing motion classification from sensor data. Recurrent neural networks are well-suited for these tasks because they can model time-series data. The researchers used RNNs to classify sounds of six different snacks recorded with five microphones, achieving high performance. They also classified tennis swing motions. RNNs with bidirectional analysis performed best on both tasks.

Uploaded by

Otras Cosas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views5 pages

PID6053227

This document summarizes research on classifying signals using deep learning. It describes two signal classification tasks: sound classification of snacks based on the sounds they make when shaken, and tennis swing motion classification from sensor data. Recurrent neural networks are well-suited for these tasks because they can model time-series data. The researchers used RNNs to classify sounds of six different snacks recorded with five microphones, achieving high performance. They also classified tennis swing motions. RNNs with bidirectional analysis performed best on both tasks.

Uploaded by

Otras Cosas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/338368907

Signal Classification Using Deep Learning

Conference Paper · July 2019


DOI: 10.1109/SENSORSNANO44414.2019.8940077

CITATIONS READS

2 1,502

2 authors:

Hiromitsu Nishizaki K. Makino


University of Yamanashi University of Yamanashi
93 PUBLICATIONS   394 CITATIONS    77 PUBLICATIONS   132 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

RTI - Deep classification of human location in outdoor environment View project

All content following this page was uploaded by Hiromitsu Nishizaki on 15 March 2021.

The user has requested enhancement of the downloaded file.


Signal Classification Using Deep Learning
Hiromitsu Nishizaki Koji Makino
Graduate School of Interdisciplinary Research Graduate School of Interdisciplinary Research
University of Yamanashi University of Yamanashi
4-3-11 Takeda, Kofh-shi, Japan 4-3-11 Takeda, Kofh-shi, Japan
[email protected] [email protected]

Abstract—Internet-of-Things (IoT) devices have rapidly be- with deep learning, we should adopt a neural network
come important in understanding conditions in an environ- that can consider a time sequential change. The recurrent
ment. The sensed data from an IoT (or sensor) device generally neural network (RNN) is very suitable in dealing with time
form a time sequential signal where the values vary with time.
This study describes time sequential signal processing using sequential data, such as signal waveform.
a recurrent-based neural network and particularly focuses on Therefore, many studies deal with signal processing using
two sorts of signal classification tasks: a sound classification RNN [5]–[9]. The audio processing research field focuses
and a tennis swing motion classification. We will introduce on the acoustic event detection tasks. For example, the
these classification tasks and their evaluation results using Detection and Classification of Acoustic Scenes and Events
recurrent neural networks. The experimental results show that
the recurrent neural networks could well classify the signals. (DCASE) challenges1 have been held every year since
Moreover, the bi-directional analysis is critical to achieving 2014. Many techniques with RNN were proposed in the
high-performance classification. previous DCASE workshops [8], [9]. Meanwhile, in the
Index Terms—deep learning, neural network, signal pro- music information processing research field, the music clas-
cessing, signal classification sification task [7], [10] is a well-known signal classification
I. I NTRODUCTION task. Many RNN-based approaches are proposed for these
tasks [11], [12]. Apart from that, many tasks have been
We have recently become capable of obtaining the sens- proposed for signal processing, and various sorts of datasets
ing data from various Internet-of-Things (IoT) devices in on signals have been released on Keggle2 or other websites.
real time and storing them to storage. The sensing (signal) We will pick up the following two signal-classification
data from these devices are then analyzed, and the analyses tasks: sound classification and tennis swing motion clas-
results are utilized for various objects. Deep learning frame- sification. We will also propose the usage of RNN-based
work has recently received much attention. Deep learning is approaches to deal with these signal classification tasks
a very effective technology for understanding various sorts herein.
of data. Object recognition [1], [2] on a photo/movie, speech
recognition of an utterance [3], and machine translation [4]
have been developed by using the deep learning framework.
Many research papers on deep learning are published every
year.
This study will focus on signal processing in a deep
learning framework among various media. We particularly
deal with two types of signal processing tasks: sound Fig. 1. Example of a sound waveform.
classification and tennis swing motion classification.
Sound data are generally made by catching an oscillation
wave that travels through the air from a sound source, II. S NACK S OUND C LASSIFICATION
such as human mouth(s) using a microphone sensor device. A. Dataset
The sensed analog signal by a microphone is translated
The first signal processing task was the sound classifica-
into a digital signal in a suitable sampling frequency and
tion (snack name identification) task. We focused on a small
a quantization bit rate. Fig. 1 shows an example of a
dataset, in which a set of sounds recorded by shaking a bag
sound waveform. Similar to sound, the sensing data from
of snack was included. This dataset included six types of
an IoT device, such as an electroencephalograph and a
snack sound with five sorts of microphone devices. Fig. 2
thermometer, are time sequential. Therefore, these sensed
shows the six sorts of popular Japanese snacks used in the
data can be represented as Fig. 1.
dataset. All of them are sold in Japan.
A signal waveform is characterized by the wave contour
Table I lists the microphone devices used for recording
at a certain time. In other words, the waveform varies
the snack sounds. The duration of each recorded snack
depending on the time. Therefore, when classifying a signal
waveform, we should consider the continuous change of 1 http://dcase.community

the waveform with time. To deal with sound classification 2 https://www.kaggle.com/datasets

978-1-5386-5619-8/19/$31.00 ©2019 IEEE


frequency [Hz]

frequency [Hz]
time [s] time [s]
(a) Calbee Potato Chips (b) Tongari Corn

frequency [Hz]
frequency [Hz]
time [s] time [s]
(c) Babystar (d) Sapporo Potato BBQ
Fig. 2. Six sorts of snacks in the dataset for the sound classification task.
Top left: Tongari Corn; top middle: Kappa Ebisen; top right: Babystar;
bottom left: Sapporo Potato Vegetable; bottom middle: Sapporo Potato

frequency [Hz]
frequency [Hz]
BBQ; and bottom right: Calbee Potato Chips.

sound by a microphone device was 300 s. Each sound was


time [s] time [s]
segmented into 150 WAVE files, with each file having a (e) Sapporo Potato Vegetable (f) Kappa Ebisen
2 s duration. The sampling frequency and the quantization
bit rates were 44.1 kHz and 16 bits/sample, respectively. A Fig. 3. Power spectrograms of the two sorts of snack sounds.
total of 900 WAVE files for each microphone device were
used for the neural network training. For the evaluation, 6 snacks
we also prepared 60 s of snack sound for each snack and Output layer
microphone device. Each sound was segmented into 30
(6)
WAVE files for testing.
This snack sound classification task was tough for hu- Fully-connected layer
mans. Fig. 3 shows examples of the power spectrograms of
Dropout (rate=0.5) (512)
the two types of snack sounds. As shown in Fig. 3, we can
hardly find the characteristic differences between them. Fully-connected layer

(512)
TABLE I. F IVE SORTS OF MICROPHONE DEVICES
LSTM LSTM LSTM … LSTM
Mic. ID Mic. type Model number
01 Microphone JVC-KENWOOD MZ-V8
02 USB-mic BUFFALO BSHSM05KB X1 X2 X3 … XT
03 Laptop built-in mic FMVS90PWD1 (512 dim.)
04 Voice recorder Olympus DS-850 Power spectrum sequence
05 Smartphone VAIO VPA0511S (num. of samples is T)

Fig. 4. Architecture of the RNN-based neural network for the snack sound
B. Neural network model classification task.

We used an RNN for the snack sound classification. Fig.


4 depicts the architecture of the RNN-based neural network.
vector in a power spectrum vector at time T . They were
The neural network was based on a single layer of
also input to the first fully-connected (FC) layer. The first
long short-term memory (LSTM) [14]. An LSTM is well-
FC layer output the 512-dimensional hidden vector before
known to be effective in classifying the time sequential
proceeding to the second FC layer. The number of output
data because it can keep a long-span history compared to a
nodes was six because the number of snacks was six. The
normal RNN.
dropout process [15] with 0.5 of dropout ratio was adapted
Before inputting the sound waveform into the NN, the
at the outputs of both the LSTM and the first FC layer. Table
sound waveform was made to undergo a short-time Fourier
II presents the training condition of the neural network.
transformation with a 1024-point window width to extract
a 512-dimensional power spectrogram vector. Fig. 5 shows The neural network model was implemented using
the pre-processing procedure of a sound signal using short- Chainer3 ver.5.3.0, which is a Python framework for deep
time Fourier transformation. learning.
An input data (i.e., set of 512-dim at time i) were input to
the LSTM layer. The LSTM layer output the 512-dimension 3 https://chainer.org/
tennis racket. The swing data from a player were obtained
by a three-dimensional (3D) accelerometer attached to the
✕ player’s hand. The data format was X, Y, and Z directions
shift length
from the output of the accelerometer and saved as the time
sequential data. The sampling frequency rate was 40 Hz.
For sampled waveform (44.1kHz, 16bit)
window length • Hamming window(width: 1024 points) Therefore, the sampled data were a 3D vector. Fig. 7 shows
FFT • frame shift(10ms) an example of a swing signal.

Power 2
spectrum 1.5
1 x y z

accelera!on [G]
0.5
512 dim. spectrum vector 0
for each 10ms
-0.5

-1
-1.5
Fig. 5. Pre-process for the sound signal waveform.
-2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
TABLE II. T RAINING CONDITION OF THE NEURAL NETWORK !me [sec]
Mini-batchsize 16 Fig. 6. An example of a tennis swing signal.
Num. of epochs 50
Activation at the 1st FC ReLU [16]
Batch normalization [13] Yes (the LSTM and the 1st FC) The number of subjects who swung a racket was 10.
Loss func. Softmax cross entropy Each subject swung a racket for 50 times. The total number
Optimaizer MomentumSGD
Learning rate 0.001
of swing signals was 500. The 500 swing signals were
separated into 400 and 100 signals for training and testing,
respectively.
C. Classification result B. Neural network model
Table III shows the accuracy rates of the snack sound We also used an RNN for the classification of the tennis
classification. Table III shows the confusion matrix between swing motion. Fig. 7 shows the architecture of the RNN-
the microphone devices. For example, when the neural net- based neural network, which was almost the same as the
work model was trained from the sound signals recorded by model for classifying a snack sound signal. The difference
microphone 01, the accuracy for the sound signals recorded between the tennis swing model and the snack sound model
by the same microphone 01 (matched condition) was 85.7%. was the utilization of a bi-directional LSTM. In addition,
However, the unmatched condition to the microphone en- the input signal did not undergo any pre-processing like the
vironments drastically created damages to the classification Fourier transformation. The sampled raw data (sequence of
accuracies. Under the matched condition of the microphone 3D vectors) were directly inputted to the neural network.
environments, the RNN-based neural network achieved the The training conditions were almost the same as those in
high performances of sound classification. In addition, the Table II. However, we did not perform any batch normaliza-
accuracy rates improved more when all the recorded signals tion processes. We also used an RNN for the classification
from all the microphone devices were used for model of the tennis swing motion.
training. The neural network model was implemented using
Chainer ver. 5.3.0.
TABLE III. S NACK SOUND CLASSIFICATION ACCURACY RATES [%]
C. Classification result
Mic. ID 01 (test) 02 (test) 03 (test) 04 (test) 05 (test)
01 (train) 85.7 21.2 50.6 21.7 28.5 Table IV shows the classification results of the tennis
02 (train) 35.4 86.3 18.4 19.3 29.0 swing signals. A∼J denotes the subject IDs. The table is
03 (train) 24.4 16.2 87.8 30.6 35.6
04 (train) 35.4 17.3 50.2 87.4 25.7 represented as the confusion matrix, in which the subject ID
05 (train) 37.6 16.9 29.3 39.2 79.3 on the horizontal axis depicts whose tennis swing signal was
All devices 91.1 88.8 94.0 86.7 84.3
input to the neural network, while the IDs on the vertical
axis denote the output from the neural network.
III. T ENNIS S WING M OTION C LASSIFICATION Table IV illustrates that most tennis swing signals can
be correctly classified using the bi-directional RNN model.
A. Dataset The accuracy rate for all the subjects’ swing signals was
The second task was the tennis swing motion classifica- 97% (97/100). We also applied the same neural network
tion. This task challenged to estimate those who swing a model (the uni-directional LSTM model) in Fig. 4 to the
Output layer … 10 players R EFERENCES
(10) [1] J. Deng, W. Dong, R. Socher, L.-J. Li, Kai Li, and Li Fei-
Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in
Fully-connected layer 2009 IEEE Conference on Computer Vision and Pattern Recognition
Dropout (rate=0.5) (128) (CVPR), 2009, pp. 248–255.
Fully-connected layer [2] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only
Look Once: Unified, Real-Time Object Detection,” in 2016 IEEE
Dropout (rate=0.5) Dropout (rate=0.5) Conference on Computer Vision and Pattern Recognition (CVPR),
(128) Backward-LSTM
2016, pp. 779—788.
LSTM LSTM LSTM … LSTM [3] G. Hinton et al., “Deep Neural Networks for Acoustic Modeling in
Speech Recognition: The Shared Views of Four Research Groups,”
(128) IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82–97, 2012.
LSTM LSTM LSTM … LSTM [4] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, “BERT: Pre-training
of Deep Bidirectional Transformers for Language Understanding,”
Forward-LSTM
arXiv:1810.04805, 2018.
[5] J. Nam, K. Choi, J. Lee, S.-Y. Chou, and Y.-H. Yang, ”Deep Learning
(X1, Y1, Z1) (X2, Y2, Z2) (X3, Y3, Z3) … (XT, YT, ZT) for Audio-Based Music Classification and Tagging: Teaching Com-
Tennis swing data
puters to Distinguish Rock from Bach,” IEEE Signal Process. Mag.,
(num. of samples is T) vol. 36, no. 1, pp. 41–51, 2019.
[6] J. Lee, J. Park, K. L. Kim, and J. Nam, “Sample-level deep convolu-
tional neural networks for music auto-tagging using raw waveforms,”
Fig. 7. Architecture of the RNN-based neural network for the tennis swing
in the Sound and Music Computing Conference, pp.220–226, 2017.
motion classification.
[7] G. Tzanetakis, and P. Cook, “Musical genre classification of audio
signals,” IEEE Trans. Speech Audio Processing, vol. 10, no. 5,
pp. 293—302, 2002.
tennis swing classification task. The accuracy rate for all the [8] I. Himawan, M. Towsey, and P. Roe, “3D convolutional recurrent
subjects was 83% (83/100), which is substantially worse neural networks for bird sound detection,” in Workshop on Detection
and Classification of Acoustic Scenes and Events (DCASE2018),
than that for the bi-directional LSTM model. The tennis 2018.
swing classification was a competitively hard task compared [9] Y. Guo1 M. Xu, J. Wu, Y. Wang, and K. Hoashi, “Multi-scale
to the snack sound classification; however, the bi-directional convolutional recurrent neural network with ensemble method for
weakly labeled sound event detection,” in Workshop on Detection
LSTM model can capture the characteristics of a tennis and Classification of Acoustic Scenes and Events (DCASE2018),
swing signal. 2018.
[10] M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson, “FMA:
A dataset for music analysis,” in 18th International Society for Music
TABLE IV. C LASSIFICATION MATRIX AMONG SUBJECTS Information Retrieval Conference, 2017, pp. 316–323.
[11] K. Choi, G. Fazekas, K. Cho, and M. Sandler, “Convolutional
Subject ID A B C D E F G H I J recurrent neural networks for music classification,” in the 2017 IEEE
A 9 0 0 0 0 0 0 1 0 0 International Conference on Acoustics, Speech and Signal Processing
B 0 9 0 0 0 0 1 0 0 0 (ICASSP), 2017, pp. 2392–2396.
C 0 0 10 0 0 0 0 0 0 0 [12] J. Dai, S. Liang, W. Xue, C. Ni, and W. Liu, “Long Short-term
D 0 0 0 10 0 0 0 0 0 0 Memory Recurrent Neural Network based Segment Features for
E 0 0 0 0 10 0 0 0 0 0 Music Genre Classification,” in the 10th International Symposium
F 0 0 0 0 0 10 0 0 0 0 on Chinese Spoken Language Processing (ISCSLP), 2016, pp. 1–5.
G 0 0 0 0 0 0 9 1 0 0 [13] S. Ioffe and C. Szegedy, “Batch normalization: Accelarating
H 0 0 0 0 0 0 0 10 0 0 deep network training by reducing internal covariate shift,”
I 0 0 0 0 0 0 0 0 10 0
arXiv:1502.03167v3, 2016.
J 0 0 0 0 0 0 0 0 0 10
[14] S. Hochreiter and J. Schmidhuber, “Long Short-Time Memory,”
Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[15] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and
R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural
IV. C ONCLUSIONS Networks from Overfitting,” J. Mach. Learn. Res., vol. 15, no. 1,
pp. 1929—1958, 2014.
This study introduced the signal classification tasks using [16] X. Glorot, A. Bordes and Y. Bengio, “Deep Sparse Rectifier Neural
Networks,” in 14th International Conference on Artificial Intelligence
the deep learning framework. We showed that the recurrent- and Statistics (AISTATS), 2011, vol. 15, pp. 315–323.
based neural network was very effective in understanding [17] A. S. Abdull Sukor, A. Zakaria, N. A. Rahim, L. M. Kamarudin,
and classifying the signal. In particular, the bi-directional R. Setchi and H. Nishizaki, “A hybrid approach of knowledge-driven
and data-driven reasoning for activity recognition in smart homes,”
LSTM can realize a robust classification. J. Intell. Fuzzy Syst., vol. 36, no. 5, pp. 4177—4188, May 2019.
Although we dealt with the snack sound classification [18] R. Gunasagaran,L. M. Kamarudin,E. Kanagaraj, A. Zakaria and
and the tennis swing classification, the RNN-based model A. Y. M. Shakaff, “Internet of Things: Solar power under forest
canopy,” in the 2016 IEEE Student Conference on Research and
can be widely applied to the time sequential data from Development (SCOReD), 2016.
various IoT devices. In the future work, we will develop
an environment-understanding system for in-home [17] or
forest environments [18] with IoT devices and the deep
learning framework.

ACKNOWLEDGMENT
This work was supported by JSPS KAKENHI Grant-in-
Aid for Scientific Research (B) Grant Number 17H01977.

View publication stats

You might also like