0% found this document useful (0 votes)
5 views

ViT-Based_Multi-Scale_Classification_Using_Digital_Signal_Processing_and_Image_Transformation

This study proposes a ViT-based multi-scale classification methodology for time-series data, addressing challenges like complexity and dynamic variation through digital signal processing (DSP) and image transformation. By extracting features from time-series data and converting them into images, the method utilizes a vision transformer (ViT) for improved classification accuracy compared to traditional models. Experimental results demonstrate the effectiveness of this approach in handling complex patterns and achieving high performance in multi-class classification tasks.

Uploaded by

shirisha edikoju
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

ViT-Based_Multi-Scale_Classification_Using_Digital_Signal_Processing_and_Image_Transformation

This study proposes a ViT-based multi-scale classification methodology for time-series data, addressing challenges like complexity and dynamic variation through digital signal processing (DSP) and image transformation. By extracting features from time-series data and converting them into images, the method utilizes a vision transformer (ViT) for improved classification accuracy compared to traditional models. Experimental results demonstrate the effectiveness of this approach in handling complex patterns and achieving high performance in multi-class classification tasks.

Uploaded by

shirisha edikoju
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Received 27 February 2024, accepted 11 April 2024, date of publication 16 April 2024, date of current version 1 May 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3389808

ViT-Based Multi-Scale Classification Using Digital


Signal Processing and Image Transformation
GYU-IL KIM 1 AND KYUNGYONG CHUNG 2
1 Department of Computer Science, Kyonggi University, Suwon 16227, South Korea
2 Division of AI Computer Science and Engineering, Kyonggi University, Suwon 16227, South Korea

Corresponding author: Kyungyong Chung ([email protected])


This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the
Ministry of Education (2020R1A6A1A03040583).

ABSTRACT The existing classification of time-series data has difficulties that traditional methodologies
struggle to address, such as complexity and dynamic variation. Difficulty with pattern recognition and
long-term dependency modeling, high dimensionality and complex interactions between variables, and
incompleteness of irregular intervals, missing values, and noise are the main causes for the degradation of
model performance. Therefore, it is necessary to develop new classification methodologies to effectively
process time-series data and make real-world applications. Accordingly, this study proposes ViT-based
multi-scale classification using digital signal processing and image transformation. It comprises feature
extraction through digital signal processing (DSP), image transformation, and vision transformer (ViT) based
classification. In the DSP stage, a total of five features are extracted through sampling, quantization, and
discrete fourier transform (DFT), which are sampling time, sampled signal, quantized signal, and magnitudes
and phases extracted through DFT processing. Subsequently, the extracted multi-scale features are used to
generate new images. Finally, based on the generated images, a ViT model is applied to make multi-class
classification. This study confirms the superiority of the proposed approach by comparing traditional models
with ViT and convolutional neural network (CNN) models. Particularly, by showing excellent classification
performance even for the most challenging classes, it proves effective data processing in terms of data
diversity. Ultimately, this study suggests a methodology for the analysis and classification of time-series
data and shows that it has the potential to be applied to a wide range of data analysis problems.

INDEX TERMS Digital signal processing, image transformation, multi-class classification, multiscale, time
series, vision transformer.

I. INTRODUCTION data [3]. Moreover, real-world time-series data may entail


The classification of time-series data is an important research incompleteness of irregular intervals, missing values, and
topic in various domains [1]. However, the inherent complex- noise, which are the main contributors to model performance
ity and dynamic variation of time-series data pose several degradation [4]. Therefore, developing new classification
challenges when traditional classification methodologies are methodologies that can effectively handle the complexity of
used [2]. For instance, recognizing patterns over time or time-series data and be applied to real-world problems in
modeling long-term dependencies is a highly challenging various domains emerges as a significant concern in both
task, which directly impacts classification accuracy. Addi- research and industry.
tionally, time-series data often have high dimensionality and The approach of transforming time-series data into images
complex interactions between variables, so existing models offers advantages in overcoming these challenges. Through
have difficulties effectively handling and learning from the image transformation, it is possible to express visually the
temporal patterns and characteristics of time-series data. This
The associate editor coordinating the review of this manuscript and means that deep learning models, especially CNN or ViT
approving it for publication was Tianhua Xu . models, which have demonstrated strong performance in

2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
VOLUME 12, 2024 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 58625
G.-I. Kim, K. Chung: ViT-Based Multi-Scale Classification Using DSP and Image Transformation

image analysis, can also be applied to time-series data. Fur- The organization of this paper is as follows. In Section II
thermore, the transformation into images allows for a more describes Basic time-series Classification and Advanced
effective capture of multi-scale information and complex pat- time-series Classification. In Section III discusses DSP,
terns and helps a model learn them. With the recent advance- Image Transformation, and ViT-based Multi-Scale Classifi-
ment of deep learning and DSP technologies, the paradigm of cation. In Section IV presents the results and performance
time-series data analysis is being re-established [5]. evaluation. In Section V describes the conclusion of this
Therefore, this study proposes a new methodology aimed study.
at overcoming the limitations of traditional time-series data
classification and making the most use of the benefits of II. RELATED WORK
image transformation. The proposed methodology consists of Time-series data analysis goes beyond traditional techniques
three main steps. In the first step, DSP technique is applied and is advanced into more sophisticated techniques capable
to extract significant features from time-series data. In the of understanding and classifying even more complex and
second step, based on the extracted features, data are trans- high-dimensional data structures [6]. In this section, review
formed into images. In the last step, the transformed images recent advanced time-series classification techniques and
are classified through the ViT model. This process is aimed research trends and discuss how they relate to this study.
at effectively modeling the complex patterns and dynamic
variations of time-series data, and achieving high classifi- A. BASIC TIME SERIES CLASSIFICATION
cation accuracy. By providing the specific methodology of Time-series data analysis has emerged as a significant
the new approach, experimental results, and insights gained research topic in various fields. Particularly, time-series
thereof, this study comprehensively evaluates the impact of classification is applied to a wide range of areas, such as pat-
the convergence of deep learning and DSP technology on the tern recognition, predictive modeling, and decision-making
field of time-series data analysis. systems [7]. In this section, these researchers provide an
The contributions of the proposed method in this study are overview of basic time-series classification methodology and
as follows: introduce important theories and algorithms widely adopted
• Proposing a new approach through the fusion of signal in this field.
processing and image conversion. By combining DSP Time-series data is defined as a collection of data points
and image conversion techniques, we propose a new observed in a time sequence. Given the data characteristics,
methodology to extract deeper features from time series it is important to understand the inherent temporal continu-
data and visualize them effectively. This approach pro- ity and patterns. Early research in time-series classification
vides a new perspective on time series data analysis and mainly focused on statistical methodologies [8]. For instance,
explores applicability in a variety of fields. Alnaa and Ahiakpor [9] introduced the autoregressive inte-
• Development of a high-performance multi-class clas- grated moving average (ARIMA) model for the temporal
sification model using ViT. By applying the latest structure of time-series data. This model transforms non-
deep learning model, ViT, to data classification, higher stationary time-series data into stationary data and thereby
accuracy, and efficiency are achieved compared to exist- supports the prediction of future values. With the devel-
ing models. This represents a significant technological opment of machine learning, various algorithms have been
advance in recognizing and classifying complex patterns applied to time-series classification. However, there are lim-
in time series data. itations in capturing nonlinear patterns and sensitivity to
• Verification of the excellence of the methodology model parameter selection. Wagner et al. [10] proposed the
through performance comparison with various base Fuzzy DTW and Fuzzy BOSS model by applying the k-NN
models. Through comparison with traditional time algorithm to a new time-series classifier. The proposed model
series analysis models such as long short-term mem- effectively handles uncertain labels by applying fuzzy the-
ory (LSTM), gated recurrent unit (GRU), and tempo- ory to existing DTW and BOSS algorithms. It shows robust
ral convolutional network (TCN), the performance of performance even with noisy data and especially outperforms
this research methodology is objectively verified. This conventional methods in handling uncertain labels. However,
demonstrates the excellence of the proposed model and the proposed method increases computational costs and faces
sets a new standard contributing to the field of time series difficult model interpretation due to the fuzzy labels and
data analysis. complex algorithm structure.
• Provides detailed analysis of the model through per- Aside from that, algorithms such as decision tree, support
formance optimization and hyperparameter adjustment. vector machine (SVM), and random forest are widely used in
Through various hyperparameter settings and perfor- time-series classification. These algorithms learn the inherent
mance optimization experiments, model performance patterns and structures of the data and use them to classify
is further improved and in-depth analysis is pro- the data. Deng et al. [11] proposed a time-series Forest using
vided. Through this, we increase understanding of the a random forest algorithm. Based on the statistical features
model and suggest guidelines for future research and computed at various time intervals, it employs an ensemble
application. of decision trees. This approach effectively classifies features

58626 VOLUME 12, 2024


G.-I. Kim, K. Chung: ViT-Based Multi-Scale Classification Using DSP and Image Transformation

FIGURE 1. Process of ViT-Based multi-scale classification using digital signal processing and image transformation.

of time-series data and achieves high classification accuracy time-series data and to improve classification performance.
even with high-dimensional time-series data. However, con- However, it causes high computational costs and a high poten-
sidering diverse features and splitting criteria can make a tial for overfitting due to network complexity.
model complex interpretation. In addition to deep learning techniques, various ensem-
ble methods and feature extraction techniques are being
B. ADVANCED TIME SERIES CLASSIFICATION
researched to effectively classify the nonlinear patterns of
time-series data. For example, Fawaz et al. [15] revealed that
Recently, deep learning techniques have been increasingly
ensembling various deep learning models can improve the
applied to time-series classification. Karim et al. [12] used
performance of time-series classification. Moreover, research
LSTM to address time-series data classification. They pro-
findings show that it is possible to further enhance the clas-
posed a novel model combining LSTM with fully convo-
sification performance of models by applying appropriate
lutional networks. The proposed model makes it possible
feature extraction methods.
to achieve high classification performance and minimize
Based on these advanced time-series classification tech-
data preprocessing. However, it has such limitations as an
niques, this study proposes a new approach by combining
increased model size and complex model interpretation.
DSP and image transformation techniques. This approach is
Elsayed et al. [13] utilized a GRU for time-series analysis.
aimed at capturing more detailed and intricate data structures,
GRU handles long-term dependencies in a similar way to
compared to traditional time-series classification methods,
LSTM but has a simpler structure which leads to higher
and exploring the potential for effectively classifying them.
computational efficiency. Therefore, it has lower model com-
Therefore, this study is expected to contribute to achieving
plexity than LSTM and allows for faster learning, effectively
technical advancements and providing a new perspective in
capturing important information in time-series data. How-
the field of time-series classification.
ever, compared to LSTM, GRU can have a limitation in
modeling expressiveness, which results in low performance.
These days, such models as 1D CNNs and TCN demon- III. METHODOLOGY
strate good performance when being applied to time-series This study consists of a total of 4 stages. In the first stage, the
classification. Koh et al. [14] used a TCN model to address data to be used in experiments are collected. In the second
time-series classification problems. TCN is based on 1D con- stage, digital signal processing is applied to the collected
volutional neural networks and was designed to effectively data. In the third stage, the features extracted through the
learn long-term dependencies in time-series data. It employs DSP process are transformed into images. Lastly, in the fourth
data processing and connection operations to deliver temporal stage, multi-class classification is performed with the trans-
characteristics of data to deep network layers. In this way, formed images. Figure 1 shows the overall process of the
it is possible to effectively capture the temporal structure of methodology proposed in this paper.

VOLUME 12, 2024 58627


G.-I. Kim, K. Chung: ViT-Based Multi-Scale Classification Using DSP and Image Transformation

TABLE 1. Experimental data measured by time domain reflectometry.

Firstly, significant features are extracted from time-series in concrete structures, such as internal cavities, necking, and
data through DSP process [16]. By sampling, quantization, cracks.
and applying Fourier transform, multidimensional features The data generated from the experiments have a total of
are extracted through sampling, quantization, and Fourier eight classes. These include ‘void’ for the empty air state,
transform. Based on them, multi-scale characteristics of the ‘dry’ for the state of filling with dry earth, ‘water’ for the
data are transformed into images [17], [18], [19]. With the state of filling with only water, and water with different mois-
images generated in the process, the complexity and multi- ture content percentages: ‘water 6%(s6)’, ‘water 12%(s12)’,
dimensionality of the data are effectively expressed with the ‘water 18%(s18)’, ‘water 24%(s24)’, and ‘water 30%(s30)’.
use of RGB channels and a variety of color mapping tech- The experiments include reflection coefficient values from
niques [20]. By applying multi-scale features to the images, about 0 to 82 seconds at an interval of 0.01 seconds. The
it is possible to effectively address the potential issue of reflection coefficient is the value of the reflection pattern
losing features in the course of the transformation of data into of electromagnetic waves measured through time domain
images [21]. reflectometry (TDR) in an experimental setting. Each class
Next, the multi-class classification approach using ViT is consists of 2 columns and 8191 rows which represent time
introduced [22]. Recently, ViT has demonstrated excellent and reflection coefficient. Figure 2 shows the graphs of time
performance in the field of image classification. It utilizes and reflection coefficient for the eight classes.
232 × 5-sized images as inputs and effectively classifies
various classes of the data used in the study. This study
proves the excellent performance of ViT by comparing it with
baseline models such as LSTM, GRU, and TCN [23], [24],
[25]. Additionally, hyperparameter tuning experiments are
conducted to optimize the performance of ViT and the impact
of image size variation is analyzed [26].
Additionally, the reason why ViT models show higher
performance than traditional CNN models is analyzed.
Aside from that, the proposed methodology is compared
with general image transformation methodologies such as
spectrograms to clarify its distinctiveness in terms of perfor-
mance [27].

A. DATA COLLECTION
The experimental data used in this study are the measured FIGURE 2. Time and reflection coefficient graphs for 8 classes.
data of the reflection patterns of electromagnetic waves pass-
ing through concrete [28]. Experiments are conducted with In Figure 2, the horizontal axis represents Time, and the
the use of the created mold with 3m in length. The mold is vertical axis represents Reflection Coefficient. At the ends of
divided into three sections of one meter each. Concrete is the graph, similar patterns are observed, but around the mid-
placed at both ends, and various mediums are used in the dle where a red circle is present, different patterns emerge.
middle section for experimentation. As the medium changes, The green graph represents ‘void’, and the reflection coeffi-
the reflection patterns of electromagnetic waves also change. cient of about 0.37 around 40 seconds. In the case of ‘dry’, the
Therefore, if other materials than concrete are present, it is reflection coefficient is about 0.3 around the same seconds.
possible to make a detection by catching out differences in ‘Dry’ and ‘void’ have similar patterns around about 50 sec-
reflection patterns. This enables the identification of defects onds. In addition, the reflection coefficients of the purple

58628 VOLUME 12, 2024


G.-I. Kim, K. Chung: ViT-Based Multi-Scale Classification Using DSP and Image Transformation

graph ‘water 12%’ and the brown graph ‘water 18%’ are
found to be similar.

B. FEATURE EXTRACTION USING DIGITAL SIGNAL


PROCEESSING
In this study, in the DSP stage, time-series data are processed
into a suitable format for image transformation. Through the
DSP, a variety of multi-scale information is captured from the
time-series data, and the features to be applied to the image
transformation are extracted. The DSP stage consists of three
steps: sampling, quantization, and Fourier transform.
Algorithm 1 represents DSP, illustrating the process
described in Section III-B. It depicts the processes from data
loading to the creation of five features.

Algorithm 1 Digital Signal Processing


Input: input_path // Paths to csv files
Output: output_path //
preocedure DSP(csv_file)
Load Data:
dataframe ← pd.rcad_csv(csv_filc)
time ← first column of dataframe
signal ← second column of dataframe
Sampling:
sampled_df ← select data from dataframe matching spe-
cific condition
sampled_time ← first column of sampled_df
sampled_signal ← second column of sampled_df
Quantization:
quantized signal ← apply quantization function to (sam-
pled_signal, desired levels)
Perform DFT:
dft result ← apply DFT to (quantized signal)
magnitude ← calculate magnitude from d ftrresult −
phase ← calculate phase from dft_result
return sampled_time, sampled_signal, quantized_signal,
magnitude, phase
end preocedure

Figure 3 shows the results from the application of sampling


to source data.
As shown in Figure 3, sampling is applied to each of
the 8 classes. The orange point in Figure 3 represents the FIGURE 3. Results of applying sampling to source data.

sampled point. Sampling is the process of converting a


continuous-time signal into a discrete-time signal [17]. It is As shown in Figure 4, quantization is applied to the sam-
carried out as a way of measuring and recording continuous pled data. The orange point represents the sampled point, and
signals at a specific time interval. The primary purpose of the green point is the quantized point. Quantization is the
sampling is to convert a signal into a digital format and process of limiting a sampled value to a specific resolution
thereby process it in computers or digital devices. In this and converting the analog value of a signal into a digital
study, sampling is conducted at an interval of 0.1 seconds. value. In other words, it refers to the process of expressing
This makes it possible to extract key sampled data points a range of continuous values at a level of a finite number.
from the time-series data. This process adjusts the temporal In quantization, the value of each sample is rounded to the
resolution of data, and these data points are used as the first nearest standard value. A higher resolution of quantization
feature. The next step is the quantization of the sampled data. allows for more precise expression of original signals, but it
Figure 4 shows the application of quantization to the sampled requires more storage space and processing power. In this
data. study, a range from the minimum to maximum values is

VOLUME 12, 2024 58629


G.-I. Kim, K. Chung: ViT-Based Multi-Scale Classification Using DSP and Image Transformation

FIGURE 4. Results of applying quantization source data.

quantized in 16 levels. The quantized data point adjusts the


size of the data and is used as the second feature. The next step
is Fourier transform. Figure 5 shows the Fourier transform
results for the quantized data. FIGURE 5. Results of applying discrete Fourier transform source data.
As shown in Figure 5, DFT is applied to the quan-
tized data. The blue solid line represents the Magnitude
obtained through the DFT process, and the orange dotted
line represents the Phase. Since a signal is converted into the energy distribution of a signal, helping identify which
a discrete-time domain signal in the quantization process, frequency components are dominant. Phase is the phase angle
DFT, rather than Fourier transform is used. The DFT is the of each frequency component, providing information about
process of converting a discrete-time domain signal into a the temporal structure and shape of a signal. Phase impacts
frequency domain [19]. This makes it possible to analyze the a signal’s shape and timing. Even with the same magnitude,
frequency components of a signal. The two kinds of main having different phases can result in completely different
information obtained through the transformation are Mag- signal shapes. The Magnitude and Phase extracted through
nitude and Phase. Magnitude is the size of each frequency DFT can be used to capture the complex composition and
component, indicating how strong that frequency component characteristics of signals, making it possible to analyze the
is in the signal. Magnitude is a crucial factor in understanding basic structures and patterns of signals. In this study, Mag-

58630 VOLUME 12, 2024


G.-I. Kim, K. Chung: ViT-Based Multi-Scale Classification Using DSP and Image Transformation

nitude and Phase are used as the third and fourth features, This image transformation method makes it possible to
respectively. express the complexity and diversity of time-series data visu-
Lastly, the time information of time-series data is used as ally and effectively, and the transformed images will be
the fifth feature that reflects the temporal characteristics of used in the multi-class classification of ViT. This process
data. This is crucial in capturing the dynamic features of time- demonstrates the possibility of maintaining the multi-scale
series data. These extracted five features comprehensively characteristics of time-series data and achieving effective
reflect the multi-scale information of data. This helps to classification performance.
preserve the essential characteristics and patterns of data as Figure 6 shows the appearance of the final image gener-
much as possible, even if inevitable compression of infor- ated for the eight classes, and Figure 7 presents the resized
mation occurs during image transformation. Therefore, the image.
DSP stage serves as the foundation for extracting important
information from data and effectively converting them into
images.

C. IMAGE TRANSFORMATION USING EXTRACTED


FEATURES
This section describes the process of converting time-series
data into images with the use of the five features extracted
in Section III-B The purpose of this transformation is to
preserve the multi-scale characteristics of time-series data
and visually express the data. The transformation process
involves creating two separate images and then combining
them.
The process of generating the first image is as follows: The
magnitude of the DFT, the phase of the DFT, and the value of
time data are normalized in a range between 0 and 1. This
normalization ensures that each feature can be appropriately
mapped to the pixel value of the image channel. With the use FIGURE 6. Image results were created from five features for eight classes.
of the normalized features, a 3-channel image is constructed.
In this case, the R channel is set to the normalized value
corresponding to the magnitude of DFT, the G channel to the
normalized value corresponding to the phase of DFT, and the
B channel to the normalized value corresponding to time data.
The process of generating the second image is as follows:
The values of the sampled signal and the quantized signal are
normalized in a range between 0 and 1. These two features
are used to construct the first and second channels of the
image. With the use of the normalized features, a 3-channel
image is constructed. In this case, the R channel is set to the
normalized value of the sampled signal, the G channel to the
normalized value of the quantized signal, and the B channel
to the combination or extension of these two values. Since
this study lacks one feature, the B channel is created through
the extension based on the information from the R and G
channels.
The two images generated through the above two processes
visually express each feature. These two images are concate-
nated horizontally to create a final image of size 5 × 232.
Additionally, an image of size 224 × 224 is generated in line
with the input of Vision Transformer model. Later, both the
original and resized images are used to conduct performance
evaluation to find a balance of the image size optimized
for input specifications. These generated images contain a
variety of multi-scale information of time-series data and aim
to overcome the information compression that may occur FIGURE 7. Resized image results were created from five features for eight
during image transformation. classes.

VOLUME 12, 2024 58631


G.-I. Kim, K. Chung: ViT-Based Multi-Scale Classification Using DSP and Image Transformation

Algorithm 2 shows the Image Transformation, depicting Compared to the traditional methods for analyzing
the process described in Section III-C and the process of time-series data or other deep learning-based image clas-
generating images from the five features. sification models, ViT model has shown higher accuracy
and efficiency [31]. This indicates that ViT model is espe-
Algorithm 2 Image Transformation cially advantageous in transforming the complex features of
Input: input // sampled_time, sampled_signal, quan- time-series data into images and processing them. Moreover,
tized_signal, magnitude, phase ViT model provides deeper insights and predictive power than
Output: output_path // traditional approaches for analyzing time-series data [32].
preocedure IMAGE TRANSFORMATION(sampled_time, The visualization of the time-series data obtained through
sampled_signal, quantized_signal, magnitude, phase) image transformation enables ViT model to better understand
3 features to image: and classify complex patterns [33]. Therefore, this study
norm_magnilude ← NORMALIZE(magnilude) employs ViT model for the image transformation and clas-
norm phase ← NORMALIZE(phase) sification of time-series data.
norm time ← NORMALIZE(time) Figure 8 shows the structure of the ViT model used in this
image 1 ← STACKCHANNELS(norm_magnitude, paper. ViT-Base model is used to classify the images of time-
norm_phase, norm_time) series data. The model maintains basic settings, and only an
2 features to imags: input size is adjusted. The input image size for the model
norm_sampled_signal ← NORMALIZE(sampled_signal) is set to 232 × 5, corresponding to the size of the image
norm quantized signal ← NORMALIZE(quantized sig- generated in Section III-C. The patch size for processing the
nal) input images is set to 1. Consequently, the 232 × 5 image
dummy channel ← CREATEDUMMYCHANNEL is divided into a total of 1,160 patches and put into the
image2 ← STACKCHANNELS(norm sampled signal, ViT model. In this paper, an input image is directly fed
norm_quantized_signal, dummy_channel) into the ViT model without any preprocessing, and thus no
Create composite image from features: separate Feature Extractor is employed. The divided patches
composite_image ← COMBINEIMAGESIDEBY- are passed through Linear Projection of Flattened Patches and
SIDE(image 1, image 2 ) then are put into the encoder of the transformer. The encoder
return composite_image of the transformer follows the structure of a conventional
end preocedure Transformer, consisting of 12 layers. After passing through
the encoder, the patches go through MLP Head for final class
By transforming time-series data into images, it is possi- classification.
ble to effectively express the multidimensional features and The classification is conducted with the uses of both the
temporal and spatial patterns of the data [29]. This helps original image transformed from time-series data (232 × 5)
deep learning models understand and learn better the complex and the resized image (224 × 224) tailored to fit the
structures of data. Furthermore, it is possible to integrate input of ViT model. This allows for the evaluation of the
multi-scale information during image transformation. Such impact of image size variation on classification performance.
integration is good for capturing various frequencies and In the performance evaluation process, the ViT model is
time-scale patterns of time-series data. For these reasons, compared with other transformer-based vision models and
time-series data are transformed into images [30]. traditional CNN models. This evaluation aims to assess
how effective the ViT model is for image classification of
D. ViT-BASED MULTI-SCALE CLASSIFICATION time-series data. With the changes in the hyperparameters
This section describes the process of classifying the trans- of the ViT model, multiple experiments are carried out to
formed images of time-series data using ViT model [22]. draw optimal performance. In this process, the performance

FIGURE 8. Multi-class classification model structure based on ViT.

58632 VOLUME 12, 2024


G.-I. Kim, K. Chung: ViT-Based Multi-Scale Classification Using DSP and Image Transformation

changes of the model under various settings are observed and represent the calculation methods for Precision and Recall,
analyzed. respectively.
This study focuses on exploring the multi-class classi-
TP
fication method of time-series data using the ViT model, Precision = (2)
and systematically evaluating the impacts of image size TP + FP
TP
and model structure variation on classification performance. Recall = (3)
Therefore, it tries to look into the applicability and limitations TP + FN
of state-of-the-art deep learning models in time-series data ViT-MSC (Multi-Scale Classification), introduced in
analysis. Section IV, refers to the ViT model which reflects the feature
extraction and image transformation methods proposed in this
IV. EXPERIMENTAL RESULTS AND COMPARISONS paper. It is based on ViT-B, where Patch Size is set to 1 and
A. DATA AUGMENTATION AND PERFORMANCE image resizing is not applied.
EVALUATION METRICS FOR EXPERIMENTS
This section describes the experimental results and data aug- B. COMPARISON WITH TRADITIONAL TIME SERIES
mentation. All experiments were conducted with two units MODELS
of Intel®Xeon®Silver 2.40 GHz CPU, 128GB RAM, and This section presents a comparative performance analysis
NVIDIA RTX 3090 GPU. In this paper, noise injection between the ViT model and conventional time-series models,
is used as a data augmentation technique to increase the especially, LSTM, GRU, and TCN. The evaluation metric
diversity of experimental data [34]. Data augmentation is used for comparison is accuracy which represents the ratio
applied to the experimental data (time-series data) acquired of correctly predicted instances to total instances. The exper-
in Section III-A. Noise injection generates the transformed iments include three existing time-series models, LSTM,
versions of data by adding random noise to the experimental GRU, and TCN, as well as the ViT model. The accuracy
data. The purpose of this process is to enhance a model’s results for each model are presented below. Table 2 shows
generalization ability by exposing it to various noises that the comparison with existing time-series models.
may occur in real-world environments [34].
As a result, 30,000 augmented data samples per class are TABLE 2. Comparison with traditional time series models.
used as training data, and 9,000 augmented data per class as
test data. In the training process, 10% of the training data
are set aside as validation data and used. All experiments
are conducted with the same augmented dataset, and the
transformed images are also generated from the augmented
data. This ensures consistent conditions for comparing and
evaluating the experimental results.
Accuracy is an evaluation metric of experiments [35]. The
Accuracy metric is used to measure how accurately a model The ViT model used in this paper outperforms existing
predicts the classes of input images and is calculated as the time-series models, achieving the highest accuracy of 0.8917.
ratio of correctly classified data to total data. Eq. (1) presents This indicates that the ViT model is excellent for capturing
the calculation method for accuracy. True Positives (TP) are and classifying patterns of time-series data when the data
correctly identified positive cases, True Negatives (TN) are are transformed into images. The performance advantage of
correctly identified negative cases, False Positives (FP) are the ViT model may stem from the following several factors:
negative cases incorrectly identified as positive, and False Firstly, the ViT model makes use of the advantages of rich
Negatives (FN) are positive cases incorrectly identified as feature expressions obtained through image transformation,
negative. making it possible to effectively capture complex patterns of
data better than traditional models. Unlike traditional time-
TP + TN
Accuracy = (1) series models, the ViT model utilizes spatial relationships
TP + FP + TN + FN within the transformed images to provide additional con-
Additionally, Precision and Recall are used as evaluation textual information which helps to enhance classification
metrics to evaluate the class-wise performance of traditional accuracy. The ViT model inherently has excellent scalability
time-series data models and ViT. Precision represents the and flexibility so that it can handle various image sizes and
ratio of true positive items among the items classified as complexity relatively easily.
positive by the model. The Precision metric is used to mea- The reason why GRU has slightly lower performance than
sure how accurately a model predicts a positive class [36]. LSTM is attributable to their different model structures. GRU
Recall represents the percentage of items correctly classi- has a simpler structure and fewer parameters than LSTM. For
fied as positive by a model among actual positive items. this reason, it has higher computational efficiency. However,
The Recall metric is used to measure how well a model compared to LSTM, GRU may have limitations in capturing
captures an actual positive class [37]. Eq. (2) and Eq. (3) detailed information when dealing with complex time-series

VOLUME 12, 2024 58633


G.-I. Kim, K. Chung: ViT-Based Multi-Scale Classification Using DSP and Image Transformation

data. Therefore, due to its simpler structure, GRU is spec- the specific data structure in this study and because all parts of
ulated to have slightly lower performance than LSTM. The an image can be utilized effectively through precise patch size
primary reasons why TCN model has relatively lower per- settings. It is speculated that other Vision Transformer models
formance are the characteristics of its model structure and show slightly lower performance because they need resizing
the complexity of time-series data. Although the TCN is in terms of the characteristics of their model structures.
designed to handle long-term dependencies, it is sensitive to The reason why PVT outperforms Swin-T is that its
specific hyperparameter settings. For this reason, the TCN pyramid-shaped structure is effective for capturing the fea-
may not adequately capture the various patterns and com- tures of multi-scales. PVT has a structure to hierarchically
plexity of the given time-series data. Additionally, TCN’s learn the features at different resolutions. The structure allows
fixed dilation path may have limitations in effectively learn- the PVT to effectively handle diverse patterns and scales of
ing the diverse scales and patterns of the transformed image time-series data. This structural characteristic plays a key role
data. in enabling PVT to achieve higher accuracy than Swin-T.
In conclusion, the outstanding performance of the ViT It is estimated that the reason why Swin-T performs better
model underscores the efficiency of the image transformation than Swin-Tv2 is attributable to the fact that the Swin-T
used for time-series data classification. This approach not model has a more suitable nature for the image transforma-
only enhances model performance but also highlights the tion of time-series data in this paper. Although Swin-Tv2
potential of vision-based models in the domain traditionally achieved more improvements and optimization structurally
dominated by sequence-based models. The results of this than Swin-T, these changes may have compromised the
study support the exploration of image-based models and model’s adaptability to a specific data structure. Particularly,
transformation and present a promising method for future in terms of the image transformation of time-series data, the
research on time-series analysis. structure of Swin-T might have been better suited to the data
characteristics in this paper.
C. COMPARISON AMONG VISION TRANSFORMER This comparative analysis reveals that the ViT model,
MODELS compared to different ViT models, exhibited excellent per-
In this section, these researchers make a comparison between formance in multi-class classification for the image transfor-
four ViT models: ViT-MSC (Ours), Swin-T [38], Swin-Tv2 mation of time-series data. This indicates that the structural
[39], and PVT [40]. In a related experiment, resizing is flexibility and data adaptability of the ViT model served as
applied to other models than the ViT model. The accuracy of key roles for performance.
each model is presented below. Table 3 shows the comparison
between ViT models. Table 4 presents the key hyperparame- D. COMPARISON WITH CNN MODELS
ter information used in the experiment. In this section, for performance analysis, these researchers
compare the ViT model with the latest CNN models such
TABLE 3. Comparison among vision transformer models. as ConvNeXt [41], ConvNeXtv2 [42], EfficientNet [43],
and ResNet [44]. The comparison focuses on the accuracy
achieved by each model when they classify the images
transformed from time-series data. Table 5 presents the per-
formance comparison with CNN models.

TABLE 5. Comparison with CNN models.

TABLE 4. Key hyperparameter information used in the experiment.

Table 5 shows that the ViT model had better performance


than recent CNN models. There are three main reasons why
the ViT model outperforms the CNNs. Firstly, in the method-
The ViT model has higher accuracy than different ViT ology used in this study, the transformation of time-series data
models. That is mainly attributable to model flexibility and into images generates images that are suitable for processing
data adaptability. The ViT model can be used in line with the at the patch level. The ViT model splits an image into patches
original data size of 232 × 5 without resizing. Since its Patch and independently processes them. This approach is good for
Size is set to 1, the model can capture detailed information capturing detailed information from the transformed images
of data. This is because the structure of ViT aligns well with effectively. It enables the ViT model to utilize both local and

58634 VOLUME 12, 2024


G.-I. Kim, K. Chung: ViT-Based Multi-Scale Classification Using DSP and Image Transformation

global information effectively. Secondly, CNN models are in the frequency domain, the detailed information in the
known to have a significant inductive bias in handling spatial time domain may be lost in the process. Additionally, the
hierarchy and locality. Although this bias may be advanta- spectrogram created with the time-series data used in this
geous for certain types of images, it can also limit a model’s study has a size of 775 × 300. When the image is resized
adaptability. In contrast, the ViT model processes images in to 224 × 224 in line with the ViT model, more information
a patch sequence and shows a low level of inductive bias is lost [46]. In contrast, the methodology used in this paper
that does not impose strong spatial assumptions. This feature considers the information in both the time and frequency
allows the ViT model to adapt better to the images trans- domains and preserves more information in the transformed
formed from time-series data that have no inherent inductive images. In addition, it offers flexibility for the adaptation to
biases in natural images. Lastly, the images generated in the various characteristics of time-series data. The DSP-based
methodology proposed in this paper have the absence of multi-scale information extraction makes it possible to cap-
traditional inductive biases such as continuity and locality ture data patterns in diverse scenarios and conditions and
which CNNs inherently exploit. As a result, the patch-based contributes to increasing accuracy in classification tasks.
approach of the ViT provides a more suitable framework for This comparative analysis reveals that the methodology
capturing unique patterns and features of these transformed suggested in this study outperforms the traditional spectro-
images. gram transformation method in terms of image transforma-
The performance comparison emphasizes the effectiveness tion and classification of time-series data. The suggested
of the ViT model which outperforms traditional CNN models methodology transforms complex time-series data into visual
in terms of handling the images transformed from time-series forms more effectively and minimizes information loss that
data. ViT’s patch-based processing and low inductive bias may occur in the transformation process. This indicates that
contribute to adapting to the unique characteristics of the the suggested methodology is an effective alternative for
transformed images and thereby achieve higher classifica- time-series data analysis and classification.
tion accuracy. This analysis suggests that the ViT model has
potential in such areas as time-series data analysis beyond the F. CLASS-WISE PERFORMANCE COMPARISON BETWEEN
field of natural image processing. LSTM, GRU, TCN MODELS, AND ViT MODEL
In this section, these researchers conduct a performance com-
E. COMPARISON WITH SPECTROGRAM-BASED IMAGE parison between the ViT model and LSTM, GRU, & TCN
TRANSFORMATION models regarding various classes of time-series data. In par-
In this section, these researchers compare the methodology ticular, the researchers analyze the accuracy of each model
used in this study with the most widely used spectrogram by class to evaluate how well the models classify specific
transformation method for transforming time-series data into classes of time-series data. Table 7 presents the class-by-class
images and analyzing them [45]. This comparison focuses performance evaluation for the four models.
on evaluating the performance of two approaches that are According to the performance analysis of LSTM, GRU,
used to transform time-series data into visual forms. Table 6 and TCN models, they show relatively low accuracy for the
presents the performance comparison between the spectro- two most challenging classes, water 12% and water 18%.
gram methodology and the methodology used in this study. Additionally, they tend to have a bias of considering Class
4 and Class 5 as a single class. This indicates that these
TABLE 6. Comparison with spectrogram-based image transformation. models have limitations in capturing complex patterns or
subtle temporal variation. If time-series data of certain classes
have intricate dynamic characteristics, traditional time-series
processing models may have difficulty in adequately learn-
ing and classifying them. In contrast, the ViT model that
reflects the methodology suggested in this paper demon-
strates high classification accuracy even for the classes hard
Table 6 shows that the methodology used in this paper to find by conventional models. Furthermore, it consistently
has higher performance than the spectrogram transformation distinguishes between two classes, rather than having a bias
method. The main reason why the methodology used in this of considering them as one class. That is because, in the
study achieves higher accuracy than the spectrogram transfor- suggested methodology of transforming time-series data into
mation method is that multi-scale information is effectively images, the ViT model is capable of effectively capturing and
extracted from time-series data through DSP. The image understanding multi-scale patterns and complex features of
transformation methodology used in this paper captures vari- data. Particularly, the patch-based approach makes it possible
ous scales and characteristics of time-series data through the to capture detailed information of data. That is a main factor
DSP. This helps minimize information compression and loss contributing to the excellent performance of the ViT model
that may occur during image transformation and preserves the compared to other models.
complexity and diversity of data. Although the spectrogram The ViT model and the image transformation methodology
transformation mainly focuses on visualizing the information used in this paper offer a remarkable approach to time-series

VOLUME 12, 2024 58635


G.-I. Kim, K. Chung: ViT-Based Multi-Scale Classification Using DSP and Image Transformation

TABLE 7. Class-wise performance comparison between LSTM, GRU, TCN models, and ViT-MSC model.

data analysis. Compared to traditional time-series processing


models, the ViT model demonstrates the capability to classify
various classes of complex time-series data more accurately.
These findings indicate that the ViT model will be able to
open up a new possibility for analyzing time-series data,
especially hard-to-classify classes.

G. PERFORMANCE COMPARISON WITH VARIATIONS IN


ViT HYPER-PARAMETERS
In this section, to evaluate the impact of hyperparameter
variations in the ViT model on performance, these researchers
conduct performance comparisons according to whether to
apply resizing and a change in patch size. This compari-
son focuses on exploring the flexibility of the ViT model FIGURE 9. Performance comparison with variations in ViT
and its applicability in various scenarios. Figure 9 shows Hyper-parameters.

the performance comparison according to variations in ViT


hyper-parameters.
In the case where resizing is applied, the default patch guidelines for applying the ViT model to various data and
size is set to 16, and the model performance is evaluated. tasks. In addition, this study confirms the potential for perfor-
In the case where resizing is not applied, patch sizes are set mance improvement with increasing model size. Therefore,
to 1 and 2, and the model performance is evaluated. When it is expected that the scope and applicability of the ViT model
the patch size is set to 1 without resizing, the model shows will be extended further.
the highest performance. This indicates that the ViT model
can better capture the intricate features and patterns of data V. CONCLUSION
when processing detailed information of images at a patch This study tried to explore a new approach to time-series
level. Particularly, for the images transformed from time- data analysis by transforming time-series data into images
series data, smaller patch sizes appear advantageous for the and conducting classification with the use of ViT model.
model to learn subtle data variations more accurately. It was confirmed that the methodology developed in this
Due to computational constraints, this study utilizes only study effectively transformed various time-series data into
the ViT-Base model for experimentation. If larger models images. This indicates that the methodology is successfully
like ViT-L are employed, there is a possibility of achieving applicable to reinterpretation of the intricate patterns and
higher accuracy. This is because larger models can capture dynamic characteristics of time-series data in a visual form.
data complexity more effectively through a greater number The ViT model used in this paper demonstrated far better
of parameters and deeper structures. performance than time-series data models and CNN mod-
In this study, the performance comparison analysis based els. Particularly, when a patch size was set to 1, the ViT
on hyperparameter variations reveals that the detailed set- model had the best performance. This means that the detailed
tings of the ViT model impact image transformation and information generated during image transformation is signif-
classification of time-series data. Particularly, it was found icant. Additionally, the methodology of this study contributed
that patch size and the application of resizing are crucial to achieving high classification accuracy for all classes,
factors influencing model performance. This finding provides including the most challenging ones. This suggests that trans-

58636 VOLUME 12, 2024


G.-I. Kim, K. Chung: ViT-Based Multi-Scale Classification Using DSP and Image Transformation

formed images can effectively reflect multi-scale information [10] N. Wagner, V. Antoine, J. Koko, and R. Lardy, ‘‘Fuzzy k-NN based clas-
of time-series data. Additionally, the comparative analysis sifiers for time series with soft labels,’’ in Proc. IPMU, Lisbon, Portugal,
2020, pp. 578–589, doi: 10.1007/978-3-030-50153-2_43.
with spectrogram transformation method revealed that the [11] H. Deng, G. Runger, E. Tuv, and M. Vladimir, ‘‘A time series forest
image transformation methodology developed in this study for classification and feature extraction,’’ Inf. Sci., vol. 239, pp. 142–153,
has superiority and effectiveness. This indicates that the Aug. 2013, doi: 10.1016/j.ins.2013.02.030.
developed methodology can be used as a new approach to [12] F. Karim, S. Majumdar, H. Darabi, and S. Chen, ‘‘LSTM fully convo-
lutional networks for time series classification,’’ IEEE Access, vol. 6,
capture various features of time-series data. pp. 1662–1669, 2018, doi: 10.1109/ACCESS.2017.2779939.
The future research to be conducted has the following [13] N. Elsayed, A. S. Maida, and M. Bayoumi, ‘‘Gated recurrent neural
directions. The first one is to explore the construction of an networks empirical utilization for time series classification,’’ in Proc.
Int. Conf. Internet Things (iThings) IEEE Green Comput. Commun.
end-to-end model by making parameters such as sampling (GreenCom) IEEE Cyber, Phys. Social Comput. (CPSCom) IEEE Smart
and quantization in the DSP stage trainable. This will enable Data (SmartData), Atlanta, GA, USA, Jul. 2019, pp. 1207–1210, doi:
a model to automatically learn a transformation method more 10.1109/iThings/GreenCom/CPSCom/SmartData.2019.00202.
[14] B. H. D. Koh, C. L. P. Lim, H. Rahimi, W. L. Woo, and B. Gao, ‘‘Deep
appropriate to data features to enhance performance further. temporal convolution network for time series classification,’’ Sensors,
The second one is to develop a method to vary time intervals vol. 21, no. 2, p. 603, Jan. 2021, doi: 10.3390/s21020603.
in the sampling process to extract various features and inte- [15] H. Ismail Fawaz, G. Forestier, J. Weber, L. Idoumghar, and P.-A. Müller,
grate them into a single image. This aims to explore a method ‘‘Deep learning for time series classification: A review,’’ Data Min-
ing Knowl. Discovery, vol. 33, no. 4, pp. 917–963, Mar. 2019, doi:
that can more richly reflect the multidimensional characteris- 10.1007/s10618-019-00619-1.
tics of time-series data. The last one is to develop a new model [16] B. Karanov, M. Chagnon, V. Aref, F. Ferreira, D. Lavery, P. Bayvel, and
specialized for this methodology and then experiment with L. Schmalen, ‘‘Experimental investigation of deep learning for digital sig-
nal processing in short reach optical fiber communications,’’ in Proc. IEEE
not only the ViT model but various structures and architec- Workshop Signal Process. Syst. (SiPS), Coimbra, Portugal, Oct. 2020,
tures. This aims to optimize the approach of the methodology pp. 1–6, doi: 10.1109/SiPS50750.2020.9195215.
and look into a new possibility of time-series data analysis. [17] F. Xuan, Y. Dong, J. Li, X. Li, W. Su, X. Huang, J. Huang, Z. Xie, Z. Li,
This study suggests a novel approach to time-series data H. Liu, W. Tao, Y. Wen, and Y. Zhang, ‘‘Mapping crop type in northeast
China during 2013–2021 using automatic sampling and tile-based image
analysis and will lay the foundation for future research and classification,’’ Int. J. Appl. Earth Observ. Geoinf., vol. 117, Mar. 2023,
applications in the field. Art. no. 103178, doi: 10.1016/j.jag.2022.103178.
[18] T. Shen, W. Ren, and M. Han, ‘‘Quantized generalized maximum corren-
tropy criterion based kernel recursive least squares for online time series
REFERENCES prediction,’’ Eng. Appl. Artif. Intell., vol. 95, Oct. 2020, Art. no. 103797,
[1] H. I. Fawaz, B. Lucas, G. Forestier, C. Pelletier, D. F. Schmidt, J. Weber, doi: 10.1016/j.engappai.2020.103797.
G. I. Webb, L. Idoumghar, P.-A. Müller, and F. Petitjean, ‘‘InceptionTime: [19] E. Ghaderpour, S. D. Pagiatakis, and Q. K. Hassan, ‘‘A survey on change
Finding AlexNet for time series classification,’’ Data Mining Knowl. detection and time series analysis with applications,’’ Appl. Sci., vol. 11,
Discovery, vol. 34, no. 6, pp. 1936–1962, Sep. 2020, doi: 10.1007/s10618- no. 13, p. 6141, Jul. 2021, doi: 10.3390/app11136141.
020-00710-y. [20] M. Onishi and T. Ise, ‘‘Explainable identification and mapping of trees
[2] A. Dempster, F. Petitjean, and G. I. Webb, ‘‘ROCKET: Exceptionally using UAV RGB image and deep learning,’’ Sci. Rep., vol. 11, no. 1, p. 903,
fast and accurate time series classification using random convolutional Jan. 2021, doi: 10.1038/s41598-020-79653-9.
kernels,’’ Data Mining Knowl. Discovery, vol. 34, no. 5, pp. 1454–1495, [21] J. Ke, Q. Wang, Y. Wang, P. Milanfar, and F. Yang, ‘‘MUSIQ: Multi-scale
Jul. 2020, doi: 10.1007/s10618-020-00701-z. image quality transformer,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis.
[3] A. Dempster, D. F. Schmidt, and G. I. Webb, ‘‘MiniRocket: A very fast (ICCV), Montréal, QC, Canada, Oct. 2021, pp. 5128–5137.
(almost) deterministic transform for time series classification,’’ in Proc. [22] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
27th ACM SIGKDD Conf. Knowl. Discovery Data Mining, New York, NY, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,
USA, Aug. 2021, pp. 248–257, doi: 10.1145/3447548.3467231. J. Uszkoreit, and N. Houlsby, ‘‘An image is worth 16 × 16 words:
[4] X. Zhang, Y. Gao, J. Lin, and C.-T. Lu, ‘‘TapNet: Multivariate time Transformers for image recognition at scale,’’ in Proc. Int. Conf. Learn.
series classification with attentional prototypical network,’’ in Proc. Reps. (ICLR), Vienna, Austria, 2021, pp. 1–21.
AAAI Conf. Artif. Intell., Apr. 2020, vol. 34, no. 4, pp. 6845–6852, doi: [23] A. Yadav, C. K. Jha, and A. Sharan, ‘‘Optimizing LSTM for time
10.1609/aaai.v34i04.6165. series prediction in Indian stock market,’’ Proc. Comput. Sci., vol. 167,
[5] R. P. França, A. C. Borges Monteiro, R. Arthur, and Y. Iano, ‘‘An overview pp. 2091–2100, Jan. 2020, doi: 10.1016/j.procs.2020.03.257.
of deep learning in big data, image, and signal processing in the [24] W. Zheng and G. Chen, ‘‘An accurate GRU-based power time-series
modern digital age,’’ in Trends in Deep Learning Methodologies, prediction approach with selective state updating and stochastic optimiza-
1st ed., Amsterdam, Netherlands: Elsevier, 2021, ch. 3, pp. 63–87, doi: tion,’’ IEEE Trans. Cybern., vol. 52, no. 12, pp. 13902–13914, Dec. 2022,
10.1016/B978-0-12-822226-3.00003-9. doi: 10.1109/TCYB.2021.3121312.
[6] A. P. Ruiz, M. Flynn, J. Large, M. Middlehurst, and A. Bagnall, ‘‘The [25] J. Fan, K. Zhang, Y. Huang, Y. Zhu, and B. Chen, ‘‘Parallel spatio-
great multivariate time series classification bake off: A review and exper- temporal attention-based TCN for multivariate time series prediction,’’
imental evaluation of recent algorithmic advances,’’ Data Mining Knowl. Neural Comput. Appl., vol. 35, no. 18, pp. 13109–13118, Jun. 2023, doi:
Discovery, vol. 35, no. 2, pp. 401–449, Mar. 2021, doi: 10.1007/s10618- 10.1007/s00521-021-05958-z.
020-00727-3. [26] D. Ma, P. Zhao, and X. Jiao, ‘‘PerfHD: Efficient ViT architecture perfor-
[7] B. Lim and S. Zohren, ‘‘Time-series forecasting with deep learning: A sur- mance ranking using hyperdimensional computing,’’ in Proc. IEEE/CVF
vey,’’ Phil. Trans. Roy. Soc. A, Math., Phys. Eng. Sci., vol. 379, no. 2194, Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Vancouver,
Feb. 2021, Art. no. 20200209, doi: 10.1098/rsta.2020.0209. BC, Canada, Jun. 2023, pp. 2229–2236.
[8] M. Maleki, M. R. Mahmoudi, D. Wraith, and K.-H. Pho, ‘‘Time series [27] Z. Zeng, M. G. Amin, and T. Shan, ‘‘Arm motion classification using time-
modelling to forecast the confirmed and recovered cases of COVID-19,’’ series analysis of the spectrogram frequency envelopes,’’ Remote Sens.,
Travel Med. Infectious Disease, vol. 37, Sep. 2020, Art. no. 101742, doi: vol. 12, no. 3, p. 454, Feb. 2020, doi: 10.3390/rs12030454.
10.1016/j.tmaid.2020.101742. [28] J.-S. Lee, J. U. Song, W.-T. Hong, and J.-D. Yu, ‘‘Application of time
[9] S. E. Alnaa and F. Ahiakpor, ‘‘ARIMA (autoregressive integrated moving domain reflectometer for detecting necking defects in bored piles,’’ NDT
average) approach to predicting inflation in Ghana,’’ J. Econ. Int. Finance., E Int., vol. 100, pp. 132–141, Dec. 2018, doi: 10.1016/j.ndteint.2018.
vol. 3, no. 5, pp. 328–336, Mar. 2011. 09.006.

VOLUME 12, 2024 58637


G.-I. Kim, K. Chung: ViT-Based Multi-Scale Classification Using DSP and Image Transformation

[29] S. Barra, S. M. Carta, A. Corriga, A. S. Podda, and D. R. Recupero, ‘‘Deep [43] M. Tan and Q. Le, ‘‘EfficientNet: Rethinking model scaling for convolu-
learning and time series-to-image encoding for financial forecasting,’’ tional neural networks,’’ in Proc. Int. Conf. Mach. Learn., Long Beach,
IEEE/CAA J. Autom. Sinica, vol. 7, no. 3, pp. 683–692, May 2020, doi: CA, USA, 2019, pp. 6105–6114.
10.1109/JAS.2020.1003132. [44] C. Zhang, P. Benz, D. M. Argaw, S. Lee, J. Kim, F. Rameau, J.-C. Bazin,
[30] C.-L. Yang, C.-Y. Yang, Z.-X. Chen, and N.-W. Lo, ‘‘Multivariate and I. S. Kweon, ‘‘ResNet or DenseNet? Introducing dense shortcuts
time series data transformation for convolutional neural network,’’ in to ResNet,’’ in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV),
Proc. IEEE/SICE Int. Symp. Syst. Integr. (SII), Paris, France, Jan. 2019, Jan. 2021, pp. 3549–3558.
pp. 188–192, doi: 10.1109/SII.2019.8700425. [45] H. Chaurasiya, ‘‘Time-frequency representations: Spectrogram,
[31] H. Yin, A. Vahdat, J. M. Alvarez, A. Mallya, J. Kautz, and P. Molchanov, cochleogram and correlogram,’’ Proc. Comput. Sci., vol. 167,
‘‘A-ViT: Adaptive tokens for efficient vision transformer,’’ in Proc. pp. 1901–1910, Jan. 2020, doi: 10.1016/j.procs.2020.03.209.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), New Orleans, [46] J. Kwon, M. J. Kim, J. W. Baek, and K. Chung, ‘‘Voice frequency synthesis
LA, USA, Jun. 2022, pp. 10799–10808. using VAW-GAN based amplitude scaling for emotion transformation,’’
[32] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z. Jiang, F. E. H. Tay, J. Feng, KSII Trans. Internet Inf. Syst. (TIIS), vol. 16, no. 8, pp. 2787–2800,
and S. Yan, ‘‘Tokens-to-Token ViT: Training vision transformers from Aug. 2022, doi: 10.3837/tiis.2022.02.018.
scratch on ImageNet,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
Montréal, QC, Canada, Oct. 2021, pp. 538–547.
[33] M. Chen, M. Lin, K. Li, Y. Shen, Y. Wu, F. Chao, and R. Ji, ‘‘CF-ViT:
A general coarse-to-fine method for vision transformer,’’ in Proc. AAAI
Conf. Artif. Intell., Washington, DC, USA, 2023, pp. 7042–7052, doi:
10.1609/aaai.v37i6.25860.
[34] B. Liu, Z. Zhang, and R. Cui, ‘‘Efficient time series augmentation meth- GYU-IL KIM received the B.S. degree from the
ods,’’ in Proc. 13th Int. Congr. Image Signal Process., Biomed. Eng. Division of Computer Science and Engineering,
Informat. (CISP-BMEI), Chengdu, China, Oct. 2020, pp. 1004–1009, doi: Kyonggi University, Suwon, South Korea, in 2023,
10.1109/CISP-BMEI51763.2020.9263602. where he is currently pursuing the master’s degree
[35] H. Yoo, S.-E. Lee, and K. Chung, ‘‘Deep learning-based action classifica- with the Department of Computer Science. He was
tion using one-shot object detection,’’ Comput., Mater. Continua, vol. 76, a Researcher with the Data Mining Laboratory,
no. 2, pp. 1343–1359, Aug. 2023, doi: 10.32604/cmc.2023.039263. Kyonggi University. His research interests include
[36] J.-W. Baek and K. Chung, ‘‘Multi-context mining-based graph neural data mining, big data, deep learning, machine
network for predicting emerging health risks,’’ IEEE Access, vol. 11, learning, and computer vision.
pp. 15153–15163, 2023, doi: 10.1109/ACCESS.2023.3243722.
[37] B. U. Jeon and K. Chung, ‘‘CutPaste-based anomaly detection model using
multi-scale feature extraction in time series streaming data,’’ KSII Trans.
Internet Inf. Syst. (TIIS), vol. 16, no. 8, pp. 2787–2800, Aug. 2022, doi:
10.3837/tiis.2022.08.018.
[38] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, ‘‘Swin
Transformer: Hierarchical vision transformer using shifted windows,’’ in
Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Montréal, QC, Canada, KYUNGYONG CHUNG received the B.S., M.S.,
Oct. 2021, pp. 9992–10002. and Ph.D. degrees from the Department of Com-
[39] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang, puter Information Engineering, Inha University,
L. Dong, F. Wei, and B. Guo, ‘‘Swin Transformer V2: Scaling up capacity South Korea, in 2000, 2002, and 2005, respec-
and resolution,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. tively. He has worked for the software technology
(CVPR), New Orleans, LA, USA, Jun. 2022, pp. 11999–12009.
leading department with Korea IT Industry Promo-
[40] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo,
tion Agency (KIPA). From 2006 to 2016, he was
and L. Shao, ‘‘Pyramid vision transformer: A versatile backbone for dense
a Professor with the School of Computer Informa-
prediction without convolutions,’’ in Proc. IEEE/CVF Int. Conf. Comput.
Vis. (ICCV), Montréal, QC, Canada, Oct. 2021, pp. 548–558. tion Engineering, Sangji University, South Korea.
[41] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, Since 2017, he has been a Professor with the
‘‘A ConvNet for the 2020s,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Division of AI Computer Science and Engineering, Kyonggi University,
Recognit. (CVPR), New Orleans, LA, USA, Jun. 2022, pp. 11966–11976. South Korea. He was named in 2017 as a Highly Cited Researcher by
[42] S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie, Clarivate Analytics. His research interests include data mining, artificial
‘‘ConvNeXt V2: Co-designing and scaling ConvNets with masked autoen- intelligence, healthcare, knowledge systems, HCI, and recommendation
coders,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), systems.
Vancouver, BC, Canada, Jun. 2023, pp. 16133–16142.

58638 VOLUME 12, 2024

You might also like