0% found this document useful (0 votes)
25 views10 pages

2019 MMCNet Deep Learning-Based Multimodal Classification Model Using Dynamic Knowledge

Uploaded by

vewabev936
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views10 pages

2019 MMCNet Deep Learning-Based Multimodal Classification Model Using Dynamic Knowledge

Uploaded by

vewabev936
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Personal and Ubiquitous Computing (2022) 26:355–364

https://doi.org/10.1007/s00779-019-01261-w

ORIGINAL ARTICLE

MMCNet: deep learning–based multimodal classification model


using dynamic knowledge
Sung-Soo Park 1 & Kyungyong Chung 2

Received: 16 March 2019 / Accepted: 1 July 2019 / Published online: 2 August 2019
# Springer-Verlag London Ltd., part of Springer Nature 2019

Abstract
Because of the growth of the business sector dealing in the distribution of movies, software, music, and other contents, a very
large amount of contents has accumulated. Accordingly, recommendation systems for inducing user requests for contents are
more important. In distribution businesses, accurate content recommendations are required to secure and retain users. To establish
a highly accurate recommendation system, the recommended contents must be accurately classified. As classification methods,
mainly techniques such as naive Bayes, SGD (stochastic gradient descent), and SVM (support vector machine), are utilized. If all
of the information on recommended subjects is applied in the classification process, high-level accuracy can be expected, but
heavy calculation, a long service time, and low scalability are incurred. Given this inefficiency, effective classification in which
the metadata of contents are used is required. Metadata are expressed in the forms of the domain concept, relation, type, and
attribute to allow the complicated relations between multimodal data (text, images, and video) to be processed efficiently. Most
classification systems use single modal data to express one piece of knowledge for an item in a domain. Single modal data are
limited in terms of improving classification accuracy, because they do not include the useful information provided by different
knowledge types. Therefore, in this paper, we propose MMCNet, a deep learning–based multimodal classification model that
uses dynamic knowledge. The proposed method consists of a classification model that applies the human learning principle-
based CNN (convolution neural network) to multimodal data in combination with text and image knowledge. By using a Web
robot agent, multimodal data are collected from the TMDb (The Movie Database) data set, which includes a variety of single
modal data. In the preprocessing procedures, knowledge integration, knowledge conversion, and knowledge reduction are
performed to create a quantified knowledge base. To handle text data, sentences are refined through morphological analysis
and converted to numerical vectors by using word embedding. Image data are converted to numerical vectors using the library
related to vector conversion. The converted feature vectors are utilized to create multimodal learning data and the classification
model is used for learning. To solve the problem of memory operation resources, vector model-based meta-knowledge is
expanded through expression, conversion, alignment, inference, and deep learning. To evaluate its performance, the proposed
model was compared with conventional classification methods in terms of accuracy, recall, and F1-score. According to this
evaluation, the proposed classification model improves the accuracy, recall, and F1-score rates more than the conventional
methods. In addition, the proposed model was implemented as a deep learning–based multimodal classification system in a
graphical user interface environment that allows users to provide feedback about the classification results by adjusting classifi-
cation parameters. Through the convergence of the knowledge bases of various domains and multimodal deep learning, the
dynamic knowledge that influences user preference is inferred.

Keywords Dynamic knowledge . Data mining . CNN . Multimodal classification . Recommendation

1 Introduction
* Kyungyong Chung A recommendation system can recommend items or contents
[email protected]
to users in their environment, which changes together with the
Sung-Soo Park time-series flow. It analyzes the contents-related knowledge
[email protected] that is continuously accumulated and then recommends the
Extended author information available on the last page of the article
contents that users are predicted to like. The problems that
356 Pers Ubiquit Comput (2022) 26:355–364

conventional recommendation systems suffer include the first are very accurate. However, the operations of ANN-based
rating [1], data sparsity [2], dynamic preference [3], and do- deep learning models are computationally expensive, al-
main dependence problems [4]. The first-rating problem is though, with the development of the GPU, the operation speed
divided into the first-rater problem and cold-start problem, is increasing exponentially. When the operation cost is high,
according to whether a recommended item or a recommenda- an adaptive weight value is applied, or partial learning accord-
tion user is involved. The first-rater problem occurs when no ing to an entropy information-based weight value threshold is
review of a new item exists and thus a new item cannot be established using an ensemble approach to resolve the learn-
recommended because of lack of data. The cold-start problem ing time problem. In addition, by using model parallelism for
occurs when no information on a new user exists. The data distributing learning models and data parallelism for distrib-
sparsity problem leads to low-level accuracy in the case of uting learning data, methods that render the deep learning
unstable data that have many missing values. The dynamic structure lightweight are being developed.
preference problem is caused by changes in the user’s prefer- Companies that conduct research on artificial intelligence,
ence characteristics according to internal and external condi- such as Apple and Google, now manufacture new processors
tions. The domain dependence problem occurs in most rec- that replace neural engines and Tensor Core GPUs to increase
ommendation systems; it means that the recommendation de- the operation speed. This hardware development allows a re-
pends on only one domain. With the 4th industrial revolution, duction in the burden of expensive operation costs and the
recommendation systems that use life big data in various do- commercialization of the deep learning–based model. ANN-
mains have become more important. Contents distribution based deep learning models are applied in diverse areas, in-
companies, such as Amazon, Netflix, Aladin, and Watcha, cluding image classification, voice recognition, and context
established their unique recommendation system for their awareness. However, research on conventional deep learning
business and ceaselessly research and develop their system’s models has been focused on the classification of single modal
performance to improve it. A recommendation system should data, and less research is being conducted on data classifica-
reflect the user’s preference, which changes over time, and tion involving multimodal data. Multimodal data include
provide a personalized service that considers a variety of user more features than single modal data and allow a more accu-
dispositions. A typical recommendation method that considers rate model to be constructed. Therefore, in this paper, a mul-
consumer preferences is based on the correlations between timodal deep learning classification model (MMCNet) that
items. It recommends an item according to correlations be- considers dynamic knowledge changes, in combination with
tween items, or an item selected by other users who show a the text and image data contained in knowledge data, is pro-
preference similar to that of the current user. The greater the posed for predicting items. The classification of multimodal
detail in which an item is analyzed, the more accurate is the data by means of a deep learning model can lead to a better
recommendation. Therefore, highly reliable knowledge bases performance than the classification of single modal data.
should be constructed by means of accurate classification of This paper is organized as follows. In Section 2, the clas-
similar items. As a fundamental classification method, human sification method for item recommendation is described. The
item classification shows an average error rate of 13–15% and proposed MMCNet, a deep learning–based multimodal clas-
in general is highly accurate; however, it incurs time and cost sification model that uses dynamic knowledge, is described in
problems [5]. To solve these problems, classification algo- Section 3. Section 4 presents the performance evaluation of
rithms have been developed and studied that can extract rep- the model. Section 5 provides the conclusion.
resentative attributes and discover and advance knowledge by
using sequences, logistics, regression, neural networks, and
decision trees. It is important to develop an algorithm that 2 Classification method for item
retains the accuracy of conventional classification methods, recommendation
while also increasing efficiency and automating item
classification. The main techniques applied in recommendation systems in-
With the development of deep learning technology, data clude collaborative filtering, content-based filtering, hybrid
from various areas can now be applied in diverse domains. filtering, context awareness, and data mining. They improve
In particular, in the memory and intelligence-based machine both the accuracy and recall rates through the classification of
learning methods that copy the human learning process, con- the items to be recommended. The accuracy and recall rates of
text knowledge is processed using Bi-LSTM (bi-directional a recommendation system are strongly influenced by the
long short-term memory) and CNN+RNN (convolutional methods used to classify items. Items can be classified at dif-
neural networks plus recurrent neural networks) methods, ferent levels because of the diverse features found in data. To
and thus, the deep learning performance is improved [6]. In classify items accurately, research has been conducted on clas-
ANN (artificial neural network)-based deep learning methods, sification methods that are suitable for metadata comprised of
all the features are extracted from the data, and these methods text and images.
Pers Ubiquit Comput (2022) 26:355–364 357

Fig. 1 Summary flow of classification method

For text classification, TF-IDF (term frequency-inverse means that the existing classification criteria are not used,
document frequency) and word embedding are used. TF- and thus, scalability is low. Word embedding is a method in
IDF is a method for evaluating the extent to which a word is which a sentence is converted into a numerical vector [10]. It
important to a relevant sentence [7]. It calculates a value by classifies a numerical vector using logistic regression analy-
multiplying the TF (term frequency), that is, the appearance sis [11] and an ANN. Logistic regression analysis, like linear
frequency of a particular term in a sentence, by the IDF (in- regression analysis, utilizes the classification of the results of
verse document frequency), that is, the inverse of the appear- logistic regression into particular types. The method can be
ance frequency of the term in a document group. The signif- easily implemented, and its calculation cost is low. However,
icance of the word analyzed by TF-IDF is used as the weight because appropriate learning is not always achieved, the
value to calculate the relation between a sentence and the under-fitting problem arises. Classification methods that
classification result. Techniques for calculating this relation use ANNs include MLP (multilayer perceptron) classifica-
include naive Bayes [8] and SGD (stochastic gradient de- tion [12] and FCNs (fully connected neural networks) [13].
scent) [9]. Naive Bayes is a probability classifier based on In MLP, the perceptron, which is created in imitation of hu-
the Bayes theorem. Although this classification method is man neurons, is connected in multiple layers. The multiple-
relatively old, it requires fewer learning data and its operation layered perceptron is able to achieve the effect of drawing
cost is lower than that of advanced classification methods, multiple lines of linear classification discrimination. Using
such as SVM (support vector machine) algorithms, and its the perceptron, problems that cannot easily be solved by
accuracy performance is high. However, the method’s pre- using linear classification, such as XOR problems and
processing operation influences the accuracy of the classifi- Modified National Institute of Standards and Technology
cation. SGD is a classification method that minimizes the (MNIST), can be solved. MLP classification learning im-
expected value of a loss function. It optimizes an expected proves the accuracy of the classification result using forward
value of the gradient as the sample mean by using an estimate. and backward propagation. However, if the model is de-
Because sample data are used only for optimization, the size signed with a deep and wide structure so that it can be applied
of the raw data is not important. Therefore, the method is to the problems in the complicated real world, learning is not
applied mainly in machine learning with large amounts of effective, because the vanishing gradient problem arises in
operation data. However, an optimization method that uses the backward propagation process. An FCN is an ANN hav-
an estimate does not guarantee that the expected value of the ing multiple hidden layers between its input and output
loss function is minimized. The TF-IDF-based classification layers. In each layer, a feature can be identified by using an
method creates a new classification criterion whenever data activation function and the accuracy of the classification re-
are added. For this reason, to classify dynamic data a new sult can be improved by learning with forward and backward
classification criterion must be continuously created. This propagation. However, because one-dimensional data are
358 Pers Ubiquit Comput (2022) 26:355–364

classified, the spatial information of the data can be lost in the necessary to learn the domain concept, relation, type, and
dimension conversion process. attribute that are involved in very large amounts of metadata.
For image data classification, models such as CNN-based In a platform that distributes digital contents, metadata include
AlexNet [14], GoogLeNet [15], ResNet [16], and SegNet [17] a variety of multimodal data (text, images, sound, and video)
have been proposed. These models were presented in the [21]. By using feature inheritance, frame-type inferences are
ILSVRC (ImageNet Large Scale Visual Challenge). New image made to allow a lower class to share the attribute and value of
classification models are proposed every year in the ILSVRC. To its upper class. The frame-based knowledge expression con-
increase classification accuracy, advanced studies are being con- sists of a class frame, subclass frame, and case frame. Feature
ducted. Most of the previously proposed models improve accu- inheritance allows the knowledge of the frame to be kept and
racy, but their operation cost is high [18–20]. This problem can be an error in the expressed knowledge to be corrected. For this
solved by using a distribution method that implements multiple reason, it is used to determine the relation between classes.
advanced GPUs to shorten the learning period. However, these The attributes that constitute the metadata of a movie are the
models are used in special circumstances for research, and there- title, poster, production company, and supported languages, as
fore, it is difficult to apply them in practice. Thus, it is necessary well as the genre, and those metadata are used in the classifi-
to develop through research an efficient model that is highly cation procedure. The metadata collected in this study are
accurate and has a low operation cost in a general situation. comprised of the data set offered by TMDb (The Movie
Figure 1 shows the summary flow of the classification method. Database). Established in 2008, TMDb contains the data of
For text classification, TF-IDF, logistic regression, MLP, and the accumulated movie knowledge offered by multiple users
FCN are used, which calculate associations using naive Bayes [22]. It updates information in real time by using collective
and SGD. For image classification, models that utilize CNN are intelligence, and its data are in the form of metadata. For the
used, such as AlexNet, GooLeNet, ResNet, and SegNet. collection of movie metadata, the TMDb API (application
programming interface) is used. The TMDb API receives a
request for information through the URL, including a movie’s
3 MMCNet: deep learning–based multimodal unique ID. The URL is able to make a request to HTTP REST
classification model using dynamic (Representational State Transfer) for collection. By using a
knowledge Web robot agent based on the Python requests library that
supports REST, data can be collected in the {key, value} for-
3.1 Web robot agent for knowledge collection mat. A Web robot agent is a program that can automatically
collect from the Web the attributes and their values for ex-
In conventional digital contents classification, if the number of pressing the frame object [23]. By selecting a frame key in
contents registered in a platform is small, it is very likely that TMDb and choosing an appropriate attribute, data can be col-
the data over-fitting problem will occur. Accordingly, it is lected. From the data collected through selection, the attribute

Fig. 2 Web robot agent for movie


database knowledge collection
Pers Ubiquit Comput (2022) 26:355–364 359

of poster is used to collect image data from the URL in which stemming and lemmatization, a dictionary-type morpheme is
movie poster images can be saved. Figure 2 shows the Web obtained. Thus, the phenomenon that the same word is distribut-
robot agent for the TMDb knowledge collection. By using the ed in various forms because of its suffix and ending, so that its
API key issued by TMDb and the unique ID of the item to be importance is weak, can be avoided. Stemming allows the suffix
collected, knowledge data can be collected. In other words, by and ending of a transformed morpheme to be removed. Using
using the API key and the unique ID of an item, text attributes PorterStemmer or LancasterStemmer offered by NLTK, the end-
(a title or a synopsis), an image attribute (a poster), and a result ing of a word can be removed. For example, the morphemes of
attribute (a genre) for data collection can be selected. In addi- “dying” and “lives” can become “die” and “live” when processed
tion, the “not null” exception (data from all attributes must be by PorterStemmer. In the case of LancasterStemmer, “dying”
filled) and language selection (original language) options are and “liv” are displayed. Lemmatizing is performed to present
used to collect classifiable data only to satisfy data integrity the basic morpheme after complete ending extraction as the
[24]. The resulting frame of a Web robot agent contains a list dictionary-type word with the same meaning. For lemmatization,
of the collected items. The runtime dynamic linking of the when WordNetLemmaterzer is applied to the basic morphemes
collected documents is used to create an entry file and collect “dying” and “liv,” “dying” and “life” are displayed. The text in
knowledge in the form of metadata. the form of a dictionary-type word is encoded in the classifica-
tion input data. It presents the feature of the extractable text based
3.2 Preprocessing process for building knowledge on word appearance frequency. Text encoding utilizes word em-
base bedding to convert a word into a numerical vector. Word embed-
ding techniques include Word2Vec [10], Glove [28], and
The collected metadata are used to create a knowledge base Doc2Vec [29]. In this study, we utilized Doc2Vec, which creates
through preprocessing [25, 26]. Text is converted into a format a word embedding model from large block-based text such as a
that a computer can understand through morphological anal- sentence. The method creates a model through unsupervised
ysis. The morphological analysis utilizes the Python library, learning and presents the relation between the input data and
an open source offered by NLTK (Natural Language Toolkit) the classification result as a numerical vector.
[27]. It divides a sentence into words or tokens, the small units In the case of images, color data are converted into a numer-
defined as regular expressions. A sentence is tokenized by ical vector. For the conversion, keras_preprocessing, included
TweetTokenizer. TweetTokenizer is a tokenizing function for in Keras, a deep learning library [30], is used. The image mod-
analyzing and defining the data in Twitter. The generated to- ule of keras_preprocessing includes a function that can convert
kens include a dictionary-type morpheme, the stem of the word, image data into a different format. Using the load_img and
and the word’s ending. A morpheme is the smallest grammatical img_to_array functions, image data are converted into a numer-
unit in a language and cannot be applied as it is from the per- ical vector. The load_img function loads input image data in the
spective of knowledge processing. To find a dictionary-type mor- PIL (Python Image Library) format and the img_to_array func-
pheme, morphological analysis is conducted. The morphological tion displays the loaded PIL format in a numerical vector as the
analysis includes processes such as stemming, lemmatization, output. Figure 3 shows the preprocessing procedure of
and part-of-speech tagging. Each process module is designed to converting knowledge data into a numerical vector according
operate independently and plays a complementary role. Through to data types.

Fig. 3 Preprocessing procedure


of converting knowledge data
360 Pers Ubiquit Comput (2022) 26:355–364

3.3 MMCNet: deep learning–based multimodal gradient exploding, convariate shift, and internal convariate shift.
classification model using dynamic knowledge Gradient vanishing and gradient exploding mean that an LR that
is inappropriate for learning leads to too small a gradient
A multimodal classification model extracts the features of the (vanishing) or too large a gradient (exploding). For complemen-
numerical vector of a knowledge base through deep learning tation, BN (batch normalization) is performed [33]. Convariate
[31]. Using the extracted features, data are classified on the shift occurs when there is a distribution difference between the
basis of a defined criterion. It is defined multiple choices as learning and test data and when the input distribution is not
those that meet the criteria of a probability. A deep learning constant in the learning process. This problem can be solved by
model is optimized using a CNN that is suitable for classifi- normalizing the input distribution to zero mean and unit variance.
cation. The network has a multilayer connection structure Pooling resizes a convolution layer to output a new layer. To
comprising convolution layers. Each convolution layer in- prevent over-fitting, the number of features is reduced. The
cludes efficient learning by using activation, normalization, pooling operations that exist are Mean Pool, MaxPool, and
pooling, padding, and dropout functions. As an activation others. This model uses MaxPool, which shows the best perfor-
function, ELU (exponential linear units) is used [32]. This mance, for the pooling operation [34]. The method extracts the
function retains the advantage of ReLU (rectified linear units) maximum value from the filter, the size of which differs accord-
that its optimization speed is excellent and solves the dying ing to the stride, and then outputs a new layer value. Padding
ReLU problem. The dying ReLU problem means that, for a prevents a reduction in output data, which is achieved by filtering
negative number input, “0” is always displayed as the output. and striding in a convolution layer. It is used to obtain the outer of
Equation (1) presents ELU. In the equation, x represents an the input data filled with a particular value. By using zero pad-
input value and α a gradient coefficient. ding set to a value “0,” interference in feature extraction can be
prevented. The dropout function makes a part of a convolution
 layer “0” and prevents over-fitting and thereby improves the
x ðif x > 0Þ
f ð xÞ ¼ ð1Þ
αðex −1Þ ðif x ≤ 0Þ classification accuracy [35, 36]. To output the classification result
on the basis of the extracted features, a fully connected layer is
The use of a high LR (learning rate) to increase the deep organized. It utilizes the softmax activation function, which is
learning speed may cause problems, such as gradient vanishing, used in general. The input data of the fully connected layer is

Fig. 4 Configuration of deep learning classification model using dynamic knowledge


Pers Ubiquit Comput (2022) 26:355–364 361

different from that of a convolution layer in terms of data shape. model [37–40]. Figure 5 shows the deep learning–based mul-
Accordingly, a flatten layer for converting the data into the ap- timodal classification system, which collects knowledge data
propriate shape is applied. The designed deep learning model is in real time using a Web robot agent. Morphological analysis
applied to each attribute of the metadata [37]. This classification is conducted to extract and refine unstructured and structured
model keeps the most recent knowledge by using dynamic data to obtain knowledge. Through preprocessing, problems
knowledge. Figure 4 shows the configuration of the deep learn- such as noise, incompletion, and inconsistency are resolved.
ing classification model using dynamic knowledge. It extracts the Thus, a knowledge base quantified through knowledge inte-
features of each data modality by using a CNN model, which gration, knowledge conversion, and knowledge reduction is
includes a repetition of a convolution layer, BN, ELU activation, created. The deep learning–based multimodal classification
Maxpool, and dropout. To extract features from a vector effec- model uses a CNN, which performs knowledge expression,
tively, the model is designed with six or more convolution layers. knowledge conversion, and knowledge classification.
To output the classification results from the features, a fully con- Therefore, it increases the learning of the multimodal artificial
nected layer is applied. intelligence model and the utility of meta-knowledge scalabil-
ity and thereby solves the data sparsity problem that arises in
single modality-based classification. Through mining in dif-
4 Performance evaluation ferent knowledge domains, the proposed model is able to infer
and process meaningful knowledge and execute deep learning
4.1 Composition of deep learning–based multimodal expression, conversion, alignment, inference, and learning
classification system [41, 42]. To solve the memory operation resource problem,
it scales up the vector model-based meta-knowledge through
The proposed classification model (MMCNet) was developed partial learning based on an operation accelerator function.
in the software environment of Windows 10, Intel i7-8700, The deep learning–based multimodal classification system
8GB RAM, and Python 3.6.8. For knowledge data collection, allows a user-friendly PyQT-based GUI (graphical user inter-
a Web robot agent is used to collect the data set offered by face) environment to be developed and the provision of
TMDb. Movie knowledge is filtered from the TMDb data set module-by-module functions [25]. It checks an essential li-
in runtime in order for collection and processing. The collect- brary and displays the installed version and most recent ver-
ed knowledge data are used to create the knowledge base of sion. If a library needs to be updated or installed, PyPI (the
deep learning data through knowledge integration, conver- Python Package Index) is called to install the library that has
sion, and reduction preprocessing. For text data, the morpho- the version that fits the system operation. After the library state
logical analysis supplied by NLTK and Doc2Vec is used. For is checked, the scraper button and preprocesser button are
image data, keras_preprocessing is used. The knowledge base activated. By activating each button, respectively, a Web robot
is classified using the deep learning multimodal classification agent and a preprocessing procedure to collect knowledge

Fig. 5 Deep learning–based


multimodal classification system
362 Pers Ubiquit Comput (2022) 26:355–364

Table 1 Confusion answer. Table 1 shows the confusion matrix of the classifica-
matrix of the Answer
tion result and the correct answer. In the table, TP means that
classification result and
the correct answer True False the classification result is true and the correct answer is true.
FP means that the classification result is true and the correct
Result True TP FP answer is false. FN means that the classification result is false
False FN TN and the correct answer is true. TN means that the classification
result is false and the correct answer is false [41, 43].
Equations (2), (3), and (4) were used to calculate the accuracy,
from TMDb and create a knowledge base can be called. In the
recall and F1-score, respectively. In the equations, accuracy is the
quantified knowledge base, classification based on multimod-
rate of the consistency between the classification result and the
al deep learning for judgment and prediction is performed by
correct answer among the total results, recall is the rate of the
using short- and long-term memory mechanisms that imitate
consistency between the classification result and the correct an-
human learning. It finds the optimal status by repeatedly
swer among the cases where the answer was correct, and F1-
extracting features according to knowledge data types. By
score summarizes the precision and recall together and uses the
using the train button, the deep learning–based multimodal
harmonic mean of the two values. The higher the accuracy, recall,
classification model learning can be performed and its effec-
and F1-score rates, the better the performance.
tiveness verified. In the console, its progress can be checked.
When the multimodal deep learning model learning has been
performed 2,000 times repeatedly, an item in a list can be TP þ TN
Accuracy ¼ ð2Þ
selected and tested. The classification result of the item is TP þ FN þ FP þ TN
presented as the probability of each item, and the item is clas-
sified into “True” in 90% or more of cases. The classification TP
result is compared with the correct answer, and then, its accu- Recall ¼ ð3Þ
TP þ FN
racy is displayed.
 
precision  recall
F1 ¼ 2 
4.2 Performance evaluation precision þ recall
 
2  TP
To evaluate the performance of the proposed classification ¼ ð4Þ
2  TP þ FP þ FN
model, 50,000 filtered TMDb knowledge data were used as
test data. Of these, 80% were used as a data set for learning
and 20% as a data set for evaluation. The learning data set was Table 2 shows the accuracy, recall, and F1-score evaluation
used to establish a classification model, whereas the evalua- results for each classification model. In the table, it can be seen
tion data set was used to check whether the performance of the that Doc2Vec-based MMCNet has an accuracy rate of 92.4–
proposed model is better when it is applied to new data. 95.6%, a recall value of 93.0–94.5%, and an F1-score of 92.7–
Through the preprocessing of image and text knowledge data, 95.1%. As compared with other classification models, its per-
a quantified knowledge base was created. The text preprocess- formance is good. The single modality-based model had higher
ing was based on TF-IDF and Doc2Vec, and thus, it was accuracy, recall, and F1-score rates than the multimodality-
possible to evaluate multiple types of classification models. based model. The accuracy, recall, and F1-score rates of the
The TF-IDF-based multinominal naive Bayes and SGD clas- TF-IDF-based classification model are low because of mean-
sifier and the Doc2Vec-based logistic regression, MLP classi- ingless very high-frequency words. Of the Doc2Vec-based
fication, and MMCNet were compared. For the comparison, classification models, logistic regression failed to perform
the classification models were evaluated in terms of accuracy, learning correctly, and therefore, its accuracy, recall, and F1-
recall, and F1-score, which were calculated using the generat- score rates are low. The structure of the MLP classification
ed confusion matrix of the classification result and the correct model is deep and wide to allow it to be applied to complicated

Table 2 Accuracy, recall, and F1-


score evaluation results for each KB Classification Modal Accuracy Recall F1-score
classification model
TF-IDF MultinominalNB TEXT 74.6% 70.1% 72.3%
SGD TEXT 70.3% 72.1% 72.1%
Doc2Vec Logistic Regression TEXT 83.8% 82.3% 83.0%
MLP TEXT 80.2% 78.9% 79.3%
MMCNet TEXT 92.4% 93.0% 92.7%
MMCNet TEXT+IMAGE 95.6% 94.5% 95.1%
Pers Ubiquit Comput (2022) 26:355–364 363

knowledge data, and therefore, its accuracy, recall, and F1- knowledge data set that is constantly modified. This will be
score rates are low because of the vanishing gradient and other addressed in future work.
problems.
Funding information This work was supported by the GRRC program of
Gyeonggi province [2018-B05, Smart Manufacturing Application
Technology Research].
5 Conclusion

In this paper, we proposed a deep learning–based multi-


References
modal classification model (MMCNet) that uses dynamic
knowledge. To express efficiently the relations between 1. Bobadilla J, Ortega F, Hernando A, Bernal J (2012) A collaborative
the feature vectors extracted from the modality, meta- filtering approach to mitigate the new user cold start problem.
knowledge is used. Different types of modality interact Knowl-Based Syst 26:225–238
2. Jung H, Chung K (2016) Knowledge-based dietary nutrition recom-
with each other through knowledge base convergence.
mendation for obese management. Inf Technol Manag 17(1):29–42
To increase the accuracy of the recommendation system, 3. Wu X, Zhu X, Wu G, Ding W (2014) Data mining with big data.
a dynamic knowledge-based multimodal classification IEEE Trans Knowl Data Eng 26(1):97–107
model was developed. It combines and classifies multi- 4. Melville P, Sindhwani V (2017) Recommender Systems. In:
modal data to increase its accuracy in comparison with Sammut C, Webb GI (eds) Encyclopedia of machine learning and
data mining. Springer, Boston, pp 1056–1066
single modal data classification models. As the data to 5. Kobsa A, Cho H, Knijnenburg B (2016) The effect of personaliza-
classify, the data set offered by TMDb is used to collect tion provider characteristics on privacy attitudes and behaviors: an
knowledge data. For the TMDb knowledge collection, a elaboration likelihood model approach. J Assoc Inf Sci Technol
Web robot agent is used to select and collect the data that 67(11):2587–2606
6. Schmidhuber J (2015) Deep learning in neural networks: an over-
guarantees integrity. The collected data are preprocessed view. Neural Netw 61:85–117
and then used to create a quantified knowledge base. Text 7. Rajaraman A, Ullman JD (2011) Mining of massive datasets.
is lemmatized through a tokenizing process and morpho- Cambridge University Press, Cambridge
logical analysis and then expressed as a token vector. 8. Duda R, Hart P (1973) Pattern classification and scene analysis.
Wiley, New York
Image data are loaded using an appropriate library and 9. Robbins H, Monro S (1951) A stochastic approximation method.
then converted to numerical vectors. The created knowl- Ann Math Stat 22:400–407
edge base is learned using a CNN-based deep learning 10. Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013)
model that imitates the principles of human learning. A Distributed representations of words and phrases and their
compositionality. Adv Neural Inf Process Syst 26:3111–3119
multimodal classification model is comprised of a convo-
11. Cox D (1958) The regression analysis of binary sequences. J R Stat
lution layer, BN, Maxpool, zero padding, and a dropout Soc Ser B 20(2):215–242
function. The proposed model, which was developed in 12. Rumelhart D, Hinton G, Williams R (1985) Learning internal rep-
consideration of the operation cost, shows a better perfor- resentations by error propagation. California Univ San Diego La
Jolla Inst for Cognitive Science. No. ICS-8506
mance than human recognition. The proposed MMCNet is
13. Long J, Shelhamer E, Darrell T (2015) Fully convolutional net-
implemented as the deep learning–based multimodal clas- works for semantic segmentation. The IEEE Conference on
sification system that provides users with all the processes Computer Vision and Pattern Recognition (CVPR).:3431–3440
of collection, preprocessing, CNN learning, and result 14. Krizhevsky A, Sutskever I, Hinton G (2012) ImageNet classifica-
production in the GUI form, step by step. The proposed tion with deep convolutional neural networks. Advances in neural
information processing systems, vol 25, pp 1097–1105
MMCNet was compared with other classification methods 15. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D.
in terms of accuracy, recall, and F1-score rates. According Erhan, V. Vanhoucke, A. Rabinovich. (2015). Going deeper with
to the comparative evaluation, the proposed model has an convolutions. The IEEE Conference on Computer Vision and
accuracy of 92.4–95.6%, recall of 93.0–94.5%, and F1- Pattern Recognition (CVPR) 1–9
16. K. He, X. Zhang, S. Ren, J. Sun. (2016). Deep residual learning for
score of 92.7–95.1%, and the multimodal-based model image recognition. The IEEE Conference on Computer Vision and
showed a better performance than the single modality- Pattern Recognition (CVPR). 770–778
based model. The proposed MMCNet showed a better 17. Badrinarayanan V, Kendall A, Cipolla R (2017) SegNet: a deep
performance than other classification methods in terms convolutional encoder-decoder architecture for image segmenta-
tion. IEEE Trans Pattern Anal Mach Intell 39:2481–2495
of all the evaluation factors. 18. Yoo H, Chung K (2017) PHR based diabetes index service model
A classification model was developed, and a system was using life behavior analysis. Wirel Pers Commun 93(1):161–174
constructed that considers multimodal data and dynamic 19. Jung H, Chung K (2016) Life style improvement mobile service for
knowledge. However, it did not produce results for dynamic high risk chronic disease based on PHR platform. Clust Comput
19(2):967–977
knowledge where knowledge is constantly added and modi- 20. Park RC, Jung H, Chung K, Yoon KH (2015) Picocell based tele-
fied. This fact can be considered the threshold of this paper. medicine health service for human UX/UI. Multimed Tools Appl
The results should be derived by organizing a dynamic 74(7):2519–2534
364 Pers Ubiquit Comput (2022) 26:355–364

21. Yoo H, Chung K (2018) Mining-based lifecare recommendation 34. Scherer D, Muller A, Behnke S (2010) Evaluation of pooling opera-
using peer-to-peer dataset and adaptive decision feedback. Peer- tions in convolutional architectures for object recognition. In: Proc of
to-Peer Networking and Applications 11(6):1309–1320 the International Conference on Artificial Neural Networks, pp 92–101
22. TMDb. https://www.themoviedb.org. Accessed 10 Mar 2019 35. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov
23. Chung K, Na Y, Lee JH (2013) Interactive design recommendation R (2014) Dropout: a simple way to prevent neural networks from
using sensor based smart wear and weather WebBot. Wirel Pers overfitting. J Mach Learn Res 15:1929–1958
Commun 73(2):243–256 36. Chung K, Yoo H, Choe D, Jung H (2019) Blockchain network
24. Sun S, Yao W, Li X (2019) SORD: a new strategy of online replica based topic mining process for cognitive manufacturing. Wirel
deduplication in cloud-P2P. Clust Comput 22(1):1–23 Pers Commun 105(2):583–597
25. Chung K, Kim JC, Park RC (2016) Knowledge-based health ser- 37. Kim JC, Chung K (2019) Mining based time-series sleeping pattern
vice considering user convenience using hybrid Wi-Fi P2P. Inf analysis for life big-data. Wirel Pers Commun 105(2):475–489
Technol Manag 17(1):67–80 38. Kim JC, Chung K (2019) Prediction model of user physical activity
26. Kim JC, Chung K (2017) Depression index service using knowl- using data characteristics-based long short-term memory recurrent
edge based crowdsourcing in smart health. Wirel Pers Commun neural networks. KSII Trans Internet Inf Syst 13(4):2060–2077
93(1):255–268 39. Chung K, Boutaba R, Hariri S (2014) Recent trends in dig-
27. NLTK. https://www.nltk.org/. Accessed 10 Mar 2019 ital convergence information system. Wirel Pers Commun
28. Pennington J, Socher R, Manning C (2014) Glove: global vectors 79(4):2409–2413
for word representation. In: Proc of the 2014 Conference on 40. Yoo H, Chung K (2018) Heart rate variability based stress index
Empirical Methods in Natural Language Processing (EMNLP), pp service model using bio-sensor. Clust Comput 21(1):1139–1149
1532–1543 41. Kim JC, Chung K (2017) Emerging risk forecast system using
associative index mining analysis. Clust Comput 20(1):547–558
29. Le Q, Mikolov T. (2014) Distributed representations of sentences
42. Kim JC, Chung K (2019) Associative feature information extrac-
and documents. In Proc. of the international conference on machine
tion using text mining from health big data. Wirel Pers Commun
learning. 1188–1196
105(2):691–707
30. Keras. http://keras.io/. Accessed 10 Mar 2019
43. Jung H, Chung K (2015) Ontology-driven slope modeling for di-
31. Jung H, Yoo H, Chung K (2016) Associative context mining for
saster management service. Clust Comput 18(2):677–692
ontology-driven hidden knowledge discovery. Clust Comput 19(4):
2261–2271
32. Klambauer G, Unterthiner T, Mayr A, Hochreiter S (2017) Self-
normalizing neural networks. Adv Neural Inf Process Syst:971–980
33. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep
network training by reducing internal covariate shift. In: Proc. of the Publisher’s note Springer Nature remains neutral with regard to
international conference on machine learning, vol 37, pp 448–456 jurisdictional claims in published maps and institutional affiliations.

Affiliations

Sung-Soo Park 1 & Kyungyong Chung 2

1 2
Data Mining Lab., Department of Computer Science, Kyonggi Division of Computer Science and Engineering, Kyonggi University,
University, 154-42, Gwanggyosan-ro, Yeongtong-gu, Suwon- 154-42, Gwanggyosan-ro, Yeongtong-gu, Suwon-si, Gyeonggi-
si, Gyeonggi-do 16227, South Korea do 16227, South Korea

You might also like