0% found this document useful (0 votes)
44 views32 pages

Dsa Unit 1 PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views32 pages

Dsa Unit 1 PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

lOMoARcPSD|50021907

DSA unit1 - data science applications

computer science (G.Narayanamma Institute of Technology & Science)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Lalu Yadav ([email protected])
lOMoARcPSD|50021907

1
Introduction to Data Science: Review, Challenges,
and Opportunities

G. R. Sinha
Myanmar Institute of Information Technology (MIIT), Mandalay, Myanmar

Ulligaddala Srinivasarao and Aakanksha Sharaff


National Institute of Technology, Raipur, Chhattisgarh, India

CONTENTS
1.1 Introduction ........................................................................................................................... 2
1.2 Data Science ........................................................................................................................... 2
1.2.1 Classification .............................................................................................................. 3
1.2.2 Regression .................................................................................................................. 4
1.2.3 Deep Learning ........................................................................................................... 4
1.2.4 Clustering ................................................................................................................... 4
1.2.5 Association Rules ...................................................................................................... 4
1.2.6 Times Series Analysis ............................................................................................... 5
1.3 Applications of Data Science in Various Domains ........................................................... 5
1.3.1 Economic Analysis of Electric Consumption ........................................................ 6
1.3.2 Stock Market Prediction ........................................................................................... 6
1.3.3 Bioinformatics ........................................................................................................... 6
1.3.4 Social Media Analytics ............................................................................................. 6
1.3.5 Email Mining ............................................................................................................. 7
1.3.6 Big Data Analysis Mining Methods ....................................................................... 7
1.4 Challenges and Opportunities ............................................................................................ 8
1.4.1 Challenges in Mathematical and Statistical Foundations ................................... 8
1.4.2 Challenges in Social Issues ...................................................................................... 8
1.4.3 Data-to-Decision and Actions ................................................................................. 8
1.4.4 Data Storage and Management Systems ............................................................... 9
1.4.5 Data Quality Enhancement ..................................................................................... 9
1.4.6 Deep Analytics and Discovery ................................................................................ 9
1.4.7 High-Performance Processing and Analytics ....................................................... 9
1.4.8 Networking, Communication, and Interoperation .............................................. 9
1.5 Tools for Data Scientists ....................................................................................................... 9
1.5.1 Cloud Infrastructure ............................................................................................... 10
1.5.2 Data/Application Integration ............................................................................... 10
1.5.3 Master Data Management ..................................................................................... 10
1.5.4 Data Preparation and Processing ......................................................................... 10
1.5.5 Analytics ................................................................................................................... 10
1.5.6 Visualization ............................................................................................................ 10

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

2 Data Science and Its Applications

1.5.7 Programming ........................................................................................................... 10


1.5.8 High-Performance Processing .............................................................................. 10
1.5.9 Business Intelligence Reporting ............................................................................ 10
1.5.10 Social Network Analysis ........................................................................................ 11
1.6 Conclusion ........................................................................................................................... 11
References ...................................................................................................................................... 11

1.1 Introduction
Data science is a new area of research that is related to huge data and involves concepts like
collecting, preparing, visualizing, managing, and preserving. Even though the term data
science looks related to subject areas like computer science and databases, it also requires
other skills, including non-mathematical ones. Data science not only combines data analy-
sis, statistics, and other methods, but it also includes the corresponding results. Data sci-
ence is intended to analyze and understand the original phenomenon related to the data
by revealing the hidden features of complex social, human, and natural phenomena related
to data from another point of view other than traditional methods.
Data science includes three stages: designing the data, collecting the data, and finally
analyzing the data. There is an exponential increase in the applicability of data science in
various areas because data science has been making enormous strides in data processing
and use. Business analytics, social media, data mining, and other disciplines have bene-
fited due to the advance in data science and have shown good results in the literature.
Data science has made remarkable advancements in the fields of ensemble machine
learning, hybrid machine learning, and deep learning. Machine learning methods (ML) can
learn from the data with minimum human interference. Deep learning (DL) is a subset of
ML that is applicable in different areas, like self-driving cars, earthquake predictions, and so
on. There are many pieces of evidence in the literature that show the superiority of DL over
ML methods; DL methods include artificial neural networks, k-nearest neighbors, and sup-
port vector machine (SVM) in different disciplines, such as medical, social media, and so on.
Torabi et al. developed a hybrid model where two predictive machine learning algorithms
are combined together [1]. Here, an additional optimization-based method has also been used
for maximizing the prediction function. Mosavi and Edalatifar illustrated that hybrid machine
learning models perform very accurately compared to single machine learning models [2].
This chapter presents a review of various data science methods and details how they are
used to deal with critical challenges that arise when working with big data analytics.
According to the literature, different classification, regression, clustering, and deep learning–
based methods have often been used. However, there is an opportunity to improve in new
areas, like temporal and frequent pattern discovery for load prediction. This chapter also
discusses the future trends of data science, to explore new tools and algorithms that are
capable of intelligently handling large datasets that are collected from various sources.

1.2 Data Science


Technological tools developed recently over the years have helped in many domains,
including management and big data. Advancements in different areas of communications

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

Introduction to Data Science 3

FIGURE 1.1
Data science process

and information technology—like email information privacy, market, stock data, data sci-
ence, and real-time monitoring—have also been a good influence.
It is well known that data science builds algorithms and systems for discovering knowl-
edge, detecting the patterns, and generating useful information from massive data. To do
so, it encompasses an entire data analysis process that starts with the extraction of data and
cleaning, and extends to data analysis, description, and summarization. Figure 1.1 depicts
the complete process. It starts with data collection. Next, the data is cleaned to select the
segment that has the most valuable information. To do so, the user will filter over the data
or formulate queries that can erase unnecessary information. After the data is prepared, an
exploratory analysis that includes visualizing tools will help decide the algorithms that are
suitable to gain the required knowledge. This complete process will guide the user toward
the results that will help them make suitable decisions.
Depending on the primary outcomes, the complete process should be fine-tuned to
obtain improved results. This will involve changing the parameter values or making
changes to the datasets. These kinds of decisions are not made automatically, so the
involvement of an expert in result analysis is a crucial factor.
From a technical point of view, data science consists of a set of tools and techniques that
deals with various goals corresponding to multiple situations. Some of the recent methods
used are clustering, classification, deep learning, regression, association rule mining, and
time-series analysis. Even though these methods are often used in text mining and other
areas, anomaly detection and sequence analysis are also helpful to provide excellent results
for text mining problems.

1.2.1 Classification
Wu et al. have classified a set of objects that predict the classes based on the attributes.
Decision trees (DT) are used to perform and visualize that classification [3]. DTs may be
generated using various algorithms, such as ID3, CLS, CART, C4.5, and C5.0. Random for-
est (RF) is one more classifier that will construct a set of DTs, and then predicts through the
aggregation of the values generated from each DT. A classification model was developed

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

4 Data Science and Its Applications

by using a technique known as Least Squares Support Vector Machine (LS-SVM). The
classification task is performed by LS-SVM by using a hyper-plane in a multidimensional
space for separating the dataset into the target classes [4].

1.2.2 Regression
Regression analysis aims for the numerical estimation of the relationship between vari-
ables. This involves the estimation of whether or not the variables are independent. If a
variable is not independent, then the first step is to determine the type of dependence.
Chatterjee et al. proposed a regression analysis that is often used for predicting and fore-
casting, and also to understand how the dependent variables will change corresponding to
the fixed values of independent variables [5].

1.2.3 Deep Learning


In deep learning, many hidden layers of neural networks are used to deeply understand
the information that images are attempting to predict accurately. Here, each layer will
learn and detect low-level features, such as edges. Further, new layers will be merged with
the features of the previous layer to represent it better. Fischer and Krauss [6] have
expanded the long short-term memory (LSTM) networks for forecasting out-of-sample
directional movements in the stock market. Here, a comparative study has been performed
with DNN, RF, and LOG, and it demonstrates that the LSTM model outperforms the oth-
ers. Tamura et al. [7] have proposed a model for predicting stock values, which is a two-
dimensional approach. In this model, technical, financial indexes related to the Japanese
stock market are used as input data for LSTM to predict. Using this data, the financial
statements of other companies have been retrieved and are also added to the database.

1.2.4 Clustering
Jain et al. proposed a clustering-based method using the degree of similarity [8]. In cluster-
ing, the objects are separated into groups called clusters. This type of learning is called
unsupervised learning, as there is no prior idea over the classes as to which group the
objects belong. Based on the similarity measure criterion, cluster analysis has various mod-
els: (i) based on the connectivity distance, connectivity models are generated, i.e., hierar-
chical clustering; (ii) by using the nearest cluster center, the objects are assigned, centroid
models are generated, i.e., k-means; (iii) by means of statistical distributions, the distrib-
uted models are generated, i.e., expectation-maximization algorithm; (iv) based on high-
density areas that exist in the data, the clusters are defined in density models; (v) graphs
are used for expressing the dataset in graph-based models.

1.2.5 Association Rules


Association rules are suitable tools to represent the new information that has been extracted
from the raw dataset. These rules are expressed to make the decisions in terms of implica-
tion rules, per Verma et al. [9]. The respective rules indicate the frequency of occurrence of
the attributes with high reliability in databases. This example represents an association
rules related to the database of the supermarket. Even though the algorithms like ECLAT
and FP-Growth algorithms are available for large datasets, in the Apriori algorithm, for

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

Introduction to Data Science 5

FIGURE 1.2
Data Science Techniques

example, the generalized rule induction algorithm and its adaptations are often used, per
Tan et al. [10].

1.2.6 Times Series Analysis


Das provided a time-series analysis. Here the time-series data, which is collected over
time, is used for modeling the data. Further, the model is used for predicting future values
of the time series [11]. The often used methods are the following: (i) techniques for explor-
atory analysis, for example wavelets, trend analysis, autocorrelation, and so on; (ii) fore-
casting and prediction methods, for example signal estimation, regression methods, and so
on; (iii) classification techniques that will be assigned a category to patterns related to the
series; and (iv) segmentation that aims to identify a sequence of points that share particular
properties. Hullermeier developed a fuzzy extension that allows for processing uncertain
and imprecise data related to different domains [12]. Bezdek et al. have proposed a fuzzy
k-means method. This method is similar to a type of clustering technique that has given
efficient results in different scenarios, as it will permit the assignment of data elements
related to single or more clusters [13]. Figure 1.2. shows different types of techniques used
in data science and application.

1.3 Applications of Data Science in Various Domains


Data science is one subject that has gained popularity out of necessity, corresponding to
real-world applications as a substitute to research domain. Its application began from a
narrow field of analytics and statistics and has improved to be applied to different areas of
industry and science. Consequently, this section explains the data science applications that
can do the following: (i) economic analysis of electric consumption, (ii) stock market pre-
diction, (iii) bioinformatics, (iv) social media analytics, (v) email mining, (vi) big data anal-
ysis, and (vii) SMS Mining, among other things!

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

6 Data Science and Its Applications

1.3.1 Economic Analysis of Electric Consumption


Different electric companies or utilities approached data science to find out and under-
stand when and how consumers use energy. There has been an increase in competition
among companies that use data science to develop such information. Traditionally, this
information has been determined via classification, clustering, and pattern analysis meth-
ods by using the association rule. Chicco et al. have grouped consumers as various classes
based on their behavior and usage of electricity [14]. The comparative evaluation was
made with self-organizing maps and an improved version of follow-the-leader methods.
This was the first step initiated for a tariff of the electrical utilities. Figueiro et al. [15] have
developed a framework for exploiting the historical data, which consists of two modules:
(i) a load-profile module, which creates a set of customer classes by using unsupervised
and supervised learning, and (ii) a classification module, which builds models for assign-
ing customers to their respective classes.

1.3.2 Stock Market Prediction


An application of ML and DL techniques in the stock market is increasing compared to
other areas of economics. Even though investing in the stock market gives profits, high risk
is often involved along with high benefits. So, investors try to estimate and determine the
value of a stock before they make an investment. The cost of the stock varies depending
upon factors like local politics and economy, which causes difficulties in identifying future
trends of the stock market. Fischer and Krauss [6] used LSTM to forecast future trends in
the stock market. The results have been compared with LOG, DNN, and RF, and have
shown improved results over the others. Tamura et al. [7] have proposed a new method for
predicting the values of the stock. Here, financial data related to the stock market of Japan
has been used as a prediction input in LSTMs (Long short-term memories). Further, the
financial statements of the companies are recovered and then added to the database.
Sharaff and Srinivasarao [16] proposed Linear Support Vector Machine (LSVM) identify
the correlation among the words in content and subject of the emails.

1.3.3 Bioinformatics
Bioinformatics is a new area that uses computers to understand biological data like genom-
ics and genetics. This helps scientists understand the cause of disease, physiological prop-
erties, and genetic properties. Baldi et al. [17] utilized various techniques to estimate the
applicability and efficiency of different predictive methods in the classification task. The
previous error estimation techniques are primarily focused on supervised learning using
the microarray data. Michiels et al. [18] have used various random datasets to predict can-
cer using microarray data. Ambroise et al. [19] solved a gene selection problem based on
microarrays data. Here, 10-fold validation has been used. Here, 0.632 bootstrap error
estimates are used to deal with prediction rules that are overfitted. The accuracy of 0.632
bootstrap estimators for microarray classification using small datasets is proposed in Braga
et al. [20]

1.3.4 Social Media Analytics


Joshi and Deshpande [21] have used Twitter data to classify the sentiments included in
tweets. They have applied various machine learning methods to do so. A comparative

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

Introduction to Data Science 7

study has been carried out by using maximum entropy, naïve Bayes, and positive-negative
word counting. Wolny [22] proposed a model to recognize the emotion in Twitter data and
performed an emotion analysis study. Here, the feelings and sentiments were discussed in
detail by explaining the existing methods.
The emotion and sentiment are classified based on symbols via an unsupervised classi-
fier, and the lexicon was explained by suggesting future research. Coviello et al. [23] have
analyzed the emotion contagion related to Facebook data. The instrumental variable
regression technique has been used to analyze the Facebook data. Here, the emotions of the
people, such as negative and positive emotions during rainy days, were detected. Roelens
et al. [24] explained that the detection of the people who influence social networks is a dif-
ficult task or area of research, but one of great interest so that referral marketing and
spreading information regarding products can reac the maximum possible network.

1.3.5 Email Mining


There is a threat to internet security with spam emails. Spam emails are nothing but
unwanted or unsolicited emails. Mailboxes will overload with these unwanted emails, and
there may be losses in storage and bandwidth, which favors quick, wrong information and
malicious data. Gudkova et al. [25] conducted a study and explained that 56% of all emails
are spam emails. Caruana and Li [26] illustrated that the machine learning method is suc-
cessful for detecting spam data. These include learning classifier models, which map data
by using features like n-gram and others into spam or ham classes. Dada et al. [27] have
demonstrated that email features may be either manual or automatic. Bhowmick and
Hazarika [28] demonstrated that the manually extracted rules are known as knowledge
engineering, which requires expert and regular updates to maintain good accuracy. Text
mining methods are used for automated feature extraction of useful information like
words, enabling spam discrimination, HTML mark up, and so on. Using these features, an
email is represented as Bag-of-Words (BoW) as proposed by Aggarwal [29]. Here the
unstructured word tokens are used to discriminate the spam messages with the others. The
BoW assumes word tokens that are not dependent that will prevent from delivering the
good semantic content to represent the email. Sharaff and Nagwani [30] have identified
the email threads using LDA- and NMF-based methodology.

1.3.6 Big Data Analysis Mining Methods


Big data is one of the very fast-growing technologies that is critical to handle in the present
era. The information is used for analytical studies to help drive decisions for giving quick
and improved services. Laney [31] proposed that big data consists of three characteristics:
velocity, volume, and variety. These are also called the 3Vs.
Chen et al. [32] explained that data mining is a procedure where potentially useful,
unknown, and hidden meaningful information is extracted from noisy, random, incom-
plete, and fuzzy data. The knowledge and information that has been extracted is used to
derive new comprehensions, scientific events, and influences business scientific discovery,
per Liu [33].
Two articles have aimed at improving the accuracy of data mining. Han et al. [34] have
proposed a new model using the skyline algorithm. Here, a sorted positional index list
(SSPL), which has low space overhead, has been used to reduce the input or output cost.
Table 1.1 shows an overview of data science methods used in different applications.

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

8 Data Science and Its Applications

TABLE 1.1
An overview of data science methods used in different applications
S.no Applications Methods Source

1 Economic Follow-the-Leader Clustering (FLC) Chicco et al. [14]


analysis K-Means Figueiredo et al. [15]
2 Stock Market Long Short-Term Memory (LSTM) Fischer and Krauss [6]
Tamura et al. [7]
3 Bioinformatics Gradient Descent Learning (GDL) Baldi et al. [17]
k-nearest-neighbors (K-NN) Michiels et al. [18]
support vector machine (SVM) Ambroise et al. [19]
4 Social Media Naive Bayes (NB) and Maximum Entropy Joshi and Deshpande [21]
analytics Algorithms (MEA) Wolny [22]
Lexicon Based Approach (LBA) Coviello et al. [23]
Regression Methods (RM)
5 Email Mining Machine and Non-Machine Learning Caruana and Li [26]
Methods (NMLM) Dada et al. [27]
Deep Leaning Methods (DLM) Bhowmick and Hazarika
Machine Learning Techniques (MLT) [28]
Latent Dirichlet Allocation and Sharaff and Nagwani [30]
Non-Negative Matrix Factorization (NNMF)
6 Big Data Fuzzy Clustering (FC) Chen et al. [32]
Analysis Data Mining Methods (DMM) Liu [33]
Skyline Algorithm (SA) Han et al. [34]

1.4 Challenges and Opportunities


This section summarizes the key issues, challenges, and opportunities that are related to
data science in different fields.

1.4.1 Challenges in Mathematical and Statistical Foundations


The main challenge in mathematical fields is to find out why theoretical foundations are
not enough to solve complex problems, and then identify and obtain a helpful action plan.

1.4.2 Challenges in Social Issues


In social contexts, the challenges are to specify, respect, and identify social issues. Any
domain-specific data is to be selected, and then its related concepts—like business, secu-
rity, protection privacy—should be accurately handled.

1.4.3 Data-to-Decision and Actions


It is important to develope accurate decision-making systems that are data-driven. These
systems should also be able to manage and govern the decision-making systems.

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

Introduction to Data Science 9

1.4.4 Data Storage and Management Systems


One of the challenges include designing a good storage and management system that has
the capability to handle large amounts data, stream-speed in real time, and can manage
such data in an Internet-based environment, including cloud.

1.4.5 Data Quality Enhancement


Another important challenge is issues of data quality like uncertainty, noise, unbalance,
and so on. The level of presence of these issues will vary depending upon the data
complexity.

1.4.6 Deep Analytics and Discovery


Cao [35] proposed new algorithms to deal with the deep and implicit analytics that are not
able to be tackled using the existing descriptive, latent, and predictive learning. Also, how
to aggregate the model based with data-driven problem-solving solutions to balance the
domain-specific data complexity, intelligence-driven evidence learning, and common
learning frameworks.

1.4.7 High-Performance Processing and Analytics


Systems must handle the online, real-time, Internet-based, large-scale, high-frequency,
data analytics and processing with balanced resource involvement that may be local and
global. This requires new array disk storage, batch, and high performance parallel process-
ing. It is also necessary to use complex matrix calculations, data-to-knowledge manage-
ment, mixed data structures, and management systems.

1.4.8 Networking, Communication, and Interoperation


The challenge involved is how to support the interoperation, communication, and net-
working between various data science roles like distributed and complete cycle of prob-
lem-solving in data science. Here, it is necessary to coordinate management of tasks, data,
workflows, control, task scheduling, and governance.

1.5 Tools for Data Scientists


This section presents the tools required for data scientists to address the aspects discussed
above. These tools are classified as data and application integration, cloud infrastructure,
programming, visualization, high-performance processing, analytics, master data man-
agement, business intelligence reporting, data preparation and processing, and project
management. The researcher can use any number of tools depending upon the complexity
of the problem being solved.

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

10 Data Science and Its Applications

1.5.1 Cloud Infrastructure


Like Map R, Google Cloud Platform, Amazon Web Services, Cloudera, Spark, Apache
Hadoop, and other systems may be used. Most of the traditional IT vendors at present are
using cloud platform.

1.5.2 Data/Application Integration


This includes Clover ETL, Information Builders, DM Express Sync sort, Oracle Data
Integrator, Informatics, Including Ab Initio, and so on.

1.5.3 Master Data Management


Master data management includes SAP Net Weaver Master Data Management tool, Black
Watch Data, Microsoft Master Data Services, Informatica MDM, TIBCO MDM, Teradata
Warehousing, and so on.

1.5.4 Data Preparation and Processing


Stodder and Matters [36] have used some platforms and data preparation tools like
Wrangler Enterprise and Wrangler, Alpine Chorus, IBM SPSS, Teradata Loom, Platfora,
and so on.

1.5.5 Analytics
Analytics includes commercial tools like Rapid Miner [37], Mat Lab, IBM SPSS Modeler
and SPSS Statistics, SAS Enterprise Miner, and so on, in addition to some new tools, like
Google Cloud Prediction API, ML Base, Big ML [38], Data Robot, and others.

1.5.6 Visualization
Some commercial and free software listed in KDnuggets [39] to visualize include Miner3D,
IRIS Explorer, Interactive Data Language, Quadrigram, Science GL, and so on.

1.5.7 Programming
Additionally, Java, Python, SQL, SAS, and R languages have been used for data analytics.
Some data scientists have also included Go, Ruby, .net, and Java Script [40].

1.5.8 High-Performance Processing


Around 40 computer cluster software programs, like Platform Cluster Manager, Moab
Cluster Suite, Stacki, and others, have been listed in Wikipedia [41].

1.5.9 Business Intelligence Reporting


Some of the reporting tools [42] commonly used are SAP Crystal Reports, SAS Business
Intelligence, Micro Strategy, and IBM Cognos, among others.

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

Introduction to Data Science 11

FIGURE 1.3
Data Science Programming Models

1.5.10 Social Network Analysis


Around 30 tools have been listed for social network analysis and to help visualize data. For
example, Ego Net, Cuttlefish, Commetrix, Keynetiq, Node XL, and so on. [43]. Figure 1.3
shows the different types of programming languages that are used in data science.

1.6 Conclusion
This chapter has surveyed the modern advances in information technology, and the influ-
ence these advances have had on big data analytics and its applications. The effectiveness
of different data science algorithms that can be applied to solve the challenges in big data
has been examined. Data science algorithms will be extensively used in the future to
address the problems and challenges in big data applications.
In various areas, the exploitation and discovery of meaningful insights from the dataset
will be very much required. Big data applications are necessary in different fields like
industry, government, and so on. This new perspective will challenge research groups to
develop better solutions to manage large heterogeneous amounts of real-time data. It also
deals with the uncertainty associated with it. Data science techniques reveal important
tools that can extract and exploit the information and knowledge that exists in the user
dataset. In the coming days, big data techniques will increase possibilities, and may also
democratize them.

References
1. Torabi, M., Hashemi, S., Saybani, M. R., Shamshirband, S., & Mosavi, A. (2019). A Hybrid clus-
tering and classification technique for forecasting short-term energy consumption. Environmental
Progress & Sustainable Energy, 38(1), 66–76.

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

12 Data Science and Its Applications

2. Mosavi, A., & Edalatifar, M. (2018). A hybrid neuro-fuzzy algorithm for prediction of reference
evapotranspiration. In International conference on global research and education (pp. 235–243).
Cham: Springer.
3. Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., … & Zhou, Z. H. (2008). Top
10 algorithms in data mining. Knowledge and information systems, 14(1), 1–37.
4. Suykens, J. A., Van Gestel, T., & De Brabanter, J (2002). Least squares support vector machines.
World Scientific.
5. Chatterjee, S., Hadi, A. S., & Price, B. (2000). Regression analysis by example. New York: John
Wiley & Sons Inc..
6. Fischer, T., & Krauss, C. (2018). Deep learning with long short-term memory networks for
financial market predictions. European Journal of Operational Research, 270(2), 654–669.
7. Tamura, K., Uenoyama, K., Iitsuka, S., & Matsuo, Y. (2018). Model for evaluation of stock values
by ensemble model using deep learning.
8. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM computing surveys
(CSUR), 31(3), 264–323.
9. Verma, M., Srivastava, M., Chack, N., Diswar, A. K., & Gupta, N. (2012). A comparative study
of various clustering algorithms in data mining. International Journal of Engineering Research and
Applications (IJERA), 2(3), 1379–1384.
10. Tan, P. N., Steinbach, M., & Kumar, V. (2016). Introduction to data mining. Delhi: Pearson
Education India.
11. Das, S. (1994). Time series analysis. (Vol 10). Princeton, NJ: Princeton University Press.
12. Hüllermeier, E. (2005). Fuzzy methods in machine learning and data mining: status and pros-
pects. Fuzzy Sets and Systems, 156(3), 387–406.
13. Bezdek, J. C., Ehrlich, R., & Full, W. (1984). FCM: the fuzzy c-means clustering algorithm.
Computers & Geosciences, 10(2–3), 191–203.
14. Chicco, G., Napoli, R., Piglione, F., Postolache, P., Scutariu, M., & Toader, C. (2004). Load pat-
tern-based classification of electricity customers. IEEE Transactions on Power Systems, 19(2),
1232–1239.
15. Figueiredo, V., Rodrigues, F., Vale, Z., & Gouveia, J. B. (2005). An electric energy consumer
characterization framework based on data mining techniques. IEEE Transactions on power sys-
tems, 20(2), 596–602.
16. Sharaff, A., & Srinivasarao, U. (2020). Towards classification of email through selection of infor-
mative features. In 2020 First International Conference on Power, Control and Computing Technologies
(ICPC2T) (pp. 316–320). IEEE.
17. Baldi, P., Brunak, S., Chauvin, Y., Andersen, C. A., & Nielsen, H. (2000). Assessing the accuracy
of prediction algorithms for classification: an overview. Bioinformatics, 16(5), 412–424.
18. Michiels, S., Koscielny, S., & Hill, C. (2005). Prediction of cancer outcome with microarrays: a
multiple random validation strategy. The Lancet, 365(9458), 488–492.
19. Ambroise, C., & McLachlan, G. J. (2002). Selection bias in gene extraction on the basis of micro-
array gene-expression data. Proceedings of the national academy of sciences, 99(10), 6562–6566.
20. Braga-Neto, U. M., & Dougherty, E. R. (2004). Is cross-validation valid for small-sample micro-
array classification? Bioinformatics, 20(3), 374–380.
21. Joshi, S., & Deshpande, D. (2018). Twitter sentiment analysis system. International Journal of
Computer Applications, 180(47), 0975–8887.
22. Wolny, W. (2016). Emotion analysis of twitter data that use emoticons and emoji ideograms.
23. Coviello, L., Sohn, Y., Kramer, A. D., Marlow, C., Franceschetti, M., Christakis, N. A., & Fowler,
J. H. (2014). Detecting emotional contagion in massive social networks. PloS one, 9(3), e90315.
24. Roelens, I., Baecke, P., & Benoit, D. F. (2016). Identifying influencers in a social network: the
value of real referral data. Decision Support Systems, 91, 25–36.
25. Gudkova, D., Vergelis, M., Demidova, N., and Shcherbakova, T. (2017). Spam and phishingin
Q2 2017, Securelsit, Spam and phishing reports, https://securelist.com/spamand-phishing-in-
q2-2017/81537/, 2017.

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

Introduction to Data Science 13

26. Caruana, G., & Li, M. (2008). A survey of emerging approaches to spam filtering. ACM
Computing Surveys (CSUR), 44(2), 1–27.
27. Dada, E. G., Bassi, J. S., Chiroma, H., Adetunmbi, A. O., & Ajibuwa, O. E. (2019). Machine learn-
ing for email spam filtering: review, approaches and open research problems. Heliyon, 5(6),
e01802.
28. Bhowmick, A., & Hazarika, S. M. (2016). Machine learning for e-mail spam filtering: review,
techniques and trends. arXiv preprint arXiv:1606.01042.
29. Aggarwal, C. C., & Zhai, C. (Eds.). (2012). Mining text data. Springer Science & Business Media.
30. Sharaff, A., & Nagwani, N. K. (2016). Email thread identification using latent Dirichlet alloca-
tion and non-negative matrix factorization based clustering techniques. Journal of Information
Science, 42(2), 200–212.
31. Laney, D. (2001). 3D data management: controlling data volume, velocity and variety. META
group research note, 6(70), 1.
32. Chen, M. M. S., & Liu, Y. (2014). Big Data: A Survey. Mobile Networks and Applications, 19,
171–209.
33. Liu, L. (2013). Computing infrastructure for big data processing. Frontiers of Computer Science,
7(2), 165–170.
34. Han, X., Li, J., Yang, D., & Wang, J. (2012). Efficient skyline computation on big data. IEEE
Transactions on Knowledge and Data Engineering, 25(11), 2521–2535.
35. Cao, L. (2017). Data science: challenges and directions. Communications of the ACM, 60(8),
59–68.
36. Stodder, D., & Matters, W. D. P. (2016). Improving data preparation for business analytics.
Applying technologies and methods for establishing trusted data assets for more productive
users. Best Practices Report Q, 3(2016), 19–21.
37. RapidMiner. 2016. RapidMiner. (2016). https://rapidminer.com/.
38. BigML. 2016. BigML. Retrieved from https://bigml.com/.
39. KDnuggets. 2015. Visualization Software. Retrieved from: http://www.kdnuggets.com/soft-
ware/visualization.html.
40. Davis, J. (2016). 10 Programming Languages And Tools Data Scientists Used. (2016).
41. Wikipedia. 2016. Comparison of Cluster Software. Retrieved from https://en.wikipedia.org/
wiki/Comparison_of_cluster_software.
42. Capterra. 2016. Top Reporting Software Products. Retrieved from http://www.capterra.com/
reporting-software/.
43. Desale, D. (2015). Top 30 Social Network Analysis and Visualization Tools. KDnuggets. https://
www.kdnuggets.com/2015/06/top-30-social-network-analysis-visualization-tools.html.

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

2
Recommender Systems: Challenges and
Opportunities in the Age of Big Data and Artificial
Intelligence

Mehdi Elahi
University of Bergen, Bergen, Norway

Amin Beheshti and Srinivasa Reddy Goluguri


Macquarie University, Sydney, Australia

CONTENTS
2.1 Introduction ......................................................................................................................... 16
2.2 Methods ................................................................................................................................ 17
2.2.1 Classical .................................................................................................................... 17
2.2.2 Collaborative Filtering ........................................................................................... 17
2.2.3 Content-Based Recommendation ......................................................................... 18
2.2.4 Hybrid FM ............................................................................................................... 19
2.2.5 Modern Recommender Systems ........................................................................... 20
2.2.6 Data-Driven Recommendations ........................................................................... 20
2.2.7 Knowledge-Driven Recommendations ............................................................... 20
2.2.8 Cognition-Driven Recommendations .................................................................. 23
2.3 Application .......................................................................................................................... 23
2.3.1 Classic .......................................................................................................................23
2.3.1.1 Multimedia ................................................................................................ 23
2.3.1.2 Tourism ......................................................................................................25
2.3.1.3 Food ............................................................................................................25
2.3.1.4 Fashion ....................................................................................................... 26
2.3.2 Modern ..................................................................................................................... 27
2.3.2.1 Financial Technology (Fintech) .............................................................. 27
2.3.2.2 Education .................................................................................................. 27
2.3.2.3 Recruitment ............................................................................................... 27
2.4 Challenges ............................................................................................................................ 29
2.4.1 Cold Start ................................................................................................................. 29
2.4.2 Context Awareness ................................................................................................. 30
2.4.3 Style Awareness ....................................................................................................... 30
2.5 Advanced Topics ................................................................................................................. 31
2.5.1 AI-Enabled Recommendations ............................................................................. 31
2.5.2 Cognition Aware ..................................................................................................... 32
2.5.3 Intelligent Personalization ..................................................................................... 32
2.5.4 Intelligent Ranking ................................................................................................. 33
2.5.5 Intelligent Customer Engagement ........................................................................33

15

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

16 Data Science and Its Applications

2.6 Conclusion ........................................................................................................................... 33


References ...................................................................................................................................... 34

2.1 Introduction
In the times of Big Data, choosing the right products is a challenge for consumers due to
the massive volume , velocity, and variety of related data produced online. Because of this,
users are getting more and more desperate when making choices among an unlimited set
of choices. Recommender systems are support apps that can deal with this challenge by
assisting shoppers to make choices on what to purchase (Jannach, Zanker, Felfernig, and
Friedrich, 2010; Resnick and Varian, 1997; Ricci, Rokach, and Shapira, 2015). Recommender
systems can learn from particular preferences and tastes of users and build personalized
suggestions that tailor to users’ preferences and necessities rather than offering sugges-
tions based on mainstream taste (Elahi, 2011; Elahi, Repsys, and Ricci, 2011).
Many recommender software options and algorithms have been proposed, up to now,
by the academic and industrial community. Most of these algorithms are capable of getting
input data from various data types and then exploiting them to generate recommendations
on top of the data. These data types can describe either the item content (e.g., category,
brand, and tags) or the user preferences (e.g., ratings, likes, and clicks). The data is col-
lected and pre-processed, cleaned, and then exploited to build a model in which the items
are projected as arrays of features. Recommendation lists for a specific user is then made
by filtering the items that represent alike features to the rest of the item sets that user liked/
rated high.
Enhanced capabilities of recommender techniques in understanding the varied catego-
ries of user tastes and precisely tackling information burden has enabled them to become
an important part of any online shop that tackles the expansion of item cataloging (Burke,
2002; Elahi, 2014). Diverse categories of recommender engines have been built in order to
generate personalized selection and relevant recommendations of products and services
ranging from clothing and outfits to movies and music. Such a personalized selection and
suggestion is usually made based on the big data of a huge community of connected
users, and by calculating the patterns and relationships among their preferences (Chao,
Huiskes, Gritti, and Ciuhu, 2009; Elahi, 2011; Elahi and Qi, 2020; He and McAuley, 2016;
Nguyen, Almenningen, Havig, Schistad, Kofod-Petersen, Langseth, and Ramampiaro,
2014; Quanping 2015; Tu and Dong 2010). The excellency in performance of recommender
systems has been validated in the diverse range of e-commerce applications where a
choice support mechanism is necessary to handle customers’ needs and help them when
interacting with online e-commerce. Such an assistance improves the user experiences
when shopping or browsing the system catalogue (He and McAuley, 2016; Tu and Dong,
2010).
In this chapter, we will provide an outline of different types of real-world recommender
systems, along with challenges and opportunities in the age of big data and AI. We will
discuss the progress in cognitive technology, in addition to evolutionary development in
areas such as AI (with all relevant disciplines such as ML, DL, and NLP), KR, and HCI, and
how they can empower recommender systems to effectively support their users.
We discuss that modern recommendation systems require access to and the ability to
understand big data, in all different forms, and that big data generated on data islands can

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

Recommender Systems 17

be used to build relevant and personalized recommendations tailored to each customer’s


needs and preferences. We present different application scenarios (including multimedia,
fashion, tourism, banking, and education) and review potential solutions for the recom-
mendation. The remaining parts of the chapter is organized as follows: Section 2.2 briefly
describes popular methods and algorithms. Section 2.3 discusses different application sce-
narios, and Section 2.4 reviews real-world challenges and potential solutions. Section 2.5
extends the previous chapters by providing some advanced topics. Finally, in Section 2.6,
we conclude the chapter.

2.2 Methods
2.2.1 Classical
Diverse recommendation approaches have already been developed and tested, which can
be classified within a number of categories. A well-adopted category of methods is called
content-based (Pazzani and Billsus, 2007). Methods within this category suggest items
based on their descriptors (Balabanovíc and Shoham, 1997). For example, book recom-
mender systems take terms within the text of a book as descriptors and suggest to the user
other books that have descriptors similar to the book the user liked in the past. Another
popular category is collaborative filtering (Desrosiers and Karypis, 2011; Koren and Bell,
2011). Collaborative filtering methods predict the preferences (i.e., ratings) of users by
learning the preferences that a set of users provided to items and suggests to users those
items with the highest predicted preferences. Methods within the demographic (Wang,
Chan, and Ngai, 2012) category generate recommendations by identifying similar users
based on the demographics of the users (Pazzani, 1999). These methods attempt to group
existing users by their personal descriptors and make relevant suggestions based on their
demographic descriptions. Knowledge-based (Felfernig and Burke, 2008) methods are
another category that tries to suggest items that are inferred from the needs and constrains
entered by users (Burke, 2000). Knowledge-based methods are distinguished by their
knowledge about how a specific item fulfills a particular user’s needs (Claypool, Gokhale,
Miranda, Murnikov, Netes, and Sartin, 1999). Hence, these methods can mine inferences
based on the connections within the user’s need and the possible recommendation. Hybrid
(Li and Kim, 2003) methods combine diverse individual methods among those noted ear-
lier in order to handle the particular restrictions of an individual method.

2.2.2 Collaborative Filtering


Collaborative filtering (CF) is a recommender method used in almost all application
domains. This method focuses on effective adoption of the user feedback (e.g., ratings)
elicited from the users to make a profile of affinities. Such profiles are used to generate
personalized recommendations. Hence, collaborative filtering relies on big data comprised
of ratings acquired from typically big network of users (Desrosiers and Karypis, 2011).
Using such data, collaborative filtering recommends items that a target user has not yet
checked, but could probably like (Koren and Bell, 2011). Perhaps a cornerstone for these
systems is to ability to estimate the feedback (or ratings) entered by users for items that
they have not produced any rating for yet. Having the predicted ratings, collaborative

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

18 Data Science and Its Applications

filtering can sort the items based on the predicted ratings and recommend those with the
highest ratings.
Classical methods in collaborative filtering systems are neighbor-based, which compute
user-to-user or item-to-item similarities based on the co-rating patterns of the users and
items. In item-based collaborative filtering, items can be computed as alike if the commu-
nity of interconnected users have rated those items in a similar way. Analogously, in user-
based collaborative filtering, users with similar rating patterns form neighborhoods that
are used for rating prediction. Hence, ratings predictions are performed based on how the
item has been co-rated by other users who were considered as like-minded compared to
the target users.
Another category of collaborative filtering systems adopt Latent factor models in order
to generate rating prediction. A well-adopted category of these methods is matrix factorization
(Koren, 2008b; Koren and Bell, 2011). Matrix factorization builds mathematical models on
top of ratings data and forms a set of factors for the users and items. These sets, with equal
length, are learned from every rating elicited from users. Every factor of these sets is
assigned to an item and represents the level in which an item projects a particular latent
aspect of user preference. In the movie domain, as an example, item factors could be inter-
preted as the genre of the movie, while user factors could describe the taste of the users
toward such genres.
In order to identify such factors, matrix factorization decomposes the rating matrix into
different matrices:

R ≈ SM T (2.1)

Where S is a matrix of |U| × F, and M is a matrix of |I| × F.


A well-known implementation of matrix factorization, Timely Development (2008), was
proposed as Funk-SVD Funk (2006) and is capable of making predictions using this
formula:

rˆui  s
f 1.. F
uf mif (2.2)

where the suf describes the level of the user u preferences towards the factor f , and the mif
describes the strength of the factor f is in the item i (Koren, 2008b).

2.2.3 Content-Based Recommendation


Content-based methods are also widely adopted in recommender systems. Content-
based methods adopt content-based filtering (CBF) algorithms in order to build user
profiles by associating user preferences to the item content (Deldjoo and Atani, 2016;
Deldjoo, Elahi, Cremonesi, Garzotto, Piazzolla, and Quadrana, 2016). As noted earlier,
the user preferences are typically given as ratings to items and item content can be
described with diverse forms of features. Content-based recommender systems exploit
such content features and make a vector space model on top of the content data (Pazzani
and Billsus, 2007). This model projects every item into a multi-dimensional space accord-
ing to the content features (Lops, De Gemmis, and Semeraro, 2011). The content-based
methods measure a relevancy score associated to user preferences proportional to the
content features.

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

Recommender Systems 19

So far, a diverse spectrum of CBF approaches have been formulated and tested in the
context of recommender systems. A well-adopted method is K -nearest neighbors (KNN )
which exploits the similarities using items content and builds suggestions on top of it. The
similarities scores among the item j and all the rest of the items allows us to build a set of
nearest neighbor items (i.e., NN j) containing the items with the maximum similarity scores
to the item j . Accordingly, the preferences (e.g., likes/dislikes or the star ratings) that have
provided for the items within the nearest neighbors set are then used to predict the prefer-
ence r̂ij for user i and item j:

rˆij 
 jNN j , rij  0
rij ssjj
(2.3)
 jNN j , rij  0
ssjj

where ruj > 0 reflects the elements of the preferences matrix, R , i.e., user ratings included in
the matrix of all ratings.

2.2.4 Hybrid FM
While the collaborative filtering method and content-based method have both been largely
adopted by the recommender system community, they have a number of restrictions.
These restrictions will be explained later on in this chapter. In order to address such restric-
tions, hybrid methods have been developed by hybridizing these methods (Low, Bickson,
Gonzalez, Guestrin, Kyrola, and Hellerstein, 2012). While hybrid methods can also have
diverse forms, we briefly introduce one of the most recent methods, called
factorization machines (Burke, 2002; Rendle, 2012).
Factorization machines is a recommender method that is formed by extending the clas-
sical matrix factorization method TURI (2018). Factorization machines hybridizes matrix
factorization by mixing it with a well-known machine learning method named
support vector machines (SVM ). This hybrid method enables the factorization machines to
be capable of taking advantage of not only the user preferences (e.g., ratings), but also item
descriptions, as well as any additional data attributed by users. This enables factorization
machines to adopt a wide range of data, typically referred to as side information, or item
descriptors (e.g., category, title, or tag) as well as user attributes (e.g., demographics, emo-
tion, mood, and personality). Hence, factorization machines build mathematical models
on top of user ratings, as well as item descriptors or user attributes in order it make prefer-
ence predictions (Rendle, 2012).
Predicting the user preferences (e.g., likes and dislikes, or ratings) is conducted through
the next formula:

r̂ij    wi  w j  a T x i  bT y i  ui T v j (2.4)

where µ denotes the bias factor, wi is the user weight, w j is the item weight, and xi and yj
are feature set for user and item, respectively.
There other advanced models (such as Mooney and Roy, 2000; Ahn, Brusilovsky, Grady,
He, and Syn, 2007) that go beyond traditional methods by building probabilistic models
based on the user or item input data. For instance, in Fernandez-Tob́ıas and Cantador
(2014) and Manzato (2013), a model called gSVD + + has been developed that can take
advantage of content data attributed into MF Koren (2008a).

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

20 Data Science and Its Applications

2.2.5 Modern Recommender Systems


Despite the effectiveness of the presented methods, in the age of big data and artificial
intelligence (AI), the need for more advanced methods has been a strong force to build a
modern generation of recommender engines. Several improvements in such recommenda-
tion engines have enabled them to make quick and accurate recommendations tailored to
each customer’s needs and preferences. To achieve this goal, modern recommendation
systems have focused on three main aspects: data, knowledge, and cognition.

2.2.6 Data-Driven Recommendations


Modern recommendation systems require access to and understanding of the raw data
generated on various data islands, including open/private/social data sources (Beheshti,
Benatallah, Sheng, and Schiliro, 2019). This is important as the improvement in data com-
munication and processing enable access to big data, and will enable intelligent and accu-
rate recommendations. In this context, the main challenge in harnessing big data is the
ability to ingest and organize the big data (from various data islands) onto a centralized
repository. The concept of a data lake presents a centralized repository in which to organize
the raw data generated on various data islands. Modern approaches, such as CoreDB
(Beheshti et al., 2017a) propose the notion of data lakes as a service to facilitate managing
and querying large amount sof information (from open, social, IoT, and private data
islands) and to enable analysts to deal with the variety of data and non-standard data mod-
els. Figure 2.1. illustrates the CoreDB (data lake as a service) architecture.
To understand the raw data, it is necessary to leverage AI (artificial intelligence) and ML
(machine learning) technologies to contextualize the raw data, and ultimately improve the
accuracy of recommendations. This enables the adoption of popular recommender sys-
tems, and facilitate sthe journey from analytical models to deep learning models. The goal
here is to generate better predictions by improving correlations between features and attri-
butes. Hence, the concept of a knowledge lake has been introduced. Accordingly, data cura-
tion services can be adopted, which will enable automatic transformation of the raw data
into curated data. Figure 2.2 illustrates the architecture of the knowledge lake.
As a motivating scenario, we may consider recommendations on social media, such as
Twitter. Modern recommender systems would need to understand the content and context
of tweets posted by social users. Considering a tweet as a raw data, the curation services
(Beheshti, Tabebordbar, Benatallah, and Nouri, 2017b) in the knowledge lake would be
able to extract information (e.g., keyword, phrase, named entity, topic, sentiment, etc.)
from the text of the tweet or a URL in the tweet, and enrich them using external knowledge
sources and services. A contextualized tweet (as illustrated in Figure 2.3.) will tell more
stories compared to a raw tweet. For example, if we are able to extract “Barack Obama”
from the text of the tweet, understand that it is a named entity or person, and link it to the
entity Barack Obama (i.e., the 44th president of the United States) in Wikidata, the recom-
mendation system will understand that this tweet may be related to the topic of politics.
Similarly, if the tweet contains a keyword related to health or mentions the World Health
Organization (WHO), the tweet would be classified as related to the topic of health..

2.2.7 Knowledge-Driven Recommendations


Intelligence RSs learn from domain experts’ experience and knowledge in order to under-
stand the domain that the items will be recommended (Beheshti, Yakhchi, Mousaeirad,

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

Recommender Systems
FIGURE 2.1
The data-lake-as-a-service architecture (CoreDB Beheshti et al., 2017a).

21
Downloaded by Lalu Yadav ([email protected])
lOMoARcPSD|50021907

22
Authentication, Access Control, Data
SPARQL SQL Query

Query
Full-Text Search

Encryption, etc.
Security
Index &
Search

elastic

CoreKG REST API


Apache Drill Apache Phonix

Databases
NoSQL

Tracing and Provenance


...

Meta-Data
MongoDB CouchDB HBase Hive

(Create, Read, Update, Delete)


Relational
Databases

CRUD
...

Data Science and Its Applications


Contextualized
Item MySQL PostgreSQL SQL Server

FIGURE 2.2
CoreKG: knowledge-lake-as-a-service architecture (Beheshti et al., 2018).

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

Recommender Systems 23

Ghafari, Goluguri, and Edrisi, 2020b). For example, a new line of research started (Beheshti
et al., 2020b) to use crowdsourcing techniques to capture domain experts’ knowledge and
use them to provide accurate and personalized recommendations. Another line of work
has been leveraged by intelligent knowledge lakes (KLs) to address the following two chal-
lenges: (i) The cold-start problem: leveraging intelligent knowledge lakes will bring infor-
mative data from a crowd of people and use it to generate recommendations.; (ii) Bias and
variance: leveraging intelligent knowledge lakes will be able to guide recommender sys-
tems to choose the best next steps by following the best practices learned from domain
experts. This is important, as features used for training recommenders may be gathered by
humans, which enables biases to get into data preparation and training phases. To build an
intelligent KL, it is important to mimic domain expert’s knowledge. This can be done using
techniques such as collecting feedback, organizing interviews, and requesting surveys. To
achieve this goal, it is important to capture important events and entities (and relation-
ships among them) that are happening in real time in various disciplines and fields, such
as education and fintech.

2.2.8 Cognition-Driven Recommendations


To support accurate and intelligent recommendations, it is vital for a recommender system
to identify similar users based on their behavior, activities, and cognitive thinking.
Accordingly, a cognition-driven recommender system should: (i) facilitate understanding
users’ personalities, emotions, moods, and affinities over time. This task aims to empower
the recommender models in exploitation of the cognitive signals and neural data, as noted
in our previous work, Personality2Vec (Beheshti, Hashemi, Yakhchi, Motahari-Nezhad,
Ghafari, and Yang, 2020a), to design mechanisms for personalized task recommendations
and to facilitate discovering meaningful patterns from users’ social behaviors. A cognitive
RS may focus on dimensions such as explicit behavioral pattern and implicit behavioral
patterns (Beheshti et al., 2020b). Explicit patterns may include text-based methods, loca-
tion-based approaches, action-based methods, and feature-based methods. Implicit pat-
terns may focus on social-based features, trust-based features, and action-based features.
Sequential recommender systems (Wang, Hu, Wang, Cao, Sheng, and Orgun, 2019) aim
to understand and model user behaviors, however, they do not consider the analysis of
users’ attitude, behavior, and personality over time. Recent work introduces a new type of
recommender system, cognitive recommender systems (Beheshti et al., 2020b), which
focuses on understanding the users’ cognitive aspects.

2.3 Application
2.3.1 Classic
2.3.1.1 Multimedia
Multimedia is probably the most popular application domain in recommender systems.
Multimedia recommender systems can exploit different forms of preference data and can
use different types of multimedia descriptors when creating recommendations (Elahi,
Ricci, and Rubens, 2012; Hazrati and Elahi, 2020). While such features can have different
forms, we can classify them into a two main categories: high -level and low -level forms of

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

24 Data Science and Its Applications

FIGURE 2.3
A contextualized tweet (Beheshti et al., 2019).

descriptors (Cantador, Szomszor, Alani, Fernandez, and Castells, 2008; Hazrati and
Elahi, 2020).
High-level descriptors illustrate more of the semantic and syntactic characteristics of
multimedia items and can be aggregated from either structured forms of metadata, e.g., a
relational databases or an ontology (Cantador et al., 2008; Mooney and Roy, 2000), or from
less structured form of data, e.g., user reviews, film plots, and social tags (Ahn et al., 2007;
Hazrati and Elahi, 2020).
Low-level descriptors, on the other side of the story, are aggregated directly from multi-
media files (e.g., audio or visual files). In the music domain, for instance, low-level

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

Recommender Systems 25

descriptors can represent the acoustic configurations of the songs (e.g., rhythm, energy,
and melody), which can be adopted by recommender systems to find similar songs and to
generate personalized recommendation for a user (Bogdanov and Herrera, 2011; Bogdanov,
Serra, Wack, Herrera, and Serra, 2011; Knees, Pohle, Schedl, and Widmer, 2007; Seyerlehner,
Schedl, Pohle, and Knees, 2010).
In video domain, low-level descriptors can represent the visual aspects of the videos
and thus reflect an artistic style (Canini, Benini, and Leonardi, 2013; Lehinevych, Kokkinis-
Ntrenis, Siantikos, Dogruoz, Giannakopoulos, and Konstantopoulos. 2014; Yang, Mei,
Hua, Yang, Yang, and Li, 2007; Zhao, Li, Wang, Yuan, Zha, Li, and Chua, 2011).
It is a fact that recommendation based on low-level features do not draw much attention
to multimedia recommender systems. On the other hand, such features received massive
attention in some related research fields, namely, in computer vision (Rasheed, Sheikh, and
Shah, 2005), and content-based video retrieval. Despite the differences in objectives, these
communities share objectives such as formulating the informative descriptors of video and
movie items. Hence, they report outcomes and insights that can be beneficial to the context
of the multimedia recommender systems (Brezeale and Cook, 2008; Hu, Xie, Li, Zeng, and
Maybank, 2011; Rasheed et al., 2005).

2.3.1.2 Tourism
Another well-studied domain in the research on the recommender systems is tourism. This
is a domain where contextualization plays an important role. We can define contextualiza-
tion as the process of incorporating contextual factors (such as weather condition, travel
goals, and means of transportation) in the recommendation generation. The idea is to make
personal suggestions by incorporating diverse sources of user data, as well as the condition
represented by contextual factors (Adomavicius and Tuzhilin, 2011). For example, a group
of tourists may be interested in visting suggested indoor attractions (e.g., museums) dur-
ing bad weather, but in nice weather they may prefer outdoor activities (e.g., hiking).
Recommender systems that are capable of using such contextual factors are known as
CARS.
CARS are empowered to exploit mathematical modeling in order to better learn user
preferences in different contextual situations based on diverse sources of data, e.g., the
temperature, season, the geographical position, and even the vehicle type. Due to the pop-
ularity of this research domain, a big amount of research has already been conducted in in
this domain (Baltrunas, Ludwig, Peer, and Ricci, 2012; Chen and Chen, 2014; Gallego,
Woerndl, and Huecas, 2013; Hariri, Mobasher, and Burke, 2012; Kaminskas, Ricci, and
Schedl, 2013; Natarajan, Shin, and Dhillon, 2013). The majority of these works can exploit
the context experienced by the user in the recommending process.

2.3.1.3 Food
There are a diverse categories of food recommendation systems that have recently been
proposed by the community (Trevisiol, Chiarandini, and Baeza-Yates, 2014; West, White,
and Horvitz, 2013). For example, Freyne and Berkovsky (2010) built a food recommenda-
tion system that, through an effective user interaction model, collects user preferences and
generates personalized suggestions. Their system converts the preferences of the users for
recipes into preferences for ingredients, and then merges these converted preferences to
form user suggestions.

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

26 Data Science and Its Applications

Elahi, Ge, Ricci, Massimo, and Berkovsky (2014) devised a different approach for food
recommendation that can combine the predictions for food along diverse aspects (such as
user food preferences, nutrition, ingredients, and expenditure) to measure a score for a
potential food (or meal). The objective is to take into account measures that shall impact
the user’s food choices in order to make a more beneficial set of recommendations (Teng,
Lin, and Adamic, 2012). In their next paper, the same authors performed an assessment of
the rating prediction method, which used a variant of MF. This method exploits more data
than utilizing only ratings, such as subjective tags paired to different recipes by users. It
has been discovered that extra data input on the user preferences allows the technique to
outperform other baseline methods, including those developed in Freyne and Berkovsky
(2013).
Generally speaking, the preferences that are aggregated by a recommender system can
have two forms, i.e., long-term affinities or short-term affinities. While obtaining and
aggregating both forms of preferences is essential, the research on recommender systems
does not identify the differences between these two forms. Only limited research works
have considered such differences (e.g., Ricci and Nguyen, 2007). The noted example is one
of the few works that developed a recommender system, which elicits both generic long-
term affinities and specific short-term affinities.
We would like to point out that the traditional line of research on recommender systems
typically undermines the importance of human-system interaction model, as an essential
component for creating an industrial-grade system. Hence, they mainly concentrate on
enhancing the core analytical models by supposing that the preference acquisition proce-
dure is conducted only in the beginning, and then ended.

2.3.1.4 Fashion
Fashion is traditionally referred to as the prevailing form of clothing, and it can be formu-
lated by the concept of changing. Fashion includes diverse characters of self-fashioning,
such as styles in the street to the other calls of high fashion made by designers (Bollen,
Knijnenburg, Willemsen, and Graus, 2010); Person, 2019). One of the biggest issues for this
type of application is the growing diversity and expanding number of fashion products.
This is an effect that can certainly lead to choice overload for the fashion consumers. This
is not necessarily negative, since the more available options then there is a higher likeli-
hood that consumers will find a desired product. However, such an effect may lead to the
impossibility of actually choosing a product, i.e., the problem of receiving too many
options, particularly when they are very diverse (Anderson, 2006).
Recommender techniques are powerful tools that can effectively tackle this issue by
making relevant suggestions of products tailored to the needs of the users. They can
build a filtering mechanism that eliminates uninteresting and irrelevant products from a
shortlist of recommendations. They can thoroughly mine the user data in order to learn
particularities among user preferences for each single user. For instance, Amazon can
look into the purchase history of users and build predictive models that can ultimately
be used to make personalized recommendation for the purchaser. Hence, the smart
engine behind the recommender can actively understand the users’ behaviors, and
obtain diverse and informative forms of data describing the user tastes in order to obtain
knowledge on the individual requirements of every user (Rashid, Albert, Cosley, Lam,
Mcnee, Konstan, and Riedl, 2002; Rubens, Elahi, Sugiyama, and Kaplan, 2015; Su and
Khoshgoftaar, 2009).

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

Recommender Systems 27

2.3.2 Modern
2.3.2.1 Financial Technology (Fintech)
Financial technology (fintech) aims to use technology to provide financial services to busi-
nesses or consumers. Any form of recommendation method in this field will need to under-
stand three main dimensions to provide intelligent recommendations: (i) banking entities,
such as customers and products; (ii) banking domain knowledge, such as how different
banking segments operate across customers, sales and distribution, products and services,
people, processes, and technology; and (iii) banking processes, to help understand the best
practices learned by knowledge experts in processes such as fraud detection, customer
segmentation, managing customer data, risk modeling for investment banks, and more.
The main shortcoming of existing RSs is that they do not consider domain experts’
knowledge, and hence may not exploit user-side information such as cognitive character-
istics of the user. These aspects are quite vital to support intelligent and time-aware
recommendations.
To support data analytics focusing on customers’ cognitive activities, it is important to
understand customers’ dimensions both from banking and non-banking perspectives, as
depicted in Figure 2.4. Modern approaches, such as cognitive recommender systems (Beheshti
et al.. 2020b), model the customer behavior and activities as a graph-based data model
(Beheshti, Benatallah, and Motahari-Nezhad, 2016; Hammoud, Rabbou, Nouri, Beheshti,
and Sakr, 2015) over customers’ cognitive graphs to personalize the recommendations.

2.3.2.2 Education
One of the most popular application domains in recommender systems is the field of edu-
cation. Education allows individuals to reach their full potential and aids in the develop-
ment of societies by reducing poverty and decreases social inequalities. Recently, the world
has experienced an increasing growth in this domain, both on quantity and quality
measures.
This, in turn, has already generated several challenges in the education system, such as
instructors’ workload in dealing with assessments and providing recommendations based
on students’ performance and skills assessment.
In this context, recommender systems can be significantly important tools for personal-
izing teaching and learning by understanding and analyzing important indicators such as
knowledge, performance (e.g., cognitive, affective, and psychomotor indicators), and skills
(e.g., decision-making and problem-solving). An attractive planned work for the future
could implement a time-aware deep learning model to construct and analyze learners’
profiles in order to better understand students’ performance and skills. The learning mod-
els would enable recommender systems to identify similarly performing students, which
may facilitate personalizing learning process, subject selections, and recruitment.

2.3.2.3 Recruitment
Talent acquisition and recruitment processes are examples of ad hoc processes that are
controlled by knowledge workers aiming to achieve a business objective/goal. Attracting
and recruiting the right talent is a key differentiator in modern organizations, and recom-
mender systems can play an important role in assisting recruiters in the recruitment pro-
cess. For example, consider a recommendation engine that has access to LinkedIn profiles,

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

28 Data Science and Its Applications

FIGURE 2.4
Users’ dimensions in a banking scenario (Beheshti et al.,2020b).

is able to extract data and knowledge from business artifacts (e.g., candidates’ CV and
position descriptions), has access to curation algorithms to contextualize the data and
knowledge, and is able to link them to the facts in the recruitment domain knowledge base.
Artificial intelligence (AI) has enabled organizations to create business leverage by
applying cutting edge automation techniques (Shahbaz, Beheshti, Nobari, Qu, Paik, and
Mahdavi, 2018): (i) improving the overall quality and effectiveness of the recruitment
process; (ii) extracting relevant information from a candidate’s CV automatically; (iii)
aggregation of different candidate evaluations and relevant information; (iv) under-
standing the best practices used by recruiters; (v) extracting personality traits and appli-
cant attitudes from social media sites, something that was traditionally only possible
through interviews. All these techniques can be leveraged by recommender systems
building effective ranking algorithms that optimize the recommendations and help
maintain a priority pool of talent. AI-enabled recommender systems would be able to
help match the behaviors of the most talented people in their organizations, and help
businesses recruit the right candidates for open jobs by aggregating information from

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

Recommender Systems 29

different sources and then ranking them based on their overall score. Another future line
of work would be to use computer vision algorithms to assess interviews of potential
candidates and compare them to the organization’s best talent in order to make recom-
mendations (Hirevue, 2019).

2.4 Challenges
Recommender systems typically exploit datasets that contain user feedback (e.g., likes and
dislikes, or ratings) that represent preferences produced by a big crowd of interconnected
users to a large list of items (Desrosiers and Karypis, 2011). Exploiting such data empowers
the recommender systems to learn the patterns and connections among users, and use
them to estimate the missing assessments (likes and dislikes, or ratings) of users for the
unexplored items and then suggest items that may be attractive to a target user (Koren and
Bell. 2011).
The above-mentioned procedure is oversimplified, and there are many grand concerns
that have not been fully addressed so far. Hereafter, we briefly explain some of these
concerns.

2.4.1 Cold Start


Recommender systems may still encounter a wide range of challenges due to different
reasons, such as the lack of rating data for some of the users or items (Adomavicius and
Tuzhilin, 2005); Schein, Popescul, Ungar, and Pennock, 2002). One of the problematic
issues in building personalized recommendation is cold start, which is strongly related to
low quality or the quantity of the input data. A sub-problem of cold start is called the
new user situation, which refers to when a new user begins to use the system and demands
suggestions prior to giving any preferences to any existing item. Similarly, in a new item
situation, a new item is introduced to the item catalog waiting to obtain assessments from
existing users (in terms of ratings, reviews, or tags). In addition to that, in spite of the fact
that these are typical situations in a cold-starting recommender systems, there exists
another problem called sparsity. Sparsity is a measure of the data density and is propor-
tional to the number of available feedbacks (e.g., ratings) over the overall possible
feedbacks:

# of existing feedbacks
1− (2.5)
# of all possible feedbacks

In some of the acute situations of sparsity, the effectiveness of the recommender systems
can be strongly deteriorated, consequently resulting in a significant decrease in the perfor-
mance of the system. In such a condition, the quantity of available user feedback is largely
smaller than the number of missing feedback, and the operating system has to build pre-
dictions with satisfactory level of quality (Adomavicius and Tuzhilin 2005; Braunhofer,
Elahi, and Ricci, 2014).
Different cold start conditions can take place in the actual applications, namely extreme
cold start and moderate cold start conditions.

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

30 Data Science and Its Applications

• Extreme cold start condiction take place when a user starts using the system and asks
for a recommendation before producing any feedback. The problem can also happen
when a brand-new product is inserted into the catalog and has no associated data
that can represent that item. This, in turn, can lead to a failure in suggesting that new
product to an existing users. Both of these situations are critical issues that have to be
handled activrely by the system.
• Mild cold start conditions happen once a small number of feedbacks are produced by
a user to existing products, and the system can use this limited data to generate a
recommendation. This problem may also take place for a new product when a small
amount of content data are not fully produced. Mild cold start may take place as a
combined condition of extreme and warm start. This can still lead to a failure if not
be promptly addressed by the operating system.

2.4.2 Context Awareness


Although context may have different meanings in various domains, it typically empha-
sizes an event that will be helpful in order to be understood (Beheshti et al., 2020b). In
general, context is represented by factors that influence computation but are still different
from the input and output data (Lieberman and Selker, 2000). Time, location, and social
relations are examples of a context for which motivated researchers may focus on contest-
aware techniques, such as location-aware and time-aware recommender systems.
For instance, time-aware recommender systems can be beneficial for the system to
understand the development of users’ affinities over a certain period of time, and can thus
provide contextual recommendations for different times (e.g., seasonal, monthly, weakly,
or daily with different weather conditions). As another example, social-aware recom-
mender systems can benefit from big data generated on social media in different aspects,
from the social characterization of users (including social relationships, followers, and
shared content) to identify the personality, behavior, and attitude of social users (Ghafari,
Yakhchi, Beheshti, and Orgun, 2018). For example, features such as intimacy, emotional
intensity, along with location and time-aware context, can be calculated and used to pro-
vide accurate and context-aware recommendations.

2.4.3 Style Awareness


Modern recommender systems in various application domains are becoming more and
more aware of the style of the items and products they recommend. Integration of product
style within the recommendation process is in turn becoming increasingly important. In
the multimedia domain, examples of style elements are lighting, colorfulness, and move-
ment and sound.
There are diverse reasons for having lighting in films and movies. The most important
one is to enable the perceived understanding of the space and to build observable objects
that the audience is seeing. But lighting can also perhaps alter the way an occurrence seen,
acting in a way that goes beyond the logical perception of a human.
Colors express a similar capability by setting up emotions derived by an encounter con-
dition. The specific quality of colors disallows them to be perceived separately from space
lighting. Likewise, colors tend to contribute to making a unique “perception” of a space in
comparison to the other aesthetic characteristics functioning in the similar manner. Experts

Downloaded by Lalu Yadav ([email protected])


lOMoARcPSD|50021907

Recommender Systems 31

in media have a common belief that the impact of colors becomes larger as they are predis-
posed in making a particular emotional goal.
A number of research works on recommender systems have reported that users’ prefer-
ences can be impacted greater by low-level descriptors in comparison to high-level descrip-
tors (expressing the semantic or syntactic forms in films) (Elahi, Deldjoo, Bakhshandegan
Moghaddam, Cella, Cereda, and Cremonesi, 2017; He, Fang, Wang, and McAuley, 2016;
Messina, Dominquez, Parra, Trattner, and Soto, 2018; Rimaz, Elahi, Bakhshandegan
Moghadam, Trattner, Hosseini, and Tkalcic, 2019; Roy and Guntuku, 2016). Examples of
such low-level descriptors can be color energy, shot duration, and lighting key. (Wang and
Cheong, 2006) have a proved to influence on user mood and emotion (Roberts, Hager, and
Heron, 1994). In addition to that, various forms of motion (such as camera movement) can
play a significant role and are commonly adopted by filmmakers when aiming to affect the
perception of movie watchers (Heiderich, 2018). A range of methods and techniques have
been adopted to address the task of learning visual descriptors from films (Ewerth,
Schwalb, Tessmann, and Freisleben, 2004; Savian, Elahi, and Tillo, 2020; Tan, Saur, Kulkami,
and Ramadge, 2000).
Despite of the importance of low-level descriptors, the usage of them has not drawn
much consideration in recommendation systems (e.g., an example is Messina et al. [2018]).
However, these audiovisual descriptors are thoroughly investigated in the related areas,
namely within the computer vision community (Naphide and Huang, 2001; Snoek and
Worring, 2005).

2.5 Advanced Topics


In the times of AI, an intelligent recommender system should be highly data driven, knowl-
edge driven, and cognitive driven. Cognitive recommender systems (Beheshti et al., 2020b)
have been proposed as a new type of intelligent recommender system that aims to analyze
and understand users’ preferences and explore mechanisms to intelligently understand
the complex and changing environments. In this context, we categorize the advanced top-
ics in recommender systems into the following categories: AI-enabled recommendations,
cognition aware, intelligent personalization, intelligent ranking, and intelligent customer
engagement.

2.5.1 AI-Enabled Recommendations


As discussed in Section 2.3, AI-enabled recommendations should benefit from data-driven
and knowledge-driven approaches, such as data lakes (Beheshti et al., 2017a) and knowl-
edge lakes (Beheshti et al., 2018, 2019).
Data-driven recommendations will enable leveraging machine learning technologies to
contextualize the big data aiming to enhance the precision of automatic suggestions, facili-
tating the use of content/context, and moving from statistical modeling to advanced mod-
els based on multi-layer neural networks. These will improve mining patterns between
item and user descriptors to build better suggestions for users.
Knowledge -driven recommendations empower simulating the expertise of the domain
experts (e.g., using crowdsourcing methods [Beheshti et al., 2019]) and to adopt methods

Downloaded by Lalu Yadav ([email protected])

You might also like