Computers 11 00158 v2
Computers 11 00158 v2
Article
A Ranking Learning Model by K-Means Clustering Technique
for Web Scraped Movie Data
Kamal Uddin Sarker 1, * , Mohammed Saqib 2 , Raza Hasan 3, * , Salman Mahmood 4 , Saqib Hussain 3 ,
Ali Abbas 5 and Aziz Deraman 6
Abstract: Business organizations experience cut-throat competition in the e-commerce era, where a
smart organization needs to come up with faster innovative ideas to enjoy competitive advantages. A
smart user decides from the review information of an online product. Data-driven smart machine
learning applications use real data to support immediate decision making. Web scraping technologies
support supplying sufficient relevant and up-to-date well-structured data from unstructured data
Citation: Sarker, K.U.; Saqib, M.; sources like websites. Machine learning applications generate models for in-depth data analysis
Hasan, R.; Mahmood, S.; Hussain, S.; and decision making. The Internet Movie Database (IMDB) is one of the largest movie databases
Abbas, A.; Deraman, A. A Ranking on the internet. IMDB movie information is applied for statistical analysis, sentiment classification,
Learning Model by K-Means
genre-based clustering, and rating-based clustering with respect to movie release year, budget, etc.,
Clustering Technique for Web
for repository dataset. This paper presents a novel clustering model with respect to two different
Scraped Movie Data. Computers 2022,
rating systems of IMDB movie data. This work contributes to the three areas: (i) the “grey area” of
11, 158. https://doi.org/10.3390/
web scraping to extract data for research purposes; (ii) statistical analysis to correlate required data
computers11110158
fields and understanding purposes of implementation machine learning, (iii) k-means clustering is
Academic Editors: Phivos Mylonas, applied for movie critics rank (Metascore) and users’ star rank (Rating). Different python libraries are
Katia Lida Kermanidis and
used for web data scraping, data analysis, data visualization, and k-means clustering application.
Manolis Maragoudakis
Only 42.4% of records were accepted from the extracted dataset for research purposes after cleaning.
Received: 14 September 2022 Statistical analysis showed that votes, ratings, Metascore have a linear relationship, while random
Accepted: 31 October 2022 characteristics are observed for income of the movie. On the other hand, experts’ feedback (Metascore)
Published: 8 November 2022 and customers’ feedback (Rating) are negatively correlated (−0.0384) due to the biasness of additional
Publisher’s Note: MDPI stays neutral
features like genre, actors, budget, etc. Both rankings have a nonlinear relationship with the income
with regard to jurisdictional claims in of the movies. Six optimal clusters were selected by elbow technique and the calculated silhouette
published maps and institutional affil- score is 0.4926 for the proposed k-means clustering model and we found that only one cluster is in
iations. the logical relationship of two rankings systems.
Keywords: movie data; web scraping; statistical analysis; machine learning; k-means clustering
one or more sources, that are stored, processed, and analyzed to describe factual informa-
tion. Data could be qualitative/categorical (nominal or ordinal) or quantitative/numerical
(discrete or continuous) [1]. Nominal data do not measure an object but are used to label
variables like country, age, and race. Ordinal data are represented in an order or scale that
measures a variable like, salary, range, rating, or points of a product. Discrete data consist
of fixed values like, number of students, price of a product, etc., while continuous data can
accept any numerical value from a range like water pressure and walking speed. A data
scientist collects, processes, stores, analyzes, splits, merges, and applies effective algorithms
to generate knowledge for decision making. Traditionally, we can collect data from primary
sources or secondary sources, but time and reliability are sensitive in terms of competitive
advantages [1,2]. Web data extraction has become popular [1,3] because it is up-to-date and
accurate and supports better decision making.
accessed on 24 June 2022. The IMDB site uses Metascore from critics and Rating from users
to reduce the search time of its users [l01].
• Research Scope: There is plenty of research on IMDB data analysis and the implemen-
tation of machine learning applications. Most of the works developed their own
supervised models or applied clustering techniques, or performed a statistical anal-
ysis based on the repository data set (details in Section 2). The repository data set
may consist of unnecessary fields at back dated information. Moreover, according to
Quora [14], a good number of customers has no good experience when they select
a movie only based on the scoring systems of Metascore or Rating because Metascore
is biased by human error or business goals and Rating is biased by users’ influential
factors (age, gender, race, and culture). On the other hand, an ordinary user has limited
access on the IMDB movie information and most of the users follow Rating/Metascore,
but it becomes ambiguous when a huge difference exists between these two scores of
a movie.
1.3. Contribution
This research contributes to the three major areas:
(i) We extract up-to-date movie data (movie name, Metascore, Rating, year, votes, and
gross income) from IMDB movie site. This is an ethical (grey area) data extraction
process from the internet, which is more accurate and reliable than collected data from
a third party.
(ii) Data cleansing and analysis is performed to show the correlation between rating,
Metascore, votes, and gross income of the movies. The statistical analysis illustrates the
relationship between Metascore and rating by scatter plot and boxplot for comparison
between two different scoring systems. This supports data validation and feature
selection for machine learning applications.
(iii) Finally, the k-means clustering technique is applied after analyzing the machine learn-
ing approaches that will support a user to select a move from optimal clusters. The rest
of the paper is organized as follows: Section 2 consists of recently completed research
in data science and machine learning domains for the IMDB dataset. Section 3 illus-
trates the research methodology aided with a diagram. Web-scraping data extraction,
data analysis (statistical), and implementation of k-Means clustering are executed with
the required explanation and literature in the following three sections (Sections 4–6).
Section 7 explains the result and application of the research. Concluding remarks with
limitations and future work are mentioned in Section 8.
2. Related Work
We are going to use IMDB movie data for our research to study the relationship
between the two types of scoring systems. IMDB movie data are historically applied in
different machine learning techniques. Jasmine Hsieh [15] performed a statistical analysis
based on the movie rating and votes from 2005 to 2015 to see the changing pattern over the
periods according to the genres (comedy, short, sport, adult, animation, etc.) of IMDB listed
movies. Qaisar [16] applied Long Short-Term Memory (LSTM) classification for sentiment
analysis based on the users’ comments and Topal et al. [17] applied statistical analysis
to show the ranking changing pattern over a period of movie data. They worked on the
repository dataset of IMDB that is collected from a third party. It has also been analyzed by
different regressions [18] to predict popularity of the movies based on the genre information
of the Kaggle dataset. Naeem et al. applied gradient boosting classifiers, support vector
machines (SVM), Naïve Bayes classifier, and random forest [19], while Sourav M. and
Tanupriya C. applied Naïve Bayes and SVM [20] and both found that SVM is better than
any other classifier for sentiment analysis of IMDB movie review text. Hasan B. and
Serdar K. showed clustering based on the genre of a movie to compare the genres with
respect to other features like rating, release year, and gross income [21]. Aditya et al., on
the other hand, applied different clustering techniques based on the rating with respect
than any other classifier for sentiment analysis of IMDB movie review text. Hasan B. and
Serdar K. showed clustering based on the genre of a movie to compare the genres with
respect to other features like rating, release year, and gross income [21]. Aditya et al., on
the other hand, applied different clustering techniques based on the rating with respect to
Computers 2022, 11, 158 genre, year, budget, Facebook likes, etc. [22]. The supplementary material (Table4 of 21 in-
S1)
cludes the data of 1000 movies that are in the top of rating list at IMDB.
3.toMethodology
genre, year, budget, Facebook likes, etc. [22]. The supplementary material (Table S1)
includes the data of 1000 movies that are in the top of rating list at IMDB.
This research contributes to web data scraping, statistical analysis, and machine
learning algorithms to analyze the correlation between users’ feedback and experts’ eval-
3. Methodology
uation on the internet, predominantly on a movie ranking website. The introduction sec-
This research contributes to web data scraping, statistical analysis, and machine learn-
tion
ing provided
algorithmsthe preliminary
to analyze understanding
the correlation between of users’
the research
feedback domain (AI, web
and experts’ data, sta-
evaluation
tistics)
on theand the aim
internet, of this scholarly
predominantly on awork.
movie ranking website. The introduction section
This article is arranged
provided the preliminary understanding in a sequential manner
of the (Figure
research domain1) that selects
(AI, web webpages
data, statistics)and
the features of data that
and the aim of this scholarly work.are required for this research. Then, the website is inspected to
understand the location
This article of the
is arranged in data, the layers
a sequential of tags,
manner content
(Figure of the
1) that tags,webpages
selects and structureand of
the
thepages.
featuresAnaconda
of data that is used for Python
are required package
for this management
research. and deployment
Then, the website is inspectedthat to is
known as “free
understand the distribution
location of the of data,
Python”. Pandasofistags,
the layers one content
of the Python packages
of the tags, that is par-
and structure
ticularly uses inAnaconda
of the pages. data scienceis used for PythonThere
applications. packagearemanagement
plenty of Python and IDEs
deployment that
for web-scrap-
is known
ing, as “free distribution
data analytics, Python”.(details
and machineoflearning Pandas inis Section
one of the Python
4) with theirpackages that isfea-
own special
particularly
tures. We used uses in data
Jupyter science which
Notebook, applications.
is a web-based plenty of Python
There areopensource IDEs for
application forweb-
Python
scraping, data analytics, and machine learning (details in Section
programming. BeautifulSoup is the web screen scraping library that in the data extraction4) with their own special
features. stores
program We used .csvJupyter
files Notebook, whichdirectories,
of the Jupyter is a web-basedwhileopensource application
Numpy supports Python
theforconvertion
programming. BeautifulSoup is the web screen scraping library that in the data extraction
of extracted data into an appropriate data structure. The Pandas library is used for statis-
program stores .csv files of the Jupyter directories, while Numpy supports the convertion of
tical analysis and implementation of clustering techniques. On the other hand, Seaborn
extracted data into an appropriate data structure. The Pandas library is used for statistical
and Matplotlib
analysis packages are utilized
and implementation for visualization.
of clustering techniques. Section 4 comprises
On the other data-scraping
hand, Seaborn and
tools, techniques,
Matplotlib packages features selection,
are utilized algorithm, and
for visualization. cleansing
Section methods.
4 comprises It also includes
data-scraping tools, a
web-scraping algorithm
techniques, features and related
selection, algorithm,regular
andexpressions that are Itused
cleansing methods. alsofor data extraction
includes a web-
and data cleansing. Data analysis included scattered plot, box-diagram,
scraping algorithm and related regular expressions that are used for data extraction and correlation,
and
asdata
detailed in Section
cleansing. Data 5. We applied
analysis included elbow functions
scattered plot,tobox-diagram,
select number and ofcorrelation,
optimal clustersas
for our dataset.
detailed K-means
in Section clustering
5. We applied technique
elbow functions is implemented
to select number forofsix clusters
optimal in Section
clusters for 6
after a brief discussion
our dataset. of a fewtechnique
K-means clustering machineislearning techniques.
implemented The result
for six clusters analysis
in Section is dis-
6 after
a brief in
cussed discussion
Section 7ofwith a few machine learning
a comparison studytechniques.
of related The resultThe
research. analysis
paper is is
discussed
concluded
byinmentioning
Section 7 with a comparison
limitations study and
of the study of related
ways toresearch.
further The
extendpaper
the is concluded by
research.
mentioning limitations of the study and ways to further extend the research.
Research methodology.
Figure1.1.Research
Figure methodology.
4. Data Collection
4. Data Collection
Websites contains huge amounts of information, and the content of the pages is
Websites
updated contains
regularly hugeservices.
for better amounts of information,
User-developed and the
content content information
is dynamic of the pagesthat
is up-
dated regularly
is updated formoment.
every better services. User-developed
So, our research content
extracts users’ is dynamic
ratings information
from the IMDB moviethat is
website instead
updated of collecting
every moment. So, it
ourfrom a thirdextracts
research party, inusers’
order ratings
to get up-to-date data. This
from the IMDB movie
section describes the importance of internet data, data extraction tools and techniques
from webpages, data extraction legality, and an algorithm for movie data collection from
IMDB website.
4.1. Importance
We are in the digital world and relate to data sources of the digital environment, which
has been increasing due to digitally recorded activities of daily life, business, and news
feeds [2,3]. Every moment, a vast amount of data is generated by the Internet of Things (IoT),
Computers 2022, 11, 158 5 of 21
cyber security, social media, smart devices, smart cities, digital financial services, health
care, and ordinary websites. Machine learning applications are growing fast in the context
of data analysis and computing with intelligent functions [1]. Data analysis applications
need real data from that domain to train a machine learning model. For instance, cyber
security data need to develop automated data-driven cyber security systems and mobile
data are used in smart mobile awareness systems [5]. Similarly, the COVID-19 prevention
system should reflect actions based on the COVID-19 dataset. Data are important when
we can utilize them properly, which is mainly in three forms: unstructured data, semi-
structured data, and structured data [4,6]. Structured data are organized in an order
and represented in a standard format. Typically, the datare are highly organized in the
tabular form (i.e., relational database). Office document files of audio, video, pdf, images,
and websites consist of data that are unstructured and complicated to organize, manage,
and analyze. Semi-structured data are not organized like a relational database, but they
have a unique tie that supports analyzing, such as HTML, XML, NoSQL, and JSON files.
Machine learning applications are data driven and the effectiveness or efficiency relies
on the quality of data [4,7]; even the outcome of a machine learning algorithm differs
based on the characteristics of data [8]. This research extracted unstructured web page
data and converted them to a structured dataset before applying them to machine learning
algorithms and statistical analysis.
for updating a model. It also enriches the database of a company and performance of the ap-
plication. In this paper, we apply HTTP protocol data collection methods for web-scraping.
Libraries Description
It is the most basic and essential library for web-scraping that is used for various HTTP requests like
GET and POST to extract information from HTML server pages [48]. It is simple, support HTTP(s)
Requests
proxy and it is easy to get chunk amount data from static web pages, but it cannot parse data from
retrieved HTML files or collet information from Java script pages [47].
Perhaps it is the most used Python library that creates parse tree for parsing HTML and XML
Beautiful Soup document. It is comparatively easier and suitable for beginners to extract information and combine
with lxml. It is commonly used with Requests in industries though it is slower than pure lxml [47,49].
This is a blazingly fast HTML and XML parsing library of Python that shows high performance for
lxml large amount dataset scraping [49]. It also works with Requests in industries. It supports data
extraction by CSS and XPath selectors but is not good for poorly designed HTML webpages [48].
It was developed for automatic webpage testing and pretty quickly it has turned into a data science
Selenium for web-scraping. It supports web-scraping dynamically populated pages. It is not suitable for large
application but can be applied if time and speed is not a concern.
It is a web-scraping framework that can crawl multiple web sites by spider bots. It is asynchronous to
Scrapy send multiple HTTP requests simultaneously. It can extract data from dynamic websites using the
Splash library [49].
researcher when developing new research questions or answering old questions [52], and
it allows practitioners to understand business strategy [53]. Web-scraping data are used by
government agencies, and market and business analysts without legal issues [33]. A vast
volume of web data extraction can lead to technical, legal, and ethical challenges [51,53].
There is a proliferation of selection tools, techniques, and purposes in a legal way called the
“grey area” of web-scraping [50,54]. Web-scraping is not restricted by legislative addresses,
but it is guided by a set of laws and theories: “Computer Fraud and Abuse Act (CFAA),
“Trespass to Chattels”, “Breach of Contract”, and “Copyright Infringement” [54,55]. A
website owner can apply fundamental theories, (i) by posting a terms of uses policy on
their website to prevent programmatic access, (ii) apply the fair use principle to protect
copyright property, (iii) protect premium content from commercial purposes by cease and
desist declaration, (iv) be protected from overload and damage by Trespass to Chattels law
declaration, and (v) the declaration of ethical statement, personal data protection, and data
utilization strategy. Our scraped website (IMDB) is a public website and there is no private
or copyright-protected data. The scraping program cannot damage or create extra load
for the website. It only extracts published information that will only be used for research
by academics. It will not be shared in the public domain, with third parties, or be used
for commercial application. IMDB (https://www.imdb.com/list/ls048276758/?ref=otl2
accessed on 24 June 2022) is the selected website for data extraction using the Pandas
library (Beautiful Soup of Python). Table 2 shows the checklist for ethical data collection
criteria that are maintained by our research team. The IMDB site contains of several
thousand movies in a list where a single page represents information of 200 movies. We
only scraped information from the first 10 pages (total 1000), which takes a few seconds
without interruption of their services.
Specification Remarks
Web-scraping is explicitly prohibited by terms and conditions. No
The extracted data are confidential for the organization. No
Information of the website is copyrighted. No
Are data going to be used for illegal, fraudulent, or commercial purposes? No
Scraping causes information or system damage. No
Web-scraping diminishes the service. No
Collected data are going to compromise individuals’ privacy. No
Collected information will be shared. No
Unnamed:0
Computers 2022, 11, 158 Name_of_Movie Release_Year Duration_of_Flim Rating Score Votes Income
8 of 21
0 1 Cameraperson 2016 102 7.4 86 2900 0
3 Unnamed:0
7 Name_of_Movie
Nebraska Release_Year
2013 Duration_of_Flim
115 Rating 7.7 Score 87 Votes
118,215 Income
17.65
04 1 9 Cameraperson
Paterson 20162016 102 118 7.4 7.3 86 90 2900 81,270 02.14
1 2 Goldfinger 1964 110 7.7 87 189,661 51.08
… … … … … … … … …
2 6 True Grit 2010 110 7.6 80 337,113 171.24
3419 7 78 Nebraska
Pan’s Labyrinth 20132006 115 118 7.7 8.2 87 98 118,215
662,747 17.65
37.63
4 9 Paterson 2016 118 7.3 90 81,270 2.14
The Best Years of Our
. 420
.. . . . 85 ... . . .1946 ... 170 ... 8.1 ... 93 . . 64,310
. . 23.65
..
419 78 Pan’s Labyrinth
Lives 2006 118 8.2 98 662,747 37.63
The Best Years of
420
421 85 88 Tampopo 19461985 170 114 8.1 7.9 93 87 64,310
19,097 23.65
0.22
Our Lives
421 88 Tampopo
Werckmeister 1985 114 7.9 87 19,097 0.22
422 89 Werckmeister 2000 145 8.0 92 14,231 0.03
422 89 harm6niak 2000 145 8.0 92 14,231 0.03
harm6niak
423
423 98 98 Rio Bravo
Rio Bravo 19591959 141 141 8.0 8.0 93 93 62,485
62,485 12.54
12.54
424 rows
424 rows × 8× 8 columns
columns
An
An ordinary
ordinary user
user selects
selects aa movie
movie based
based on on the
the scoring
scoring with
with some
some other
other features
features like
like
name,
name, actors, yearyearof ofrelease,
release,andandsubject
subjectof of
thethe movie.
movie. TheThemoviemovie features
features are inare
anin an
asso-
association rule [56] and we extracted only seven features for statistical
ciation rule [56] and we extracted only seven features for statistical analysis analysis but the aim
aim
of
of the
thework
workisistotosuggest
suggestaamovie
movieonlyonlybased
based ononthe Metascore
the Metascore and
andRating.
Rating.Seven
Sevenfeatures of
features
the movie
of the records
movie recordswere scraped
were scraped(Figure 2) for
(Figure 2) the firstfirst
for the 10001000
movies (first(first
movies 10 pages) formform
10 pages) the
list
the of the
list ofwebsite basedbased
the website on Algorithm
on Algorithm1 without interfering
1 without the website’s
interfering services.
the website’s After
services.
removing the records
After removing that consist
the records of incomplete
that consist or garbage
of incomplete data, 424
or garbage records
data, (Table 3)(Table
424 records were
transformed into the Pandas data frame for analysis and clustering.
3) were transformed into the Pandas data frame for analysis and clustering.
Figure 2.
Figure 2. Selected
Selected fields
fields for
for data-scraping.
data-scraping.
5. Data Analysis
Data are presented in the tags of the web pages that were fetched as a text dur-
ing scraping (by BeautifulSoup), but we are going to apply clustering machine learning
techniques and analysis that will work on meaningful numerical values. The text was
converted to a numerical form by a Python function (to_numeric()) for required fields. We
applied three steps for cleansing: (i) dropped the records that consist of meaningless data
(data.drop(number)), (ii) removed all records that consist of null value on the website but that
are scraped with “000 (data.data(Score !=0)) and “000” (data.data(income !=0)) in the Score and
Income field, respectively. In the text fields of the tags, there might be additional symbols,
special characters, or punctuation that are unnecessary or even barriers to apply arithmetic,
logical, and machine learning applications. We applied regular expressions to remove extra
data with a tag by removing special characters, symbols, spaces, and punctuation. Finally,
424 records were developed into a complete dataset (Table 3) for analysis and machine
learning algorithm application.
Data are stored as a file that could be in a tabular form, image, graph, or even a text
or pdf file. Each form of presentation is important for a particular application but may
not be suitable for all. We extracted data as a tabular form (Table 3), which is easier to
sense and for utilizing cleansing methods. Data scientists apply efficient techniques for
analysis, present and store information using mathematical and statistical models. For
example, the liner equation is applied for linear regression to predict price or customer sat-
isfaction of a business organization. It is commonly used in principal component analysis
for dimensionality reduction, encoding of the dataset, or singular value decomposition.
Our dataset was extracted without unnecessary dimensions. The matrix is used to rep-
resent, store, and analyze images. It is also used to compress a file and our dataset was
comparatively small and there was no need to apply a compression technique. Vector
operations are used to calculate and predict the movement of a machine. To understand
the nature of data, scientists use central tendency and dispersion, while probability is
used in accuracy and prediction models. Moreover, point estimation, interval estimation,
hypothesis testing, and categorization algorithms use distribution theory like Sigmoid and
Gaussian functions. Mathematical and statistical models are commonly used in machine
learning algorithms that are applied in data science. Naïve Bayes, support vector machines
(SVM), and boost algorithms are used for supervised learning [57]. Wavelet coefficients
of natural images are relatively sparse models implemented as a wavelet coefficient for
Computers 2022, 11, xnatural image
FOR PEER processing [58], Shannon source coding theorem is used for uniform coding
REVIEW 10
in tree construction [59], sensing data modeling [60], and applications available for data
transformation, projection of objects, as well as in learning algorithms. A sample of data
could represent the concept
better of overall
visualization information,
of multiple andinnormalization
features a single frame.can also be
Figure applied
3 shows overall rel
for better visualization of multiple
of four movie features in athough
choice parameters single frame. Figure 3 shows
it was developed overall
on the rela-
random sample of
tion of four movie
tenth choice
dataset.parameters
It also usedthough it was developed
a normalized on the
scale of Income andrandom sample
Votes with of to Rat
respect
one-tenth dataset. It also used a normalized scale of Income and Votes with respect to Rate
Score. Interestingly, Income of the movie is showing a random manner in relation
and Score. Interestingly, Income of the movie is showing a random manner in relation to all
other features.
other features.
Figure 4a,b represent the Score (x axis) and Rating (y axis) and show a scatter a
boxplot, respectively. Users’ choice (Rating) and experts’ ranking (Score) are negat
corelated (−0.03846622277552866). Figure 4a shows that the minimum Score is 70 (t
Figure 3. Sample of Rating, Score, income, and votes.
Computers 2022, 11, 158 10 of 21
Figure 4a,b represent the Score (x axis) and Rating (y axis) and show a scatter and a
boxplot, respectively. Users’ choice (Rating) and experts’ ranking (Score) are negatively
Figure
corelated 4a,b represent the ScoreFigure
(−0.03846622277552866). (x axis)4aand Rating
shows (y axis)
that and show Score
the minimum a scatter and
is 70 (that is
a boxplot, respectively. Users’ choice (Rating) and experts’ ranking (Score) are negatively
the considered minimum accepted value for movie critic Metascore) and maximum 100, so
corelated (−0.03846622277552866). Figure 4a shows that the minimum Score is 70 (that is the
most of the metacore values are in the range of 80–95 and they are in small groups (that
considered minimum accepted value for movie critic Metascore) and maximum 100, so most
motivate us to apply
of the metacore values clustering techniques)
are in the range butthey
of 80–95 and veryarenear to each
in small groups other
(that with different
motivate
shapes. On the
us to apply other hand,
clustering it doesbut
techniques) notvery
clearly
nearfall in the
to each regressions,
other and we
with different do not
shapes. On know
how
the other hand, it does not clearly fall in the regressions, and we do not know how manyquad-
many groups can be made for classification. In the box diagram (Figure 4b), the
groups
ratic can of
values be boxes
made for
areclassification.
not on the same In the level,
box diagram
from 70 (Figure 4b),so
to 100, thethat
quadratic
we canvalues
imagine a
of boxes
curve in theare not on the
medium same
of the level,tofrom
boxes 70 toor
classify 100, so that into
separate we can
twoimagine
logicalagroups.
curve inHowever,
the
medium of the boxes to classify or separate into two logical groups. However,
the irregular imaginary curve does not generate any sense of classification. There are a the irregular
imaginary curve does not generate any sense of classification. There are a few boxes of
few boxes of very small shape that consist of very few observations, and you can imagine
very small shape that consist of very few observations, and you can imagine outliers of
outliers of the dataset.
the dataset.
(a) (b)
Figure 4.4.Score
Figure Scorevs.
vs.Rating.
Rating. (a) ScatterPlot;
(a) Scatter Plot;
(b)(b)
BoxBox
Plot. Plot.
6. 6. Machine Learning
Machine Learning
Artificial intelligence (AI) improves the ability of a machine to imitate a human
Artificial intelligence (AI) improves the ability of a machine to imitate a human and
and extended AI provides the learning capabilities of a machine called machine learning
extended AI provides
(ML). When a machine the learning
uses capabilities
more learner layersof a machine
(hidden called
layers) machine
in neural learning
networks, it is (ML).
When a deep
called machine uses(DL).
learning moreInlearner layers (hidden
this research, layers) in neural
we are concentrating on MLnetworks, it is called
of supervised,
semi-supervised, unsupervised, and reinforcement learning. Supervised learning maps
input data to an output based on the predefined input–output mapping to participate in
machine learning systems. It works on labeled training data and is called a task-driven
learning system with classification and regression techniques. Unsupervised machine
learning is a data-driven approach that can analyze unlabeled data for clustering, feature
learning, dimensionality reduction, anomaly detection, etc. [57]. Semi-supervised learning
can work on labeled or unlabeled datasets for clustering and classification [59]. In the
real world, labeled data are limited and a semi-supervised model is more practical for
work on unlabeled datasets [61] for better performance. Reinforcement machine learning
allows its agents to learn from the environment. It is either a model-based or model-free
technique [61] with four elements: agent, environment, reward, and policy for controlling a
system (refer to Table 4).
Over the years, the data analytics technology ecosystem has integrated big data into
sophisticated computing platforms with analysis tools, techniques, and machine learning
algorithms [62]. Cognitive robotics, virtual agents, text analytics, and video analytics
applications improve the capabilities of machine learning frameworks. Structured real data
are important to train the machine, but internet pages have plenty of unstructured data.
Computers 2022, 11, 158 11 of 21
6.2. Regression
Regression analysis is a mathematical model that predicts a dependent variable (out-
come) based on a set of independent variable(s). This is a predictive analysis in artificial
intelligence and data science for forecasting, time series analysis, and finding the cause–
effect relationship between variables. It is also the process of fitting a group of points to
a graph so that it can be represented by a mathematical equation to predict any outcome
via given input(s). It supports getting significant factors of a dataset. Market analysis,
promotion, and price changes are most common in intelligent business analytics. It can
be classified based on the number of independent variables, shape of the curves, and
type of dependent variable. It has several variations, like linear and nonlinear regression,
or simple and multiple regression analysis. Linear regression is very measurable and
easy to understand but sensitive to outliers [102]. It is frequently employed in pricing
models, forecasting, and detecting financial performance that supports better decision
making. Regression predicts a continuous fact while classification predicts at class level
only. Polynomial regression (medicine, archaeology, environmental study), power regres-
sion (weather forecasting, physiotherapy, environmental study), exponential regression
(exploration, bacteria growth, population), and Gompertz regression are famous in ma-
chine learning applications. Multiple linear regression is deployed for energy performance
forecasting [103], exponential regression and the relevance vector machine are used to
estimate the manner of residual life [104], a design optimization technique proposed by
polynomial regression [105], and fuzzy polynomial regression is applied for feature selec-
tion and adjustment models [106]. Regression has variation between simple to complex
functions that consist of a set of variables and coefficient(s) and those are selected based on
the importance of accuracy [102]. During analysis (Section 4), we noticed that the ranking
of two methods is negatively correlated, but dispersion is too high in the scatter diagram
(Figure 4a). In this scenario, regression analysis will create high positive and negative errors
that do not make real sense of implementation.
6.4. Clustering
Clustering is an unsupervised machine learning technique that creates a group of
similar items from a large dataset where each group has a specific characteristic called a
cluster. Each cluster is separated from the others. It is a machine learning technique that
makes n numbers of categories without prior knowledge about the number of clusters
or characteristics of any cluster. It is used to identify the trend or pattern of the dataset,
image segmentation, biological grouping, similarity and dissimilarity identification, logical
partitioning, noise detection, visual object detection, dictionary learning, competitive
learning, etc. [109]. A data scientist can use various clustering methods based on the
requirements or outcomes of the task. Hierarchy clustering decomposes the dataset into
multiple levels that cannot correct enormous merges or splits [110]. It is applicable in the
hierarchy architecture of species or objects in agglomerative or divisive approaches. Density
Computers 2022, 11, 158 14 of 21
based clustering (DBSCAN) comprises the distance between the nearest points to make
a cluster by separating higher density points into lower density points. It is commonly
used in medical images for diagnosis [111]. It is not affected by outliers, and it is commonly
applied in noise detection or image segmentation [112]. Grid-based clusters summarize
all data into a grid and then merge grid cells for a cluster that is good for dealing with
massive datasets. Model-based clustering is derived from statistical learning methods or
neural network learning methods to generate required clusters. Partitioning clustering
methods use mean or medoid to identify the center of a mutually exclusive spherical
cluster and are good for small to medium datasets [110]. A distance-based clustering and
sensitive with outliers. K-means, k-medoids, CLARA, and CLARANS are the commonly
used partitioning clustering methods [113] and they are suitable for separate clusters with
a predefined cluster number [110]. An algorithm is selected based on the application and
time complexity. Time complexity of AI algorithms is influenced by number of instances (n),
number of attributes (m), and number iterations (i) [111]; for example, time complexity of
SVM is O(n2 ), DT is O(mn2 ), and DBN is O((n + m)i). We implemented k-mean clustering
and the time complexity of different clustering algorithms mentioned in Table 5 according
to Yash Dagli [110].
7. Result Analysis
This research is a new approach to IMDB movie data that fulfills the aim of the research.
We created six clusters of the movies that will support the user in selecting a movie form
the desired clusters. This research adds a new dimension to the study of IMDB movie
information. Table 6 differentiates our study with previous works with respect to objectives
of the study and data collection method.
Computers 2022, 11, 158 15 of 21
Research on IMDB
Objective of the Research Data Extraction Outcomes
Movie Information
The comments are classified into positive and
Classification: Sentiment Used data repository:
negative classification by the Long Short-Term
S. M. Qaisar [16] analysis based on the text of created by Andrew
Memory (LSTM) classifier and showed that
the review comments. Maas [114]
the classification accuracy is 89.9%.
Classification: Sentiment Used data repository: the Compared and analyzed SVM machine with
Sourav M. and Tanupriya C. [20] analysis based on the text of “IMDB Large Movie Naïve Bayes and showed that SVM is more
the review comments. Review Dataset” accurate than Naïve Bayes.
Compared gradient boosting classifiers,
Classification: Sentiment
Used data repository: support vector machines (SVM), Naïve Bayes
Naeem et al. [19] analysis based on the text of
Kaggle.com classifier, and random forest and showed that
the review comments.
SVM is better than any other methods.
Created different clusters based on the rating
Clustering: Based on the on Used data repository: the of the movies with respect to release year,
Aditya TS et al. [22] rating with respect to years, Movie Database on Facebook likes, etc., that support the user to
Facebook likes, and budget. kaggle.com select a popular movie from a
particular domain.
Used data repository: the Created different clusters based on the genre
Clustering: Based on the
Hasan B. and Serdar K. [21] “IMDB Large Movie of the movies that supports the user to select a
genre of the movies.
Review Dataset” popular movie from a particular genre.
Our study validates the scoring systems and
Clustering: Based on the Web-scraped
Our study supports the user to make faster decision
Metascore and Rating. up-to-date data
based on the outcome of both scoring systems.
Within Cluster Sum of Square (WCSS) is the method for cluster generating that is
applied to develop the elbow diagram (Figure 5i). It shows the possible number of clusters
(1 to 10) on the x axis and the sum of the square distance of the class elements on the y axis.
The elbow function is developed based on the Rating and Score fields of the dataset. Cluster
numbers 3 to 6 (Figure 5i) fall in the elbow part of the curve, so these are logical, reasonable,
and acceptable cluster numbers for this dataset. Before cluster number 3 and after cluster
number 6, the curve sharply changes and there are no distinguished remarkable points.
Cluster number 3 and cluster number 6 are remarkable points to consider (ignore fraction
to make a cluster). We found that six clusters are most suitable to see the relationship
between Score and Rating (Figure 5ii). It is noticeable that cluster ‘b’ and cluster ‘e’ have a
more homogeneous (comparatively less distance among the points of the cluster) bond,
but cluster ‘a’ and cluster ‘d’ have a more heterogenous (more distance between the points
of two cluster) bond. For movie data, we can consider that Metascore is strongly opposite to
Rating between cluster ‘a’ and cluster ‘d’, while there is a mostly similar understanding
between cluster ‘b’ and cluster ‘e’.
The x axis of Figure 5ii represents Score, and Rating is represented on the y axis to form
six clusters that are indicated by a, b, c, d, e, and f. Clusters are formed by the points that
are nearest (Euclidian distance) to the centroid. For three cluster modeling, the data are
clustered into three groups where the common point of these clusters is in the center of
the dataset. Each cluster spreads at a 120-degree angle (approximately) for each that does
not make good sense according to Figure 5ii (six clusters model), group ‘e’ represents the
movies that have a balanced ratio of Metascore and Ratings compared to the other clusters.
Cluster ‘c’ has a comparatively high Metascore with a minimum user Rating, while cluster
‘a’ achieves the highest user Rating and a standard Metascore of 85–95. Cluster ‘d’ and
cluster ‘e’ have the same user rating but there is a significant gap in the movie critic score.
Cluster ‘b’ has the lowest ranks for both measures among the clusters. Here, cluster ‘e’ is
the optimal one among the six clusters to select a movie with minimum risk.
relationship between Score and Rating (Figure 5ii). It is noticeable that cluster ‘b’ and clus-
ter ‘e’ have a more homogeneous (comparatively less distance among the points of the
cluster) bond, but cluster ‘a’ and cluster ‘d’ have a more heterogenous (more distance be-
tween the points of two cluster) bond. For movie data, we can consider that Metascore is
Computers 2022, 11, 158
strongly opposite to Rating between cluster ‘a’ and cluster ‘d’, while there is a mostly sim-
16 of 21
ilar understanding between cluster ‘b’ and cluster ‘e’.
(i) (ii)
Figure 5. (i) The elbow method and (ii) the clustering model.
language, and culture, etc. We applied k-means clustering without removing outliers that
fulfill our objective.
Future work and recommendation: In the near future, we are going to apply k-means and
k-medoids clustering for each of the major genres from the IMDB movie list. There is an
adequate scope for extending the research to develop a supervised and an unsupervised
model that will help users and move producers in decision making. There is a popularity
changing pattern with respect to the movie genre and movie popularity. This movie data
could be helpful for multi-criteria decision making and problem solving in statistics and AI.
The deep learning model of Kamran et al. [115] supports considering multiple vectors for
automatic decision making with good accuracy. We will extend our study for multilayer
deep learning algorithms to consider all influential factors (Metascore, User Ratings, Votes,
Gross Income) for supervised modeling based on the research of Kamran et al. [115]. This
idea could extend to validating any recommendation system where multiple online ratings
exist for a product or service.
Supplementary Materials: The following supporting information can be downloaded at: https:
//www.mdpi.com/article/10.3390/computers11110158/s1, Table S1: Complete Dataset.
Author Contributions: K.U.S. and M.S. contributed to the investigation and project administration.
R.H. contributed to the supervision. S.M. contributed to the visualization. A.A. and S.H. contributed
to the resources and writing—review and editing. K.U.S. and A.D. contributed to the collected
data and conducted the pre-processing of the input data. All authors have read and agreed to the
published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The data presented in this study are available in supplementary
material here.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Sarker, I.H.; Kayes, A.S.M.; Badsha, S.; Alqahtani, H.; Watters, P.; Ng, A. Cybersecurity data science: An overview from machine
learning perspective. J. Big Data 2020, 7, 41. [CrossRef]
2. Sarker, I.H.; Kayes, A.S.M. Abc-ruleminer: User behavioral rule based machine learning method for context-aware intelligent
services. J. Netw. Comput. Appl. 2020, 168, 102762. [CrossRef]
3. Cao, L. Data science: A comprehensive overview. ACM Comput. Surv. (CSUR) 2017, 50, 43. [CrossRef]
4. Han, J.; Pei, J.; Kamber, M. Data Mining: Concepts and Techniques; Elsevier: Amsterdam, The Netherlands, 2011.
5. Sarker, I.H.; Hoque, M.M.; Kafil Uddin, M.; Tawfeeq, A. Mobile data science and intelligent apps: Concepts, ai-based modeling
and research directions. Mob. Netw. Appl. 2021, 26, 285–303. [CrossRef]
6. Marchand, A.; Marx, P. Automated product recommendations with preference-based explanations. J. Retail. 2020, 96, 328–343.
[CrossRef]
7. Witten, I.H.; Frank, E. Data Mining: Practical Machine Learning Tools and Techniques; Morgan Kaufmann: Burlington, NJ, USA, 2005.
8. Sarker, I.H.; Watters, P.; Kayes, A.S.M. Effectiveness analisis of machine learning classification models for predicting personalized
context-aware smartphone usage. J. Big Data 2019, 6, 57. [CrossRef]
9. Harmon, S.A.; Sanford, T.H.; Sheng, X.; Turkbey, E.B.; Roth, H.; Ziyue, X.; Yang, D.; Myronenko, A.; Anderson, V.; Amalou,
A.; et al. Artifcial intelligence for the detection of COVID-19 pneumonia on chest ct using multinational datasets. Nat. Commun.
2020, 11, 4080. [CrossRef]
10. Chen, J.; Kou, G.; Peng, Y. The dynamic effects of online product reviews on purchase decisions. Technol. Econ. Dev. Econ. 2018,
24, 2045–2064. [CrossRef]
11. Park, D.; Lee, J. eWOM overload and its effect on consumer behavioral intention depending on consumer involvement. Electron.
Commer. Res. Appl. 2008, 7, 386–398. [CrossRef]
12. Schneider, F.; Domahidi, E.; Dietrich, F. What Is Important When We Evaluate Movies? Insights from Computational Analysis of
Online Reviews. Media Commun. 2020, 8, 153–163. [CrossRef]
13. Raney, A.A.; Bryant, J. Entertainment and enjoyment as media effect. In Media Effects: Advances in Theory and Research, 4th ed.;
Oliver, M.B., Raney, A.A., Bryant, J., Eds.; Routledge: New York, NY, USA, 2020; pp. 324–341.
Computers 2022, 11, 158 18 of 21
14. Quora, How Trustworthy Is IMDB with Its Ratings? Available online: https://www.quora.com/How-trustworthy-is-IMDB-
with-its-ratings (accessed on 10 October 2022).
15. Hsieh, J. Final Project: IMDB Data Analysis. 2015. Available online: http://mercury.webster.edu/aleshunas/Support%20
Materials/Analysis/Hsieh-Final%20Project%20imdb.pdf (accessed on 10 October 2022).
16. Qaisar, S.M. Sentiment Analysis of IMDb Movie Reviews Using Long Short-Term Memory. In Proceedings of the 2020 2nd
International Conference on Computer and Information Sciences (ICCIS), Sakaka, Saudi Arabia, 13–15 October 2020; pp. 1–4.
[CrossRef]
17. Topal, K.; Ozsoyoglu, G. In Proceedings of the Movie review analysis: Emotion analysis of IMDb movie reviews. In Proceedings
of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Davis, CA,
USA, 18–21 August 2016; pp. 1170–1176. [CrossRef]
18. Nithin, V.; Pranav, M.; Babu, P.S.; Lijiya, A. Predicting Movie Success Based on IMDB Data. Int. J. Bus. Intell. 2014, 3, 34–36.
[CrossRef]
19. Naeem, M.Z.; Rustam, F.; Mehmood, A.; Din, M.Z.; Ashraf, I.; Choi, G.S. Classification of movie reviews using term frequency-
inverse document frequency and optimized machine learning algorithms. PeerJ Comput. Sci. 2022, 8, e914. [CrossRef] [PubMed]
20. Mehra, S.; Choudhary, T. Sentiment Analysis of User Entered Text. In Proceedings of the International Conference of Computa-
tional Techniques, Electronics and Mechanical Systems (CTEMS), Belgaum, India, 21–22 December 2018; ISBN 978-1-5386-7709-4.
21. Bulut, H.; Korukoglu, S. Analysis and Clustering of Movie Genres. J. Comput. 2011, 3, 16–23.
22. Aditya, T.S.; Rajaraman, K.; Subashini, M.M. Comparative Analysis of Clustering Techniques for Movie Recommendation. In
Proceedings of the MATEC Web of Conferences 225, Nadu, India, 18–19 September 2018; p. 02004.
23. Lawson, R. Web Scraping with Python; Packt Publishing Ltd.: Birmingham, UK, 2015.
24. Gheorghe, M.; Mihai, F.-C.; Dârdală, M. Modern techniques of web scraping for data scientists. Int. J. User-Syst. Interact. 2018, 11,
63–75.
25. Rahman, R.U.; Tomar, D.S. Threats of price scraping on e-commerce websites: Attack model and its detection using neural
network. J. Comput. Virol. Hacking Tech. 2020, 17, 75–89. [CrossRef]
26. Watson, H.J. Tutorial: Big Data Analytics: Concepts, Technologies, and Applications. Commun. Assoc. Inf. Syst. 2014, 34,
1247–1268. [CrossRef]
27. Sarker, K.U.; Deraman, A.B.; Hasan, R.; Abbas, A. Ontological Practice for Big Data Management. Int. J. Comput. Digit. Syst. 2019,
8, 265–273. Available online: https://journal.uob.edu.bh/handle/123456789/3485 (accessed on 24 July 2022). [CrossRef]
28. Almaqbali, I.S.; Al Khufairi, F.M.; Khan, M.S.; Bhat, A.Z.; Ahmed, I. Web Scrapping: Data Extraction from Websites. J. Stud. Res.
2019, 12. [CrossRef]
29. Chaulagain, R.S.; Pandey, S.; Basnet, S.R.; Shakya, S. Cloud based web scraping for big data applications. In Proceedings of the
2017 IEEE International Conference on Smart Cloud (SmartCloud), New York, NY, USA, 3–5 November 2017; pp. 138–143.
30. Sirisuriya, D.S. A comparative study on web scraping. In Proceedings of the 8th International Research Conference, KDU,
Palisades, NY, USA, 7–10 October 2015.
31. Milev, P. Conceptual approach for development of web scraping application for tracking information. Econ. Altern. 2017, 3,
475–485.
32. Hillen, J. Web scraping for food price research. Br. Food J. 2019, 121, 3350–3361. [CrossRef]
33. Shaukat, K.; Alam, T.M.; Ahmed, M.; Luo, S.; Hameed, I.A.; Iqbal, M.S.; Li, J. A Model to Enhance Governance Issues through
Opinion Extraction. In Proceedings of the 2020 11th IEEE Annual Information Technology, Electronics and Mobile Communication
Conference (IEMCON), Vancouver, BC, Canada, 4–7 November 2020; pp. 0511–0516. [CrossRef]
34. Mitchell, R. Web Scraping with Python: Collecting More Data from the Modern Web; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2018.
35. Broucke, S.V.; Baesens, B. Practical Web Scraping for Data Science: Best Practices and Examples with Python, 1st ed.; Apress: New York,
NY, USA, 2018.
36. Black, M.L. The World Wide Web as Complex Data Set: Expanding the Digital Humanities into the Twentieth Century and
Beyond through Internet Research. Int. J. Humanit. Arts Comput. 2016, 10, 95–109. [CrossRef]
37. Zhao, B. Web scraping. In Encyclopedia of Big Data; Springer International Publishing: Berlin/Heidelberg, Germany, 2017; pp. 1–3.
38. Tarannum, T. Cleaning of Web Scraped Data with Python. Doctoral Dissertation, Brac University, Dhaka, Bangladesh, 1 April 2019.
39. Manjushree, B.S.; Sharvani, G.S. Survey on Web scraping technology. Wutan Huatan Jisuan Jishu 2020, XVI(VI), 1–8.
40. Yannikos, Y.; Heeger, J.; Brockmeyer, M. An Analysis Framework for Product Prices and Supplies in Darknet Marketplaces. In
Proceedings of the 14th International Conference on Availability, Reliability and Security, New York, NY, USA, 26 August 2019;
Association for Computing Machinery: New York, NY, USA, 2019.
41. Kurniawati, D.; Triawan, D. Increased information retrieval capabilities on e-commerce websites using scraping techniques.
In Proceedings of the 2017 International Conference on Sustainable Information Engineering and Technology (SIET), Malang,
Indonesia, 24–25 November 2017; pp. 226–229.
42. Raicu, I. Financial Banking Dataset for Supervised Machine Learning Classification. Inform. Econ. 2019, 23, 37–49. [CrossRef]
43. Mbah, R.B.; Rege, M.; Misra, B. Discovering Job Market Trends with Text Analytics. In Proceedings of the 2017 International
Conference on Information Technology (ICIT), Singapore, 27–29 December 2017; pp. 137–142. [CrossRef]
44. Farooq, B.; Husain, M.S.; Suaib, M. New Insights into Rental Housing Markets across the United States: Web Scraping and
Analyzing Craigslist Rental Listings. Int. J. Adv. Res. Comput. Sci. 2018, 9, 64–67.
Computers 2022, 11, 158 19 of 21
45. Lunn, S.; Zhu, J.; Ross, M. Utilizing Web Scraping and Natural Language Processing to Better Inform Pedagogical Practice. In
Proceedings of the 2020 IEEE Frontiers in Education Conference (FIE), Uppsala, Sweden, 21–24 October 2020; pp. 1–9. [CrossRef]
46. Andersson, P. Developing a Python Based Web Scraper: A Study on the Development of a Web Scraper for TimeEdit. Master’s.
Thesis, Mid Sweden University, Holmgatan, Sweden, 1 July 2021. Available online: https://www.diva-portal.org/smash/get/
diva2:1596457/FULLTEXT01.pdf (accessed on 8 August 2022).
47. Uzun, E.; Yerlikaya, T.; Kirat, O. Comparison of Python Libraries used for Web Data Extraction. J. Tech. Univ.–Sofia Plovdiv Branch
Bulg. “Fundam. Sci. Appl.” 2018, 24, 87–92.
48. Uzun, E.; Buluş, H.N.; Doruk, A.; Özhan, E. Evaluation of Hap, Angle Sharp and HTML Document in web content extraction. In
Proceedings of the International Scientific Conference’2017 (UNITECH’17), Gabrovo, Bulgaria, 17–18 November 2017; Volume II,
pp. 275–278.
49. Ferrara, E.; De Meo, P.; Fiumara, G.; Baumgartner, R. Web data extraction, applications and techniques: A survey. Knowl.-Based
Syst. 2014, 70, 301–323. [CrossRef]
50. Munzert, S.; Rubba, C.; Meißner, P.; Nyhuis, D. Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining;
John Wiley & Sons, Ltd.: Chichester, UK, 2015.
51. Krotov, V.; Tennyson, M. Scraping Financial Data from the Web Using the R Language. J. Emerg. Technol. Account. 2018, 15,
169–181. [CrossRef]
52. Ives, B.; Palese, B.; Rodriguez, J.A. Enhancing Customer Service through the Internet of Things and Digital Data Streams. MIS Q.
Exec. 2016, 15, 4.
53. Constantiou, I.D.; Kallinikos, J. New Games, New Rules: Big Data and the Changing Context of Strategy. J. Inf. Technol. 2015, 30,
44–57. [CrossRef]
54. Snell, J.; Menaldo, N. Web Scraping in an Era of Big Data 2.0. Bloomberg BNA. 2016. Available online: https://www.bna.com/
web-scraping-era-n57982073780/ (accessed on 13 September 2022).
55. Dryer, A.J.; Stockton, J. Internet ‘Data Scraping’: A Primer for Counseling Clients. New York Law Journal. 2013. Available online:
https://www.law.com/newyorklawjournal/almID/1202610687621 (accessed on 13 September 2022).
56. Alam, T.M.; Shaukat, K.; Hameed, I.A.; Khan, W.A.; Sarwar, M.U.; Iqbal, F.; Luo, S. A novel framework for prognostic factors
identification of malignant mesothelioma through association rule mining. Biomed. Signal Process. Control. 2021, 68, 102726.
[CrossRef]
57. Sulong, G.; Mohammedali, A. Recognition of human activities from still image using novel classifier. J. Theor. Appl. Inf. Technol.
2015, 71, 59103531.
58. Mallat, S. A Wavelet Tour of Signal Processing: The Sparse Way; Academic Press: Cambridge, MA, USA, 2008.
59. Gutiérrez-Gómez, L.; Petry, F.; Khadraoui, D. A comparison framework of machine learning algorithms for mixed-type variables
datasets: A case study on tire-performances prediction. IEEE Access 2020, 8, 214902–214914. [CrossRef]
60. Starck, J.; Murtagh, F.; Fadili, J. Sparse Image and Signal Processing: Wavelets and Related Geometric Multiscale Analysis; Cambridge
University Press: Cambridge, UK, 2015.
61. Mohammed, M.; Khan, M.B.; Bashier Mohammed, B.E. Machine Learning: Algorithms and Applications; CRC Press: Boca Raton, FL,
USA, 2016.
62. Paltrinieri, N.; Comfort, L.; Reniers, G. Learning about risk: Machine learning for risk assessment. Saf. Sci. 2019, 118, 475–486.
[CrossRef]
63. Shaukat, K.; Iqbal, F.; Alam, T.M.; Aujla, G.K.; Devnath, L.; Khan, A.G.; Iqbal, R.; Shahzadi, I.; Rubab, A. The Impact of Artificial
intelligence and Robotics on the Future Employment Opportunities. Trends Comput. Sci. Inf. Technol. 2020, 5, 050–054.
64. Yu, S.; Chen, Y.; Zaidi, H. AVA: A financial service chatbot based on deep bidirectional transformers. Front. Appl. Math. Stat. 2021,
7, 604842. [CrossRef]
65. Eling, M.; Nuessl, D.; Staubli, J. The impact of artificial intelligence along the insurance value chain and on the insurability of
risks. In Geneva Paper on Risk and Insurance-Issues and Practices; Springer: Berlin/Heidelberg, Germany, 2021. [CrossRef]
66. Dornadula, V.N.; Geetha, S. Credit card fraud detection using machine learning algorithms. Procedia Comput. Sci. 2019, 165,
631–641. [CrossRef]
67. Leo, M.; Sharma, S.; Maddulety, K. Machine learning in banking risk management: A literature review. Risks 2019, 7, 29.
[CrossRef]
68. Zand, A.; Orwell, J.; Pfluegel, E. A secure framework for anti-money laundering using machine learning and secret sharing. In
Proceedings of the International Conference on Cyber Security and Protection of Digital Services, Dublin, Ireland, 15–19 June
2020; pp. 1–7. [CrossRef]
69. Gu, S.; Kelly, B.; Xiu, D. Empirical asset pricing via machine learning. Rev. Financ. Stud. 2020, 33, 2233–2273. [CrossRef]
70. Ye, T.; Zhang, L. Derivatives pricing via machine learning. J. Math. Financ. 2019, 9, 561–589. [CrossRef]
71. Javed, U.; Shaukat, K.; AHameed, I.; Iqbal, F.; Mahboob Alam, T.; Luo, S. A Review of Content-Based and Context-Based
Recommendation Systems. Int. J. Emerg. Technol. Learn. (iJET) 2021, 16, 274–306. [CrossRef]
72. Ramzan, B.; Bajwa, I.S.; Jamil, N.; Amin, R.U.; Ramzan, S.; Mirza, F.; Sarwar, N. An Intelligent Data Analysis for Recommendation
Systems Using Machine Learning. Sci. Program. 2019, 2019, 5941096. [CrossRef]
73. Zhou, Y.; Mao, H.; Yi, Z. Cell mitosis detection using deep neural networks. Knowl.-Based Syst. 2017, 137, 19–28. [CrossRef]
Computers 2022, 11, 158 20 of 21
74. Yang, S.; Korayem, M.; AlJadda, K.; Grainger, T.; Natarajan, S. Combining content-based and collaborative filtering for job
recommendation system: A cost-sensitive statistical relational learning approach. Knowl.-Based Syst. 2017, 136 (Suppl. C), 37–45.
[CrossRef]
75. Cohen, Y.; Hendler, D.; Rubin, A. Detection of malicious webmail attachments based on propagation patterns. Knowl.-Based Syst.
2018, 141, 67–79. [CrossRef]
76. Shaukat, K.; Luo, S.; Varadharajan, V.; Hameed, I.A.; Chen, S.; Liu, D.; Li, J. Performance Comparison and Current Challenges of
Using Machine Learning Techniques in Cybersecurity. Energies 2020, 13, 2509. [CrossRef]
77. Shaukat, K.; Luo, S.; Varadharajan, V.; Hameed, I.A.; Xu, M. A Survey on Machine Learning Techniques for Cyber Security in the
Last Decade. IEEE Access 2020, 8, 222310–222354. [CrossRef]
78. Rodr’ıguez, C.; Florian, D.; Casati, F. Mining and quality assessment of mashup model patterns with the crowd: A feasibility
study. ACM Trans. Internet Technol. 2016, 16, 17. [CrossRef]
79. Xu, K.; Zheng, X.; Cai, Y.; Min, H.; Gao, Z.; Zhu, B.; Xie, H.; Wong, T. Improving user recommendation by extracting social topics
and interest topics of users in uni-directional social networks. Knowl.-Based Syst. 2018, 140 (Suppl. C), 120–133. [CrossRef]
80. Castillo, P.A.; Mora, A.M.; Faris, H.; Merelo, J.J.; García-Sanchez, P.; Fernandez-Ares, A.J.; de las Cuevas, P.; Garcıa-Arenas, M.I.
Applying computational intelligence methods for predicting the sales of newly published books in a real editorial business
management environment. Knowl.-Based Syst. 2017, 115 (Suppl. C), 133–151. [CrossRef]
81. Hajek, P.; Henriques, R. Mining corporate annual reports for intelligent detection of financial statement fraud–Comparative study
of machine learning methods. Knowl.-Based Syst. 2017, 128, 139–152. [CrossRef]
82. Lee, W.; Chen, C.; Huang, J.; Liang, J. A smartphone-based activity aware system for music streaming recommendation.
Knowl.-Based Syst. 2017, 131 (Suppl. C), 70–82. [CrossRef]
83. Tavallaee, M.; Bagheri, E.; Lu, W.; Ghorbani, A.A. A detailed analysis of the kdd cup 99 data set. In. IEEE symposium on
computational intelligence for security and defense applications. IEEE 2009, 2009, 1–6.
84. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg,
V.; et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 2011, 12, 2825–2830.
85. Wu, C.-C.; Yen-Liang, C.; Yi-Hung, L.; Xiang-Yu, Y. Decision tree induction with a constrained number of leaf nodes. Appl. Intell.
2016, 45, 673–685. [CrossRef]
86. Holte, R.C. Very simple classification rules perform well on most commonly used datasets. Mach. Learn. 1993, 11, 63–90.
[CrossRef]
87. John, G.H.; Langley, P. Estimating continuous distributions in Bayesian classifiers. In Proceedings of the Eleventh Conference on
Uncertainty in Artificial Intelligence; Morgan Kaufmann Publishers Inc.: Burlington, NJ, USA, 1995; pp. 338–345.
88. Sarker, I.H. A machine learning based robust prediction model for real-life mobile phone data. Internet Things 2019, 5, 180–193.
[CrossRef]
89. LeCessie, S.; Van Houwelingen, J.C. Ridge estimators in logistic regression. J. R. Stat. Soc. Ser. C (Appl. Stat.) 1992, 41, 191–201.
90. Kibler, D.; Albert, M. Instance-based learning algorithms. Mach. Learn. 1991, 6, 37–66.
91. Keerthi, S.S.; Shevade, S.K.; Bhattacharyya, C.; Radha Krishna, M.K. Improvements to platt’s smo algorithm for svm classifer
design. Neural Comput. 2001, 13, 637–649. [CrossRef]
92. Quinlan, J.R. C4.5: Programs for machine learning. Mach. Learn. 1993, 16, 235–240.
93. Sarker, I.H.; Abushark, Y.B.; Alsolami, F.; Khan, A. Intrudtree: A machine learning based cyber security intrusion detection model.
Symmetry 2020, 12, 754. [CrossRef]
94. Sarker, I.H.; Alan, C.; Jun, H.; Khan, A.I.; Abushark, Y.B.; Khaled, S. Behavdtee: A behavioral decision tree learning to build
user-centric context-aware predictive model. Mob. Netw. Appl. 2019, 25, 1151–1161. [CrossRef]
95. Quinlan, J.R. Induction of decision trees. Mach. Learn. 1986, 1, 81–106. [CrossRef]
96. Breiman, L.; Friedman, J.; Stone, C.J.; Olshen, R.A. Classifcation and Regression Trees; CRC Press: Boca Raton, FL, USA, 1984.
97. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [CrossRef]
98. Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [CrossRef]
99. Amit, Y.; Geman, D. Shape quantization and recognition with randomized trees. Neural Comput. 1997, 9, 1545–1588. [CrossRef]
100. Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming; John Wiley & Sons: Hoboken, NJ, USA, 2014.
101. Freund, Y.; Schapire, R.E. Experiments with a new boosting algorithm. Icml Citeseer 1996, 96, 148–156.
102. Iqbal, M.A. Application of Regression Techniques with their Advantages and Disadvantages. Elektron. Mag. 2021, 4, 11–17.
103. Ciulla, G.; Amico, A.D. Building energy performance forecasting: A multiple linear regression approach. Appl. Energy 2019, 253,
113500. [CrossRef]
104. Maio, F.D.; Tsui, K.L.; Zio, E. Combining relevance vector machines and exponential regression for bearing residual life estimation.
Mech. Syst. Signal Process. 2012, 31, 405–427. [CrossRef]
105. Kim, S.J.; Kim, C.H.; Jung, S.Y.; Kim, Y.J. Optimal design of novel pole piece for power density improvement of magnetic gear
using polynomial regression analysis. IEEE Trans. Energy Convers. 2015, 30, 1171–1179. [CrossRef]
106. Wi, Y.M.; Joo, S.K.; Song, K.B. Holiday load forecasting using fuzzy polynomial regression with weather feature selection and
adjustment. IEEE Trans. Power Syst. 2011, 27, 596–603. [CrossRef]
107. Wiering, M.A.; Van Otterlo, M. Reinforcement learning. Adapt. Learn. Optim. 2012, 12, 729.
108. Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement learning: A survey. J. Artif. Intell. Res. 1996, 4, 237–285. [CrossRef]
Computers 2022, 11, 158 21 of 21
109. Dar, K.S.; Javed, I.; Amjad, W.; Aslam, S.; Shamim, A. A Survey of clustering applications. J. Netw. Commun. Emerg. Technol.
(JNCET) 2015, 4, 10–15.
110. Dagli, Y. Partitional Clustering using CLARANS Method with Python Example. 2019. Available online: https://medium.com/
analytics-vidhya/partitional-clustering-using-clarans-method-with-python-example-545dd84e58b4 (accessed on 29 August 2022).
111. Shaukat, K.; Masood, N.; Shafaat, A.B.; Jabbar, K.; Shabbir, H.; Shabbir, S. Dengue Fever in Perspective of Clustering Algorithms.
arXiv 2015, arXiv:abs/1511.07353.
112. Chauhan, N.S. DBSCAN Clustering Algorithm in Machine Learning. An Introduction to the DBSCAN Algorithm and Its
Implementation in Python. KDnuggets. 2022. Available online: https://www.kdnuggets.com/2020/04/dbscan-clustering-
algorithm-machine-learning.html (accessed on 30 August 2022).
113. Iqbal, H. Sarker, Machine Learning: Algorithms, Real-World Applications and Research Directions. SN Comput. Sci. 2021, 2, 160.
114. Maas, A.L.; Daly, R.E.; Pham, P.T.; Huang, D.; Ng, A.Y.; Potts, C. Learning word vectors for sentiment analysis. In Proceedings of
the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Stroudsburg, PA,
USA, 19–24 June 2011; Volume 1, pp. 142–150.
115. Shaukat, K.; Luo, S.; Varadharajan, V. A novel method for improving the robustness of deep learning-based malware detectors
against adversarial attacks. Eng. Appl. Artif. Intell. 2022, 116, 105461. [CrossRef]