0% found this document useful (0 votes)
14 views

Module-2 Chapter 5

Chapter 5 discusses the shift from static to dynamic machine learning (ML) frameworks, emphasizing the need for models to adapt to constantly changing data. It outlines a semi-automated dynamic ML framework that selects appropriate learning techniques based on user input and real-time data characteristics. The chapter also contrasts static and dynamic learning, highlighting the advantages of dynamic systems in responding to evolving business needs and the challenges associated with monitoring and maintaining these systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Module-2 Chapter 5

Chapter 5 discusses the shift from static to dynamic machine learning (ML) frameworks, emphasizing the need for models to adapt to constantly changing data. It outlines a semi-automated dynamic ML framework that selects appropriate learning techniques based on user input and real-time data characteristics. The chapter also contrasts static and dynamic learning, highlighting the advantages of dynamic systems in responding to evolving business needs and the challenges associated with monitoring and maintaining these systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Chapter 5

Dynamicity in learning
Smart selection of learning techniques

DYNAMICITY IN ML

Automated machine learning (ML) represents a fundamental shift in the


way organizations approach data science. Current ML theory and practice
has a challenge to handle. ML models are selected keeping the data and its
format in mind; furthermore, ML models are also based on the business
context. The context as well as the incoming data can change – leading
to a potentially scattered ML model that may have lost its relevance. This
chapter outlines an integrated online dynamic ML framework that unifies
the scattered ML technology in data science.
Conventional ML algorithms are static. They feed on datasets that were
handcrafted by experts prior to the ML training sessions. Many firms get
accustomed to train their models on such old historical data. In reality, how-
ever, datasets keep changing at a rapid pace. Businesses need to align their
models to cope with the online changing data. The dynamic ML frame-
work takes ML to the next level. The framework proposed in this chap-
ter responds to the changing data challenge semiautomatically. Through an
expert system engine at the front end interface, it queries the user about the
characteristics of the incoming data. The framework then inferences and
chooses the appropriate ML category (supervised, unsupervised, etc.) and
ML mode (shallow, deep, transfer learning) to match with the incoming
data. Once set into motion online, the framework constantly monitors the
changing data at the input, selects the appropriate category and mode of
learning, and updates prediction at the output, without relying on the exper-
tise of a professional data scientist.
Applying traditional machine learning methods to real-world busi-
ness problems is a time-consuming, expensive, and error-prone endeavor.
Such attempts require hiring a team of professional data scientists. These
sought-after professionals are expensive and in short supply. Several data sci-
ence reports lament that it is hard to hire a data scientist and harder still to
keep her in the enterprise.1,2

121
122 Artificial Intelligence for Business Optimization

A description and comparison of static and dynamic ML are provided next.

Static learning
Static learning is performed manually most of the time. Static learning is
also known as offline learning. The datasets are almost always prepared
offline. Relevant data is collected from various sources and checked by
domain experts and data scientists for inconsistencies. The datasets are
later preprocessed by using some of the techniques described in Chapter 4.
Upon completion of the data preprocessing stage, the data is then divided
into training, validation, and testing subsets. The data scientists then look
for an appropriate model to be trained for prediction. The model is trained,
validated, and tested using the predetermined subsets. The model hyperpa-
rameters are also tuned to give the desired level of prediction accuracy. The
learned model is then frozen as a time-tested artifact. In static or offline
learning, the model is trained only once and then used for prediction for a
while.3 When new enterprise data becomes available, the model is used for
making predictions.
Static learning is easier and cost-effective as far as implementation and
maintenance are concerned. The ML team undertakes a one-time effort and
investment to obtain the trained model. The benefits can then be reaped
against new data that steadily becomes available in the enterprise. On the
downside, models that are trained in a static manner cannot cope with rap-
idly changing data. There are weekly, monthly, seasonal, and annual data
changes in any enterprise. Consider, for instance, firms selling costumes and
apparel. Their static machine learning models will not be in a position to pre-
dict sales due to seasonal variation in data. The statically learned models will
certainly rupture around the time of Halloween, because of a huge variation
in the sale of costumes. To respond to the data-driven business requirements,
a dynamic learning framework is presented in the following sections.

Dynamic learning
Data in real life is not static. It is constantly changing, growing, and diver-
sifying. Dynamic learning is designed to adapt to changing data. Older
datasets are updated and newer datasets are created with the passage of
time. Dynamic learning models are flexible to take the new forms of data
and continue the learning task. This kind of online learning is adaptive
and continuous. The ML algorithm continuously improves its learning and
prediction is performed on the fly. The major disadvantage of the dynamic
Dynamicity in learning 123

learning systems is that the incoming data, the model, and the entire ML
pipeline have to be continually monitored.

DATA AND ALGORITHM SELECTIONS

Data is the oil for the machinery of ML. Everything in ML is data-driven.


The type of learning and the ML algorithm to be chosen for implement-
ing ML are all dictated by data. Data scientists and ML experts look at
the data and manually select the appropriate ML algorithm. The criteria
for selecting ML algorithm are i nput–output pairs, absence of output
variable, few input–output pairs, absence of state-action-reward tuples,
and data collection by interacting with the environment, as summarized
below.

­I nput–output pairs
Datasets are usually in the form of tables with rows and columns. The col-
umns represent the data attributes. The columns on the left represent the
independent variables. The rightmost column is the output vector and repre-
sents the dependent variable. Supervised learning algorithms seek the func-
tional relationship between the input (independent) and output (dependent)
variables. Table 5.1 is an example of a dataset that can be used for regres-
sion as well as classification. It is related to the red variant of the Portuguese
Vinho Verde wine.4
The rightmost column indicates the quality of red wine on a scale from 0
to 10, dependent on the 11 attributes or features given in the columns on the
left. The dataset is also available in the UCI public repository.5

Absence of output variable


The data table is almost exactly as in Table 5.1, but without the rightmost
column containing the output. The unsupervised learning algorithms seek
patterns in the input feature space and group the data into clusters such that
each cluster contains similar datapoints.

Few input–output pairs


Most real-life datasets do not have an output label with respect to the input.
Labeling the data is expensive and time-consuming. However, with expert
knowledge, some datapoints in the dataset are labeled. Semi-supervised
algorithms learn classification from the labeled datapoints and assign
pseudo-labels to the unlabeled datapoints. The pseudo-labels are refined
iteratively through self-training or co-training (Table 5.2).
124

­Table 5.1 Dataset format


Fixed Volatile Residual Free sulfur Total sulfur
# acidity acidity Citric acid sugar Chlorides dioxide dioxide Density pH Sulphates Alcohol Quality
1 7.4 0.7 0 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5
2 7.8 0.88 0 2.6 0.098 25 67 0.9968 3.2 0.68 9.8 5
3 7.8 0.76 0.04 2.3 0.092 15 54 0.997 3.26 0.65 9.8 5
4 11.2 0.28 0.56 1.9 0.075 17 60 0.998 3.16 0.58 9.8 6
5 7.4 0.7 0 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5
6 7.4 0.66 0 1.8 0.075 13 40 0.9978 3.51 0.56 9.4 5
7 7.9 0.6 0.06 1.6 0.069 15 59 0.9964 3.3 0.46 9.4 5
8 7.3 0.65 0 1.2 0.065 15 21 0.9946 3.39 0.47 10 7
9 7.8 0.58 0.02 2 0.073 9 18 0.9968 3.36 0.57 9.5 7
Artificial Intelligence for Business Optimization

10 7.5 0.5 0.36 6.1 0.071 17 102 0.9978 3.35 0.8 10.5 5
11 6.7 0.58 0.08 1.8 0.097 15 65 0.9959 3.28 0.54 9.2 5
12 7.5 0.5 0.36 6.1 0.071 17 102 0.9978 3.35 0.8 10.5 5
13 5.6 0.615 0 1.6 0.089 16 59 0.9943 3.58 0.52 9.9 5
14 7.8 0.61 0.29 1.6 0.114 9 29 0.9974 3.26 1.56 9.1 5
15 8.9 0.62 0.18 3.8 0.176 52 145 0.9986 3.16 0.88 9.2 5
Dynamicity in learning 125

­Table 5.2 Criteria for selecting ML algorithms


Data type Learning category
Input–output pairs Supervised
Absence of output variable Unsupervised
Few input–output pairs ­ Semi-supervised learning
Absence of state-action-reward tuples Reinforcement learning
Data collection by interacting with environment Deep reinforcement learning

Absence of state- action-reward tuples


In reinforcement learning, no data is provided to the learning agent. The
agent perceives its current state and randomly takes an action. The action
is immediately rewarded or penalized by the environment. Further, the
agent’s action changes the state of the environment and the agent takes the
next action on the new stage. The state-action-reward cycles last till the end
of a reinforcement learning session, where the cumulative reward is com-
puted. In the absence of the state-action-reward data tuples, the agent must
learn a policy that maximizes the long-term cumulative rewards.

Data collection by interacting with environment


Deep reinforcement learning methods, however, require active online data
collection, where the model actively interacts with its environment. This
makes such methods hard to scale to complex real-world problems, where
active data collection means that large datasets must be collected for every
experiment – this can be expensive and, for systems such as autonomous
vehicles or robots, potentially unsafe.
In many domains of practical interest, such as autonomous driving, robot-
ics, and games, there is an active online data collection when the model
actively interacts with its environment. The previously collected interaction
data consists of informative behaviors – state-action-rewards tuples. Deep
RL algorithms that can utilize such prior datasets will not only scale to
real-world problems, but will also lead to solutions that generalize substan-
tially better.6

GAME TREE AND STATE EXPLOSION

The M in-Max algorithm is a robust artificial intelligence (AI) algorithm


for developing game-playing software programs. Min and Max are meta-
phorical agents playing a board game. Max (computer program) is trying to
maximize her score at every move of the game, while Min (human player)
is trying to oppose and minimize Max’s score. The algorithm draws a game
126 Artificial Intelligence for Business Optimization

root

node

leaf

­Figure 5.1 Tree structure.

­Table 5.3 Game tree complexity of various board games


State-space
­ Game-tree
­ Branching Average game
Game complexity complexity factor length
­Tic-Tac-Toe
­ 103 105 4 9
Connect four 1013 1021 4 36
Othello 1028 1058 10 58
Checkers 1021 1031 - -
Chess 1046 10123 35 80
Shogi 1071 10226 92 115
Go 1072 10360 250 150

tree, specifying Min and Max’s playing turn, and gives scores to the leaf
nodes based on the evaluation of some heuristic function. It then backtracks
Max’s winning path consisting of a series of moves.7
The tree structure is an abstract mathematical concept for representing
objects and their relationships. The tree consisting of nodes and branches
resembles a natural tree turned upside down (Figure 5.1). Two nodes in the
tree have no more than one path between them. The starting node is called
the root, and the end nodes are called leaves. A node is called a parent node
if it has children nodes connected to it. Multiple nodes proceeding from the
same parent are called siblings. The arc or edge between two nodes is called
a branch.
The Min-Max algorithm with its variants was a classic game-playing algo-
rithm used in the early days of AI before the advent of reinforcement learn-
ing. Generating the entire game tree for any of the board games shown in
Table 5.3 is practically impossible, given the size of the datasets. To overcome
this problem, Min-Max adopts various techniques to prune the branches of
the tree.
The method at best is heuristic and not exact. The biggest disadvan-
tage of this game-playing algorithm is that there is no learning. Therefore,
the agent has no game-playing policy to make the winning moves. Latest
research has repeatedly demonstrated that the best way of circumventing
Dynamicity in learning 127

the problem of generating unrealistic massive datasets is to embed some


kind of reinforcement learning in the system.

DATA AUGMENTATION

Deep learning models consume tons of data, just not in quantity but also
in diversity.
However, it is very difficult to increase the instances in datasets for learn-
ing. Data augmentation refers to the process of making new data instances
by making minor modifications to the available data instances without
introducing new and fresh instances from external sources. Most of the
deep learning frameworks contain built-in data augmentation utilities.
By improving generalization, data augmentation leads to an overall pre-
diction accuracy. It also helps overcome overfitting, a phenomenon in which
the network learns a function that perfectly fits the training data.

Image data augmentation


Consumer goods recommendation systems use a lot of pictures and images.
In addition, enterprises dealing with fashion, travel, apparel, real estate,
photography, and so on cannot do their business without handling pictures.
Knowledge of object recognition and image processing algorithms along
with the standard image datasets is a must for such enterprises.
Convolutional neural networks (CNNs) are state-of-the-art deep learn-
ing networks for computer vision tasks like object recognition, detection,
semantic segmentation, and so on. Some of the well-known publicly avail-
able datasets to test the cutting-edge CNN algorithms are CIFAR-10,
CIFAR-100, ImageNet, MNIST (hand-written digit recognition), SVHN
(street view house numbers), PASCAL VOC12, COCO, and so on (refer
to Appendix A). Massive as they are, researchers claim that these datasets
still do not contain sufficient instances to raise the prediction accuracy of
their algorithms. Numerous data augmentation techniques are tried and
tested on the above datasets. Some of the traditional data augmentation
techniques involve horizontal/vertical flipping, rotation, cropping, saturat-
ing, blurring, adding Gaussian noise, and so on (Figure 5.2).

(a) Original image (b) rotated (c) cropped (d) horizontally flipped

­Figure 5.2 Image data augmentation.


128 Artificial Intelligence for Business Optimization

The dataset extended by augmentation increases by a factor equal to the


number of transformations applied. Since the data augmentation is a ran-
dom process, there is no guarantee that the ML algorithm will produce
a proportionate increase in the prediction accuracy. Besides, care should
be taken when applying the transforms to a dataset. For example, a rota-
tion or a flip of an image in the ImageNet dataset will result in a new data
instance, but it cannot be done for digit images like 6 or 9 in the MNIST
hand-written digit dataset.
Mixing images together by averaging their pixel values, although quite
counterintuitive, is also considered to be an effective augmentation strategy.8
AutoAugment is an algorithm proposed by Google to learn the best augmen-
tation policies for a given dataset with the help of reinforcement learning. A
policy comprises five subpolicies, and each subpolicy comprises two image
operations applied in sequence. Each image operation in turn considers two
parameters: the probability of calling the operation, and the magnitude of the
operation. If the randomly chosen probability of calling an operation turns out
to be null, then the operation may not be applied in that mini-batch. However,
if applied, it is applied with a fixed magnitude.9
In practice, AutoAugment algorithm is computationally expensive.
Population-based augmentation (PBA) is an alternative data augmentation
technique with less compute. It generates nonstationary augmentation pol-
icy schedules instead of a fixed augmentation policy.10

Numerical data augmentation


The above section described some of the statistical techniques for image data
augmentation. The techniques are fairly straightforward for image data.
Individual images are randomly cropped, rotated, sprinkled with random
noise, and color contrasted to increase the number of available samples.
The augmentation techniques work well in image datasets because minor
perturbations at the pixel level do change the image class. Unfortunately,
such a method does not work for numerical datasets. A tiny perturbation
in the numerical data can shift the perturbed sample into an entirely dif-
ferent class. The following simple algorithm11 is effective in producing data
augmentation in numerical data:

• Copy each class in a subset a couple of times to achieve a reasonable


representation.
• Randomly perturb features in each subset using the distribution’s
mean and standard deviation as perturbation bounds.
• Validate that each new sample belongs to a proper class or reassign to
a new class using Mahalanobis Distance (MD) and Gaussian Mixture
Models. MD is a multidimensional generalization of how many stan-
dard deviations away a sample is from the mean of a distribution.
Dynamicity in learning 129

Text data augmentation


The application of ML to Natural Language Processing (NLP) tasks, too,
requires a very large collection of text data. Although text data is abundant
and freely available in the form of web pages and other digital repositories,
preparing the data with annotations and in formats amenable for ML is an
extremely tedious and expensive task. Data augmentation is not very com-
mon in NLP tasks, but some significant amount of research is already in
progress. Word-level text data augmentation and sentence-level text data
augmentation described below are the two major approaches proposed.

­
This method, also known as lexical substitution, tries to substitute a ran-
domly chosen word in a sentence either with a synonym or another word
not too far in meaning. Synonyms are readily available in WorNet,12 which
is a standard digital thesaurus used in NLP tasks. Typical similarity metrics
used are k-nearest neighbors and cosine similarity. Yet another method is to
substitute words from the pretrained word embeddings such as Word2Vec13
of FastText.14 By far, the most successful method is the use of transformer
models developed by Google such as bidirectional encoder representations
from transformers (BERT).15 BERT is pretrained on a large body of text using
masked language modeling in which the model has learned to predict masked
words based on the context. Hence, in text data augmentation, BERT can
replace a masked word which does not alter the meaning of the original sen-
tence. A further improvement on word embedding data augmentation is the
contextual augmentation method which offers a wider range of substitute
words predicted by a bidirectional language model matching the context16

­
Sentence shuffling in short paragraphs of text is also known to add useful
data instances to a training body of text. Python NLTK library offers sentence
tokenization and shuffling. This is useful in sentiment analysis studies, because
with sentence tokenization, the sentiment conveyed in the original text is still
maintained. Paraphrasing is another useful method. BERT is quite powerful
in paraphrasing a phrase or a sentence owing to the masking method. Neural
machine translation is an upcoming technology in the field of translation. It
needs a tremendous amount of data in the form of parallel corpora of the
source and the target languages. There is not always a 1-1 correspondence
between the words in the two languages. Words in the target language which
are not present or omitted in the source language are introduced in the source
language by means of back-translation. Effectively, back-translation serves as
an instrument to augment data in a target translation language, which may
not have sufficient data for training a machine translation model.17,18
130 Artificial Intelligence for Business Optimization

Synthetic dataset
Even the exquisite data augmentation techniques described above do not
suffice to produce the high quality and the right amount of data necessary
for accurate machine learning. Another w ­ ell-known method is creating a
repository of synthetic data algorithmically. The main purpose of this tech-
nique is to generate flexible and rich data to experiment with various regres-
sion, classification, clustering, and deep learning algorithms.
Privacy concern of customers is another notable reason for generating
synthetic data. For example, predicting customer behavior or preferences
would need large amounts of data to build the learning models. But because
of privacy, accessing real customer data is slow and does not provide good
enough data on account of extensive masking and reaction of informa-
tion.19 In such cases, generating synthetic data from a model fitted with real
data is the need of the hour. If the model is a good representation of the real
data, then the synthetic data will also pick up the statistical properties of
the data, which are indispensable for learning.
­Table 5.4 shows some of the relevant properties a synthetic dataset should
possess.
Examples of synthetic datasets include SURREAL20 for estimating human
pose, shape, and motion from images and videos; SYNTHIA21 for semantic
segmentation of urban scenes, and those for text detection in natural images.22

DYNAMIC LEARNING FRAMEWORK

This section describes a novel dynamic learning framework that partially


automates the ML pipeline. This unit of the framework contains the usual
four categories of ML: supervised, unsupervised, ­ semi-supervised, and
reinforcement learning. Each ML category contains more algorithms than
those described in C
­ hapter 4. The framework consists of several computa-
tional units that work together seamlessly. Eventually, dynamic learning
can be implemented by DL bots that autonomously decide on the type of

­Table 5.4 Dataset format


Data attributes Desired properties
Type It can be numerical, binary, or categorical.
Features The number of features of the dataset should be flexible.
Size The size of the dataset (­number of records) should be flexible.
Distribution The dataset should be based on a statistical distribution, which can be
varied.
Noise A controllable amount of random noise should be injected in the dataset.
Dynamicity in learning 131

ONLINE DATA

Automatic
collection Algorithm
selector
switchboard

Pre-
Processing Prediction

Expert
System Decision
Engine Making

USER INTERFACE

­Figure 5.3 Online dynamic learning framework.

ML models to be used. Bots can learn dynamically from previous decisions


and can also s­ elf-heal. DL systems can anticipate customer needs, market
changes, and even cybersecurity threats. Contextual, personalized, and
secure services can be provided by dynamic learning framework. ­Figure 5.3
outlines this dynamic learning framework.
The frontend of the ML framework contains the following subunits:

Online data repository


The customer data or any other data that is of interest to an enterprise is
stored in an online repository or cloud. This data, being dynamic in nature,
is constantly changing. The concerned enterprise may or may not be directly
in control of the data sources responsible for stacking and managing the
data in the online repository.

Automatic collection
The automatic data collection unit constantly monitors the ­ever-changing
data in the online repository. Whenever new chunks of data become avail-
able, it automatically downloads it and pushes it to the preprocessing mod-
ule. The automatic collection module is responsible for ascertaining that
there is no data redundancy.
132 Artificial Intelligence for Business Optimization

Preprocessing
This unit routinely performs preprocessing, the type of which was seen in
Chapter 4. It filters noise from the messy data, rectifies corrupted data, and
imputes values in case of missing data. It sorts out the complex nonlinear
relationships hidden in the data and performs some sort of a feature selec-
tion based on the expert rules provided in the unit. Data preprocessing mod-
ule is semiautomatic. It preprocesses most of the online incoming data, but
needs the assistance of the user for certain complex and unforeseeable tasks.

Expert system engine


Expert system is an automated AI reasoning system based on domain
experts. It has a large knowledge base of domain knowledge and mimics
the inference mechanism of experts in drawing conclusions from the data
supplied. It also aims at giving advice to end users. The expert system engine
subunit of the dynamic ML framework decides which machine learning cat-
egory an incoming dataset should be allotted to by interacting with the user
through data-specific queries. The three components in the expert system
engine are knowledge acquisition, knowledge representation, and inference
(Figure 5.4). These are described in the following subsections.

Knowledge acquisition
This is the very first step in building an expert system. Knowledge predomi-
nantly resides with the domain experts. In the ML context, data scientists are

Forward
chaining
IF…THEN
IF…THEN
or
IF…THEN
Backward
chaining

KNOWELDGE KNOWELDGE
ACQUISITION REPRESENTATION INFERENCE

­Figure 5.4 Constituent elements of expert system engine.


Dynamicity in learning 133

the domain experts who have detailed knowledge of data and the associated
ML methods and algorithms. In particular, they know which ML category or
algorithm a particular dataset should be directed to. The body of knowledge
is systematically collected from data scientists and stored in a convenient for-
mat in a large knowledge repository called the knowledge base.

Knowledge representation
The knowledge acquired from the expert may be in the form of unstructured
text, tables, diagrams, and annotations. This knowledge is then represented
in a well-defined format to remove ambiguities and make the coding work
easier. Semantic networks, frames, and ontologies are some of the well-known
knowledge representation formats in the knowledge engineering domain. The
IF-THEN rules are then framed from the knowledge formats. For instance,
there will be rules like IF (data contains input–output pairs) THEN (supervised
learning), IF (data does not contain input–output pairs) THEN (unsupervised
learning), etc. The framework contains a user interface at the front end. This
is for the expert system engine to query with the user about the characteristics
and type of online incoming data.

Inference
Inference is the process of drawing conclusions by linking data precepts.
The main inference mechanisms found in modern expert systems are
forward-chaining, backward-chaining, and hybrid. In the forward-chaining
or data-driven inference mechanism, if a piece of data matches the “IF” part
of any “IF-THEN” production rule, the rule fires and produces another rel-
evant piece of data. The newly produced intermediate conclusions of all the
rules that have fired are kept in the working memory of the expert system.
The inference engine then searches for rules in the ruleset that match the
new contents available in the working memory. These rules are then fired,
generating more intermediate conclusions which are used to invoke appli-
cable rules in the next round. The fetch-match-fire cycle continues till the
inference engine exhausts all the fire-able rules that exist on the agenda. In
the backward chaining or goal-driven inference mechanism, the “THEN”
part of the “IF-THEN” rule is provisionally assumed to be true, and the
inference engine looks for data that will satisfy this subgoal. It then seeks
to link all the subgoals to yield the final goal. The hybrid inference is a
combination of the forward and backward chaining. Given a dataset and
a ruleset, the expert system reaches the same conclusion irrespective of the
inference chaining. The only difference is the speed of convergence, which
is problem dependent.
In a medical diagnosis session, for example, the expert system initiates
the session by asking the patient a couple of questions related to the patient’s
134 Artificial Intelligence for Business Optimization

symptoms through the user interface. For instance, a backward chaining


inference expert system, which is quicker than the forward chaining infer-
ence system, will hypothesize the patient has a particular sickness based
on the initial list of symptoms the patient has supplied and then proceed
to get further information (data) from the patient to prove the hypothesis.
A similar strategy is employed in the inference mechanism of the dynamic
ML expert system engine. It will first query with the user about the speci-
fications of the data as shown in Table 5.2 and accordingly decide on the
category and mode of learning.

ML MODES IN DYNAMIC LEARNING

Shallow learning
The term “shallow learning” is a misnomer because the performance of the ML
algorithms pertaining to this class is far from being shallow. “Shallowness”
is not a salient feature of the “shallow learning” ML algorithms. The term
is collectively and rather loosely used to refer to all forms of ML before the
advent of “deep learning” (Figure
­­ 5.5).
The common feature of the algorithms which are included in “shallow
learning” is that they rely on handcrafted features based upon heuristics of
the target problem. The domain experts and data scientists define the fea-
tures of the problem space, and subsequently, data is collected in a format
dictated by the features. Data is preprocessed and cleaned, and then the
features are further refined. Finally, the cleaned data is fed into algorithms
like linear regression, SVM, random forests, and neural networks (NNs)
to arrive at prediction and decision-making thereof. Viewed this way, it is
closer to truth to call shallow learning as feature engineering.
NN is one of the most common ML techniques found in the AI research
community. Historically, neural networkswere proposed in the 1940s as

AI

ML

NN

DNN

­Figure 5.5 AI, ML, NN, and DNN.


Dynamicity in learning 135

inputs weights summation activation

x1
w1

x2 w2

Σ output
x3 w3

wn
xn

­Figure 5.6 Rosenblat’s perceptron.

perceptrons (Figure
­­ 5.6). 23 They were found to be capable of solving several
problems of interest at the time. However, research in the potential appli-
cations soon began to decline when some researchers criticized NN as not
being able to perform the XOR operation. The problem was solved years
later when a middle layer was introduced in the NN. It is precisely the num-
ber of layers in a NN that distinguish deep learning from shallow learning.
Corresponding to each category and mode of learning, there are specific
algorithms that operate on the incoming data. The shallow learning mode,
for example, contains all the algorithms described in Chapter 4 and a few
more relevant ones that are not covered in this book.
The algorithm selector switchboard (Figure 5.7) in the framework helps the
user to select an appropriate ML algorithm for an ML training session. The
system will use an algorithm indicated by the user. In the ensemble mode, it
will use an ensemble of algorithms and report the best performance. When
operating in the default setting, the system will choose an algorithm at ran-
dom from multiple algorithms to perform learning.

Deep learning
Deep learning methods have gained popularity because they often outper-
form conventional (i.e., shallow) ML methods and can extract features auto-
matically from raw data with little or no preprocessing.24,25,26 The structure
of a deep neural network is very complex, containing layers in the order of
tens or hundreds. The VGGNet, 27 which is often used as a pretrained arti-
fact for image recognition, consists of 16 convolutional layers. As opposed
to the shallow neural networks that do a mere weighted summation of the
inputs at each of the neurons, deep CNNs perform complex operations like
weight sharing, convolution, and pooling during each forward pass.
136 Artificial Intelligence for Business Optimization

default choose ensemble

­Figure 5.7 Algorithm selector switchboard.


input features

­Figure 5.8 Shallow neural network.

The essential difference between shallow learning (feature engineering)


and deep learning (feature learning) is illustrated in Figures 5.8 and 5.9.
In shallow learning, the features are handcrafted and fed into NN, which
has only one hidden layer (and, therefore, “shallow”). In contrast, the deep
network has a large number of hidden layers, making the structure “deep.”
Handcrafted features are not fed into the Deep Net. Instead, these hidden
layers successively learn the features (Figure 5.9). Deep learning, therefore,
is also referred to as feature learning.
Classification problems involving text and images cannot be handled by
shallow learning. Deep convolutional neural networks and deep recurrent
neural networks are best suited for these learning tasks. In the deep learn-
ing mode of our framework (Figure 5.3), there are a set of deep learning
models DLM1, DLM2, ……DLMn, from which the most suitable one is
selected by the framework to engage in the learning task.
Machine learning, in general, has made remarkable progress in recent
years. But the outstanding success is achieved in the area of DL. Deep
Dynamicity in learning 137

inputs

­Figure 5.9 Deep neural network.

learning systems now enable previously impossible smart applications, rev-


olutionizing image recognition and natural language processing, and iden-
tifying complex patterns in data. 28 Deep learning has shown great promise
for tackling many tasks such as image processing, 29 natural language pro-
cessing, 30 speech recognition, 31 superhuman game playing, 32 and autono-
mous driving.33,34

Transfer learning
In this section, there are several pretrained models, each with a specific kind
of input–output data. These are chosen by the system depending on the new
incoming data.
ML comprises two distinct phases: training phase and testing phase. One
of the fundamental assumptions of ML is that the training and the testing
datasets come from identical statistical distributions. To a large extent this
is justified because in most of the ML cases, only one original dataset is
dealt with. This original dataset is randomly divided into training dataset
and testing dataset. The ML experiment is repeated to yield a statistically
satisfiable k-fold cross-validation, which on average leads to a minimization
of the loss function in training and minimization of the prediction error in
testing. However, the power of ML is demonstrated in the real world when it
is trained on a known (historical) dataset and then used to make predictions
on unknown datasets – future datasets that were not available at the time of
training. This is where most of the ML models get into trouble. The future
test datasets may not share the same statistical distribution as the parent
training dataset. The performance will degrade. The most natural thing to
do when faced with such a situation is to rebuild another ML model from
scratch to fit the new data and start the training-validation-testing phases all
over again. This is where transfer learning comes to help.
138 Artificial Intelligence for Business Optimization

According to Goodfellow,35 transfer learning is defined as “Situation where


what has been learned in one setting is exploited to improve generalization
in another setting.” Inductive transfer, learning to learn, and knowledge con-
solidation are some other terms often used for transfer learning.
Transfer learning is motivated by the fact that humans easily apply knowl-
edge learned previously to solve new problems. When toddlers have learnt
to distinguish between apples and oranges, for instance, they can transfer
their knowledge to distinguish between bananas and cucumbers; if they can
distinguish a cat from a dog, they can distinguish poodles from Pomeranians,
although they have never had a lesson in identifying poodles and Pomeranians.
The same is true about daily human skills. Having learned to ride a bicy-
cle, intuitively basic bicycle-riding skills like balance and maneuvring are
transferred to ride a motorbike. Driving on the left side of the road with
right-side steering is easy to adapt to driving on the right side of the road
with left-handed steering. Similarly, for an agent trained to drive a vehicle
on a 2D surface, transfer learning extends the agent’s capabilities to drive in
3D space. Transfer learning is used to further train computer programs that
have learnt autonomous driving of land vehicles to become drone drivers.
The programs will tacitly use the knowledge they have picked up in dodg-
ing obstacles while driving on the land to avoid obstacles in mid-air. Thus,
transfer learning is the ability to transfer knowledge across tasks. The more
related the tasks, the easier it is to transfer knowledge across the tasks.

ML AUTOMATION AND OPTIMIZATION

The dynamic ML framework proposed in this chapter has several benefits: It


is simpler to use for nonexperts, more reliable, and faster to deploy and yields
better performance than hand-designed models. There are three major issues
concerning the functionality of the framework. (a) How will the system select
the ML category, the row in the framework? (b) How will the system select
the ML mode, the column in the framework? (c) How will the system deliver
an efficient performance?
The answer to the first two questions is automation. The answer to the
third question is optimization.
In selecting the ML category (rows in the framework), the system is auto-
mated to switch among the following three strategies:

1. User selection using the switching board (Figure 5.7)


A user experienced in data science or machine learning can select
any row or column, as well as any ML category or mode of learning
provided in the framework to perform the learning task. If the user’s
choice is appropriate to the learning task with the data provided at the
input, the system will proceed smoothly. However, it will stop with an
Dynamicity in learning 139

error code if the selected category or mode of learning does not match
the input data.
2. ­Data-driven strategy
The system will gauge the scale of the data and select an appropriate
algorithm in the given category. For simple and small datasets, it will
select simpler algorithms; for larger and more complex datasets, it will
select more complex algorithms from the suite provided in the framework.
3. Ensemble Learning strategy
For a given learning problem, the framework will try multiple algo-
rithms and select the one with the best performance. The disadvantage
is computational time. However, it is ameliorated since the dynamic
ML framework is operating online with continuous self-monitoring.
Besides, modern enterprises are equipped with massive and rapid com-
puting resources spread across the cloud. Graphic processing unit (GPU)
computing which offers orders-of-magnitude performance increase
over the conventional CPU computing is an additional blessing.

ML is an optimization problem. Most ML algorithms salvage through heaps


of data to learn a mapping function from the inputs to the output or learn
to cluster the data based on some implicit function optimization. NN has
become the de facto standard of ML. As a matter of fact, the concept of deep
learning has emerged directly from the application of NN for ML tasks. The
NN used for shallow learning has a single hidden layer sandwiched between
the input and output layers. As more and more hidden layers are stacked to
handle complex problems and improve the fine granularity of learning, the
NN structure becomes deeper and deeper culminating in the current para-
digm of deep learning.
NN has three classes of parameters: the network topology (number of lay-
ers and number of neurons in each layer), the connection weights, and hyper-
parameters. The NN connection weights are trained, validated, and tested
by using the training, validating, and testing data subsets, respectively. The
randomly initialized connection weights are gradually optimized by the
gradient-based backpropagation algorithm through the epochs of training. It
is evident that in the current ML procedures, only the connection weights of
the NN model are optimized. NN topology and hyperparameters are not sys-
tematically optimized. Although there is a popular dropout technique to alter
the topology of the network while training, data scientists manually adjust the
number of layers along with the number of neurons in each layer and tweak
the values of the hyperparameters with the hope of improving the performance
of the ML algorithm. The trial-and-error method is far from being efficient.
A far more efficient way that combines ML with evolutionary algorithms
is neuro-evolution, described in the following subsection. This strategy
simultaneously optimizes the topology of the network and the hyperpa-
rameters while optimizing the connection weights.
140 Artificial Intelligence for Business Optimization

­N euro-evolution
­

Optimization problem formulation


An optimization problem is defined as:

Find X such that f ( X ) is minimum/maximum (5.1)


­

subject to the constraints:

g (X ) = 0 (5.2)
­

h(X ) < 0 (5.3)


­

where X is a vector of decision variables, usually bounded:

Xmin ≤ X ≤ Xmax (5.4)


­

­ is the objective function, g(X)


f(X) ­ and h(X)
­ are the constraints, and Xmin
and X max are the bounds on the decision variables in vector X.

Genetic algorithm
­
Dynamicity in learning 141

Begin No End Yes


GA of
iterations?

Randomly generate
population of Replace the old
N parents population

Evaluate fitness of End


Mutation
each parent GA

Select the
Crossover
better-fit parents

­Figure 5.10 Genetic algorithm flowchart.

the objective function. These parents are then allowed to evolve through a
series of evolutionary cycles consisting of various genetic operators. Over
the course of evolution, the selection of fitter parents in the population and
the eventual crossover and mutations in the successive generation of off-
spring gradually result in better fit or optimal individuals.
The GA iterates through the following cycles:

Step 1 – Random generation of population


A population of N individual solutions is generated randomly.
Care must be taken to ensure that the individuals do not violate
the constraints imposed on the input variables. The individual
solutions that make up the population are called parents or chro-
mosomes. The bit representation of these solutions resembles a bio-
logical chromosome composed of a long strand of bases as shown
in ­Figure 5.11.
Each individual solution is a vector X containing the values of
the variables. The neuro-evolution problem deals with the simul-
taneous evolution of the neural network topology, the connection
weights, and the hyperparameters. In the binary bit string repre-
sentation, each parent or chromosome (solution) is a band contain-
ing three 8-bit units for topology, weights, and hyperparameters
(Figure
­­ 5.12).
142 Artificial Intelligence for Business Optimization

­Figure 5.11 Biological chromosome.

topology weights hyperparameters

11100011 11001110 10100101

11010100 10101010 10101101

­Figure 5.12 Pair of GA chromosomes encoded as bit strings.

Step 2 – Fitness evaluation


The fitness function f(X)
­ is evaluated. In most cases, the evalua-
tion of f(X)
­ is a direct computation. However, in practical applica-
tion areas, the evaluation of f(X)
­ may involve a time-consuming
elaborate simulation. The fitness function for maximization prob-
lems is directly proportional to the value of the objective function.
For minimization problems, it is customary to consider the recipro-
cal of the objective function as the fitness. In ML, some derivative
of the loss function or the reward function may be used as the GA
fitness.
Step 3 – Selection
The greater the value of the fitness function, the fitter the indi-
vidual. The better-fit parents in the population are selected for
crossover (also called recombination). There are two main types of
selection schemes: tournament and roulette wheel. In the tourna-
ment selection, a pair of parents is selected at random from the pop-
ulation. Their fitness is compared and the fitter of the two is selected
for “reproduction.” In case of a tie in fitness values, the selection is
performed randomly. The selection procedure is repeated till the
number of the selected parents equals the population size. In the
roulette wheel selection scheme, the chromosomes are treated as
if they are placed on a roulette wheel according to their fitness.
The chromosomes that occupy the greater area on the wheel have
Dynamicity in learning 143

a greater chance of being selected. For each selection, the roulette


wheel is rotated as in a casino, and the chromosome to which the
pointer points to when the roulette wheel stops is selected.
Step 4 – Crossover
In the natural world, crossover mixes the genetic material in the
offspring of the species and increases its chances of survival. The
following three kinds of crossover operators are common in GA.
1. One-point crossover: A single crossover point on both the parents’
strings is randomly selected. The part of the chromosome after
the crossover point is swapped between the two parent organisms
(Figure
­­ 5.13a).
2. Two-point crossover: Two distinct points are selected on the par-
ent chromosome strings. The part of the chromosome between the
two crossover points is swapped between the parent organisms
(Figure
­­ 5.13b).
3. Uniform crossover: Each corresponding bit between the parent
chromosomes is swapped with a small probability p (Figure
­­ 5.13c).
Step 5 – Mutation
Every bit in every individual is flipped (0/1) with a very small
mutation probability. Mutation makes the search wider and aids
premature convergence of the population.
Step 6 – Inserting the new offspring in the older generation
Since the initial solutions are randomly generated and evolved
through the GA operators, there is no guarantee that the final solu-
tions will always remain within the bounds imposed by the prob-
lem constraints. Any infeasible solutions are “repaired” and then
inserted back to the old population, and the above steps of the GA
cycle are iterated.

(a) 11100011 11001110 10100101


11100011 11101010 10101101

11010100 10101010 10101101 11010100 10001110 10100101

(b)
11100011 11001110 10100101 11100100 10101010 10101101

11010100 10101010 10101101 11010011 11001110 10100101

(c)
11100011 11001110 10100101 11000011 10101110 10101101

11010100 10101010 10101101 11110100 11001010 10100101

­Figure 5.13 (a) One-point crossover (b) Two-point crossover (c) Uniform crossover.
144 Artificial Intelligence for Business Optimization

The dynamic ML framework handles all the above steps automatically.


Once the learning category and mode are selected, the system automatically
rolls the GA in motion to optimize the learning model performance.

Recommendation systems
Deep learning techniques may be used to develop a realistic recommenda-
tion system for online shopping customers.
Almost all businesses have gone online and keep a record of the products
purchased by customers along with the customer preferences and their per-
sonal data. Further, to enhance their sales, online businesses are routinely
engaged in maintaining recommendation systems. These systems, which
monitor customer preferences along with their data, quickly run an ML
algorithm over the collected data and come with a recommendation of new
products which may interest the customers. The recommendations are not
totally unrelated random suggestions, but are based on the customer pur-
chase history and their preferences. In a nutshell, recommendation systems
are ML systems that learn to predict the rating or preference an online
customer would give to an item or product in the market.
When a customer gets in a physical store (as opposed to an online store),
she can see all the items displayed in the store. Since the collection of items
is not too large, she can easily select a handful of items to focus on at a time,
and then make her purchase choice. However, this kind of a stress-free tra-
ditional buying suddenly transforms into a confusion and stressful situation
when the customer takes her shopping spree to online stores. First of all, she
has to search for relevant items using a web-based search engine; she is imme-
diately bombarded with a vast collection of items, and it becomes extremely
difficult to make the buying choice. Recommendation systems come to her
rescue by suggesting a relevant list of items, not too large, so that she can
concentrate to make her selection. Recommender systems are information
filtering systems that offer a win-win situation to the buyer and the seller.
It helps the buyer to select a handful of items at a time to make her buying
choice and, simultaneously, helps the seller to sell additional products.
Types of recommendation systems
Below, different kinds of recommendation systems deployed in practice
are explained.

­Popularity-based method
This is the simplest recommendation system conceivable. It is based on the
popularity of the item being sold. The popularity of an item is simply the
count of the number of items sold so far. Items in the online store are given
a popularity index directly based on the number of pieces of that item or
product being sold, and then ranked according to this popularity index.
Dynamicity in learning 145

The most popular items are placed on top of the list recommended to the
user. The major drawback of the popularity-based method is that it does
not offer any personalization to the customer.

Collaborative filtering
It has become a common practice for online marketing enterprises to get the
customers to rate the products in a casual and noninvasive way when they
are busy inspecting the items on display in the online store before making a
purchase. The ratings dataset is converted into a matrix in which customers
occupy the rows and items occupy the columns. The cell corresponding to a
customer and an item is the rating given by the customer to that particular
item, on a scale of 1~5 in this case (Figure 5.14). The interaction matrices
are usually very large, although sparse.
Collaborative methods work by employing appropriate machine learning
algorithms that try to learn a function that predicts the preference of items
to each user. This is based on a similarity measure that computes the simi-
larity of users with respect to the items.
The k-nearest algorithm often used in recommendation systems goes
through the following steps:

• Compute the similarity of users using cosine or Pearson similarity


measure
• Determine k users close in similarity to the user u in question
• Recommend items preferred by k users, but not yet purchased by
user u.

item1 item2 item3 item4 item5

1
user1 1 4 3

user2 2 5 2

user3 1 4 4 1

user4 2 3 2 2

user5 5 2 1

­Figure 5.14 Users-items interaction matrix.


146 Artificial Intelligence for Business Optimization

Deep learning for recommendation systems


Recently, the freely available Web video and music channel YouTube has
become immensely popular among its users. The Web app displays video
clips ready for playing on demand. The keyword search also is appeal-
ing and effective. One of the most outstanding features of YouTube is the
recommendation mechanism that comes hand in hand with the services.
For example, when searching for a piece of classical music composed by
Vivaldi, it will instantly come up with a list of Vivaldi’s violin sonatas; and
depending on what the user chooses to play, the list of recommendation will
modify itself in real time to provide the user with the best possible choice. It
will also instantly recommend related Baroque composers.
These systems deal with rapidly changing large sources of data. Millions
of video and music clips are uploaded on the internet on a daily basis.
Conventional ML system based on shallow learning cannot cope up with
the scale and complexity of such recommendation systems. Deep learning
is the best-fit candidate.39 (Figure
­­ 5.15)
Figure 5.15 shows the general architecture. The central unit of the sys-
tem contains two prominent deep neural networks (DNN). The first DNN
is fed with the online datasets and learns to perform collaborative filter-
ing based on features such as user’s age, gender, history of search and
watch, related genre, and so on. From the purchasing history of similar
customers, it learns to make predictions (filtering) about the interests of
the customer in question. The outcome of the collaborative filtering DNN
is a filtered subset of media (music or videos). The filtered subset of media
is then fed to the second DNN along with the video features (genre, stars,
age, bag of words from lyrics, etc.). The second DNN fine-tunes the selec-
tion according to users’ taste and produces a still smaller subset of the
most relevant media to the user. These are ranked and then displayed in
the rank order.

Collaborative Ranking ranked &


online filtering displayed
media
DNN media
DNN
database

history of search & watch media features

­Figure 5.15 Deep learning recommendation system (adapted from Covington et al.)


Dynamicity in learning 147

Data for fuelling recommendation systems


Large-scale enterprises do not have any dearth of data which they readily
employ to drive several ML systems. Medium-sized enterprises also have
sufficient data which can run the online ML engines. However, small-size
enterprises and start-ups often struggle to reach the data critical mass that
will make their ML algorithms effective. A prominent pain point here is
how to quickly amass data that will drive the product recommendation sys-
tems that those firms have freshly deployed. A viable alternative is to exper-
iment with datasets freely available in the public domain such as Kaggle.40

CONSOLIDATION WORKSHOP

1. What are the benefits of static learning? What are its limitations?
2. Describe the characteristics of dynamic learning?
3. What are some of the most important data characteristics?
4. What is the importance of static data?
5. Describe the differences between preprocessed and distributed data.
6. What are the challenges of continuously changing data? How is it
monitored?
7. What are the characteristics of shallow learning?
8. What is the impact of deep learning?
9. What are the challenges of transfer learning? Give some concrete
examples where one can apply transfer learning.
10. How is shallow learning used in data analytics?
11. How is deep learning used in data analytics?
12. What is neuro-evolution? How can it be applied to ML for BO?
13. How are recommendation systems implemented in businesses?
14. What is the role of the discount factor in deep Q learning?
15. What is a cold start in recommendation systems? How is it resolved?

NOTES

1. Pykes, Kurtis, Getting a data science job is harder than ever: How to use the
difficulties of landing a gig to your advantage. Medium, 2020, September 29,
https://towardsdatascience.com/getting- a- data- science-job-is-harder- than-
ever-fb796aae1922.
2. Davenport, Thomas H. and Patil, D.J., Data scientist: The sexiest job of the
21st century. Harvard Business Review, 2012, October. Issue https:// hbr.
org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century.
3. https://developers.google.com/ machine-learning/crash-course/static- vs-
dynamic-training/video-lecture.

You might also like