Module-2 Chapter 5
Module-2 Chapter 5
Dynamicity in learning
Smart selection of learning techniques
DYNAMICITY IN ML
121
122 Artificial Intelligence for Business Optimization
Static learning
Static learning is performed manually most of the time. Static learning is
also known as offline learning. The datasets are almost always prepared
offline. Relevant data is collected from various sources and checked by
domain experts and data scientists for inconsistencies. The datasets are
later preprocessed by using some of the techniques described in Chapter 4.
Upon completion of the data preprocessing stage, the data is then divided
into training, validation, and testing subsets. The data scientists then look
for an appropriate model to be trained for prediction. The model is trained,
validated, and tested using the predetermined subsets. The model hyperpa-
rameters are also tuned to give the desired level of prediction accuracy. The
learned model is then frozen as a time-tested artifact. In static or offline
learning, the model is trained only once and then used for prediction for a
while.3 When new enterprise data becomes available, the model is used for
making predictions.
Static learning is easier and cost-effective as far as implementation and
maintenance are concerned. The ML team undertakes a one-time effort and
investment to obtain the trained model. The benefits can then be reaped
against new data that steadily becomes available in the enterprise. On the
downside, models that are trained in a static manner cannot cope with rap-
idly changing data. There are weekly, monthly, seasonal, and annual data
changes in any enterprise. Consider, for instance, firms selling costumes and
apparel. Their static machine learning models will not be in a position to pre-
dict sales due to seasonal variation in data. The statically learned models will
certainly rupture around the time of Halloween, because of a huge variation
in the sale of costumes. To respond to the data-driven business requirements,
a dynamic learning framework is presented in the following sections.
Dynamic learning
Data in real life is not static. It is constantly changing, growing, and diver-
sifying. Dynamic learning is designed to adapt to changing data. Older
datasets are updated and newer datasets are created with the passage of
time. Dynamic learning models are flexible to take the new forms of data
and continue the learning task. This kind of online learning is adaptive
and continuous. The ML algorithm continuously improves its learning and
prediction is performed on the fly. The major disadvantage of the dynamic
Dynamicity in learning 123
learning systems is that the incoming data, the model, and the entire ML
pipeline have to be continually monitored.
I nput–output pairs
Datasets are usually in the form of tables with rows and columns. The col-
umns represent the data attributes. The columns on the left represent the
independent variables. The rightmost column is the output vector and repre-
sents the dependent variable. Supervised learning algorithms seek the func-
tional relationship between the input (independent) and output (dependent)
variables. Table 5.1 is an example of a dataset that can be used for regres-
sion as well as classification. It is related to the red variant of the Portuguese
Vinho Verde wine.4
The rightmost column indicates the quality of red wine on a scale from 0
to 10, dependent on the 11 attributes or features given in the columns on the
left. The dataset is also available in the UCI public repository.5
10 7.5 0.5 0.36 6.1 0.071 17 102 0.9978 3.35 0.8 10.5 5
11 6.7 0.58 0.08 1.8 0.097 15 65 0.9959 3.28 0.54 9.2 5
12 7.5 0.5 0.36 6.1 0.071 17 102 0.9978 3.35 0.8 10.5 5
13 5.6 0.615 0 1.6 0.089 16 59 0.9943 3.58 0.52 9.9 5
14 7.8 0.61 0.29 1.6 0.114 9 29 0.9974 3.26 1.56 9.1 5
15 8.9 0.62 0.18 3.8 0.176 52 145 0.9986 3.16 0.88 9.2 5
Dynamicity in learning 125
root
node
leaf
tree, specifying Min and Max’s playing turn, and gives scores to the leaf
nodes based on the evaluation of some heuristic function. It then backtracks
Max’s winning path consisting of a series of moves.7
The tree structure is an abstract mathematical concept for representing
objects and their relationships. The tree consisting of nodes and branches
resembles a natural tree turned upside down (Figure 5.1). Two nodes in the
tree have no more than one path between them. The starting node is called
the root, and the end nodes are called leaves. A node is called a parent node
if it has children nodes connected to it. Multiple nodes proceeding from the
same parent are called siblings. The arc or edge between two nodes is called
a branch.
The Min-Max algorithm with its variants was a classic game-playing algo-
rithm used in the early days of AI before the advent of reinforcement learn-
ing. Generating the entire game tree for any of the board games shown in
Table 5.3 is practically impossible, given the size of the datasets. To overcome
this problem, Min-Max adopts various techniques to prune the branches of
the tree.
The method at best is heuristic and not exact. The biggest disadvan-
tage of this game-playing algorithm is that there is no learning. Therefore,
the agent has no game-playing policy to make the winning moves. Latest
research has repeatedly demonstrated that the best way of circumventing
Dynamicity in learning 127
DATA AUGMENTATION
Deep learning models consume tons of data, just not in quantity but also
in diversity.
However, it is very difficult to increase the instances in datasets for learn-
ing. Data augmentation refers to the process of making new data instances
by making minor modifications to the available data instances without
introducing new and fresh instances from external sources. Most of the
deep learning frameworks contain built-in data augmentation utilities.
By improving generalization, data augmentation leads to an overall pre-
diction accuracy. It also helps overcome overfitting, a phenomenon in which
the network learns a function that perfectly fits the training data.
(a) Original image (b) rotated (c) cropped (d) horizontally flipped
This method, also known as lexical substitution, tries to substitute a ran-
domly chosen word in a sentence either with a synonym or another word
not too far in meaning. Synonyms are readily available in WorNet,12 which
is a standard digital thesaurus used in NLP tasks. Typical similarity metrics
used are k-nearest neighbors and cosine similarity. Yet another method is to
substitute words from the pretrained word embeddings such as Word2Vec13
of FastText.14 By far, the most successful method is the use of transformer
models developed by Google such as bidirectional encoder representations
from transformers (BERT).15 BERT is pretrained on a large body of text using
masked language modeling in which the model has learned to predict masked
words based on the context. Hence, in text data augmentation, BERT can
replace a masked word which does not alter the meaning of the original sen-
tence. A further improvement on word embedding data augmentation is the
contextual augmentation method which offers a wider range of substitute
words predicted by a bidirectional language model matching the context16
Sentence shuffling in short paragraphs of text is also known to add useful
data instances to a training body of text. Python NLTK library offers sentence
tokenization and shuffling. This is useful in sentiment analysis studies, because
with sentence tokenization, the sentiment conveyed in the original text is still
maintained. Paraphrasing is another useful method. BERT is quite powerful
in paraphrasing a phrase or a sentence owing to the masking method. Neural
machine translation is an upcoming technology in the field of translation. It
needs a tremendous amount of data in the form of parallel corpora of the
source and the target languages. There is not always a 1-1 correspondence
between the words in the two languages. Words in the target language which
are not present or omitted in the source language are introduced in the source
language by means of back-translation. Effectively, back-translation serves as
an instrument to augment data in a target translation language, which may
not have sufficient data for training a machine translation model.17,18
130 Artificial Intelligence for Business Optimization
Synthetic dataset
Even the exquisite data augmentation techniques described above do not
suffice to produce the high quality and the right amount of data necessary
for accurate machine learning. Another w ell-known method is creating a
repository of synthetic data algorithmically. The main purpose of this tech-
nique is to generate flexible and rich data to experiment with various regres-
sion, classification, clustering, and deep learning algorithms.
Privacy concern of customers is another notable reason for generating
synthetic data. For example, predicting customer behavior or preferences
would need large amounts of data to build the learning models. But because
of privacy, accessing real customer data is slow and does not provide good
enough data on account of extensive masking and reaction of informa-
tion.19 In such cases, generating synthetic data from a model fitted with real
data is the need of the hour. If the model is a good representation of the real
data, then the synthetic data will also pick up the statistical properties of
the data, which are indispensable for learning.
Table 5.4 shows some of the relevant properties a synthetic dataset should
possess.
Examples of synthetic datasets include SURREAL20 for estimating human
pose, shape, and motion from images and videos; SYNTHIA21 for semantic
segmentation of urban scenes, and those for text detection in natural images.22
ONLINE DATA
Automatic
collection Algorithm
selector
switchboard
Pre-
Processing Prediction
Expert
System Decision
Engine Making
USER INTERFACE
Automatic collection
The automatic data collection unit constantly monitors the ever-changing
data in the online repository. Whenever new chunks of data become avail-
able, it automatically downloads it and pushes it to the preprocessing mod-
ule. The automatic collection module is responsible for ascertaining that
there is no data redundancy.
132 Artificial Intelligence for Business Optimization
Preprocessing
This unit routinely performs preprocessing, the type of which was seen in
Chapter 4. It filters noise from the messy data, rectifies corrupted data, and
imputes values in case of missing data. It sorts out the complex nonlinear
relationships hidden in the data and performs some sort of a feature selec-
tion based on the expert rules provided in the unit. Data preprocessing mod-
ule is semiautomatic. It preprocesses most of the online incoming data, but
needs the assistance of the user for certain complex and unforeseeable tasks.
Knowledge acquisition
This is the very first step in building an expert system. Knowledge predomi-
nantly resides with the domain experts. In the ML context, data scientists are
Forward
chaining
IF…THEN
IF…THEN
or
IF…THEN
Backward
chaining
KNOWELDGE KNOWELDGE
ACQUISITION REPRESENTATION INFERENCE
the domain experts who have detailed knowledge of data and the associated
ML methods and algorithms. In particular, they know which ML category or
algorithm a particular dataset should be directed to. The body of knowledge
is systematically collected from data scientists and stored in a convenient for-
mat in a large knowledge repository called the knowledge base.
Knowledge representation
The knowledge acquired from the expert may be in the form of unstructured
text, tables, diagrams, and annotations. This knowledge is then represented
in a well-defined format to remove ambiguities and make the coding work
easier. Semantic networks, frames, and ontologies are some of the well-known
knowledge representation formats in the knowledge engineering domain. The
IF-THEN rules are then framed from the knowledge formats. For instance,
there will be rules like IF (data contains input–output pairs) THEN (supervised
learning), IF (data does not contain input–output pairs) THEN (unsupervised
learning), etc. The framework contains a user interface at the front end. This
is for the expert system engine to query with the user about the characteristics
and type of online incoming data.
Inference
Inference is the process of drawing conclusions by linking data precepts.
The main inference mechanisms found in modern expert systems are
forward-chaining, backward-chaining, and hybrid. In the forward-chaining
or data-driven inference mechanism, if a piece of data matches the “IF” part
of any “IF-THEN” production rule, the rule fires and produces another rel-
evant piece of data. The newly produced intermediate conclusions of all the
rules that have fired are kept in the working memory of the expert system.
The inference engine then searches for rules in the ruleset that match the
new contents available in the working memory. These rules are then fired,
generating more intermediate conclusions which are used to invoke appli-
cable rules in the next round. The fetch-match-fire cycle continues till the
inference engine exhausts all the fire-able rules that exist on the agenda. In
the backward chaining or goal-driven inference mechanism, the “THEN”
part of the “IF-THEN” rule is provisionally assumed to be true, and the
inference engine looks for data that will satisfy this subgoal. It then seeks
to link all the subgoals to yield the final goal. The hybrid inference is a
combination of the forward and backward chaining. Given a dataset and
a ruleset, the expert system reaches the same conclusion irrespective of the
inference chaining. The only difference is the speed of convergence, which
is problem dependent.
In a medical diagnosis session, for example, the expert system initiates
the session by asking the patient a couple of questions related to the patient’s
134 Artificial Intelligence for Business Optimization
Shallow learning
The term “shallow learning” is a misnomer because the performance of the ML
algorithms pertaining to this class is far from being shallow. “Shallowness”
is not a salient feature of the “shallow learning” ML algorithms. The term
is collectively and rather loosely used to refer to all forms of ML before the
advent of “deep learning” (Figure
5.5).
The common feature of the algorithms which are included in “shallow
learning” is that they rely on handcrafted features based upon heuristics of
the target problem. The domain experts and data scientists define the fea-
tures of the problem space, and subsequently, data is collected in a format
dictated by the features. Data is preprocessed and cleaned, and then the
features are further refined. Finally, the cleaned data is fed into algorithms
like linear regression, SVM, random forests, and neural networks (NNs)
to arrive at prediction and decision-making thereof. Viewed this way, it is
closer to truth to call shallow learning as feature engineering.
NN is one of the most common ML techniques found in the AI research
community. Historically, neural networkswere proposed in the 1940s as
AI
ML
NN
DNN
x1
w1
x2 w2
Σ output
x3 w3
wn
xn
perceptrons (Figure
5.6). 23 They were found to be capable of solving several
problems of interest at the time. However, research in the potential appli-
cations soon began to decline when some researchers criticized NN as not
being able to perform the XOR operation. The problem was solved years
later when a middle layer was introduced in the NN. It is precisely the num-
ber of layers in a NN that distinguish deep learning from shallow learning.
Corresponding to each category and mode of learning, there are specific
algorithms that operate on the incoming data. The shallow learning mode,
for example, contains all the algorithms described in Chapter 4 and a few
more relevant ones that are not covered in this book.
The algorithm selector switchboard (Figure 5.7) in the framework helps the
user to select an appropriate ML algorithm for an ML training session. The
system will use an algorithm indicated by the user. In the ensemble mode, it
will use an ensemble of algorithms and report the best performance. When
operating in the default setting, the system will choose an algorithm at ran-
dom from multiple algorithms to perform learning.
Deep learning
Deep learning methods have gained popularity because they often outper-
form conventional (i.e., shallow) ML methods and can extract features auto-
matically from raw data with little or no preprocessing.24,25,26 The structure
of a deep neural network is very complex, containing layers in the order of
tens or hundreds. The VGGNet, 27 which is often used as a pretrained arti-
fact for image recognition, consists of 16 convolutional layers. As opposed
to the shallow neural networks that do a mere weighted summation of the
inputs at each of the neurons, deep CNNs perform complex operations like
weight sharing, convolution, and pooling during each forward pass.
136 Artificial Intelligence for Business Optimization
inputs
Transfer learning
In this section, there are several pretrained models, each with a specific kind
of input–output data. These are chosen by the system depending on the new
incoming data.
ML comprises two distinct phases: training phase and testing phase. One
of the fundamental assumptions of ML is that the training and the testing
datasets come from identical statistical distributions. To a large extent this
is justified because in most of the ML cases, only one original dataset is
dealt with. This original dataset is randomly divided into training dataset
and testing dataset. The ML experiment is repeated to yield a statistically
satisfiable k-fold cross-validation, which on average leads to a minimization
of the loss function in training and minimization of the prediction error in
testing. However, the power of ML is demonstrated in the real world when it
is trained on a known (historical) dataset and then used to make predictions
on unknown datasets – future datasets that were not available at the time of
training. This is where most of the ML models get into trouble. The future
test datasets may not share the same statistical distribution as the parent
training dataset. The performance will degrade. The most natural thing to
do when faced with such a situation is to rebuild another ML model from
scratch to fit the new data and start the training-validation-testing phases all
over again. This is where transfer learning comes to help.
138 Artificial Intelligence for Business Optimization
error code if the selected category or mode of learning does not match
the input data.
2. Data-driven strategy
The system will gauge the scale of the data and select an appropriate
algorithm in the given category. For simple and small datasets, it will
select simpler algorithms; for larger and more complex datasets, it will
select more complex algorithms from the suite provided in the framework.
3. Ensemble Learning strategy
For a given learning problem, the framework will try multiple algo-
rithms and select the one with the best performance. The disadvantage
is computational time. However, it is ameliorated since the dynamic
ML framework is operating online with continuous self-monitoring.
Besides, modern enterprises are equipped with massive and rapid com-
puting resources spread across the cloud. Graphic processing unit (GPU)
computing which offers orders-of-magnitude performance increase
over the conventional CPU computing is an additional blessing.
N euro-evolution
g (X ) = 0 (5.2)
Genetic algorithm
Dynamicity in learning 141
Randomly generate
population of Replace the old
N parents population
Select the
Crossover
better-fit parents
the objective function. These parents are then allowed to evolve through a
series of evolutionary cycles consisting of various genetic operators. Over
the course of evolution, the selection of fitter parents in the population and
the eventual crossover and mutations in the successive generation of off-
spring gradually result in better fit or optimal individuals.
The GA iterates through the following cycles:
(b)
11100011 11001110 10100101 11100100 10101010 10101101
(c)
11100011 11001110 10100101 11000011 10101110 10101101
Figure 5.13 (a) One-point crossover (b) Two-point crossover (c) Uniform crossover.
144 Artificial Intelligence for Business Optimization
Recommendation systems
Deep learning techniques may be used to develop a realistic recommenda-
tion system for online shopping customers.
Almost all businesses have gone online and keep a record of the products
purchased by customers along with the customer preferences and their per-
sonal data. Further, to enhance their sales, online businesses are routinely
engaged in maintaining recommendation systems. These systems, which
monitor customer preferences along with their data, quickly run an ML
algorithm over the collected data and come with a recommendation of new
products which may interest the customers. The recommendations are not
totally unrelated random suggestions, but are based on the customer pur-
chase history and their preferences. In a nutshell, recommendation systems
are ML systems that learn to predict the rating or preference an online
customer would give to an item or product in the market.
When a customer gets in a physical store (as opposed to an online store),
she can see all the items displayed in the store. Since the collection of items
is not too large, she can easily select a handful of items to focus on at a time,
and then make her purchase choice. However, this kind of a stress-free tra-
ditional buying suddenly transforms into a confusion and stressful situation
when the customer takes her shopping spree to online stores. First of all, she
has to search for relevant items using a web-based search engine; she is imme-
diately bombarded with a vast collection of items, and it becomes extremely
difficult to make the buying choice. Recommendation systems come to her
rescue by suggesting a relevant list of items, not too large, so that she can
concentrate to make her selection. Recommender systems are information
filtering systems that offer a win-win situation to the buyer and the seller.
It helps the buyer to select a handful of items at a time to make her buying
choice and, simultaneously, helps the seller to sell additional products.
Types of recommendation systems
Below, different kinds of recommendation systems deployed in practice
are explained.
Popularity-based method
This is the simplest recommendation system conceivable. It is based on the
popularity of the item being sold. The popularity of an item is simply the
count of the number of items sold so far. Items in the online store are given
a popularity index directly based on the number of pieces of that item or
product being sold, and then ranked according to this popularity index.
Dynamicity in learning 145
The most popular items are placed on top of the list recommended to the
user. The major drawback of the popularity-based method is that it does
not offer any personalization to the customer.
Collaborative filtering
It has become a common practice for online marketing enterprises to get the
customers to rate the products in a casual and noninvasive way when they
are busy inspecting the items on display in the online store before making a
purchase. The ratings dataset is converted into a matrix in which customers
occupy the rows and items occupy the columns. The cell corresponding to a
customer and an item is the rating given by the customer to that particular
item, on a scale of 1~5 in this case (Figure 5.14). The interaction matrices
are usually very large, although sparse.
Collaborative methods work by employing appropriate machine learning
algorithms that try to learn a function that predicts the preference of items
to each user. This is based on a similarity measure that computes the simi-
larity of users with respect to the items.
The k-nearest algorithm often used in recommendation systems goes
through the following steps:
1
user1 1 4 3
user2 2 5 2
user3 1 4 4 1
user4 2 3 2 2
user5 5 2 1
CONSOLIDATION WORKSHOP
1. What are the benefits of static learning? What are its limitations?
2. Describe the characteristics of dynamic learning?
3. What are some of the most important data characteristics?
4. What is the importance of static data?
5. Describe the differences between preprocessed and distributed data.
6. What are the challenges of continuously changing data? How is it
monitored?
7. What are the characteristics of shallow learning?
8. What is the impact of deep learning?
9. What are the challenges of transfer learning? Give some concrete
examples where one can apply transfer learning.
10. How is shallow learning used in data analytics?
11. How is deep learning used in data analytics?
12. What is neuro-evolution? How can it be applied to ML for BO?
13. How are recommendation systems implemented in businesses?
14. What is the role of the discount factor in deep Q learning?
15. What is a cold start in recommendation systems? How is it resolved?
NOTES
1. Pykes, Kurtis, Getting a data science job is harder than ever: How to use the
difficulties of landing a gig to your advantage. Medium, 2020, September 29,
https://towardsdatascience.com/getting- a- data- science-job-is-harder- than-
ever-fb796aae1922.
2. Davenport, Thomas H. and Patil, D.J., Data scientist: The sexiest job of the
21st century. Harvard Business Review, 2012, October. Issue https:// hbr.
org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century.
3. https://developers.google.com/ machine-learning/crash-course/static- vs-
dynamic-training/video-lecture.