Introduction To Machine Learning: Unit Structure
Introduction To Machine Learning: Unit Structure
LEARNING
Unit Structure
Introduction
Machine learning
Examples of Machine Learning Problems
Structure of Learning
Learning versus Designing
Training versus Testing
Characteristics of Machine learning tasks
Predictive and descriptive tasks
Summary
Unit End Questions
References
INTRODUCTION
A human child learns new things and uncovers the structure of their world
year by year as they grow to adulthood. A child's brain and senses
perceive the facts of their surroundings and gradually learn the hidden
patterns of life which help the child to craft logical rules to identify
learned patterns. The learning process of the human brain makes humans
the most sophisticated living creature of this world. Learning continuously
by discovering hidden patterns and then innovating on those patterns
enables us to make ourselves better and better throughout our lifetime.
Superficially, we can draw some motivational similarities between the
learning process of the human brain and the concepts of machine learning.
(jlooper, n.d.)
The human brain perceives things from the real world, processes the
perceived information, makes rational decisions, and performs certain
actions based on circumstances. When we program a replica of the
intelligent behavioural process to a machine, it is called artificial
intelligence (AI).
1
uncover meaningful information and find hidden patterns from perceived
data to support the logical decision-making process.
MACHINE LEARNING
This motivation is loosely inspired by how the human brain learns certain
things based on the data it perceives from the outside world. Machine
learning is the systematic study of algorithms and systems that improve
their knowledge or performance with experience.
Arthur Samuel described it as: "The field of study that gives computers
the ability to learn from data without being explicitly programmed." This
is an older, informal definition.
2
machines predict the output. The labelled data means some input data is
already tagged with the correct output.
1. Regression:
Regression algorithms are used if there is a relationship between the input
variable and the output variable. It is used for the prediction of continuous
variables, such as Weather forecasting, Market Trends, etc. Linear
Regression, Regression Trees, Non-Linear Regression, Bayesian Linear
Regression, Polynomial Regression are some popular Regression
algorithms which come under supervised learning.
2. Classification:
Classification algorithms are used when the output variable is categorical,
which means there are two classes such as Yes-No, Male-Female, True-
false, etc. Spam Filtering, Random Forest, Decision Trees, Logistic
Regression, Support vector Machines are some examples of classification.
3
Example: Suppose the unsupervised learning algorithm is given an input
dataset containing images of different types of cats and dogs. The
algorithm is never trained upon the given dataset, which means it does not
have any idea about the features of the dataset. The task of the
unsupervised learning algorithm is to identify the image features on their
own. Unsupervised learning algorithm will perform this task by clustering
the image dataset into the groups according to similarities between images.
1. Clustering:
Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no
similarities with the objects of another group. Cluster analysis finds the
commonalities between the data objects and categorizes them as per the
presence and absence of those commonalities.
2. Association:
An association rule is an unsupervised learning method which is used for
finding the relationships between variables in the large database. It
determines the set of items that occurs together in the dataset. Association
rule makes marketing strategy more effective. Such as people who buy X
item (suppose a bread) are also tend to purchase Y (Butter/Jam) item. A
typical example of Association rule is Market Basket Analysis.
Reinforcement Learning:
Reinforcement Learning is a feedback-based Machine learning technique
in which an agent learns to behave in an environment by performing the
actions and seeing the results of actions. For each good action, the agent
gets positive feedback, and for each bad action, the agent gets negative
feedback or penalty. In Reinforcement Learning, the agent learns
automatically using feedbacks without any labeled data, unlike supervised
learning.
4
The agent learns with the process of hit and trial, and based on the
experience, it learns to perform the task in a better way. Hence, we can say
that "Reinforcement learning is a type of machine learning method where
an intelligent agent (computer program) interacts with the environment
and learns to act within that”. How a Robotic dog learns the movement of
his arms is an example of Reinforcement learning.
1. Image Recognition:
Image recognition is one of the most common applications of machine
learning. It is used to identify objects, persons, places, digital images, etc.
The popular use case of image recognition and face detection is,
Automatic friend tagging suggestion:
2. Speech Recognition:
While using Google, we get an option of "Search by voice," it comes
under speech recognition, and it's a popular application of machine
learning.
3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which
shows us the correct path with the shortest route and predicts the traffic
conditions.
5
Real Time location of the vehicle form Google Map app and sensors
Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It
takes information from the user and sends back to its database to improve
the performance.
4. Product recommendations:
Machine learning is widely used by various e-commerce and
entertainment companies such as Amazon, Netflix, etc., for product
recommendation to the user. Whenever we search for some product on
Amazon, then we started getting an advertisement for the same product
while internet surfing on the same browser and this is because of machine
learning.
5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving
cars. Machine learning plays a significant role in self-driving cars. Tesla,
the most popular car manufacturing company is working on self-driving
car. It is using unsupervised learning method to train the car models to
detect people and objects while driving.
6
information using our voice instruction. These assistants can help us in
various ways just by our voice instructions such as Play music, call
someone, open an email, Scheduling an appointment, etc.
For each genuine transaction, the output is converted into some hash
values, and these values become the input for the next round. For each
genuine transaction, there is a specific pattern which gets change for the
fraud transaction hence, it detects it and makes our online transactions
more secure.
7
STRUCTURE OF LEARNING
(FLACH, 2012)
Try to look for columns or rows that are combinations of other columns or
rows. For instance, the third column turns out to be the sum of the first and
second columns. Similarly, the fourth row is the sum of the first and
second rows. What this means is that the fourth person combines the
ratings of the first and second person. Similarly, BADLA (B)‟s ratings are
the sum of the ratings of the first two films. This is made more explicit by
writing the matrix as the following product:
(FLACH, 2012)
Notice that the first and third matrix on the right-hand side are now
Boolean, and the middle one is diagonal (all off-diagonal entries are zero).
Moreover, these matrices have a very natural interpretation in terms of
film genres.
8
The right-most matrix associates films (in columns) with genres (in rows):
Khosla Ka Ghosla (KG) and Drishyam (D) belong to two different genres,
say drama and crime, BADLA (B) belongs to both, and Hera Phery (HP)
is a crime film and also introduces a new genre (say comedy).
Building
Machine
Logical
Training Data Learning Output
Mathematical
Algorithm
Model
9
Task, T: To classify mails into Spam or Not Spam.
Performance measure, P: Total percent of mails being correctly classified
as being “Spam” or “Not Spam”.
Step 1) Choosing the Training Experience: The very important and first
task is to choose the training data or training experience which will be fed
to the Machine Learning Algorithm. It is important to note that the data or
experience that we fed to the algorithm must have a significant impact on
the Success or Failure of the Model. So, training data or experience should
be chosen wisely.
Final Design
Below are the attributes which will impact on Success and Failure of Data:
Second important attribute is the degree to which the learner will control
the sequences of training examples. For example: when training data is fed
to the machine then at that time accuracy is very less but when it gains
experience while playing again and again with itself or opponent the
machine algorithm will get feedback and control the chess game
accordingly.
10
Step 2) Choosing target function: The next important step is choosing
the target function. It means according to the knowledge fed to the
algorithm the machine learning will choose NextMove function which will
describe what type of legal moves should be taken. For example: While
playing chess with the opponent, when opponent will play then the
machine learning algorithm will decide what be the number of possible
legal moves taken in order to get success.
Step 5) Final Design: The final design is created at last when system goes
from number of examples, failures and success, correct and incorrect
decision and what will be the next step etc. Example: DeepBlue is an
intelligent computer which is ML-based won chess game against the chess
expert Garry Kasparov, and it became the first computer which had beaten
a human chess expert.
Training data and test data are two important concepts in machine
learning.
Training Data:
The observations in the training set form the experience that the algorithm
uses to learn. In supervised learning problems, each observation consists
of an observed output variable and one or more observed input variables.
Test Data:
The test set is a set of observations used to evaluate the performance of the
model using some performance metric. It is important that no observations
from the training set are included in the test set. If the test set does contain
examples from the training set, it will be difficult to assess whether the
11
algorithm has learned to generalize from the training set or has simply
memorized it.
In addition to the training and test data, a third set of observations, called
a validation or hold-out set, is sometimes required. The validation set is
used to tune variables called hyper parameters, which control how the
model is learned. The program is still evaluated on the test set to provide
an estimate of its performance in the real world; its performance on the
validation set should not be used as an estimate of the model's real-world
performance since the program has been tuned specifically to the
validation data.
Some training sets may contain only a few hundred observations; others
may include millions. Inexpensive storage, increased network
connectivity, the ubiquity of sensor-packed smartphones, and shifting
attitudes towards privacy have contributed to the contemporary state of big
data, or training sets with millions or billions of examples.
However, machine learning algorithms also follow the maxim "garbage in,
garbage out." A student who studies for a test by reading a large,
confusing textbook that contains many errors will likely not score better
than a student who reads a short but well-written textbook. Similarly, an
algorithm trained on a large collection of noisy, irrelevant, or incorrectly
labelled data will not perform better than an algorithm trained on a smaller
set of data that is more representative of problems in the real world.
12
costly in some domains. Fortunately, several datasets are bundled
with scikit-learn, allowing developers to focus on experimenting with
models instead.
Consider for example that the original dataset is partitioned into five
subsets of equal size, labelled A through E. Initially, the model is trained
on partitions B through E, and tested on partition A. In the next iteration,
the model is trained on partitions A, C, D, and E, and tested on partition B.
The partitions are rotated until models have been trained and tested on all
of the partitions. Cross-validation provides a more accurate estimate of the
model's performance than testing a single partition of the data.
13
2. Automation at its best:
4. The ability to take efficiency to the next level when merged with
IoT:
IoT is being designated as a strategically significant area by many
companies. And many others have launched pilot projects to gauge the
potential of IoT in the context of business operations. But attaining
financial benefits through IoT isn‟t easy. In order to achieve success,
companies, which are offering IoT consulting services and platforms, need
to clearly determine the areas that will change with the implementation of
IoT strategies. Many of these businesses have failed to address it. In this
scenario, machine learning is probably the best technology that can be
used to attain higher levels of efficiency. By merging machine learning
with IoT, businesses can boost the efficiency of their entire production
processes.
14
5. The ability to change the mortgage market:
It‟s a fact that fostering a positive credit score usually takes discipline,
time, and lots of financial planning for a lot of consumers. When it comes
to the lenders, the consumer credit score is one of the biggest measures of
creditworthiness that involve a number of factors including payment
history, total debt, length of credit history etc. But wouldn‟t it be great if
there is a simplified and better measure? With the help of machine
learning, lenders can now obtain a more comprehensive consumer picture.
They can now predict whether the customer is a low spender or a high
spender and understand his/her tipping point of spending. Apart from
mortgage lending, financial institutions are using the same techniques for
other types of consumer loans.
15
The rows refer to whether the training data is labelled with a target
variable, while the columns indicate whether the models learned are used
to predict a target variable or rather describe the given data.
SUMMARY
16
MACHINE LEARNING MODELS
Unit Structure
Introduction
Geometric Models
Logical Models
Probabilistic Models
Features
Feature types
Feature Construction and Transformation
Feature Selection
Summary
Unit End Questions
References
INTRODUCTION
Models form the central concept in machine learning as they are what is
being learned from the data, in order to solve a given task. There is a
considerable – not to say be wildering – range of machine learning models
to choose from. One reason for this is the ubiquity of the tasks that
machine learning aims to solve: classification, regression, clustering,
association discovery, to name but a few. Examples of each of these tasks
can be found in virtually every branch of science and engineering.
Mathematicians, engineers, psychologists, computer scientists and many
others have discovered – and sometimes rediscovered – ways to solve
these tasks. They have all brought their specific background to bear, and
consequently the principles underlying these models are also diverse. My
personal view is that this diversity is a good thing as it helps to make
machine learning the powerful and exciting discipline it is. It doesn‟t,
however, make the task of writing a machine learning book any easier!
Luckily, a few common themes can be observed, which allow me to
discuss machine learning models in a somewhat more systematic way. We
will discuss three groups of models: geometric models, probabilistic
models, and logical models. These groupings are not meant to be mutually
exclusive, and sometimes a particular kind of model has, for instance, both
a geometric and a probabilistic interpretation. Nevertheless, it provides a
good starting point for our purposes.
GEOMETRIC MODELS
17
use each feature as a coordinate in a Cartesian coordinate system. A
geometric model is constructed directly in instance space, using geometric
concepts such as lines, planes and distances. For instance, the linear
classifier depicted in Figure 1 on p.5 is a geometric classifier. One main
advantage of geometric classifiers is that they are easy to visualise, as long
as we keep to two or three dimensions. It is important to keep in mind,
though, that a Cartesian instance space has as many coordinates as there
are features, which can be tens, hundreds, thousands, or even more. Such
high-dimensional spaces are hard to imagine but are nevertheless very
common in machine learning. Geometric concepts that potentially apply to
high-dimensional spaces are usually prefixed with „hyper-‟: for instance, a
decision boundary in an unspecified number of dimensions is called a
hyperplane.
We will call this the basic linear classifier. It has the advantage of
simplicity, being defined in terms of addition, subtraction and rescaling of
18
examples only (in other words, w is a linear combination of the examples).
However, if those assumptions do not hold, the basic linear classifier can
perform poorly – for instance, note that it may not perfectly separate the
positives from the negatives, even if the data is linearly separable. Because
data is usually noisy, linear separability doesn‟t occur very often in
practice, unless the data is very sparse, as in text classification. Recall that
we used a large vocabulary, say 10 000 words, each word corresponding
to a Boolean feature indicating whether or not that word occurs in the
document. This means that the instance space has 10 000 dimensions, but
for any one document no more than a small percentage of the features will
be non-zero. As a result there is much „empty space‟ between instances,
which increases the possibility of linear separability. However, because
linearly separable data doesn‟t uniquely define a decision boundary, we
are now faced with a problem: which of the infinitely many decision
boundaries should we choose? One natural option is to prefer large margin
classifiers, where the margin of a linear classifier is the distance between
the decision boundary and the closest instance. Support vector machines
are a powerful kind of linear classifier that find a decision boundary whose
margin is as large as possible (Figure 2.2).
19
nearest-neighbour classifier:
A very simple distance-based classifier works as follows:
To classify a new instance, we retrieve from memory the most similar
training instance (i.e., the training instance with smallest Euclidean
distance from the instance to be classified), and simply assign that training
instance‟s class. This classifier is known as the nearest-neighbour
classifier.
Suppose we want to cluster our data into K clusters, and we have an initial
guess of how the data should be clustered. We then calculate the means of
each initial cluster and reassign each point to the nearest cluster mean.
Unless our initial guess was a lucky one, this will have changed some of
the clusters, so we repeat these two steps (calculating the cluster means
and reassigning points to clusters) until no change occurs.
LOGICAL MODELS
Logical models can also be expressed as Tree models and Rule models
Logical models use a logical expression to divide the instance space into
segments and hence construct grouping models. A logical expression is an
expression that returns a Boolean value, i.e., a True or False outcome.
Once the data is grouped using a logical expression, the data is divided
into homogeneous groupings for the problem we are trying to solve. For
example, for a classification problem, all the instances in the group belong
to one class.
20
There are mainly two kinds of logical models: Tree models and Rule
models.
Tree models can be seen as a particular type of rule model where the if-
parts of the rules are organised in a tree structure. Both Tree models and
Rule models use the same approach to supervised learning. The approach
can be summarised in two strategies: we could first find the body of the
rule (the concept) that covers a sufficiently homogeneous set of examples
and then find a label to represent the body. Alternately, we could approach
it from the other direction, i.e., first select a class we want to learn and
then find rules that cover examples of the class.
The models of this type can be easily translated into rules that are
understandable by humans, such as ·if Bonus = 1 then Class = Y = spam·.
Such rules are easily organized in a tree structure, such as the one in
Figure 2.3, which is called a feature tree. The idea of such a tree is that
features are used to iteratively partition the instance space.
21
A feature list is a binary feature tree which always branches in the same
direction, either left or right. The tree in Figure 2.3 is a left-branching
feature list. Such feature lists can be written as nested if–then–else
statements that will be familiar to anyone with a bit of programming
experience. For instance, if we were to label the leaves in Figure 2.3 by
majority class, we obtain the following decision list as per the Rule
learning:
if bonus = 1 then Class = Y = spam
else if lottery = 1 then Class = Y = spam
else Class = Y = ham
Both tree learning and rule learning are implemented in top-down fashion.
Select a feature from the instance space, which best splits the entire
training sets into different number of subsets. Each subset can then further
derive into subsets. Finally, all belongs to each node of a class. In tree
learning, we follow divide and conquer approach.
In rule based, first write a rule, based on some condition and then step by
step, we add more conditions to rule by using some set of examples from
the training dataset. Now remove those examples from the dataset. Here,
we find the class for each feature, ultimately. Here, we follow separate and
conquer.
22
if bonus = 1 then Class = Y = spam·
if bonus = 0 𝖠 lottery = 1 then Class = Y = spam·
if bonus = 0 𝖠 lottery = 0 then Class = Y = ham.
Here, every path from root to a leaf is translated into a rule. As a result,
although rules from the same sub-tree share conditions (such as bonus=0),
every pair of rules will have at least some mutually exclusive conditions
(such as lottery = 1 in the second rule and lottery = 0 in the third).
However, this is not always the case: rules can have a certain overlap.
Before learning more on logical models let us understand the
terminologies – grouping and grading.
23
The logical rule learning system Progol found the following set of
conditions to predict whether a molecular compound is carcinogenic
(causes cancer):
1. it tests positive in the Salmonella assay; or
2. it tests positive for sex-linked recessive lethal mutation in Drosophila;
or
3. it tests negative for chromosome aberration; or
4. it has a carbon in a six-membered aromatic ring with a partial charge
of −0.13; or
5. it has a primary amine group and no secondary or tertiary amines; or
6. it has an aromatic (or resonant) hydrogen with partial charge ≥ 0.168;
or
7. it has a hydroxy oxygen with a partial charge ≥ −0.616 and an
aromatic (or resonant) hydrogen; or
8. it has a bromine; or
9. it has a tetrahedral carbon with a partial charge ≤ −0.144 and tests
positive on Progol‟s mutagenicity rules.
The first three conditions concerned certain tests that were carried out for
all molecules and whose results were recorded in the data as Boolean
features. In contrast, the remaining six rules all refer to the structure of the
molecule and were constructed entirely by Progol.
24
predict spam if the likelihood ratio is larger than 1 and ham otherwise.
So, which one should we use: posterior probabilities or likelihoods? As it
turns out, we can easily transform one into the other using Bayes‟ rule, a
simple property of conditional probabilities which states that
The first decision rule above suggested that we predict the class with
maximum posterior probability, which using Bayes‟ rule can be written in
terms of the likelihood function.
PROBABILISTIC MODELS
The third type of models are probabilistic in nature, like the Bayesian
classifier we considered earlier. Many of these models are based around
the following idea. Let X denote the variables we know about, e.g., our
instance‟s feature values; and let Y denote the target variables we‟re
interested in, e.g., the instance‟s class. The key question in machine
learning is how to model the relationship between X and Y.
Since X is known for a particular instance but Y may not be, we are
particularly interested in the conditional probabilities P(Y |X). For
instance, Y could indicate whether the e-mail is spam, and X could
indicate whether the e-mail contains the words „bonus‟ and „lottery‟. The
probability of interest is then P(Y | bonus, lottery), with bonus and lottery
two Boolean variables which together constitute the feature vector X. For
a particular e-mail we know the feature values and so we might write P(Y
|bonus = 1,lottery = 0) if the e-mail contains the word „bonus‟ but not the
word „lottery‟. This is called a posterior probability because it is used after
the features X are observed.
25
Table 2.1. An example posterior distribution. „Bonus‟ and „lottery‟ are two Boolean
features; Y is the class variable, with values „spam‟ and „ham‟. In each row the most
likely class is indicated in blue. Source: (FLACH, 2012)
Even though this example table is small, it will grow unfeasibly large very
quickly, with n Boolean variables 2n cases have to be distinguished. We
therefore don‟t normally have access to the full joint distribution and have
to approximate it using additional assumptions, as we will see below.
Assuming that X and Y are the only variables we know and care about, the
posterior distribution P(Y |X) helps us to answer many questions of
interest. For instance, to classify a new e-mail we determine whether the
words „Bonus‟ and „lottery‟ occur in it, look up the corresponding
probability P(Y = spam | Bonus, Lottery), and predict spam if this
probability exceeds 0.5 and ham otherwise. Such a recipe to predict a
value of Y on the basis of the values of X and the posterior distribution
P(Y |X) is called a decision rule.
FEATURES
26
features. One could say that models lend the machine learning field
diversity, but tasks and features give it unity.
Figure 2.6. An overview of how machine learning is used to address a given task. A task
(upper box) requires an appropriate mapping – a model – from data described by features
to outputs. Obtaining such a mapping from training data is what constitutes a learning
problem (lower box). Source: (FLACH, 2012)
Mathematically, they are functions, that map from the instance space to
some set of feature values called the domain of the feature. Since
measurements are often numerical, the most common feature domain is
the set of real numbers. Other typical feature domains include the set of
integers, for instance when the feature counts something, such as the
number of occurrences of a particular word; the Booleans, if our feature is
a statement that can be true or false for a particular instance, such as „this e-
mail is addressed to Beena Kapadia‟; and arbitrary finite sets, such as a set
of colours, or a set of shapes.
The first two properties could be expressed by discrete features with three
and two values, respectively; or if the distinctions are more gradual, each
aspect could be rated on some numerical scale.
FEATURE TYPES
27
There are mainly three kinds of features – Quantitative, Ordinal and
Categorical.
Table 2.1. Kinds of features, their properties and allowable statistics. Each kind inherits
the statistics from the kinds above it in the table. For instance, the mode is a statistic of
central tendency that can be computed for any kind of feature. Source: (FLACH, 2012)
Quantitative:
They have a meaningful numerical scale and order. They most often
involve a mapping into the reals or continuous. Even if a feature maps into
a subset of the reals, such as age expressed in years, the various statistics
such as mean or standard deviation still require the full scale of the reals.
Ordinal:
Features with an ordering but without scale are called ordinal features. The
domain of an ordinal feature is some totally ordered set, such as the set of
characters or strings. Even if the domain of a feature is the set of integers,
denoting the feature as ordinal means that we have to dispense with the
scale, as we did with house numbers. Another common example are
features that express a rank order: first, second, third, and so on. Ordinal
features allow the mode and median as central tendency statistics, and
quantiles as dispersion statistics.
Categorical:
Features without ordering or scale are called categorical features (or
sometimes „nominal‟ features). They do not allow any statistical summary
except the mode. One subspecies of the categorical features is the Boolean
feature, which maps into the truth values true and false. The situation is
summarised in Table 2.1.
Now let‟s consider the naive Bayes classifier. We have seen that this
model works by estimating a likelihood function P(X|Y) for each feature
X given the class Y. For categorical and ordinal features with k values this
involves estimating P(X = v1|Y), . . . ,P(X = vk |Y). In effect, ordinal
features are treated as categorical ones, ignoring the order.
In a similar vein, for ordinal features we can count the number of values
between two feature values (if we encode the ordinal feature by means of
integers, this would simply be their difference). This means that distance-
based methods can accommodate all feature types by using an appropriate
distance metric. Similar techniques can be used to extend support vector
machines and other kernel-based methods to categorical and ordinal
features.
29
distinguish between grammatical and ungrammatical sentences, word
order is clearly signal rather than noise, and a different representation is
called for.
FEATURE SELECTION
30
Once we have constructed new features it is often a good idea to select a
suitable subset of them prior to learning. Not only will this speed up
learning as fewer candidate features need to be considered, it also helps to
guard against overfitting.
(FLACH, 2012)
There are two main approaches to feature selection, The filter approach
and the relief approach.
The filter approach scores the features on a particular metric and the top-
scoring features are selected. Many of the metrics we have seen so far can
be used for feature scoring, including information gain, the χ2 statistic, the
correlation coefficient, to name just a few.
To detect features that are useful in the context of other features, we need
to evaluate sets of features; this usually goes under the name of wrapper
approaches. The idea is that feature selection is „wrapped‟ in a search
procedure that usually involves training and evaluating a model with a
candidate set of features.
Forward selection methods start with an empty set of features and add
features to the set one at a time, as long as they improve the performance
of the model. Backward elimination starts with the full set of features and
aims at improving performance by removing features one at a time. Since
there are an exponential number of subsets of features it is usually not
feasible to search all possible subsets, and most approaches apply a
„greedy‟ search algorithm that never reconsiders the choices it makes.
SUMMARY
After studying this chapter, you will understand different modes like
Geometric Models, Logical Models and Probabilistic Models. You will
understand about features usage and why it is very important in model
designing. You will also understand about different Feature types, how
they can be Constructed and why their Transformation required and how it
can be done. You will also understand how Feature Selection plays an
important role in designing a model and how to do it.
32
7. Why are feature construction and feature transformation required?
How to achieve them?
8. What are the approaches to feature selection? Explain each one in
detail.
33