0% found this document useful (0 votes)
22 views33 pages

Introduction To Machine Learning: Unit Structure

The document provides an introduction to machine learning, detailing its definitions, types, and applications. It explains concepts such as supervised, unsupervised, and reinforcement learning, along with real-world examples like image recognition, speech recognition, and self-driving cars. The document emphasizes the similarities between human learning processes and machine learning algorithms in discovering patterns from data for intelligent decision-making.

Uploaded by

anjaliray.1611
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views33 pages

Introduction To Machine Learning: Unit Structure

The document provides an introduction to machine learning, detailing its definitions, types, and applications. It explains concepts such as supervised, unsupervised, and reinforcement learning, along with real-world examples like image recognition, speech recognition, and self-driving cars. The document emphasizes the similarities between human learning processes and machine learning algorithms in discovering patterns from data for intelligent decision-making.

Uploaded by

anjaliray.1611
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

INTRODUCTION TO MACHINE

LEARNING
Unit Structure
Introduction
Machine learning
Examples of Machine Learning Problems
Structure of Learning
Learning versus Designing
Training versus Testing
Characteristics of Machine learning tasks
Predictive and descriptive tasks
Summary
Unit End Questions
References

INTRODUCTION

A human child learns new things and uncovers the structure of their world
year by year as they grow to adulthood. A child's brain and senses
perceive the facts of their surroundings and gradually learn the hidden
patterns of life which help the child to craft logical rules to identify
learned patterns. The learning process of the human brain makes humans
the most sophisticated living creature of this world. Learning continuously
by discovering hidden patterns and then innovating on those patterns
enables us to make ourselves better and better throughout our lifetime.
Superficially, we can draw some motivational similarities between the
learning process of the human brain and the concepts of machine learning.
(jlooper, n.d.)

The human brain perceives things from the real world, processes the
perceived information, makes rational decisions, and performs certain
actions based on circumstances. When we program a replica of the
intelligent behavioural process to a machine, it is called artificial
intelligence (AI).

Machine learning (ML) is an important subset of artificial


intelligence. ML is concerned with using specialized algorithms to

1
uncover meaningful information and find hidden patterns from perceived
data to support the logical decision-making process.

MACHINE LEARNING

Machine learning, from a systems perspective, is defined as the creation of


automated systems that can learn hidden patterns from data to aid in
making intelligent decisions.

This motivation is loosely inspired by how the human brain learns certain
things based on the data it perceives from the outside world. Machine
learning is the systematic study of algorithms and systems that improve
their knowledge or performance with experience.

A Machine Learning system learns from historical data, builds the


prediction models, and whenever it receives new data, predicts the output
for it. The accuracy of predicted output depends upon the amount of data,
as the huge amount of data helps to build a better model which predicts the
output more accurately.

Suppose we have a complex problem, where we need to perform some


predictions, so instead of writing a code for it, we just need to feed the
data to generic algorithms, and with the help of these algorithms, machine
builds the logic as per the data and predict the output.

Two definitions of Machine Learning are offered.

Arthur Samuel described it as: "The field of study that gives computers
the ability to learn from data without being explicitly programmed." This
is an older, informal definition.

Tom Mitchell provides a more modern definition. According to him, "A


computer program is said to learn from experience E with respect to some
class of tasks T and performance measure P, if its performance at tasks in
T, as measured by P, improves with experience E."

Example: playing checkers.


T = the task of playing checkers.
P = the probability that the program will win the next game.
E = the experience of playing many games of checkers

Let us now understand, Supervised Machine Learning, Unsupervised


Machine Learning and Reinforcement Learning:

Supervised Machine Learning:


Supervised learning is the types of machine learning in which machines
are trained using well "labelled" training data, and on basis of that data,

2
machines predict the output. The labelled data means some input data is
already tagged with the correct output.

In supervised learning, the training data provided to the machines work as


the supervisor that teaches the machines to predict the output correctly. It
applies the same concept as a student learns in the supervision of the
teacher.

Supervised learning is a process of providing input data as well as correct


output data to the machine learning model. The aim of a supervised
learning algorithm is to find a mapping function to map the input
variable(x) with the output variable(y).

In the real-world, supervised learning can be used for Risk Assessment,


Image classification, Fraud Detection, spam filtering, etc.

Supervised learning can be further divided into two types of


problems:

1. Regression:
Regression algorithms are used if there is a relationship between the input
variable and the output variable. It is used for the prediction of continuous
variables, such as Weather forecasting, Market Trends, etc. Linear
Regression, Regression Trees, Non-Linear Regression, Bayesian Linear
Regression, Polynomial Regression are some popular Regression
algorithms which come under supervised learning.

2. Classification:
Classification algorithms are used when the output variable is categorical,
which means there are two classes such as Yes-No, Male-Female, True-
false, etc. Spam Filtering, Random Forest, Decision Trees, Logistic
Regression, Support vector Machines are some examples of classification.

Unsupervised Machine Learning:


There may be many cases in which we do not have labeled data and need
to find the hidden patterns from the given dataset. So, to solve such types
of cases in machine learning, we need unsupervised learning techniques.
unsupervised learning is a machine learning technique in which models
are not supervised using training dataset. Instead, models itself find the
hidden patterns and insights from the given data. It can be compared to
learning which takes place in the human brain while learning new things.

Unsupervised learning cannot be directly applied to a regression or


classification problem because unlike supervised learning, we have the
input data but no corresponding output data. The goal of unsupervised
learning is to find the underlying structure of dataset, group that data
according to similarities, and represent that dataset in a compressed
format.

3
Example: Suppose the unsupervised learning algorithm is given an input
dataset containing images of different types of cats and dogs. The
algorithm is never trained upon the given dataset, which means it does not
have any idea about the features of the dataset. The task of the
unsupervised learning algorithm is to identify the image features on their
own. Unsupervised learning algorithm will perform this task by clustering
the image dataset into the groups according to similarities between images.

The unsupervised learning algorithm can be further categorized into two


types of problems:
1. Clustering
2. Association

1. Clustering:
Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no
similarities with the objects of another group. Cluster analysis finds the
commonalities between the data objects and categorizes them as per the
presence and absence of those commonalities.

2. Association:
An association rule is an unsupervised learning method which is used for
finding the relationships between variables in the large database. It
determines the set of items that occurs together in the dataset. Association
rule makes marketing strategy more effective. Such as people who buy X
item (suppose a bread) are also tend to purchase Y (Butter/Jam) item. A
typical example of Association rule is Market Basket Analysis.

K-means clustering, KNN (k-nearest neighbors), Hierarchal clustering,


Anomaly detection, Neural Networks are the examples of unsupervised
learning.

Reinforcement Learning:
Reinforcement Learning is a feedback-based Machine learning technique
in which an agent learns to behave in an environment by performing the
actions and seeing the results of actions. For each good action, the agent
gets positive feedback, and for each bad action, the agent gets negative
feedback or penalty. In Reinforcement Learning, the agent learns
automatically using feedbacks without any labeled data, unlike supervised
learning.

Since there is no labelled data, so the agent is bound to learn by its


experience only. RL solves a specific type of problem where decision
making is sequential, and the goal is long-term, such as game-playing,
robotics, etc. The agent interacts with the environment and explores it by
itself. The primary goal of an agent in reinforcement learning is to
improve the performance by getting the maximum positive rewards.

4
The agent learns with the process of hit and trial, and based on the
experience, it learns to perform the task in a better way. Hence, we can say
that "Reinforcement learning is a type of machine learning method where
an intelligent agent (computer program) interacts with the environment
and learns to act within that”. How a Robotic dog learns the movement of
his arms is an example of Reinforcement learning.

EXAMPLES OF MACHINE LEARNING PROBLEMS:

Machine learning is a buzzword for today's technology, and it is growing


very rapidly day by day. We are using machine learning in our daily life
even without knowing it such as Google Maps, Google assistant, Alexa,
etc. Below are some most trending real-world applications of Machine
Learning:

1. Image Recognition:
Image recognition is one of the most common applications of machine
learning. It is used to identify objects, persons, places, digital images, etc.
The popular use case of image recognition and face detection is,
Automatic friend tagging suggestion:

Facebook provides us a feature of auto friend tagging suggestion.


Whenever we upload a photo with our Facebook friends, then we
automatically get a tagging suggestion with name, and the technology
behind this is machine learning's face detection and recognition algorithm.

It is based on the Facebook project named "Deep Face," which is


responsible for face recognition and person identification in the picture.

2. Speech Recognition:
While using Google, we get an option of "Search by voice," it comes
under speech recognition, and it's a popular application of machine
learning.

Speech recognition is a process of converting voice instructions into text,


and it is also known as "Speech to text", or "Computer speech
recognition." At present, machine learning algorithms are widely used by
various applications of speech recognition. Google assistant, Siri, Cortana,
and Alexa are using speech recognition technology to follow the voice
instructions.

3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which
shows us the correct path with the shortest route and predicts the traffic
conditions.

It predicts the traffic conditions such as whether traffic is cleared, slow-


moving, or heavily congested with the help of two ways:

5
 Real Time location of the vehicle form Google Map app and sensors
 Average time has taken on past days at the same time.

Everyone who is using Google Map is helping this app to make it better. It
takes information from the user and sends back to its database to improve
the performance.

4. Product recommendations:
Machine learning is widely used by various e-commerce and
entertainment companies such as Amazon, Netflix, etc., for product
recommendation to the user. Whenever we search for some product on
Amazon, then we started getting an advertisement for the same product
while internet surfing on the same browser and this is because of machine
learning.

Google understands the user interest using various machine learning


algorithms and suggests the product as per customer interest. As similar,
when we use Netflix, we find some recommendations for entertainment
series, movies, etc., and this is also done with the help of machine
learning.

5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving
cars. Machine learning plays a significant role in self-driving cars. Tesla,
the most popular car manufacturing company is working on self-driving
car. It is using unsupervised learning method to train the car models to
detect people and objects while driving.

6. Email Spam and Malware Filtering:


Whenever we receive a new email, it is filtered automatically as important,
normal, and spam. We always receive an important mail in our inbox with
the important symbol and spam emails in our spam box, and the
technology behind this is Machine learning. Below are some spam filters
used by Gmail:
 Content Filter
 Header filter
 General blacklists filter
 Rules-based filters
 Permission filters

Some machine learning algorithms such as Multi-Layer Perceptron,


Decision tree, and Naïve Bayes classifier are used for email spam filtering
and malware detection.

7. Virtual Personal Assistant:


We have various virtual personal assistants such as Google assistant,
Alexa, Cortana, Siri. As the name suggests, they help us in finding the

6
information using our voice instruction. These assistants can help us in
various ways just by our voice instructions such as Play music, call
someone, open an email, Scheduling an appointment, etc.

These virtual assistants use machine learning algorithms as an important


part. These assistants record our voice instructions, send it over the server
on a cloud, and decode it using ML algorithms and act accordingly.

8. Online Fraud Detection:


Machine learning is making our online transaction safe and secure by
detecting fraud transaction. Whenever we perform some online
transaction, there may be various ways that a fraudulent transaction can
take place such as fake accounts, fake ids, and steal money in the middle
of a transaction. So, to detect this, Feed Forward Neural network helps us
by checking whether it is a genuine transaction or a fraud transaction.

For each genuine transaction, the output is converted into some hash
values, and these values become the input for the next round. For each
genuine transaction, there is a specific pattern which gets change for the
fraud transaction hence, it detects it and makes our online transactions
more secure.

9. Stock Market trading:


Machine learning is widely used in stock market trading. In the stock
market, there is always a risk of up and downs in shares, so for this
machine learning's long short-term memory neural network is used for the
prediction of stock market trends.

10. Medical Diagnosis:


In medical science, machine learning is used for diseases diagnoses. With
this, medical technology is growing very fast and able to build 3D models
that can predict the exact position of lesions in the brain. It helps in finding
brain tumours and other brain-related diseases easily.

11. Automatic Language Translation:


Nowadays, if we visit a new place and we are not aware of the language
then it is not a problem at all, as for this also machine learning helps us by
converting the text into our known languages. Google's GNMT (Google
Neural Machine Translation) provide this feature, which is a Neural
Machine Learning that translates the text into our familiar language, and it
called as automatic translation.

The technology behind the automatic translation is a sequence-to-sequence


learning algorithm, which is used with image recognition and translates
the text from one language to another language.

7
STRUCTURE OF LEARNING

Like all other machine learning models, patterns are a manifestation of


underlying structure in the data. Sometimes this structure takes the form of
a single hidden or latent variable, much like unobservable but nevertheless
explanatory quantities in physics, such as energy. Consider the following
matrix:

(FLACH, 2012)

Imagine these represent ratings by six different people (in rows), on a


scale of 0 to 3, of four different films – say Khosla Ka Ghosla (KG),
Drishyam (D), BADLA (B), Hera Phery (HP), (in columns, from left to
right). BADLA (B) seems to be the most popular of the four with an
average rating of 1.5, and Khosla Ka Ghosla (KG) is the least appreciated
with an average rating of 0.5. Try to find a structure in this matrix.
(FLACH, 2012)

Try to look for columns or rows that are combinations of other columns or
rows. For instance, the third column turns out to be the sum of the first and
second columns. Similarly, the fourth row is the sum of the first and
second rows. What this means is that the fourth person combines the
ratings of the first and second person. Similarly, BADLA (B)‟s ratings are
the sum of the ratings of the first two films. This is made more explicit by
writing the matrix as the following product:

(FLACH, 2012)

Notice that the first and third matrix on the right-hand side are now
Boolean, and the middle one is diagonal (all off-diagonal entries are zero).
Moreover, these matrices have a very natural interpretation in terms of
film genres.

8
The right-most matrix associates films (in columns) with genres (in rows):
Khosla Ka Ghosla (KG) and Drishyam (D) belong to two different genres,
say drama and crime, BADLA (B) belongs to both, and Hera Phery (HP)
is a crime film and also introduces a new genre (say comedy).

The tall, 6-by-3 matrix then expresses people‟s preferences in terms of


genres: the first, fourth and fifth person like drama, the second, fourth and
fifth person like crime films, and the third, fifth and sixth person like
comedies. Finally, the middle matrix states that the crime genre is twice as
important as the other two genres in terms of determining people‟s
preferences.

LEARNING VERSUS DESIGNING

According to Arthur Samuel “Machine Learning enables a Machine to


Automatically learn from Data, improve performance from an Experience
and predict things without explicitly programmed.”
(https://www.tutorialspoint.com/, n.d.)

In Simple Words, when we fed the Training Data to Machine Learning


Algorithm, this algorithm will produce a mathematical model and with the
help of the mathematical model, the machine will make a prediction and
take a decision without being explicitly programmed, as shown in figure
1.1. Also, during training data, the more machine will work with it the
more it will get experience and the more it will get experience the more
efficient result is produced.

Building
Machine
Logical
Training Data Learning Output
Mathematical
Algorithm
Model

Figure 1.1: Learn from data

Example: In Driverless Car, the training data is fed to Algorithm like


how to Drive Car in Highway, Busy and Narrow Street with factors like
speed limit, parking, stop at signal etc. After that, a Logical and
Mathematical model is created based on that and after that, the car will
work according to the logical model. Also, the more data the data is fed
the more efficient output is produced.

Designing a Learning System in Machine Learning:


According to Tom Mitchell, “A computer program is said to be learning
from experience (E), with respect to some task (T). Thus, the performance
measure (P) is the performance at task T, which is measured by P, and it
improves with experience E.”

Example: In Spam E-Mail detection,

9
Task, T: To classify mails into Spam or Not Spam.
Performance measure, P: Total percent of mails being correctly classified
as being “Spam” or “Not Spam”.

Experience, E: Set of Mails with label “Spam”


Steps for Designing Learning System are as shown in figure 1.2 below:

Step 1) Choosing the Training Experience: The very important and first
task is to choose the training data or training experience which will be fed
to the Machine Learning Algorithm. It is important to note that the data or
experience that we fed to the algorithm must have a significant impact on
the Success or Failure of the Model. So, training data or experience should
be chosen wisely.

Choosing the Training Experience

Choosing target function

Choosing Representation for Target function

Choosing Function Approximation Algorithm

Final Design

Figure 1.2: Steps for Designing Learning System

Below are the attributes which will impact on Success and Failure of Data:

The training experience will be able to provide direct or indirect feedback


regarding choices. For example: While Playing chess the training data will
provide feedback to itself like instead of this move if this is chosen the
chances of success increases.

Second important attribute is the degree to which the learner will control
the sequences of training examples. For example: when training data is fed
to the machine then at that time accuracy is very less but when it gains
experience while playing again and again with itself or opponent the
machine algorithm will get feedback and control the chess game
accordingly.

Third important attribute is how it will represent the distribution of


examples over which performance will be measured. For example, a
Machine learning algorithm will get experience while going through a
number of different cases and different examples. Thus, Machine Learning
Algorithm will get more and more experience by passing through more
and more examples and hence its performance will increase.

10
Step 2) Choosing target function: The next important step is choosing
the target function. It means according to the knowledge fed to the
algorithm the machine learning will choose NextMove function which will
describe what type of legal moves should be taken. For example: While
playing chess with the opponent, when opponent will play then the
machine learning algorithm will decide what be the number of possible
legal moves taken in order to get success.

Step 3) Choosing Representation for Target function: When the


machine algorithm will know all the possible legal moves the next step is
to choose the optimized move using any representation i.e. using linear
Equations, Hierarchical Graph Representation, Tabular form etc. The
NextMove function will move the Target move like out of these moves
which will provide more success rate. For Example: while playing chess
machine have 4 possible moves, so the machine will choose that optimized
move which will provide success to it.

Step 4) Choosing Function Approximation Algorithm: An optimized


move cannot be chosen just with the training data. The training data had to
go through with set of examples and through these examples the training
data will approximates which steps are chosen and after that machine will
provide feedback on it. For Example: When a training data of Playing
chess is fed to algorithm so at that time it is not machine algorithm will
fail or get success and again from that failure or success it will measure
while next move what step should be chosen and what is its success rate.

Step 5) Final Design: The final design is created at last when system goes
from number of examples, failures and success, correct and incorrect
decision and what will be the next step etc. Example: DeepBlue is an
intelligent computer which is ML-based won chess game against the chess
expert Garry Kasparov, and it became the first computer which had beaten
a human chess expert.

TRAINING VERSUS TESTING

Training data and test data are two important concepts in machine
learning.

Training Data:
The observations in the training set form the experience that the algorithm
uses to learn. In supervised learning problems, each observation consists
of an observed output variable and one or more observed input variables.

Test Data:
The test set is a set of observations used to evaluate the performance of the
model using some performance metric. It is important that no observations
from the training set are included in the test set. If the test set does contain
examples from the training set, it will be difficult to assess whether the

11
algorithm has learned to generalize from the training set or has simply
memorized it.

A program that generalizes well will be able to effectively perform a task


with new data. In contrast, a program that memorizes the training data by
learning an overly complex model could predict the values of the response
variable for the training set accurately but will fail to predict the value of
the response variable for new examples.

Memorizing the training set is called over-fitting. A program that


memorizes its observations may not perform its task well, as it could
memorize relations and structures that are noise or coincidence. Balancing
memorization and generalization, or over-fitting and under-fitting, is a
problem common to many machine learning
algorithms. Regularization may be applied to many models to reduce over-
fitting.

In addition to the training and test data, a third set of observations, called
a validation or hold-out set, is sometimes required. The validation set is
used to tune variables called hyper parameters, which control how the
model is learned. The program is still evaluated on the test set to provide
an estimate of its performance in the real world; its performance on the
validation set should not be used as an estimate of the model's real-world
performance since the program has been tuned specifically to the
validation data.

It is common to partition a single set of supervised observations into


training, validation, and test sets. There are no requirements for the sizes
of the partitions, and they may vary according to the amount of data
available. It is common to allocate 50 percent or more of the data to the
training set, 25 percent to the test set, and the remainder to the validation
set.

Some training sets may contain only a few hundred observations; others
may include millions. Inexpensive storage, increased network
connectivity, the ubiquity of sensor-packed smartphones, and shifting
attitudes towards privacy have contributed to the contemporary state of big
data, or training sets with millions or billions of examples.

However, machine learning algorithms also follow the maxim "garbage in,
garbage out." A student who studies for a test by reading a large,
confusing textbook that contains many errors will likely not score better
than a student who reads a short but well-written textbook. Similarly, an
algorithm trained on a large collection of noisy, irrelevant, or incorrectly
labelled data will not perform better than an algorithm trained on a smaller
set of data that is more representative of problems in the real world.

Many supervised training sets are prepared manually, or by semi-


automated processes. Creating a large collection of supervised data can be

12
costly in some domains. Fortunately, several datasets are bundled
with scikit-learn, allowing developers to focus on experimenting with
models instead.

During development, and particularly when training data is scarce, a


practice called cross-validation can be used to train and validate an
algorithm on the same data. In cross-validation, the training data is
partitioned. The algorithm is trained using all but one of the partitions and
tested on the remaining partition. The partitions are then rotated several
times so that the algorithm is trained and evaluated on all of the data.

Consider for example that the original dataset is partitioned into five
subsets of equal size, labelled A through E. Initially, the model is trained
on partitions B through E, and tested on partition A. In the next iteration,
the model is trained on partitions A, C, D, and E, and tested on partition B.
The partitions are rotated until models have been trained and tested on all
of the partitions. Cross-validation provides a more accurate estimate of the
model's performance than testing a single partition of the data.

CHARACTERISTICS OF MACHINE LEARNING


TASKS

To understand the actual power of machine learning, we must consider the


characteristics of this technology. There are lots of examples that echo the
characteristics of machine learning in today‟s data-rich world. Here are
seven key characteristics of machine learning for which companies should
prefer it over other technologies:
1. The ability to perform automated data visualization
2. Automation at its best
3. Customer engagement like never before
4. The ability to take efficiency to the next level when merged with IoT
5. The ability to change the mortgage market
6. Accurate data analysis
7. Business intelligence at its best

1. The ability to perform automated data visualization:


A massive amount of data is being generated by businesses and common
people on a regular basis. By visualizing notable relationships in data,
businesses can not only make better decisions but build confidence as
well. Machine learning offers several tools that provide rich snippets of
data which can be applied to both unstructured and structured data. With
the help of user-friendly automated data visualization platforms in
machine learning, businesses can obtain a wealth of new insights to
increase productivity in their processes.

13
2. Automation at its best:

import process visualize model evaluate

Figure 1.3 Machine Learning workflow

Figure 1.3 shows Machine Learning workflow. One of the biggest


characteristics of machine learning is its ability to automate repetitive
tasks and thus, increasing productivity. A huge number of organizations
are already using machine learning-powered paperwork and email
automation. In the financial sector, for example, a huge number of
repetitive, data-heavy and predictable tasks are needed to be performed.
Because of this, this sector uses different types of machine learning
solutions to a great extent. The make accounting tasks faster, more
insightful, and more accurate. Some aspects that have been already
addressed by machine learning include addressing financial queries with
the help of chatbots, making predictions, managing expenses, simplifying
invoicing, and automating bank reconciliations.

3. Customer engagement like never before:


For any business, one of the most crucial ways to drive engagement,
promote brand loyalty and establish long-lasting customer relationships is
by triggering meaningful conversations with its target customer base.
Machine learning plays a critical role in enabling businesses and brands to
spark more valuable conversations in terms of customer engagement. The
technology analyzes particular phrases, words, sentences, idioms, and
content formats which resonate with certain audience members. We can
think of Pinterest which is successfully using machine learning to
personalize suggestions to its users. It uses the technology to source
content in which users will be interested, based on objects which they have
pinned already.

4. The ability to take efficiency to the next level when merged with
IoT:
IoT is being designated as a strategically significant area by many
companies. And many others have launched pilot projects to gauge the
potential of IoT in the context of business operations. But attaining
financial benefits through IoT isn‟t easy. In order to achieve success,
companies, which are offering IoT consulting services and platforms, need
to clearly determine the areas that will change with the implementation of
IoT strategies. Many of these businesses have failed to address it. In this
scenario, machine learning is probably the best technology that can be
used to attain higher levels of efficiency. By merging machine learning
with IoT, businesses can boost the efficiency of their entire production
processes.

14
5. The ability to change the mortgage market:
It‟s a fact that fostering a positive credit score usually takes discipline,
time, and lots of financial planning for a lot of consumers. When it comes
to the lenders, the consumer credit score is one of the biggest measures of
creditworthiness that involve a number of factors including payment
history, total debt, length of credit history etc. But wouldn‟t it be great if
there is a simplified and better measure? With the help of machine
learning, lenders can now obtain a more comprehensive consumer picture.
They can now predict whether the customer is a low spender or a high
spender and understand his/her tipping point of spending. Apart from
mortgage lending, financial institutions are using the same techniques for
other types of consumer loans.

6. Accurate data analysis:


Traditionally, data analysis has always been encompassing trial and error
method, an approach which becomes impossible when we are working
with large and heterogeneous datasets. Machine learning comes as the best
solution to all these issues by offering effective alternatives to analyzing
massive volumes of data. By developing efficient and fast algorithms, as
well as, data-driven models for processing of data in real-time, machine
learning is able to generate accurate analysis and results.

7. Business intelligence at its best:


Machine learning characteristics, when merged with big data analytical
work, can generate extreme levels of business intelligence with the help of
which several different industries are making strategic initiatives. From
retail to financial services to healthcare, and many more – machine
learning has already become one of the most effective technologies to
boost business operations.

PREDICTIVE AND DESCRIPTIVE TASKS

In the similar fashion, as the distinction between supervised learning from


labelled data and unsupervised learning from unlabelled data, we can draw
a distinction between whether the model output involves the target
variable or not: we call it a predictive model if it does, and a descriptive
model if it does not. This leads to the four different machine learning
settings summarised in Table 1.1.

Predictive model Descriptive


model
Supervised learning classification, regression subgroup
discovery
Unsupervised learning predictive clustering descriptive
clustering,
association rule discovery
Table 1.1. An overview of different machine learning settings. (FLACH, 2012)

15
The rows refer to whether the training data is labelled with a target
variable, while the columns indicate whether the models learned are used
to predict a target variable or rather describe the given data.

The table 1.1 indicates following points:


 The most common setting is supervised learning of predictive models –
in fact, this is what people commonly mean when they refer to
supervised learning. Typical tasks are classification and regression.
 It is also possible to use labelled training data to build a descriptive
model that is not primarily intended to predict the target variable, but
instead identifies, say, subsets of the data that behave differently with
respect to the target variable. This example of supervised learning of a
descriptive model is called subgroup discovery.
 Descriptive models can naturally be learned in an unsupervised setting,
and we have just seen a few examples of that (clustering, association
rule discovery and matrix decomposition). This is often the implied
setting when people talk about unsupervised learning.
 A typical example of unsupervised learning of a predictive model
occurs when we cluster data with the intention of using the clusters to
assign class labels to new data. We will call this predictive clustering to
distinguish it from the previous, descriptive form of clustering.

SUMMARY

This chapter gives brief introduction of Machine Learning. After studying


this chapter, you will learn about definition of machine learning, what is
supervised, unsupervised and reinforcement learning, applications of
machine learning, how a pattern can be found from data, what are training
data and test data and predictive and descriptive tasks with respect to
supervised and unsupervised learning.

UNIT END QUESTIONS

1. Define and explain Machine Learning. Also explain its examples in


brief.
2. Explain supervised learning and unsupervised learning in detail.
3. Write a short note on learning verses designing.
4. Explain training data and test data in detail.
5. What are the characteristics of machine learning tasks? Explain each
one in brief.
6. What are predictive and descriptive tasks? Explain with respect to
supervised and unsupervised learning.

16
MACHINE LEARNING MODELS
Unit Structure
Introduction
Geometric Models
Logical Models
Probabilistic Models
Features
Feature types
Feature Construction and Transformation
Feature Selection
Summary
Unit End Questions
References

INTRODUCTION
Models form the central concept in machine learning as they are what is
being learned from the data, in order to solve a given task. There is a
considerable – not to say be wildering – range of machine learning models
to choose from. One reason for this is the ubiquity of the tasks that
machine learning aims to solve: classification, regression, clustering,
association discovery, to name but a few. Examples of each of these tasks
can be found in virtually every branch of science and engineering.
Mathematicians, engineers, psychologists, computer scientists and many
others have discovered – and sometimes rediscovered – ways to solve
these tasks. They have all brought their specific background to bear, and
consequently the principles underlying these models are also diverse. My
personal view is that this diversity is a good thing as it helps to make
machine learning the powerful and exciting discipline it is. It doesn‟t,
however, make the task of writing a machine learning book any easier!
Luckily, a few common themes can be observed, which allow me to
discuss machine learning models in a somewhat more systematic way. We
will discuss three groups of models: geometric models, probabilistic
models, and logical models. These groupings are not meant to be mutually
exclusive, and sometimes a particular kind of model has, for instance, both
a geometric and a probabilistic interpretation. Nevertheless, it provides a
good starting point for our purposes.

GEOMETRIC MODELS

The instance space is the set of all possible or describable instances,


whether they are present in our data set or not. Usually this set has some
geometric structure. For instance, if all features are numerical, then we can

17
use each feature as a coordinate in a Cartesian coordinate system. A
geometric model is constructed directly in instance space, using geometric
concepts such as lines, planes and distances. For instance, the linear
classifier depicted in Figure 1 on p.5 is a geometric classifier. One main
advantage of geometric classifiers is that they are easy to visualise, as long
as we keep to two or three dimensions. It is important to keep in mind,
though, that a Cartesian instance space has as many coordinates as there
are features, which can be tens, hundreds, thousands, or even more. Such
high-dimensional spaces are hard to imagine but are nevertheless very
common in machine learning. Geometric concepts that potentially apply to
high-dimensional spaces are usually prefixed with „hyper-‟: for instance, a
decision boundary in an unspecified number of dimensions is called a
hyperplane.

If there exists a linear decision boundary separating the two classes, we


say that the data is linearly separable. As we have seen, a linear decision
boundary is defined by the equation w·x = t, where w is a vector
perpendicular to the decision boundary, x points to an arbitrary point on
the decision boundary, and t is the decision threshold. A good way to think
of the vector w is as pointing from the „centre of mass‟ of the negative
examples, n, to the centre of mass of the positives p. In other words, w is
proportional (or equal) to p−n. One way to calculate these centres of mass
is by averaging. For instance, if P is a set of n positive examples, then we
can define P = ∑ ∈ and similarly for n. By setting the decision
threshold appropriately, we can intersect the line from n to p half-way
(Figure 2.1).

Source: (FLACH, 2012)

We will call this the basic linear classifier. It has the advantage of
simplicity, being defined in terms of addition, subtraction and rescaling of

18
examples only (in other words, w is a linear combination of the examples).
However, if those assumptions do not hold, the basic linear classifier can
perform poorly – for instance, note that it may not perfectly separate the
positives from the negatives, even if the data is linearly separable. Because
data is usually noisy, linear separability doesn‟t occur very often in
practice, unless the data is very sparse, as in text classification. Recall that
we used a large vocabulary, say 10 000 words, each word corresponding
to a Boolean feature indicating whether or not that word occurs in the
document. This means that the instance space has 10 000 dimensions, but
for any one document no more than a small percentage of the features will
be non-zero. As a result there is much „empty space‟ between instances,
which increases the possibility of linear separability. However, because
linearly separable data doesn‟t uniquely define a decision boundary, we
are now faced with a problem: which of the infinitely many decision
boundaries should we choose? One natural option is to prefer large margin
classifiers, where the margin of a linear classifier is the distance between
the decision boundary and the closest instance. Support vector machines
are a powerful kind of linear classifier that find a decision boundary whose
margin is as large as possible (Figure 2.2).

Source: (FLACH, 2012)

A very useful geometric concept in machine learning is the distance. If the


distance between two instances is small then the instances are similar in
terms of their feature values, and so nearby instances would be expected to
receive the same classification or belong to the same cluster.

In a Cartesian coordinate system, distance can be measured by Euclidean


distance, which is the square root of the sum of the squared distances
along each coordinate: ( − ) + ( − ) . For n points, the
general formula is: ∑ ( − )2

19
nearest-neighbour classifier:
A very simple distance-based classifier works as follows:
To classify a new instance, we retrieve from memory the most similar
training instance (i.e., the training instance with smallest Euclidean
distance from the instance to be classified), and simply assign that training
instance‟s class. This classifier is known as the nearest-neighbour
classifier.

Suppose we want to cluster our data into K clusters, and we have an initial
guess of how the data should be clustered. We then calculate the means of
each initial cluster and reassign each point to the nearest cluster mean.
Unless our initial guess was a lucky one, this will have changed some of
the clusters, so we repeat these two steps (calculating the cluster means
and reassigning points to clusters) until no change occurs.

It remains to be decided how we construct our initial guess. This is usually


done randomly: either by randomly partitioning the data set into K
„clusters‟ or by randomly guessing K „cluster centres‟. Instead of using
Euclidean distance, which May not work exactly the way it should, for
outliers, other distances can be used such as Manhattan distance, which
sums the distances along each coordinate: ∑ | − |.

LOGICAL MODELS

For a given problem, the collection of all possible outcomes represents


the sample space or instance space. The basic idea for creating a taxonomy
of algorithms is that we divide the instance space by using one of three
ways:
 Using a Logical expression.
 Using the Geometry of the instance space.
 Using Probability to classify the instance space.

The outcome of the transformation of the instance space by a machine


learning algorithm using the above techniques should be exhaustive (cover
all possible outcomes) and mutually exclusive (non-overlapping).

Logical models can also be expressed as Tree models and Rule models

Logical models use a logical expression to divide the instance space into
segments and hence construct grouping models. A logical expression is an
expression that returns a Boolean value, i.e., a True or False outcome.
Once the data is grouped using a logical expression, the data is divided
into homogeneous groupings for the problem we are trying to solve. For
example, for a classification problem, all the instances in the group belong
to one class.

20
There are mainly two kinds of logical models: Tree models and Rule
models.

Rule models consist of a collection of implications or IF-THEN rules. For


tree-based models, the „if-part‟ defines a segment and the „then-part‟
defines the behaviour of the model for this segment. Rule models follow
the same reasoning.

Tree models can be seen as a particular type of rule model where the if-
parts of the rules are organised in a tree structure. Both Tree models and
Rule models use the same approach to supervised learning. The approach
can be summarised in two strategies: we could first find the body of the
rule (the concept) that covers a sufficiently homogeneous set of examples
and then find a label to represent the body. Alternately, we could approach
it from the other direction, i.e., first select a class we want to learn and
then find rules that cover examples of the class.

The models of this type can be easily translated into rules that are
understandable by humans, such as ·if Bonus = 1 then Class = Y = spam·.
Such rules are easily organized in a tree structure, such as the one in
Figure 2.3, which is called a feature tree. The idea of such a tree is that
features are used to iteratively partition the instance space.

Source: (FLACH, 2012)

The leaves of the tree therefore correspond to rectangular areas in the


instance space, which we will call instance space segments, or segments
for short. Depending on the task we are solving, we can then label the
leaves with a class, a probability, a real value, and so on. Feature trees
whose leaves are labelled with classes are commonly called decision trees.
A complete feature tree, which contains all features, one at each level of
the tree is shown in figure 2.4.

21
A feature list is a binary feature tree which always branches in the same
direction, either left or right. The tree in Figure 2.3 is a left-branching
feature list. Such feature lists can be written as nested if–then–else
statements that will be familiar to anyone with a bit of programming
experience. For instance, if we were to label the leaves in Figure 2.3 by
majority class, we obtain the following decision list as per the Rule
learning:
if bonus = 1 then Class = Y = spam
else if lottery = 1 then Class = Y = spam
else Class = Y = ham

Both tree learning and rule learning are implemented in top-down fashion.
Select a feature from the instance space, which best splits the entire
training sets into different number of subsets. Each subset can then further
derive into subsets. Finally, all belongs to each node of a class. In tree
learning, we follow divide and conquer approach.

Source: (FLACH, 2012)

In rule based, first write a rule, based on some condition and then step by
step, we add more conditions to rule by using some set of examples from
the training dataset. Now remove those examples from the dataset. Here,
we find the class for each feature, ultimately. Here, we follow separate and
conquer.

Logical models often have different, equivalent formulations. For instance,


two alternative formulations for this model are:

if bonus = 1 𝗏 lottery = 1 then Class = Y = spam·


else Class = Y = ham·
if bonus = 0 𝖠 lottery = 0 then Class = Y = ham·
else Class = Y = spam·

We can also represent the same model as un-nested rules:

22
if bonus = 1 then Class = Y = spam·
if bonus = 0 𝖠 lottery = 1 then Class = Y = spam·
if bonus = 0 𝖠 lottery = 0 then Class = Y = ham.

Here, every path from root to a leaf is translated into a rule. As a result,
although rules from the same sub-tree share conditions (such as bonus=0),
every pair of rules will have at least some mutually exclusive conditions
(such as lottery = 1 in the second rule and lottery = 0 in the third).
However, this is not always the case: rules can have a certain overlap.
Before learning more on logical models let us understand the
terminologies – grouping and grading.

Grouping and grading:


Grouping is breaking the instance space into groups or segments, the
number of which is determined at training time. Figure 2.4 shows the
example of Grouping.

Grading models are able to distinguish between arbitrary instances, when


working in cartesian instance space. The basic linear classifier constructs a
decision boundary by half-way intersecting the line between the positive
(p) and negative (n) centers of mass. It is described by the equation w·x = t
(x is any arbitrary point), as shown in Figure 2.5 – example of Grading.

Figure 2.5 – example of Grading Source: (FLACH, 2012)

Let us now continue understanding logical models. An interesting aspect


of logical models, which sets them aside from most geometric and
probabilistic models, is that they can, to some extent, provide explanations
for their predictions.

For example, a prediction assigned by a decision tree could be explained


by reading off the conditions that led to the prediction from root to leaf.
The model itself can also easily be inspected by humans, which is why
they are sometimes called declarative. Declarative models do not need to
be restricted to the simple rules that we have considered so far.

23
The logical rule learning system Progol found the following set of
conditions to predict whether a molecular compound is carcinogenic
(causes cancer):
1. it tests positive in the Salmonella assay; or
2. it tests positive for sex-linked recessive lethal mutation in Drosophila;
or
3. it tests negative for chromosome aberration; or
4. it has a carbon in a six-membered aromatic ring with a partial charge
of −0.13; or
5. it has a primary amine group and no secondary or tertiary amines; or
6. it has an aromatic (or resonant) hydrogen with partial charge ≥ 0.168;
or
7. it has a hydroxy oxygen with a partial charge ≥ −0.616 and an
aromatic (or resonant) hydrogen; or
8. it has a bromine; or
9. it has a tetrahedral carbon with a partial charge ≤ −0.144 and tests
positive on Progol‟s mutagenicity rules.

The first three conditions concerned certain tests that were carried out for
all molecules and whose results were recorded in the data as Boolean
features. In contrast, the remaining six rules all refer to the structure of the
molecule and were constructed entirely by Progol.

For instance, rule 4 predicts that a molecule is carcinogenic if it contains a


carbon atom with certain properties. This condition is different from the
first three in that it is not a pre-recorded feature in the data, but a new
feature that is constructed by Progol during the learning process because it
helps to explain the data.

Statisticians work very often with different conditional probabilities, given


by the likelihood function P(X|Y ). For example, if somebody was to send
me a spam e-mail, how likely would it be that it contains exactly the words
of the e-mail I‟m looking at? And how likely if it were a ham e-mail
instead?

With so many words to choose from, the probability of any particular


combination of words would be very small indeed. What really matters is
not the magnitude of these likelihoods, but their ratio: how much more
likely is it to observe this combination of words in a spam e-mail than it is
in a non-spam e-mail.

For instance, suppose that for a particular e-mail described by X we have


P(X|Y = spam) = 3.5 · 10−5 and P(X|Y = ham) = 7.4 · 10−6, then observing
X in a spam e-mail is much more likely than it is in a ham e-mail.

This suggests the following decision rule:

24
predict spam if the likelihood ratio is larger than 1 and ham otherwise.
So, which one should we use: posterior probabilities or likelihoods? As it
turns out, we can easily transform one into the other using Bayes‟ rule, a
simple property of conditional probabilities which states that

Where, P(Y |X) is conditional probability


P(X |Y) is likelihood function
P(Y) is prior probability without observing data X and
P(X) is probability of features independent of Y

The first decision rule above suggested that we predict the class with
maximum posterior probability, which using Bayes‟ rule can be written in
terms of the likelihood function.

PROBABILISTIC MODELS

The third type of models are probabilistic in nature, like the Bayesian
classifier we considered earlier. Many of these models are based around
the following idea. Let X denote the variables we know about, e.g., our
instance‟s feature values; and let Y denote the target variables we‟re
interested in, e.g., the instance‟s class. The key question in machine
learning is how to model the relationship between X and Y.

Since X is known for a particular instance but Y may not be, we are
particularly interested in the conditional probabilities P(Y |X). For
instance, Y could indicate whether the e-mail is spam, and X could
indicate whether the e-mail contains the words „bonus‟ and „lottery‟. The
probability of interest is then P(Y | bonus, lottery), with bonus and lottery
two Boolean variables which together constitute the feature vector X. For
a particular e-mail we know the feature values and so we might write P(Y
|bonus = 1,lottery = 0) if the e-mail contains the word „bonus‟ but not the
word „lottery‟. This is called a posterior probability because it is used after
the features X are observed.

Table 2.1 shows an example of how these probabilities might be


distributed. From this distribution you can conclude that, if an e-mail
doesn‟t contain the word „Bonus‟, then observing the word „lottery‟
increases the probability of the e-mail being spam from 0.31 to 0.65; but if
the e-mail does contain the word „Bonus‟, then observing the word
„lottery‟ as well decreases the spam probability from 0.80 to 0.40.

25
Table 2.1. An example posterior distribution. „Bonus‟ and „lottery‟ are two Boolean
features; Y is the class variable, with values „spam‟ and „ham‟. In each row the most
likely class is indicated in blue. Source: (FLACH, 2012)

Even though this example table is small, it will grow unfeasibly large very
quickly, with n Boolean variables 2n cases have to be distinguished. We
therefore don‟t normally have access to the full joint distribution and have
to approximate it using additional assumptions, as we will see below.

Assuming that X and Y are the only variables we know and care about, the
posterior distribution P(Y |X) helps us to answer many questions of
interest. For instance, to classify a new e-mail we determine whether the
words „Bonus‟ and „lottery‟ occur in it, look up the corresponding
probability P(Y = spam | Bonus, Lottery), and predict spam if this
probability exceeds 0.5 and ham otherwise. Such a recipe to predict a
value of Y on the basis of the values of X and the posterior distribution
P(Y |X) is called a decision rule.

FEATURES

MACHINE LEARNING IS ALL ABOUT using the right features to build


the right models that achieve the right tasks – this is the slogan, visualised
in Figure 2.6. In essence, features define a „language‟ in which we
describe the relevant objects in our domain. We should not normally have
to go back to the domain objects themselves once we have a suitable
feature representation, which is why features play such an important role
in machine learning.

A task is an abstract representation of a problem we want to solve


regarding those domain objects: the most common form of these is
classifying them into two or more classes. Many of these tasks can be
represented as a mapping from data points to outputs.

This mapping or model is itself produced as the output of a machine


learning algorithm applied to training data; there is a wide variety of
models to choose from. No matter what variety of machine learning
models you may encounter, you will find that they are designed to solve
one of only a small number of tasks and use only a few different types of

26
features. One could say that models lend the machine learning field
diversity, but tasks and features give it unity.

Figure 2.6. An overview of how machine learning is used to address a given task. A task
(upper box) requires an appropriate mapping – a model – from data described by features
to outputs. Obtaining such a mapping from training data is what constitutes a learning
problem (lower box). Source: (FLACH, 2012)

Features determine much of the success of a machine learning application,


because a model is only as good as its features. A feature can be thought
of as a kind of measurement that can be easily performed on any instance.

Mathematically, they are functions, that map from the instance space to
some set of feature values called the domain of the feature. Since
measurements are often numerical, the most common feature domain is
the set of real numbers. Other typical feature domains include the set of
integers, for instance when the feature counts something, such as the
number of occurrences of a particular word; the Booleans, if our feature is
a statement that can be true or false for a particular instance, such as „this e-
mail is addressed to Beena Kapadia‟; and arbitrary finite sets, such as a set
of colours, or a set of shapes.

Suppose we have a number of learning models that we want to describe in


terms of a number of properties:
 the extent to which the models are geometric, probabilistic or logical
 whether they are grouping or grading models
 the extent to which they can handle discrete and/or real-valued
features
 whether they are used in supervised or unsupervised learning; and
 the extent to which they can handle multi-class problems.

The first two properties could be expressed by discrete features with three
and two values, respectively; or if the distinctions are more gradual, each
aspect could be rated on some numerical scale.

FEATURE TYPES

27
There are mainly three kinds of features – Quantitative, Ordinal and
Categorical.

Table 2.1. Kinds of features, their properties and allowable statistics. Each kind inherits
the statistics from the kinds above it in the table. For instance, the mode is a statistic of
central tendency that can be computed for any kind of feature. Source: (FLACH, 2012)

Quantitative:
They have a meaningful numerical scale and order. They most often
involve a mapping into the reals or continuous. Even if a feature maps into
a subset of the reals, such as age expressed in years, the various statistics
such as mean or standard deviation still require the full scale of the reals.

Ordinal:
Features with an ordering but without scale are called ordinal features. The
domain of an ordinal feature is some totally ordered set, such as the set of
characters or strings. Even if the domain of a feature is the set of integers,
denoting the feature as ordinal means that we have to dispense with the
scale, as we did with house numbers. Another common example are
features that express a rank order: first, second, third, and so on. Ordinal
features allow the mode and median as central tendency statistics, and
quantiles as dispersion statistics.

Categorical:
Features without ordering or scale are called categorical features (or
sometimes „nominal‟ features). They do not allow any statistical summary
except the mode. One subspecies of the categorical features is the Boolean
feature, which maps into the truth values true and false. The situation is
summarised in Table 2.1.

Models treat these different kinds of feature in distinct ways. First,


consider tree models such as decision trees. A split on a categorical feature
will have as many children as there are feature values. Ordinal and
quantitative features, on the other hand, give rise to a binary split, by
selecting a value v0 such that all instances with a feature value less than or
equal to v0 go to one child, and the remaining instances to the other child.
It follows that tree models are insensitive to the scale of quantitative
features. For example, whether a temperature feature is measured on the
Celsius scale or on the Fahrenheit scale will not affect the learned tree.
Neither will switching from a linear scale to a logarithmic scale have any
effect: the split threshold will simply be logv0 instead of v0. In general,
tree models are insensitive to monotonic transformations on the scale of a
feature, which are those transformations that do not affect the relative
order of the feature values. In effect, tree models ignore the scale of
28
quantitative features, treating them as ordinal. The same holds for rule
models.

Now let‟s consider the naive Bayes classifier. We have seen that this
model works by estimating a likelihood function P(X|Y) for each feature
X given the class Y. For categorical and ordinal features with k values this
involves estimating P(X = v1|Y), . . . ,P(X = vk |Y). In effect, ordinal
features are treated as categorical ones, ignoring the order.

Quantitative features cannot be handled at all, unless they are discretised


into a finite number of bins and thus converted to categorical.
Alternatively, we could assume a parametric form for P(X|Y), for instance
a normal distribution.

While naive Bayes only really handles categorical features, many


geometric models go in the other direction: they can only handle
quantitative features. Linear models are a case in point: the very notion of
linearity assumes a Euclidean instance space in which features act as
Cartesian coordinates, and thus need to be quantitative. Distance-based
models such as k-nearest neighbour and K-means require quantitative
features if their distance metric is Euclidean distance, but we can adapt the
distance metric to incorporate categorical features by setting the distance
to 0 for equal values and 1 for unequal values.

In a similar vein, for ordinal features we can count the number of values
between two feature values (if we encode the ordinal feature by means of
integers, this would simply be their difference). This means that distance-
based methods can accommodate all feature types by using an appropriate
distance metric. Similar techniques can be used to extend support vector
machines and other kernel-based methods to categorical and ordinal
features.

FEATURE CONSTRUCTION AND


TRANSFORMATION

There is a lot of scope in machine learning for playing around with


features. In the spam filter example, and text classification more generally,
the messages or documents don‟t come with built-in features; rather, they
need to be constructed by the developer of the machine learning
application. This feature construction process is absolutely crucial for the
success of a machine learning application.

Indexing an e-mail by the words that occur in it (called a bag of words


representation as it disregards the order of the words in the e-mail) is a
carefully engineered representation that manages to amplify the „signal‟
and attenuate the „noise‟ in spam e-mail filtering and related classification
tasks. However, it is easy to conceive of problems where this would be
exactly the wrong thing to do: for instance if we aim to train a classifier to

29
distinguish between grammatical and ungrammatical sentences, word
order is clearly signal rather than noise, and a different representation is
called for.

Figure 2.7. (left) Artificial data depicting a histogram of body weight


measurements of people with (blue) and without (red) diabetes, with eleven fixed
intervals of 10 kilograms width each. (right) By joining the first and second, third
and fourth, fifth and sixth, and the eighth, ninth and tenth intervals, we obtain a
discretisation such that the proportion of diabetes cases increases from left to
right. This discretisation makes the feature more useful in predicting diabetes.
(FLACH, 2012)

It is often natural to build a model in terms of the given features. However,


we are free to change the features as we see fit, or even to introduce new
features. For instance, real-valued features often contain unnecessary
detail that can be removed by discretisation. Imagine you want to analyse
the body weight of a relatively small group of, say, 100 people, by
drawing a histogram.

If you measure everybody‟s weight in kilograms with one position after


the decimal point (i.e., your precision is 100 grams), then your histogram
will be sparse and spiky. It is hard to draw any general conclusions from
such a histogram. It would be much more useful to discretise the body
weight measurements into intervals of 10 kilograms. If we are in a
classification context, say we‟re trying to relate body weight to diabetes,
we could then associate each bar of the histogram with the proportion of
people having diabetes among the people whose weight falls in that
interval. In fact, we can even choose the intervals such that this proportion
is monotonically increasing as shown in Figure 2.7.

The previous example gives another illustration of how, for a particular


task such as classification, we can improve the signal-to-noise ratio of a
feature. In more extreme cases of feature construction, we transform the
entire instance space. Consider Figure 2.6: the data on the left is clearly
not linearly separable, but by mapping the instance space into a new
„feature space‟ consisting of the squares of the original features we see
that the data becomes almost linearly separable. In fact, by adding in a
third feature we can perform a remarkable trick: we can build this feature
space classifier without actually constructing the feature space.

FEATURE SELECTION
30
Once we have constructed new features it is often a good idea to select a
suitable subset of them prior to learning. Not only will this speed up
learning as fewer candidate features need to be considered, it also helps to
guard against overfitting.

(FLACH, 2012)

There are two main approaches to feature selection, The filter approach
and the relief approach.

The filter approach scores the features on a particular metric and the top-
scoring features are selected. Many of the metrics we have seen so far can
be used for feature scoring, including information gain, the χ2 statistic, the
correlation coefficient, to name just a few.

An interesting variation is provided by the Relief feature selection method,


which repeatedly samples a random instance x and finds its nearest hit h
(instance of the same class) as well as its nearest miss m (instance of
opposite class). The i -th feature‟s score is then decreased by Dis(xi , hi) 2
and increased by Dis(xi , mi)2, where Dis is some distance measure (e.g.,
Euclidean distance for quantitative features, Hamming distance for
categorical features). The intuition is that we want to move closer to the
nearest hit while differentiating from the nearest miss.

One drawback of a simple filter approach is that no account is taken of


redundancy between features. Imagine, for the sake of the argument,
duplicating a promising feature in the data set: both copies score equally
high and will be selected, whereas the second one provides no added value
in the context of the first one.
Secondly, feature filters do not detect dependencies between features as
they are solely based on marginal distributions. For example, consider two
Boolean features such that half the positives have the value true for both
features and the other half have the value false for both, whereas all
negatives have opposite values (again distributed half-half over the two
31
possibilities). It follows that each feature in isolation has zero information
gain and hence is unlikely to be selected by a feature filter, despite their
combination being a perfect classifier. One could say that feature filters
are good at picking out possible root features for a decision tree, but not
necessarily good at selecting features that are useful further down the tree.

To detect features that are useful in the context of other features, we need
to evaluate sets of features; this usually goes under the name of wrapper
approaches. The idea is that feature selection is „wrapped‟ in a search
procedure that usually involves training and evaluating a model with a
candidate set of features.

Forward selection methods start with an empty set of features and add
features to the set one at a time, as long as they improve the performance
of the model. Backward elimination starts with the full set of features and
aims at improving performance by removing features one at a time. Since
there are an exponential number of subsets of features it is usually not
feasible to search all possible subsets, and most approaches apply a
„greedy‟ search algorithm that never reconsiders the choices it makes.

SUMMARY

After studying this chapter, you will understand different modes like
Geometric Models, Logical Models and Probabilistic Models. You will
understand about features usage and why it is very important in model
designing. You will also understand about different Feature types, how
they can be Constructed and why their Transformation required and how it
can be done. You will also understand how Feature Selection plays an
important role in designing a model and how to do it.

UNIT END QUESTIONS

1. How a linear classifier construct decision boundary using linear


separable data? Explain it in detail with respect to geometric models
of Machine Learning.
2. Explain the working of decision boundary learned by Support Vector
Machine from linear separable data with respect to geometric models
of Machine Learning.
3. Describe logical models.
4. Write a short note on probabilistic models.
5. Machine learning is all about using the right features to build the right
models that achieve the right tasks – justify this sentence.
6. What are various types of features available? Explain each one in
brief.

32
7. Why are feature construction and feature transformation required?
How to achieve them?
8. What are the approaches to feature selection? Explain each one in
detail.

33

You might also like