0% found this document useful (0 votes)
29 views16 pages

Machine Learning Class Note 1

The document provides a comprehensive overview of machine learning, detailing its definition, types (supervised, unsupervised, semi-supervised, and reinforcement learning), and applications in various fields such as finance, healthcare, and telecommunications. It also discusses the challenges faced in machine learning, including data quality and model deployment issues, as well as the components of learning processes. The course aims to educate on the fundamental concepts and real-world implications of machine learning technologies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views16 pages

Machine Learning Class Note 1

The document provides a comprehensive overview of machine learning, detailing its definition, types (supervised, unsupervised, semi-supervised, and reinforcement learning), and applications in various fields such as finance, healthcare, and telecommunications. It also discusses the challenges faced in machine learning, including data quality and model deployment issues, as well as the components of learning processes. The course aims to educate on the fundamental concepts and real-world implications of machine learning technologies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

MACHINE LEARNING WITH PHYTON

COURSE DESCRIPTION: the difference between the two main types of machine learning
methods; supervised and unsupervised; Supervised learning algorithms, including
classification and regression; Unsupervised learning algorithms, including Clustering and
Dimensionality Reduction; How statistical modelling relates to machine learning and how to
compare them; Real-life examples of the different ways machine learning affects society.
Week One

Introduction
It is said that the term machine learning was first coined by Arthur Lee Samuel, a pioneer in the AI
field, in 1959. Machine learning is not just about storing a lot of data, but it is part of Artificial
Intelligence (AI). Artificial Intelligence is the improvement of the computer programs to
perform tasks that typically require the human intervention, such as decision making.
Machine learning is the use and development of computer systems that are able to learn and adapt
without following explicit instructions, by using algorithms and statistical models to analyse and
draw inferences from patterns in data. According to IBM cloud Education, Machine learning is a
branch of artificial intelligence (AI) and computer science which focuses on the use of data and
algorithms to imitate the way that humans learn, gradually improving its accuracy. Machine Learning
as a subfield of Artificial Intelligence (AI), one of the goals behind machine learning was to replace
the need for developing computer programs “manually.” Considering that programs are being
developed to automate processes, we can think of machine learning as the process of “automating
automation.” In other words, machine learning lets computers “create” programs (often, the intent
for developing these programs is making predictions) themselves. In other words, machine learning
is the process of turning data into programs.

What is Machine Learning?

Machine learning is the practice of programming computers to learn from data. In machine learning,
this data is referred to as training sets or examples.

Machine learning is programming computers to optimize a performance criterion using example


data or past experience. We have a model defined up to some parameters, and learning is the
execution of a computer program to optimize the parameters of the model using the training data or
past experience. The model may be predictive to make predictions in the future, or descriptive to
gain knowledge from data, or both.

Machine learning is a subfield of computer science that evolved from the study of pattern
recognition and computational learning theory in Artificial Intelligence (AI).

In 1959, Arthur Samuel, an American pioneer in the field of computer gaming, machine learning, and
artificial intelligence has defined machine learning as a “Field of study that gives computers the
ability to learn without being explicitly programmed.”

Machine learning is a field of computer science that involves using statistical methods to create
programs that either improve performance over time, or detect patterns in massive amounts of data
that humans would be unlikely to find.

Machine Learning explores the study and construction of algorithms that can learn from and make
predictions on data. Such algorithms operate by building a model from example inputs in order to
make data driven predictions or decisions, rather than following strictly static program instructions.
Machine Learning is a collection of algorithms and techniques used to create computational systems
that learn from data in order to make predictions and inferences. Unlike computer programs that
takes data and input into command to perform a task, machine learning starts with data, learn from
the data and build a model which is used to generate an output. Computer Programs require input
command vs input data, we can see that this process consists of only two steps: Command > Action.
While Machine learning entails a three-step process: Data > Model > Action. Incorporating machine
learning into a system means to switch out “command” for “data” and add “model” in order to
produce an action (output).

Why Machine Learning?

One of the goals behind machine learning was to replace the need for developing computer
programs “manually.” In the early days of “intelligent” applications, many systems used hand coded
rules of “if” and “else” decisions to process data or adjust to user input. Manually crafting decision
rules is feasible for some applications, particularly those in which humans have a good
understanding of the process to model. However, using hand coded rules to make decisions has two
major disadvantages:

i. The logic required to make a decision is specific to a single domain and task. Changing
the task even slightly might require a rewrite of the whole system.
ii. Designing rules requires a deep understanding of how a decision should be made by a
human expert.

One example of where this hand coded approach will fail is in detecting faces in images. Today,
every smartphone can detect a face in an image. However, face detection was an unsolved problem
until recently. The main problem is that the way in which pixels (which make up an image in a
computer) are “perceived” by the computer is very different from how humans perceive a face. This
difference in representation makes it basically impossible for a human to come up with a good set of
rules to describe what constitutes a face in a digital image. Using machine learning, however, simply
presenting a program with a large collection of images of faces is enough for an algorithm to
determine what characteristics are needed to identify a face.

Applications of Machines Learning

Machine Learning is the most rapidly growing technology and according to researchers we are in the
golden year of AI and ML. It is used to solve many real-world complex problems which cannot be
solved with traditional approach. Following are some real-world applications of ML:

 Emotion analysis

 Sentiment analysis

 Error detection and prevention

 Weather forecasting and prediction

 Stock market analysis and forecasting

 Speech synthesis

 Speech recognition Customer segmentation


 Object recognition

 Fraud detection

 Fraud prevention

 Recommendation of products to customer in online shopping

Areas of Application of Machine Learning

Application of machine learning methods to large databases is called data mining. In data mining, a
large volume of data is processed to construct a simple model with valuable use. Some of the typical
applications of machine learning.

1. In retail business, machine learning is used to study consumer behaviour.

2. In finance, banks analyse their past data to build models to use in credit applications, fraud
detection, and the stock market.

3. In manufacturing, learning models are used for optimization, control, and troubleshooting.

4. In medicine, learning programs are used for medical diagnosis.

5. In telecommunications, call patterns are analyzed for network optimization and maximizing the
quality of service.

6. In science, large amounts of data in physics, astronomy, and biology can only be analyzed fast
enough by computers. The World Wide Web is huge; it is constantly growing and searching for
relevant information cannot be done manually.

7. In artificial intelligence, it is used to teach a system to learn and adapt to changes so that the
system designer need not foresee and provide solutions for all possible situations.

8. It is used to find solutions to many problems in vision, speech recognition, and robotics.

9. Machine learning methods are applied in the design of computer-controlled vehicles to steer
correctly when driving on a variety of roads.

10. Machine learning methods have been used to develop programmes for playing games such as
chess, backgammon and Go

Challenges in Machines Learning

While Machine Learning is rapidly evolving, making significant strides with cybersecurity and
autonomous cars, this segment of AI as whole still has a long way to go. The reason behind is that
ML has not been able to overcome number of challenges. The challenges that ML is facing currently
are:

Quality of data: Having good-quality data for ML algorithms is one of the biggest challenges. Use of
low-quality data leads to the problems related to data preprocessing and feature extraction.

Time-Consuming task: Another challenge faced by ML models is the consumption of time especially
for data acquisition, feature extraction and retrieval.
Lack of specialist persons: As ML technology is still in its infancy stage, availability of expert
resources is a tough job.

No clear objective for formulating business problems: Having no clear objective and well-defined
goal for business problems is another key challenge for ML because this technology is not that
mature yet.

Issue of overfitting & underfitting: If the model is overfitting or underfitting, it cannot be


represented well for the problem.

Curse of dimensionality: Another challenge ML model faces is too many features of data points. This
can be a real hindrance.

Difficulty in deployment: Complexity of the ML model makes it quite difficult to be deployed in real
life.

Machine Learning Model

A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with
experience E. The above definition is basically focusing on three parameters, also the main
components of any learning algorithm, namely Task (T), Performance (P) and experience (E). In this
context, we can simplify this definition as: ML is a field of AI consisting of learning algorithms that:

 Improve their performance (P)

 At executing some task (T)

 Over time with experience (E)

Based on the above, the following diagram represents a Machine Learning Model:

Performa
Task(T) nce (P)

Experience(
E)

Figure 1.1 Machine learning Model

Task (T) From the perspective of problem, we may define the task T as the real-world problem to be
solved. The problem can be anything like finding best house price in a specific location or to find best
marketing strategy etc. On the other hand, if we talk about machine learning, the definition of task is
different because it is difficult to solve ML based tasks by conventional programming approach. A
task T is said to be a ML based task when it is based on the process and the system must follow for
operating on data points. The examples of ML based tasks are Classification, Regression, Structured
annotation, Clustering, Transcription etc.

Experience (E) As name suggests, it is the knowledge gained from data points provided to the
algorithm or model. Once provided with the dataset, the model will run iteratively and will learn
some inherent pattern. The learning thus acquired is called experience(E). Making an analogy with
human learning, we can think of this situation as in which a human being is learning or gaining some
experience from various attributes like situation, relationships etc. Supervised, unsupervised and
reinforcement learning are some ways to learn or gain experience. The experience gained by out ML
model or algorithm will be used to solve the task T.

Performance (P) An ML algorithm is supposed to perform task and gain experience with the passage
of time. The measure which tells whether ML algorithm is performing as per expectation or not is its
performance (P). P is basically a quantitative metric that tells how a model is performing the task, T,
using its experience, E. There are many metrics that help to understand the ML performance, such as
accuracy score, F1 score, confusion matrix, precision, recall, sensitivity etc.

Examples

i) Handwriting recognition learning problem


• Task T: Recognising and classifying handwritten words within images
• Performance P: Percent of words correctly classified
• Experience E: A dataset of handwritten words with given classifications
ii) A robot driving learning problem
• Task T: Driving on highways using vision sensors
• Performance measure P: Average distance travelled before an error
•Experience: A sequence of images and steering commands recorded while observing a human
driver
iii) A chess learning problem
• Task T: Playing chess
• Performance measure P: Percent of games won against opponents
•Experience E: Playing practice games against itself

Components of Learning

Basic components of learning process The learning process, whether by a human or a machine, can
be divided into four components, namely, data storage, abstraction, generalization and evaluation.
Figure 1.2 illustrates the various components and the steps involved in the learning process.

Data Storage Abstraction Generalization Evaluation

Inferences
Concepts
DATA

Figure 1.2 components of learning process

1. Data storage
Facilities for storing and retrieving huge amounts of data are an important component of the
learning process. Humans and computers alike utilize data storage as a foundation for advanced
reasoning. In a human being, the data is stored in the brain and data is retrieved using
electrochemical signals. Computers use hard disk drives, flash memory, random access memory and
similar devices to store data and use cables and other technology to retrieve data.

2. Abstraction

The second component of the learning process is known as abstraction. Abstraction is the process of
extracting knowledge about stored data. This involves creating general concepts about the data as a
whole. The creation of knowledge involves application of known models and creation of new
models. The process of fitting a model to a dataset is known as training. When the model has been
trained, the data is transformed into an abstract form that summarizes the original information.

3. Generalization

The third component of the learning process is known as generalisation. The term generalization
describes the process of turning the knowledge about stored data into a form that can be utilized for
future action. These actions are to be carried out on tasks that are similar, but not identical, to those
what have been seen before. In generalization, the goal is to discover those properties of the data
that will be most relevant to future tasks.

4. Evaluation

Evaluation is the last component of the learning process. It is the process of giving feedback to the
user to measure the utility of the learned knowledge. This feedback is then utilised to effect
improvements in the whole learning process.

Types of Machine Learning


Supervised Machine Learning
Supervised learning is typically the task of machine learning to learn a function that maps an input to
an output based on sample input-output pairs. It uses labelle training data and a collection of
training examples to infer a function. Supervised learning is carried out when certain goals are
identified to be accomplished from a certain set of inputs, i.e., a task driven approach. The most
common supervised tasks are “classification” that separates the data, and “regression” that fits the
data. For instance, predicting the class label or sentiment of a piece of text, like a tweet or a product
review, i.e., text classification, is an example of supervised learning.

Unsupervised Machine Learning


Unsupervised learning analyses unlabelled datasets without the need for human interference, i.e., a
data-driven process. This is widely used for extracting generative features, identifying meaningful
trends and structures, groupings in results, and exploratory purposes. The most common
unsupervised learning tasks are clustering, density estimation, feature learning, dimensionality
reduction, finding association rules, anomaly detection, etc.

Semi Supervised Learning


Semi-supervised learning can be defined as a hybridization of the above-mentioned supervised and
unsupervised methods, as it operates on both labelled and unlabelled data. Thus, it falls between
learning “without supervision” and learning “with supervision”. In the real world, labelled data could
be rare in several contexts, and unlabelled data are numerous, where semi-supervised learning is
useful. The ultimate goal of a semi-supervised learning model is to provide a better outcome for
prediction than that produced using the labelled data alone from the model. Some application areas
where semi-supervised learning is used include machine translation, fraud detection, labelling data
and text classification.

Reinforcement Learning
Reinforcement learning is a type of machine learning algorithm that enables software agents and
machines to automatically evaluate the optimal behaviour in a particular context or environment to
improve its efficiency, i.e., an environment-driven approach. This type of learning is based on reward
or penalty, and its ultimate goal is to use insights obtained from environmental activists to take
action to increase the reward or minimize the risk. It is a powerful tool for training AI models that
can help increase automation or optimize the operational efficiency of sophisticated systems such as
robotics, autonomous driving tasks, manufacturing and supply chain logistics, however, not
preferable to use it for solving the basic or straight forward problems.

Table 1 Various types of machine learning techniques with examples

Learning type Model building Examples


Supervised Algorithms or models learn Classification, regression
from labeled data (task-driven
approach
Unsupervised Algorithms or models learn Clustering, associations,
from unlabeled data (Data- dimensionality reduction
Driven Approach)
Semi-supervised Models are built using Classification, clustering
combined data (labeled +
unlabeled)
Reinforcement Models are based on reward Classification, control
or penalty (environment-
driven approach)
Figure 1.3: Types of Machine Learning
Week Two
Supervised Learning Algorithms
Machine learning algorithms that learn from input/output pairs are called supervised learning
algorithms because a “teacher” provides supervision to the algorithms in the form of the
desired outputs for each example that they learn from. While creating a dataset of inputs and
outputs is often a laborious manual process, supervised learning algorithms are well
understood and their performance is easy to measure. If your application can be formulated as
a supervised learning problem, and you are able to create a dataset that includes the desired
outcome, machine learning will likely be able to solve your problem.
Examples of supervised machine learning tasks include:
 Identifying the zip code from handwritten digits on an envelope
Here the input is a scan of the handwriting, and the desired output is the actual digits in the
zip code. To create a dataset for building a machine learning model, you need to collect many
envelopes. Then you can read the zip codes yourself and store the digits as your desired
outcomes.
 Determining whether a tumor is benign based on a medical image
Here the input is the image, and the output is whether the tumor is benign. To create a dataset
for building a model, you need a database of medical images. You also need an expert
opinion, so a doctor needs to look at all of the images and decide which tumors are benign
and which are not. It might even be necessary to do additional diagnosis beyond the content
of the image to determine whether the tumor in the image is cancerous or not.
 Detecting fraudulent activity in credit card transactions
Here the input is a record of the credit card transaction, and the output is whether it is likely
to be fraudulent or not. Assuming that you are the entity dis‐ tributing the credit cards,
collecting a dataset means storing all transactions and recording if a user reports any
transaction as fraudulent
An interesting thing to note about these examples is that although the inputs and outputs look
fairly straightforward, the data collection process for these three tasks is vastly different.
While reading envelopes is laborious, it is easy and cheap. Obtaining medical imaging and
diagnoses, on the other hand, requires not only expensive machinery but also rare and
expensive expert knowledge, not to mention the ethical concerns and privacy issues. In the
example of detecting credit card fraud, data collection is much simpler. Your customers will
provide you with the desired output, as they will report fraud. All you have to do to obtain the
input/output pairs of fraudulent and non-fraudulent activity is wait.
Supervised learning is used whenever we want to predict a certain outcome from a given
input, and we have examples of input/output pairs. We build a machine learning model from
these input/output pairs, which comprise our training set. Our goal is to make accurate
predictions for new, never-before-seen data. Supervised learning often requires human effort
to build the training set, but afterward automates and often speeds up an otherwise laborious
or infeasible task.
Supervised learning is the subcategory of machine learning that focuses on learning a
classification or regression model that is, learning from labelled training data (i.e., inputs that
also contain the desired outputs or targets; basically, “examples” of what we want to predict.
There are two major types of supervised machine learning problems, called classification and
regression.
Classification
In classification, the goal is to predict a class label, which is a choice from a predefined list of
possibilities.
Classification is sometimes separated into binary classification and multiclass classification,
Binary Classification is the special case of distinguishing between exactly two classes,
binary classification can be think of as trying to answer a yes/no question. Classifying emails
as either spam or not spam is an example of a binary classification problem. In this binary
classification task, the yes/no question being asked would be “Is this email spam?” In binary
classification we often speak of one class being the positive class and the other class being
the negative class.
Multiclass classification, which is classification between more than two classes. Classifying
the iris, is an example of a multiclass classification problem. Another example is predicting what
language a website is in from the text on the website. The classes here would be a pre-defined list of
possible languages

Regression
In regression, the goal is to predict a continuous number, or a floating-point number in
programming terms (or real number in mathematical terms). Predicting a person’s annual
income from their education, their age, and where they live is an example of a regression
task. When predicting income, the predicted value is an amount, and can be any number in a
given range. Another example of a regression task is predicting the yield of a corn farm given
attributes such as previous yields, weather, and number of employees working on the farm.
The yield again can be an arbitrary number.
An easy way to distinguish between classification and regression tasks is to ask whether there
is some kind of continuity in the output. If there is continuity between possible outcomes,
then the problem is a regression problem. In contrast, for the task of recognizing the language
of a website (which is a classification problem), there is no matter of degree. A website is in
one language, or it is in another. There is no continuity between languages, and there is no
language that is between English and French.
Generalization, Overfitting and Underfitting
Generalization: In supervised learning, a model is built on the training data and then used to
make accurate predictions on new, unseen data that has the same characteristics as the
training set that was used. If a model is able to make accurate predictions on unseen data, it is
said to able to generalize from the training set to the test set. We want to build a model that is
able to generalize as accurately as possible,
Overfitting: Building a model that is too complex for the amount of information available, is
called overfitting. Overfitting occurs when you fit a model too closely to the particularities of
the training set and obtain a model that works well on the training set but is not able to
generalize to new data.
Underfitting: On the other hand, if a model is too simple, then it might not be able to capture
all the aspects of variability in the data, and the model will do badly even on the training set.
Choosing too simple a model is called underfitting.
The more complex a model is, the better it will be able to predict on the training data.
However, if a model becomes too complex, we start focusing too much on each individual
data point in the training set, and the model will not generalize well to new data.
For examples and further illustration, refer to Introduction to Machine Learning with Python
by Andreas C. Müller and Sarah Guido
Important Supervised Algorithms
- K-nearest neighbors
- Linear regression
- Neural Networks
- Support vector Machine
- Logistic Regression
- Decision trees and Random Forests

K-Nearest Neighbors
The K-Nearest Neighbours k-NN algorithm is about the simplest machine learning algorithm.
Building the model consists only of storing the training dataset. To make a prediction for a
new data point, the algorithm finds the closest data points in the training dataset its “nearest
neighbors.”
k-Neighbors classification In its simplest version, the k-NN algorithm only considers exactly
one nearest neighbor, which is the closest training data point to the point we want to make a
prediction for. The prediction is then simply the known output for this training point.
k-neighbors regression There is also a regression variant of the k-nearest neighbors
algorithm. The prediction using a single neighbor is just the target value of the nearest
neighbor.
Strengths, weaknesses, and parameters
In principle, there are two important parameters to the K Neighbors classifier: the number of
neighbors and how you measure distance between data points. In practice, using a small
number of neighbors like three or five often works well, but you should certainly adjust this
parameter. Choosing the right distance measure is somewhat important. By default, Euclidean
distance is used, which works well in many settings.
One of the strengths of k-NN is that the model is very easy to understand, and often gives
reasonable performance without a lot of adjustments. Using this algorithm is a good baseline
method to try before considering more advanced techniques. Building the nearest neighbors
model is usually very fast, but when your training set is very large (either in number of
features or in number of samples) prediction can be slow. When using the k-NN algorithm,
it’s important to pre-process the data. This approach often does not perform well on datasets
with many features (hundreds or more), and it does particularly badly with datasets where
most features are 0 most of the time (so-called sparse datasets). So, while the nearest k-
neighbors algorithm is easy to understand, it is not often used in practice, due to prediction
being slow and its inability to handle many features.
Linear Models
Linear models make a prediction using a linear function of the input features.

Linear models for regression can be characterized as regression models for which the prediction is a
line for a single feature, a plane when using two features, or a hyper ‐ plane in higher dimensions
(that is, when using more features).

Comparing the predictions made by the straight line with those made by the K-Neighbors Regressor,
using a straight line to make predictions seems very restrictive. All the fine details of the data lost.

There are many different linear models for regression. The difference between these models lies in
how the model parameters w and b are learned from the training data, and how model complexity
can be controlled. w and b are parameters of the model that are learned, and ŷ is the prediction the
model makes.

Linear regression (aka ordinary least squares)

Linear regression, or ordinary least squares (OLS), is the simplest and most classic linear method for
regression. Linear regression finds the parameters w and b that minimize the mean squared error
between predictions and the true regression targets, y, on the training set. The mean squared error
is the sum of the squared differences between the predictions and the true values. Linear regression
has no parameters, which is a benefit, but it also has no way to control model complexity.

Ridge regression
Ridge regression is also a linear model for regression, so the formula it uses to make predictions is
the same one used for ordinary least squares. In ridge regression, though, the coefficients (w) are
chosen not only so that they predict well on the train ‐ ing data, but also to fit an additional
constraint. We also want the magnitude of coef‐ ficients to be as small as possible; in other words,
all entries of w should be close to zero. Intuitively, this means each feature should have as little
effect on the outcome as possible (which translates to having a small slope), while still predicting
well. This constraint is an example of what is called regularization. Regularization means explic ‐ itly
restricting a model to avoid overfitting. The particular kind used by ridge regres ‐ sion is known as L2
regularization.7

Linear models for classification

Linear models are also extensively used for classification. a prediction is made using the following
formula:

ŷ = w[0] * x[0] + w[1] * x[1] + ... + w[p] * x[p] + b > 0

The formula looks very similar to the one for linear regression, but instead of just returning the
weighted sum of the features, we threshold the predicted value at zero. If the function is smaller
than zero, we predict the class –1; if it is larger than zero, we predict the class +1

For linear models for regression, the output, ŷ, is a linear function of the features: a line, plane, or
hyperplane (in higher dimensions). For linear models for classification, the decision boundary is a
linear function of the input. In other words, a (binary) linear classifier is a classifier that separates
two classes using a line, a plane, or a hyper ‐ plane. There are many algorithms for learning linear
models. These algorithms all differ in the following two ways: • The way in which they measure how
well a particular combination of coefficients and intercept fits the training data • If and what kind of
regularization they use.

Strengths, weaknesses, and parameters

The main parameter of linear models is the regularization parameter, called alpha in the regression
models and C in LinearSVC and LogisticRegression. Large values for alpha or small values for C mean
simple models. In particular for the regression models, tuning these parameters is quite important.

Linear models are very fast to train, and also fast to predict. They scale to very large datasets and
work well with sparse data. Another strength of linear models is that they are relatively easy to
understand how a prediction is made, using the formulas for regression and classification.

Unfortunately, it is often not entirely clear why coefficients are the way they are. This is particularly
true if dataset has highly correlated features; in these cases, the coefficients might be hard to
interpret. Linear models often perform well when the number of features is large compared to the
number of samples. They are also often used on very large datasets, simply because it’s not feasible
to train other models. However, in lower-dimensional spaces, other models might yield better
generalization performance.

Naive Bayes Classifiers

Naive Bayes classifiers are a family of classifiers that are quite similar to the linear models earlier
discussed. However, they tend to be even faster in training. The price paid for this efficiency is that
naive Bayes models often provide generalization performance that is slightly worse than that of
linear classifiers. The reason that naive Bayes models are so efficient is that they learn parameters
by looking at each feature individually and collect simple per-class statistics from each feature.
There are three kinds of naive Bayes classifiers. GaussianNB, BernoulliNB, and MultinomialNB.
GaussianNB can be applied to any continuous data, while BernoulliNB assumes binary data and
MultinomialNB assumes count data (that is, that each feature represents an integer count of some‐
thing, like how often a word appears in a sentence). BernoulliNB and MultinomialNB are mostly used
in text data classification. The BernoulliNB classifier counts how often every feature of each class is
not zero.

Strengths, weaknesses, and parameters

MultinomialNB and BernoulliNB have a single parameter, alpha, which controls model complexity.
The way alpha works is that the algorithm adds to the data alpha many virtual data points that have
positive values for all the features. This results in a “smoothing” of the statistics. A large alpha means
more smoothing, resulting in less complex models. The algorithm’s performance is relatively robust
to the setting of alpha, meaning that setting alpha is not critical for good performance. However,
tuning it usually improves accuracy somewhat.

GaussianNB is mostly used on very high-dimensional data, while the other two variants of naive
Bayes are widely used for sparse count data such as text. MultinomialNB usually performs better
than BinaryNB, particularly on datasets with a relatively large number of nonzero features (i.e., large
documents). The naive Bayes models share many of the strengths and weaknesses of the linear
models. They are very fast to train and to predict, and the training procedure is easy to understand.
The models work very well with high-dimensional sparse data and are relatively robust to the
parameters. Naive Bayes models are great baseline models and are often used on very large
datasets, where training even a linear model might take too long.

Decision Trees

Decision trees are widely used models for classification and regression tasks. Essen ‐ tially, they learn
a hierarchy of if/else questions, leading to a decision. These questions are similar to the questions
you might ask in a game of 20 Questions. Imagine you want to distinguish between the following
four animals: bears, hawks, penguins, and dolphins. Your goal is to get to the right answer by asking
as few if/else questions as possible. You might start off by asking whether the animal has feathers, a
question that narrows down your possible animals to just two. If the answer is “yes,” you can ask
another question that could help you distinguish between hawks and penguins. For example, you
could ask whether the animal can fly. If the animal doesn’t have feathers, your possible animal
choices are dolphins and bears, and you will need to ask a question to distinguish between these
two animals—for example, asking whether the animal has fins.
Has
Feathers?

True False

Can fly? Has Fins?

True False
True
False

Hawk
Dolphin Bear
Penguin

Figure ….. A decision tree to distinguish among several animals

Strengths, weaknesses, and parameters

The parameters that control model complexity in decision trees are the pre-pruning parameters that
stop the building of the tree before it is fully developed. Usually, picking one of the pre-pruning
strategies setting either.

Decision trees have two advantages over many of the algorithms, the resulting model can easily be
visualized and understood by nonexperts (at least for smaller trees), and the algorithms are
completely invariant to scaling of the data. As each feature is processed separately, and the possible
splits of the data don’t depend on scaling, no preprocessing like normalization or standardization of
features is needed for decision tree algorithms. In particular, decision trees work well when you
have features that are on completely different scales, or a mix of binary and con ‐ tinuous features.
The main downside of decision trees is that even with the use of pre-pruning, they tend to overfit
and provide poor generalization performance. Therefore, in most applications, the ensemble
methods we discuss next are usually used in place of a single decision tree.

Support Vector Machines

Support vector machines (often just referred to as SVMs) are an extension that allows for more
complex mod‐ els that are not defined simply by hyperplanes in the input space. While there are
sup‐ port vector machines for classification and regression.

There are two ways to map your data into a higher-dimensional space that are commonly used with
support vector machines: the polynomial kernel, which computes all possible polynomials up to a
certain degree of the original features (like feature1 ** 2 * feature2 ** 5); and the radial basis
function (RBF) kernel, also known as the Gaussian kernel. The Gaussian kernel is a bit harder to
explain, as it corresponds to an infinite-dimensional feature space. One way to explain the Gaussian
kernel is that it considers all possible polynomials of all degrees, but the importance of the features
decreases for higher degrees.

Strengths, weaknesses, and parameters

Support vector machines are powerful models and perform well on a variety of datasets. SVMs allow
for complex decision boundaries, even if the data has only a few features. They work well on low-
dimensional and high-dimensional data (i.e., few and many features), but don’t scale very well with
the number of samples. Running an SVM on data with up to 10,000 samples might work well, but
working with datasets of size 100,000 or more can become challenging in terms of runtime and
memory usage.

Another downside of SVMs is that they require careful pre-processing of the data and tuning of the
parameters. This is why, these days, most people instead use tree-based models such as random
forests or gradient boosting (which require little or no pre‐ processing) in many applications.

Furthermore, SVM models are hard to inspect; it can be difficult to understand why a particular
prediction was made, and it might be tricky to explain the model to a nonexpert. Still, it might be
worth trying SVMs, particularly if all of your features represent measurements in similar units (e.g.,
all are pixel intensities) and they are on similar scales.

The important parameters in kernel SVMs are the regularization parameter C, the choice of the
kernel, and the kernel-specific parameters. The RBF kernel has only one parameter, gamma, which is
the inverse of the width of the Gaussian kernel. gamma and C both control the complexity of the
model, with large values in either resulting in a more complex model. Therefore, good settings for
the two parameters are usually strongly correlated, and C and gamma should be adjusted together.

Neural Networks (Deep Learning)

A family of algorithms known as neural networks has recently seen a revival under the name “deep
learning.” While deep learning shows great promise in many machine learning applications, deep
learning algorithms are often tailored very carefully to a specific use case. One of the deep learning
methods is namely multilayer perceptrons for classification and regression, that can serve as a
starting point for more involved deep learning methods. Multilayer perceptrons (MLPs) are also
known as feed-forward neural networks, or sometimes just neural network

Strengths, weaknesses, and parameters

Neural networks have re-emerged as state-of-the-art models in many applications of machine


learning. One of their main advantages is that they are able to capture information contained in
large amounts of data and build incredibly complex models. Given enough computation time, data,
and careful tuning of the parameters, neural networks often beat other machine learning algorithms
(for classification and regression tasks).

The downsides. Neural networks—particularly the large and powerful ones—often take a long time
to train. They also require careful pre-processing of the data, as we saw here. Similarly to SVMs, they
work best with “homogeneous” data, where all the features have similar meanings. For data that has
very different kinds of features, tree-based models might work better. Tuning neural network
parameters is also an art unto itself. In our experiments, we barely scratched the surface of possible
ways to adjust neural network models and how to train them.

The most important parameters are the number of layers and the number of hidden units per layer.
It is right to start with one or two hidden layers, and possibly expand from there. The number of
nodes per hidden layer is often similar to the number of input features, but rarely higher than in the
low to mid-thousands.

Summary

Here is a quick summary of when to use each model:

Nearest neighbors

For small datasets, good as a baseline, easy to explain.

Linear models

Go-to as a first algorithm to try, good for very large datasets, good for very high dimensional data.

Naive Bayes
Only for classification. Even faster than linear models, good for very large data ‐ sets and high-
dimensional data. Often less accurate than linear models.

Decision trees Very fast, don’t need scaling of the data, can be visualized and easily explained.

Support vector machines

Powerful for medium-sized datasets of features with similar meaning. Require scaling of data,
sensitive to parameters.

Neural networks

Can build very complex models, particularly for large datasets. Sensitive to scaling of the data and to
the choice of parameters. Large models need a long time to train.

When working with a new dataset, it is in general a good idea to start with a simple model, such as a
linear model or a naive Bayes or nearest neighbors classifier, and see how far you can get. After
understanding more about the data, you can consider moving to an algorithm that can build more
complex models, such as random forests, gradient boosted decision trees, SVMs, or neural networks

You might also like