ML Notes
ML Notes
Writing software is the bottleneck, we don’t have enough good developers. Let the data do the
work instead of people. Machine learning is the way to make programming scalable.
Traditional Programming: Data and program is run on the computer to produce the output.
Machine Learning: Data and output is run on the computer to create a program. This program can
be used in traditional programming.
Machine learning is like farming or gardening. Seeds is the algorithms, nutrients is the data, the
gardner is you and plants is the programs.
Applications of Machine
Sample applications of machine learning:
Web search: ranking page based on what you are most likely to click on.
Computational biology: rational design drugs in the computer based on past experiments.
Finance: decide who to send what credit card offers to. Evaluation of risk on credit offers. How to
decide where to invest money.
E-commerce: Predicting customer churn. Whether or not a transaction is fraudulent.
Space exploration: space probes and radio astronomy.
Robotics: how to handle uncertainty in new environments. Autonomous. Self-driving car.
Information extraction: Ask questions over databases across the web.
Social networks: Data on relationships and preferences. Machine learning to extract value from
data.
Debugging: Use in computer science problems like debugging. Labor intensive process. Could
suggest where the bug could be.
Various Search/Optimization
Algorithms
• Gradient descent
– Perceptron
– Backpropagation
• Dynamic Programming
– HMMLearning
– PCFG Learning
• Divide and Conquer
– Decision tree induction
– Rule learning
• Evolutionary Computation
– Genetic Algorithms (GAs)
– Genetic Programming (GP)
– Neuro-evolution
Evaluation
• Accuracy
• Precision and recall
• Squared error
• Likelihood
• Posterior probability
• Cost / Utility
• Margin
• Entropy
• K-L divergence
• etc.
ML in Practice
• Understand domain, prior knowledge, and goals
• Data integration, selection, cleaning, pre-processing, etc.
• Learnmodels
• Interpret results
• Consolidate and deploy discovered knowledge
https://www.cmpe.boun.edu.tr/~ethem/i2ml3e/
For a given problem, the collection of all possible outcomes represents the sample space or
instance space.
The basic idea for creating a taxonomy of algorithms is that we divide the instance space by using
one of three ways:
The outcome of the transformation of the instance space by a machine learning algorithm using the
above techniques should be exhaustive (cover all possible outcomes) and mutually exclusive (non-
overlapping).
2. Logical models
There are mainly two kinds of logical models: Tree models and Rule models.
Rule models consist of a collection of implications or IF-THEN rules. For tree-based models, the ‘if-
part’ defines a segment and the ‘then-part’ defines the behaviour of the model for this segment.
Rule models follow the same reasoning.
Tree models can be seen as a particular type of rule model where the if-parts of the rules are
organised in a tree structure. Both Tree models and Rule models use the same approach to
supervised learning. The approach can be summarised in two strategies: we could first find the
body of the rule (the concept) that covers a sufficiently homogeneous set of examples and then
find a label to represent the body. Alternately, we could approach it from the other direction, i.e.,
first select a class we want to learn and then find rules that cover examples of the class.
A simple tree-based model is shown below. The tree shows survival numbers of passengers on the
Titanic ("sibsp" is the number of spouses or siblings aboard). The values under the leaves show
the probability of survival and the percentage of observations in the leaf. The model can be
summarised as: Your chances of survival were good if you were (i) a female or (ii) a male younger
than 9.5 years with less than 2.5 siblings.
A Concept Learning Task called “Enjoy Sport” as shown above is defined by a set of data from
some example days. Each data is described by six attributes. The task is to learn to predict the
value of Enjoy Sport for an arbitrary day based on the values of its attribute values. The problem
can be represented by a series of hypotheses. Each hypothesis is described by a conjunction of
constraints on the attributes. The training data represents a set of positive and negative examples
of the target function. In the example above, each hypothesis is a vector of six constraints,
specifying the values of the six attributes – Sky, AirTemp, Humidity, Wind, Water, and Forecast.
The training phase involves learning the set of days (as a conjunction of attributes) for which Enjoy
Sport = yes.
Thus, the problem can be formulated as:
Given instances X which represent a set of all possible days, each described by the attributes:
Sky – (values: Sunny, Cloudy, Rainy),
AirTemp – (values: Warm, Cold),
Humidity – (values: Normal, High),
Wind – (values: Strong, Weak),
Water – (values: Warm, Cold),
Forecast – (values: Same, Change).
Try to identify a function that can predict the target variable Enjoy Sport as yes/no, i.e., 1 or 0.
3. Geometric models
In the previous section, we have seen that with logical models, such as decision trees, a logical
expression is used to partition the instance space. Two instances are similar when they end up in
the same logical segment. In this section, we consider models that define similarity by considering
the geometry of the instance space. In Geometric models, features could be described as points in
two dimensions (x- and y-axis) or a three-dimensional space (x, y, and z). Even when features are
not intrinsically geometric, they could be modelled in a geometric manner (for example,
temperature as a function of time can be modelled in two axes). In geometric models, there are two
ways we could impose similarity.
We could use geometric concepts like lines or planes to segment (classify) the instance space.
These are called Linear models.
Alternatively, we can use the geometric notion of distance to represent similarity. In this case, if two
points are close together, they have similar values for features and thus can be classed as similar.
We call such models as Distance-based models.
3.1 Linear models
Linear models are relatively simple. In this case, the function is represented as a linear
combination of its inputs. Thus, if x1 and x2 are two scalars or vectors of the same dimension and
a and b are arbitrary scalars, then ax1 + bx2 represents a linear combination of x1 and x2. In the
simplest case where f(x) represents a straight line, we have an equation of the form f (x) = mx + c
where c represents the intercept and m represents the slope.
Linear models are parametric, which means that they have a fixed form with a small number of
numeric parameters that need to be learned from data. For example, in f (x) = mx + c, m and c are
the parameters that we are trying to learn from the data. This technique is different from tree or rule
models, where the structure of the model (e.g., which features to use in the tree, and where) is not
fixed in advance.
Linear models are stable, i.e., small variations in the training data have only a limited impact on the
learned model. In contrast, tree models tend to vary more with the training data, as the choice of a
different split at the root of the tree typically means that the rest of the tree is different as well. As a
result of having relatively few parameters, Linear models have low variance and high bias. This
implies that Linear models are less likely to overfit the training data than some other models.
However, they are more likely to underfit. For example, if we want to learn the boundaries between
countries based on labelled data, then linear models are not likely to give a good approximation.
In this section, we could also use algorithms that include kernel methods, such as support vector
machine (SVM). Kernel methods use the kernel function to transform data into another dimension
where easier separation can be achieved for the data, such as using a hyperplane for SVM.
Distance is applied through the concept of neighbours and exemplars. Neighbours are points in
proximity with respect to the distance measure expressed through exemplars. Exemplars are either
centroids that find a centre of mass according to a chosen distance metric or medoids that find the
most centrally located data point. The most commonly used centroid is the arithmetic mean, which
minimises squared Euclidean distance to all other points.
Notes:
The centroid represents the geometric centre of a plane figure, i.e., the arithmetic mean position of
all the points in the figure from the centroid point. This definition extends to any object in n-
dimensional space: its centroid is the mean position of all the points.
Medoids are similar in concept to means or centroids. Medoids are most commonly used on data
when a mean or centroid cannot be defined. They are used in contexts where the centroid is not
representative of the dataset, such as in image data.
Examples of distance-based models include the nearest-neighbour models, which use the training
data as exemplars – for example, in classification. The K-means clustering algorithm also uses
exemplars to create clusters of similar data points.
4. Probabilistic models
The third family of machine learning algorithms is the probabilistic models. We have seen before
that the k-nearest neighbour algorithm uses the idea of distance (e.g., Euclidian distance) to
classify entities, and logical models use a logical expression to partition the instance space. In this
section, we see how the probabilistic models use the idea of probability to classify new entities.
Probabilistic models see features and target variables as random variables. The process of
modelling represents and manipulates the level of uncertainty with respect to these variables.
There are two types of probabilistic models: Predictive and Generative. Predictive probability
models use the idea of a conditional probability distribution P (Y |X) from which Y can be predicted
from X. Generative models estimate the joint distribution P (Y, X). Once we know the joint
distribution for the generative models, we can derive any conditional or marginal distribution
involving the same variables. Thus, the generative model is capable of creating new data points
and their labels, knowing the joint probability distribution. The joint distribution looks for a
relationship between two variables. Once this relationship is inferred, it is possible to infer new data
points.
The goal of any probabilistic classifier is given a set of features (x_0 through x_n) and a set of
classes (c_0 through c_k), we aim to determine the probability of the features occurring in each
class, and to return the most likely class. Therefore, for each class, we need to calculate P(c_i |
x_0, …, x_n).
The Naïve Bayes algorithm is based on the idea of Conditional Probability. Conditional probability
is based on finding the probability that something will happen, given that something else has
already happened. The task of the algorithm then is to look at the evidence and to determine the
likelihood of a specific class and assign a label accordingly to each entity.
Supervised learning
Unsupervised learning
Reinforcement learning
1) Supervised Learning
Supervised learning is a type of machine learning method in which we provide sample labeled data
to the machine learning system in order to train it, and on that basis, it predicts the output.
The system creates a model using labeled data to understand the datasets and learn about each
data, once the training and processing are done then we test the model by providing a sample data
to check whether it is predicting the exact output or not.
The goal of supervised learning is to map input data with the output data. The supervised learning
is based on supervision, and it is the same as when a student learns things in the supervision of
the teacher. The example of supervised learning is spam filtering.
Supervised learning can be grouped further in two categories of algorithms:
Classification
Regression
2) Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns without any supervision.
The training is provided to the machine with the set of data that has not been labeled, classified, or
categorized, and the algorithm needs to act on that data without any supervision. The goal of
unsupervised learning is to restructure the input data into new features or a group of objects with
similar patterns.
In unsupervised learning, we don't have a predetermined result. The machine tries to find useful
insights from the huge amount of data. It can be further classifieds into two categories of
algorithms:
Clustering
Association
3) Reinforcement Learning
Reinforcement learning is a feedback-based learning method, in which a learning agent gets a
reward for each right action and gets a penalty for each wrong action. The agent learns
automatically with these feedbacks and improves its performance. In reinforcement learning, the
agent interacts with the environment and explores it. The goal of an agent is to get the most reward
points, and hence, it improves its performance.
The robotic dog, which automatically learns the movement of his arms, is an example of
Reinforcement learning.
In supervised learning, the training data provided to the machines work as the supervisor that
teaches the machines to predict the output correctly. It applies the same concept as a student
learns in the supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data to the
machine learning model. The aim of a supervised learning algorithm is to find a mapping function
to map the input variable(x) with the output variable(y).
How Supervised Learning Works?
In supervised learning, models are trained using labelled dataset, where the model learns about
each type of data. Once the training process is completed, the model is tested on the basis of test
data (a subset of the training set), and then it predicts the output.
The working of Supervised learning can be easily understood by the below example and
diagram:
Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle,
and Polygon. Now the first step is that we need to train the model for each shape.
If the given shape has four sides, and all the sides are equal, then it will be labelled as a Square.
If the given shape has three sides, then it will be labelled as a triangle.
If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify the
shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it classifies
the shape on the bases of a number of sides, and predicts the output.
1. Regression
Regression algorithms are used if there is a relationship between the input variable and the output
variable. It is used for the prediction of continuous variables, such as Weather forecasting, Market
Trends, etc. Below are some popular Regression algorithms which come under supervised
learning:
Linear Regression
Regression Trees
Non-Linear Regression
Bayesian Linear Regression
Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which means there are
two classes such as Yes-No, Male-Female, True-false, etc.
Spam Filtering,
Random Forest
Decision Trees
Logistic Regression
Support vector Machines
Unsupervised learning is a type of machine learning in which models are trained using unlabeled
dataset and are allowed to act on that data without any supervision.
Unsupervised learning cannot be directly applied to a regression or classification problem because
unlike supervised learning, we have the input data but no corresponding output data. The goal of
unsupervised learning is to find the underlying structure of dataset, group that data according to
similarities, and represent that dataset in a compressed format.
Example: Suppose the unsupervised learning algorithm is given an input dataset containing
images of different types of cats and dogs. The algorithm is never trained upon the given dataset,
which means it does not have any idea about the features of the dataset. The task of the
unsupervised learning algorithm is to identify the image features on their own. Unsupervised
learning algorithm will perform this task by clustering the image dataset into the groups according
to similarities between images.
Why use Unsupervised Learning?
Below are some main reasons which describe the importance of Unsupervised Learning:
Unsupervised learning is helpful for finding useful insights from the data.
Unsupervised learning is much similar as a human learns to think by their own experiences, which
makes it closer to the real AI.
Unsupervised learning works on unlabeled and uncategorized data which make unsupervised
learning more important.
In real-world, we do not always have input data with the corresponding output so to solve such
cases, we need unsupervised learning.
Working of Unsupervised Learning
Working of unsupervised learning can be understood by the below diagram
Here, we have taken an unlabeled input data, which means it is not categorized and corresponding
outputs are also not given. Now, this unlabeled input data is fed to the machine learning model in
order to train it. Firstly, it will interpret the raw data to find the hidden patterns from the data and
then will apply suitable algorithms such as k-means clustering, Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into groups according
to the similarities and difference between the objects.
Clustering: Clustering is a method of grouping the objects into clusters such that objects with most
similarities remains into a group and has less or no similarities with the objects of another group.
Cluster analysis finds the commonalities between the data objects and categorizes them as per the
presence and absence of those commonalities.
Association: An association rule is an unsupervised learning method which is used for finding the
relationships between variables in the large database. It determines the set of items that occurs
together in the dataset. Association rule makes marketing strategy more effective. Such as people
who buy X item (suppose a bread) are also tend to purchase Y (Butter/Jam) item. A typical
example of Association rule is Market Basket Analysis.
Unsupervised Learning algorithms:
Below is the list of some popular unsupervised learning algorithms:
K-means clustering
KNN (k-nearest neighbors)
Hierarchal clustering
Anomaly detection
Neural Networks
Principle Component Analysis
Independent Component Analysis
Apriori algorithm
Singular value decomposition
https://www.javatpoint.com/reinforcement-learning
https://bloomberg.github.io/foml/#lectures