ML 1-6
ML 1-6
Introduction to Machine
Learning
Introduction
To solve a problem on a computer, we need an algorithm.
An algorithm is a sequence of instructions that should be
carried out to transform the input to output.
For example, one can devise an algorithm for sorting.
For some tasks, however, we do not have an algorithm—for
example, to tell spam emails from legitimate emails.
What we lack in knowledge, we make up for in data.
We can easily compile thousands of example messages, some
of which we know to be spam and what we want is to
“learn” what constitutes spam from them.
Machine Learning
Definition
Machine learning is programming computers to optimize
a performance criterion using example data or past
experience.
Regression
Regression is the prediction of a numeric value.
Clustering
In unsupervised learning, there’s no label or target value given for the
data.
A task where we group similar items together is known as clustering.
Type of Machine Learning
Types of Machine Learning
Machine Learning Classification
Machine Learning
Regression Clustering
Linear Semi-Supervised
K-means, PCA Learning
Multivariate
Classification Association
Are you trying to fit your data into some discrete groups? If so and
that’s all you need, you should look into clustering.
Do you need to have some numerical estimate of how strong the fit
is into each group? If you answer yes, then you probably should look
into a density estimation algorithm.
You should spend some time getting to know your data, and the
more you know about it, the better you’ll be able to build a
successful application.
Steps in developing a machine learning
application
Collect data
Prepare the input data
Analyse the input data
Train the algorithm
Test the algorithm
Use it
Training Error and Testing Error
There are two important concepts used in machine learning: the
training error and the test error.
Prediction
Error
Bias and Variance
{Bias refers to predicted value by the model and the actual value}
Low Bias: Suggests less assumptions about the form of the target
function. {Less Gap between PV & AV}
High-Bias: Suggests more assumptions about the form of the target
function. {More Gap between PV & AV}
Learning with
Regression and Trees
Learning with Regression
Regression
If two variables are closely related we may be interested in
estimating (predicting) the value of one variable given the value of
another.
For example, advertising and sales are correlated we find out
expected amount of sales for a given advertising expenditure or the
required amount of expenditure for attaining a given amount of
sales.
Similarly, if we know that the yield of rice and rainfall are closely
related we may find out the amount of rain required to achieve a
certain production figure.
Regression analysis reveals average relationship between two
variables and this makes possible estimation or prediction.
The dictionary meaning of the term 'regression' is the act of returning
or going back.
Regression Contd…
The variable which is used to predict the variable of interest is called
the independent variable or explanatory variable and the variable we
are trying to predict is called the dependent variable or explained"
variable.
The independent variable is denoted by X and the dependent variable
by Y.
The analysis used is called the simple linear regression analysis-
simple because there is only one predictor or independent variable,
and linear because of the assumed linear relationship between the
dependent and independent variables.
Regression Contd…
Simple Regression Equation of Y on X
The regression equation of Y on X is expressed as follows:
Y=aX + b
It may be noted that in this equation 'Y' is a dependent variable, ie,
its value depends on X. 'X' is independent variable, i.e., we can take
a given value of X and compute the value of Y.
‘b’ is "Y-intercept" because its value is the point at which the
regression line crosses the Y-axis, that is, the vertical axis.
‘a’“ is the slope of line. It represents change in Y variable for unit
change in X variable.
‘a’ and 'b' in the equation are called numerical constants because for
any given straight line, their value does not change.
Sums # 1
Sums # 2
Sums # 2
Multiple Linear Regression
Multiple linear regression (MLR), also known simply as multiple
regression, is a statistical technique that uses several explanatory
variables to predict the outcome of a response variable.
Multiple regression is an extension of linear (OLS) regression that
uses just one explanatory variable.
Multiple vs Multivariate Linear Regression
Sums - Multiple Linear Regression
Sums - Multiple Linear Regression
Sums – Multivariate Linear Regression
Sums – Multivariate Linear Regression
Logistic Regression
This type of statistical model (also known as logit model) is often
used for classification and predictive analytics.
Logistic regression estimates the probability of an event occurring,
such as voted or didn’t vote, based on a given dataset of
independent variables.
Since the outcome is a probability, the dependent variable is
bounded between 0 and 1.
Logistic Regression Contd…
Sums - Logistic Regression
Sums - Logistic Regression
Learning with Trees
Decision Tree
Decision Tree is the most powerful and popular tool for classification
and prediction.
A Decision tree is a flowchart-like tree structure, where each internal
node denotes a test on an attribute, each branch represents an
outcome of the test, and each leaf node (terminal node) holds a class
label.
Strengths and Weaknesses
The strengths of decision tree methods are:
Decision trees are able to generate understandable rules.
Decision trees perform classification without requiring much computation.
Decision trees are able to handle both continuous and categorical variables.
Decision trees provide a clear indication of which fields are most important for
prediction or classification.
The weaknesses of decision tree methods :
Decision trees are less appropriate for estimation tasks where the goal is to predict
the value of a continuous attribute.
Decision trees are prone to errors in classification problems with many classes and a
relatively small number of training examples.
Decision tree can be computationally expensive to train. The process of growing a
decision tree is computationally expensive. At each node, each candidate splitting
field must be sorted before its best split can be found. In some algorithms,
combinations of fields are used and a search must be made for optimal combining
weights. Pruning algorithms can also be expensive since many candidate sub-trees
must be formed and compared.
Entropy in Information Theory
Root Mean Squared Error (RMSE) is the square root of Mean Squared
error. It measures the standard deviation of residuals
MAE, MAPE, MSE, RMSE and R2
Accuracy
Confusion Matrix
Precision
Recall
F1 score
AUC/ROC
Kappa
Performance Metrics for Classification
Accuracy
The overall accuracy of a model is simply the number of correct predictions
divided by the total number of predictions. An accuracy score will give a
value between 0 and 1, a value of 1 would indicate a perfect model.
Ensemble Learning
K-Fold Validation
K-Fold Cross Validation
Machine learning model performance assessment is just like assessing
the scores.
In which the model has been validated multiple times based on the
value assigned as a parameter and which is called K and it should be
an INTEGER.
Make it simple, based on the K value, the data set would be divided,
and train/testing will be conducted in a sequence way equal to K time.
K-Fold Validation
K-Fold Validation
The general process of k-fold cross-validation for evaluating a model’s
performance is:
Definition :
Ensemble learning is the process by which multiple models, such as
classifiers or experts, are strategically generated and combined to solve
a particular computational intelligence problem.
Ensemble learning is primarily used to improve the (classification,
prediction, function approximation, etc.) performance of a model, or
reduce the likelihood of an unfortunate selection of a poor one.
Imaging that Fable of blind men and elephant. All of the blind men had
their own description of the elephant. Even though each of the
description was true, it would have been better to come together and
discuss their understanding before coming to final conclusion. This story
perfectly describes the Ensemble learning method.
Ensemble learning Types / Ways to Combine Classifiers
Ensemble learning Types
Parallel Ensemble Learning(Bagging)
Bagging, is a machine learning ensemble meta-algorithm intended to improve the strength
and accuracy of machine learning algorithms used in classification and regression purpose.
It additionally help for over-fitting.
Parallel ensemble methods where the base learners are generated in parallel
Algorithms : Random Forest, Bagged Decision Trees, Extra Trees
Parallel Ensemble Learning(Bagging)
“Standard” bagging: each of the T subsamples has size n and created with
replacement. –
Below are some points that explain why we should use the Random
Forest algorithm:
It takes less training time as compared to other algorithms.
It predicts output with high accuracy, even for the large dataset it runs
efficiently.
It can also maintain accuracy when a large proportion of data is
missing.
Sequential Ensemble learning (Boosting)
Boosting, is a machine learning ensemble meta-algorithm for principally
reducing bias, and furthermore variance in supervised learning, and a
group of machine learning algorithms that convert weak learner to
string ones.
Sequential ensemble methods where the base learners are generated
sequentially.
Example : Adaboost, Stochastic Gradient Boosting
Boosting
Boosting is an ensemble modelling, technique that attempts to build a
strong classifier from the number of weak classifiers.
It is done by building a model by using weak models in series.
Firstly, a model is built from the training data.
Then the second model is built which tries to correct the errors
present in the first model.
This procedure is continued and models are added until either the
complete training data set is predicted correctly or the maximum
number of models are added.
Boosting
Gradient Boosting
Gradient Boosting is a popular boosting algorithm. In gradient
boosting, each predictor corrects its predecessor’s error.
9 Example: The Random forest model uses Bagging. Example: The AdaBoost uses Boosting techniques
Stacking & Blending
Stacking is a way of combining multiple models, that introduces the
concept of a meta learner. It is less widely used than bagging and boosting.
Unlike bagging and boosting, stacking may be (and normally is) used to
combine models of different types.
The procedure is as follows:
1. Split the training set into two disjoint sets.
2. Train several base learners on the first part.
3. Test the base learners on the second part.
Using the predictions from 3) as the inputs, and the correct responses as
the outputs, train a higher level learner.
Example : Voting Classifier
Support vectors are the data points that lie closest to the
decision surface (or hyperplane)
In simplistic terms:
The Kernel simply converts the non-linear datapoints to linear
datapoints, so that the SVM can bisect two classes.
Contd…
Kernels
There are several kernel functions used for SVMs. Some of the
popular ones are:
Gaussian Radial Basis Function (RBF):
where 𝛾 > 0.
A special case is 𝛾 = 1/2𝜎²
Gaussian Kernel:
Kernels
Polynomial Kernel:
Sigmoid kernel:
Support Vector Regression
Support Vector Regression is a supervised learning algorithm that is
used to predict discrete values.
Support Vector Regression uses the same principle as the SVMs.
The basic idea behind SVR is to find the best fit line.
In SVR, the best fit line is the hyperplane that has the maximum
number of points.
Support Vector Regression
Unlike other Regression models that try to minimize the error
between the real and predicted value, the SVR tries to fit the best
line within a threshold value.
The threshold value is the distance between the hyperplane and
boundary line.
The fit time complexity of SVR is more than quadratic with the
number of samples which makes it hard to scale to datasets with
more than a couple of 10000 samples.
Support Vector Regression
Multiclass Classification
In its most basic type, SVM doesn’t support multiclass classification.
For multiclass classification, the same principle is utilized after breaking
down the multi-classification problem into smaller subproblems, all of
which are binary classification problems.
Then to predict the output for new input, just predict with each of
the build SVMs and then find which one puts the prediction the
farthest into the positive region (behaves as a confidence criterion
for a particular SVM).
One vs All (OVA)
In the One vs All approach, we try to find a hyperplane to
separate the classes. This means the separation takes all points
into account and then divides them into two groups in which there
is a group for the one class points and the other group for all other
points.
For example, here, the Greenline tries to maximize the gap
between green points and all other points at once.
One vs All (OVA)
There are some challenges to train these N SVMs, which are:
Example:
An international online catalog company wishes to group its
customers based on common features.
Company management does not have any predefined labels for these
groups.
Based on the outcome of the grouping, they will target marketing
and advertising campaigns to the different groups.
The information they have about the customers includes income,
age, number of children, marital status, location of house and
education among others.
Introduction
Density-based
Distribution-based
Centroid-based
Hierarchical-based
Density-based
Sum
Divisive Algorithm
With divisive clustering, all items are initially placed in one cluster
and clusters are repeatedly split in two until all items are in their
own cluster.
The idea is to split up clusters where some elements are not
sufficiently close to other elements.
K-Means Clustering
Dimensionality
Reduction
Introduction
The number of input features, variables, or columns present in a given
dataset is known as dimensionality, and the process to reduce these
features is called dimensionality reduction.
By reducing the dimensions of the features, the space required to store the
dataset also gets reduced.
Less Computation training time is required for reduced dimensions of
features.
Reduced dimensions of features of the dataset help in visualizing the data
quickly.
It removes the redundant features (if present) by taking care of
multicollinearity.
Disadvantages of dimensionality
Reduction
Correlation
Chi-Square Test
ANOVA
Information Gain, etc.
Wrappers Methods
The wrapper method has the same goal as the filter method, but it takes a
machine learning model for its evaluation.
In this method, some features are fed to the ML model, and evaluate the
performance.
The performance decides whether to add those features or remove to
increase the accuracy of the model.
This method is more accurate than the filtering method but complex to
work.
Some common techniques of wrapper methods are:
Forward Selection
Backward Selection
Bi-directional Elimination
Embedded methods
LASSO
Elastic Net
Ridge Regression, etc.
Feature extraction
After choosing the axis covering the most variability, we choose the
next axis, which has the second most variability, provided it’s
perpendicular to the first axis.
The real term used is orthogonal.
On this two-dimensional plot, perpendicular and orthogonal are the
same.
In figure, line C would be our second axis.
With PCA, we’re rotating the axes so that they’re lined up with the
most important directions from the data’s perspective.
Linear Discriminant Analysis
Example:
Suppose we have two sets of data points belonging to two different
classes that we want to classify.
As shown in the given 2D graph, when the data points are plotted on
the 2D plane, there’s no straight line that can separate the two
classes of the data points completely.
Hence, in this case, LDA (Linear Discriminant Analysis) is used which
reduces the 2D graph into a 1D graph in order to maximize the
separability between the two classes.
Linear Discriminant Analysis