Unit 1 - Machine Learning
Unit 1 - Machine Learning
Machine learning
Machine learning is a branch of science that deals with programming the systems in such a way
that they automatically learn and improve with experience. Here, learning means recognizing
and understanding the input data and making wise decisions based on the supplied data.
It is very difficult to provide to all the decisions based on all possible inputs. To tackle this
problem, ML algorithms are developed. This are those that can learn from data and improve
from experience, without human intervention. These algorithms build knowledge from specific
data and experience with the principles of statistics, probability theory, logic, combinatorial
optimization, search, reinforcement learning, and control theory. The developed algorithms
form the basis of various applications such as:
• Vision processing
• Language processing
• Forecasting (e.g., stock market trends)
• Pattern recognition
• Games
• Data mining
• Expert systems
• Robotics
Learning
Learning tasks may include learning the function that maps the input to the output, learning the
hidden structure in unlabelled data or ‘instance-based learning’, where a class label is produced
for a new instance by comparing the new instance (row) to instances from the training data,
which were stored in memory. ‘Instance-based learning’ does not create an abstraction from
specific instances.
Types of ML
Supervised learning
Supervised learning deals with learning a function from available training data. A supervised
learning algorithm analyses the training data and produces an inferred function, which can be
used for mapping new examples.
•
from the data the computer should be able to learn the patterns.
Supervised learning algorithms try to model relationships and dependencies
between the target prediction output and the input features such that we can
predict the output values for new data based on those relationships which it learned
from the previous data sets.
• Nearest Neighbour
• Naive Bayes
• Decision Trees
• Linear Regression
• Support Vector Machines (SVM)
• Neural Networks
Unsupervised learning
Unsupervised learning makes sense of unlabelled data without having any predefined dataset
for its training. Unsupervised learning is an extremely powerful tool for analysing available
data and look for patterns and trends. It is most commonly used for clustering similar input
into logical groups.
• k-means
• self-organizing maps, and
• hierarchical clustering
Semi-supervised Learning
In the previous two types, either there are no labels for all the observation in the dataset or
labels are present for all the observations. Semi-supervised learning falls in between these two.
In many practical situations, the cost to label is quite high, since it requires skilled human
experts to do that. So, in the absence of labels in most of the observations but present in few,
semi-supervised algorithms are the best candidates for the model building. These methods
exploit the idea that even though the group memberships of the unlabelled data are unknown,
this data carries important information about the group parameters.
Reinforcement Learning
method aims at using observations gathered from the interaction with the environment to take
actions that would maximize the reward or minimize the risk. Reinforcement learning
algorithm (called the agent) continuously learns from the environment in an iterative fashion.
In the process, the agent learns from its experiences of the environment until it explores the
full range of possible states.
There are many different algorithms that tackle this issue. As a matter of fact, Reinforcement
Learning is defined by a specific type of problem, and all its solutions are classed as
Reinforcement Learning algorithms. In the problem, an agent is supposed decide the best
action to select based on his current state. When this step is repeated, the problem is known
as a Markov Decision Process.
In order to produce intelligent programs (also called agents), reinforcement learning goes
through the following steps:
• Q-Learning
• Temporal Difference (TD)
• Deep Adversarial Networks
Applications of ML:
Google Maps is probably the app we use whenever we go out and require assistance in
directions and traffic. The other day I was traveling to another city and took the expressway
and Maps suggested: “Despite the Heavy Traffic, you are on the fastest route “. It’s a
combination of People currently using the service, Historic Data of that route collected over
time and few tricks acquired from other companies. Everyone using maps is providing their
location, average speed, the route in which they are traveling which in turn helps Google collect
massive Data about the traffic, which makes them predict the upcoming traffic and adjust your
route according to it.
Facebook uses face detection and Image recognition to automatically find the face of the
person which matches its Database and hence suggests us to tag that person based on Deep-
Face. Facebook’s Deep Learning project Deep-Face is responsible for the recognition of faces
and identifying which person is in the picture. It also provides Alt Tags (Alternative Tags) to
images already uploaded on Facebook. For e.g., if we inspect the following image on
Facebook, the alt-tag has a description.
Virtual Personal Assistants assist in finding useful information, when asked via text or
voice. Few of the major Applications of Machine Learning here are:
• Speech Recognition
• Speech to Text Conversion
• Natural Language Processing
• Text to Speech Conversion
All you need to do is ask a simple question like “What is my schedule for tomorrow?” or
“Show my upcoming Flights “. For answering, your personal assistant searches for
information or recalls your related queries to collect info.
Google Translate
The time when we travelled to a new place and you find it difficult to communicate with the
locals or finding local spots where everything is written in a different language.
Google’s GNMT(Google Neural Machine Translation) is a Neural Machine Learning that
works on thousands of languages and dictionaries, uses Natural Language Processing to
provide the most accurate translation of any sentence or words. Since the tone of the words
also matters, it uses other techniques like POS Tagging, NER (Named Entity Recognition) and
Chunking. It is one of the best and most used Applications of Machine Learning.
Fraud Detection
Fraud Detection is one of the most necessary Applications of Machine Learning. The number
of transactions has increased due to a plethora of payment channels – credit/debit cards,
smartphones, numerous wallets, UPI and much more. At the same time, the number of
criminals has become adept at finding loopholes. Whenever a customer carries out a transaction
– the Machine Learning model thoroughly x-rays their profile searching for suspicious
patterns. In Machine Learning, problems like fraud detection are usually framed as
classification problems.
Introduction to Neural Networks.
Neural networks are parallel computing devices, which is basically an attempt to make a
computer model of the brain. The main objective is to develop a system to perform various
computational tasks faster than the traditional systems. These tasks include pattern recognition
and classification, approximation, optimization, and data clustering.
Artificial Neural Network (ANN) is an efficient computing system whose central theme is
borrowed from the analogy of biological neural networks. ANNs are also named as “artificial
neural systems,” or “parallel distributed processing systems,” or “connectionist systems.” ANN
acquires a large collection of units that are interconnected in some pattern to allow
communication between the units. These units, also referred to as nodes or neurons, are simple
processors which operate in parallel.
Every neuron is connected with other neuron through a connection link. Each connection link
is associated with a weight that has information about the input signal. This is the most useful
information for neurons to solve a particular problem because the weight usually excites or
inhibits the signal that is being communicated. Each neuron has an internal state, which is
called an activation signal. Output signals, which are produced after combining the input
signals and activation rule, may be sent to other units.
The following diagram represents the general model of ANN followed by its processing.
For the above general model of artificial neural network, the net input can be calculated as
follows −
yin=x1.w1+x2.w2+x3.w3…xm.wm
Y=F(yin)
Output = function (net input calculated)
SSE
In linear regression there is a neat way to measure the accuracy of the relationship (called
correlation in statistics), it is has many names, SSE, SSR, RSS. I am going to refer to it as
SSE, which stands for Sum of Squared Errors.
The regression line is the line made using the function we defined above. You can think of it
as drawing a pixel for every possible meal price value, thus creating a line. Here is what it
looks like with our data set.
An error refers to how far a data point, or in this case tip is from the regression line. To get
the SSE we calculate the distance for each of the data points from the regression line then
square the it, then we add to the sum. Here is what it would look like in code.
If a tip moves far away from the regression line, it is a clear indicator that the correlation is
low. Squaring the error serves as a useful measure to insure that the correlation is high,
simply summing up each error without squaring it would not effectively show how low the
correlation actually is.
Gradient Descent
Gradient Descent is one of the most popular and widely used algorithms for training machine
learning models.
Machine learning models typically have parameters (weights and biases) and a cost function
to evaluate how good a particular set of parameters are. Many machine learning problems
reduce to finding a set of weights for the model which minimizes the cost function.
For example, if the prediction is p, the target is t, and our error metric is squared error, then
the cost function J(W) = (p - t)².
Note that the predicted value p depends on the input X as well as the machine learning model
and (current) values of the parameters W. During training, our aim is to find a set of values
for W such that (p - t)² is small. This means our prediction p will be close to the target t.
closed form
Normal Equation
where,
xi : the input value of iih training example.
m : no. of training instances
n : no. of data-set features
yi : the expected result of ith instance
Let us representing cost function in a vector form.
we have ignored 1/2m here as it will not make any difference in the working. It was
used for the mathematical convenience while calculation gradient descent. But it is
no more needed here.
xij : value of jih feature in iih training example.
This can further be reduced to
But each residual value is squared. We cannot simply square the above expression.
As the square of a vector/matrix is not equal to the square of each of its values. So
to get the squared value, multiply the vector/matrix with its transpose. So, the final
equation derived is
Linearity
The linear regression model forces the prediction to be a linear combination of
features, which is both its greatest strength and its greatest limitation. Linearity
leads to interpretable models. Linear effects are easy to quantify and describe.
They are additive, so it is easy to separate the effects. If you suspect feature
interactions or a nonlinear association of a feature with the target value, you can
add interaction terms or use regression splines.
Normality
It is assumed that the target outcome given the features follows a normal
distribution. If this assumption is violated, the estimated confidence intervals of the
feature weights are invalid.
Independence
It is assumed that each instance is independent of any other instance. If you
perform repeated measurements, such as multiple blood tests per patient, the data
points are not independent. For dependent data you need special linear regression
models, such as mixed effect models or GEEs. If you use the “normal” linear
regression model, you might draw wrong conclusions from the model.
Fixed features
The input features are considered “fixed”. Fixed means that they are treated as
“given constants” and not as statistical variables. This implies that they are free of
measurement errors. This is a rather unrealistic assumption. Without that
assumption, however, you would have to fit very complex measurement error
models that account for the measurement errors of your input features. And
usually you do not want to do that.
Absence of multicollinearity
You do not want strongly correlated features, because this messes up the
estimation of the weights. In a situation where two features are strongly correlated,
it becomes problematic to estimate the weights because the feature effects are
additive and it becomes indeterminable to which of the correlated features to
attribute the effects.
Overfitting in Machine Learning
Overfitting happens when a model learns the detail and noise in the training data to the
extent that it negatively impacts the performance of the model on new data. This means
that the noise or random fluctuations in the training data is picked up and learned as
concepts by the model. The problem is that these concepts do not apply to new data and
negatively impact the models ability to generalize.
Overfitting is more likely with nonparametric and nonlinear models that have more
flexibility when learning a target function. As such, many nonparametric machine
learning algorithms also include parameters or techniques to limit and constrain how
much detail the model learns.
For example, decision trees are a nonparametric machine learning algorithm that is very
flexible and is subject to overfitting training data. This problem can be addressed by
pruning a tree after it has learned in order to remove some of the detail it has picked up.
An underfit machine learning model is not a suitable model and will be obvious as it will
have poor performance on the training data.
Validation
In machine learning, we couldn’t fit the model on the training data and can’t say that
the model will work accurately for the real data. For this, we must assure that our
model got the correct patterns from the data, and it is not getting up too much noise.
For this purpose, we use the cross-validation technique.
Cross-Validation
Cross-validation is a technique in which we train our model using the subset of the
data-set and then evaluate using the complementary subset of the data-set.
The three steps involved in cross-validation are as follows :
1. Reserve some portion of sample data-set.
2. Using the rest data-set train the model.
3. Test the model using the reserve portion of the data-set.
Advantages of cross-validation:
1. More accurate estimate of out-of-sample accuracy.
2. More “efficient” use of data as every observation is used for both training and
testing.
Classification
Classification is the process of predicting the class of given data points. Classes are sometimes
called as targets/ labels or categories. Classification predictive modeling is the task of
approximating a mapping function (f) from input variables (X) to discrete output variables (y).
For example, spam detection in email service providers can be identified as a classification
problem. This is s binary classification since there are only 2 classes as spam and not spam. A
classifier utilizes some training data to understand how given input variables relate to the class.
In this case, known spam and non-spam emails have to be used as the training data. When the
classifier is trained accurately, it can be used to detect an unknown email.
Classification belongs to the category of supervised learning where the targets also provided
with the input data. There are many applications in classification in many domains such as in
credit approval, medical diagnosis, target marketing etc.
There are two types of learners in classification as lazy learners and eager learners.
1. Lazy learners
Lazy learners simply store the training data and wait until a testing data appear. When it does,
classification is conducted based on the most related data in the stored training data. Compared
to eager learners, lazy learners have less training time but more time in predicting.
2. Eager learners
Eager learners construct a classification model based on the given training data before receiving
data for classification. It must be able to commit to a single hypothesis that covers the entire
instance space. Due to the model construction, eager learners take a long time for train and less
time to predict.
After training a Machine Learning Model using a data-set, it is often necessary to visualize the
classification of the data-points in Feature Space. Decision Boundary on a Scatter Plot serves
the purpose, in which the Scatter Plot contains the data-points belonging to different classes
(denoted by colour or shape) and the decision boundary can be drawn following many
different strategies:
1. Single-Line Decision Boundary: The basic strategy to draw the Decision Boundary on a
Scatter Plot is to find a single line that separates the data-points into regions signifying
different classes. Now, this single line is found using the parameters related to the
Machine Learning Algorithm that are obtained after training the model. The line co-
ordinates are found using the obtained parameters and intuition behind the Machine
Learning Algorithm. Deployment of this strategy is not possible if the intuition and
working mechanism of the ML Algorithm is not known.
Nearest-Neighbours(k-Nearest-Neighbors)
As we would need to in any machine learning problem, we must first find a way to represent
data points as feature vectors. A feature vector is our mathematical representation of data, and
since the desired characteristics of our data may not be inherently numerical, pre-processing and
feature-engineering may be required in order to create these vectors. Given data with N unique
features, the feature vector would be a vector of length N, where entry I of the vector represents
that data point’s value for feature I. Each feature vector can thus be thought of as a point in R^N.
Now, unlike most other methods of classification, kNN falls under lazy learning, which means
that there is no explicit training phase before classification. Instead, any attempts to
generalize or abstract the data is made upon classification. While this does mean that we can
immediately begin classifying once we have our data, there are some inherent problems with
this type of algorithm. We must be able to keep the entire training set in memory unless we
apply some type of reduction to the data-set, and performing classifications can be
computationally expensive as the algorithm parse through all data points for each
classification. For these reasons, kNN tends to work best on smaller data-sets that do not
have many features.
Once we have formed our training data-set, which is represented as an M x N matrix where M is
the number of data points and N is the number of features, we can now begin classifying. The
gist of the kNN method is, for each classification query, to:
1. Compute a distance value between the item to be classified and every item in the training
data-set
2. Pick the k closest data points (the items with the k lowest distances)
3. Conduct a “majority vote” among those data points — the dominating classification in that
pool is decided as the final classification
There are two important decisions that must be made before making classifications. One is the
value of k that will be used; this can either be decided arbitrarily, or you can try cross-
validation to find an optimal value. The next, and the most complex, is the distance metric that
will be used.
There are many different ways to compute distance, as it is a fairly ambiguous notion, and the
proper metric to use is always going to be determined by the data-set and the classification task.
Two popular ones, however, are Euclidean distance and Cosine similarity.
Euclidean distance is probably the one that you are most familiar with; it is essentially the
magnitude of the vector obtained by subtracting the training data point from the point to be
classified.
Another common metric is Cosine similarity. Rather than calculating a magnitude, Cosine
similarity instead uses the difference in direction between two vectors.
Choosing a metric can often be tricky, and it may be best to just use cross-validation to decide,
unless you have some prior insight that clearly leads to using one over the other. For example,
for something like word vectors, you may want to use Cosine similarity because the direction
of a word is more meaningful than the sizes of the component values. Generally, both of these
methods will run in roughly the same time, and will suffer from highly-dimensional data.
After doing all of the above and deciding on a metric, the result of the kNN algorithm is a
decision boundary that partitions R^N into sections. Each section (colored distinctly below)
represents a class in the classification problem. The boundaries need not be formed with actual
training examples — they are instead calculated using the distance metric and the available
training points. By taking R^N in (small) chunks, we can calculate the most likely class for a
hypothetical data-point in that region, and we thus color that chunk as being in the region for
that class.
This information is all that is needed to begin implementing the algorithm and doing so should
be relatively simple. There are, of course, many ways to improve upon this base algorithm.
Common modifications include weighting, and specific pre-processing to reduce computation
and reduce noise, such as various algorithms for feature extraction and dimension reduction.
Additionally, the kNN method has also been used, although less-commonly, for regression
tasks, and operates in a manner very similar to that of the classifier through averaging.
MATLAB
MATLAB makes machine learning easy. With tools and functions for handling big data, as
well as apps to make machine learning accessible, MATLAB is an ideal environment for
applying machine learning to your data analytics.
With MATLAB, engineers and data scientists have immediate access to prebuilt functions,
extensive toolboxes, and specialized apps for classification, regression, and clustering.
• Compare approaches such as logistic regression, classification trees, support vector machines,
ensemble methods, and deep learning.
• Use model refinement and reduction techniques to create an accurate model that best captures
the predictive power of your data.
• Integrate machine learning models into enterprise systems, clusters, and clouds, and target
models to real-time embedded hardware.
• Perform automatic code generation for embedded sensor analytics.
• Support integrated workflows from data analytics to deployment.