machine learning notes

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 20

Unit -1

Machine learning (ML):


It is the scientific study of algorithms and statistical models that computer systems use to
perform a specific task without using explicit instructions, relying on patterns
and inference instead.
It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical
model based on sample data, known as "training data", in order to make predictions or decisions
without being explicitly programmed to perform the task. [1][2]:2 Machine learning algorithms are
used in a wide variety of applications, such as email filtering and computer vision, where it is
difficult or infeasible to develop a conventional algorithm for effectively performing the task.
Although a machine learning model may apply a mix of different techniques, the
methods for learning can typically be categorized as three general types:

 Supervised learning: The learning algorithm is given labeled data and the
desired output. For example, pictures of dogs labeled “dog” will help the
algorithm identify the rules to classify pictures of dogs.
 Unsupervised learning: The data given to the learning algorithm is
unlabeled, and the algorithm is asked to identify patterns in the input data. For
example, the recommendation system of an e-commerce website where the
learning algorithm discovers similar items often bought together.
 Reinforcement learning: The algorithm interacts with a dynamic
environment that provides feedback in terms of rewards and punishments. For
example, self-driving cars being rewarded to stay on the road.1
Supervised Learning
 Supervised learning do the work of function approximation, where basically we train an
algorithm and in the end of the process we pick the function that best describes the input
data, the one that for a given X makes the best estimation of y (X -> y). Most of the time
we are not able to figure out the true function that always make the correct predictions and
other reason is that the algorithm rely upon an assumption made by humans about how the
computer should learn and this assumptions introduce a bias, Bias is topic I’ll explain
in another post.

 Here input dataset acts as a teacher where we feed the computer with training data
containing the input/predictors and we show it the correct answers (output or the label of
input predictors ). Form the training dataset, the model learns the mapping function between
the input predictors and the output variable.

 Supervised learning algorithms try to model relationships and dependencies between the
target prediction output and the input features such that we can predict the output values for
new data based on those relationships which it learned from the previous data sets.

 Supervised learning based models are the predictive models that predict either the value of a
continuous variable ( like temperature, stock price etc) which we calls regression, another is
the prediction of class ( like input image is of dog or cat) which we call classification models.

List of Common Algorithms of Supervised learning


 Nearest Neighbor
 Naive Bayes
 Decision Trees
 Linear Regression
 Support Vector Machines (SVM)
 Neural Networks

Classification and Regression in supervised Learning:


Classification algorithms and regression algorithms are types of supervised learning.
Classification algorithms are used when the value of the output variable is restricted to a limited
set of values i.e. class numbers. For a classification algorithm that filters emails, the input would
be an incoming email, and the output would be the name of the folder in which to file the email.
For an algorithm that identifies spam emails, the output would be the prediction of either " spam"
or "not spam", represented by the Boolean values true and false.
Regression algorithms are named for their continuous outputs, meaning they may have any value
within a range. Examples of a continuous value are the temperature, length, or price of an object.

.
In the case of semi-supervised learning algorithms, some of the training examples are missing
training labels, but they can nevertheless be used to improve the quality of a model. In weakly
supervised learning, the training labels are noisy, limited, or imprecise; however, these labels are
often cheaper to obtain, resulting in larger effective training sets.

Unsupervised learning
Unsupervised learning algorithms take a set of data that contains only inputs, and find structure
in the data, like grouping or clustering of data points. The algorithms, therefore, learn from test
data that has not been labeled, classified or categorized. Instead of responding to feedback,
unsupervised learning algorithms identify commonalities in the data and react based on the
presence or absence of such commonalities in each new piece of data. A central application of
unsupervised learning is in the field of density estimation in statistics, though unsupervised
learning encompasses other domains involving summarizing and explaining data features.
Cluster analysis is the assignment of a set of observations into subsets (called clusters) so that
observations within the same cluster are similar according to one or more predesignated criteria,
while observations drawn from different clusters are dissimilar. Different clustering techniques
make different assumptions on the structure of the data, often defined by some similarity
metric and evaluated, for example, by internal compactness, or the similarity between members
of the same cluster, and separation, the difference between clusters. Other methods are based
on estimated density and graph connectivity.

Semi-supervised learning
Semi-supervised learning falls between unsupervised learning (without any labeled training data)
and supervised learning (with completely labeled training data). Many machine-learning
researchers have found that unlabeled data, when used in conjunction with a small amount of
labeled data, can produce a considerable improvement in learning accuracy.

Reinforcement learning
Reinforcement learning is an area of machine learning concerned with how software
agents ought to take actions in an environment so as to maximize some notion of cumulative
reward. Due to its generality, the field is studied in many other disciplines, such as game
theory, control theory, operations research, information theory, simulation-based
optimization, multi-agent systems, swarm intelligence, statistics and genetic algorithms. In
machine learning, the environment is typically represented as a Markov Decision
Process (MDP). Many reinforcement learning algorithms use dynamic
programming techniques. Reinforcement learning algorithms do not assume knowledge of an
exact mathematical model of the MDP, and are used when exact models are infeasible.
Reinforcement learning algorithms are used in autonomous vehicles or in learning to play a game
against a human opponent.
Application of Machine Learning:

Limitations of Machine Learning:

Lack of Data : Many machine learning algorithms require large amounts of data
before they begin to give useful results. A good example of this is a neural
network. Neural networks are data-eating machines that require copious amounts
of training data. The larger the architecture, the more data is needed to produce
viable results. Reusing data is a bad idea, and data augmentation is useful to some
extent, but having more data is always the preferred solution. If you can get the
data, then use it.

Lack of Good Data: Despite the appearance, this is not the same as the above
comment. Let’s imagine you think you can cheat by generating ten thousand fake
data points to put in your neural network. What happens when you put it in?

1. It will train itself, and then when you come to test it on an unseen data set, it
will not perform well. You had the data but the quality of the data was not up
to scratch.
2. In the same way that having a lack of good features can cause your algorithm
to perform poorly, having a lack of good ground truth data can also limit the
capabilities of your model. No company is going to implement a machine
learning model that performs worse than human-level error.
3. Similarly, applying a model that was trained on a set of data in one situation
may not necessarily apply as well to a second situation. The best example of
this I have found so far is in breast cancer prediction.
4. Mammography databases have a lot of images in them, but they suffer from
one problem that has caused significant issues in recent years — almost all of
the x-rays are from white women. This may not sound like a big deal, but
actually, black women have been shown to be 42 percent more likely to die
from breast cancer due to a wide range of factors that may include
differences in detection and access to health care. Thus, training an algorithm
primarily on white women adversely impacts black women in this case.
5. What is needed in this specific case is a larger number of x-rays of black
patients in the training database, more features relevant to the cause of this 42
percent increased likelihood, and for the algorithm to be more equitable by
stratifying the dataset along the relevant axes.

Data Augmentation

Data augmentation is a method by which you can virtually increase the number of samples in
your dataset using data you already have. For image augmentation, it can be achieved
by performing geometric transformations, changes to color, brightness, contrast or by adding
some noise. Currently there are ongoing studies on interesting new methods in data
augmentation using Generative Adversarial Networks or by pairing samples.

Data Augmentation in image Processing:

 Position augmentation
 Scaling
 Cropping
 Flipping
 Padding
 Rotation
 Translation
 Affine transformation

 Color augmentation
 Brightness
 Contrast
 Saturation
 Hue

Scaling
In scaling or resizing, the image is resized to the given size e.g. the width of the image can be
doubled.

Cropping
In cropping, a portion of the image is selected e.g. in the given example the center cropped image is
returned
Flipping
In flipping, the image is flipped horizontally or vertically.

Padding
In padding, the image is padded with a given value on all sides.

Rotation
The image is rotated randomly in rotation.
Translation
In translation, the image is moved either along the x-axis or y-axis.

Color augmentation
Color augmentation or color jittering deals with altering the color properties of an image by changing
its pixel values.

Brightness
One way to augment is to change the brightness of the image. The resultant image becomes darker
or lighter compared to the original one.

Contrast
The contrast is defined as the degree of separation between the darkest and brightest areas of an
image. The contrast of the image can also be changed.

Saturation
Saturation is the separation between colors of an image.

Hue
Hue can be described of as the shade of the colors in an image
Topic: Eigen vector and Eigen value
Eigenvectors and eigenvalues have many important applications in computer vision and machine
learning in general. Well known examples are PCA (Principal Component Analysis) for
dimensionality reduction or EigenFaces for face recognition. An interesting use of eigenvectors and
eigenvalues is also illustrated in my post about error ellipses. Furthermore, eigendecomposition
forms the base of the geometric interpretation of covariance matrices, discussed in an more recent
post. In this article, I will provide a gentle introduction into this mathematical concept, and will show
how to manually obtain the eigendecomposition of a 2D square matrix.

An eigenvector is a vector whose direction remains unchanged when a linear transformation is


applied to it. Consider the image below in which three vectors are shown. The green square is only
drawn to illustrate the linear transformation that is applied to each of these three vectors.

Eigenvectors (red) do not change direction when a linear transformation (e.g. scaling) is applied to
them. Other vectors (yellow) do.

The transformation in this case is a simple scaling with factor 2 in the horizontal direction and factor
0.5 in the vertical direction, such that the transformation matrix is defined as:

A vector is then scaled by applying this transformation as . The above


figure shows that the direction of some vectors (shown in red) is not affected by this linear
transformation. These vectors are called eigenvectors of the transformation, and uniquely define the
square matrix . This unique, deterministic relation is exactly the reason that those vectors are
called ‘eigenvectors’ (Eigen means ‘specific’ in German).

In general, the eigenvector of a matrix is the vector for which the following holds:
where is a scalar value called the ‘eigenvalue’. This means that the linear transformation on
vector is completely defined by .

We can rewrite equation (1) as follows:

where is the identity matrix of the same dimensions as .

However, assuming that is not the null-vector, equation (2) can only be defined if is
not invertible. If a square matrix is not invertible, that means that its determinant must equal zero.
Therefore, to find the eigenvectors of , we simply have to solve the following equation:

In the following sections we will determine the eigenvectors and eigenvalues of a matrix , by
solving equation (3). Matrix in this example, is defined by:

Calculating the eigenvalues


To determine the eigenvalues for this example, we substitute in equation (3) by equation (4) and
obtain:

Calculating the determinant gives:

(6)

To solve this quadratic equation in , we find the discriminant:

Since the discriminant is strictly positive, this means that two different values for exist:
We have now determined the two eigenvalues and . Note that a square matrix of
size always has exactly eigenvalues, each with a corresponding eigenvector. The
eigenvalue specifies the size of the eigenvector.

Calculating the first eigenvector


We can now determine the eigenvectors by plugging the eigenvalues from equation (7) into equation
(1) that originally defined the problem. The eigenvectors are then found by solving this system of
equations.

We first do this for eigenvalue , in order to find the corresponding first eigenvector:

Since this is simply the matrix notation for a system of equations, we can write it in its equivalent
form:

and solve the first equation as a function of , resulting in:

Since an eigenvector simply represents an orientation (the corresponding eigenvalue represents the
magnitude), all scalar multiples of the eigenvector are vectors that are parallel to this eigenvector,
and are therefore equivalent (If we would normalize the vectors, they would all be equal). Thus,
instead of further solving the above system of equations, we can freely chose a real value for
either or , and determine the other one by using equation (9).

For this example, we arbitrarily choose , such that . Therefore, the


eigenvector that corresponds to eigenvalue is
Calculating the second eigenvector
Calculations for the second eigenvector are similar to those needed for the first eigenvector;
We now substitute eigenvalue into equation (1), yielding:

Written as a system of equations, this is equivalent to:

Solving the first equation as a function of resuls in:

We then arbitrarily choose , and find . Therefore, the eigenvector that


corresponds to eigenvalue is
Topic : Gradient Descent Based Linear Regression

It is a kind of Supervised Learning which can be used to predict the value of a


continuous variable like temperature, pressure, stock price.

The training dataset will be divided into two sections, one is the set of independent
variables(set of independent features) and another is the dependent variable which
is to be predicted. For example: In dataset of mobile price prediction, the input
dataset will be divided, input feature set ( CPU speed, ram, pixels for camera,
battery ) and output feature will be price which is dependent on the input feature
set.

Error
Function

Steps to calculate Linear Function

1. Assume Random values for m and b (slope and intercept)


2. For each ith Iteration or epoch, repeat the process from step 3 to 7
3. Evaluate gradient(Gm, Gb) for m and b using error from each ith sample
according to eq.1

4. Update M and C
5. m=m-(learning rate*Gm)
6. b= b-(Learning rate*Gb)
Statistic
The entire subject of statistics is based around the idea that you have this big set of data, and you
want to analyse that set in terms of the relationships between the individual points in that data
set. I am going to look at a few of the measures you can do on a set of data, and what they tell
you about the data itself.

1.1 Standard Deviation :


The standard deviation is a statistic that measures the dispersion of a dataset relative to its mean
and is calculated as the square root of the variance. It is calculated as the square root of variance
by determining the variation between each data point relative to the mean. If the data points are
further from the mean, there is a higher deviation within the data set; thus, the more spread out
the data, the higher the standard deviation.

The Formula for Standard Deviation:

where:
xi=Value of the ith point in the data set
x=The mean value of the data set
n=The number of data points in the data set

1.2 Standard Deviation vs. Variance


Variance is derived by taking the mean of the data points, subtracting the mean from each data
point individually, squaring each of these results and then taking another mean of these squares.
Standard deviation is the square root of the variance.

The variance helps determine the data's spread size when compared to the mean value. As the
variance gets bigger, more variation in data values occurs, and there may be a larger gap between
one data value and another. If the data values are all close together, the variance will be smaller.
This is more difficult to grasp than are standard deviations, however, because variances represent
a squared result that may not be meaningfully expressed on the same graph as the original
dataset.

Standard deviations are usually easier to picture and apply. The standard deviation is expressed
in the same unit of measurement as the data, which isn't necessarily the case with the variance.
Using the standard deviation, statisticians may determine if the data has a normal curve or other
mathematical relationship. If the data behaves in a normal curve, then 68% of the data points will
fall within one standard deviation of the average, or mean data point. Bigger variances cause
more data points to fall outside the standard deviation. Smaller variances result in more data that
is close to average.

A Big Drawback
The biggest drawback of using standard deviation is that it can be impacted by outliers and
extreme values. Standard deviation assumes a normal distribution and calculates all uncertainty
as risk, even when it’s in the investor's favor—such as above average returns.

1.2 Variance:

Variance (σ2) in statistics is a measurement of the spread between numbers in a data


set. That is, it measures how far each number in the set is from the mean and therefore
from every other number in the set.

1.3 Covariance:

In mathematics and statistics, covariance is a measure of the relationship between two random
variables. The metric evaluates how much – to what extent – the variables change together. In
other words, it is essentially a measure of the variance between two variables (note that the
variance of one variable equals the variance of the other variable). However, the metric does not
assess the dependency between variables.
Unlike the correlation coefficient, covariance is measured in units. The units are computed by
multiplying the units of the two variables. The variance can take any positive or negative values.
The values are interpreted as follows:

 Positive covariance: Indicates that two variables tend to move in the same direction.
 Negative covariance: Reveals that two variables tend to move in inverse directions.

Formula for Covariance:

Where:

 Xi – the values of the X-variable


 Yj – the values of the Y-variable
 X̄ – the mean (average) of the X-variable
 Ȳ – the mean (average) of the Y-variable
 n – the number of the data points

1.4 Correlation Coefficient:

Covariance and correlation both primarily assess the relationship between variables. The closest
analogy to the relationship between them is the relationship between the variance and standard
deviation.

Covariance measures the total variation of two random variables from their expected values.
Using covariance, we can only gauge the direction of the relationship (whether the variables tend
to move in tandem or show an inverse relationship). However, it does not indicate the strength of
the relationship, nor the dependency between the variables.
On the other hand, correlation measures the strength of the relationship between variables.
Correlation is the scaled measure of covariance. It is dimensionless. In other words, the
correlation coefficient is always a pure value and not measured in any units.

The relationship between the two concepts can be expressed using the formula below:

Z Score:
In statistics, the standard score is the signed fractional number of standard deviations by
which the value of an observation or data point is above the mean value of what is being
observed or measured. Observed values above the mean have positive standard scores,
while values below the mean have negative standard scores.
It is calculated by subtracting the population mean from an individual raw score and then
dividing the difference by the population standard deviation. It is a dimensionless quantity.
This conversion process is called standardizing or normalizing (however, "normalizing" can
refer to many types of ratios; see normalization for more).
Standard scores are also called z-values, z-scores, normal scores, and standardized
variables. They are most frequently used to compare an observation to a
theoretical deviate, such as a standard normal deviate.
Computing a z-score requires knowing the mean and standard deviation of the complete
population to which a data point belongs; if one only has a sample of observations from the
population, then the analogous computation with sample mean and sample standard
deviation yields the t-statistic.

F1 Score:

In statistical analysis of binary classification, the F1 score (also F-score or F-


measure) is a measure of a test's accuracy. It considers both the precision p and
the recall r of the test to compute the score: p is the number of correct positive
results divided by the number of all positive results returned by the classifier,
and r is the number of correct positive results divided by the number of all relevant
samples (all samples that should have been identified as positive). The F 1 score is
the harmonic mean of the precision and recall, where an F1 score reaches its best
value at 1 (perfect precision and recall) and worst at 0.

Precision and Recall:

In pattern recognition, information retrieval and classification (machine


learning), precision (also called positive predictive value) is the fraction of relevant instances
among the retrieved instances, while recall (also known as sensitivity) is the fraction of the
total amount of relevant instances that were actually retrieved. Both precision and recall are
therefore based on an understanding and measure of relevance.
Precision= TP/(TP+FP)

Recall=TP/ (TP+FN)

You might also like