0% found this document useful (0 votes)
24 views47 pages

ML - Unit - 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views47 pages

ML - Unit - 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

What is Machine Learning?

Machine Learning is a concept which allows the machine to learn from examples
and experience, and that too without being explicitly programmed. So instead of
you writing the code, what you do is you feed data to the generic algorithm, and
the algorithm/ machine builds the logic based on the given data.

How does Machine Learning Work?


Machine Learning algorithm is trained using a training data set to create a
model. When new input data is introduced to the ML algorithm, it makes a
prediction on the basis of the model.
The prediction is evaluated for accuracy and if the accuracy is acceptable, the
Machine Learning algorithm is deployed. If the accuracy is not acceptable, the
Machine Learning algorithm is trained again and again with an augmented
training data set

Types of Machine Learning


Machine learning is sub-categorized to three types:
 Supervised Learning – Train Me!
 Unsupervised Learning – I am self sufficient in learning
 Reinforcement Learning – My life My rules! (Hit & Trial)

What is Supervised Learning?


Supervised Learning is the one, where you can consider the learning is guided by
a teacher. We have a dataset which acts as a teacher and its role is to train the
model or the machine. Once the model gets trained it can start making a prediction
or decision when new data is given to it.

What is Unsupervised Learning?


The model learns through observation and finds structures in the data. Once the
model is given a dataset, it automatically finds patterns and relationships in the
dataset by creating clusters in it. What it cannot do is add labels to the cluster,
like it cannot say this a group of apples or mangoes, but it will separate all the
apples from mangoes.
Suppose we presented images of apples, bananas and mangoes to the model, so
what it does, based on some patterns and relationships it creates clusters and
divides the dataset into those clusters. Now if a new data is fed to the model, it
adds it to one of the created clusters.

What is Reinforcement Learning?


It is the ability of an agent to interact with the environment and find out what is
the best outcome. It follows the concept of hit and trial method. The agent is
rewarded or penalized with a point for a correct or a wrong answer, and on the
basis of the positive reward points gained the model trains itself. And again once
trained it gets ready to predict the new data presented to it.
Classification of Machine Learning Algorithms
Machine Learning algorithms can be classified into:

1. Supervised Algorithms – Linear Regression, Logistic Regression, Support


Vector Machine (SVM), Decision Trees, Random Forest
2. Unsupervised Algorithms – K Means Clustering.
3. Reinforcement Algorithm

1. Supervised Machine Learning Algorithms


In this type of algorithm, the data set on which the machine is trained consists of
labelled data or simply said, consists both the input parameters as well as the
required output. For example, classifying whether a person is a male or a female.
Here male and female will be our labels and our training dataset will already be
classified into the given labels based on certain parameters through which the
machine will learn these features and patterns and classify some new input data
based on the learning from this training data.
Supervised Learning Algorithms can be broadly divided into two types of
algorithms, Classification and Regression.
Classification Algorithms
Just as the name suggests, these algorithms are used to classify data into
predefined classes or labels.
Regression Algorithms
These algorithms are used to determine the mathematical relationship between
two or more variables and the level of dependency between variables. These can
be used for predicting an output based on the interdependency of two or more
variables. For example, an increase in the price of a product will decrease its
consumption, which means, in this case, the amount of consumption will depend
on the price of the product. Here, the amount of consumption will be called as the
dependent variable and price of the product will be called the independent
variable. The level of dependency of the amount of consumption on the price of
a product will help us predict the future value of the amount of consumption based
on the change in prices of the product.
We have two types of regression algorithms: Linear Regression and Logistic
Regression
(a) Linear Regression
Linear regression is used with continuously valued variables, like the previous
example in which the price of the product and amount of consumption are
continuous variables, which means that they can have an infinite number of
possible values. Linear regression can also be represented as a graph known as
scatter plot, where all the data points of the dependent and independent variables
are plotted and a straight line is drawn through them such that the maximum
number of points will lie on the line or at a smaller distance from the line. This
line – also called the regression line, will then help us determine the relationship
between the dependent and independent variables along with which the linear
regression equation is formed.
(b) Logistic Regression
The difference between linear and logistic regression is that logistic regression is
used with categorical dependent variables (eg: Yes/No, Male/Female,
Sunny/Rainy/Cloudy, Red/Blue etc.), unlike the continuous valued variables used
in linear regression. Logistic regression helps determine the probability of a
certain variable to be in a certain group like whether it is night or day, or whether
the colour is red or blue etc. The graph of logistic regression consists of a non-
linear sigmoid function which demonstrates the probabilities of the variables.
2. Unsupervised Machine Learning Algorithms
Unlike supervised learning algorithms, where we deal with labelled data for
training, the training data will be unlabelled for Unsupervised Machine Learning
Algorithms. The clustering of data into a specific group will be done on the basis
of the similarities between the variables. Some of the unsupervised machine
learning algorithms are K-means clustering, neural networks.
Another machine learning concept which is extensively used in the field is Neural
Networks..
3. Reinforcement Machine Learning Algorithms
Reinforcement Learning is a type of Machine Learning in which the machine is
required to determine the ideal behaviour within a specific context, in order to
maximize its rewards. It works on the rewards and punishment principle which
means that for any decision which a machine takes, it will be either be rewarded
or punished due to which it will understand whether or not the decision was
correct. This is how the machine will learn to take the correct decisions to
maximize the reward in the long run.
For reinforcement algorithm, a machine can be adjusted and programmed to focus
more on either the long-term rewards or the short-term rewards. When the
machine is in a particular state and has to be the action for the next state in order
to achieve the reward, this process is called the Markov Decision Process.

Supervised and unsupervised are mostly used by a lot machine learning engineers
and data geeks.Reinforcement learning is really powerful and complex to apply
for problems.

Supervised learning

as we know machine learning takes data as input. lets call this data Training data

The training data includes both Inputs and Labels(Targets)

what are Inputs and Labels(Targets)?? for example addition of two numbers
a=5,b=6 result =11, Inputs are 5,6 and Target is 11
We first train the model with the lots of training data(inputs&targets)

then with new data and the logic we got before we predict the output

(Note : We don’t get exact 6 as answer we may get value which is close to 6 based
on training data and algorithm)

This process is called Supervised Learning which is really fast and accurate.

Types of Supervised learning

Regression: This is a type of problem where we need to predict the continuous-


response value (ex : above we predict number which can vary from -infinity to
+infinity)

Some examples are

 what is the price of house in a specific city?


 what is the value of the stock?
 how many total runs can be on board in a cricket game?

etc… there are tons of things we can predict if we wish.

Classification: This is a type of problem where we predict the categorical


response value where the data can be separated into specific “classes” (ex: we
predict one of the values in a set of values).

Some examples are :

 this mail is spam or not?


 will it rain today or not?
 is this picture a cat or not?

Basically ‘Yes/No’ type questions called binary classification.

Other examples are :

 this mail is spam or important or promotion?


 is this picture a cat or a dog or a tiger?

This type is called multi-class classification.

Here is the final picture

Classification separates the data, Regression fits the data

That’s all for supervised learning.

Unsupervised learning

The training data does not include Targets here so we don’t tell the system where
to go , the system has to understand itself from the data we give.

Here training data is not structured (contains noisy data,unknown data and etc..)
ex: A random articles from different pages

Unsupervised process

There are also different types for unsupervised learning like Clustering and
anomaly detection (clustering is pretty famous)

Clustering: This is a type of problem where we group similar things together.

Bit similar to multi class classification but here we don’t provide the labels, the
system understands from data itself and cluster the data.

Some examples are :

 given news articles,cluster into different types of news


 given a set of tweets ,cluster based on content of tweet
 given a set of images, cluster them into different objects
Clustering with 3 clusters

Unsupervised learning is bit difficult to implement and its not used as widely as
supervised.

Reinforcement Learning is a type of Machine Learning, and thereby also a branch


of Artificial Intelligence. It allows machines and software agents to automatically
determine the ideal behavior within a specific context, in order to maximize its
performance. Simple reward feedback is required for the agent to learn its
behavior; this is known as the reinforcement signal.

There are many different algorithms that tackle this issue. As a matter of fact,
Reinforcement Learning is defined by a specific type of problem, and all its
solutions are classed as Reinforcement Learning algorithms. In the problem, an
agent is supposed to decide the best action to select based on his current state.
When this step is repeated, the problem is known as a Markov Decision Process.

In order to produce intelligent programs (also called agents), reinforcement


learning goes through the following steps:

1. Input state is observed by the agent.


2. Decision making function is used to make the agent perform an action.
3. After the action is performed, the agent receives reward or reinforcement from
the environment.
4. The state-action pair information about the reward is stored.

List of Common Algorithms

 Q-Learning
 Temporal Difference (TD)
 Deep Adversarial Networks

Use cases:
Some applications of the reinforcement learning algorithms are computer played
board games (Chess, Go), robotic hands, and self-driving cars.
K Nearest Neighbors – Classification
K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases
based on a similarity measure (e.g., distance functions). KNN has been used in statistical
estimation and pattern recognition already in the beginning of 1970’s as a non-parametric
technique.

Algorithm

A case is classified by a majority vote of its neighbors, with the case being assigned to the class
most common amongst its K nearest neighbors measured by a distance function. If K = 1, then
the case is simply assigned to the class of its nearest neighbor.

It should also be noted that all three distance measures are only valid for continuous variables.
In the instance of categorical variables the Hamming distance must be used. It also brings up the
issue of standardization of the numerical variables between 0 and 1 when there is a mixture of
numerical and categorical variables in the dataset.
Choosing the optimal value for K is best done by first inspecting the data. In general, a large K
value is more precise as it reduces the overall noise but there is no guarantee. Cross-validation
is another way to retrospectively determine a good K value by using an independent dataset to
validate the K value. Historically, the optimal K for most datasets has been between 3-10. That
produces much better results than 1NN.

Example:

Consider the following data concerning credit default. Age and Loan are two numerical variables
(predictors) and Default is the target.

We can now use the training set to classify an unknown case (Age=48 and Loan=$142,000)
using Euclidean distance. If K=1 then the nearest neighbor is the last case in the training set with
Default=Y.
D = Sqrt[(48-33)^2 + (142000-150000)^2] = 8000.01 >> Default=Y

With K=3, there are two Default=Y and one Default=N out of three closest neighbors. The
prediction for the unknown case is again Default=Y.

What is a classifier?
A classifier is a machine learning model that is used to discriminate different
objects based on certain features.

Principle of Naive Bayes Classifier:


A Naive Bayes classifier is a probabilistic machine learning model that’s used for
classification task. The crux of the classifier is based on the Bayes theorem.

Bayes Theorem:
Using Bayes theorem, we can find the probability of A happening, given that Bhas
occurred. Here, B is the evidence and A is the hypothesis. The assumption made
here is that the predictors/features are independent. That is presence of one
particular feature does not affect the other. Hence it is called naive.

Example:
Let us take an example to get some better intuition. Consider the problem of
playing golf. The dataset is represented as below.

We classify whether the day is suitable for playing golf, given the features of the
day. The columns represent these features and the rows represent individual
entries. If we take the first row of the dataset, we can observe that is not suitable
for playing golf if the outlook is rainy, temperature is hot, humidity is high and it
is not windy. We make two assumptions here, one as stated above we consider
that these predictors are independent. That is, if the temperature is hot, it does not
necessarily mean that the humidity is high. Another assumption made here is that
all the predictors have an equal effect on the outcome. That is, the day being windy
does not have more importance in deciding to play golf or not.

According to this example, Bayes theorem can be rewritten as:

The variable y is the class variable(play golf), which represents if it is suitable to


play golf or not given the conditions. Variable X represent the parameters/features.

X is given as,

Here x_1,x_2….x_n represent the features, i.e they can be mapped to outlook,
temperature, humidity and windy. By substituting for X and expanding using the
chain rule we get,

Now, you can obtain the values for each by looking at the dataset and substitute
them into the equation. For all entries in the dataset, the denominator does not
change, it remain static. Therefore, the denominator can be removed and a
proportionality can be introduced.

In our case, the class variable(y) has only two outcomes, yes or no. There could
be cases where the classification could be multivariate. Therefore, we need to find
the class y with maximum probability.
Using the above function, we can obtain the class, given the predictors.

Types of Naive Bayes Classifier:

Multinomial Naive Bayes:


This is mostly used for document classification problem, i.e whether a document
belongs to the category of sports, politics, technology etc. The features/predictors
used by the classifier are the frequency of the words present in the document.

Bernoulli Naive Bayes:


This is similar to the multinomial naive bayes but the predictors are boolean
variables. The parameters that we use to predict the class variable take up only
values yes or no, for example if a word occurs in the text or not.

Gaussian Naive Bayes:


When the predictors take up a continuous value and are not discrete, we assume
that these values are sampled from a gaussian distribution.

Conclusion:
Naive Bayes algorithms are mostly used in sentiment analysis, spam filtering,
recommendation systems etc. They are fast and easy to implement but their
biggest disadvantage is that the requirement of predictors to be independent. In
most of the real life cases, the predictors are dependent, this hinders the
performance of the classifier.

Naive Bayes Classifier technique is based on the so-called Bayesian theorem and
is particularly suited when the dimensionality of the inputs is high. Despite its
simplicity, Naive Bayes can often outperform more sophisticated classification
methods.
To demonstrate the concept of Naïve Bayes Classification, consider the example
displayed in the illustration above. As indicated, the objects can be classified as
either GREEN or RED. Our task is to classify new cases as they arrive, i.e., decide
to which class label they belong, based on the currently exiting objects.
Since there are twice as many GREEN objects as RED, it is reasonable to believe
that a new case (which hasn't been observed yet) is twice as likely to have
membership GREEN rather than RED. In the Bayesian analysis, this belief is
known as the prior probability. Prior probabilities are based on previous
experience, in this case the percentage of GREEN and RED objects, and often
used to predict outcomes before they actually happen.
Thus, we can write:

Since there is a total of 60 objects, 40 of which are GREEN and 20 RED, our
prior probabilities for class membership are:

Having formulated our prior probability, we are now ready to classify a new
object (WHITE circle). Since the objects are well clustered, it is reasonable to
assume that the more GREEN (or RED) objects in the vicinity of X, the more
likely that the new cases belong to that particular color. To measure this
likelihood, we draw a circle around X which encompasses a number (to be chosen
a priori) of points irrespective of their class labels. Then we calculate the number
of points in the circle belonging to each class label. From this we calculate the
likelihood:

From the illustration above, it is clear that Likelihood of X given GREEN is


smaller than Likelihood of X given RED, since the circle encompasses 1 GREEN
object and 3 RED ones. Thus:

Although the prior probabilities indicate that X may belong to GREEN (given
that there are twice as many GREEN compared to RED) the likelihood indicates
otherwise; that the class membership of X is RED (given that there are more RED
objects in the vicinity of X than GREEN). In the Bayesian analysis, the final
classification is produced by combining both sources of information, i.e., the prior
and the likelihood, to form a posterior probability using the so-called Bayes' rule
(named after Rev. Thomas Bayes 1702-1761).

Finally, we classify X as RED since its class membership achieves the largest
posterior probability.
Decision Trees
For example, you go to your nearest super store and want to buy milk for your
family, the very first question which comes to your mind is – How much milk
should I buy today?

To answer the basic question, your un-conscious mind makes some calculations
(based on the sample questions listed below) and you end up buying the required
quantity of milk. Is it a normal weekday?

 On weekdays we require 1 Litre of Milk

 Is it a weekend? On weekends we require 1.5 Litre of Milk

 Are we expecting any guests today? We need to buy 250 ML extra milk for each
guest, etc.

Formally speaking, “Decision tree is a binary (mostly) structure where each node
best splits the data to classify a response variable. Tree starts with a Root which
is the first node and ends with the final nodes which are known as leaves of the
tree”.

Assume that you are given a characteristic information of 10,000 people living in
your town. You are asked to study them and come up with the algorithm which
should be able to tell whether a new person coming to the town is male or a
female.

Primarily you are given information about:

 Skin colour

 Hair length

 Weight

 Height

Based on the information you can divide the information in such a way that it
somehow indicates the characteristics of Males vs. Females.

Below is a hypothetical tree designed out of this data:


The tree shown above divides the data in such a way that we gain the maximum
information, to understand the tree – If a person’s hair length is less than 5 Inches,
weight greater than 55 KGs then there are 80% chances for that person being a
Male.

If you are familiar with Predictive Modelling e.g., Logistic Regression, Random
Forest etc. – You might be wondering what is the difference between a Logistic
Model and Decision Tree!
Because in both the algorithms we are trying to predict a categorical variable.

There are a few fundamental differences between both but ideally both the
approaches should give you the same results. The best use of Decision Trees is
when your solution requires a representation. For example, you are working for
a Telecom Operator and building a solution using which a call center agent can
take a decision whether to pitch for an upsell or not!

There are very less chances that a call center executive will understand the
Logistic Regression or the equations, but using a more visually appealing solution
you might gain a better adoption from your call center team.
How does Decision Tree work?
There are multiple algorithms written to build a decision tree, which can be used
according to the problem characteristics you are trying to solve. Few of the
commonly used algorithms are listed below:

 ID3
 C4.5

 CART

 CHAID (CHi-squared Automatic Interaction Detector)

 MARS

 Conditional Inference Trees

Though the methods are different for different decision tree building algorithms
but all of them works on the principle of Greediness. Algorithms try to search for
a variable which give the maximum information gain or divides the data in the
most homogenous way.

For an example, consider the following hypothetical dataset which contains Lead
Actor and Genre of a movie along with the success on box office:
Lead Actor Genre Hit(Y/N)

Amitabh Bacchan Action Yes

Amitabh Bacchan Fiction Yes

Amitabh Bacchan Romance No

Amitabh Bacchan Action Yes

Abhishek Bacchan Action No

Abhishek Bacchan Fiction No

Abhishek Bacchan Romance Yes

Let say, you want to identify the success of the movie but you can use only one
variable – There are the following two ways in which this can be done:
You can clearly observe that Method 1 (Based on lead actor) splits the data best
while the second method (Based on Genre) have produced mixed results.
Decision Tree algorithms do similar things when it comes to select variables.

There are various metrics which decision trees use in order to find out the best
split variables. We’ll go through them one by one and try to understand, what do
they mean?
Entropy & Information Gain
The word Entropy is borrowed from Thermodynamics which is a measure of
variability or chaos or randomness. Shannon extended the thermodynamic
entropy concept in 1948 and introduced it into statistical studies and suggested
the following formula for statistical entropy:

Where, H is the entropy in the system which is a measure of randomness.

Assuming you are rolling a fair coin and want to know the Entropy of the system.
As per the formula given by Shann – Entropy would be equals to -[0.5 ln(0.5) +
0.5 ln(0.5)].
Which is equal to -0.69; which is the maximum entropy which can occur in the
system. In other words, there will be maximum randomness in our dataset if the
probable outcomes have same probability of occurrence.

Graph shown above shows the variation of Entropy with the probability of a class,
we can clearly see that Entropy is maximum when probability of either of the
classes is equal. Now, you can understand that when a decision algorithm tries to
split the data, it selects the variable which will give us maximum reduction in
system Entropy.

For the example of movie success rate – Initial Entropy in the system was:
EntropyParent = -(0.57*ln(0.57) + 0.43*ln(0.43)); Which is 0.68
Entropy after Method 1 Split
Entropyleft = -(.75*ln(0.75) + 0.25*ln(0.25)) = 0.56
Entropyright = -(.33*ln(0.33) + 0.67*ln(0.67)) = 0.63
Captured impurity or entropy after splitting data using Method 1 can be calculated
using the following formula: “Entropy (Parent) – Weighted Average of Children
Entropy”
Which is,

0.68 – (4*0.56 + 3*0.63)/7 = 0.09


This number 0.09 is generally known as “Information Gain”

Entropy after Method 2 Split


Entropyleft = -(.67*ln(0.67) + 0.33*ln(0.33)) = 0.63
Entropymiddle = -(.5*ln(0.5) + 0.5*ln(0.5)) = 0.69
Entropyright = -(.5*ln(0.5) + 0.5*ln(0.5)) = 0.69
Now using the method used above, we can calculate the Information Gain as:

Information Gain = 0.68 – (3*0.63 + 2*0.69 + 2*0.69)/7 = 0.02


Hence, we can clearly see that Method 1 gives us more than 4 times information
gain compared to Method 2 and hence Method 1 is the best split variable.
Gain Ratio
Soon after the development of entropy mathematicians realized that Information
gain is biased toward multi-valued attributes and to conquer this issue, “Gain
Ratio” came into picture which is more reliable than Information gain. The gain
ratio can be defined as:

Where Split info can be defined as:

Assuming we are dividing our variable into ‘n’ child nodes and Di represents the
number of records going into various child nodes. Hence gain ratio takes care of
distribution bias while building a decision tree.

For the example discussed above, for Method 1

Split Info = - ((4/7)*log2(4/7)) - ((3/7)*log2(3/7)) = 0.98


And Hence,

Gain Ratio = 0.09/0.98 = 0.092


Gini Index
There is one more metric which can be used while building a decision tree is Gini
Index (Gini Index is mostly used in CART). Gini index measures the impurity of
a data partition K, formula for Gini Index can be written down as:
Where m is the number of classes, and Pi is the probability that an observation in
K belongs to the class. Gini Index assumes a binary split for each of the attribute
in S, let say T1 & T2. The Gini index of K given this partitioning is given by:

Which is nothing but a weighted sum of each of the impurities in split nodes. The
reduction in impurity is given by:

Similar to Information Gain & Gain Ratio, split which gives us maximum
reduction in impurity is considered for dividing our data.

Coming back to our movie example,

If we want to calculate Gini(K)-

= 0.49

Now as per our Method 1, we can get Ginis(K) as,

= 0.24 + 0.19

= 0.43

LINEAR REGRESSION

Linear regression is a statistical approach for modelling relationship between a


dependent variable with a given set of independent variables.

Simple Linear Regression

Simple linear regression is an approach for predicting a response using a single


feature.
It is assumed that the two variables are linearly related. Hence, we try to find a
linear function that predicts the response value(y) as accurately as possible as a
function of the feature or independent variable(x).
Let us consider a dataset where we have a value of response y for every feature
x:

For generality, we define:


x as feature vector, i.e x = [x_1, x_2, …., x_n],
y as response vector, i.e y = [y_1, y_2, …., y_n]
for n observations (in above example, n=10).
A scatter plot of above dataset looks like:-

Now, the task is to find a line which fits best in above scatter plot so that we can
predict the response for any new feature values. (i.e a value of x not present in
dataset)
This line is called regression line.
The equation of regression line is represented as:
Here,
 h(x_i) represents the predicted response value for ith observation.
 b_0 and b_1 are regression coefficients and represent y-
intercept and slope of regression line respectively.
To create our model, we must “learn” or estimate the values of regression
coefficients b_0 and b_1. And once we’ve estimated these coefficients, we can
use the model to predict responses!

LOGISTIC REGRESSION
Consider an example dataset which maps the number of hours of study with the
result of an exam. The result can take only two values, namely passed(1) or
failed(0):

HOURS(X) 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00

PASS(Y) 0 0 0 0 0 0 1 0 1 0 1
i.e. y is a categorical target variable which can take only two possible type:“0” or
“1”.
In order to generalize our model, we assume that:
Differences between Linear Regression and Logistic Regression:-

LINEAR REGRESSION LOGISTIC REGRESSION

Linear Regression is a

supervised regression Logistic Regression is a supervised

model. classification model.

In Linear Regression, we

predict the value by an In Logistic Regression, we predict

integer number. the value by 1 or 0.


LINEAR REGRESSION LOGISTIC REGRESSION

Here activation function is used to

Here no activation function is convert a linear regression equation

used. to the logistic regression equation

Here no threshold value is

needed. Here a threshold value is added.

Here we calculate Root

Mean Square Error(RMSE)

to predict the next weight Here we use precision to predict the

value. next weight value.

Here the dependent variable

consists of only two categories.

Logistic regression estimates the

Here dependent variable odds outcome of the dependent

should be numeric and the variable given a set of quantitative

response variable is or categorical independent

continuous to value. variables.

It is based on the least It is based on maximum likelihood

square estimation. estimation.


LINEAR REGRESSION LOGISTIC REGRESSION

Any change in the coefficient leads

to a change in both the direction and

the steepness of the logistic

Here when we plot the function. It means positive slopes

training datasets, a straight result in an S-shaped curve and

line can be drawn that negative slopes result in a Z-shaped

touches maximum plots. curve.

Linear regression is used to

estimate the dependent

variable in case of a change Whereas logistic regression is used

in independent variables. For to calculate the probability of an

example, predict the price of event. For example, classify if tissue

houses. is benign or malignant.

Linear regression assumes

the normal or gaussian Logistic regression assumes the

distribution of the dependent binomial distribution of the

variable. dependent variable.

Genearized Linear Models:-

The Generalized Linear Model (GLZ) is a generalization of the general linear


model .In its simplest form, a linear model specifies the (linear) relationship
between a dependent (or response) variable Y, and a set of predictor variables,
the X's, so that
Y = b0 + b1X1 + b2X2 + ... + bkXk
In this equation b0 is the regression coefficient for the intercept and the bi values
are the regression coefficients (for variables 1 through k) computed from the data.
So for example, we could estimate (i.e., predict) a person's weight as a function
of the person's height and gender. You could use linear regression to estimate the
respective regression coefficients from a sample of data, measuring height,
weight, and observing the subjects' gender. For many data analysis problems,
estimates of the linear relationships between variables are adequate to describe
the observed data, and to make reasonable predictions for new observations..
However, there are many relationships that cannot adequately be summarized by
a simple linear equation, for two major reasons:
Distribution of dependent variable. First, the dependent variable of interest
may have a non-continuous distribution, and thus, the predicted values should
also follow the respective distribution; any other predicted values are not logically
possible. For example, a researcher may be interested in predicting one of
three possible discrete outcomes (e.g., a consumer's choice of one of three
alternative products). In that case, the dependent variable can only take on
3 distinct values, and the distribution of the dependent variable is said to
be multinomial. Or suppose you are trying to predict people's family planning
choices, specifically, how many children families will have, as a function of
income and various other socioeconomic indicators. The dependent variable -
number of children - is discrete (i.e., a family may have 1, 2, or 3 children and so
on, but cannot have 2.4 children), and most likely the distribution of that variable
is highly skewed (i.e., most families have 1, 2, or 3 children, fewer will have 4 or
5, very few will have 6 or 7, and so on). In this case it would be reasonable to
assume that the dependent variable follows a Poisson distribution.
Link function. A second reason why the linear (multiple regression) model might
be inadequate to describe a particular relationship is that the effect of the
predictors on the dependent variable may not be linear in nature. For example,
the relationship between a person's age and various indicators of health is most
likely not linear in nature: During early adulthood, the (average) health status of
people who are 30 years old as compared to the (average) health status of people
who are 40 years old is not markedly different. However, the difference in health
status of 60 year old people and 70 year old people is probably greater. Thus, the
relationship between age and health status is likely non-linear in nature. Probably
some kind of a power function would be adequate to describe the relationship
between a person's age and health, so that each increment in years of age at older
ages will have greater impact on health status, as compared to each increment in
years of age during early adulthood. Put in other words, the link between age and
health status is best described as non-linear, or as a power relationship in this
particular example.
The generalized linear model can be used to predict responses both for dependent
variables with discrete distributions and for dependent variables which are
nonlinearly related to the predictors.

Computational Approach
To summarize the basic ideas, the generalized linear model differs from the
general linear model (of which, for example, multiple regression is a special case)
in two major respects: First, the distribution of the dependent or response variable
can be (explicitly) non-normal, and does not have to be continuous, i.e., it can
be binomial, multinomial, or ordinal multinomial (i.e., contain information on
ranks only);
2)Dependent variable values are predicted from a linear combination of predictor
variables, which are "connected" to the dependent variable via a link function.
The general linear model for a single dependent variable can be considered a
special case of the generalized linear model: In the general linear model the
dependent variable values are expected to follow the normal distribution, and the
link function is a simple identity function (i.e., the linear combination of values
for the predictor variables is not transformed).
To illustrate, in the general linear model a response variable Y is linearly
associated with values on the X variables by
Y = b0 + b1X1 + b2X2 + ... + bkXk + e
(where e stands for the error variability that cannot be accounted for by the
predictors; note that the expected value of e is assumed to be 0), while the
relationship in the generalized linear model is assumed to be
Y = g (b0 + b1X1 + b2X2 + ... + bkXk )+ e
where e is the error, and g(…) is a function. Formally, the inverse function
of g(…), say f(…), is called the link function; so that:
f(muy) = b0 + b1X1 + b2X2 + ... + bkXk
where muy stands for the expected value of y.
Link functions and distributions. Various link functions can be chosen,
depending on the assumed distribution of the y variable values:
Normal, Gamma, Inverse normal, and Poisson distributions:
Identity link: f(z) = z

Log link: f(z) = log(z)

Power link: f(z) = za, for a given a


SUPPORT VECTOR MACHINES

Support Vector Machine or SVM are supervised learning models with associated
learning algorithms that analyze data for classification( clasifications means
knowing what belong to what e.g ‘apple’ belongs to class ‘fruit’ while ‘dog’ to
class ‘animals’ -see fig.1)

Fig. 1

In support vector machines, it looks somewhat like Fig.2 below :) which separates
the blue balls from red.
SVM is a classifier formally defined by a separating hyperplane. An hyperplane
is a subspace of one dimension less than its ambient space. The dimension of a
mathematical space (or object) is informally defined as the minimum number of
coordinates (x,y,z axis) needed to specify any point (like each blue and red point)
within it while an ambient space is the space surrounding a mathematical object.

Therefore the hyperplane of a two dimensional space below (fig.2) is a one


dimensional line dividing the red and blue dots.

Fig. 2

From the example above of trying to predict the breed of a particular dog, it goes
like this

Data (all breeds of dog)→ Features(skin color, hair etc)→ Learning algorithm

So why Kernels?
Consider the Fig. 3 below
Fig. 3

Can you try to solve the above problem linearly like we did with Fig. 2?

NO!

The red and blue balls cannot be separated by a straight line as they are randomly
distributed and this, in reality, is how most real life problem data are -randomly
distributed.

In machine learning, a “kernel” is usually used to refer to the kernel trick, a method
of using a linear classifier to solve a non-linear problem. It entails transforming
linearly inseparable data like (Fig. 3) to linearly separable ones (Fig. 2). The kernel
function is what is applied on each data instance to map the original non-linear
observations into a higher-dimensional space in which they become separable.

Using the dog breed prediction example again, kernels offer a better alternative.
Instead of defining a slew of features, you define a single kernel function to
compute similarity between breeds of dog. You provide this kernel, together with
the data and labels to the learning algorithm, and out comes a classifier.
So this is with two features, and we see we have a 2D graph. If we had three
features, we could have a 3D graph. The 3D graph would be a little more
challenging for us to visually group and divide, but still do-able. The problem
occurs when we have four features, or four-thousand features. Now you can start
to understand the power of machine learning, seeing and analyzing a number of
dimensions imperceptible to us.

With that in mind, we're going to go ahead and continue with our two-featured
example. Now, in order to feed data into our machine learning algorithm, we first
need to compile an array of the features, rather than having them as x and y
coordinate values.

Generally, you will see the feature list being stored in a capital X variable. Let's
translate our above x and y coordinates into an array that is compiled of the x and
y coordinates, where x is a feature and y is a feature.

X = np.array([[1,2],
[5,8],
[1.5,1.8],
[8,8],
[1,0.6],
[9,11]])

Now that we have this array, we need to label it for training purposes. There are
forms of machine learning called "unsupervised learning," where data labeling
isn't used, as is the case with clustering, though this example is a form of
supervised learning.

For our labels, sometimes referred to as "targets," we're going to use 0 or 1.

y = [0,1,0,1,0,1]

Just by looking at our data set, we can see we have coordinate pairs that are "low"
numbers and coordinate pairs that are "higher" numbers. We've then assigned 0
to the lower coordinate pairs and 1 to the higher feature pairs.

These are the labels. In the case of our project, we will wind up having a list of
numerical features that are various statistics about stock companies, and then the
"label" will be either a 0 or a 1, where 0 is under-perform the market and a 1 is
out-perform the market.

Moving along, we are now going to define our classifier:

clf = svm.SVC(kernel='linear', C = 1.0)

We're going to be using the SVC (support vector classifier) SVM (support vector
machine). Our kernel is going to be linear, and C is equal to 1.0. What is C you
ask? Don't worry about it for now, but, if you must know, C is a valuation of "how
badly" you want to properly classify, or fit, everything. The machine learning
field is relatively new, and experimental. There exist many debates about the
value of C, as well as how to calculate the value for C. We're going to just stick
with 1.0 for now, which is a nice default parameter.

Next, we call:

clf.fit(X,y)

Note: this is an older tutorial, and Scikit-Learn has since deprecated this method.
By version 0.19, this code will cause an error because it needs to be a numpy
array, and re-shaped. To see an example of converting to a NumPy array and
reshaping, check out this K Nearest Neighbors tutorial, near the end. You do not
need to follow along with that series to mimic what is done there with the
reshaping, and continue along with this series.

From here, the learning is done. It should be nearly-instant, since we have such a
small data set.

Next, we can predict and test. Let's print a prediction:

print(clf.predict([0.58,0.76]))
We're hoping this predicts a 0, since this is a "lower" coordinate pair.

Sure enough, the prediction is a classification of 0. Next, what if we do:

print(clf.predict([10.58,10.76]))

And again, we have a theoretically correct answer of 1 as the classification. This


was a blind prediction, though it was really a test as well, since we knew what the
hopeful target was. Congratulations, you have 100% accuracy!

SVM Kernel Functions

SVM algorithms use a set of mathematical functions that are defined as the kernel.
The function of kernel is to take data as input and transform it into the required
form. Different SVM algorithms use different types of kernel functions. These
functions can be different types. For example linear, nonlinear, polynomial,
radial basis function (RBF), and sigmoid.
Introduce Kernel functions for sequence data, graphs, text, images, as well as
vectors. The most used type of kernel function is RBF. Because it has localized
and finite response along the entire x-axis.

The kernel functions return the inner product between two points in a suitable
feature space. Thus by defining a notion of similarity, with little computational
cost even in very high-dimensional spaces.

Examples of SVM Kernels


Let us see some common kernels used with SVMs and their uses:

4.1. Polynomial kernel


It is popular in image processing.
Equation is:

Polynomial kernel equation


where d is the degree of the polynomial.

4.2. Gaussian kernel


It is a general-purpose kernel; used when there is no prior knowledge about the
data. Equation is:
Gaussian kernel equation
4.3. Gaussian radial basis function (RBF)
It is a general-purpose kernel; used when there is no prior knowledge about the
data.
Equation is:

Gaussian radial basis function (RBF)


, for:

Gaussian radial basis function (RBF)


Sometimes parametrized using:

Gaussian radial basis function (RBF)


4.4. Laplace RBF kernel
It is general-purpose kernel; used when there is no prior knowledge about the
data.
Equation is:

Laplace RBF kernel equation


4.5. Hyperbolic tangent kernel
We can use it in neural networks.
Equation is:

Hyperbolic tangent kernel equation


, for some (not every) k>0 and c<0.

4.6. Sigmoid kernel


We can use it as the proxy for neural networks. Equation is

Sigmoid kernel equation


4.7. Bessel function of the first kind Kernel
We can use it to remove the cross term in mathematical functions. Equation is :
Equation of Bessel function of the first kind kernel
where j is the Bessel function of first kind.

4.8. ANOVA radial basis kernel


We can use it in regression problems. Equation is:

ANOVA radial basis kernel equation

Multiclass classification

Just as binary classification involves predicting if something is from one of two


classes (e.g. “black” or “white”, “dead” or “alive”, etc), Multiclass problems
involve classifying something into one of N classes (e.g. “red”, “white” or “blue”,
etc).

Common examples include image classification (is it a cat, dog, human, etc)
or handwritten digit recognition (classifying an image of a handwritten number
into a digit from 0 to 9).

Performance Metrics

• Accuracy can be calculated by taking average of the values lying across


the “main diagonal” i.e
Accuracy = (True Positives+False Negatives)/Total Number of Samples

Precision:-It is the number of correct positive results divided by the number


of positive results predicted by classifier.

• Recall :- It is the number of correct positive results divided by the number


of all relevant samples
Structured prediction or structured (output) learning :-

It is an umbrella term for supervised machine learning techniques that


involves predicting structured objects, rather than scalar discrete or real values.

Similar to commonly used supervised learning techniques, structured prediction


models are typically trained by means of observed data in which the true
prediction value is used to adjust model parameters. Due to the complexity of the
model and the interrelations of predicted variables the process of prediction using
a trained model and of training itself is often computationally infeasible
and approximate inference and learning methods are used.

For example, the problem of translating a natural language sentence into a


syntactic representation such as a parse tree can be seen as a structured
prediction problem in which the structured output domain is the set of all possible
parse trees. Structured prediction is also used in a wide variety of application
domains including bioinformatics, natural language processing, speech
recognition, and computer vision.

Example: sequence tagging

Sequence tagging is a class of problems prevalent in natural language processing,


where input data are often sequences (e.g. sentences of text). The sequence
tagging problem appears in several guises, e.g. part-of-speech tagging and named
entity recognition. In POS tagging, for example, each word in a sequence must
receive a "tag" (class label) that expresses its "type" of word:
This DT
is VB
a DT
tagged JJ
sentence NN
. .
DT-Determiner
VB-Verb
JJ-Adjective
NN-Noun

Ranking :-
Learning to Rank (LTR) is a class of techniques that apply supervised machine
learning (ML) to solve ranking problems. The main difference between LTR
and traditional supervised ML is this:

 Traditional ML solves a prediction problem (classification or regression)


on a single instance at a time. E.g. if you are doing spam detection on
email, you will look at all the features associated with that email and
classify it as spam or not. The aim of traditional ML is to come up with
a class (spam or no-spam) or a single numerical score for that instance.
 LTR solves a ranking problem on a list of items. The aim of LTR is to
come up with optimal ordering of those items. As such, LTR doesn't care
much about the exact score that each item gets, but cares more about the
relative ordering among all the items.
The most common application of LTR is search engine ranking, but it's useful
anywhere you need to produce a ranked list of items.

The training data for a LTR model consists of a list of items and a "ground truth"
score for each of those items. For search engine ranking, this translates to a list
of results for a query and a relevance rating for each of those results with respect
to the query. The most common way used by major search engines to generate
these relevance ratings is to ask human raters to rate results for a set of queries

Learning to rank algorithms have been applied in areas other than information
retrieval:

 In machine translation for ranking a set of hypothesized translations


 In computational biology for ranking candidate 3-D structures in protein
structure prediction problem
 In recommender systems for identifying a ranked list of related news
articles to recommend to a user after he or she has read a current news
article
 In software engineering, learning-to-rank methods have been used for fault
localization

You might also like