0% found this document useful (0 votes)
28 views

Machine learning notes

Study material important questions

Uploaded by

gg9244260
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Machine learning notes

Study material important questions

Uploaded by

gg9244260
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Unit – 4

QUESTION AND ANSWER BANK


2 Marks
1. What is logistic regression?

Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.

Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.

2. Why do we use Logistic regression rather than Linear Regression?

Linear regression is utilized for regression tasks, while logistic regression helps
accomplish classification tasks. Supervised machine learning is a widely used machine
learning technique that predicts future outcomes or events. It uses labeled datasets to learn and
generate accurate predictions.

3. Define logistic function

The logistic function, also called the sigmoid function was developed by statisticians
to describe properties of population growth in ecology, rising quickly and maxing out at the
carrying capacity of the environment. It’s an S-shaped curve that can take any real-valued
number and map it into a value between 0 and 1, but never exactly at those limits.
1 / (1 + e^-value)

Where e is the base of the natural logarithms (Euler’s number or the EXP() function in
your spreadsheet) and value is the actual numerical value that you want to transform. Below is
a plot of the numbers between -5 and 5 transformed into the range 0 and 1 using the logistic
function.

4. Define cost function in logistic regression


The cost function of logistic regression is derived from taking the log of
the maximum likelihood function and applying negative to log loss function in order to
use gradient descent for optimization purposes. This is why the cross-entropy loss function
is also called a log loss function.

5. What are the types of logistic regression?

On the basis of the categories, Logistic Regression can be classified into three types:

o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered
types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".

6. Explain the equation of logistic regression?

The following equation represents logistic regression:

Equation of Logistic Regression

 x = input value
 y = predicted output
 b0 = bias or intercept term
 b1 = coefficient for input (x)

This equation is similar to linear regression, where the input values are combined
linearly to predict an output value using weights or coefficient values. However, unlike
linear regression, the output value modeled here is a binary value (0 or 1) rather than a
numeric value.
7. What is classification problem?

In machine learning, classification refers to a predictive modeling problem where a class


label is predicted for a given example of input data. Examples of classification problems
include: Given an example, classify if it is spam or not. Given a handwritten character,
classify it as one of the known characters.

8. What are the problem in using linear regression approach work for classification?

There are two things that explain why Linear Regression is not suitable for
classification. The first one is that Linear Regression deals with continuous values whereas
classification problems mandate discrete values. The second problem is regarding the shift
in threshold value when new data points are added

9. Define gradient descent

Gradient descent is an optimization algorithm which is commonly-used to


train machine learning models and neural networks. Training data helps these models
learn over time, and the cost function within gradient descent specifically acts as a barometer,
gauging its accuracy with each iteration of parameter updates.

10. Define overfitting?

Overfitting occurs when our machine learning model tries to cover all the data points
or more than the required data points present in the given dataset. Because of this, the model
starts caching noise and inaccurate values present in the dataset, and all these factors reduce
the efficiency and accuracy of the model. The over fitted model has low bias and high
variance.

11. Explain neural network

The neural network is a method in artificial intelligence that teaches computers


to process data in a way that is inspired by the human brain. It is a type of machine
learning process, called deep learning, that uses interconnected nodes or neurons in a layered
structure that resembles the human brain.

12. What is multi-class classification?

In machine learning and statistical classification, multiclass classification or


multinomial classification is the problem of classifying instances into one of three or more
classes (classifying instances into one of two classes is called binary classification).
13. What are the classifier used in multiclass classifier?

 Neural networks. Extreme learning machines.


 k-nearest neighbors.
 Naive Bayes.
 Decision trees.
 Support vector machines.
 Multi expression programming.

14. What is entropy?

Entropy is defined as the randomness or measuring the disorder of the


information being processed in Machine Learning. Further, in other words, we can say that
entropy is the machine learning metric that measures the unpredictability or impurity in the
system.

13 MARKS

1. DEFINE LOGISTIC REGRESSION WITH BREIF EXAMPLE?

 Logistic regression is one of the most popular Machine Learning algorithms,


which comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.
 Logistic regression predicts the output of a categorical dependent variable.
Therefore the outcome must be a categorical or discrete value. It can be either Yes or No, 0
or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
 Logistic Regression is much similar to the Linear Regression except that how they
are used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
 In Logistic regression, instead of fitting a regression line, we fit an S shaped
logistic function which predicts two maximum values (0 or 1).
 The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
 Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
datasets.

o Logistic Regression can be used to classify the observations using different


types of data and can easily determine the most effective variables used for the
classification. The below image is showing the logistic function:

Logistic Regression Equation:

The Logistic regression equation can be obtained from the Linear Regression
equation. The mathematical steps to get Logistic Regression equations are given
below:

o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide
the above equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:
The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:

On the basis of the categories, Logistic Regression can be classified into three
types:

o Binomial: In binomial Logistic regression, there can be only two possible


types of the dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more
possible unordered types of the dependent variable, such as "cat", "dogs", or
"sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible
ordered types of dependent variables, such as "low", "Medium", or "High".

Example: There is a dataset given which contains the information of various users
obtained from the social networking sites. There is a car making company that has
recently launched a new SUV car. So the company wanted to check how many users from
the dataset, wants to purchase the car.

o For this problem, we will build a Machine Learning model using the Logistic
regression algorithm. The dataset is shown in the below image. In this problem, we
will predict the purchased variable (Dependent Variable) by using age and salary
(Independent variables).
2. EXPLAIN THE CLEAR EXAMPLE WITH STEPS USED IN LOGISTIC
REGRESSION?

Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.

Logistic regression predicts the output of a categorical dependent variable.


Therefore the outcome must be a categorical or discrete value. It can be either Yes or No, 0
or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.

Logistic Regression is much similar to the Linear Regression except that how they
are used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.

Example: There is a dataset given which contains the information of various users
obtained from the social networking sites. There is a car making company that has recently
launched a new SUV car. So the company wanted to check how many users from the dataset,
wants to purchase the car.
For this problem, we will build a Machine Learning model using the Logistic
regression algorithm. The dataset is shown in the below image. In this problem, we will predict
the purchased variable (Dependent Variable) by using age and salary (Independent
variables).

Steps in Logistic Regression: To implement the Logistic Regression using Python, we will
use the same steps as we have done in previous topics of Regression. Below are the steps:

o Data Pre-processing step


o Fitting Logistic Regression to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

1. Data Pre-processing step: In this step, we will pre-process/prepare the data so that we can
use it in our code efficiently. It will be the same as we have done in Data pre-processing topic.
The code for this is given below:

1. #Data Pre-procesing Step


2. # importing libraries
3. import numpy as nm
4. import matplotlib.pyplot as mtp
5. import pandas as pd
6.
7. #importing datasets
8. data_set= pd.read_csv('user_data.csv')

By executing the above lines of code, we will get the dataset as the output. Consider the
given image:

Now, we will extract the dependent and independent variables from the given
dataset. Below is the code for it:

1. #Extracting Independent and dependent Variable


2. x= data_set.iloc[:, [2,3]].values
3. y= data_set.iloc[:, 4].values

In the above code, we have taken [2, 3] for x because our independent variables are age
and salary, which are at index 2, 3. And we have taken 4 for y variable because our dependent
variable is at index 4. The output will be:
Now we will split the dataset into a training set and test set. Below is the code for it:

1. # Splitting the dataset into training and test set.


2. from sklearn.model_selection import train_test_split
3. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)

The output for this is given below:

FOR TEST SET:

For training set:


In logistic regression, we will do feature scaling because we want accurate result of
predictions. Here we will only scale the independent variable because dependent variable
have only 0 and 1 values. Below is the code for it:

1. #feature Scaling
2. from sklearn.preprocessing import StandardScaler
3. st_x= StandardScaler()
4. x_train= st_x.fit_transform(x_train)
5. x_test= st_x.transform(x_test)

The scaled output is given below:

2. Fitting Logistic Regression to the Training set:


We have well prepared our dataset, and now we will train the dataset using the training set.
For providing training or fitting the model to the training set, we will import
the LogisticRegression class of the sklearn library.

After importing the class, we will create a classifier object and use it to fit the model to the
logistic regression. Below is the code for it:

1. #Fitting Logistic Regression to the training set


2. from sklearn.linear_model import LogisticRegression
3. classifier= Logistic Regression(rando m_state=0)
4. classifier.fit(x_train, y_train)

Output: By executing the above code, we will get the below output:

1. LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,


2. intercept_scaling=1, l1_ratio=None, max_iter=100,
3. multi_class='warn', n_jobs=None, penalty='l2',
4. random_state=0, solver='warn', tol=0.0001, verbose=0,
5. warm_start=False)

Hence our model is well fitted to the training set.

3. Predicting the Test Result

Our model is well trained on the training set, so we will now predict the result by
using test set data. Below is the code for it:

1. #Predicting the test set result


2. y_pred= classifier.predict(x_test)

In the above code, we have created a y_pred vector to predict the test set result.

Output: By executing the above code, a new vector (y_pred) will be created under the
variable explorer option. It can be seen as:
The above output image shows the corresponding predicted users who want to
purchase or not purchase the car.

4. Test Accuracy of the result

Now we will create the confusion matrix here to check the accuracy of the
classification. To create it, we need to import the confusion_matrix function of the
sklearn library. After importing the function, we will call it using a new variable cm.
The function takes two parameters, mainly y_true( the actual values) and y_pred (the
targeted value return by the classifier). Below is the code for it:

1. #Creating the Confusion matrix


2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix()

Output:

By executing the above code, a new confusion matrix will be created. Consider the below
image:
We can find the accuracy of the predicted result by interpreting the confusion matrix.
By above output, we can interpret that 65+24= 89 (Correct Output) and 8+3= 11(Incorrect
Output).

5. Visualizing the training set result

Finally, we will visualize the training set result. To visualize the result, we will
use ListedColormap class of matplotlib library. Below is the code for it:

1. #Visualizing the training set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_train, y_train
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6.

mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape


)alpha = 0.75, cmap = ListedColormap(('purple','green' )))
7. mtp.xlim(x1.min(), x1.max())
8. mtp.ylim(x2.min(), x2.max())
9. for i, j in enumerate(nm.unique(y_set)):
10. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
11. c = ListedColormap(('purple', 'green'))(i), label = j)
12. mtp.title('Logistic Regression (Training set)')
13. mtp.xlabel('Age')
14. mtp.ylabel('Estimated Salary')
15. mtp.legend()
16. mtp.show()

In the above code, we have imported the ListedColormap class of Matplotlib library
to create the colormap for visualizing the result. We have created two new
variables x_set and y_set to replace x_train and y_train. After that, we have used
the nm.meshgrid command to create a rectangular grid, which has a range of -1(minimum) to
1 (maximum). The pixel points we have taken are of 0.01 resolution.

To create a filled contour, we have used mtp.contourf command, it will create


regions of provided colors (purple and green). In this function, we have passed
the classifier.predict to show the predicted data points predicted by the classifier.

Output: By executing the above code, we will get the below output:

The graph can be explained in the below points:

o In the above graph, we can see that there are some Green points within the green region
and Purple points within the purple region.
o All these data points are the observation points from the training set, which shows the
result for purchased variables.
o This graph is made by using two independent variables i.e., Age on the x-
axis and Estimated salary on the y-axis.
o The purple point observations are for which purchased (dependent variable) is probably
0, i.e., users who did not purchase the SUV car.
o The green point observations are for which purchased (dependent variable) is probably 1
means user who purchased the SUV car.
o We can also estimate from the graph that the users who are younger with low salary, did not
purchase the car, whereas older users with high estimated salary purchased the car.
o But there are some purple points in the green region (Buying the car) and some green points
in the purple region(Not buying the car). So we can say that younger users with a high
estimated salary purchased the car, whereas an older user with a low estimated salary did not
purchase the car.

The goal of the classifier:

We have successfully visualized the training set result for the logistic regression, and our
goal for this classification is to divide the users who purchased the SUV car and who did not
purchase the car. So from the output graph, we can clearly see the two regions (Purple and Green)
with the observation points. The Purple region is for those users who didn't buy the car, and
Green Region is for those users who purchased the car.

Linear Classifier:

As we can see from the graph, the classifier is a Straight line or linear in nature as we
have used the Linear model for Logistic Regression. In further topics, we will learn for non-linear
Classifiers.

Visualizing the test set result:

Our model is well trained using the training dataset. Now, we will visualize the result for
new observations (Test set). The code for the test set will remain same as above except that here
we will use x_test and y_test instead of x_train and y_train. Below is the code for it:

1. #Visulaizing the test set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6.
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1
.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Logistic Regression (Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:

The above graph shows the test set result. As we can see, the graph is divided into
two regions (Purple and Green). And Green observations are in the green region, and Purple
observations are in the purple region. So we can say it is a good prediction and model. Some
of the green and purple data points are in different regions, which can be ignored as we have
already calculated this error using the confusion matrix (11 Incorrect output).

Hence our model is pretty good and ready to make new predictions for this
classification problem.
3. What is cost function in logistic regression?

Logistic regression:

We use logistic regression to solve classification problems where the outcome is


a discrete variable. Usually, we use it to solve binary classification problems. As the name
suggests, binary classification problems have two possible outputs.
We utilize the sigmoid function (or logistic function) to map input values from a wide range
into a limited interval. Mathematically, the sigmoid function is:

Cost function in logistic regression:

The cost function summarizes how well the model is behaving. In other
words, we use the cost function to measure how close the model’s predictions are to the actual
outputs.
In linear regression, we use mean squared error (MSE) as the cost function. But in logistic
regression, using the mean of the squared differences between actual and predicted outcomes
as the cost function might give a wavy, non-convex solution; containing many local optima:

In this case, finding an optimal solution with the gradient descent method is not
possible. Instead, we use a logarithmic function to represent the cost of logistic
regression. It is guaranteed to be convex for all input values, containing only one minimum,
allowing us to run the gradient descent algorithm.
When dealing with a binary classification problem, the logarithmic cost of error
depends on the value of . We can define the cost for two cases separately:

Which then results in:


Because when the actual outcome , the cost is for and takes the
maximum value for hϑ (x) =0 . Similarly, if y = 0, the cost is for hϑ (x) =0 .
As the output can either be or , we can simplify the equation to be:
Cost( hϑ(x),y) = -y(i)×log (hϑ(x(i))) – (1-y(i))× log(hϑ(x(i))) ( ϑ- theta)

In linear regression, we use the Mean squared error which was the difference between

y_predicted and y_actual and this is derived from the maximum likelihood estimator. The

graph of the cost function in linear regression is like this:

Image
In logistic regression Yi is a non-linear function (Ŷ=1​/1+ e-z). If we use this in the

above MSE equation then it will give a non-convex graph with many local minima as shown

The problem here is that this cost function will give results with local minima, which

is a big problem because then we’ll miss out on our global minima and our error will increase.

In order to solve this problem, we derive a different cost function for logistic

regression called log loss which is also derived from the maximum likelihood

estimation method.

In the next section, we’ll talk a little bit about the maximum likelihood estimator and

what it is used for. We’ll also try to see the math behind this log loss function.
4. DERIVATION OF COST FUNCTION WITH REQUIRED STEPS?

COST FUNCTION:

The cost function summarizes how well the model is behaving. In other
words, we use the cost function to measure how close the model’s predictions are to
the actual outputs.

Derivation of Cost Function:

Before we derive our cost function we’ll first find a derivative for our sigmoid

function because it will be used in derivating the cost function.


Now, we will derive the cost function with the help of the chain rule as it allows us to

calculate complex partial derivatives by breaking them down.

Step-1: Use chain rule and break the partial derivative of log-likelihood.

Step-2: Find derivative of log-likelihood w.r.t p

Step-3: Find derivative of ‘p’ w.r.t ‘z’


Step-4: Find derivate of z w.r.t θ

Step-5: Put all the derivatives in equation 1


Hence the derivative of our cost function is:

Now since we have our derivative of the cost function, we can write our gradient descent

algorithm as:

If the slope is negative (downward slope) then our gradient descent will add some value

to our new value of the parameter directing it towards the minimum point of the convex curve.

Whereas if the slope is positive (upward slope) the gradient descent will minus some value to

direct it towards the minimum point.

5. GRADIENT DESCENT ALGORITHM IN LOGISTIC REGRESSION?

We have successfully calculated our Cost Function. But we need to minimize the
loss to make a good predicting algorithm. To do that, we have the Gradient Descent
Algorithm.
Here we have plotted a graph between J()and . Our objective is to find the deepest
point (global minimum) of this function. Now the deepest point is where the J()is minimum.

Two things are required to find the deepest point:

 Derivative – to find the direction of the next step.


 (Learning Rate) – magnitude of the next step
The idea is you first select any random point from the function. Then you need to
compute the derivative of J()w.r.t. . This will point to the direction of the local minimum.
Now multiply that resultant gradient with the Learning Rate. The Learning Rate has no fixed
value, and is to be decided based on problems.

Now, you need to subtract the result from to get the new .

This update of should be simultaneously done for every (i).


Do these steps repeatedly until you reach the local or global minimum. By reaching the global
minimum, you have achieved the lowest possible loss in your prediction.

Taking derivatives is simple. Just the basic calculus you must have done in your high
school is enough. The major issue is with the Learning Rate( ). Taking a good learning rate is
important and often difficult.

If you take a very small learning rate, each step will be too small, and hence you will
take up a lot of time to reach the local minimum.
Now, if you tend to take a huge learning rate value, you will overshoot the minimum
and never converge again. There is no specific rule for the perfect learning rate. You need to
tweak it to prepare the best model.

The equation for Gradient Descent is:

Repeat until convergence:

So we can summarize the Gradient Descent Algorithm as:

1. Start with random


2. Loop until convergence:
1. Compute Gradient
2. Update
3. Return
Stochastic Gradient Descent Algorithm
Now, Gradient Descent Algorithm is a fine algorithm for minimizing Cost Function,
especially for small to medium data. But when we need to deal with bigger datasets, Gradient
Descent Algorithm turns out to be slow in computation. The reason is simple: it needs to
compute the gradient, and update values simultaneously for every parameter,and that too for
every training example.

So think about all those calculations! It’s massive, and hence there was a need for a
slightly modified Gradient Descent Algorithm, namely – Stochastic Gradient Descent
Algorithm (SGD).

The only difference SGD has with Normal Gradient Descent is that, in SGD, we don’t
deal with the entire training instance at a single time. In SGD, we compute the gradient of the
cost function for just a single random example at each iteration.

Now, doing so brings down the time taken for computations by a huge margin especially for
large datasets. The path taken by SGD is very haphazard and noisy (although a noisy path may
give us a chance to reach global minima).
But that is okay, since we do not have to worry about the path taken.

We only need to reach minimal loss at a faster time.

So we can summarize the Gradient Descent Algorithm as:

1. Loop until convergence:


1. Pick single data point ‘i’
2. Compute Gradient over that single point
3. Update
2. Return
Mini-Batch Gradient Descent Algorithm
Mini-Batch Gradient Descent is another slight modification of the Gradient Descent
Algorithm. It is somewhat in between Normal Gradient Descent and Stochastic Gradient
Descent.

Mini-Batch Gradient Descent is just taking a smaller batch of the entire dataset, and then
minimizing the loss on it.

This process is more efficient than both the above two Gradient Descent Algorithms. Now the
batch size can be of-course anything you want.

But researchers have shown that it is better if you keep it within 1 to 100, with 32 being the
best batch size. Hence batch size = 32 is kept default in most frameworks.

1 Loop until convergence:


2 Pick a batch of ‘b’ data points
3 Compute Gradient over that batch
4 Update
5 Return

6. HOW TO SOLVE THE PROBLEMS IN OVERLIFTING?

Let’s assume you want to predict the future price movements of a stock.

You then decide to gather the historic daily prices of the stock for the last 10 days and
plot the stock price on a scatter plot as shown below:
The chart above shows that the actual stock prices are some-what random.

To capture the stock price movements, you assess and gather data for the following 16
features which you know the stock price is dependent on:

1. Company’s profits

2. Company’s future announcements

3. Company’s dividends

4. Company’s current and future contracts size

5. Company’s M&A state

6. Company’s management information

7. Company’s current contracts

8. Company’s future contracts

9. Inflation

10. Interest Rates

11. Foreign Exchange Rates


12. Investor Sentiment

13. Company’s competitors

Once the data is gathered, cleaned, scaled and transformed, you split the data into
training and test data sets. Furthermore, you feed the training data into your machine learning
model to train it.

Once the model is trained, you decide to test the accuracy of your model by passing in
test data set.

What Do We Expect To See?

The chart above shows that the actual stock prices are random. However, the predicted
stock price is a smooth curve. It has not fit itself too close to the training set and therefore it is
capable of generalising unseen data better.

However, let’s assume your plot actual vs predicted stock prices and you experience
the following charts:

1. A Straight Line To Show The Predicted Price


What Does It Show?

This means that the algorithm has a very strong pre-conception of the data. It implies that it has
high-bias. This is known as under-fitting. These models are not good for predicting new data.

2. A Very Strong Closely Fitted Line

What Does It Show?

This is the other extreme. It might look as if it’s doing a great job at predicting the
stock price. However, this is known as over-fitting. It is also known as high-variance because
it has learned the training data so well that it cannot generalise well to make predictions on
new and unseen data. These models are not good for predicting new data. If we feed the model
new data then it’s accuracy will end up being extremely poor. It is also indicating that we are
not training our model with enough data.
Overfitting is when your model has over-trained itself on the data that is fed to train

it. It could be because there are way too many features in the data or because we have not

supplied enough data. It happens when the difference between the actual and predicted values is

close to 0.

How Do I Detect Over-fitting?

The models that have been over-fit on the training data do not generalize well to new
examples. They are not good at predicting unseen data.

This implies that they are extremely accurate during training and yield very poor
results during the prediction of unseen data.

If the measure of accuracy such as mean error squared is substantially lower during

training of the model and the accuracy deteriorates on the test data set then it implies that your

model is over-fitting the data.

How Do We Resolve Overfitting?

We can randomly remove the features and assess the accuracy of the algorithm
iteratively but it is a very tedious and slow process. There are essentially four common ways to
reduce over-fitting.
1. Reduce Features:

The most obvious option is to reduce the features. You can compute the correlation
matrix of the features and reduce the features that are highly correlated with each other:
import matplotlib.pyplot as plt
plt.matshow(dataframe.corr())
plt.show()

2. Model Selection Algorithms:

You can select model selection algorithms. These algorithms can choose features with
greater importance. The problem with these techniques is that we might end up losing valuable
information.

3. Feed More Data

You should aim to feed enough data to your models so that the models are trained,
tested and validated thoroughly. Aim to give 60% of the data to train the model, 20% of the data
to test and 20% of the data to validate the model.

3. Regularization:

The aim of regularization is to keep all of the features but impose a constraint on the
magnitude of the coefficients.

It is preferred because you do not have to lose the features by penalizing the features.
When the constraints are applied to the parameters, then the model is less prone to over-fitting as it
produces a smooth function.

The regularization parameters, known as penalty factors, are introduced which control
the parameters and ensure that the model is not over-training itself on the training data.
These parameters are set to smaller values to eliminate overfitting. When the
coefficients take large values then the regularization parameters penalize the optimization function.
There are two common regularization techniques:

1. LASSO

Lasso is a feature selection tool and it can completely eliminate non-important features. Adds a
penalty which is the absolute of the magnitude of the coefficients. This ensures that the features
do not end up applying high weight on the prediction of the algorithm. As a result, some of the
weights will end up being to zero. This means that the data of some of the features will not be
used in the algorithm.
from sklearn import linear_model
model = linear_model.Lasso(alpha=0.1)
model.fit([[0,0], [1, 1], [2, 2]], [0, 1, 2])

2. RIDGE

Adds a penalty which is the square of the magnitude of the coefficients. As a result, some of the
weights will be very close to 0. As a result, it ends up smoothing the effect of the features.
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)
model.fit(X, y)

7. Which classifiers do we use in multiclass classification? When do we use them?

We use many algorithms such as Naïve Bayes, Decision trees, SVM, Random forest
classifier, KNN, and logistic regression for classification. But we might learn about only a few of
them here because our motive is to understand multiclass classification. So, using a few algorithms
we will try to cover almost all the relevant concepts related to multiclass classification.
Naive Bayes

Naive Bayes is a parametric algorithm which means it requires a fixed set of parameters or
assumptions to simplify the machine’s learning process. In parametric algorithms, the number of
parameters used is independent of the size of training data.

Naïve Bayes Assumption:

 It assumes that features of a dataset are completely independent of each other. But it is
generally not true that is why we also call it a ‘naïve’ algorithm.
It is a classification model based on conditional probability and uses Bayes theorem to predict the
class of unknown datasets. This model is mostly used for large datasets as it is easy to build and is
fast for both training and making predictions. Moreover, without hyperparameter tuning, it can give
you better results as compared to other algorithms.

Naïve Bayes can also be an extremely good text classifier as it performs well, such as in the spam
ham dataset.

Bayes theorem is stated as-

 By P (A|B), we are trying to find the probability of event A given that event B is true. It is
also known as posterior probability.
 Event B is known as evidence.
 P (A) is called priori of A which means it is probability of event before evidence is seen.
 P (B|A) is known as conditional probability or likelihood.
Note: Naïve Bayes’ is linear classifier which might not be suitable to classes that are not linearly
separated in a dataset. Let us look at the figure below:
As can be seen in Fig.2b, Classifiers such as KNN can be used for non-linear classification
instead of Naïve Bayes classifier.

KNN (K-nearest neighbours)

KNN is a supervised machine learning algorithm that can be used to solve both
classification and regression problems. It is one of the simplest algorithms yet powerful one.
It does not learn a discriminative function from the training data but memorizes the training
data instead. Due to the very same reason, it is also known as a lazy algorithm.

How it works?

The K-nearest neighbor algorithm forms a majority vote between the K most
similar instances, and it uses a distance metric between the two data points for defining them
as similar. Most popular choice is Euclidean distance which is written as:

K in KNN is the hyperparameter that can be chosen by us to get the best possible
fit for the dataset. If we keep the smallest value for K, i.e. K=1, then the model will show low
bias, but high variance because our model will be overfitted in this case. Whereas, a larger
value for K, lets suppose k=10, will surely smoothen our decision boundary, which means
low variance but high bias. So we always go for a trade-off between the bias and variance,
known as bias-variance trade-off.
Let us understand more about it by looking at its advantages and disadvantages:

Advantages-

 KNN makes no assumptions about the distribution of classes i.e. it is a non-parametric


classifier
 It is one of the methods that can be widely used in multiclass classification
 It does not get impacted by the outliers
 This classifier is easy to use and implement
Disadvantages-

 K value is difficult to find as it must work well with test data also, not only with the
training data
 It is a lazy algorithm as it does not make any models
 It is computationally extensive because it measures distance with each data point
Decision Trees

As the name suggests, the decision tree is a tree-like structure of decisions made
based on some conditional statements. This is one of the most used supervised learning
methods in classification problems because of their high accuracy, stability, and easy
interpretation. They can map linear as well as non-linear relationships in a good way.
Let us look at the figure below, Fig.3, where we have used adult census income dataset with
two independent variables and one dependent variable. Our target or dependent variable is
income, which has binary classes i.e, <=50K or >50K.

We can see that the algorithm works based on some conditions, such as Age <50 and
Hours>=40, to further split into two buckets for reaching towards homogeneity. Similarly, we
can move ahead for multiclass classification problem datasets, such as Iris data.

Now a question arises in our mind. How should we decide which column to take first and
what is the threshold for splitting? For splitting a node and deciding threshold for splitting, we
use entropy or Gini index as measures of impurity of a node. We aim to maximize the purity
or homogeneity on each split, as we saw in above diagram.

8. Define dropout regularization?

“Dropout” in machine learning refers to the process of randomly ignoring certain


nodes in a layer during training. In the figure below, the neural network on the left represents a
typical neural network where all units are activated. On the right, the red units have been
dropped out of the model — the values of their weights and biases are not considered during

training.

Adapted from Srivastava, Nitish, et al. ”Dropout: a simple way to prevent neural network
from overfitting”, Dropout is used as a regularization technique — it prevents overfitting by
ensuring that no units are codependent (more on this later).

How Does Dropout Work?

When we apply dropout to a neural network, we’re creating a “thinned” network with
unique combinations of the units in the hidden layers being dropped randomly at different
points in time during training. Each time the gradient of our model is updated, we generate a
new thinned neural network with different units dropped based on a probability
hyperparameter p. Training a network using dropout can thus be viewed as training loads of
different thinned neural networks and merging them into one network that picks up the key
properties of each thinned network.
This process allows dropout to reduce the overfitting of models on training data.

This graph, taken from the paper “Dropout: A Simple Way to Prevent Neural
Networks from Overfitting” by Srivastava et al., compares the change in classification error of
models without dropout to the same models with dropout (keeping all other hyperparameters
constant). All the models have been trained on the MNIST dataset.

It is observed that the models with dropout had a lower classification error than
the same models without dropout at any given point in time. A similar trend was observed
when the models were used to train other datasets in vision, as well as speech recognition and
text analysis.

The lower error is because dropout helps prevent overfitting on the training data
by reducing the reliance of each unit in the hidden layer on other units in the hidden layers.
These diagrams taken from the same paper show the features learned by an auto
encoder on MNIST with one layer of 256 units without dropout (a) and the features learned by
an identical auto encoder that used a dropout of p = 0.5 (b). It can be observed in figure a that
the units don’t seem to pick up on any meaningful feature, whereas in figure b, the units seem
to have picked up on distinct edges and spots in the data provided to them.

This indicates that dropout helps break co-adaptations among units, and each unit can
act more independently when dropout regularization is used. In other words, without dropout,
the network would never be able to catch a unit A compensating for another unit B’s flaws.
With dropout, at some point unit A would be ignored and the training accuracy would decrease
as a result, exposing the inaccuracy of unit B.

15 marks

1. EXPLAIN THE WHOLE CONCEPT OF LOGISTIC REGRESSION WITH


EXAMPLE PROGRAM?
Logistic regression is one of the most popular Machine Learning algorithms,
which comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.

o Logistic regression predicts the output of a categorical dependent variable. Therefore


the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,
true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
datasets.
o Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the classification.
The below image is showing the logistic function:
Logistic Regression Equation:

The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:

o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it
will become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:

On the basis of the categories, Logistic Regression can be classified into three types:

o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered
types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".

CLEAR EXAMPLE WITH STEPS USED IN LOGISTIC REGRESSION:


Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.

Logistic regression predicts the output of a categorical dependent variable. Therefore


the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.

Logistic Regression is much similar to the Linear Regression except that how they
are used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.

EXAMPLE WITH STEPS EXPLANATION:

Example: There is a dataset given which contains the information of various users
obtained from the social networking sites. There is a car making company that has recently
launched a new SUV car. So the company wanted to check how many users from the dataset,
wants to purchase the car.

For this problem, we will build a Machine Learning model using the Logistic regression
algorithm. The dataset is shown in the below image. In this problem, we will predict
the purchased variable (Dependent Variable) by using age and salary (Independent
variables).
Steps in Logistic Regression: To implement the Logistic Regression using Python, we
will use the same steps as we have done in previous topics of Regression. Below are the
steps:

o Data Pre-processing step


o Fitting Logistic Regression to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

1. Data Pre-processing step: In this step, we will pre-process/prepare the data so that we
can use it in our code efficiently. It will be the same as we have done in Data pre-
processing topic. The code for this is given below:

9. #Data Pre-procesing Step


10. # importing libraries
11. import numpy as nm
12. import matplotlib.pyplot as mtp
13. import pandas as pd
14.
15. #importing datasets
16. data_set= pd.read_csv('user_data.csv')

By executing the above lines of code, we will get the dataset as the output. Consider the
given image:

Now, we will extract the dependent and independent variables from the given dataset. Below is the
code for it:

4. #Extracting Independent and dependent Variable


5. x= data_set.iloc[:, [2,3]].values
6. y= data_set.iloc[:, 4].values

In the above code, we have taken [2, 3] for x because our independent variables are age and salary,
which are at index 2, 3. And we have taken 4 for y variable because our dependent variable is at
index 4. The output will be:
Now we will split the dataset into a training set and test set. Below is the code for it:

4. # Splitting the dataset into training and test set.


5. from sklearn.model_selection import train_test_split
6. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)

The output for this is given below:

For test set:

For training set:


In logistic regression, we will do feature scaling because we want accurate result of predictions. Here
we will only scale the independent variable because dependent variable have only 0 and 1 values.
Below is the code for it:

6. #feature Scaling
7. from sklearn.preprocessing import StandardScaler
8. st_x= StandardScaler()
9. x_train= st_x.fit_transform(x_train)
10. x_test= st_x.transform(x_test)

The scaled output is given below:


2. Fitting Logistic Regression to the Training set:

We have well prepared our dataset, and now we will train the dataset using the training set. For
providing training or fitting the model to the training set, we will import
the LogisticRegression class of the sklearn library.

After importing the class, we will create a classifier object and use it to fit the model to the logistic
regression. Below is the code for it:

5. #Fitting Logistic Regression to the training set


6. from sklearn.linear_model import LogisticRegression
7. classifier= LogisticRegression(random_state=0)
8. classifier.fit(x_train, y_train)

Output: By executing the above code, we will get the below output:

Out[5]:

6. LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,


7. intercept_scaling=1, l1_ratio=None, max_iter=100,
8. multi_class='warn', n_jobs=None, penalty='l2',
9. random_state=0, solver='warn', tol=0.0001, verbose=0,
10. warm_start=False)

Hence our model is well fitted to the training set.

3. Predicting the Test Result

Our model is well trained on the training set, so we will now predict the result by using test set data.
Below is the code for it:

9. #Predicting the test set result


10. y_pred= classifier.predict(x_test)

In the above code, we have created a y_pred vector to predict the test set result.

Output: By executing the above code, a new vector (y_pred) will be created under the variable
explorer option. It can be seen as:
The above output image shows the corresponding predicted users who want to purchase or not
purchase the car.

4. Test Accuracy of the result

Now we will create the confusion matrix here to check the accuracy of the classification. To create it,
we need to import the confusion_matrix function of the sklearn library. After importing the function,
we will call it using a new variable cm. The function takes two parameters, mainly y_true( the actual
values) and y_pred (the targeted value return by the classifier). Below is the code for it:

11. #Creating the Confusion matrix


12. from sklearn.metrics import confusion_matrix
13. cm= confusion_matrix()

Output:

By executing the above code, a new confusion matrix will be created. Consider the below image:
We can find the accuracy of the predicted result by interpreting the confusion matrix. By above
output, we can interpret that 65+24= 89 (Correct Output) and 8+3= 11(Incorrect Output).

5. Visualizing the training set result

Finally, we will visualize the training set result. To visualize the result, we will
use ListedColormap class of matplotlib library. Below is the code for it:

17. #Visualizing the training set result


18. from matplotlib.colors import ListedColormap
19. x_set, y_set = x_train, y_train
20. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step =0.01),
21. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
22. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
23. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
24. mtp.xlim(x1.min(), x1.max())
25. mtp.ylim(x2.min(), x2.max())
26. for i, j in enumerate(nm.unique(y_set)):
27. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
28. c = ListedColormap(('purple', 'green'))(i), label = j)
29. mtp.title('Logistic Regression (Training set)')
30. mtp.xlabel('Age')
31. mtp.ylabel('Estimated Salary')
32. mtp.legend()
33. mtp.show()

In the above code, we have imported the ListedColormap class of Matplotlib library to create the
colormap for visualizing the result. We have created two new variables x_set and y_set to
replace x_train and y_train. After that, we have used the nm.meshgrid command to create a
rectangular grid, which has a range of -1(minimum) to 1 (maximum). The pixel points we have taken
are of 0.01 resolution.

To create a filled contour, we have used mtp.contourf command, it will create regions of provided
colors (purple and green). In this function, we have passed the classifier.predict to show the
predicted data points predicted by the classifier.

Output: By executing the above code, we will get the below output:

The graph can be explained in the below points:

o In the above graph, we can see that there are some Green points within the green region
and Purple points within the purple region.
o All these data points are the observation points from the training set, which shows the result
for purchased variables.
o This graph is made by using two independent variables i.e., Age on the x-axis and Estimated
salary on the y-axis.
o The purple point observations are for which purchased (dependent variable) is probably 0,
i.e., users who did not purchase the SUV car.
o The green point observations are for which purchased (dependent variable) is probably 1
means user who purchased the SUV car.
o We can also estimate from the graph that the users who are younger with low salary, did not
purchase the car, whereas older users with high estimated salary purchased the car.
o But there are some purple points in the green region (Buying the car) and some green points
in the purple region(Not buying the car). So we can say that younger users with a high
estimated salary purchased the car, whereas an older user with a low estimated salary did not
purchase the car.

The goal of the classifier:

We have successfully visualized the training set result for the logistic regression, and our goal for this
classification is to divide the users who purchased the SUV car and who did not purchase the car. So
from the output graph, we can clearly see the two regions (Purple and Green) with the observation
points. The Purple region is for those users who didn't buy the car, and Green Region is for those
users who purchased the car.

Linear Classifier:

As we can see from the graph, the classifier is a Straight line or linear in nature as we have used the
Linear model for Logistic Regression. In further topics, we will learn for non-linear Classifiers.

Visualizing the test set result:

Our model is well trained using the training dataset. Now, we will visualize the result for new
observations (Test set). The code for the test set will remain same as above except that here we will
use x_test and y_test instead of x_train and y_train. Below is the code for it:

18. #Visulaizing the test set result


19. from matplotlib.colors import ListedColormap
20. x_set, y_set = x_test, y_test
21. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step =0.01),
22. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
23. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
24. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
25. mtp.xlim(x1.min(), x1.max())
26. mtp.ylim(x2.min(), x2.max())
27. for i, j in enumerate(nm.unique(y_set)):
28. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
29. c = ListedColormap(('purple', 'green'))(i), label = j)
30. mtp.title('Logistic Regression (Test set)')
31. mtp.xlabel('Age')
32. mtp.ylabel('Estimated Salary')
33. mtp.legend()
34. mtp.show()

Output:

The above graph shows the test set result. As we can see, the graph is divided into two regions
(Purple and Green). And Green observations are in the green region, and Purple observations are in
the purple region. So we can say it is a good prediction and model. Some of the green and purple data
points are in different regions, which can be ignored as we have already calculated this error using
the confusion matrix (11 Incorrect output).

Hence our model is pretty good and ready to make new predictions for this classification problem.

You might also like