0% found this document useful (0 votes)
13 views79 pages

BSC ML CH2

Unit 2 of the Machine Learning course covers Bayesian classification, including prior and posterior probabilities, the Naive Bayesian algorithm, and logistic regression. It explains how Bayesian classifiers predict class membership probabilities and provides examples of calculating these probabilities. The unit also discusses the advantages and disadvantages of the Naive Bayes classifier, its types, and applications in real-world scenarios.

Uploaded by

rachitdhiliwal18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views79 pages

BSC ML CH2

Unit 2 of the Machine Learning course covers Bayesian classification, including prior and posterior probabilities, the Naive Bayesian algorithm, and logistic regression. It explains how Bayesian classifiers predict class membership probabilities and provides examples of calculating these probabilities. The unit also discusses the advantages and disadvantages of the Naive Bayes classifier, its types, and applications in real-world scenarios.

Uploaded by

rachitdhiliwal18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Machine Learning 1

Unit 2
BSc (Data Science)

*few figures/content have been prepared/referred from internet source/books Unit 2-Lecture 6-15 1
Unit 2
1) Prior and posterior probabilities
2) Naive Bayesian algorithm
3) Laplacian correction
4) Logistic Regression: The Logistic model
5) Estimating the regression coefficients
6) Making predictions
7) Multiple logistic regression

Unit 2-Lecture 6-15 2


Statistical (Bayesian) classification
• Bayesian classifiers are the statistical classifiers based on Bayes' Theorem
• Bayesian classifiers can predict class membership probabilities i.e. the probability
that a given tuple belongs to a particular class.
• It uses the given values to train a model and then it uses this model to classify
new data
Above,
•P(c|x) is the posterior probability of class (c, target)
given predictor (x, attributes).
•P(c) is the prior probability of class.
•P(x|c) is the likelihood which is the probability of
the predictor given class.
•P(x) is the prior probability of the predictor.
Prior and posterior probabilities
Prior: Probability distribution representing knowledge or uncertainty of a data object prior or before
observing it
Posterior: Conditional probability distribution representing what parameters are likely after observing the
data object

Unit 2-Lecture 6-15 5


Prior Probability
A prior probability is the probability that an observation will fall into a group
before you collect the data. The prior is a probability distribution that represents
your uncertainty over θ before you have sampled any data and attempted to
estimate it – usually denoted π(θ).

Posterior Probability
A posterior probability is the probability of assigning observations to groups given
the data. The posterior is a probability distribution representing your
uncertainty over θ after you have sampled data – denoted π(θ|X). It is a
conditional distribution because it conditions on the observed data.

Unit 2-Lecture 6-15 6


Example
Consider a population where the proportion of HIV-infected individuals is 0.01. Then, the
prior probability that a randomly chosen subject is HIV-infected is Pprior = 0.01 .

Suppose now a subject has been positive for HIV. It is known that specificity of the test is
95%, and sensitivity of the test is 99%.

What is the probability that the subject is HIV-infected? In other words, what is the
conditional probability that a subject is HIV-infected if he/she has tested positive?

The following table summarizes calculations. (For the sake of simplicity you may consider
the fractions (probabilities) as proportions of the general population.)

Unit 2-Lecture 6-15 7


Example

• Thus, the average proportion of positive tests overall is 0.0594, and the proportion of actually
infected among them is 0.0099/0.0594 or 0.167 = 16.7%. So, the posterior (i.e. after the test has
been carried out and turns out to be positive) probability that the subject is really HIV-infected is
0.167.
• The difference between prior and posterior probabilities characterizes the information we have
gotten from the experiment or measurement. In this example the probability changed from 0.01
(prior) to 0.167 (posterior)

Unit 2-Lecture 6-15 8


Naive Bayes Classifier

• Naïve Bayes algorithm is a supervised learning


algorithm, which is based on Bayes theorem and
used for solving classification problems.
• It is a probabilistic classifier, which means it
predicts on the basis of the probability of an
object.
• Bayes' theorem is also known as Bayes'
Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior
knowledge. It depends on the conditional
probability.
• The formula for Bayes' theorem is given as:

Where,
P(h): the probability of hypothesis h being true (regardless of the data). This is known as the prior
probability of h.
P(D): the probability of the data (regardless of the hypothesis). This is known as the prior
probability.
P(h|D): the probability of hypothesis h given the data D. This is known as posterior probability.
P(D|h): the probability of data d given that the hypothesis h was true. This is known as posterior
probability.
• Basic assumptions
• Use all the attributes
• Attributes are assumed to be:
• equally important: all attributes have the same relevance to the classification task.
• statistically independent (given the class value): knowledge about the value of a particular
attribute doesn't tell us anything about the value of another attribute (if the class is
known).
• Why is it called Naïve Bayes?
• The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:

• Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of
the occurrence of other features. Such as if the fruit is identified on the basis of color, shape, and taste,
then red, spherical, and sweet fruit is recognized as an apple. Hence each feature individually
contributes to identify that it is an apple without depending on each other.
• Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem
Properties
1. The Naive Bayes Algorithm is one of the popular classification machine learning algorithms
that helps to classify the data based upon the conditional probability values computation.
2. It implements the Bayes theorem for the computation and used class levels represented as feature
values or vectors of predictors for classification.
3. Naive Bayes Algorithm is a fast algorithm for classification problems.
4. This algorithm is a good fit for real-time prediction, multi-class prediction, recommendation
system, text classification, and sentiment analysis use cases.
5. Naive Bayes Algorithm can be built using Gaussian, Multinomial and Bernoulli distribution.
6. This algorithm is scalable and easy to implement for a large data set.
7. It helps to calculate the posterior probability P(c|x) using the prior probability of class P(c), the prior
probability of predictor P(x), and the probability of predictor given class, also called as
likelihood P(x|c).

13
Steps
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?
• Solution: To solve this, first consider the dataset:
Outlook Play

0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No Weather No Yes

5 Rainy Yes Overcast 0 5 5/14= 0.35

6 Sunny Yes Rainy 2 2 4/14=0.29

7 Overcast Yes Sunny 2 3 5/14=0.35

8 Rainy No All 4/14=0.29 10/14=0.71

9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
• Likelihood table weather condition:
Applying Bayes'theorem:

PYes|Sunny)= PSunny|Yes)*PYes/PSunny

PSunny|Yes)= 3/10 0.3


PSunny 0.35 Weather No Yes
PYes)=0.71 Overcast 0 5 5/14= 0.35
So PYes|Sunny) = 0.30.71/0.35 0.60
Rainy 2 2 4/14=0.29
PNo|Sunny)= PSunny|No)*PNo/PSunny
Sunny 2 3 5/14=0.35
PSunny|NO 2/40.5
PNo 0.29 All 4/14=0.29 10/14=0.71
PSunny 0.35
So PNo|Sunny)= 0.50.29/0.35  0.41

So as we can see from the above calculation that PYes|Sunny)>PNo|Sunny)


Hence on a Sunny day, Player can play the game.
• Types of Naïve Bayes Model:
• There are three types of Naive Bayes Model, which are given below:

• Gaussian: The Gaussian model assumes that features follow a normal distribution. This means if
predictors take continuous values instead of discrete, then the model assumes that these values are
sampled from the Gaussian distribution.

• From sklearn.naive_bayes import GaussianNB

• Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial distributed. It
is primarily used for document classification problems, it means a particular document belongs to
which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
• Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor variables
are the independent Booleans variables. Such as if a particular word is present or not in a document.
This model is also famous for document classification tasks.
data should follow the rule of Gaussian
distribution or normal distribution, used
for continuous data
it goes through all the possibilities, which is very slow and time-consuming.

Types of the
Naive Bayes
Model

data that has binary or boolean attributes, binary


Works on discrete data, used on classes
documentation ,
text classification when frequencies are given
Classifier Type Data Type Common Use Cases

Sensor data analysis,


Continuous medical diagnostics,
real values
Text classification,
document
Discrete
categorization, discrete
values(frequencies)
Spam detection,
Binary sentiment analysis ,
Boolean features
Classifier Type Data Type Common Use Cases

Sensor data analysis,


Gaussian Naive Bayes Continuous
medical diagnostics,

Text classification,
Multinomial Naive
Discrete document
Bayes
categorization
Spam detection,
Bernoulli Naive Bayes Binary
sentiment analysis
Advantages of Naïve Bayes Classifier:
• Naïve Bayes is one of the fast and easy ML algorithms to predict a
class of datasets.
• It can be used for Binary classification.
• It performs well in Multi-class predictions as compared to the other
Algorithms.
• It is the most popular choice for text classification problems.
Disadvantages of Naive Bayes:
• Naive Bayes assumes that all predictors (or features) are
independent, rarely happening in real life. This limits the applicability
of this algorithm in real-world use cases.
• This algorithm faces the ‘zero-frequency problem’ where it assigns
zero probability to a categorical variable whose category in the test
data set wasn’t available in the training dataset. It would be best if
you used a smoothing technique to overcome this issue.
• Its estimations can be wrong in some cases, so you shouldn’t take its
probability outputs very seriously.
Applications
of Naïve Bayes Classifier:
Example 1: Probabilities of weather data

New instance: [outlook=sunny, temp=cool, humidity=high, windy=true, play=?]


• outlook = sunny [yes (2/9); no (3/5)];
• temperature = cool [yes (3/9); no (1/5)];
• humidity = high [yes (3/9); no (4/5)];
• windy = true [yes (3/9); no (3/5)];
• play = yes [(9/14)]
• play = no [(5/14)]
• New instance: [outlook=sunny, temp=cool, humidity=high, windy=true, play=?]
• Likelihood of the two classes (play=yes; play=no):
• yes = (2/9)*(3/9)*(3/9)*(3/9)*(9/14) = 0.0053;
• no = (3/5)*(1/5)*(4/5)*(3/5)*(5/14) = 0.0206;
• Conversion into probabilities by normalization:
• P(yes) = 0.0053 / (0.0053 + 0.0206) = 0.205
• P(no) = 0.0206 / (0.0053 + 0.0206) = 0.795

• Answer : Play=No
Example
• Using this data, we
have to identify the
species of an entity
with the following
attributes.

• X={Color=Green,
Legs=2,
Height=Tall,
Smelly=No}
• To predict the class label for the above attribute set, we will first
calculate the probability of the species being M or H in total.
• P(Species=M)=4/8=0.5
• P(Species=H)=4/8=0.5
• Next, we will calculate the conditional probability of each attribute
value for each class label.

• P(Color=White/Species=M)=2/4=0.5 P(Height=Tall/Species=M)=3/4=0.75
P(Height=Tall/Species=H)=2/4=0.5
• P(Color=White/Species=H)=¾=0.75 P(Height=Short/Species=M)=1/4=0.25
• P(Color=Green/Species=M)=2/4=0.5 P(Height=Short/Species=H)=2/4=0.5
• P(Color=Green/Species=H)=¼=0.25

P(Legs=2/Species=M)=1/4=0.25 P(Smelly=Yes/Species=M)=3/4=0.75
P(Legs=2/Species=H)=4/4=1 P(Smelly=Yes/Species=H)=1/4=0.25
P(Legs=3/Species=M)=3/4=0.75 P(Smelly=No/Species=M)=1/4=0.25
P(Legs=3/Species=H)=0/4=0 P(Smelly=No/Species=H)=3/4=0.75
• Now that we have calculated the conditional probabilities, we will use them to calculate the probability of the
new attribute set belonging to a single class.

• Let us consider X= {Color=Green, Legs=2, Height=Tall, Smelly=No}.

• Then, the probability of X belonging to Species M will be as follows.

• P(M/X)=P(Species=M)*P(Color=Green/Species=M)*P(Legs=2/Species=M)*P(Height=Tall/Species=M)*P(Smelly=
No/Species=M)
• =0.5*0.5*0.25*0.75*0.25
• =0.0117
• Similarly, the probability of X belonging to Species H will be calculated as
follows.

• P(H/X)=P(Species=H)*P(Color=Green/Species=H)*P(Legs=2/Species=H)*P(
Height=Tall/Species=H)*P(Smelly=No/Species=H)
• =0.5*0.25*1*0.5*0.75
• =0.0468
• So, the probability of X belonging to Species M is 0.0117 and that to
Species H is 0.0468. Hence, we will assign the entity X with attributes
{Color=Green, Legs=2, Height=Tall, Smelly=No} to species H.
• P(A|B) = (P(B|A) * P(A) )/ P(B)
• Mango:
• P(X | Mango) = P(Yellow | Mango) * P(Sweet | Mango) * P(Long | Mango)
• a)P(Yellow | Mango) = (P(Yellow | Mango) * P(Yellow) )/ P (Mango)
• = ((350/800) * (800/1200)) / (650/1200)
• P(Yellow | Mango)= 0.53 →1

• b)P(Sweet | Mango) = (P(Sweet | Mango) * P(Sweet) )/ P (Mango)


• = ((450/850) * (850/1200)) / (650/1200)
• P(Sweet | Mango)= 0.69 → 2

• c)P(Long | Mango) = (P(Long | Mango) * P(Long) )/ P (Mango)


• = ((0/650) * (400/1200)) / (800/1200)
• P(Long | Mango)= 0 → 3

• On multiplying eq 1,2,3 ==> P(X | Mango) = 0.53 * 0.69 * 0


• P(X | Mango) = 0
• P(X | Mango) = 0
2. Banana:
• P(X | Banana) = P(Yellow | Banana) * P(Sweet | Banana) * P(Long | Banana)

• 2.a) P(Yellow | Banana) = (P( Banana | Yellow ) * P(Yellow) )/ P (Banana)


• = ((400/800) * (800/1200)) / (400/1200)
• P(Yellow | Banana) = 1 → 4

• 2.b) P(Sweet | Banana) = (P( Banana | Sweet) * P(Sweet) )/ P (Banana)


• = ((300/850) * (850/1200)) / (400/1200)
• P(Sweet | Banana) = .75→ 5

• 2.c)P(Long | Banana) = (P( Banana | Yellow ) * P(Long) )/ P (Banana)


• = ((350/400) * (400/1200)) / (400/1200)
• P(Yellow | Banana) = 0.875 → 6

• On multiplying eq 4,5,6 ==> P(X | Banana) = 1 * .75 * 0.875


• P(X | Banana) = 0.6562
• P(X | Banana) = 0.6562
3. Others:
• P(X | Others) = P(Yellow | Others) * P(Sweet | Others) * P(Long | Others)

• 3.a) P(Yellow | Others) = (P( Others| Yellow ) * P(Yellow) )/ P (Others)


• = ((50/800) * (800/1200)) / (150/1200)
• P(Yellow | Others) = 0.34→ 7

• 3.b) P(Sweet | Others) = (P( Others| Sweet ) * P(Sweet) )/ P (Others)


• = ((100/850) * (850/1200)) / (150/1200)
• P(Sweet | Others) = 0.67 → 8

• 3.c) P(Long | Others) = (P( Others| Long) * P(Long) )/ P (Others)


• = ((50/400) * (400/1200)) / (150/1200)
• P(Long | Others) = 0.34 → 9

• On multiplying eq 7,8,9 ==> P(X | Others) = 0.34 * 0.67* 0.34


• P(X | Others) = 0.07742
•So finally from P(X | Mango) == 0 , P(X | Banana) == 0.65 and
P(X| Others) == 0.07742.
•We can conclude Fruit{Yellow,Sweet,Long} is Banana.
• The “zero-frequency problem”
• What if an attribute value doesn't occur with every class value (e. g. humidity =
high for class yes)?
• Probability will be zero, for example P(humidity=high|yes) = 0;
• A posteriori probability will also be zero: P(yes|E) = 0 (no matter how likely the other values
are!)
• Remedy: add 1 to the count for every attribute value-class combination (i.e. use the
Laplace estimator: (p+1) / (n+1) ).
• Result: probabilities will never be zero! (also stabilizes probability estimates)
• Query review= x12xʼ
• Let, a test sample have three words, where we assume x1
and x2 are present in the training data but not xʼ. So we
have the likelihood for these two words.
• To predict whether the review is positive or negative, we
compare P(positive/review) and P(negative/review) and
choose the maximum probability out of the two to our
prediction for review.
• So, now probability equation becomes,
• P(positive/review) =
KP(x1/positive)*P(x2/positive)*P(xʼ/positive)*P(positive)
• Similarly,
• P(negative/review) =
KP(x1/negative)*P(x2/negative)*P(xʼ/negative)*P(negative)
• Here k is the proportionality constant.
• In the likelihood table, the value of P(x1/positive), P(x2/positive)
and P(positive) are present but P(xʼ/positive) is not present
since xʼ is not present in our training data. As we have no value
• In a bag of words model, we count the occurrence of
words. If occurrences of word xʼ in training are 0.
According to that P(xʼ|positive)=0 and P(xʼ|negative)=0,
but this will make both P(positive|review) and
P(negative|review) equal to 0. This is the problem of zero
probability.
• So, how to deal with this problem?
• Idea-1 Ignore the term P(xʼ/positive)
• Idea-2 Throw Away Features that Appear Zero Times In
Any Class
• Idea-3 Donʼt Throw Anything Away – Use Smoothing
Laplace smoothing:
1. A small-sample correction, or pseudo-count, will be incorporated in
every probability estimate.
2. Consequently, no probability will be zero.
3. This is a way of regularizing Naive Bayes, and when the
pseudo-count is zero, it is called Laplace smoothing.
To ensure that our posterior probabilities are never zero, we add 1 to
the numerator, and we add k to the denominator.

So, in the case that we donʼt have a particular ingredient in our training
set, the posterior probability comes out to 1 / N + k instead of zero.
Using Laplace smoothing, we can represent P(xʼ|positive) as,

P(x’/positive)= (number of reviews with x’ and target_outcome=positive + α) / (N+ α*k)

Here, alpha(α) represents the smoothing parameter,

K represents the dimensions(no of features) in the data,

N represents the number of reviews with target_outcome=positive

Finding Optimal ‘αʼ:


Here, alpha is a hyper-parameter and you have to tune it. The basic
methods fortune it is as follows:
1. Using elbow plot, try plotting ‘performance metricʼ v/s ‘αʼ
hyper-parameter.
2. In most cases, the best way to determine optimal values of alpha is
through a grid search over possible parameter values, using
cross-validation to evaluate the performance of the model on your data at
Interpretation of changing alpha
Let’s say the occurrence of word w is 3 with y=positive in training data.
Assuming we have 2 features in our dataset, i.e., K=2 and N=100 (total
number of positive reviews).

Case 1- when alpha=1: P(w’|positive) = 4/102


Case 2- when alpha = 100: P(w’|positive) = 103/300
Case 3- when alpha=1000: P(w’|positive) = 1003/2100

As alpha increases, the likelihood probability moves towards uniform distribution


(0.5). Most of the time, alpha = 1 is being used to remove the problem of zero
probability.

Unit 2-Lecture 6-15 43


Interpretation of changing alpha
• Laplace smoothing is a smoothing technique that helps tackle the problem
of zero probability in the Naïve Bayes machine learning algorithm.
• Using higher alpha values will push the likelihood towards a value of 0.5,
i.e., the probability of a word equal to 0.5 for both the positive and
negative reviews.
• Since we are not getting much information from that, it is not preferable.
Therefore, it is preferred to use alpha=1.

Unit 2-Lecture 6-15 44


Advantages and Disadvantages of Laplace Smoothing
The benefit of Laplace Smoothing
• It guarantees no instance of zero prior probability and
appropriately executes the order.
The disadvantages of Laplace Smoothing
• Since the numerical terms are changed to give a superior
order, the genuine probabilities of the occasion are changed.
• Additionally, to expand the worth of the zero probability
relevant informative item, the other information point's
possibilities are decreased to keep up with the law of
probability.
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

https://colab.research.google.com/drive/1DX8wOrqeWyNCOBWlN9dQzJL_IccX8wiS?authuser=2
Email spam
detection
Logistic Regression: The Logistic model
1. Logistic regression becomes a classification technique only when a decision threshold is brought into
the picture.
2. The setting of the threshold value is a very important aspect of Logistic regression and is dependent
on the classification problem itself.
3. The decision for the value of the threshold value is majorly affected by the values of precision and
recall.
4. Ideally, we want both precision and recall to be 1, but this seldom is the case.
5. Logistic Regression is a “Supervised machine learning” algorithm that can be used to model the
probability of a certain class or event.
6. It is used when the data is linearly separable and the outcome is binary in nature.
7. That means Logistic regression is usually used for Binary classification problems.

Unit 2-Lecture 6-15 52


Logistic Function
1. Logistic regression is named for the function used at the core of the
method, the logistic function.
2. The logistic function, also called the sigmoid function was developed
by statisticians to describe properties of population growth in
1
ecology, rising quickly and maxing out
Y = at the carrying capacity of
1− e− x
the environment.
3. It’s an S-shaped curve that can take any real-valued number and map
it into a value between 0 and 1, but never exactly at those limits.
Unit 2-Lecture 6-15 53
Logistic Function
Below is a plot of the numbers between -5 and 5 transformed into the range 0 and 1 using the
logistic function.

Unit 2-Lecture 6-15 56


Logistic Regression

• Supervised machine learning algorithm


mainly used for classification tasks .
• The goal is to predict the probability that
an instance of belonging to a given class.
• Logistic regression uses a more
sophisticated cost function called the
“Sigmoid function” or “logistic function”
instead of a linear function.
It’s referred to as regression because it takes the output of the linear regression function as
input and uses a sigmoid function to estimate the probability for the given class.

Sigmoid function maps any real value into another value between 0 and 1. In machine
learning, we use sigmoid to map predictions to probabilities.
When to use Logistic Regression?
• Logistic Regression is used when the input needs to be separated into
“two regions” by a linear boundary.
• The data points are separated using a linear line as shown:
• Based on the number of categories, Logistic regression can be classified as:
• binomial: target variable can have only 2 possible types: “0” or “1” which
may represent “win” vs “loss”, “pass” vs “fail”, “dead” vs “alive”, etc.
• multinomial: target variable can have 3 or more possible types which are
not ordered(i.e. types have no quantitative significance) like “disease A” vs
“disease B” vs “disease C”.
• ordinal: it deals with target variables with ordered categories. For example,
a test score can be categorized as:“very poor”, “poor”, “good”, “very good”.
Here, each category can be given a score like 0, 1, 2, 3.
Interpretation of Regression Coefficients
Odds of success
• Odds (𝜃) = Probability of an event happening / Probability of an event
not happening
𝜃=p/1-p
• The values of odds range from zero to ∞ and the values of probability
lies between zero and one.
• Consider the equation of a straight line:
𝑦 = 𝛽0 + 𝛽1* 𝑥
• to transform the model from linear regression to logistic regression using the
logistic function.
• Now to predict the odds of success, we use the following formula:

• Exponentiating both the sides, we have:


Let Y = e^𝛽0+𝛽1 * 𝑥
Then p(x) / 1 - p(x) = Y
p(x) = Y(1 - p(x))
p(x) = Y - Y(p(x))
p(x) + Y(p(x)) = Y
p(x)(1+Y) = Y
p(x) = Y / 1+Y
• The equation of the sigmoid function is:

• The sigmoid curve obtained from the above equation is as follows:


• This is the Sigmoid function, which produces an S-shaped curve. It always returns
a probability value between 0 and 1.
• The Sigmoid function is used to convert expected values to probabilities.
• The function converts any real number into a number between 0 and 1.
• We utilize sigmoid to translate predictions to probabilities in machine learning.
• The mathematically sigmoid function can be,
• #import pandas
• import pandas as pd
• col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age',
'label']
• # load dataset
• pima = pd.read_csv("pima-indians-diabetes.csv", header=None,
names=col_names)
#split dataset in features and target variable
feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_cols] # Features
y = pima.label # Target variable

# split X and y into training and testing sets


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
random_state=16)
# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)


logreg = LogisticRegression(random_state=16)

# fit the model with data


logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)
# import the metrics class
from sklearn import metrics

cnf_matrix = metrics.confusion_matrix(y_test, y_pred)


cnf_matrix

array([[115, 8],
[ 30, 39]])
Example
• The dataset of pass/fail in an exam for 5 students is given in the table below. If we use Logistic
Regression as the classifier and assume the model suggested by the optimizer will become the
following for Odds of passing a course:
• log(Odds)=−64+2×hours
• 1) How to calculate the probability of Pass for the student who studied 33 hours?
• 2) At least how many hours the student should study that makes sure will pass the
course with the probability of more than 95%?

You might also like