0% found this document useful (0 votes)
9 views43 pages

KNN and Baysian Method

machine learning notes

Uploaded by

renuka8177
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views43 pages

KNN and Baysian Method

machine learning notes

Uploaded by

renuka8177
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Classification

Classification
•predicts categorical class labels (discrete or nominal)
•classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
Defn: Given a Database D={t1,t2,…tn} of tuples and a
set C={C1,C2,…Cm}, the classification problem is to
define a mapping f: D- C where each ti is assigned
to one class Cj.
Prediction
•models continuous-valued functions, i.e., predicts
unknown or missing values
10/4/2024 By Dr. Kavita Bhosle 1
Typical applications
•Credit approval-applicant as good or poor
credit risk
•Target marketing-profile of a good customer
•Medical diagnosis- Develop a profile of
stroke victims
•Fraud detection -Determine a credit card
purchase is fraudulent
Classification is a two-step process
Classifier is built from a data set- learning step
The training data set contains tuples having
attributes one of which is a class label attribute

10/4/2024 By Dr. Kavita Bhosle 2


Example
Training Data Set Attributes Class label

Patien Sore throat Feve Swollen Congestion Headach Diagnosis


t Id r Glands e
1 Yes Yes Yes Yes Yes Strep throat
2 No No No Yes Yes Allergy
3 Yes Yes No Yes No Cold
4 Yes No No No No Strep throat
5 No Yes No Yes No Cold
6 No No No Yes No Allergy
7 No No Yes No No Strep throat
8 Yes No No Yes Yes Allergy
9 No Yes No Yes Yes Cold
10 Yes Yes No Yes Yes Cold

Supervised learning (classification)


10/4/2024 By Dr. Kavita Bhosle 3
Since class label is provided it is known as
supervised learning
Typically the model is represented in the form of
classification rules , decision trees or mathematical
formulae Swollen
Glands

No Yes

Fever Diagnosis=Strep Throat

No Yes
Diagnosis=Allergy Diagnosis =Cold

In second step the model is used for classification


First it is used on a test data to check its accuracy
Then It can be used to classify future data tuples
whose class label value are not known
10/4/2024 By Dr. Kavita Bhosle 4
Model construction:
•Describing a set of predetermined classes
•Each tuple/sample is assumed to belong to a predefined
class, as determined by the class label attribute
•The set of tuples used for model construction is training set
•The model is represented as classification rules, decision
trees, or mathematical formulae
Model usage:
•for classifying future or unknown objects
•Estimate accuracy of the model The known label of test
sample is compared with the classified result from the model
•Accuracy rate is the percentage of test set samples that are
correctly classified by the model
•Test set is independent of training set, otherwise over-fitting
will occur
•If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
10/4/2024 By Dr. Kavita Bhosle 5
Classification
Training Algorithms
Data

Patie Sore Fev Swolle Congestio Headac Diagnosis


nt Id throat er n n he
Gland
s Classifier
1 Yes Yes Yes Yes Yes Strep throat (Model)
2 No No No Yes Yes Allergy
3 Yes Yes No Yes No Cold
4 Yes No No No No Strep throat Swollen
5 No Yes No Yes No Cold Glands
N Y
6 No No No Yes No Allergy
o e
Diagnosis=Strep Throat
7 No No Yes No No Strep throat Fever s
8 Yes No No Yes Yes Allergy N Y
o
Diagnosis=Allergy Diagnosis e
=Cold
9 No Yes No Yes Yes Cold
s
10 Yes Yes No Yes Yes Cold

10/4/2024 By Dr. Kavita Bhosle 6


Preparing data for classification
• Data cleaning
– Preprocess data in order to reduce noise and
handle missing value
Ignore the tuple:.
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the attribute mean of the class: smarter
– the most probable value: inference-based
10/4/2024 By Dr. Kavita Bhosle 7
• Relevance analysis (feature selection)
– Remove the irrelevant or redundant attributes
– Redundant attributes may be able to be detected by
correlation analysis
– Improves classification efficiency and scalability
• Data transformation
– Generalize and/or normalize data
• Min-Max normalization
• z-score normalization
• normalization by decimal scaling
-- Data Reduction

10/4/2024 By Dr. Kavita Bhosle 8


Choosing Classification Algorithms
• Algorithm categorization
• Distance based
• Statistical
• Decision Tree Based
• Neural network
• Rule based

• Classification categorization
• Specifying boundaries-divides input space into regions
• Probabilistic- determine probability for each class and
assign tuple to the class with highest probability

10/4/2024 By Dr. Kavita Bhosle 9


Measuring Performance
• Performance of classification algorithm is by
evaluating accuracy of the classification
• Computational costs -Space and time
requirements-
• Scalability-efficient even for large databases
• Robustness-ability to make correct
classification in the presence of noisy data
• Overfitting problem- the classification fits
the training data exactly but may not be
applicable to a broader population of data
• Interpretability- insight provided by classifier
10/4/2024 By Dr. Kavita Bhosle 10
Statistical-based algorithms
Straight line regression analysis involves a response variable
y and a single predictor variable x and models y is a linear
function of x
y = w0 + w1 x
where w0 (y-intercept) and w1 (slope) are regression
coefficients
These coefficients can be solved by the method of least
squares which estimates the best fitting straight line
D be the training data set containing n data points
(x1,y1),(x2,y2)…(xn,yn), regression coefficients can be
estimated as
∑ (xi- x) (yi –y)
w1 = ----------------------- w0 = y – w1 x
∑ (xi- x)2
10/4/2024 By Dr. Kavita Bhosle 11
Yrs exprnce Salary in k
3 30
100
8 57
9 64 80

13 72
60
3 36
6 43
40
11 59
21 90
20
1 20
16 83

x= 9.1 y =55.4
(3-9.1)(30-55.4)+………..
W1= ---------------------------------- = 3.5 W0 = 55.4-(3.5)(9.1)=23.6
(3-9.1)2 +(8-9.1)2…….
y=23.6+3.5 x Using this equation we can predict salary given experience

10/4/2024 By Dr. Kavita Bhosle 12


Multiple Linear regression

It is an extension of Straight line regression analysis


so as to involve more than one predictor variable
It allows response variable y to be modeled as a
linear function of n predictor variables or attributes
describing a tuple x as (x1,x2,…xn)
y = w0 + w1 x1 +w2 x2+w3x3+…..wnxn
The method of least squares can be extended to
solve for w0, w1 etc. the equations are much more
complex and can be solved by using statistical
software packages

10/4/2024 By Dr. Kavita Bhosle 13


The linear model gets affected by the presence of noise
or outliers (extreme, exceptional values)
Nonlinear regression
• Some nonlinear models can be modeled by a polynomial
function
• A polynomial regression model can be transformed into
linear regression model. For example,
y = w0 + w1 x + w2 x2 + w3 x3
convertible to linear with new variables: x2 = x2, x3= x3
y = w0 + w1 x + w2 x2 + w3 x3
• Some models are intractable nonlinear (e.g., sum of
exponential terms)
– possible to obtain least square estimates through
extensive calculation on more complex formulae

10/4/2024 By Dr. Kavita Bhosle 14


Logistic regression
It uses a logistic curve.
The logistic curve gives a value
between 0 and 1 so it can be
interpreted as the probability of e (1+x) /1+e (1+x)

class membership.
The formula for a univariate logistic
curve is
p= e (c0+c1x1) /1+ e (c0+c1x1)
log(p/1-p)=c0+c1x1
Here p is the probability of being in
the class
10/4/2024 By Dr. Kavita Bhosle 15
Bayesian Classification:
It is based on Bayes’ Theorem of conditional
probability.
It is a statistical classifier: performs probabilistic
prediction, i.e., predicts class membership
probabilities
A simple Bayesian classifier, naïve Bayesian
classifier, assumes that different attribute values
are independent which simplifies computational
process
It has comparable performance with decision tree
and selected neural network classifiers

10/4/2024 By Dr. Kavita Bhosle 16


Let X be a data tuple (“evidence”): described by
values of its n attributes
Let H be a hypothesis that X belongs to class C
Classification is to determine P(H|X), the probability
that the hypothesis holds given the observed data
sample X
Probability that X belongs to class C having
known the attribute description of X
Given that X is 31..40 and medium income , X will
buy computer
P(H) (prior probability H ), the initial probability
E.g., X will buy computer, regardless of age,
income, …

10/4/2024 By Dr. Kavita Bhosle 17


P(H/X) (posteriori probability of H ), the probability of
H when attributes of X are known
P(X): (prior probability of X)
It is probability that sample data is in observed range
Ex It is probability that person is in the range
31..40 and medium income- evidence
P(X/H) (posteriori probability of X) -likelihood
E.g., Given that X will buy computer, the prob.
that X is 31..40, medium income
Baye’s theorem relates all these probabilities
P(H/X)= P(X/H) P(H)
P(X)
Posteriori= Likelihood x priori / evidence

10/4/2024 By Dr. Kavita Bhosle 18


Let D be a training set of tuples and their
associated class labels, and each tuple is
represented by an n-D attribute vector
X = (x1, x2, …, Xn)
Suppose there are m classes C1, C2, …, Cm.
Classification is to derive the maximum posteriori,
i.e., the maximal P(Ci|X)
This can be derived from Bayes’ theorem
P(Ci/X)= P(X/Ci) P(Ci)
P(X)
Since P(X) is constant for all classes, only
P(X/Ci)P(Ci) needs to be maximized

10/4/2024 By Dr. Kavita Bhosle 19


If class prior probabilities are not known, it can be
assumed that all classes are equally likely
P(C1)=P(C2) =………………..=P(Cn)
Reduced to maximizing P(X/Ci)
If data set has many attributes, it will be
computationally expensive to compute P(X/Ci)
To reduce computation, assumption of class
conditional independence is made
Attributes are conditionally independent
P(X/Ci)= P(x1/Ci)xP(x2/Ci)x………………x P(xn/Ci)

10/4/2024 By Dr. Kavita Bhosle 20


age income studentcredit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

10/4/2024 By Dr. Kavita Bhosle 21


P(xk/Ci) is the number of tuples of class Ci in training
set D having the value xk, divided by number of
tuples of class Ci in D
age income studentcredit_rating
buys_compu
Class: <=30 high no fair no
C1:buys_computer = ‘yes’ <=30 high no excellent no
C2:buys_computer = ‘no’ 31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
Data sample
>40 low yes excellent no
X = (age <=30, 31…40 low yes excellent yes
Income = medium, <=30 medium no fair no
Student = yes <=30 low yes fair yes
Credit_rating = Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

10/4/2024 By Dr. Kavita Bhosle 22


• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
• Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667
= 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”)
= 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)

10/4/2024 By Dr. Kavita Bhosle 23


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, classification_report

# Sample dataset
data = {
'Age': [22, 25, 47, 35, 26, 41, 39, 22, 30, 26],
'Income': ['Low', 'Medium', 'High', 'Medium', 'Low', 'High', 'High', 'Low',
'Medium', 'Low'],
'Student': ['No', 'No', 'Yes', 'No', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes'],
'Credit_Rating': ['Fair', 'Excellent', 'Fair', 'Fair', 'Fair', 'Excellent', 'Excellent',
'Fair', 'Fair', 'Excellent'],
'Buys_Computer': ['No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes']
}

# Create DataFrame
df = pd.DataFrame(data)

10/4/2024 By Dr. Kavita Bhosle 24


# Preprocessing
# Convert categorical variables to numerical
df['Income'] = df['Income'].map({'Low': 0, 'Medium': 1, 'High': 2})
df['Student'] = df['Student'].map({'No': 0, 'Yes': 1})
df['Credit_Rating'] = df['Credit_Rating'].map({'Fair': 0, 'Excellent': 1})
df['Buys_Computer'] = df['Buys_Computer'].map({'No': 0, 'Yes': 1})

# Features and target variable


X = df.drop('Buys_Computer', axis=1)
y = df['Buys_Computer']

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Train the Naive Bayes model


model = GaussianNB()
model.fit(X_train, y_train)

10/4/2024 By Dr. Kavita Bhosle 25


# Sample data for prediction
sample = pd.DataFrame({
'Age': [30], # Age <= 30
'Income': [1], # Medium
'Student': [1], # Yes
'Credit_Rating': [0] # Fair
})
# Make prediction for the sample
prediction = model.predict(sample)

# Output the prediction


result = 'Yes' if prediction[0] == 1 else 'No'
print(f"The prediction for the sample is: {result}")
The prediction for the sample is: Yes

10/4/2024 By Dr. Kavita Bhosle 26


X=(Age 31..40
Income Low
Not student
Credit rating Excellent)

10/4/2024 By Dr. Kavita Bhosle 27


Zero-probability problem
Naïve Bayesian prediction requires each conditional
prob. to be non-zero. Otherwise, the predicted prob.
will be zero irrespective of all other probabilities
Ex. Suppose a dataset with 1000 tuples, income=low
(0), income= medium (990), and income = high (10),
Use Laplacian correction (or Laplacian estimator)
Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
The “corrected” prob. estimates are close to their
“uncorrected” counterparts and the problem of zero
probability is solved

10/4/2024 By Dr. Kavita Bhosle 28


Advantages
• Easy to implement
• Only one scan of training data is required
• Good results obtained in most of the cases
• Can easily handle missing values
Disadvantages
• Assumption: class conditional independence,
therefore loss of accuracy
• Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history,
etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
Dependencies among these cannot be modeled by
Naïve Bayesian Classifier
Bayesian Belief Networks

10/4/2024 By Dr. Kavita Bhosle 29


Distance-based algorithms
Each tuple is assigned to class to which it is most similar
Each class is represented as a tuple
The representative for each class is the centre or centroid
Each tuple ti is assigned to class Cj such that sim(ti,Cj)
>sim(ti,Cl) for all Cl such that Cl ≠Cj
Each tuple must be compared to the center for a class and
there are fixed number of classes.
The complexity depends on the number of classes
K Nearest Neighbors is a distance based algorithm i.e a
lazy learning algorithm.
Simply stores training data (or only does minor processing)
and waits until it is given a test tuple
10/4/2024 By Dr. Kavita Bhosle 30
Distance-based algorithms
Similarity or distance measures may be used to
identify the alikeness of different items in the
database
The similarity between two tuples ti and tj sim(ti, tj) , in
a database D is a mapping from DxD to the range
[0,1]
Characteristics of a good similarity measure
1. sim(ti, ti)=1 for all ti
2. sim(ti, tj)=0 if ti and tj are not alike at all
3. sim(ti,tj) < sim(ti, tk) if ti is more like tk than it is like tj
10/4/2024 By Dr. Kavita Bhosle 31
Dice: sim(ti,tj) = ∑ tik tjk
∑ tik2 + ∑tjk2

Jaccard : sim(ti,tj) = ∑ tik tjk


∑ tik2 +∑tjk2 - ∑ tik tjk

Cosine : sim(ti,tj) = ∑ tik tjk


∑ tik2 ∑tjk2

Overlap : sim(ti,tj) = ∑ tik tjk


min (∑ tik2, ∑tjk2)
Distance or dissimilarity measures are often used instead
of similarity measures-usually distance measures
Euclidean : dis(ti,tj) = ∑ (tih-tjh)2
Manhattan : dis(ti,tj) = ∑ | tih-tik|
10/4/2024 By Dr. Kavita Bhosle 32
The k-Nearest neighbor algorithm
• K closet neighbors in the training set to the given
tuple are to be determined
• The nearest neighbor are defined in terms of
Euclidean distance, dist(X1, X2)
• The new item is then placed in the class that
contains the most items from this set k of closest
items
• The value of k can be determined experimentally.
Starting with k=1, a test set is used to estimate
the error rate of the classifier. The k value that
gives minimum error rate can be selected
• k-NN for real-valued prediction for a given
unknown tuple returns the mean values of the k
nearest neighbors
10/4/2024 By Dr. Kavita Bhosle 33
The k-Nearest neighbor algorithm
• Distance-weighted nearest neighbor algorithm
gives greater weight to closer neighbors
• Robust to noisy data by averaging k-nearest
neighbors
• The complexity is O(d), d is the size of training
set, can be reduced to O(logd) by storing training
set in search trees, can be O(1) by using
parallelism

10/4/2024 By Dr. Kavita Bhosle 34


10/4/2024 By Dr. Kavita Bhosle 35
10/4/2024 By Dr. Kavita Bhosle 36
10/4/2024 By Dr. Kavita Bhosle 37
10/4/2024 By Dr. Kavita Bhosle 38
10/4/2024 By Dr. Kavita Bhosle 39
10/4/2024 By Dr. Kavita Bhosle 40
SN Age SystolicBP DiastolicBP BS BodyTemp HeartRate RiskLevel
1 25 130 80 15 98 86 high risk
2 35 140 90 13 98 70 high risk
3 29 90 70 8 100 80 high risk
4 30 140 85 7 98 70 high risk
5 35 120 60 6.1 98 76 low risk
6 23 140 80 7.01 98 70 high risk
7 23 130 70 7.01 98 78 mid risk
8 35 85 60 11 102 86 high risk
9 32 120 90 6.9 98 70 mid risk
10 42 130 80 18 98 70 high risk
11 23 90 60 7.01 98 76 low risk
12 19 120 80 7 98 70 mid risk
13 25 110 89 7.01 98 77 low risk
14 20 120 75 7.01 100 70 mid risk
15 48 120 80 11 98 88 mid risk
16 15 120 80 7.01 98 70 low risk
17 50 140 90 15 98 90 high risk
18 25 140 100 7.01 98 80 high risk
19 30 120 80 6.9 101 76 mid risk
20 10 70 50 6.9 98 70 low risk
21 40 140 100 18 98 90 high risk
22 50 140 80 6.7 98 70 mid risk
23 21 90 65 7.5 98 76 low risk
24 18 90 60 7.5 98 70 low risk
25 21 120 80 7.5 98 76 low risk
26 16 100 70 7.2 98 80 low risk
10/4/2024 By Dr. Kavita Bhosle 41
Variable Demog Missing
Role Type Description Units
Name raphic Values
Any ages in years when a
Age Feature Integer Age no
women during pregnant.
Upper value of Blood
Pressure in mmHg,
SystolicBP Feature Integer another significant no
attribute during
pregnancy.
Lower value of Blood
Pressure in mmHg,
DiastolicBP Feature Integer another significant no
attribute during
pregnancy.
Blood glucose levels is in
BS Feature Integer terms of a molar mmol/L no
concentration
BodyTemp Feature Integer F no
A normal resting heart
HeartRate Feature Integer bpm no
rate
Predicted Risk Intensity
Level during pregnancy
RiskLevel Target Categorical no
considering the previous
attribute.

10/4/2024 By Dr. Kavita Bhosle 42


10/4/2024 By Dr. Kavita Bhosle 43

You might also like