Aml CS 9 PRV
Aml CS 9 PRV
Module -5
Raja vadhana P
Assistant Professor – BITS CSIS
BITS Pilani [email protected]
Pilani Campus
Module 5 : Classification Model I
A Linear Classification
A Linear Classification
D Logistic Regression
Generative Models
Discriminative Models
Tree Based Models
Supervised task of dividing objects such that each object is assigned to one of a
number of mutually exclusive and exclusive categories called classes
1. Divide the data / record into training set and test set
2. Derive a model for the class attribute as a function of other important variables
from the training set
3. Pass the test set and get the class value to Validate the accuracy
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
•Evaluation Metrics
y=1
x2 x2
Decision
y=0 Boundary
x1 x1
𝑥2 x2
x2
𝑥1 x1
x1
Multi Label
x2
Binary
Multi Label
x2
x2
x1
x1
from sklearn.neighbors import KNeighborsClassifier
y_train_large = (y_train >= 7)
y_train_odd = (y_train % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)
knn_clf.predict([some_digit])
Classifier Example
Smoothing Technique
Python Example
• For a case of N-input parameters X = {x1,x2,…, xN} and output variable of K classes C=
{c1,c2,…,cK}
• The probability for a class cj given input vector, based on Bayes Theorem is:
P(cj | x1,x2,…, xN) = P(x1,x2,…, xN | cj). P(cj) / P(x1,x2,…, xN)
• Assuming the input parameters are conditionally independent given cj, applying conditional
independence
P(cj | x1,x2,…, xN)
= (P(x1 | cj). P(x2 | cj).. P(xN | cj)). P(cj) / P(x1,x2,…, xN)
= P(cj) . 𝜋 P(xi | cj) / P(x1,x2,…, xN)
Sky AirTemp Humidity Wind Forecast Enjoy
Sport?
• Treatment of numerical values - Idea Sunny Warm Normal Strong Same Yes
P(Enjoy=Yes | X) > P(Enjoy=No | X) EnjoySport = Yes Sunny Warm Normal Breeze Same Yes
P(Enjoy=No | X) = P(X | Enjoy=No). P(Enjoy=No) / P(X) Sunny Warm Normal Strong Change ????
= P(X | Enjoy=No). P(Enjoy=No)
Rainy Warm Normal Breeze Same ????
= P(X | Enjoy=No). (4/7)
= P(Sunny | Enjoy=No). P(Warm | Enjoy=No). P(Normal | Enjoy=No). P(Strong | Enjoy=No). P(Change
| Enjoy=No). (4/7)
= (2/4) . (1/4) . (1/4) . (3/4) . (2/4) . (4/7)
= 0.006696
P(Enjoy=Yes | X) > P(Enjoy=No | X) EnjoySport = Yes Sunny Warm Normal Breeze Same Yes
P(Enjoy=No | X) = P(X | Enjoy=No). P(Enjoy=No) / P(X) Sunny Warm Normal Strong Change ????
= P(X | Enjoy=No). P(Enjoy=No)
Rainy Warm Normal Breeze Same ????
= P(X | Enjoy=No). (4/7)
= P(Rainy | Enjoy=No). P(Warm | Enjoy=No). P(Normal | Enjoy=No). P(Breeze | Enjoy=No). P(Same |
Enjoy=No). (4/7)
= (3/6) . (1/4) . (1/4) . (1/4) . (2/4) . (4/7)
= 0.0023
Politics Sports Techno Celebrit Trendy? Politics Sports Techno Celebrit Trendy?
logy y logy y
10 0 0 5 Yes 1 0 0 1 Yes
1 5 5 0 No 0 1 1 0 No
5 6 10 0 No 1 1 1 0 No
0 20 0 10 Yes 0 1 0 1 Yes
???? ????
P(Trendy=Yes) = 2/4
P(Trendy=No) = 2/4
P(Politics | Trendy=Yes) = 10/45
P(Politics | Trendy=No) = 6/32
1 No 0 1 1 0 No
1 No 1 1 1 0 No
0 Yes 0 1 0 1 Yes
????
1 Yes
1 Yes
1 Yes
P(Trendy=No | X) = P(X | Trendy=No). P(Trendy=No) / P(X)
1 Yes
= P(X | Trendy=No). Trendy=No)
1 No
= P(X | Trendy=No). (2/4)
1 No
= P(Politics | Trendy=No). P(Sports | Trendy=No).
P(Technology | Trendy=No). P(Celebrity | Trendy=No). (2/4) 1 No
Tech Trendy? 0 1 1 0 No
1 1 1 0 No
0 Yes
0 1 0 1 Yes
1 No
????
1 No
1 Yes
0 Yes
1 Yes
1 Yes
P(Trendy=No | X) = P(X | Trendy=No). P(Trendy=No) / P(X)
1 Yes
= P(X | Trendy=No). Trendy=No)
1 No
= P(X | Trendy=No). (2/4)
1 No
= P(Politics | Trendy=No). P(Sports | Trendy=No).
P(Technology | Trendy=No). P(Celebrity | Trendy=No). (2/4) 1 No
Fit a classification model for the following data and verify the model performance metric.
Bayes Classifier to classify the new input
• Input : {Round, Black, Small}
• Use Laplace Smoothing for empty sets
Classifier Intuition
Numerical Classifier Example
Python Example (Please watch the uploaded Virtual Lab demo)
5.5 100 1
5 105 0
Discriminative
Job : 8 90 1
9 105 1
1
6 120 0
4.5 80 0
Decision
7 90 1
Boundary
y=0 8.5 95 ????
0
6 130 ????
1 2 3 4 5 6 7 8 9 10 cgpa
𝜃0 + 𝜃𝑖 𝑥𝑖 ≥ 0
𝑖
𝜃0 + 𝜃𝑖 𝑥𝑖 < 0
𝑖
y=1
Decision
Boundary
y=0
0
1 2 3 4 5 6 7 8 9 10 cgpa
𝜃0 + 𝜃𝑖 𝑥𝑖 ≥ 0
𝑖
Discriminative
Job : 𝑌 = 𝜃0 + 𝜃𝑖 𝑥𝑖
𝑖
1
y=1
Decision
Boundary
y=0
0
1 2 3 4 5 6 7 8 9 10 cgpa 𝑝
ln = 𝜃0 + 𝜃𝑖 𝑥𝑖
1−𝑝
𝑖
𝜃0 + 𝜃𝑖 𝑥𝑖 ≥ 0 𝑝 ≥ 0.5 𝐶𝐺𝑃𝐴 ≥ 6.5
𝑖
• ℎ𝜃 𝑥 = 𝑔 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2
– e.g., 𝜃0 = −3, 𝜃1 = 1, 𝜃2 = 1
Decision
Age boundary
Tumor Size
• Predict “𝑦 = 1” if −3 + 𝑥1 + 𝑥2 ≥ 0
• Training set:
• m examples
• n features
• Training set:
• How to choose parameters (feature weights)?
“non-convex” “convex”
If y = 1
Cost
0 1
If y = 0
Cost
0 1
Output :
𝑚
1
𝐽 𝜃 =− 𝑦 (𝑖) log ℎ𝜃 𝑥 (𝑖) + (1 − 𝑦 (𝑖) ) log 1 − ℎ𝜃 𝑥 (𝑖)
𝑚
𝑖=1
w0=0.5,w1=0.5,w2=0.5
wTx h(x) (h(x)-y)*x0 (h(x)-y)*x1 (h(x)-y)*x2
6.6 1 0 0 0
6.5 1 1 5 7
7.5 1 0 0 0
8.5 1 0 0 0
7.5 1 1 6 8
7.9 1 1 7.5 7.3
0.15 0.925 1.115
LR*Mean Error Term
0.4 -0.4 -0.6 New Weights
Note :
The exponential function of the regression coefficient (ew-cpga) is the odds ratio associated with a one-unit
increase in the cgpa.
+ The odd of being offered with job increase by a factor of 1.35 for every unit increase in the CGPA
[np.exp(model.params)]
• Model
1
ℎ𝜃 𝑥 = 𝑃 𝑌 = 1 𝑋1 , 𝑋2 , ⋯ , 𝑋𝑛 = ⊤
1+𝑒 −𝜃 𝑥
• Cost function
𝑚
1 −log ℎ𝜃 𝑥 if 𝑦 = 1
𝐽 𝜃 = Cost(ℎ𝜃 (𝑥 𝑖 ), 𝑦 (𝑖) )) Cost(ℎ𝜃 𝑥 , 𝑦) =
𝑚 −log 1 − ℎ𝜃 𝑥 if 𝑦 = 0
𝑖=1
• Learning
1 𝑚 𝑖 𝑖 𝑖
Gradient descent: Repeat {𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 𝑚 𝑖=1 ℎ𝜃 𝑥 −𝑦 𝑥𝑗 }
• Inference
1
𝑌 = ℎ𝜃 𝑥 test = ⊤ 𝑥 test
1 + 𝑒 −𝜃
Note:
• σ(t) < 0.5 when t < 0, and σ(t) ≥ 0.5 when t ≥ 0, so a Logistic model predicts 1 if xTθ is positive, and 0 if it is negative
• logit(p) = log(p / (1 - p)), is the inverse of the logistic function. Indeed, if you compute the logit of the estimated probability
p, you will find that the result is t. The logit is also called the log-odds
Overfitting Underfitting
• Fitting the data too well • Learning too little of the true
– Features are noisy / uncorrelated to concept
concept – Features don’t capture concept
– Too much bias in model
1
0,9
0,8
0,7
0,6
0,5
0,4
0,3
0,2
0,1
0
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
38
BITS Pilani, Pilani Campus
Ways to Control Overfitting
• Regularization
𝑛 # 𝑊𝑒𝑖𝑔ℎ𝑡𝑠
Note:
The hyperparameter controlling the regularization strength of a Scikit-Learn LogisticRegression model is not
alpha (as in other linear models), but its inverse: C. The higher the value of C, the less the model is
regularized.
39
BITS Pilani, Pilani Campus
Regularization
Ridge Regression / Tikhonov regularization
Fit a classification model for the following data and verify the model performance metric using
Confusion Matrix on Training Set.
Use Logistic Regression with learning rate = 0.05 to predict the Buy-Preference for new
observation (40, 60)
𝑥2
1
ℎ𝜃 𝑥
𝑥1
𝑥2
2
ℎ𝜃 𝑥 𝑥2
𝑥1 𝑥1
Class 1:
Class 2:
3
Class 3: ℎ𝜃 𝑥
𝑥2
𝑖
ℎ𝜃 𝑥 = 𝑃 𝑦 = 𝑖 𝑥; 𝜃 (𝑖 = 1, 2, 3) 𝑥1
Note: Scikit-Learn detects when you try to use a binary classification
𝑖
For input x Predict ∶ max ℎ𝜃 𝑥 algorithm for a multi‐class classification task, and it automatically runs
i OvA (except for SVM classifiers for which it uses OvO)
𝑥2
1
ℎ𝜃 𝑥
𝑥1
𝑥2
ℎ𝜃
2
𝑥 𝑥2
𝑥1
𝑥1
Class 1:
Class 2:
3
Class 3: ℎ𝜃 𝑥 𝑥2
𝑖
ℎ𝜃 𝑥 = 𝑃 𝑦 = 𝑖 𝑥; 𝜃 (𝑖 = 1, 2, 3)
𝑥1
𝑖
For input x Predict ∶ max ℎ𝜃 𝑥
i
N × (N – 1) / 2 classifiers
BITS Pilani, Pilani Campus
Tuning the Training – Eg., Binary Class
Confusion Matrix
𝑥1
logreg=LogisticRegression()
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2, random_state=0)
predicted = cross_validation.cross_val_predict(logreg, X_train_scaled, y_train, cv=3)
metrics.accuracy_score(y_train, predicted)
metrics.classification_report(y_train, predicted)
logreg.score(X_test, y_test)
print(‘Test Accuracy Score’, score)
-----------------------------------------------------------------------------------------------------------
probs = logreg.predict_proba(X)[:, 1]
preds = np.where(probs > 0.75, 1, 0)
confusion_matrix(y, preds)
Module 6
Decision Tree Classifiers
Ensemble Methods Start