0% found this document useful (0 votes)
6 views6 pages

Lab 4 - Logistic Regression - KNN - Notes

This document provides an overview of Logistic Regression, kNN, and Decision Tree algorithms for classification tasks in a machine learning lab. It includes model building and testing procedures using Python's scikit-learn library, along with explanations of key parameters and metrics for evaluating model performance. The document also details the syntax and usage for each algorithm, including examples for implementation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views6 pages

Lab 4 - Logistic Regression - KNN - Notes

This document provides an overview of Logistic Regression, kNN, and Decision Tree algorithms for classification tasks in a machine learning lab. It includes model building and testing procedures using Python's scikit-learn library, along with explanations of key parameters and metrics for evaluating model performance. The document also details the syntax and usage for each algorithm, including examples for implementation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

ML - LAB - NLU

(Semester 2, 2024/2025)

Lab #4: Logistic Regression_kNN Notes


This lab is to deal with Logistic Regression, kNN, and Decision Tree algorithms
applied to classification tasks.

==================================================================

LogisticRegression
Logistic regression predicts the probability of an outcome that can only have two
values. The prediction is based on the use of one or several features (numerical and
categorical)

Build a model:

from sklearn.linear_model import LogisticRegression


classifier = LogisticRegression()
classifier.fit(X_train, y_train)

Test the model:

y_pred = classifier.predict(X_test)

Confusion matrix:

from sklearn.metrics import confusion_matrix


cm = confusion_matrix(y_test, y_pred)

Accuracy:

1
ML - LAB - NLU
(Semester 2, 2024/2025)
from sklearn.metrics import accuracy_score
print ("Accuracy : ", accuracy_score(y_test, y_pred))

Some additional values on metrics:

− metrics.confusion_matrix(y_test, y_pred[, ...]): Compute confusion matrix to


evaluate the accuracy of a classification
− metrics.precision_score(y_test, y_pred[, ...]) Compute the precision
− metrics.recall_score(y_test, y_pred[, ...]): Compute the recall
− metrics.f1_score(y_test, y_pred[, labels, ...]): Compute the F1 score, also known as
balanced F-score or F-measure
− metrics.accuracy_score(y_test, y_pred[, ...]): Accuracy classification score

==================================================================

kNN algorithm (https://scikit-


learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html):

Syntax:

class sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, *, weights='uniform',


algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None,
n_jobs=None)

where,

n_neighbors: int, default=5. Number of neighbors to use by default for kneighbors queries.

2
ML - LAB - NLU
(Semester 2, 2024/2025)
weights:{‘uniform’, ‘distance’} or callable, default=’uniform’. Weight function used in
prediction. Possible values:
• ‘uniform’: uniform weights. All points in each neighborhood are weighted equally.
• ‘distance’: Weight points by the inverse of their distance. In this case, neighbors
closer to a query point will have a greater influence than neighbors further away.
• [callable]: a user-defined function that accepts an array of distances, and returns an
array of the same shape containing the weights.
algorithm: {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto’. Algorithm used to
compute the nearest neighbors:
• ‘ball_tree’ will use BallTree - a data structure used for nearest neighbor
searches in multi-dimensional space. It partitions the space into "balls" rather
than hyperrectangles.➔ Suitable when the dimensionality (d) is medium (~10-
100)
• ‘kd_tree’ will use KDTree - (K-dimensional Tree) is a binary tree used for
partitioning data space. It splits the space into hyperrectangles along
coordinate axes. Efficient for data with low dimensionality
• ‘brute’ will use a brute-force search.
• ‘auto’ will attempt to decide the most appropriate algorithm based on the
values passed to fit method.
Note: fitting on sparse input will override the setting of this parameter, using brute
force.
leaf_size: int, default=30
Leaf size passed to BallTree or KDTree. This can affect the speed of the construction
and query, as well as the memory required to store the tree. The optimal value
depends on the nature of the problem.
p: int, default=2
Power parameter for the Minkowski metric. When p = 1, this is equivalent to using
manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p,
minkowski_distance (l_p) is used.
metric: str or callable, default=’minkowski’
The distance metric to use for the tree. The default metric is Minkowski, and with p=2
is equivalent to the standard Euclidean metric. For a list of available metrics, see the
documentation of DistanceMetric. If the metric is “precomputed”, X is assumed to
be a distance matrix and must be square during fit. X may be a sparse graph, in which
case only “nonzero” elements may be considered neighbors.

Usage:

from sklearn.neighbors import KNeighborsClassifier


model = KNeighborsClassifier(n_neighbors=3)
# Train the model using the training sets
model.fit(X_train,y_train)
#Predict Output
y_pred = model.predict(X_test)

3
ML - LAB - NLU
(Semester 2, 2024/2025)
==================================================================

Decision Tree (https://scikit-


learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html):

Syntax:
class sklearn.tree.DecisionTreeClassifier(*, criterion='gini', splitter='best',
max_depth=None, min_samples_split=2, min_samples_leaf=1,
min_weight_fraction_leaf=0.0, max_features=None, random_state=None,
max_leaf_nodes=None, min_impurity_decrease=0.0, class_weight=None, ccp_alpha=0.0)
where,

criterion:{“gini”, “entropy”}, default=”gini”

The function is to measure the quality of a split. Supported criteria are “gini” for the
Gini impurity and “entropy” for the information gain.

− Suppose attribute A having v distinct values, {a1, a2,... , av}


− Information needed (after using A to split D into v partitions) to classify D

v Dj
Info A ( D ) = 
D
( )
 Info D j
j =1

Ci , D
pi =
Info ( D ) = −i =1 pi log 2 ( pi )
m
D
,

Gain ( A) = Info ( D ) − InfoA ( D )

Gini ( D ) = 1 − i =1 pi2
m

− pi: the nonzero probability that an arbitrary tuple in D belongs to class Ci.
− A binary split on A partitions D into D1 and D2, the Gini index of D given that
partitioning:

D1 D2
GiniA ( D ) = Gini ( D1 ) + Gini ( D2 )
D D

splitter: {“best”, “random”}, default=”best”

4
ML - LAB - NLU
(Semester 2, 2024/2025)
The strategy used to choose the split at each node. Supported strategies are “best” to
choose the best split and “random” to choose the best random split.

max_depth: int, default=None

The maximum depth of the tree. If None, then nodes are expanded until all leaves are
pure or until all leaves contain less than min_samples_split samples.

min_samples_split: int or float, default=2

The minimum number of samples required to split an internal node:

• If int, then consider min_samples_split as the minimum number.


• If float, then min_samples_split is a fraction and
ceil(min_samples_split * n_samples) are the minimum number of
samples for each split.

min_samples_leaf: int or float, default=1. The minimum number of samples required to be at


a leaf node.
min_weight_fraction_leaf: float, default=0.0

The minimum weighted fraction of the sum total of weights (of all the input samples)
required to be at a leaf node. Samples have equal weight when sample_weight is not
provided.

max_features: int, float or {“auto”, “sqrt”, “log2”}, default=None. The number of features to
consider when looking for the best split:

• If int, then consider max_features features at each split.


• If float, then max_features is a fraction and int(max_features *
n_features) features are considered at each split.
• If “auto”, then max_features=sqrt(n_features).
• If “sqrt”, then max_features=sqrt(n_features).
• If “log2”, then max_features=log2(n_features).
• If None, then max_features=n_features.

Note: the search for a split does not stop until at least one valid partition of the node
samples is found, even if it requires to effective inspection of more than
max_features features.

max_leaf_nodes: int, default=None

Grow a tree with max_leaf_nodes in the best-first fashion. Best nodes are defined as
relative reduction in impurity. If None then an unlimited number of leaf nodes.

min_impurity_decrease: float, default=0.0

A node will be split if this split induces a decrease of the impurity greater than or
equal to this value.

5
ML - LAB - NLU
(Semester 2, 2024/2025)
The weighted impurity decrease equation is the following:

N_t / N * (impurity - N_t_R / N_t * right_impurity


- N_t_L / N_t * left_impurity)

where N is the total number of samples, N_t is the number of samples at the current
node, N_t_L is the number of samples in the left child, and N_t_R is the number of
samples in the right child.

N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.

class_weight: dict, list of dict or “balanced”, default=None

Weights associated with classes in the form {class_label: weight}. If None, all
classes are supposed to have weight one. For multi-output problems, a list of dicts can
be provided in the same order as the columns of y.

Usage:

from sklearn.tree import DecisionTreeClassifier


clf_model = DecisionTreeClassifier(criterion="gini", random_state=42,
max_depth=3, min_samples_leaf=5)
clf_model.fit(X_train,y_train)
# Plot decision tree
tree.plot_tree(clf)
# Predict X_test
y_predict = clf_model.predict(X_test)

You might also like