Lab 4 - Logistic Regression - KNN - Notes
Lab 4 - Logistic Regression - KNN - Notes
(Semester 2, 2024/2025)
==================================================================
LogisticRegression
Logistic regression predicts the probability of an outcome that can only have two
values. The prediction is based on the use of one or several features (numerical and
categorical)
Build a model:
y_pred = classifier.predict(X_test)
Confusion matrix:
Accuracy:
1
ML - LAB - NLU
(Semester 2, 2024/2025)
from sklearn.metrics import accuracy_score
print ("Accuracy : ", accuracy_score(y_test, y_pred))
==================================================================
Syntax:
where,
n_neighbors: int, default=5. Number of neighbors to use by default for kneighbors queries.
2
ML - LAB - NLU
(Semester 2, 2024/2025)
weights:{‘uniform’, ‘distance’} or callable, default=’uniform’. Weight function used in
prediction. Possible values:
• ‘uniform’: uniform weights. All points in each neighborhood are weighted equally.
• ‘distance’: Weight points by the inverse of their distance. In this case, neighbors
closer to a query point will have a greater influence than neighbors further away.
• [callable]: a user-defined function that accepts an array of distances, and returns an
array of the same shape containing the weights.
algorithm: {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto’. Algorithm used to
compute the nearest neighbors:
• ‘ball_tree’ will use BallTree - a data structure used for nearest neighbor
searches in multi-dimensional space. It partitions the space into "balls" rather
than hyperrectangles.➔ Suitable when the dimensionality (d) is medium (~10-
100)
• ‘kd_tree’ will use KDTree - (K-dimensional Tree) is a binary tree used for
partitioning data space. It splits the space into hyperrectangles along
coordinate axes. Efficient for data with low dimensionality
• ‘brute’ will use a brute-force search.
• ‘auto’ will attempt to decide the most appropriate algorithm based on the
values passed to fit method.
Note: fitting on sparse input will override the setting of this parameter, using brute
force.
leaf_size: int, default=30
Leaf size passed to BallTree or KDTree. This can affect the speed of the construction
and query, as well as the memory required to store the tree. The optimal value
depends on the nature of the problem.
p: int, default=2
Power parameter for the Minkowski metric. When p = 1, this is equivalent to using
manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p,
minkowski_distance (l_p) is used.
metric: str or callable, default=’minkowski’
The distance metric to use for the tree. The default metric is Minkowski, and with p=2
is equivalent to the standard Euclidean metric. For a list of available metrics, see the
documentation of DistanceMetric. If the metric is “precomputed”, X is assumed to
be a distance matrix and must be square during fit. X may be a sparse graph, in which
case only “nonzero” elements may be considered neighbors.
Usage:
3
ML - LAB - NLU
(Semester 2, 2024/2025)
==================================================================
Syntax:
class sklearn.tree.DecisionTreeClassifier(*, criterion='gini', splitter='best',
max_depth=None, min_samples_split=2, min_samples_leaf=1,
min_weight_fraction_leaf=0.0, max_features=None, random_state=None,
max_leaf_nodes=None, min_impurity_decrease=0.0, class_weight=None, ccp_alpha=0.0)
where,
The function is to measure the quality of a split. Supported criteria are “gini” for the
Gini impurity and “entropy” for the information gain.
v Dj
Info A ( D ) =
D
( )
Info D j
j =1
Ci , D
pi =
Info ( D ) = −i =1 pi log 2 ( pi )
m
D
,
Gini ( D ) = 1 − i =1 pi2
m
− pi: the nonzero probability that an arbitrary tuple in D belongs to class Ci.
− A binary split on A partitions D into D1 and D2, the Gini index of D given that
partitioning:
D1 D2
GiniA ( D ) = Gini ( D1 ) + Gini ( D2 )
D D
4
ML - LAB - NLU
(Semester 2, 2024/2025)
The strategy used to choose the split at each node. Supported strategies are “best” to
choose the best split and “random” to choose the best random split.
The maximum depth of the tree. If None, then nodes are expanded until all leaves are
pure or until all leaves contain less than min_samples_split samples.
The minimum weighted fraction of the sum total of weights (of all the input samples)
required to be at a leaf node. Samples have equal weight when sample_weight is not
provided.
max_features: int, float or {“auto”, “sqrt”, “log2”}, default=None. The number of features to
consider when looking for the best split:
Note: the search for a split does not stop until at least one valid partition of the node
samples is found, even if it requires to effective inspection of more than
max_features features.
Grow a tree with max_leaf_nodes in the best-first fashion. Best nodes are defined as
relative reduction in impurity. If None then an unlimited number of leaf nodes.
A node will be split if this split induces a decrease of the impurity greater than or
equal to this value.
5
ML - LAB - NLU
(Semester 2, 2024/2025)
The weighted impurity decrease equation is the following:
where N is the total number of samples, N_t is the number of samples at the current
node, N_t_L is the number of samples in the left child, and N_t_R is the number of
samples in the right child.
N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.
Weights associated with classes in the form {class_label: weight}. If None, all
classes are supposed to have weight one. For multi-output problems, a list of dicts can
be provided in the same order as the columns of y.
Usage: