ml_lecture
ml_lecture
➢ Example 3.4:
Consider the training dataset of 4 instances shown in
Table 3.2. It contains the details of the performance of
students and their likelihood of getting a job offer or not
in their final semester. Apply the Find-S algorithm.
Model Performance
In this table, True Positive (TP) = Number of cancer patients who are
classified by the test correctly, True Negative (TN) = Number of
normal patients who do not have cancer are correctly detected.
The two errors that are involved in this process is False Positive (FP)
that is an alarm that indicates that the tests show positive when the
patient has no disease and False Negative (FN) is another error that
says a patient has cancer when tests says negative or normal.
FP and FN are costly errors in this classification process.
Model Performance
The metrics that can be derived from this contingency table are listed
below:
I. Sensitivity - The sensitivity of a test is the probability that it will produce a
true positive result when used on a test dataset. It is also known as true
positive rate. The sensitivity of a test can be determined by calculating:
𝑻𝑷
𝑻𝑷 + 𝑭𝑵
Model Performance
2. Specificity- The specificity of a test is the probability that a test will produce
a true negative result when used on test dataset.
𝑻𝑵
𝑻𝑵 + 𝑭𝑷
3. Positive Predictive Value - The positive predictive value of a test is the
probability that an object is classified correctly when a positive test result is
observed.
𝑻𝑷
𝑻𝑷 + 𝑭𝑷
Model Performance
Similarity-based
Learning
➢ Similarity-based Learning ) (التعلم القائم على التشابهis a supervised
learning technique that predicts the class label of a test instance
by gauging the similarity of this test instance with training
instances.
➢ Similarity-based learning refers to a family of instance-based
learning which is used to solve both classification and regression
problems.
➢ Instance-based learning ( )التعلم القائم على المثيلmakes prediction by
computing distances or similarities between test instance and
specific set of training instances local to the test instance in an
incremental process.
➢ This learning methodology improves the performance of
classification since it uses only a specific set of instances as
incremental learning task.
➢ Similarity-based classification ( )التصنيف المبني على التشابهis useful in
various fields such as image processing, text classification,
pattern recognition, bio informatics, data mining, information
retrieval, natural language processing, etc.
➢ A practical application of this learning is predicting daily stock
index price changes. This chapter provides an insight of how
different similarity-based models predict the class of a new
instance.
Introduction to similarity or instance-based
learning
➢ Similarity-based classifiers use similarity measures to locate the
nearest neighbours and classify a test instance which works in contrast
with other learning mechanisms such as decision trees or neural
networks.
➢ Similarity-based learning is also called as Instance-based
learning/Just-in time learning since it does not build an abstract model
of the training instances and performs lazy learning when classifying a
new instance.
Introduction to similarity or instance-based
learning
➢ Classification of instances is done based on the measure of similarity in
the form of distance functions over data instances.
➢ Several distance metrics are used to estimate the similarity or
dissimilarity between instances required for clustering, nearest
neighbour classification, anomaly detection, and so on.
➢ Popular distance metrics used are Hamming distance, Euclidean
distance, Manhattan distance, Minkowski distance, Cosine similarity,
Mahalanobis distance, Pearson's correlation or correlation similarity,
Mean squared difference, Jaccard coefficient, Tanimoto coefficient,
etc.
Some examples of Instance-based
learning algorithms
➢ I. k-Nearest Neighbour (k-NN)
Example 4.2: Consider the same training dataset given in Table 4.1.
Use Weighted k-NN and determine the class.
Solution:
Step 1: Given a test instance (7.6, 60, 8) and a set of classes {Pass,
Fail), use the training dataset to classify the test instance using
Euclidean distance and weighting function.
Assign k = 3. The distance calculation is shown in Table 4.5.
Step 2: Sort the distances in the ascending order and
select the first 3 nearest training data instances to the
test instance. The selected nearest neighbours are
shown in Table 4.6.
Step 3: Predict the class of the test instance by
weighted voting technique from the 3 selected nearest
instances.
➢ Compute the inverse of each distance of the 3
selected nearest instances as shown in Table4.7.
➢Find the sum of the inverses.
Sum= 0.06502 + 0.092370 + 0.08294 = 0.24033
➢Compute the weight by dividing each inverse
distance by the sum as shown in Table 4.8.
➢ Add the weights of the same class.
Fail= 0.270545 + 0.384347 = 0.654892
Pass= 0.345109
➢ Predict the class by choosing the class with the maximum vote.
The class is predicted as 'Fail'.