Data Science Unit 3
Data Science Unit 3
Introduction
K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values
of new datapoints which further means that the new data point will be assigned a
value based on how closely it matches the points in the training set. We can
understand its working with the help of following steps −
Step 1 − For implementing any algorithm, we need dataset. So during the first
step of KNN, we must load the training as well as test data.
Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K
can be any integer.
Step 3 − For each point in the test data do the following −
3.1 − Calculate the distance between test data and each row of training data
with the help of any of the method namely: Euclidean, Manhattan or
Hamming distance. The most commonly used method to calculate distance
is Euclidean.
3.2 − Now, based on the distance value, sort them in ascending order.
3.3 − Next, it will choose the top K rows from the sorted array.
3.4 − Now, it will assign a class to the test point based on most frequent
class of these rows.
Step 4 − End
Example
The following is an example to understand the concept of K and working of KNN
algorithm −
Suppose we have a dataset which can be plotted as follows −
Now, we need to classify new data point with black dot (at point 60,60) into blue
or red class. We are assuming K = 3 i.e. it would find three nearest data points. It
is shown in the next diagram −
We can see in the above diagram the three nearest neighbors of the data point
with black dot. Among those three, two of them lies in Red class hence the black
dot will also be assigned in red class.
Pros
Cons
Applications of KNN
The following are some of the areas in which KNN can be applied successfully −
Banking System
KNN can be used in banking system to predict weather an individual is fit for loan
approval? Does that individual have the characteristics similar to the defaulters
one?
KNN algorithms can be used to find an individual’s credit rating by comparing with
the persons having similar traits.
Support vector machines (SVMs)
Introduction to SVM
Support vector machines (SVMs) are powerful yet flexible supervised machine
learning algorithms which are used both for classification and regression. But
generally, they are used in classification problems. In 1960s, SVMs were first
introduced but later they got refined in 1990. SVMs have their unique way of
implementation as compared to other machine learning algorithms. Lately, they
are extremely popular because of their ability to handle multiple continuous and
categorical variables.
Working of SVM
SVM Kernels
Linear Kernel
It can be used as a dot product between any two observations. The formula of
linear kernel is as below −
From the above formula, we can see that the product between two vectors say 𝑥
& 𝑥𝑖 is the sum of the multiplication of each pair of input values.
Polynomial Kernel
SVM classifiers offers great accuracy and work well with high dimensional space.
SVM classifiers basically use a subset of training points hence in result uses very
less memory.
They have high training time hence in practice not suitable for large datasets.
Another disadvantage is that SVM classifiers do not work well with overlapping
classes.
Decision Tree
In general, Decision tree analysis is a predictive modelling tool that can be applied
across many areas. Decision trees can be constructed by an algorithmic approach
that can split the dataset in different ways based on different conditions. Decisions
trees are the most powerful algorithms that falls under the category of supervised
algorithms.
They can be used for both classification and regression tasks. The two main
entities of a tree are decision nodes, where the data is split and leaves, where we
got outcome. The example of a binary tree for predicting whether a person is fit
or unfit providing various information like age, eating habits and exercise habits,
is given below −
In the above decision tree, the question are decision nodes and final outcomes
are leaves. We have the following two types of decision trees.
Classification decision trees − In this kind of decision trees, the decision
variable is categorical. The above decision tree is an example of
classification decision tree.
Regression decision trees − In this kind of decision trees, the decision
variable is continuous.
splitting criterion
The splitting criterion also tells us which branches to grow from node N
with respect to the outcomes of the chosen test. More specifically, the splitting
criterion
indicates the splitting attribute and may also indicate either a split-point or
a splitting subset.
In Decision Tree the major challenge is to identification of the attribute for the root
node in each level. This process is known as attribute selection. We have two
popular attribute selection measures:
1. Information Gain
2. Gini Index
3.Gain Ratio
Information Gain
Let node N represent or hold the tuples of partition D. The attribute with the highest
information gain is chosen as the splitting attribute for node N. This attribute
minimizes the information needed to classify the tuples in the resulting partitions
and reflects the least randomness or “impurity” in these partitions.
We can understand the working of Random Forest algorithm with the help of
following steps −
Step 1 − First, start with the selection of random samples from a given
dataset.
Step 2 − Next, this algorithm will construct a decision tree for every sample.
Then it will get the prediction result from every decision tree.
Step 3 − In this step, voting will be performed for every predicted result.
Step 4 − At last, select the most voted prediction result as the final
prediction result.
The following diagram will illustrate its working −
Pros
CONFUSION MATRIX
It is the easiest way to measure the performance of a classification problem where
the output can be of two or more type of classes. A confusion matrix is nothing
but a table with two dimensions viz. “Actual” and “Predicted” and furthermore, both
the dimensions have “True Positives (TP)”, “True Negatives (TN)”, “False
Positives (FP)”, “False Negatives (FN)” as shown below −
The explanation of the terms associated with confusion matrix are as follows −
True Positives (TP) − It is the case when both actual class & predicted
class of data point is 1.
True Negatives (TN) − It is the case when both actual class & predicted
class of data point is 0.
False Positives (FP) − It is the case when actual class of data point is 0 &
predicted class of data point is 1.
False Negatives (FN) − It is the case when actual class of data point is 1 &
predicted class of data point is 0.
EXAMPLE
Metrics for Evaluating Classifier Performance
The accuracy of a classifier on a given test set is the percentage of test set tuples
that are correctly classified by the classifier. That is,
The precision and recall measures are also widely used in classification. Precision
can be thought of as a measure of exactness (i.e., what percentage of tuples
labeled as positive are actually such), whereas recall is a measure of
completeness (what percentage of positive tuples are labeled as such). If recall
seems familiar, that’s because it is the same as sensitivity (or the true positive
rate). These measures can be computed as
An alternative way to use precision and recall is to combine them into a single
measure. This is the approach of the F measure (also known as the F1 score or
F-score)
Naïve Bayes Classifier Algorithm
o Naïve Bayes algorithm is a supervised learning algorithm, which is
based on Bayes theorem and used for solving classification
problems.
o It is mainly used in text classification that includes a high-
dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine
learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the
basis of the probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam
filtration, Sentimental analysis, and classifying articles.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes,
Which can be described as:
Bayes' Theorem:
Where,
Example
Pattern is everything around in this digital world. A pattern can either be seen physically or it
can be observed mathematically by applying algorithms.
Example: The colors on the clothes, speech pattern etc. In computer science, a pattern is
represented using vector feature values.
CURSE OF DIMENSIONALITY
Handling the high-dimensional data is very difficult in practice, commonly known as the curse of
dimensionality. If the dimensionality of the input dataset increases, any machine learning
algorithm and model becomes more complex. As the number of features increases, the number of
samples also gets increased proportionally, and the chance of overfitting also increases. If the
machine learning model is trained on high-dimensional data, it becomes overfitted and results in
poor performance.
Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.
DIMENSIONALITY REDUCTION
In machine learning classification problems, there are often too many factors on the basis of
which the final classification is done. These fa factors
ctors are basically variables called features.
The higher the number of features, the harder it gets to visualize the training set and then
work on it. Sometimes, most of these features are correlated, and hence redundant. This is
where dimensionality reduction
ction algorithms come into play. Dimensionality reduction is the
process of reducing the number of random variables under consideration, by obtaining a set
of principal variables. It can be divided into feature selection and feature extraction.
This method was introduced by Karl Pearson. It works on a condition that while the data in a
higher dimensional space is mapped to data in a lower dimension space, the variance of the data
in the lower dimensional space should be maximum.
It involves the following steps:
Construct the covariance matrix of the data.
Compute the eigenvectors of this matrix.
Eigenvectors corresponding to the largest eigenvalues are used to reconstruct a large fraction of
variance of the original data.
Hence, we are left with a lesser number of eigenvectors, and there might have been some data
loss in the process. But, the most important variances should be retained by the remaining
eigenvectors.
Supervised learning algorithms are trained using labeled Unsupervised learning algorithms are trained
data. using unlabeled data.
Supervised learning model takes direct feedback to Unsupervised learning model does not take
check if it is predicting correct output or not. any feedback.
Supervised learning model predicts the output. Unsupervised learning model finds the hidden
patterns in data.
In supervised learning, input data is provided to the In unsupervised learning, only input data is
model along with the output. provided to the model.
The goal of supervised learning is to train the model so The goal of unsupervised learning is to find
that it can predict the output when it is given new data. the hidden patterns and useful insights from
the unknown dataset.
Supervised learning needs supervision to train the Unsupervised learning does not need any
model. supervision to train the model.
Supervised learning can be categorized Unsupervised Learning can be classified
in Classification and Regression problems. in Clustering and Associations problems.
Supervised learning model produces an accurate result. Unsupervised learning model may give less
accurate result as compared to supervised
learning.
o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification
PERCEPTRON
Perceptron is an algorithm used for supervised learning of binary classifiers. Binary classifiers
decide whether an input, usually represented by a series of vectors, belongs to a specific class. a
perceptron is a single-layer neural network. They consist of four main parts including input
values, weights and bias, net sum, and an activation function.
The process begins by taking all the input values and multiplying them by their weights. Then,
all of these multiplied values are added together to create the weighted sum. The weighted sum is
then applied to the activation function, producing the perceptron's output. The activation function
plays the integral role of ensuring the output is mapped between required values such as (0,1) or
(-1,1). It is important to note that the weight of an input is indicative of the strength of a node.
Similarly, an input's bias value gives the ability to shift the activation function curve up or down.
LOGISTIC REGRESSION
Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
In Logistic regression, instead of fi
fitting
tting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):
Type of Logistic Regression:
On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
Boosting is a sequential process, where each subsequent model attempts to correct the
errors of the previous model. The succeeding models are dependent on the previous
model.
In this technique, learners are learned sequentially with early learners fitting simple
models to the data and then analyzing data for errors. In other words, we fit consecutive
trees (random sample) and at every step, the goal is to solve for net error from the prior
tree.
When an input is misclassified by a hypothesis, its weight is increased so that next
hypothesis is more likely to classify it correctly. By combining the whole set at the end
converts weak learners into better performing model.
Let’s understand the way boosting works in the below steps.
1. A subset is created from the original dataset.
2. Initially, all data points are given equal weights.
3. A base model is created on this subset.
4. This model is used to make predictions on the whole dataset.
Errors are calculated using the actual values and predicted values.
The observations which are incorrectly predicted, are given higher weights. (Here, the
three misclassified blue-plus points will be given higher weights)
Another model is created and predictions are made on the dataset. (This model tries to
correct the errors from the previous model)
Similarly, multiple models are created, each correcting the errors of the previous model.
The final model (strong learner) is the weighted mean of all the models (weak learners).
K-means clustering algorithm computes the centroids and iterates until we it finds optimal
centroid. It assumes that the number of clusters are already known. It is also called flat
clustering algorithm. The number of clusters identified from data by algorithm is represented by
‘K’ in K-means.
In this algorithm, the data points are assigned to a cluster in such a manner that the sum of the
squared distance between the data points and centroid would be minimum. It is to be understood
that less variation within the clusters will lead to more similar data points within same cluster.
Step 1 − First, we need to specify the number of clusters, K, need to be generated by this
algorithm.
Step 2 − Next, randomly select K data points and assign each data point to a cluster. In simple
words, classify the data based on the number of data points.
Step 3 − Now it will compute the cluster centroids.
Step 4 − Next, keep iterating the following until we find optimal centroid which is the
assignment of data points to the clusters that are not changing any more
4.1 − First, the sum of squared distance between data points and centroids would be
computed.
4.2 − Assign each data point to the cluster that is closer than other cluster (centroid).
4.3 − At last compute the centroids for the clusters by taking the average of all data
points of that cluster.
K-means follows Expectation-Maximization approach to solve the problem. The Expectation-
step is used for assigning the data points to the closest cluster and the Maximization-step is used
for computing the centroid of each cluster.
Applications of K-Means Clustering Algorithm
Market segmentation
Document Clustering
Image segmentation
Image compression
Customer segmentation
Analyzing the trend on dynamic data
EVALUATION METRICS :
Root mean square error or root mean square deviation is one of the most commonly used
measures for evaluating the quality of predictions. It shows how far predictions fall from
measured true values using Euclidean distance.
To compute RMSE, calculate the residual (difference between prediction and truth) for each data
point, compute the norm of residual for each data point, compute the mean of residuals and take
the square root of that mean. RMSE is commonly used in supervised learning applications, as
RMSE uses and needs true measurements at each predicted data point.
Mean Absolute Error (also called L1 loss) is one of the most simple yet robust loss functions
used for regression models.
MAE takes the average sum of the absolute differences between the actual and the predicted
values. For a data point xi and its predic
predicted value yi, n being the total number of data points in the
dataset, the mean absolute error is defined as:
The coefficient of determination is the square of the correlation(r), thus it ranges from 0
to 1.
With linear regression,, the coefficient of determination is equal to the square of the
correlation between the x and y variables.
If R2 is equal to 0, then the dependent variable cannot be predict
predicted
ed from the independent
variable.
If R2 is equal to 1, then the dependent variable can be predicted from the independent
variable without any error.
If R2 is between 0 and 1, then it indicates the extent that the dependent variable can be
predictable. If R2 of 0.10 means, it is 10 percent of the variance in the y variable is
predicted from the x variable. If 0.20 means, 20 percent of the variance in the y variable
is predicted from the x variable, and so on.
The value of R2 shows whether the model would be a good fit for the given data set.
TRAINING AND TESTING A CLASSIFIER
Training and Testing is a phenomena through which a system gets trained and becomes
adaptable to give result in an accurate manner. Learning is the most important phase as how
well the system performs on the data provided to the system depends on which algorithms
used on the data. Entire dataset is divided into two categories, one which is used in training
the model i.e. Training set and the other that is used in testing the model after training, i.e.
Testing set.
Trainingset:
Training set is used to build a model. It consists of the set of images that are used to train
the system. Training rules and algorithms used give relevant information on how to
associate input data with output decision. The system is trained by applying these
algorithms on the dataset, all the relevant information is extracted from the data and
results are obtained. Generally, 80% of the data of the dataset is taken for training data
Testingset:
Testing data is used to test the system. It is the set of data which is used to verify whether
the system is producing the correct output after being trained or not. Generally, 20% of
the data of the dataset is used for testing.
CROSS-VALIDATION
Cross-Validation
Cross-validation is a technique in which we train our model using the subset of the data-set
and then evaluate using the complementary subset of the data-set.
The three steps involved in cross-validation are as follows :
1. Reserve some portion of sample data-set.
2. Using the rest data-set train the model.
3. Test the model using the reserve portion of the data-set.
Methods of Cross Validation
Validation
LOOCV (Leave One Out Cross Validation)
K-Fold Cross Validation
ROC CURVE
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a
classification model at all classification thresholds. This curve plots two parameters:
True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:
To compute the points in an ROC curve, we could evaluate a logistic regression model many
times with different classification thresholds, but this would be inefficient. Fortunately, there's
an efficient, sorting-based algorithm that can provide this information for us, called AUC.
AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-
dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1).
(COST FUNCTIONS : same as evaluation functions)