0% found this document useful (0 votes)
36 views75 pages

Classification DecisionTreesNaiveBayeskNN

Classification is a form of data analysis that predicts categorical class labels. There is a two-step process: 1) A classification model is constructed using a learning algorithm trained on a dataset with known labels. 2) The model is used to predict labels for new data. Decision trees are a popular classification technique. They work by testing attributes at internal nodes and predicting a class label at leaf nodes. Attributes are selected using measures like information gain, which evaluate how well an attribute separates the data into pure class partitions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views75 pages

Classification DecisionTreesNaiveBayeskNN

Classification is a form of data analysis that predicts categorical class labels. There is a two-step process: 1) A classification model is constructed using a learning algorithm trained on a dataset with known labels. 2) The model is used to predict labels for new data. Decision trees are a popular classification technique. They work by testing attributes at internal nodes and predicting a class label at leaf nodes. Attributes are selected using measures like information gain, which evaluate how well an attribute separates the data into pure class partitions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 75

Classification Techniques

Classification??

 Classification is a form of data analysis that extracts models


describing important data classes.
 Such models are called classifiers
 They predict the categorical (discrete, unordered) class labels for
problem at hand.
 Examples
 1. Bank loan manager wants to classify loan as “Risky” or “Safe”.
 2. Identify and count the number of cars, autos, bikes, traveling on
a road in last 6 hours.
Classification-A two step solution

Data classification is a two step process consisting of


 Learning/Training Step (Supervised Learning )
 Where a classification model/classifier is constructed using a suitable algorithm by
training on a data set (Training Set) whose class labels are already known.
 The class labels are discrete valued and unordered
 Classification Step
 Once the model is ready and verified it can be fed with new data set (Test Set) to do the
classification.
Some Mathematical terms

 Let D is the data set with n attributes A1, A2,… An


 A tuple X of D is denoted by an n-dimensional attribute vector X=(x 1, x2,…xn)
 Each tuple X is assumed to belong to a pre-defined class as determined by
another data set attribute called class label attribute.
 The learning step of classification can also be viewed as a mapping y=f(X)
that can predict the associated class label of a given tuple X
Classification Vs Prediction

 Classification models find out categorical class label and


 Prediction models find out continuous valued functions.
 For example,
 we can build a classification model, to categorize bank loan
applications as either safe or risky, or a prediction model to predict
the expenditures in dollars of potential customers on computer
equipment given their income and occupation
 Note − Regression analysis technique is mostly used for
numeric prediction.
Classification and Prediction Issues

 Data Cleaning − Missing values and noise is removed by applying smoothing techniques.
and missing values problems are sorted out by replacing missing values with most commonly
occurring value for the given attribute attribute.
 Relevance Analysis − Database consists irrelevant attributes or the attributes which are not
relevant. Correlation analysis is used to find out, whether any two given attributes are related.
 Data Transformation and reduction − The data can be transformed by these two methods.
 Normalization − Normalization method scales all values for given attribute within a small
specified range.
 Generalization − The data can also be transformed by generalizing it to the higher concept. For
this purpose hierarchies concept can be used.
 Data can also be reduced by some other methods such as wavelet transformation, binning,
histogram analysis, and clustering.
Comparison of Classification and Prediction Methods

 Accuracy − Accuracy of classifier predict the class label correctly and the accuracy
of the predictor tells how well a given predictor can tell the predicted attribute value
for a new data.
 Speed − This refers to the computational cost in generating and using the classifier or
predictor.
 Robustness − It refers to the capability of classifier or predictor to make correct
predictions from given noisy data.
 Scalability − Scalability refers to the ability to construct the classifier or predictor
efficiently; given large amount of data.
 Interpretability − It refers to what extent the classifier or predictor understands.
 Both predictor and classifier same in these 2 step process: First constructing a model
and using model to predict values of unknown objects and different because
classification predict categorical class label and prediction predict continuous value.
Decision Tree Induction

A Decision tree is a flowchart like tree structure where


 Internal node (Non-Leaf node) denotes a test on an attribute
 Branch represents outcome of the test
 Leaf node holds a class label.
Why are decision tree classifiers so
popular?

 The construction of decision tree classifiers does not require any domain
knowledge or parameter setting, and therefore is appropriate for
exploratory knowledge discovery.
 Decision trees can handle multidimensional data.
 Their representation of acquired knowledge in tree form is intuitive and
generally easy to assimilate by humans.
 The learning and classification steps of decision tree induction are
simple and fast.
 In general, decision tree classifiers have good accuracy.
 However, successful use may depend on the data at hand.
ID3, C4.5 and CART

 ID3 (Iterative Dichotomiser ). :During the late 1970s and early 1980s, J. Ross
Quinlan, a researcher in machine learning, developed a decision tree algorithm
known as ID3 (Iterative Dichotomiser).
 This work expanded on earlier work on concept learning systems, described by E.
B. Hunt, J. Marin, and P. T. Stone.
 Quinlan later presented C4.5 (a successor of ID3), which became a benchmark to
which newer supervised learning algorithms are often compared.
 In 1984,a group of statisticians (L. Breiman, J. Friedman, R. Olshen, and C. Stone)
published the book Classification and Regression Trees (CART), which described
the generation of binary decision trees.
Attribute Selection Measures

― An attribute selection measure is a heuristic for selecting the splitting criterion


that “best” separates a given data partition, D, of class-labeled training tuples
into individual classes.
― Attribute selection measures are also known as splitting rules because they
determine how the tuples at a given node are to be split.
― The attribute selection measure provides a ranking for each attribute describing
the given training tuples.
― The attribute having the best score for the measure is chosen as the splitting
attribute for the given tuples.
o If the splitting attribute is continuous-valued or if we are restricted to binary trees, then,
respectively, either a split point or a splitting subset must also be determined as part of the
splitting criterion.
― The tree node created for partition D is labeled with the splitting criterion,
branches are grown for each outcome of the criterion, and the tuples are
partitioned accordingly.
Attribute Selection Measures

The popular attribute selection measures—


1. Information gain,
2. Gain ratio
3. Gini index.

The notation used herein is as follows.


 Let D, the data partition, be a training set of class-labeled tuples.
 Suppose the class label attribute has m distinct values defining m distinct
classes, Ci (for i = 1, … , m).
 Let C(i,D) be the set of tuples of class Ci in D.
 Let |D| and |C(i,D)| denote the number of tuples in D and Ci,D, respectively.
Information Gain (I)

 ID3 uses information gain as its attribute selection measure.


 This measure is based on pioneering work by Claude Shannon on information
theory, which studied the value or “information content” of messages.
 Let node N represent or hold the tuples of partition D. The attribute with the
highest information gain is chosen as the splitting attribute for node N.
 This attribute minimizes the information needed to classify the tuples in the
resulting partitions and reflects the least randomness or “impurity” in these
partitions.
 Such an approach minimizes the expected number of tests needed to classify a
given tuple and guarantees that a simple (but not necessarily the simplest) tree is
found.
Information Gain (II)

 The expected information needed to classify a tuple in D is given by

 where pi is the nonzero probability that an arbitrary tuple in D belongs to


class Ci and is estimated by |Ci,D|/|Dj|.
 A log function to the base 2 is used, because the information is encoded in
bits.
 Info(D) is just the average amount of information needed to identify the
class label of a tuple in D.
 Note that, at this point, the information we have is based solely on the
proportions of tuples of each class. Info(D) is also known as the entropy
of D.
Information Gain (III)

Now, suppose we were to partition the tuples in D on some attribute A having v


distinct values, (a1, a2, :… , av), as observed from the training data.
If A is discrete-valued, these values correspond directly to the v outcomes of a test
on A. Attribute A can be used to split D into v partitions or subsets, (D1, D2, … ,
Dv), where Dj contains those tuples in D that have outcome aj of A.
These partitions would correspond to the branches grown from node N.
Ideally, we would like this partitioning to produce an exact classification of the
tuples. That is, we would like for each partition to be pure.
However, it is quite likely that the partitions will be impure (e.g., where a partition
may contain a collection of tuples from different classes rather than from a single
class).
Information Gain (IV)

 How much more information would we still need (after the partitioning) to arrive
at an exact classification? This amount is measured by

 |D| acts as the weight of the jth partition. InfoA(D) is the expected information
required to classify a tuple from D based on the partitioning by A.
 The smaller the expected information (still) required, the greater the purity of the
partitions.
Information Gain (V)

 Information gain is defined as the difference between the original information


requirement (i.e., based on just the proportion of classes) and the new
requirement (i.e.,obtained after partitioning on A). That is,

 In other words, Gain(A) tells us how much would be gained by branching on


A. It is the expected reduction in the information requirement caused by
knowing the value of A.
 The attribute A with the highest information gain, Gain(A), is chosen as the
splitting attribute at node N.
 This is equivalent to saying that we want to partition on the attribute A that
would do the “best classification,” so that the amount of information still
required to finish classifying the tuples is minimal (i.e., minimum InfoA(D).
Number of output Class? 𝑚
Number of tuples with NO Info ( 𝐷 ) =− ∑ 𝑝𝑖 log2 (𝑝¿¿𝑖)¿
Number of tuples with YES 𝑖=1
𝑣
|𝐷 𝑗|
Inf o A ( 𝐷 )= ∑ ×Info ( 𝐷 𝑗 )
𝑗=1 |𝐷|
Similarly, we can compute it for
income, student and credit rating
 Gain(income)=D 0.029 bits,
 Gain(student)= 0.151 bits,
 Gain(credit rating)=0.048 bits.

 Because age has the highest information gain among the attributes, it is selected as the
splitting attribute.
 Node N is labeled with age, and branches are grown for each of the attribute’s values.
 The tuples are then partitioned accordingly, as shown in Figure
 Notice that the tuples falling into the partition for age D middle aged all belong to the
same class. Because they all belong to class “yes,” a leaf should therefore be created at
the end of this branch and labeled “yes.”
Information Gain-Algorithm

Information Gain is based on the decrease in entropy after a dataset is split


on an attribute. Constructing a decision tree is all about finding attribute that
returns the highest information gain (i.e., the most homogeneous branches).
 Step 1: Calculate entropy of the target.
 Step 2: The dataset is then split on the different attributes.
 Step 3: Choose attribute with the largest information gain as the
decision node.
 Step 4(a): A branch with entropy of 0 is a leaf node.
 Step 4(b): A branch with entropy more than 0 needs further splitting.
 Step 5: The ID3 algorithm is run recursively on the non-leaf branches,
until all data is classified.
Information Gain Revisited

 This is the expected information needed to classify a tuple in D.


Also known as entropy of D.

 This is the expected information still required to classify a tuple


in D based on partitioning by attribute A.

 This is the expected reduction in the information requirement


caused by knowing the value of A. Or, how much would be
gained by branching on A.
What about Continuous Value

 But how can we compute the information gain of an attribute


that is continuous valued, unlike in the example.
 For example, suppose that instead of the discretized version
of age from the example, we have the raw values for this
attribute.
What about Continuous Value?

 We first sort the values of A in increasing order.


 Typically, the midpoint between each pair of adjacent values is considered as a possible split-
point. Therefore, given v values of A, then v-1 possible splits are evaluated. For example, the
midpoint between the values ai and ai+1 of A is

 If the values of A are sorted in advance, then determining the best split for A requires only
one pass through the values.
 For each possible split-point for A, we evaluate Info A(D), where the number of partitions is
two, that is, v (or j= 1, 2).
 The point with the minimum expected information requirement for A is selected as the split
point for A. D1 is the set of tuples in D satisfying A ≤ split point, and D2 is the set of tuples
in D satisfying A > split point.
Gain Ratio (I)

 The information gain measure is biased toward tests with many outcomes.
 That is, it prefers to select attributes having a large number of values.
 For example, consider an attribute that acts as a unique identifier such as
product ID. A split on product ID would result in a large number of
partitions (as many as there are values), each one containing just one tuple.
 Because each partition is pure, the information required to classify data set
D based on this partitioning would be Infoproduct-ID(D)= 0.
 Therefore, the information gained by partitioning on this attribute is
maximal. Clearly, such a partitioning is useless for classification.
Gain Ratio (II)

 C4.5, a successor of ID3, uses an extension to information gain known as gain


ratio, which attempts to overcome this bias.
 It applies a kind of normalization to information gain using a “split
information” value defined analogously with Info(D) as

This value represents the potential information generated by splitting the training
set data, D, into v partitions corresponding to the v outcomes of a test on attribute
A.
Note that, for each outcome, it considers the number of tuples having that
outcome with respect to the total number of tuples in D. It differs from
information gain, which measures the information with respect to classification
that is acquired based on the same partitioning.
Gain Ratio(III)

 The gain ratio is defined as

 The attribute with the maximum gain ratio is selected as the splitting attribute.
 Note, however, that as the split information approaches 0, the ratio becomes
unstable.
 A constraint is added to avoid this, whereby the information gain of the test
selected must be large—at least as great as the average gain over all tests
examined.
Problem-Calculate the gain ratio for the attribute income

 Calculate the gain ratio on the Income attribute


| 𝐷 𝑗|
(| |)
| 𝐷 𝑗|
𝑣
SplitInf o A ( 𝐷 )=− ∑ × log 2
Solution 𝑗 =1 | 𝐷| 𝐷

Gain ( A )
GainRatio ( A ) =
SplitInf o A (D )
 4/14=0.2857143, 6/14=0.42857143
 Log2(4/14)=Log24-Log214=2-3.807355=-1.807355
 Log2(6/14)=Log26-Log214=2.5849625-3.807355= -
1.2224
 SplitInfo(income)=0.516+0.523+0.516=1.555
Gini Index (I)

 The Gini index is used in CART.


 Gini index measures the impurity of D, a data partition or set of training
tuples, as

 where pi is the probability that a tuple in D belongs to class Ci and is


estimated by |Ci,D|/|D|. The sum is computed over m classes.
Gini Index (II)

 Gini index considers only binary splits.


 For a discrete valued attribute A having v values (a1, a2,…, av), the
binary split is considered by examining all subsets of A excluding empty
set and power set (Total 2v-2)
 Each subset SA can be considered as a test of the form “A∈ SA?”
 Example: Attribute income is having three discrete values (low,
medium, high). The binary splits are
 {low, medium} and {high}
 {low, high} and {medium}
 {medium,high} and {low}
Induction of Decision tree using Gini
Index

 Let D be the training data shown earlier in Table, where


 there are nine tuples belonging to the class buys_computer = yes
and
 the remaining five tuples belong to the class buys_computer=no.
 A (root) node N is created for the tuples in D.
 We first calculate Gini index to compute the impurity of D:
𝑚
G 𝑖𝑛𝑖 ( 𝐷 ) =1− ∑ 𝑝 2𝑖
𝑖=1
 To find the splitting criterion for the tuples in D, we need to compute the
Gini index for each attribute.
 Let’s start with the attribute income and consider each of the possible
splitting subsets.
 Consider the subset {low, medium}. This would result in 10 tuples in
partition D1 satisfying the condition “income ϵ {low, medium}.” The
remaining four tuples of D would be assigned to partition D2.
 The Gini index value computed based on this partitioning is
Gini Index Calculation

 Similarly, the Gini index values for splits on the remaining subsets are
 0.458 (for the subsets {low, high} and {medium} and
 0.450 (for the subsets {medium, high} and {low}.
 0.443 (for the subset {low, medium} and {high}
 Therefore, the best binary split for attribute income is on {low, medium} and {high})
because it minimizes the Gini index. Delta gini for income 0.459-0.443=0.016
 Evaluating for attribute age, we obtain {youth, 0.375senior} (or {middle_aged}) as the
best split for age with a Gini index of 0.357;
 the attributes student and credit rating are both binary, with Gini index values of 0.367
and 0.429, respectively.
 The attribute age and splitting subset {youth, senior} (or {middle_aged}) gives overall
minimum Gini index with a reduction in impurity as
0.459-0.357=0.102
 The attribute age and splitting subset {youth, senior} therefore give the
minimum Gini index overall, with a reduction in impurity of
 0.459-0.357=0.102.
 The binary split “age ϵ {youth, senior}” results in the maximum reduction in
impurity of the tuples in D and is returned as the splitting criterion.
Tree Pruning

 Sometimes, when decision tress are built, many of the branches may
reflect anomalies in training data due to noise or outliers.
 Tree pruning is a method to remove the least-reliable branches from the
decision tree.
 Two approaches:
 Pre-pruning
 Post-pruning
Pre-pruning

 A tree is pruned by halting its construction early, by deciding


not to further split or partition.
 Criteria for halting
 1. Tree Depth
 2. No of leaves
 Threshold value of attribute selection measure
 Upon halting, the current node becomes a leaf.
 Majority voting may then be used to classify this node.
Post-pruning

 Decision Tree is first fully grown, and then subtrees are


removed by removing its branches and replacing it with a leaf.
 The leaf is labelled with the most frequent class in the subtree
being replaced.
 Criteria:
 1. Cost complexity
 2. Error rates
 3. No of bits required to encode tree
Advantages of decision trees

1) Able to handle both numerical and categorical data.


2) Requires little data preparation. Other techniques often require data normalization. Since trees can
handle qualitative predictors, there is no need to create dummy variables.
3) Simple to understand and interpret. Trees can also be displayed graphically in a way that is easy
for non-experts to interpret.
4) Less data cleaning required: It requires less data cleaning compared to some other modeling
techniques. It is not influenced by outliers and missing values to a fair degree.
5) Data type is not a constraint: It can handle both numerical and categorical variables.
6) Non Parametric Method- This means that decision trees have no assumptions about the space
distribution and the classifier structure.
7) Performs well with large datasets. Large amounts of data can be analysed using standard
computing resources in reasonable time.
8) Useful in Data exploration- For example, we are working on a problem where we have
information available in hundreds of variables, there decision tree will help to identify most
significant variable.
Disadvantages of decision trees

 Unstable, it means small change in the data can lead to a large


change in the structure of the optimal decision tree.
 They are often relatively inaccurate. Many other predictors
perform better with similar data. This can be remedied by
replacing a single decision tree with a random forest of decision
trees, but a random forest is not as easy to interpret as a single
decision tree.
 For data including categorical variables with different number of
levels, information gain in decision tree is biased in favour of
those attributes with more levels.
 Calculations can get very complex, if many values are uncertain
and/or if many outcomes are linked.
Bayes Classification

 It is based on Bayes’ Theorem with an assumption of independence among


predictors.
 Naive Bayes classifier assumes that the presence of a particular
feature/attribute in a class is unrelated to the presence of any other
feature/attribute.
 This is called class conditional independence.
Bayes Theorem

 Let X be a data tuple with n values for n attributes in the


dataset D. i.e. X= (x1, x2, …., xn) depicting measurements on n
attributes A1, A2, … An
 In Bayesian term, X is called evidence.
 Let H be the hypothesis that X belongs to class C.
 The classification problem using Bayes theorem is described
as the probability that hypothesis H holds given evidence X,
i.e., P(H|X)
P(H|X)=
 P(H|X)is called the posterior probability H on X, i.e., the
probability that hypothesis H holds given evidence X
 P(X|H) is called the posterior probability of X on H, i.e., the
probability that X holds given hypothesis H
 P(H) is the prior probability of H, probability of occurrence of
hypothesis H regardless of X
 P(X) is the prior probability of X, probability of occurrence of
X regardless of H.
Naïve Bayes Classification Method

 Let D be the training set. Each tuple in D is represented by an


n-dimensional attribute vector X= (x1, x2, …., xn) depicting
measurements on n attributes A1, A2, … An .
 Suppose there are m classes C1, C2, …, Cm. Given a tuple X,
the Naïve Bayes classifier will predict that X belongs to the
class having the highest posterior probability on X. That is
P(Ci|X)>P(Cj|X) for 1 ≤ j ≤ m, j ≠i
The class for which P(Ci|X) is maximized is called maximum
posterior hypothesis
P(Ci|X)=
Calculating P(Ci)

 If class prior probabilities are not known, then it is commonly


assumed that the classes are equally likely i.e. P(C 1)= P(C2)=…
P(Cm)
 Otherwise
P(Ci)=|Ci,D|/|D|
Calculating

 =
=**….*
Two cases
1. Ak is a categorical , no of tuples of Ci in D having value xk divided by |C i,D|
2. Ak is a numeric, the probability is given by the characteristic function of
Gaussian distribution
 X(age=youth, income=medium, student=yes,
credit_rating=fair)
 m=2
P(C1=yes)=9/14=0.643
P(C2=no)=5/14=0.357
P(X|C1)=P(age|C1)*P(income|C1)*P(student|C1)*P(credit_rating|
C1)
=2/9*4/9*6/9*6/9=0.044
P(X|C2)=0.019
Zero Frequency Problem in Naïve
Bayes Classification

 Laplacian Correction: Add one pretend tupple for each categorical


value of the attribute for which zero frequency is encountered
 Suppose buys_computer=yes, and D has 1000 tuples , we have 0
tuples for income=low, 990 rows with income=medium and 10
rows with income =high
Actual P(income=low)=0/1000;
modified 1/1003
Actual P(income=medium) =990/1000;
modified 991/1003
Actual P(income=high)= 10/1000;
modified 11/1003
Advantage of Naive Bayes

 It is easy and fast to predict class of test data set. It also perform well in multi
class prediction
 When assumption of independence holds, a Naive Bayes classifier performs
better compare to other models like logistic regression and you need less
training data.
 It perform well in case of categorical input variables compared to numerical
variable(s). For numerical variable, normal distribution is assumed (bell curve,
which is a strong assumption).
Disadvantage of Naïve Bayes

•When categorical variable has a category (in test data set), which was not seen
in training data set, then model will assign a 0 (zero) probability and will be
unable to make a prediction.
•This is often known as “Zero Frequency”. To solve this, we can use the
smoothing technique. One of the simplest smoothing techniques is called
Laplace estimation.
•Naive Bayes is also known as a bad estimator.
•Naive Bayes is the assumption of independent predictors. In real life, it is
almost impossible that we get a set of predictors which are completely
independent.
Applications of Naive Bayes Algorithms

 Real time Prediction: Naive Bayes is an eager learning classifier and it is quite fast.
So, it is used for making predictions in real time.
 Multi class Prediction: we can predict the probability of multiple classes of target
variable.
 Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers
mostly used in text classification (due to better result in multi class problems and
independence rule) have higher success rate as compared to other algorithms. As a
result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment
Analysis (in social media analysis, to find out positive and negative customer
sentiments)
 Recommendation System: Predict whether a user would like a given resource or
not
Lazy Learners: kNN Classifier

 Eager learners are those which when given a set of training tuples, will
construct a classification model before receiving the test tuples to
classify.
 Lazy learners on the other hand store the training tuples (or do a minor
processing) and wait until the last minute before doing any model
construction to classify test tuples.
 Lazy learners do less work when training tuples are presented and more
work when test tuples are presented.
 Lazy learners are also called as instance-based learners.
When kNN algorithm is used

 k- Nearest- Neighbor classifiers are based on learning by


analogy, that is, by comparing a given test tuple with training
tuples that are similar to it.
 A training tuple is described by n attributes.
 Each tuple represents a point in n-dimensional pattern space.
 When given an unknown test tuple, kNN classifier searches the
pattern space for k training tuples that are closest to the test tuple.
k denotes number of nearest neighbor.
Eucledian Distance

 “closeness” is defined in terms of a distance metric, for


example Eucledian distance.
 Given two tuples X1=(x11, x12,…,x1n) and X2= (x21,x22, …, x2n),
The Eucledian distance between X1 and X2 is given by

dist(X1,X2)=2

 The unknown test tuple is assigned the most common class


among its k-nearest neighbors.
Computation of distance for Nominal
Attributes

 Suppose one of the attributes in X1 and X2 is color with


three nominal values red, green and blue.
 If color values are same in both tuples, difference is
taken as 0, otherwise it is taken as 1.
Computation of distance when values
are missing

 If value of an attribute A is missing in X1 and/or X2, we


assume the maximum possible difference.
 Map the values to a range[0,1] (i.e., normalize)
 If A is missing in both X1 and X2, then difference is taken
as 1.
 If only one value is missing, then difference is taken as |1-
v’| or |0-v’|, whichever is greater, given v’ is the
normalized value of v.
KNN Properties

 Lazy algorithm.
 Neglect intermediate value.
 Result is generated after analysis of stored data.
 KNN Algorithm is based on feature similarity
 KNN stores the entire training dataset which it uses as its representation.
 KNN does not learn any model.
 KNN makes predictions just-in-time by calculating the similarity between an
input sample and each training instance.
Why KNN is Used

 KNN can be used for both classification and regression


predictive problems. So it is more widely used in classification
problems in the industry.
To evaluate any technique we generally look at 3 important
aspects:
 1. Ease to interpret output
 2. Calculation time
 3. Predictive Power
 It is commonly used for its easy of interpretation and low
calculation time.
Advantage of KNN Algorithm

 No assumptions about data — useful, for example, for


nonlinear data
 Simple algorithm — to explain and understand/interpret
 High accuracy (relatively) — it is pretty high but not
competitive in comparison to better supervised learning
models
 Versatile — useful for classification or regression
Disadvantage of KNN
Algorithm

 Computationally expensive — because the algorithm stores all


of the training data
 High memory requirement.
 Stores (almost all) of the training data.
 Prediction stage might be slow (with big N)
 Sensitive to irrelevant features and the scale of the data.
Regression

 Regression is a technique that displays the relationship between variable “y” based on the
values of variable “x”.
 Regression technique is used to understand relationship between Product Price & Sales
 we have a dataset of patients having not having heart disease. There are features like tobacco
consumption, cholesterol, alcohol consumption, type a personality traits, adiposity value,
obesity measures, systolic blood pressure to name a few.
 Correlation can be find out among all features and their dependencies on each other. For
example,
 Does a obese person with adiposity prone to heart disease ?
 Does cholesterol has any impact?
 What are the chances of a person’s tobacco consumption has any association with heart
disease?
Linear Regression

 Linear regression is a statistical method which allows us to


study relationships between two continuous (quantitative)
variables.
 Linear regression is used when you need to describe how the
mean value of one variable depends on a second.
 Several types of linear regression available to researchers.
 Simple linear regression
 Multiple linear regression
 Logistic regression
Advantage of Linear
Regression

 Linear regression is an extremely simple method.


 It is very easy and intuitive to use and understand. A person
with only the knowledge of high school mathematics can
understand and use it.
 it works in most of the cases, when it doesn’t fit the data
exactly, we can use it to find the nature of the relationship
between the two variables.
Disadvantage of Linear
Regression

 It assumes there is a straight-line relationship between them


which is incorrect sometimes. Linear regression is very
sensitive to the anomalies in the data (or outliers)
 Take for example most of your data lies in the range 0-10. If
due to any reason only one of the data item comes out of the
range, say for example 15, this significantly influences the
regression coefficients.
 When we have a number of parameters than the number of
samples available, then the model starts to model the noise
rather than the relationship between the variables.
Real World Examples or
Applications of Linear Regression

 Linear regression applied on model which have two types of variables. We use
one variable to forecast another variable value. Those variables are called as
explanatory variable(Independent variable) and dependent variable.
 Crop yields on rainfall : Yield is Dependent variable (Output which we forecast),
Rainfall is explanatory variable.
 Marks on activities : Marks is dependent and activities explanatory
 Products on sales: Sales are explanatories
 Predicting house prices with increase in sizes of houses.
 Relationship between the hours of study a student puts in, w.r.t the exam results
THANKU

You might also like