0% found this document useful (0 votes)
9 views27 pages

Unit 4-1

The document discusses various estimation methods in machine learning, focusing on parametric and nonparametric approaches, including decision trees and discriminant functions. It explains the structure and functioning of decision trees, their dynamic nature, and the concepts of entropy and information gain. Additionally, it covers linear and quadratic discriminant functions, their advantages and disadvantages, and the geometric interpretation of decision boundaries for classification tasks.

Uploaded by

Virupaksh Alur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views27 pages

Unit 4-1

The document discusses various estimation methods in machine learning, focusing on parametric and nonparametric approaches, including decision trees and discriminant functions. It explains the structure and functioning of decision trees, their dynamic nature, and the concepts of entropy and information gain. Additionally, it covers linear and quadratic discriminant functions, their advantages and disadvantages, and the geometric interpretation of decision boundaries for classification tasks.

Uploaded by

Virupaksh Alur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Decision Tree

Parametric estimation
• A single model is defined for the entire input space (the domain of all
possible inputs).

• parameters of this model are learned using the entire training dataset.

• Once the model is trained, the same set of parameters is applied to any
new test input, regardless of its specific location in the input space.
Nonparametric Estimation
• The input space is divided into local regions based on a distance measure
(e.g., the Euclidean norm, which measures how far points are from each
other in space).

• For each test input, a local model is created using the training data within
its corresponding region.
Instance-Based Models
• nonparametric models that rely on training data instances directly to make
predictions (e.g., k-nearest neighbors).

• calculating the distance from the test input to every training instance, which
is computationally expensive, with a time complexity of O(N),where N is
the number of training instances.
Decision Tree
• applies a divide-and-conquer strategy to solve problems.

• A decision tree is a structured, hierarchical representation of data. It


branches out like a tree, where each internal node represents a decision
based on attribute, and each leaf node represents an outcome or a
class.

• can be used for classification (assigning labels to data) or regression


(predicting numerical values).
• Internal Decision Nodes: Represent points in the tree where decisions
are made based on a specific condition or test.

• Terminal Leaves: Represent the endpoints of the tree. These contain


the output value or class label for the given input.

• Decision Nodes: Each node applies a test function on the input data

• Branches: Paths connecting decision nodes, corresponding to the test


outcomes.

• Leaf Node: The tree ends here, and the value in the leaf node is the
final output for the input.
• A learning algorithm takes a labeled training dataset (data where the
inputs and their corresponding outputs are known) to construct the
decision tree.

• splitting the data repeatedly based on attributes to minimize errors and


improve predictive accuracy.

• directly learn a rule base from the data, bypassing the tree structure.

• parametric models (e.g., linear regression), decision trees do not


assume any predefined distribution (e.g., Gaussian) for the class
densities (how data points are distributed within each class).
Dynamic Tree Structure
• the tree is built incrementally during the training process, adding branches
and leaves based on the complexity and characteristics of the data.

• Splitting the data recursively at decision nodes, based on conditions that


reduce uncertainty (e.g., using measures like Gini Impurity or Information
Gain).

• if the data shows a complex relationship between input features and output
classes, the tree will add more splits, branches, and leaves to accurately
model those relationships.
• Entropy quantifies the amount of disorder or uncertainty in a dataset.
A lower entropy means the dataset is more homogeneous (e.g., all
data points belong to the same class), while higher entropy indicates
more heterogeneity (e.g., data points are mixed across multiple
classes).
• Information Gain (IG) is the difference between the entropy of the
dataset before the split and the weighted entropy of the subsets after
the split.
Linear Discrimination
• the fundamental assumption is that instances of different classes can
be separated by a linear decision boundary.

• hyperplane in the feature space can effectively distinguish between


instances belonging to different categories.

• Set of discriminant functions for classification gj (x), j = 1,...,K,


• prior probabilities :

• Class likelihoods:

• Posterior densities:

• Discriminant based classification: assume a model directly for the


discriminant, bypassing the estimation of likelihoods or posteriors.

• Model for discriminant:

• discriminant function that directly assigns class labels to input


samples without explicitly estimating likelihoods or posterior probabilities.
Inductive Bias in Discriminant-Based Classification:
• Instead of Making an assumption on the form of the class densities, make an
assumption on the form of the boundaries separating classes.
Example: Perceptron, Support Vector Machines (SVM), Logistic Regression
Linear discriminant:
• Used mainly for low space and time complexity O(d).

• Before using complex models like neural networks or kernel-based SVMs, first try
a linear discriminant model to check if it performs well.
Quadratic Discriminant Function:
• Allows for more complex decision boundaries than a linear model.

Advantages:
• Capture the complex boundaries

Disadvantages:
• High computational cost
• Risk of overfitting
• map the data to a higher-dimensional space and apply a linear model.
• Instead of directly using a quadratic discriminant, preprocess the input by
adding higher-order features.

• Original input:

• New transformed variables:

• New feature vector:

Advantages:
• computational complexity is low
• More interpretable

Disadvantages:
• risk of overfitting
Generalized approach:
• Basic function:

Examples:
• Polynomial basis:

• Trigonometric basis:
Geometry of the Linear Discriminant:
• Two classes:

• Decision Hyperplane: g(x)=0

• Weight vector defines the orientation of the hyperplane.


• Bias/threshold defines the position of the hyperplane.
• Geometric Interpretation of the Decision Hyperplane
• Any point x in the input space can be decomposed into two
components
• Multiple Classes
Pairwise separation:
It uses K(K − 1)/2 linear discriminants, gij (x), one for every pair of
distinct classes.

Parametric Discrimination Revisited:


Gradient Descent:
optimize the parameters of the discriminant function to minimize
classification error on the training set.
Logistic discrimination
Two classes:log likelihood ratio is linear

• x may be composed of discrete attributes or may be a mixture of


continuous and discrete attributes.
• Using baye’s rule:
• Sigmoid function
Multiple classes:
Discrimination by regression
• Probabilistic model

You might also like