THUẬT TOÁN
THUẬT TOÁN
THEORETICAL FRAMEWORK
The Decision Tree is a supervised machine learning algorithm that utilizes a tree-
like structure to make decisions based on input data. At each node, the model identifies
an attribute to split the data into smaller subsets, aiming to make these subsets more
homogeneous according to an evaluation criterion. The leaf nodes at the end of the tree
contain the output value, representing either a classification label or a predicted value in
regression tasks.
Working Mechanism
Information Gain: the difference in Entropy before and after splitting data based
on an attribute, representing the degree of improvement of the tree:
C4.5
C4.5 is an extension of ID3 that improves upon it by using Gain Ratio – a variant
of Information Gain designed to avoid bias towards attributes with many unique values.
Additionally, C4.5 can handle continuous attributes by determining threshold values for
splitting.
Combining weak learners into a strong model: While individual decision trees
might not perform well, their combination in a forest leads to a robust and stable model.
Working Mechanism
The Ridge Regression model minimizes a cost function that balances the trade-off
between fitting the data and keeping the model coefficients small. The cost function is
expressed as:
The first term in the cost function is the residual sum of squares (RSS), which
measures the model’s error. The second term is the penalty term, proportional to the
squared magnitudes of the coefficients, and it discourages large coefficients.
Effect of Regularization Parameter (λ)
Data Standardization: Standardize the features to ensure that all predictors are on
the same scale. This step is essential since Ridge Regression depends on the magnitude of
coefficients.
Defining the Cost Function: Construct the cost function as the sum of the RSS
and the penalty term.
Here, X is the design matrix, y is the target vector, and I is the identity matrix.