0% found this document useful (0 votes)
9 views4 pages

THUẬT TOÁN

Uploaded by

Huyền Trang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views4 pages

THUẬT TOÁN

Uploaded by

Huyền Trang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

1.

THEORETICAL FRAMEWORK

2.1. Decision Tree

The Decision Tree is a supervised machine learning algorithm that utilizes a tree-
like structure to make decisions based on input data. At each node, the model identifies
an attribute to split the data into smaller subsets, aiming to make these subsets more
homogeneous according to an evaluation criterion. The leaf nodes at the end of the tree
contain the output value, representing either a classification label or a predicted value in
regression tasks.

Working Mechanism

The process of constructing a decision tree involves a series of iterative steps,


where, at each step, the best attribute is chosen to split the data. The best attribute is
determined based on a criterion that minimizes uncertainty or enhances the purity of the
data, such as Entropy, Gini Index, or Mean Squared Error.

 Entropy và Information Gain:

Entropy measures all data or any data. Calculate formula

Information Gain: the difference in Entropy before and after splitting data based
on an attribute, representing the degree of improvement of the tree:

 Gini Index: an alternative measure of purity Entropy, calculated according to the


formula:
Mean Squared Error (MSE). With a regression problem, the decision tree
optimizes by minimizing the MSE:

Popular Decision Tree Construction Algorithms

ID3 (Iterative Dichotomiser 3)


ID3 is the first algorithm used to build decision trees, relying on Entropy and
Information Gain. The algorithm selects the attribute with the highest Information Gain
to split the data at each node.

C4.5
C4.5 is an extension of ID3 that improves upon it by using Gain Ratio – a variant
of Information Gain designed to avoid bias towards attributes with many unique values.
Additionally, C4.5 can handle continuous attributes by determining threshold values for
splitting.

CART (Classification and Regression Trees)


CART is an algorithm that uses the Gini Index to evaluate attributes in
classification problems and employs Mean Squared Error (MSE) for regression tasks. It
is widely implemented due to its flexibility and efficiency.

2.2. Random Forest

Random Forest is a machine learning algorithm belonging to the ensemble


learning methods, where multiple decision trees are built and combined to create a more
robust model. Each tree in the Random Forest is constructed using a different dataset
generated through bootstrap sampling – a method of random sampling with replacement
from the original dataset.

Working Mechanism Constructing independent decision trees: For each tree, a


subset of the training data is created using bootstrap sampling. At each node in the tree,
only a random subset of features is considered to select the best attribute for splitting.
Aggregating results from the trees: For classification problems, Random Forest uses
the majority voting method, where the class predicted by the majority of the trees is
chosen as the output. For regression problems, the output is the average of the predictions
from all the trees.
Key Characteristics of Random Forest Random selection of datasets and
attributes at each node increases the diversity among trees, reducing the risk of
overfitting.

Combining weak learners into a strong model: While individual decision trees
might not perform well, their combination in a forest leads to a robust and stable model.

Random Forest Construction Algorithm: Bootstrap Sampling: Generate


multiple training datasets by randomly sampling with replacement from the original
dataset. Random Subset of Features: At each node in the tree, only a small subset of
attributes is considered for splitting, minimizing the correlation among trees. Voting or
Averaging: Aggregate the outputs from the trees by voting (for classification) or
averaging (for regression).

2.3. Ridge Regression

Ridge Regression is a type of linear regression that includes a regularization term


to address the problem of multicollinearity and overfitting in high-dimensional datasets.
It modifies the ordinary least squares (OLS) regression by adding a penalty term, which
constrains the magnitude of the model's coefficients. This ensures a more generalized
model that performs well on unseen data.

Working Mechanism

The Ridge Regression model minimizes a cost function that balances the trade-off
between fitting the data and keeping the model coefficients small. The cost function is
expressed as:

yi: Actual value of the dependent variable for observation iii.

y^i: Predicted value of the dependent variable for observation iii.

βj: Coefficient of the jjj-th feature.

λ: Regularization parameter (penalty term).

The first term in the cost function is the residual sum of squares (RSS), which
measures the model’s error. The second term is the penalty term, proportional to the
squared magnitudes of the coefficients, and it discourages large coefficients.
Effect of Regularization Parameter (λ)

When λ=0, Ridge Regression reduces to Ordinary Least Squares, and no


regularization is applied. As λ increases, the penalty term becomes more significant,
forcing the coefficients to shrink closer to zero. Unlike Lasso Regression, Ridge
Regression does not reduce coefficients to exactly zero, meaning all predictors remain in
the model.

Key Characteristics of Ridge Regression: Multicollinearity Handling is


Ridge Regression is particularly effective in scenarios where predictors are highly
correlated. By adding a penalty term, it reduces the variance of the model, resulting in
more stable predictions. Feature Shrinkage is the regularization term that shrinks the
coefficients, which helps prevent overfitting, especially when the dataset contains noise
or irrelevant features. Solution Stability is the penalty term that ensures that the problem
of inversion in the normal equation is mitigated, making the solution numerically stable
even with collinear predictors.

Ridge Regression Construction Algorithm

Data Standardization: Standardize the features to ensure that all predictors are on
the same scale. This step is essential since Ridge Regression depends on the magnitude of
coefficients.

Defining the Cost Function: Construct the cost function as the sum of the RSS
and the penalty term.

Optimization: Solve the optimization problem using techniques like gradient


descent or matrix-based methods. The Ridge Regression solution can be expressed as:

Here, X is the design matrix, y is the target vector, and I is the identity matrix.

Hyperparameter Tuning:Choose an appropriate value for λ\lambdaλ using


methods like cross-validation to balance bias and variance optimally.

Prediction: Use the optimized coefficients to make predictions on new data.

You might also like