0% found this document useful (0 votes)
12 views

Data Mining Notes

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Data Mining Notes

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

-Naive Bayes

- Decision tree (information gain, bootstrapping, aggregation)


- General embedding
- Random forest: Making weak learners, mabaadhom iwaliw strong
➢ We do the sampling through bootstrapping
➢ 80% of the information out of the bag
➢ the 20% of information will help us test the single tree
➢ We do sampling on rows and sampling on columns
➢ Nelkaw 80 observations naamlou aliha decision tree
➢ ADABOOST: tree tsalah tree? ⇒ A direct derivative of random forest
➢ OOB Error: Out-of-bag error (very important)
➢ Features importance is from the strengths of random forest (kife exactly
feature import?)
➢ Neuron network: a lot of data and reseau kbir is required to understand it
➢ Boosting: Replicating what we want the system to pay attention to
Data balancing: you have another class that you do an oversampling in it, a
replication, to make it bigger OR take the whole sample annd?? (VERY
IMPORTANT) We do that because the system dosent work on 1/1000

Random Forest:
- Ensemble simplymeans combining multiple models. Thus a collection of
models is used to make
predictions rather than an
individual model.
- Bagging: It creates a
different training subset from
sample training data with
replacement & the final output is
based on majority voting. For
example, Random Forest.
- Boosting : It combines
weak learners into strong learners by creating sequential models such
that the final model has the highest accuracy. For example, ADA BOOST,
XG BOOST.

Weak learners:
Random Forest steps:
1. Step 1: In the Random forest model, a subset of data points and a
subset of features is selected for constructing each decision tree.
Simply put, n random records and m features are taken from the
data set having k number of records.
2. Step 2: Individual decision trees are constructed for each sample.
3. Step 3: Each decision tree will generate an output.
4. Step 4: Final output is considered based on Majority Voting or
Averaging for Classification and regression, respectively.

+ Diversity and stability


+ Good with dimensionality
+ Parallenization

1- Bagging steps:
1. Subset selection
2. Bootstrap sampling
3. Bootstrapping
4. Independent Model Training
5. Majority voting (choosing most
predicted model)
6. Aggregation: Involves combining
all the results and generating final
output

2- Boosting:
➢ Use the concept of ensemble learning.
➢ A boosting algorithm combines multiple simple models (also known as
weak learners or base estimators) to generate the final output.
➢ It is done by building a model by using weak models in series.

Examples:
⇒ AdaBoost was the first really successful boosting algorithm that was developed
for the purpose of binary classification.
⇒ abbreviation for Adaptive Boosting and is a prevalent boosting technique that
combines multiple “weak classifiers” into a single “strong classifier.”
Last session before midterm:
- - k- fold cross validation- nkasmou data ala K w kol partie naamlou testing
w sampling as if you will increase the number of tests to increase the
accuracy
- - Hyperparameter tuning nestaamlou fih K fold cross validation barcha
(yesser mouhema for supervised learning)==> The technique is called grid
search cross validation
- fil R: GridSearchCV() fil R: Improvement of model performance (kima
bagging and boosting)
- Stump: asgher tree (weak learner)
- Kalna rivzou b chatgpt fih kol chay
- Bch yaamlna session online ala python w bch yaatina "tutorial" bch ii
menou fi devoir
-
-
Random Forest: An Overview
1. What is a Random Forest?
● A Random Forest is an ensemble learning method, primarily used for
classification and regression.
● It operates by constructing a multitude of decision trees at training time
and outputting the class that is the mode of the classes (in classification)
or mean prediction (in regression) of the individual trees.
2. How Does it Work?
● Bootstrap Aggregating (Bagging): Random Forests create an ensemble of
Decision Trees using bagging. Each tree is built on a bootstrap sample, a
random sample with replacement of the training data.
● Feature Randomness: When building each tree, each time a split is
considered, a random sample of features is chosen as split candidates
from the full set of features. This introduces more diversity among the
trees, and it's a key difference from a single decision tree.
3. Key Characteristics
● Reduction in Overfitting: By averaging multiple trees, there is a
significant reduction in the risk of overfitting.
● Handling Unbalanced Data: Random Forests can handle unbalanced data.
They work well for classification where the classes are imbalanced.
● Feature Importance: They give insights into which features are important
in the prediction.
4. Applications
● Random Forests are versatile and can be used in various tasks including
but not limited to medical diagnosis, stock market prediction, and
e-commerce recommendation systems.
5. Limitations
● Interpretability: They are less interpretable than decision trees.
Understanding how a decision is made by looking at individual trees can
be challenging due to the ensemble nature.
● Performance Issues: For data including categorical variables with
different numbers of levels, Random Forests are biased in favor of those
attributes with more levels.
● Large Datasets: For very large datasets, the size of the trees can take up a
lot of memory. It can also be time-consuming to train.
6. Practical Considerations
● Number of Trees: The number of trees in the forest should be high
enough to achieve stable accuracy, but adding more trees beyond a
certain point does not improve performance.
● Tree Depth: Control overfitting by adjusting the depth of the trees.
Deeper trees can model more complex patterns but also increase the risk
of overfitting.
7. Random Forest vs. Decision Trees
● While a single decision tree can be prone to overfitting, the Random
Forest averages multiple trees, which reduces overfitting and improves
generalization.
Conclusion
Random Forest is a powerful and versatile machine learning algorithm, suitable
for a wide range of applications. It provides a good balance between prediction
accuracy and model interpretability, especially in settings where decision trees
alone might be too simplistic or prone to overfitting. However, understanding
the nuances of its parameters and the nature of your data is key to harnessing its
full potential.

You might also like