0% found this document useful (0 votes)
7 views

Model Engineering

The document discusses different steps in model engineering including data splitting, building an unpruned decision tree model, visualizing and interpreting the decision tree, calculating feature importance, and how data can be separated for decision trees with different numbers of leaf nodes.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Model Engineering

The document discusses different steps in model engineering including data splitting, building an unpruned decision tree model, visualizing and interpreting the decision tree, calculating feature importance, and how data can be separated for decision trees with different numbers of leaf nodes.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Model Engineering

1. Data splitting :
a. Import Necessary Module: First, we import the train_test_split function from the
sklearn.model_selection module
b. Separate Features and Target Variable: Before splitting the data, we need to separate
the features (X) from the target variable (y). This ensures that we're correctly assigning
predictors and the variable we're trying to predict
c. Split the Data:
We use the train_test_split function to split the data into training and testing sets:
● X_train and y_train represent the features and target variables of the training set,
respectively.
● X_test and y_test represent the features and target variables of the testing set,
respectively.
● test_size=0.2 specifies that 20% of the data will be used for testing, and the
remaining 80% will be used for training.
● random_state=42 ensures reproducibility. It sets the random seed so that the data
split is the same every time the code is run.
The method used here is called Stratified Splitting. It's a technique used to split data into
training and testing sets while ensuring that the proportions of classes in the target variable
are preserved in both sets. This is particularly important when dealing with classification
problems where the distribution of classes in the data is imbalanced. train_test_split
automatically handles this when splitting the data.

2. Run the decision tree algorithm on the training data without


pruning :
An unpruned decision tree is a decision tree algorithm where the maximum depth of the tree
is not limited (max_depth=None in scikit-learn). This allows the tree to grow as large as
possible, splitting the data at every available opportunity based on the features.

Here's a breakdown of the key characteristics:

○ No Depth Restriction: The tree continues to split data points as long as there are
potential improvements in the splitting criterion (e.g., Gini impurity for classification).
○ Potentially Overfitting: Unpruned trees are susceptible to overfitting. They might
capture intricate details from the training data that don't generalize well to unseen
data. The tree becomes too complex, focusing on specific patterns in the training data
that might not be representative of the broader population.
○ High Variance: Unpruned trees can have high variance, meaning small changes in the
training data can lead to significantly different tree structures. This can make them less
reliable.
❖ Parameter settings :

1. Library:

● Choose a library that supports decision trees. This example uses scikit-learn's
DecisionTreeClassifier class in Python.

2. Disabling Pruning:

● To avoid pruning, we'll set the max_depth parameter to a high value (typically
None). This allows the tree to grow as much as possible, essentially disabling
pruning.
max_depth=None

3. Additional Notes:

● The default splitting criterion used by DecisionTreeClassifier for classification


tasks is the gini impurity index. You don't need to set this explicitly in most
cases.
● While this approach avoids pruning, it can lead to overfitting. Consider
evaluating the model on unseen data (testing set) and potentially using pruning
techniques for better generalization.

3. graphical and textual representation of the learned decision


tree

In this section, we discuss the process of visualizing and understanding the learned decision tree
model through both graphical and textual representations.

​ Exporting the Decision Tree to a DOT File:


● We use the export_graphviz function from scikit-learn's tree module to
export the learned decision tree to a DOT file.
● The function takes various parameters such as the trained decision tree
classifier (tree_classifier), feature names, class names, and formatting options.
● The output is a textual representation of the decision tree in the DOT
language, which is a plain text graph description language.
​ Visualizing the Decision Tree Using Graphviz:
● After exporting the decision tree to a DOT file, we utilize the Graphviz
library to visualize the tree structure.
● Graphviz allows us to generate graphical representations of graphs
described in the DOT language.
● The graphviz.Source class is used to load the DOT file and create a graph
object.
● We can then save the graphical representation as a PDF file
(graph.render("decision_tree")) and display it in the default viewer
(graph.view()).
​ Generating Textual Representation:
● Additionally, we employ the export_text function from scikit-learn's tree
module to generate a textual representation of the decision tree.
● This function produces a more human-readable summary of the decision
tree's structure compared to the DOT format.
● The textual representation includes information such as feature names,
thresholds, and decision criteria.
​ Printing Textual Representation:
● Finally, we print the generated textual representation to the console.
● This allows for a detailed examination of the decision tree's structure and
decision-making process.

By following these steps, we obtain both graphical and textual representations of the
learned decision tree model, facilitating better understanding and interpretation of its
behavior and performance.

4. The features that are most relevant for the classification task

The most relevant features for the classification task can be identified by examining the top-level
splits or those closer to the root of the learned decision tree model. These features play a crucial role
in distinguishing between different classes. In the textual representation of the decision tree,
features used for the initial splits are indicative of their importance in the classification process.

The most relevant features for the classification task can be determined by calculating the feature importance.
Feature importance is a measure of how much a given feature contributes to the classification accuracy of the
decision tree.
In a decision tree, the importance of a feature is calculated as the sum of the reduction in entropy (or increase in
information gain) at each node where the feature is used to split the data. The higher the importance score, the
more relevant the feature is for the classification task.

To compute the overall importance of a feature, we can sum up the importance scores for that feature across all
nodes where it is used. The following code shows how to calculate the feature importance for the decision tree
without pruning

5. how data are separated for 3 to 5 leaf nodes

To show how data are separated for 3 to 5 leaf nodes, we can look at the decision tree structure and
how it partitions the data based on various features and thresholds. Here's an example of how data
might be separated for a decision tree with 3 to 5 leaf nodes:

​ Three Leaf Nodes:


● The decision tree divides the data into three distinct regions or classes.
● Each leaf node represents a subset of the data that shares similar characteristics or
features.
● The splits in the tree effectively partition the feature space into three regions, each
corresponding to one of the leaf nodes.

Demonstration:
|--- JoiningYear <= 2017.50
|--- PaymentTier <= 2.50
|--- JoiningYear <= 2016.50
|--- Leaf Node 1
|--- JoiningYear > 2016.50
|--- Leaf Node 2
|--- PaymentTier > 2.50
|--- Leaf Node 3

● The decision tree divides the data into three distinct regions represented by three
leaf nodes: Leaf Node 1, Leaf Node 2, and Leaf Node 3.
● Leaf Node 1 corresponds to individuals who joined before or in 2016, have a
payment tier less than or equal to 2.50, and any other conditions that led to this
subset.
● Leaf Node 2 represents individuals who joined after 2016, have a payment tier less
than or equal to 2.50, and any other conditions specific to this group.
● Leaf Node 3 captures individuals with a payment tier greater than 2.50, along with
any other conditions that define this subset.

​ Four Leaf Nodes:
● With four leaf nodes, the decision tree creates a more detailed partitioning of the
data compared to three leaf nodes.
● The splits in the tree result in four distinct regions or classes, allowing for more
nuanced distinctions among the data points.
● Each leaf node captures a subset of the data with similar characteristics based on the
features considered by the tree.

Demonstration:
To extend the tree to four leaf nodes, we might introduce a new split based on another
feature.
|--- JoiningYear <= 2017.50
|--- PaymentTier <= 2.50
|--- JoiningYear <= 2016.50
|--- Leaf Node 1
|--- JoiningYear > 2016.50
|--- Leaf Node 2
|--- PaymentTier > 2.50
|--- Gender <= 0.50
|--- Leaf Node 3
|--- Gender > 0.50
|--- Leaf Node 4

This tree introduces a split based on gender for individuals with a payment tier greater than
2.50, resulting in four leaf nodes.

​ Five Leaf Nodes:


● A decision tree with five leaf nodes further refines the partitioning of the data.
● The additional leaf node allows for even more detailed distinctions among the data
points.
● Each leaf node represents a subset of the data with specific characteristics, and the
decision tree's splits aim to maximize the homogeneity within each leaf node while
maximizing the heterogeneity between nodes.

Demonstration
To further refine the partitioning of the data into five leaf nodes, we might introduce
additional splits or fine-tune existing ones, such as:

|--- JoiningYear <= 2017.50


|--- PaymentTier <= 2.50
|--- JoiningYear <= 2016.50
|--- Leaf Node 1
|--- JoiningYear > 2016.50
|--- Leaf Node 2
|--- PaymentTier > 2.50
|--- Gender <= 0.50
|--- Leaf Node 3
|--- Gender > 0.50
|--- Age <= 35.50
|--- Leaf Node 4
|--- Age > 35.50
|--- Leaf Node 5

Here, we introduced a split based on age for individuals with a payment tier greater than
2.50, leading to five leaf nodes.

6. Running the decision tree algorithm on the training data


with pre-pruning
The decision tree algorithm is run on the training data with pre-pruning, where the parameter
setting for pre-pruning is max_depth=3. This means that the maximum depth of the decision tree is
limited to 3 levels.

Explanation of the thresholds used for pre-pruning:

● max_depth: The max_depth parameter is set to 3, which specifies the maximum depth of
the decision tree. This limits the number of splits that can be made during the tree-building
process. By restricting the depth of the tree, we prevent it from becoming too complex and
overfitting the training data. In this case, a maximum depth of 3 was chosen, but the
optimal value for max_depth depends on the data and the desired level of accuracy.
● Gini impurity criterion: The decision tree algorithm uses the Gini impurity criterion to
measure the impurity of a node. The higher the Gini impurity, the more impure the node
is. The algorithm splits the nodes with the highest Gini impurity first, as this results in the
most significant reduction in impurity. By using pre-pruning with a maximum depth, we
limit the number of splits and thus control the complexity of the decision tree, which helps
prevent overfitting.

In summary, the decision tree algorithm with pre-pruning is configured to limit the maximum
depth of the tree to 3 levels, using the Gini impurity criterion to guide the splitting process. This
approach helps to obtain optimal results by balancing model complexity and accuracy.

You might also like