Model Engineering
Model Engineering
1. Data splitting :
a. Import Necessary Module: First, we import the train_test_split function from the
sklearn.model_selection module
b. Separate Features and Target Variable: Before splitting the data, we need to separate
the features (X) from the target variable (y). This ensures that we're correctly assigning
predictors and the variable we're trying to predict
c. Split the Data:
We use the train_test_split function to split the data into training and testing sets:
● X_train and y_train represent the features and target variables of the training set,
respectively.
● X_test and y_test represent the features and target variables of the testing set,
respectively.
● test_size=0.2 specifies that 20% of the data will be used for testing, and the
remaining 80% will be used for training.
● random_state=42 ensures reproducibility. It sets the random seed so that the data
split is the same every time the code is run.
The method used here is called Stratified Splitting. It's a technique used to split data into
training and testing sets while ensuring that the proportions of classes in the target variable
are preserved in both sets. This is particularly important when dealing with classification
problems where the distribution of classes in the data is imbalanced. train_test_split
automatically handles this when splitting the data.
○ No Depth Restriction: The tree continues to split data points as long as there are
potential improvements in the splitting criterion (e.g., Gini impurity for classification).
○ Potentially Overfitting: Unpruned trees are susceptible to overfitting. They might
capture intricate details from the training data that don't generalize well to unseen
data. The tree becomes too complex, focusing on specific patterns in the training data
that might not be representative of the broader population.
○ High Variance: Unpruned trees can have high variance, meaning small changes in the
training data can lead to significantly different tree structures. This can make them less
reliable.
❖ Parameter settings :
1. Library:
● Choose a library that supports decision trees. This example uses scikit-learn's
DecisionTreeClassifier class in Python.
2. Disabling Pruning:
● To avoid pruning, we'll set the max_depth parameter to a high value (typically
None). This allows the tree to grow as much as possible, essentially disabling
pruning.
max_depth=None
3. Additional Notes:
In this section, we discuss the process of visualizing and understanding the learned decision tree
model through both graphical and textual representations.
By following these steps, we obtain both graphical and textual representations of the
learned decision tree model, facilitating better understanding and interpretation of its
behavior and performance.
4. The features that are most relevant for the classification task
The most relevant features for the classification task can be identified by examining the top-level
splits or those closer to the root of the learned decision tree model. These features play a crucial role
in distinguishing between different classes. In the textual representation of the decision tree,
features used for the initial splits are indicative of their importance in the classification process.
The most relevant features for the classification task can be determined by calculating the feature importance.
Feature importance is a measure of how much a given feature contributes to the classification accuracy of the
decision tree.
In a decision tree, the importance of a feature is calculated as the sum of the reduction in entropy (or increase in
information gain) at each node where the feature is used to split the data. The higher the importance score, the
more relevant the feature is for the classification task.
To compute the overall importance of a feature, we can sum up the importance scores for that feature across all
nodes where it is used. The following code shows how to calculate the feature importance for the decision tree
without pruning
To show how data are separated for 3 to 5 leaf nodes, we can look at the decision tree structure and
how it partitions the data based on various features and thresholds. Here's an example of how data
might be separated for a decision tree with 3 to 5 leaf nodes:
Demonstration:
|--- JoiningYear <= 2017.50
|--- PaymentTier <= 2.50
|--- JoiningYear <= 2016.50
|--- Leaf Node 1
|--- JoiningYear > 2016.50
|--- Leaf Node 2
|--- PaymentTier > 2.50
|--- Leaf Node 3
● The decision tree divides the data into three distinct regions represented by three
leaf nodes: Leaf Node 1, Leaf Node 2, and Leaf Node 3.
● Leaf Node 1 corresponds to individuals who joined before or in 2016, have a
payment tier less than or equal to 2.50, and any other conditions that led to this
subset.
● Leaf Node 2 represents individuals who joined after 2016, have a payment tier less
than or equal to 2.50, and any other conditions specific to this group.
● Leaf Node 3 captures individuals with a payment tier greater than 2.50, along with
any other conditions that define this subset.
●
Four Leaf Nodes:
● With four leaf nodes, the decision tree creates a more detailed partitioning of the
data compared to three leaf nodes.
● The splits in the tree result in four distinct regions or classes, allowing for more
nuanced distinctions among the data points.
● Each leaf node captures a subset of the data with similar characteristics based on the
features considered by the tree.
Demonstration:
To extend the tree to four leaf nodes, we might introduce a new split based on another
feature.
|--- JoiningYear <= 2017.50
|--- PaymentTier <= 2.50
|--- JoiningYear <= 2016.50
|--- Leaf Node 1
|--- JoiningYear > 2016.50
|--- Leaf Node 2
|--- PaymentTier > 2.50
|--- Gender <= 0.50
|--- Leaf Node 3
|--- Gender > 0.50
|--- Leaf Node 4
This tree introduces a split based on gender for individuals with a payment tier greater than
2.50, resulting in four leaf nodes.
Demonstration
To further refine the partitioning of the data into five leaf nodes, we might introduce
additional splits or fine-tune existing ones, such as:
Here, we introduced a split based on age for individuals with a payment tier greater than
2.50, leading to five leaf nodes.
● max_depth: The max_depth parameter is set to 3, which specifies the maximum depth of
the decision tree. This limits the number of splits that can be made during the tree-building
process. By restricting the depth of the tree, we prevent it from becoming too complex and
overfitting the training data. In this case, a maximum depth of 3 was chosen, but the
optimal value for max_depth depends on the data and the desired level of accuracy.
● Gini impurity criterion: The decision tree algorithm uses the Gini impurity criterion to
measure the impurity of a node. The higher the Gini impurity, the more impure the node
is. The algorithm splits the nodes with the highest Gini impurity first, as this results in the
most significant reduction in impurity. By using pre-pruning with a maximum depth, we
limit the number of splits and thus control the complexity of the decision tree, which helps
prevent overfitting.
In summary, the decision tree algorithm with pre-pruning is configured to limit the maximum
depth of the tree to 3 levels, using the Gini impurity criterion to guide the splitting process. This
approach helps to obtain optimal results by balancing model complexity and accuracy.