0% found this document useful (0 votes)
48 views4 pages

Decision Tree

Decision trees are a classification technique that can be used for both classification and prediction problems. They work by recursively splitting a dataset into purer subsets based on the values of predictor variables using a measure of impurity like Gini index or information gain. Random forests are an ensemble method that creates many decision trees and aggregates their results, which helps reduce overfitting and improves accuracy compared to single decision trees. Key steps in building decision trees include handling missing data, converting categorical variables into numeric format, fitting the tree to training data, and evaluating on test data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views4 pages

Decision Tree

Decision trees are a classification technique that can be used for both classification and prediction problems. They work by recursively splitting a dataset into purer subsets based on the values of predictor variables using a measure of impurity like Gini index or information gain. Random forests are an ensemble method that creates many decision trees and aggregates their results, which helps reduce overfitting and improves accuracy compared to single decision trees. Key steps in building decision trees include handling missing data, converting categorical variables into numeric format, fitting the tree to training data, and evaluating on test data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 4

Correlation: To find the relationship b/w 2 variables only

Regression: To Find the causal effect relationship b/w IDV on the DV

Linear Regression: Prediction Technique


Simple Linear Regression : 1 DV and 1 IDV, Both DV and IDV are continuous
Multiple Linear Regression : 1 DV and more than 1 IDV.

Logistic Regression: DV is binary categorical, IDV is categorical and continuous.


-> Classification Technique
Single Predictor Model:
Multiple predictor Model:

Decision Tree and Random Forest:


Classification Technique Dependent Variable Independent Variable Purpose of
Algorithm
Decision Tree Categorical Categorical and
It is a classification Technique and is used to classify the records

continuous in a pictorial format with the help of gini index.

Random Forest categorical categorical and


It is an ensemble of decision tree algorithm used to

continuous find the important variable for decision tree

Decision Tree:is the only algorithm helpful for Classification and Prediction
Technique
Classify the records in Pictorial format, Target variable is categorical and input
variable is categorical and continuous
circle is node and box is leaf.

Training data set is Past data.


Based on Gini index we need to start the classification

nominal->distinct way identification

multi-way split(more than 2 way) categories

Ordinal-> Ranking the objects


Ex: size

Continuous : Variables are defined

impurityness of data
Gini 0 to 0.5
close to 0-> Homogenity
Entropy 0 to 1
0 -> homogenity and less impurityness

Misclassification error:
it ranges from 0 to 0.5

Binary split -> 2 -> Iterative Dicotomous


Categorical -> More than 2 ( Low, Medium, High) -> Cart
Continuous -> Income, Age, -> Cart

Regression Tree -> Continuous variable


CART -> Classification Regression Tree
ID3 -> Binary tree

Classification:
Overfitting:-> Decision tree grow more-> More and More Independent Variable:
Difficult to classify -> pre proning : before generation of tree.
Random forest Algorithm is ensemble of decision tree
Identify the Imp IDV
Pruning: cutting: Minimizing the the no of attributes for the tree(Control the
growing of tree)
horizontal

pre-pruning:Forward pruning: Before generation of Decision tree->we can use the


Random forest
It is used to identify the imp IDV.

post-pruning:Backward pruning: Once the tree is generated


1. Subtree Replacement ->Entire tree is summarized and replaced
2. Subtree Raising -> Connect to the Main node.

Handling Missing Attribute:


More than 50 are missing-> we can delete the entire column.

we can take the average and add to the missing values if the IDV is continuous

If IDV is categorical then we need to take the mode and replace the missing values

advantages
1.Generate the rules
2. perform the classification
3. IDV is both continuous and Categorical
4. By visualization we can clearly classify the data.

Weakness
1. Always the prediction variable IDV is categorical
2. it is not helpful for continuous variable.
3. It's performance is less for many categories.
4. it is not fit for small amount of data
5.

Underfitting

Missing Values
Costs of Classification:

Python coding
1. Import pandas library -> To load the data frame
2. Import Numpy package -> Storing the data in the array and matrix form.
3. from sklearn import tree -> for the Machine learning Algorithm
4. from sklearn import preprocessing -> used to convert text into numerical, and
missing values
5. Load the training data set
6. If the missing values is continuous then replace with Mean or If it is
categorical then replace with Mode
7. np.where(condition,value,colume)
8. Text into numrical: Label Encoder(it will fit for all variables), Label
Binarizer(2 values)
9. label_encoder = preprocessing.LabelEncoder()
10. encoded_sex = label_encoder.fit_transform(titanic_train['sex'])
11. fit_transform-> convert the text into numerical.
12. Initialize the decision tree model
tree_model = tree.DecisionTreeClassifier() -> output is categorical
DecisionTreeRegressor()-> The ouput is continuous
13. tree_model.fit(X = pd.DataFrame(encoded_sex), y = titanic_train["Survived"])

14. Based on the GIni Index value


15. Interface there is a graphviz
with open("Dtree1.dot", 'w') as f:
f= tree.export_grapghviz(tree_model, feature_names=["Sex"],out_file = f);
webgraphviz.com

lesser value is left side and greater values is right side

16. Predictors = pd.DataFrame([encoded_sex, titanic_train["Pclass"]]).T

17. tree_model.fit(X=predictors, y=titanic_train["Survived"])


with open("Dtree2.dot",'w')as f:
f = tree.export_grapghviz(tree_model,feature_names=["Sex","Pclass"],out_file = f);

For more than one DV we will use T

predictors:
Gender, Passenger Class
More than one independent variable we need to create the data frame

Max Depth is 8 : We need to control the Tree


there are 4 independent variables are there and output variable is 1 and it is
categorical (yes or no)
there fore 4 * 2 = 8 so we can go for 8 depth.
It will go for the 8 nodes

Titanic_train dataset link:https://drive.google.com/file/d/1qxJEtjt_pHzb52-


h_HXszQcNjt3zPEsF/view?usp=sharing

RandomForestClassifier(n_estimators = 1000, max_features =2, oob_score = True)


Based on the decision tree we can check the accuracy

Random forest

survived is dependent variable


IDV : Gender , fare , age

Model accuracy is more

accuracyscore ?

You might also like