Decision Tree
Decision Tree
Decision Tree:is the only algorithm helpful for Classification and Prediction
Technique
Classify the records in Pictorial format, Target variable is categorical and input
variable is categorical and continuous
circle is node and box is leaf.
impurityness of data
Gini 0 to 0.5
close to 0-> Homogenity
Entropy 0 to 1
0 -> homogenity and less impurityness
Misclassification error:
it ranges from 0 to 0.5
Classification:
Overfitting:-> Decision tree grow more-> More and More Independent Variable:
Difficult to classify -> pre proning : before generation of tree.
Random forest Algorithm is ensemble of decision tree
Identify the Imp IDV
Pruning: cutting: Minimizing the the no of attributes for the tree(Control the
growing of tree)
horizontal
we can take the average and add to the missing values if the IDV is continuous
If IDV is categorical then we need to take the mode and replace the missing values
advantages
1.Generate the rules
2. perform the classification
3. IDV is both continuous and Categorical
4. By visualization we can clearly classify the data.
Weakness
1. Always the prediction variable IDV is categorical
2. it is not helpful for continuous variable.
3. It's performance is less for many categories.
4. it is not fit for small amount of data
5.
Underfitting
Missing Values
Costs of Classification:
Python coding
1. Import pandas library -> To load the data frame
2. Import Numpy package -> Storing the data in the array and matrix form.
3. from sklearn import tree -> for the Machine learning Algorithm
4. from sklearn import preprocessing -> used to convert text into numerical, and
missing values
5. Load the training data set
6. If the missing values is continuous then replace with Mean or If it is
categorical then replace with Mode
7. np.where(condition,value,colume)
8. Text into numrical: Label Encoder(it will fit for all variables), Label
Binarizer(2 values)
9. label_encoder = preprocessing.LabelEncoder()
10. encoded_sex = label_encoder.fit_transform(titanic_train['sex'])
11. fit_transform-> convert the text into numerical.
12. Initialize the decision tree model
tree_model = tree.DecisionTreeClassifier() -> output is categorical
DecisionTreeRegressor()-> The ouput is continuous
13. tree_model.fit(X = pd.DataFrame(encoded_sex), y = titanic_train["Survived"])
predictors:
Gender, Passenger Class
More than one independent variable we need to create the data frame
Random forest
accuracyscore ?