Machine Learning
Machine Learning
Introduction
Decision Trees are a supervised learning technique used for both classification and regression
problems. They work by splitting the dataset into subsets based on the value of input features.
This structure resembles a tree, where internal nodes represent a test on a feature, branches
represent the outcome, and leaves represent the final decision.
Study Objective
• To understand the concept and working of Decision Trees.
• To build and evaluate a Decision Tree model using Python and Scikit-learn.
Scope:
• Implementation on publicly available datasets.
Real-World Application
Decision Trees are used in:
Future Trends:
• Use of Ensemble Methods like Random Forests and Gradient Boosting to improve
accuracy.
• Integration with Explainable AI (XAI) for better interpretability.
Conclusion:
Decision Trees are a powerful and easy-to-understand model in Machine Learning. Despite
their limitations, they serve as a fundamental building block for advanced ensemble methods.
With proper tuning and modern techniques, Decision Trees continue to have a significant role
in real-world applications.
References
1. Scikit-learn documentation: https://scikit-learn.org/
2. “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow” by Aurélien
Géron.
# This program builds a Decision Tree classifier to detect credit card fraud
# using the Kaggle Credit Card Fraud Detection dataset.
# Dataset: https://www.kaggle.com/mlg-ulb/creditcardfraud
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)
try:
df = pd.read_csv('creditcard.csv')
print("Dataset file not found. Please download the dataset from Kaggle and place it in the
current directory.")
exit()
print(df.head())
print("\nSummary statistics:")
print(df.describe())
print(df.isnull().sum())
class_counts = df['Class'].value_counts()
print(class_counts)
plt.figure(figsize=(10, 6))
sns.countplot(x='Class', data=df)
plt.savefig('class_distribution.png')
plt.close()
X = df.drop('Class', axis=1)
y = df['Class']
# Split the data into training and testing sets (80% training, 20% testing)
dt_classifier = DecisionTreeClassifier(
random_state=42
dt_classifier.fit(X_train, y_train)
print("Decision Tree training completed.")
# Make predictions on the test set
y_pred = dt_classifier.predict(X_test)
f1 = f1_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# Visualize the confusion matrix
plt.figure(figsize=(8, 6))
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.savefig('confusion_matrix.png')
plt.close()
'Feature': X.columns,
'Importance': dt_classifier.feature_importances_
})
plt.figure(figsize=(12, 8))
plt.tight_layout()
plt.savefig('feature_importance.png')
plt.close()
# Visualize the Decision Tree (limited to max_depth=3 for clarity)
plt.figure(figsize=(20, 10))
plot_tree(
dt_classifier,
max_depth=3, # Limit visualization depth for clarity
feature_names=X.columns,
class_names=['Normal', 'Fraud'],
filled=True,
rounded=True,
fontsize=10
)
plt.title('Decision Tree Visualization (Limited to Depth 3)')
plt.savefig('decision_tree.png')
plt.close()
print("""
The decision tree diagram shows how the model makes decisions to classify transactions as
fraudulent or normal:
The important features identified by the model (shown in the feature importance plot)
are the key indicators the algorithm uses to detect fraudulent transactions.
Time and transaction amount, along with certain V-features (which are PCA-transformed
features for confidentiality), play significant roles in fraud detection.
""")
OUTPUT: