MACHINE LEARNING
PROJECT ON DECISION TREES
Project Name :
Name : Suraj Gopal Aher
Roll No : 23/IT/164
Abstract : Decision Trees are one of the most intuitive and widely used algorithms in
Machine Learning for both classification and regression tasks. This project explains the
working of Decision Trees, their applications, and how they help in making accurate and
interpretable decisions based on data. The study involves building a simple Decision Tree
classifier and analyzing its performance on a dataset.
Introduction
Decision Trees are a supervised learning technique used for both classification and regression
problems. They work by splitting the dataset into subsets based on the value of input features.
This structure resembles a tree, where internal nodes represent a test on a feature, branches
represent the outcome, and leaves represent the final decision.
Study Objective
• To understand the concept and working of Decision Trees.
• To build and evaluate a Decision Tree model using Python and Scikit-learn.
• To explore real-world applications and future trends.
Scope & Limitations
Scope:
• Implementation on publicly available datasets.
• Use of Scikit-learn for building and visualizing Decision Trees.
Limitations:
• Limited to small datasets.
• May not generalize well on complex, large-scale problems due to overfitting.
Real-World Application
Decision Trees are used in:
• Medical Diagnosis: To predict diseases based on symptoms.
• Credit Risk Analysis: To evaluate loan applicants.
• Customer Segmentation: To classify customers based on purchasing behavior.
• Fraud Detection: To detect unusual transactions.
• Recommendation Systems: For product recommendations.
Challenges and Future Trends
Challenges:
• Prone to overfitting on training data.
• Can become unstable with small variations in data.
• Not ideal for continuous or high-dimensional data.
Future Trends:
• Use of Ensemble Methods like Random Forests and Gradient Boosting to improve
accuracy.
• Integration with Explainable AI (XAI) for better interpretability.
• Application in areas like autonomous vehicles and personalized medicine.
Conclusion:
Decision Trees are a powerful and easy-to-understand model in Machine Learning. Despite
their limitations, they serve as a fundamental building block for advanced ensemble methods.
With proper tuning and modern techniques, Decision Trees continue to have a significant role
in real-world applications.
References
1. Scikit-learn documentation: https://scikit-learn.org/
2. “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow” by Aurélien
Géron.
3. Kaggle Datasets: https://www.kaggle.com/
4. Research papers and articles on Decision Trees.
CODE:
# Credit Card Fraud Detection using Decision Trees
# This program builds a Decision Tree classifier to detect credit card fraud
# using the Kaggle Credit Card Fraud Detection dataset.
# Dataset: https://www.kaggle.com/mlg-ulb/creditcardfraud
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')
# Set random seed for reproducibility
np.random.seed(42)
# Load the dataset
print("Loading the Credit Card Fraud dataset...")
try:
df = pd.read_csv('creditcard.csv')
print("Dataset successfully loaded.")
except FileNotFoundError:
print("Dataset file not found. Please download the dataset from Kaggle and place it in the
current directory.")
print("Download link: https://www.kaggle.com/mlg-ulb/creditcardfraud")
exit()
# Display basic information about the dataset
print("\n=== Dataset Information ===")
print(f"Shape of the dataset: {df.shape}")
print("\nFirst 5 rows of the dataset:")
print(df.head())
print("\nSummary statistics:")
print(df.describe())
# Check for missing values
print("\nMissing values in each column:")
print(df.isnull().sum())
# Display class distribution (fraud vs non-fraud)
print("\n=== Class Distribution ===")
class_counts = df['Class'].value_counts()
print(class_counts)
print(f"Percentage of fraud transactions: {class_counts[1] / len(df) * 100:.4f}%")
# Visualize class distribution
plt.figure(figsize=(10, 6))
sns.countplot(x='Class', data=df)
plt.title('Class Distribution (0: Normal, 1: Fraud)')
plt.ylabel('Count')
plt.yscale('log') # Using log scale for better visualization due to class imbalance
plt.savefig('class_distribution.png')
plt.close()
# Split the data into features and target
X = df.drop('Class', axis=1)
y = df['Class']
# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,
stratify=y)
print("\n=== Dataset Split ===")
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")
# Create and train a Decision Tree Classifier
# Using class_weight='balanced' to handle imbalanced dataset
print("\n=== Training Decision Tree Classifier ===")
dt_classifier = DecisionTreeClassifier(
max_depth=10, # Limit depth to prevent overfitting
min_samples_split=20, # Minimum samples required to split an internal node
min_samples_leaf=5, # Minimum samples required at a leaf node
class_weight='balanced', # Handle class imbalance
random_state=42
# Fit the model to the training data
dt_classifier.fit(X_train, y_train)
print("Decision Tree training completed.")
# Make predictions on the test set
y_pred = dt_classifier.predict(X_test)
# Evaluate the model
print("\n=== Model Evaluation ===")
# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
# Generate and display confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)
# Display classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# Visualize the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['Normal (0)', 'Fraud (1)'],
yticklabels=['Normal (0)', 'Fraud (1)'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.savefig('confusion_matrix.png')
plt.close()
# Find the most important features based on the Decision Tree
feature_importance = pd.DataFrame({
'Feature': X.columns,
'Importance': dt_classifier.feature_importances_
})
feature_importance = feature_importance.sort_values('Importance', ascending=False)
print("\n=== Feature Importance ===")
print(feature_importance.head(10)) # Top 10 most important features
# Visualize feature importance
plt.figure(figsize=(12, 8))
sns.barplot(x='Importance', y='Feature', data=feature_importance.head(10))
plt.title('Top 10 Feature Importance')
plt.tight_layout()
plt.savefig('feature_importance.png')
plt.close()
# Visualize the Decision Tree (limited to max_depth=3 for clarity)
plt.figure(figsize=(20, 10))
plot_tree(
dt_classifier,
max_depth=3, # Limit visualization depth for clarity
feature_names=X.columns,
class_names=['Normal', 'Fraud'],
filled=True,
rounded=True,
fontsize=10
)
plt.title('Decision Tree Visualization (Limited to Depth 3)')
plt.savefig('decision_tree.png')
plt.close()
print("\n=== Decision Tree Interpretation ===")
print("""
The decision tree diagram shows how the model makes decisions to classify transactions as
fraudulent or normal:
1. Each node represents a decision based on a specific feature.
2. The color intensity indicates the class distribution at that node (darker = more
homogeneous).
3. The tree branches out based on feature thresholds.
4. The deeper the tree, the more complex the decision-making process.
5. Leaf nodes represent the final classification decisions.
The important features identified by the model (shown in the feature importance plot)
are the key indicators the algorithm uses to detect fraudulent transactions.
Time and transaction amount, along with certain V-features (which are PCA-transformed
features for confidentiality), play significant roles in fraud detection.
""")
print("\n=== Analysis Complete ===")
print("Decision Tree model for Credit Card Fraud Detection has been successfully built and
evaluated.")
print("Visualization files have been saved: class_distribution.png, confusion_matrix.png,
feature_importance.png, and decision_tree.png")
OUTPUT: