0% found this document useful (0 votes)
0 views

Machine Learning

Uploaded by

ahersuraj23march
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Machine Learning

Uploaded by

ahersuraj23march
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

MACHINE LEARNING

PROJECT ON DECISION TREES


Project Name :

Name : Suraj Gopal Aher


Roll No : 23/IT/164
Abstract : Decision Trees are one of the most intuitive and widely used algorithms in
Machine Learning for both classification and regression tasks. This project explains the
working of Decision Trees, their applications, and how they help in making accurate and
interpretable decisions based on data. The study involves building a simple Decision Tree
classifier and analyzing its performance on a dataset.

Introduction
Decision Trees are a supervised learning technique used for both classification and regression
problems. They work by splitting the dataset into subsets based on the value of input features.
This structure resembles a tree, where internal nodes represent a test on a feature, branches
represent the outcome, and leaves represent the final decision.

Study Objective
• To understand the concept and working of Decision Trees.

• To build and evaluate a Decision Tree model using Python and Scikit-learn.

• To explore real-world applications and future trends.

Scope & Limitations

Scope:
• Implementation on publicly available datasets.

• Use of Scikit-learn for building and visualizing Decision Trees.


Limitations:

• Limited to small datasets.

• May not generalize well on complex, large-scale problems due to overfitting.

Real-World Application
Decision Trees are used in:

• Medical Diagnosis: To predict diseases based on symptoms.

• Credit Risk Analysis: To evaluate loan applicants.

• Customer Segmentation: To classify customers based on purchasing behavior.


• Fraud Detection: To detect unusual transactions.
• Recommendation Systems: For product recommendations.
Challenges and Future Trends
Challenges:

• Prone to overfitting on training data.

• Can become unstable with small variations in data.

• Not ideal for continuous or high-dimensional data.

Future Trends:
• Use of Ensemble Methods like Random Forests and Gradient Boosting to improve
accuracy.
• Integration with Explainable AI (XAI) for better interpretability.

• Application in areas like autonomous vehicles and personalized medicine.

Conclusion:
Decision Trees are a powerful and easy-to-understand model in Machine Learning. Despite
their limitations, they serve as a fundamental building block for advanced ensemble methods.
With proper tuning and modern techniques, Decision Trees continue to have a significant role
in real-world applications.

References
1. Scikit-learn documentation: https://scikit-learn.org/
2. “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow” by Aurélien
Géron.

3. Kaggle Datasets: https://www.kaggle.com/

4. Research papers and articles on Decision Trees.


CODE:

# Credit Card Fraud Detection using Decision Trees

# This program builds a Decision Tree classifier to detect credit card fraud
# using the Kaggle Credit Card Fraud Detection dataset.

# Dataset: https://www.kaggle.com/mlg-ulb/creditcardfraud

# Import necessary libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt


import seaborn as sns
from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier, plot_tree

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

from sklearn.metrics import confusion_matrix, classification_report

from sklearn.preprocessing import StandardScaler

import warnings

warnings.filterwarnings('ignore')

# Set random seed for reproducibility

np.random.seed(42)

# Load the dataset

print("Loading the Credit Card Fraud dataset...")

try:
df = pd.read_csv('creditcard.csv')

print("Dataset successfully loaded.")


except FileNotFoundError:

print("Dataset file not found. Please download the dataset from Kaggle and place it in the
current directory.")

print("Download link: https://www.kaggle.com/mlg-ulb/creditcardfraud")

exit()

# Display basic information about the dataset


print("\n=== Dataset Information ===")

print(f"Shape of the dataset: {df.shape}")

print("\nFirst 5 rows of the dataset:")

print(df.head())

print("\nSummary statistics:")

print(df.describe())

# Check for missing values

print("\nMissing values in each column:")

print(df.isnull().sum())

# Display class distribution (fraud vs non-fraud)

print("\n=== Class Distribution ===")

class_counts = df['Class'].value_counts()
print(class_counts)

print(f"Percentage of fraud transactions: {class_counts[1] / len(df) * 100:.4f}%")

# Visualize class distribution

plt.figure(figsize=(10, 6))

sns.countplot(x='Class', data=df)

plt.title('Class Distribution (0: Normal, 1: Fraud)')


plt.ylabel('Count')
plt.yscale('log') # Using log scale for better visualization due to class imbalance

plt.savefig('class_distribution.png')

plt.close()

# Split the data into features and target

X = df.drop('Class', axis=1)

y = df['Class']

# Split the data into training and testing sets (80% training, 20% testing)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,


stratify=y)

print("\n=== Dataset Split ===")

print(f"Training set size: {X_train.shape[0]} samples")

print(f"Testing set size: {X_test.shape[0]} samples")

# Create and train a Decision Tree Classifier

# Using class_weight='balanced' to handle imbalanced dataset

print("\n=== Training Decision Tree Classifier ===")

dt_classifier = DecisionTreeClassifier(

max_depth=10, # Limit depth to prevent overfitting

min_samples_split=20, # Minimum samples required to split an internal node


min_samples_leaf=5, # Minimum samples required at a leaf node

class_weight='balanced', # Handle class imbalance

random_state=42

# Fit the model to the training data

dt_classifier.fit(X_train, y_train)
print("Decision Tree training completed.")
# Make predictions on the test set

y_pred = dt_classifier.predict(X_test)

# Evaluate the model

print("\n=== Model Evaluation ===")

# Calculate evaluation metrics

accuracy = accuracy_score(y_test, y_pred)

precision = precision_score(y_test, y_pred)

recall = recall_score(y_test, y_pred)

f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")

print(f"Precision: {precision:.4f}")

print(f"Recall: {recall:.4f}")

print(f"F1 Score: {f1:.4f}")

# Generate and display confusion matrix

cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")

print(cm)

# Display classification report

print("\nClassification Report:")

print(classification_report(y_test, y_pred))
# Visualize the confusion matrix

plt.figure(figsize=(8, 6))

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',

xticklabels=['Normal (0)', 'Fraud (1)'],


yticklabels=['Normal (0)', 'Fraud (1)'])

plt.xlabel('Predicted Label')

plt.ylabel('True Label')

plt.title('Confusion Matrix')

plt.savefig('confusion_matrix.png')

plt.close()

# Find the most important features based on the Decision Tree


feature_importance = pd.DataFrame({

'Feature': X.columns,

'Importance': dt_classifier.feature_importances_

})

feature_importance = feature_importance.sort_values('Importance', ascending=False)

print("\n=== Feature Importance ===")


print(feature_importance.head(10)) # Top 10 most important features

# Visualize feature importance

plt.figure(figsize=(12, 8))

sns.barplot(x='Importance', y='Feature', data=feature_importance.head(10))

plt.title('Top 10 Feature Importance')

plt.tight_layout()
plt.savefig('feature_importance.png')

plt.close()
# Visualize the Decision Tree (limited to max_depth=3 for clarity)

plt.figure(figsize=(20, 10))

plot_tree(

dt_classifier,
max_depth=3, # Limit visualization depth for clarity

feature_names=X.columns,

class_names=['Normal', 'Fraud'],

filled=True,

rounded=True,

fontsize=10

)
plt.title('Decision Tree Visualization (Limited to Depth 3)')
plt.savefig('decision_tree.png')

plt.close()

print("\n=== Decision Tree Interpretation ===")

print("""

The decision tree diagram shows how the model makes decisions to classify transactions as
fraudulent or normal:

1. Each node represents a decision based on a specific feature.


2. The color intensity indicates the class distribution at that node (darker = more
homogeneous).

3. The tree branches out based on feature thresholds.


4. The deeper the tree, the more complex the decision-making process.

5. Leaf nodes represent the final classification decisions.

The important features identified by the model (shown in the feature importance plot)

are the key indicators the algorithm uses to detect fraudulent transactions.
Time and transaction amount, along with certain V-features (which are PCA-transformed
features for confidentiality), play significant roles in fraud detection.

""")

print("\n=== Analysis Complete ===")


print("Decision Tree model for Credit Card Fraud Detection has been successfully built and
evaluated.")

print("Visualization files have been saved: class_distribution.png, confusion_matrix.png,


feature_importance.png, and decision_tree.png")

OUTPUT:

You might also like