0% found this document useful (0 votes)
23 views7 pages

Phase 3 IBM

The document outlines the process of model development and evaluation, including data preparation, algorithm selection, and performance metrics. It details the implementation of various models such as Random Forest, XGBoost, and LightGBM, along with their evaluation results and insights on model performance. The conclusion emphasizes the importance of advanced data cleaning, model building, and the role of AI in optimizing performance for fraud detection.

Uploaded by

Shreya Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views7 pages

Phase 3 IBM

The document outlines the process of model development and evaluation, including data preparation, algorithm selection, and performance metrics. It details the implementation of various models such as Random Forest, XGBoost, and LightGBM, along with their evaluation results and insights on model performance. The conclusion emphasizes the importance of advanced data cleaning, model building, and the role of AI in optimizing performance for fraud detection.

Uploaded by

Shreya Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Model Development and Evaluation

In this phase, Model development and evaluation is a comprehensive process that begins with preparing
the data, including cleaning, feature engineering, and splitting it into training and testing sets. After
selecting an appropriate algorithm, such as Random Forest or Logistic Regression, the model is trained
on the training data. Once trained, the model’s performance is evaluated on test data using various
metrics such as accuracy, precision, recall, ROC curve, or for regression, MAE, MSE, and R². Cross-
validation can be used to assess the model's generalization, and hyperparameter tuning is applied to
optimize performance. The goal is to ensure the model is accurate, robust, and capable of performing
well on unseen data.

Step 1: Importing necessary libraries and loading the dataset


Importing all the required and necessary libraries to successfully implement the project. Advanced
techniques ensure the dataset is free from missing values, outliers, and imbalance issues.

1.1 Importing Libraries and loading the dataset


## Importing the necessary libraries
# General libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Model libraries
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
import xgboost as xgb
import lightgbm as lgb

# Preprocessing libraries
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, accuracy_score

#loading the dataset


df = pd.read_csv(r"C:\Users\Downloads\archive (4)\data\dataset.csv")

1.2 Exploratory Data Analysis (EDA)


Displaying the metadata about the dataset

# display metadata about the dataset


df.info ()

1.3 Calculate the average popularity

# Calculate the average popularity for each genre


top_15_popular_genres =
df.groupby('track_genre')['popularity'].mean().sort_values(ascending=False).head(15)

# Plot the top 15 most popular genres based on average popularity


plt.figure(figsize=(12, 6))
sns.barplot(x=top_15_popular_genres.index, y=top_15_popular_genres.values, palette='viridis')
plt.title("Top 15 Most Popular Genres (Based on Average Popularity)")
plt.xlabel("Genre")
plt.ylabel("Average Popularity")
plt.xticks(rotation=45)
plt.show()

1.4 Checking for duplicates, data preprocessing and label encoding

duplicates = df.duplicated().sum()

#4) Data Preprocessing


# Fill missing values in 'artists', 'album_name', and 'track_name' with 'Unknown'
df['artists'].fillna('Unknown', inplace=True)
df['album_name'].fillna('Unknown', inplace=True)
df['track_name'].fillna('Unknown', inplace=True)

# Label encode the target variable


le = LabelEncoder()
df['track_genre'] = le.fit_transform(df['track_genre'])

Step 2: Building and Training Models


2.1 Feature Selection and Splitting the Data

We started in feature selection by targeting the particular required variable and splited the data into
training and testing sets

• Training set of data includes 80%


• Testing set of data includes 20%

##5) Feature Selection and Splitting Data


# Features (all columns except 'track_genre')
X = df.drop(columns=['track_genre'])

# Target variable
y = df['track_genre']

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

2.2 Model Development and Evauation

We initialized the Random Forest Classifier then the model is trained by fitting the values later the
predictions are done as per the predictions the model is evaluated.

##6) Model Development and Evaluation


#Random Forest Model
# Initialize the Random Forest classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model


rf.fit(X_train, y_train)

# Make predictions
y_pred_rf = rf.predict(X_test)

# Evaluate the model


print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Classification Report:\n", classification_report(y_test, y_pred_rf))

2.3 Trying of Gradient Boosting Libraries

XGBoost model - XGBoost is an optimized and scalable version of gradient boosting that was
developed by Tianqi Chen. It has become very popular due to its speed, accuracy, and ability to handle
large datasets with complex features.
• Initialize the XGBoost Classifier
• Train the model
• Make predictions
• Evaluate

#XGBoost Model
# Initialize the XGBoost classifier
xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42)

# Train the model


xgb_model.fit(X_train, y_train)

# Make predictions
y_pred_xgb = xgb_model.predict(X_test)

# Evaluate the model


print("XGBoost Accuracy:", accuracy_score(y_test, y_pred_xgb))
print("Classification Report:\n", classification_report(y_test, y_pred_xgb))

LightGBM - LightGBM, developed by Microsoft, is another gradient boosting framework designed to


be fast and efficient, particularly for large datasets. It focuses on faster training times and lower
memory usage.
• Initialize the LightGBM Classifier
• Train the model
• Make predictions
• Evaluate

#LightGBM
# Initialize the LightGBM classifier
lgbm_model = lgb.LGBMClassifier(random_state=42)
# Train the model
lgbm_model.fit(X_train, y_train)

# Make predictions
y_pred_lgbm = lgbm_model.predict(X_test)

# Evaluate the model


print("LightGBM Accuracy:", accuracy_score(y_test, y_pred_lgbm))
print("Classification Report:\n", classification_report(y_test, y_pred_lgbm))

Trying ELBOW Method plot


#ELBOW METHOD PLOT
#to find optimal K

from sklearn.cluster import KMeans


import matplotlib.pyplot as plt

# Determine the optimal number of clusters using the Elbow Method


inertia = []
k_values = range(1, 11)

for k in k_values:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(train_data)
inertia.append(kmeans.inertia_)

# Plot the Elbow Method


plt.plot(k_values, inertia, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.show()

KMeans algotithm -K-Means is a clustering algorithm that groups similar data points into K clusters
based on their features, by minimizing the distance between each point and its closest cluster center.

2.3 Calculations
#CALCULATING ACCURACY,PRECISION,RECALL AND ROC CURVE FOR THE DESCRIBED
ALGORITHM{RANDOM FOREST CLASSIFIER}

accuracy = accuracy_score(y_test, predicted_labels)


precision = precision_score(y_test, predicted_labels, average='weighted')
recall = recall_score(y_test, predicted_labels, average='weighted')
f1 = f1_score(y_test, predicted_labels, average='weighted')

# Calculate ROC AUC Score


# For multi-class, we calculate the ROC AUC score using a one-vs-rest scheme
y_test_bin = label_binarize(y_test, classes=[0, 1, 2]) # Binarize the true labels
y_score = rf.predict_proba(X_test) # Get the probability estimates for each class
# Compute ROC AUC score
roc_auc = roc_auc_score(y_test_bin, y_score, multi_class='ovr')
print(f"ROC AUC Score: {roc_auc}")
2.3 Recommendation system training with kmeans
#recommendation system with kmeans
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def recommend_songs(song_name, df, num_recommendations=5):


# Get the cluster of the input song
song_cluster = df[df["name"] == song_name]["Cluster"].values[0]

# Filter songs from the same cluster


same_cluster_songs = df[df["Cluster"] == song_cluster]

# Calculate similarity within the cluster


song_index = same_cluster_songs[same_cluster_songs["name"] == song_name].index[0]
cluster_features = same_cluster_songs[numerical_features]
similarity = cosine_similarity(cluster_features, cluster_features)

# Get top recommendations


similar_songs = np.argsort(similarity[song_index])[-(num_recommendations + 1):-1][::-1]
recommendations = same_cluster_songs.iloc[similar_songs][["name", "year", "artists"]]

return recommendations

Step 3: Using of AI
3.1 What is AI?
AI can be used to automate tasks, gain insights from data, and make decisions or predictions more
accurately and efficiently than humans. It can also be used to enhance customer experiences, improve
productivity, and drive innovation.

3.2 Why Use of AI?


AI in model development helps unlock the potential of complex, large-scale data, automates repetitive
tasks, improves performance and decision-making, and makes the process more scalable and
adaptable. By incorporating AI, organizations can create more accurate, efficient, and intelligent
models that provide actionable insights, enhance user experiences, and drive innovation.

Step 4: Model Evaluation


4.1 Why Evaluate?
Evaluation refers to the process of assessing the performance of a machine learning model or algorithm.
This involves measuring how well the model is able to make predictions, classify data, or complete other
tasks.

Metrics Used:

1. Accuracy: Measures overall correctness.


2. Precision & Recall: For imbalanced datasets.
3. ROC AUC: Evaluates classification performance.
4. Fairness Metrics: Evaluates model bias.

Step 5: Results and Insights


Comparison:

1. Random Forest Results:


• Accuracy: 0.3
• Precision (Weighted): 0.3132867132867133
• Recall (Weighted): 0.3
• F1-Score (Weighted): 0.30295652173913046
• ROC AUC Score: 0.5200066137566136
2. XGBoost Classifer
• XGBoost Accuracy: 0.31407894736842107
• Classification Report:
precision recall f1-score support

0 0.19 0.17 0.18 213


1 1 0.37 0.34 0.36 203
…………….
3. LightGBM Classifier
• LightGBM Accuracy: 0.1287280701754386
• Classification Report:
precision recall f1-score support

0 0.06 0.04 0.05 213


1 0.09 0.15 0.11 203

Observations:

1. Model Performance Analysis:

● Random Forest: Enhanced accuracy and recall due to its ensemble approach, demonstrating
robustness and effectiveness as an initial model.
● XGBoost: Calculates the accuracy, ,acro avg and also weighted avg. A better scalable version
among gradient boosting libraries.
● LightGBM: Auto-choosing col-wise multi-threading, the overhead of testing is done. Starts
the training from score and tries to split the data with positive gain and best gain until the
accuracy and average is computed.

2. Evaluation Metrics:

● Precision: Measures the proportion of correctly identified fraud cases out of all predicted fraud
cases.
● Recall: Focuses on how many actual fraud cases were detected out of the total fraud cases
present.
● F1-score: Balances precision and recall, offering a comprehensive view of model performance
in the context of fraud detection.

3. Insights on Model Accuracy:


While the model achieved 99% accuracy, this metric can be misleading due to the following factors:

● Class Imbalance: Fraud cases often constitute less than 1% of the dataset. A model predicting
all cases as "not fraud" can yield high accuracy but fail at identifying fraud effectively.
● Overfitting: High accuracy may indicate the model has overfitted the training data, reducing
its ability to generalize to unseen data.
● Evaluation Metrics: For imbalanced datasets, metrics like precision, recall,
F1-score, and ROC AUC provide deeper insights into the model’s true performance.
.
Key Takeaways:

● XGBoost Classifier: Regularization includes both L1 (Lasso) and L2 (Ridge) regularization to


prevent overfitting, which is an important feature missing in traditional gradient boosting.
Parallelization: It supports parallel processing for faster training by splitting tasks like
computing gradients across multiple cores.
Handling Missing Data: XGBoost has built-in support for handling missing values
during training, automatically learning the best way to deal with them.
● LightGBM: Histogram-based Approach in LightGBM uses a histogram-based approach for
binning continuous features, which leads to faster training and lower memory consumption.
Leaf-wise Growth: LightGBM grows trees leaf-wise, as opposed to level-wise growth
in traditional gradient boosting. This often leads to deeper trees and better model performance,
but it can be more prone to overfitting in smaller datasets.
Efficient for Large Datasets: It is highly efficient when dealing with large datasets and
can be distributed across multiple machines for even faster training.
● Metrics Comparison: A detailed comparison of metrics like precision, recall,
F1-score, and ROC AUC highlighted the trade-offs in detecting fraud. These metrics proved
more informative than accuracy for evaluating model performance on imbalanced datasets.

Conclusion:
This project underscored the critical importance of advanced data cleaning, iterative model building,
and comprehensive evaluation to handle real-world challenges.AI emerged as a valuable asset,
offering efficient optimization and reliable performance, especially in resource-constrained
environments. By integrating manual techniques with automated solutions, the project achieved a
robust and fair model that effectively addresses the complexities of fraud detection, providing
actionable insights and scalable solutions.

You might also like