0% found this document useful (0 votes)

26 views7 pages

Phase 3 IBM

The document outlines the process of model development and evaluation, including data preparation, algorithm selection, and performance metrics. It details the implementation of various models such as Random Forest, XGBoost, and LightGBM, along with their evaluation results and insights on model performance. The conclusion emphasizes the importance of advanced data cleaning, model building, and the role of AI in optimizing performance for fraud detection.

Uploaded by

Shreya Patil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views7 pages

Phase 3 IBM

Uploaded by

Shreya Patil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Model Development and Evaluation

In this phase, Model development and evaluation is a comprehensive process that begins with preparing
the data, including cleaning, feature engineering, and splitting it into training and testing sets. After
selecting an appropriate algorithm, such as Random Forest or Logistic Regression, the model is trained
on the training data. Once trained, the model’s performance is evaluated on test data using various
metrics such as accuracy, precision, recall, ROC curve, or for regression, MAE, MSE, and R². Cross-
validation can be used to assess the model's generalization, and hyperparameter tuning is applied to
optimize performance. The goal is to ensure the model is accurate, robust, and capable of performing
well on unseen data.

Step 1: Importing necessary libraries and loading the dataset

Importing all the required and necessary libraries to successfully implement the project. Advanced
techniques ensure the dataset is free from missing values, outliers, and imbalance issues.

1.1 Importing Libraries and loading the dataset

## Importing the necessary libraries
# General libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Model libraries
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
import xgboost as xgb
import lightgbm as lgb

# Preprocessing libraries
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, accuracy_score

#loading the dataset

df = pd.read_csv(r"C:\Users\Downloads\archive (4)\data\dataset.csv")

1.2 Exploratory Data Analysis (EDA)

Displaying the metadata about the dataset

# display metadata about the dataset

df.info ()

1.3 Calculate the average popularity

# Calculate the average popularity for each genre

top_15_popular_genres =
df.groupby('track_genre')['popularity'].mean().sort_values(ascending=False).head(15)

# Plot the top 15 most popular genres based on average popularity

plt.figure(figsize=(12, 6))
sns.barplot(x=top_15_popular_genres.index, y=top_15_popular_genres.values, palette='viridis')
plt.title("Top 15 Most Popular Genres (Based on Average Popularity)")
plt.xlabel("Genre")
plt.ylabel("Average Popularity")
plt.xticks(rotation=45)
plt.show()

1.4 Checking for duplicates, data preprocessing and label encoding

duplicates = df.duplicated().sum()

#4) Data Preprocessing

# Fill missing values in 'artists', 'album_name', and 'track_name' with 'Unknown'
df['artists'].fillna('Unknown', inplace=True)
df['album_name'].fillna('Unknown', inplace=True)
df['track_name'].fillna('Unknown', inplace=True)

# Label encode the target variable

le = LabelEncoder()
df['track_genre'] = le.fit_transform(df['track_genre'])

Step 2: Building and Training Models

2.1 Feature Selection and Splitting the Data

We started in feature selection by targeting the particular required variable and splited the data into
training and testing sets

• Training set of data includes 80%

• Testing set of data includes 20%

##5) Feature Selection and Splitting Data

# Features (all columns except 'track_genre')
X = df.drop(columns=['track_genre'])

# Target variable
y = df['track_genre']

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

2.2 Model Development and Evauation

We initialized the Random Forest Classifier then the model is trained by fitting the values later the
predictions are done as per the predictions the model is evaluated.

##6) Model Development and Evaluation

#Random Forest Model
# Initialize the Random Forest classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model

rf.fit(X_train, y_train)

# Make predictions
y_pred_rf = rf.predict(X_test)

# Evaluate the model

print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Classification Report:\n", classification_report(y_test, y_pred_rf))

2.3 Trying of Gradient Boosting Libraries

XGBoost model - XGBoost is an optimized and scalable version of gradient boosting that was
developed by Tianqi Chen. It has become very popular due to its speed, accuracy, and ability to handle
large datasets with complex features.
• Initialize the XGBoost Classifier
• Train the model
• Make predictions
• Evaluate

#XGBoost Model
# Initialize the XGBoost classifier
xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42)

# Train the model

xgb_model.fit(X_train, y_train)

# Make predictions
y_pred_xgb = xgb_model.predict(X_test)

# Evaluate the model

print("XGBoost Accuracy:", accuracy_score(y_test, y_pred_xgb))
print("Classification Report:\n", classification_report(y_test, y_pred_xgb))

LightGBM - LightGBM, developed by Microsoft, is another gradient boosting framework designed to

be fast and efficient, particularly for large datasets. It focuses on faster training times and lower
memory usage.
• Initialize the LightGBM Classifier
• Train the model
• Make predictions
• Evaluate

#LightGBM
# Initialize the LightGBM classifier
lgbm_model = lgb.LGBMClassifier(random_state=42)
# Train the model
lgbm_model.fit(X_train, y_train)

# Make predictions
y_pred_lgbm = lgbm_model.predict(X_test)

# Evaluate the model

print("LightGBM Accuracy:", accuracy_score(y_test, y_pred_lgbm))
print("Classification Report:\n", classification_report(y_test, y_pred_lgbm))

Trying ELBOW Method plot

#ELBOW METHOD PLOT
#to find optimal K

from sklearn.cluster import KMeans

import matplotlib.pyplot as plt

# Determine the optimal number of clusters using the Elbow Method

inertia = []
k_values = range(1, 11)

for k in k_values:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(train_data)
inertia.append(kmeans.inertia_)

# Plot the Elbow Method

plt.plot(k_values, inertia, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.show()

KMeans algotithm -K-Means is a clustering algorithm that groups similar data points into K clusters
based on their features, by minimizing the distance between each point and its closest cluster center.

2.3 Calculations
#CALCULATING ACCURACY,PRECISION,RECALL AND ROC CURVE FOR THE DESCRIBED
ALGORITHM{RANDOM FOREST CLASSIFIER}

accuracy = accuracy_score(y_test, predicted_labels)

precision = precision_score(y_test, predicted_labels, average='weighted')
recall = recall_score(y_test, predicted_labels, average='weighted')
f1 = f1_score(y_test, predicted_labels, average='weighted')

# Calculate ROC AUC Score

# For multi-class, we calculate the ROC AUC score using a one-vs-rest scheme
y_test_bin = label_binarize(y_test, classes=[0, 1, 2]) # Binarize the true labels
y_score = rf.predict_proba(X_test) # Get the probability estimates for each class
# Compute ROC AUC score
roc_auc = roc_auc_score(y_test_bin, y_score, multi_class='ovr')
print(f"ROC AUC Score: {roc_auc}")
2.3 Recommendation system training with kmeans
#recommendation system with kmeans
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def recommend_songs(song_name, df, num_recommendations=5):

# Get the cluster of the input song
song_cluster = df[df["name"] == song_name]["Cluster"].values[0]

# Filter songs from the same cluster

same_cluster_songs = df[df["Cluster"] == song_cluster]

# Calculate similarity within the cluster

song_index = same_cluster_songs[same_cluster_songs["name"] == song_name].index[0]
cluster_features = same_cluster_songs[numerical_features]
similarity = cosine_similarity(cluster_features, cluster_features)

# Get top recommendations

similar_songs = np.argsort(similarity[song_index])[-(num_recommendations + 1):-1][::-1]
recommendations = same_cluster_songs.iloc[similar_songs][["name", "year", "artists"]]

return recommendations

Step 3: Using of AI
3.1 What is AI?
AI can be used to automate tasks, gain insights from data, and make decisions or predictions more
accurately and efficiently than humans. It can also be used to enhance customer experiences, improve
productivity, and drive innovation.

3.2 Why Use of AI?

AI in model development helps unlock the potential of complex, large-scale data, automates repetitive
tasks, improves performance and decision-making, and makes the process more scalable and
adaptable. By incorporating AI, organizations can create more accurate, efficient, and intelligent
models that provide actionable insights, enhance user experiences, and drive innovation.

Step 4: Model Evaluation

4.1 Why Evaluate?
Evaluation refers to the process of assessing the performance of a machine learning model or algorithm.
This involves measuring how well the model is able to make predictions, classify data, or complete other
tasks.

Metrics Used:

1. Accuracy: Measures overall correctness.

2. Precision & Recall: For imbalanced datasets.
3. ROC AUC: Evaluates classification performance.
4. Fairness Metrics: Evaluates model bias.

Step 5: Results and Insights

Comparison:

1. Random Forest Results:

• Accuracy: 0.3
• Precision (Weighted): 0.3132867132867133
• Recall (Weighted): 0.3
• F1-Score (Weighted): 0.30295652173913046
• ROC AUC Score: 0.5200066137566136
2. XGBoost Classifer
• XGBoost Accuracy: 0.31407894736842107
• Classification Report:
precision recall f1-score support

0 0.19 0.17 0.18 213

1 1 0.37 0.34 0.36 203
…………….
3. LightGBM Classifier
• LightGBM Accuracy: 0.1287280701754386
• Classification Report:
precision recall f1-score support

0 0.06 0.04 0.05 213

1 0.09 0.15 0.11 203

Observations:

1. Model Performance Analysis:

● Random Forest: Enhanced accuracy and recall due to its ensemble approach, demonstrating
robustness and effectiveness as an initial model.
● XGBoost: Calculates the accuracy, ,acro avg and also weighted avg. A better scalable version
among gradient boosting libraries.
● LightGBM: Auto-choosing col-wise multi-threading, the overhead of testing is done. Starts
the training from score and tries to split the data with positive gain and best gain until the
accuracy and average is computed.

2. Evaluation Metrics:

● Precision: Measures the proportion of correctly identified fraud cases out of all predicted fraud
cases.
● Recall: Focuses on how many actual fraud cases were detected out of the total fraud cases
present.
● F1-score: Balances precision and recall, offering a comprehensive view of model performance
in the context of fraud detection.

3. Insights on Model Accuracy:

While the model achieved 99% accuracy, this metric can be misleading due to the following factors:

● Class Imbalance: Fraud cases often constitute less than 1% of the dataset. A model predicting
all cases as "not fraud" can yield high accuracy but fail at identifying fraud effectively.
● Overfitting: High accuracy may indicate the model has overfitted the training data, reducing
its ability to generalize to unseen data.
● Evaluation Metrics: For imbalanced datasets, metrics like precision, recall,
F1-score, and ROC AUC provide deeper insights into the model’s true performance.
.
Key Takeaways:

● XGBoost Classifier: Regularization includes both L1 (Lasso) and L2 (Ridge) regularization to

prevent overfitting, which is an important feature missing in traditional gradient boosting.
Parallelization: It supports parallel processing for faster training by splitting tasks like
computing gradients across multiple cores.
Handling Missing Data: XGBoost has built-in support for handling missing values
during training, automatically learning the best way to deal with them.
● LightGBM: Histogram-based Approach in LightGBM uses a histogram-based approach for
binning continuous features, which leads to faster training and lower memory consumption.
Leaf-wise Growth: LightGBM grows trees leaf-wise, as opposed to level-wise growth
in traditional gradient boosting. This often leads to deeper trees and better model performance,
but it can be more prone to overfitting in smaller datasets.
Efficient for Large Datasets: It is highly efficient when dealing with large datasets and
can be distributed across multiple machines for even faster training.
● Metrics Comparison: A detailed comparison of metrics like precision, recall,
F1-score, and ROC AUC highlighted the trade-offs in detecting fraud. These metrics proved
more informative than accuracy for evaluating model performance on imbalanced datasets.

Conclusion:
This project underscored the critical importance of advanced data cleaning, iterative model building,
and comprehensive evaluation to handle real-world challenges.AI emerged as a valuable asset,
offering efficient optimization and reliable performance, especially in resource-constrained
environments. By integrating manual techniques with automated solutions, the project achieved a
robust and fair model that effectively addresses the complexities of fraud detection, providing
actionable insights and scalable solutions.

Model Evaluation and Selection Cheatsheet 1708023215
No ratings yet
Model Evaluation and Selection Cheatsheet 1708023215
7 pages
Machine Learning With Random Forests and Decision Trees - A Visual Guide For Beginners by Scott Hartshorn
No ratings yet
Machine Learning With Random Forests and Decision Trees - A Visual Guide For Beginners by Scott Hartshorn
73 pages
40 Interview Questions Asked at Startups in Machine Learning - Data Science
100% (3)
40 Interview Questions Asked at Startups in Machine Learning - Data Science
33 pages
Model Selection On ML
No ratings yet
Model Selection On ML
49 pages
Exam PA Knowledge Based Outline
No ratings yet
Exam PA Knowledge Based Outline
22 pages
Data Analytics Report - Case Study - Employee Attrition
100% (1)
Data Analytics Report - Case Study - Employee Attrition
41 pages
MS&E 448 Final Presentation High Frequency Algorithmic Trading
No ratings yet
MS&E 448 Final Presentation High Frequency Algorithmic Trading
29 pages
Divorce Prediction System: Devansh Kapoor 179202050
No ratings yet
Divorce Prediction System: Devansh Kapoor 179202050
12 pages
Model Evaluation - II
No ratings yet
Model Evaluation - II
12 pages
Decision Trees and Random Forests
No ratings yet
Decision Trees and Random Forests
25 pages
BigML WhizzML Tutorials
No ratings yet
BigML WhizzML Tutorials
45 pages
Data Science Assignment 2
No ratings yet
Data Science Assignment 2
14 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
50 pages
Machine Learning Lecture1 - 26-27 Aug
No ratings yet
Machine Learning Lecture1 - 26-27 Aug
30 pages
Northbay Summarizes Data Pre-Processing Algorithms
No ratings yet
Northbay Summarizes Data Pre-Processing Algorithms
10 pages
12 Classification
No ratings yet
12 Classification
16 pages
Black Friday Sales Analysis & Prediction: A.Priyanka P.Anish K.Pruthvi Raj
No ratings yet
Black Friday Sales Analysis & Prediction: A.Priyanka P.Anish K.Pruthvi Raj
16 pages
Week 7 Laboratory Activity
No ratings yet
Week 7 Laboratory Activity
12 pages
6 - Steps of The Classification Algorithm in Supervised Learning
No ratings yet
6 - Steps of The Classification Algorithm in Supervised Learning
15 pages
Improving Econometric Prediction by Machine Learning: Applied Economics Letters
No ratings yet
Improving Econometric Prediction by Machine Learning: Applied Economics Letters
8 pages
(REPORT) LAB - 2 - Decision - Tree
No ratings yet
(REPORT) LAB - 2 - Decision - Tree
17 pages
Fake Job Post Detection Using Machine Learning
100% (1)
Fake Job Post Detection Using Machine Learning
24 pages
Pre-Owned Car Price and Life Prediction Using Machine Learning
No ratings yet
Pre-Owned Car Price and Life Prediction Using Machine Learning
26 pages
Exploratory Data Analysis of Titanic Survival Prediction Using Machine Learning Techniques
No ratings yet
Exploratory Data Analysis of Titanic Survival Prediction Using Machine Learning Techniques
5 pages
Application of Machine Learning To Predict Transient Sand Production in The Karazhanbas Oil Field, Ustyurt-Buzachi Basin (West Kazakhstan)
No ratings yet
Application of Machine Learning To Predict Transient Sand Production in The Karazhanbas Oil Field, Ustyurt-Buzachi Basin (West Kazakhstan)
12 pages
UNIT 2-Part2
No ratings yet
UNIT 2-Part2
9 pages
Draft Xai
No ratings yet
Draft Xai
16 pages
The Honor Code Office of Student Integrity (OSI) NLP Course Page
No ratings yet
The Honor Code Office of Student Integrity (OSI) NLP Course Page
5 pages
Minor Project Review-1 PPT Template-WS 22-23 Batch-261
No ratings yet
Minor Project Review-1 PPT Template-WS 22-23 Batch-261
19 pages
05 - Ensemble Learning
No ratings yet
05 - Ensemble Learning
39 pages
House Price Prediction
100% (1)
House Price Prediction
17 pages
AIML 7 To 11
No ratings yet
AIML 7 To 11
7 pages
Unit 5
No ratings yet
Unit 5
11 pages
Child Mortality Prediction Using Machine Learning Techniques
No ratings yet
Child Mortality Prediction Using Machine Learning Techniques
6 pages
Lecture03. Classification (Chapter 3)
No ratings yet
Lecture03. Classification (Chapter 3)
46 pages
Integrated Long-Term Stock Selection Models Based On Feature Selection and Machine Learning Algorithms For China Stock Market
No ratings yet
Integrated Long-Term Stock Selection Models Based On Feature Selection and Machine Learning Algorithms For China Stock Market
14 pages
Diabetes Prediction Using Machine Learning Classification Techniques
No ratings yet
Diabetes Prediction Using Machine Learning Classification Techniques
34 pages
CBR Hypertension
No ratings yet
CBR Hypertension
9 pages
Module 4 - Supervised Learning - First ML Model
No ratings yet
Module 4 - Supervised Learning - First ML Model
23 pages
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
Minor Project
No ratings yet
Minor Project
21 pages
Chapter 2 Machine Learning Draft-85-172
No ratings yet
Chapter 2 Machine Learning Draft-85-172
88 pages
Unit-V 1
No ratings yet
Unit-V 1
26 pages
Lab 2
No ratings yet
Lab 2
17 pages
Question Set Machine Learning A Revolution in Risk Management and Compliance
100% (11)
Question Set Machine Learning A Revolution in Risk Management and Compliance
11 pages
Major Project Detailed Report
No ratings yet
Major Project Detailed Report
50 pages
Unit 3
No ratings yet
Unit 3
63 pages
Chapter 03
No ratings yet
Chapter 03
30 pages
Lecture 7.2 - DTC Algorithm Implementation
No ratings yet
Lecture 7.2 - DTC Algorithm Implementation
7 pages
Chapter 03
No ratings yet
Chapter 03
30 pages
Machine Learning Lecture 2,3,4
No ratings yet
Machine Learning Lecture 2,3,4
26 pages
Chapter Two - Classification Feb 26 2024
No ratings yet
Chapter Two - Classification Feb 26 2024
18 pages
Project Valuation (Finance Analysis)
No ratings yet
Project Valuation (Finance Analysis)
41 pages
Python TUM
No ratings yet
Python TUM
3 pages
Advanced Users For KNIME Analytics Platform
No ratings yet
Advanced Users For KNIME Analytics Platform
113 pages
Machine Learning Fundamentals
No ratings yet
Machine Learning Fundamentals
4 pages
Machine Learning Path
No ratings yet
Machine Learning Path
21 pages
Python Essential Methods in Machine Learning
No ratings yet
Python Essential Methods in Machine Learning
6 pages
PYTHON PROGRAMMING FOR MACHINE LEARNING-220901004 - Compressed
No ratings yet
PYTHON PROGRAMMING FOR MACHINE LEARNING-220901004 - Compressed
6 pages
Module 3 Data Science Machine Learning
No ratings yet
Module 3 Data Science Machine Learning
53 pages
Project Report6
No ratings yet
Project Report6
23 pages
A Hybrid CNN + Lstm-Based Intrusion Detection System For Industrial Iot Networks
No ratings yet
A Hybrid CNN + Lstm-Based Intrusion Detection System For Industrial Iot Networks
14 pages
INT524 Unit3
No ratings yet
INT524 Unit3
35 pages
DMBI
No ratings yet
DMBI
15 pages
ML 456
No ratings yet
ML 456
6 pages
Analysis of Wheat
No ratings yet
Analysis of Wheat
21 pages
Model Learning Steps
No ratings yet
Model Learning Steps
12 pages
Lecture 5 - Feature Extraction, Model Building & Evaluation
No ratings yet
Lecture 5 - Feature Extraction, Model Building & Evaluation
35 pages
Evaluating Machine Learning Algorithms and Model Selection
No ratings yet
Evaluating Machine Learning Algorithms and Model Selection
10 pages
Mini Project123
No ratings yet
Mini Project123
100 pages
Price Prediction Research Paper
No ratings yet
Price Prediction Research Paper
10 pages
Instagram Spam Detection ISD
No ratings yet
Instagram Spam Detection ISD
11 pages
ML Notes
No ratings yet
ML Notes
15 pages
ML Lecture 15 Ensemble
No ratings yet
ML Lecture 15 Ensemble
27 pages
Chapter 5 2025
No ratings yet
Chapter 5 2025
19 pages
AIML-Unit 5 Notes-Assignment 5
No ratings yet
AIML-Unit 5 Notes-Assignment 5
24 pages
Class 2a-Decision Trees
No ratings yet
Class 2a-Decision Trees
28 pages
ABAARJ 2023carloan
No ratings yet
ABAARJ 2023carloan
10 pages
CS326 Report
No ratings yet
CS326 Report
36 pages
Hyper Parameter Tuning
No ratings yet
Hyper Parameter Tuning
4 pages
Assignment 9
No ratings yet
Assignment 9
8 pages
Ds Notes Mca
No ratings yet
Ds Notes Mca
30 pages
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
No ratings yet
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
10 pages
Final Year Research Paper
No ratings yet
Final Year Research Paper
8 pages
Assignment 4 R Program1
No ratings yet
Assignment 4 R Program1
11 pages
ML Lab-1
No ratings yet
ML Lab-1
32 pages
How To Create A Python Model
No ratings yet
How To Create A Python Model
29 pages
INT354 - Unit 2
No ratings yet
INT354 - Unit 2
26 pages
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet