Phase 3 IBM
Phase 3 IBM
In this phase, Model development and evaluation is a comprehensive process that begins with preparing
the data, including cleaning, feature engineering, and splitting it into training and testing sets. After
selecting an appropriate algorithm, such as Random Forest or Logistic Regression, the model is trained
on the training data. Once trained, the model’s performance is evaluated on test data using various
metrics such as accuracy, precision, recall, ROC curve, or for regression, MAE, MSE, and R². Cross-
validation can be used to assess the model's generalization, and hyperparameter tuning is applied to
optimize performance. The goal is to ensure the model is accurate, robust, and capable of performing
well on unseen data.
# Model libraries
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
import xgboost as xgb
import lightgbm as lgb
# Preprocessing libraries
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, accuracy_score
duplicates = df.duplicated().sum()
We started in feature selection by targeting the particular required variable and splited the data into
training and testing sets
# Target variable
y = df['track_genre']
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
We initialized the Random Forest Classifier then the model is trained by fitting the values later the
predictions are done as per the predictions the model is evaluated.
# Make predictions
y_pred_rf = rf.predict(X_test)
XGBoost model - XGBoost is an optimized and scalable version of gradient boosting that was
developed by Tianqi Chen. It has become very popular due to its speed, accuracy, and ability to handle
large datasets with complex features.
• Initialize the XGBoost Classifier
• Train the model
• Make predictions
• Evaluate
#XGBoost Model
# Initialize the XGBoost classifier
xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42)
# Make predictions
y_pred_xgb = xgb_model.predict(X_test)
#LightGBM
# Initialize the LightGBM classifier
lgbm_model = lgb.LGBMClassifier(random_state=42)
# Train the model
lgbm_model.fit(X_train, y_train)
# Make predictions
y_pred_lgbm = lgbm_model.predict(X_test)
for k in k_values:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(train_data)
inertia.append(kmeans.inertia_)
KMeans algotithm -K-Means is a clustering algorithm that groups similar data points into K clusters
based on their features, by minimizing the distance between each point and its closest cluster center.
2.3 Calculations
#CALCULATING ACCURACY,PRECISION,RECALL AND ROC CURVE FOR THE DESCRIBED
ALGORITHM{RANDOM FOREST CLASSIFIER}
return recommendations
Step 3: Using of AI
3.1 What is AI?
AI can be used to automate tasks, gain insights from data, and make decisions or predictions more
accurately and efficiently than humans. It can also be used to enhance customer experiences, improve
productivity, and drive innovation.
Metrics Used:
Observations:
● Random Forest: Enhanced accuracy and recall due to its ensemble approach, demonstrating
robustness and effectiveness as an initial model.
● XGBoost: Calculates the accuracy, ,acro avg and also weighted avg. A better scalable version
among gradient boosting libraries.
● LightGBM: Auto-choosing col-wise multi-threading, the overhead of testing is done. Starts
the training from score and tries to split the data with positive gain and best gain until the
accuracy and average is computed.
2. Evaluation Metrics:
● Precision: Measures the proportion of correctly identified fraud cases out of all predicted fraud
cases.
● Recall: Focuses on how many actual fraud cases were detected out of the total fraud cases
present.
● F1-score: Balances precision and recall, offering a comprehensive view of model performance
in the context of fraud detection.
● Class Imbalance: Fraud cases often constitute less than 1% of the dataset. A model predicting
all cases as "not fraud" can yield high accuracy but fail at identifying fraud effectively.
● Overfitting: High accuracy may indicate the model has overfitted the training data, reducing
its ability to generalize to unseen data.
● Evaluation Metrics: For imbalanced datasets, metrics like precision, recall,
F1-score, and ROC AUC provide deeper insights into the model’s true performance.
.
Key Takeaways:
Conclusion:
This project underscored the critical importance of advanced data cleaning, iterative model building,
and comprehensive evaluation to handle real-world challenges.AI emerged as a valuable asset,
offering efficient optimization and reliable performance, especially in resource-constrained
environments. By integrating manual techniques with automated solutions, the project achieved a
robust and fair model that effectively addresses the complexities of fraud detection, providing
actionable insights and scalable solutions.