0% found this document useful (0 votes)

84 views5 pages

RANDOM FOREST (Binary Classification)

The document describes a machine learning workflow for binary classification of honey samples using spectral data. It includes: 1) Importing common Python libraries for data processing, modeling, and visualization. 2) Loading a CSV dataset, splitting it into features (spectral data) and a target (adulterated or not). 3) Training a random forest model on 80% of the data and evaluating its predictions on the remaining 20%. Key steps are data preprocessing, model training and tuning, and evaluating performance using various metrics to identify an accurate model. Feature importance plots provide insights into the most predictive spectral bands.

Uploaded by

Noor Ul Haq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

84 views5 pages

RANDOM FOREST (Binary Classification)

Uploaded by

Noor Ul Haq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

CODE:

import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os

for dirname, _, filenames in os.walk('/kaggle/input'):

for filename in filenames:

print(os.path.join(dirname, filename))

EXPLANATION:
1. import numpy as np: This line imports the NumPy library as np. NumPy is a fundamental library
for numerical computations in Python, and it provides support for arrays and matrices, which are
commonly used in machine learning.

2. import pandas as pd: This line imports the Pandas library as pd. Pandas is another essential
library for data manipulation and analysis in Python, often used to work with structured data,
such as CSV files or data tables.

3. Comments (# data processing, CSV file I/O...): These lines are comments that provide
explanations for the purpose of the imported libraries.

4. import os: This line imports the os module, which provides a way to interact with the operating
system. It is used to perform file and directory operations.

5. for dirname, _, filenames in os.walk('/kaggle/input'):: This line initiates a loop using the os.walk
function to traverse the directory tree starting from the '/kaggle/input' directory. It retrieves
three values in each iteration:

 dirname: The current directory being explored.

 _: A list of subdirectories in the current directory (but not used in this loop).

 filenames: A list of filenames in the current directory.

6. for filename in filenames:: This line starts another loop to iterate over the list of filenames
obtained in the previous step.

7. print(os.path.join(dirname, filename)): In this line, os.path.join() is used to combine the current

dirname and filename into a full path, and then print() is used to display that full path to the
console. This effectively lists all the files in the '/kaggle/input' directory and its subdirectories.

CODE:
# Import necessary libraries

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

import matplotlib.pyplot as plt

# Load the dataset

data = pd.read_csv('/kaggle/input/honey-adulteration/adulteration.csv') # Replace 'your_dataset.csv'

with the actual file path

# Split the data into features (X) and target (y)

X = data.iloc[:, 4:-1] # Select spectral band columns as features

y = data['Class'] # Target variable (adulterated or not)

# Map 'Class' to binary labels (e.g., 'Clover' to 1 and others to 0)

y = (y == 'Clover').astype(int)

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features (optional but often recommended)

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

# Initialize and train a classification model (Random Forest in this example)

clf = RandomForestClassifier(random_state=42)

clf.fit(X_train, y_train)
# Make predictions on the test set

y_pred = clf.predict(X_test)

# Evaluate the model's performance

accuracy = accuracy_score(y_test, y_pred)

conf_matrix = confusion_matrix(y_test, y_pred)

classification_rep = classification_report(y_test, y_pred)

# Print the evaluation results

print(f'Accuracy: {accuracy:.2f}')

print(f'Confusion Matrix:\n{conf_matrix}')

print(f'Classification Report:\n{classification_rep}')

# Plot feature importances (if applicable to your model)

feature_importances = clf.feature_importances_

plt.figure(figsize=(10, 6))

plt.bar(range(len(feature_importances)), feature_importances)

plt.xlabel('Spectral Bands')

plt.ylabel('Feature Importance')

plt.title('Feature Importance for Binary Classification')

plt.show()

EXPLANATION:
This code snippet demonstrates a workflow for building and evaluating a binary classification model
using a dataset with spectral band features. Here's a step-by-step explanation:

1. Importing Libraries:

 Necessary libraries such as pandas, numpy, scikit-learn, and matplotlib are imported to
perform data manipulation, model building, and visualization tasks.

2. Loading the Dataset:

 The dataset is loaded from a CSV file ('adulteration.csv') using pandas and stored in a
DataFrame named data.

3. Splitting Features and Target:

 The features (X) are selected from the DataFrame, excluding the first four columns
(assuming they are not needed for modeling). These columns are assumed to represent
spectral band data.

 The target variable (y) is extracted from the 'Class' column, where binary labels are
created. For example, 'Clover' is mapped to 1 (indicating adulterated) and other classes
to 0 (indicating not adulterated).

4. Splitting the Dataset:

 The dataset is split into training and testing sets using the train_test_split function from
scikit-learn. This is a common practice for evaluating machine learning models. Here,
80% of the data is used for training, and 20% is used for testing.

5. Standardizing Features (Optional):

 The features are standardized using the StandardScaler from scikit-learn.

Standardization scales the features to have a mean of 0 and a standard deviation of 1,
which can help some machine learning algorithms perform better. This step is optional
but often recommended.

6. Initializing and Training a Classification Model (Random Forest):

 A binary classification model is initialized using the RandomForestClassifier from scikit-

learn. This model is used to learn the relationship between spectral band features and
the binary target variable (adulterated or not).

 The model is trained on the standardized training data using the fit method.

7. Making Predictions:

 The trained model is used to make predictions on the test set using the predict method.

8. Evaluating Model Performance:

 The code calculates the accuracy of the model's predictions using the accuracy_score
function from scikit-learn.

 The confusion matrix is computed using the confusion_matrix function, providing

information about true positives, true negatives, false positives, and false negatives.

 A classification report is generated using the classification_report function, which

includes precision, recall, F1-score, and support for both classes.

9. Printing Evaluation Results:

 The accuracy, confusion matrix, and classification report are printed to assess the
model's performance.

10. Plotting Feature Importances (if applicable):

 If the model supports feature importance analysis (as Random Forest does), the code
calculates and plots feature importances. This helps understand which spectral bands
contribute most to the classification decision.

Overall, this code provides a complete example of a binary classification workflow, including data
preprocessing, model training, evaluation, and feature importance analysis.

GRAPH EXPLANATION:
The graph in the code is used to visualize the feature importances when using a Random Forest classifier
for binary classification. This visualization helps you understand which spectral bands (features) are the
most important for making classification decisions. Here's an explanation of the graph:

1. Feature Importances: In a machine learning model like Random Forest, feature importances
represent how much each feature (spectral band, in this case) contributes to the model's
predictions. Higher feature importance indicates that the feature is more influential in making
classification decisions.

2. x-axis (Spectral Bands): The x-axis of the graph represents the spectral bands used as features.
Each band corresponds to a specific wavelength in the hyperspectral data, such as 399.40nm,
404.39nm, and so on. These bands are the input features for the model.

3. y-axis (Feature Importance): The y-axis represents the feature importance scores. It quantifies
the importance of each spectral band in the classification process. Higher values indicate more
important features.

4. Bars: Each bar in the graph corresponds to a specific spectral band. The height of the bar
represents the feature importance score for that band. The taller the bar, the more important
that particular band is in making classification decisions.

5. Interpretation: By looking at this graph, you can identify which spectral bands have the most
significant impact on whether honey is classified as 'Clover' (positive class) or not. Bands with
higher feature importance are more informative for distinguishing between the two classes.

6. Usage: You can use this information to potentially reduce the number of features (spectral
bands) in your model if some bands are less important. It can also provide insights into the
underlying characteristics of the data and help focus further analysis on specific wavelengths
that are highly relevant for classification.

The graph is a valuable tool for feature selection and understanding the key factors contributing to the
model's decisions. It can guide decisions on feature engineering, model improvement, and domain-
specific insights about the dataset.

Kaggle Course Notes
No ratings yet
Kaggle Course Notes
87 pages
ML Expected Question and Explanation of The 3 PGM
No ratings yet
ML Expected Question and Explanation of The 3 PGM
12 pages
Machine Learning Data Prep Guide
No ratings yet
Machine Learning Data Prep Guide
17 pages
10 Random - Forest - Algo
No ratings yet
10 Random - Forest - Algo
6 pages
Python ML Tutorial: Scikit-Learn Wine Quality
No ratings yet
Python ML Tutorial: Scikit-Learn Wine Quality
16 pages
DA PRA WEEK 13 (Random Forest) - 054551
No ratings yet
DA PRA WEEK 13 (Random Forest) - 054551
12 pages
NF Assighment4
No ratings yet
NF Assighment4
5 pages
Phase 3 IBM
No ratings yet
Phase 3 IBM
7 pages
EX - NO:3: Algorithm
No ratings yet
EX - NO:3: Algorithm
11 pages
Heart Disease Prediction - Colab
No ratings yet
Heart Disease Prediction - Colab
18 pages
Update on pandas.util.testing Deprecation
No ratings yet
Update on pandas.util.testing Deprecation
10 pages
Human Activity Recognition
No ratings yet
Human Activity Recognition
8 pages
Exercise5 Solution
No ratings yet
Exercise5 Solution
22 pages
Top Datasets for Data Science
100% (1)
Top Datasets for Data Science
9 pages
Bda 3.1
No ratings yet
Bda 3.1
2 pages
Car Evaluation Data Analysis & Random Forest Model
No ratings yet
Car Evaluation Data Analysis & Random Forest Model
12 pages
PyTorch Tabular Regression Guide
No ratings yet
PyTorch Tabular Regression Guide
13 pages
Decision Tree and Forests - Ipynb - Colab
No ratings yet
Decision Tree and Forests - Ipynb - Colab
3 pages
C2 W4 Decision Tree With Markdown
No ratings yet
C2 W4 Decision Tree With Markdown
14 pages
Pattern Recognition Lab
No ratings yet
Pattern Recognition Lab
24 pages
AIML Short Term Internship Session 9 Summary-1719044709410
No ratings yet
AIML Short Term Internship Session 9 Summary-1719044709410
14 pages
MLP Slides Merged
No ratings yet
MLP Slides Merged
480 pages
List of Imported Libraries
No ratings yet
List of Imported Libraries
12 pages
DWDM Lab Report
No ratings yet
DWDM Lab Report
12 pages
ML Cheat Sheet
No ratings yet
ML Cheat Sheet
7 pages
Proyecto Final Model
No ratings yet
Proyecto Final Model
13 pages
Random Forest 1737667979
No ratings yet
Random Forest 1737667979
11 pages
KNN and Random Forests Guide
No ratings yet
KNN and Random Forests Guide
6 pages
CS326 Report
No ratings yet
CS326 Report
36 pages
ML with Python: Data Visualization Guide
No ratings yet
ML with Python: Data Visualization Guide
7 pages
Essential Datasets for Data Science
No ratings yet
Essential Datasets for Data Science
9 pages
Machine Learning-1
No ratings yet
Machine Learning-1
24 pages
Machine Learning Lab Dlihebca6sem
100% (1)
Machine Learning Lab Dlihebca6sem
25 pages
ML LAB Mannual - Index
No ratings yet
ML LAB Mannual - Index
29 pages
Practical Machine Learning Code Examples
No ratings yet
Practical Machine Learning Code Examples
33 pages
ML 3
No ratings yet
ML 3
24 pages
Mini Project With Output
No ratings yet
Mini Project With Output
8 pages
Lab Manual ML
No ratings yet
Lab Manual ML
23 pages
Random Forest Classification
No ratings yet
Random Forest Classification
8 pages
Devesh
No ratings yet
Devesh
11 pages
Data Analysis with Python Packages
No ratings yet
Data Analysis with Python Packages
34 pages
BigML WhizzML Tutorials
No ratings yet
BigML WhizzML Tutorials
45 pages
ML Manual
No ratings yet
ML Manual
29 pages
Lab 4 - Logistic Regression - KNN - Notes
No ratings yet
Lab 4 - Logistic Regression - KNN - Notes
6 pages
ML Functions
No ratings yet
ML Functions
12 pages
Decision Tree
No ratings yet
Decision Tree
5 pages
CatBoost - An In-Depth Guide Python
No ratings yet
CatBoost - An In-Depth Guide Python
33 pages
Practical Labs Guide
No ratings yet
Practical Labs Guide
34 pages
Python ML Methods Cheatsheet
No ratings yet
Python ML Methods Cheatsheet
6 pages
Codes and Other Relevant Explanations For Supervised Learning (Part 1) - Session by Sabyasachi Mukhopadhyay - August 3
No ratings yet
Codes and Other Relevant Explanations For Supervised Learning (Part 1) - Session by Sabyasachi Mukhopadhyay - August 3
5 pages
Lab 12 Ai Mussab (Fa22 Bce 073)
No ratings yet
Lab 12 Ai Mussab (Fa22 Bce 073)
7 pages
SOLUTION ONLY CODE DWDM - Lab - All
No ratings yet
SOLUTION ONLY CODE DWDM - Lab - All
8 pages
BART Model Visualization Tools
No ratings yet
BART Model Visualization Tools
40 pages
To Study About Numpy, Pandas and Matplotlib Libraries in Python
No ratings yet
To Study About Numpy, Pandas and Matplotlib Libraries in Python
21 pages
C2 W4 Decision Tree With Markdown
No ratings yet
C2 W4 Decision Tree With Markdown
17 pages
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
No ratings yet
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
10 pages
Python Data Science Cheat Sheet
100% (2)
Python Data Science Cheat Sheet
6 pages
Unit 2: Electronic Circuit Simulation Package (Pt. 1) E3004 / UNIT 2
No ratings yet
Unit 2: Electronic Circuit Simulation Package (Pt. 1) E3004 / UNIT 2
27 pages
@PowerBI - Ir - Data Visualization Cheat Sheet
No ratings yet
@PowerBI - Ir - Data Visualization Cheat Sheet
15 pages
Student-Infinity Meta Web - Presentation
No ratings yet
Student-Infinity Meta Web - Presentation
58 pages
Behavior Modeling
No ratings yet
Behavior Modeling
41 pages
Nitish Upadhyay CV
No ratings yet
Nitish Upadhyay CV
3 pages
TP-Link TL-PS310U Setup Guide
No ratings yet
TP-Link TL-PS310U Setup Guide
9 pages
Irfan Project
No ratings yet
Irfan Project
32 pages
RITIK
No ratings yet
RITIK
11 pages
Exercise On CPM and PERT
100% (3)
Exercise On CPM and PERT
6 pages
Cyber Safety Notes
No ratings yet
Cyber Safety Notes
23 pages
Cake Shop MS
100% (5)
Cake Shop MS
22 pages
Civil Engineering Laboratories Overview
No ratings yet
Civil Engineering Laboratories Overview
4 pages
Spring Boot Architecture and Components
No ratings yet
Spring Boot Architecture and Components
2 pages
Data Mining Query Language Overview
No ratings yet
Data Mining Query Language Overview
16 pages
Ruhi Sania Resume 47756198
No ratings yet
Ruhi Sania Resume 47756198
1 page
(Ebook PDF) GIS Research Methods: Incorporating Spatial Perspectivesinstant Download
100% (8)
(Ebook PDF) GIS Research Methods: Incorporating Spatial Perspectivesinstant Download
49 pages
Windows Forms Layout Guide
No ratings yet
Windows Forms Layout Guide
18 pages
Where Can I Get Irodov's Solution PDF For Free?: 6 Answers
No ratings yet
Where Can I Get Irodov's Solution PDF For Free?: 6 Answers
5 pages
All School - Avenues OPEN
No ratings yet
All School - Avenues OPEN
1 page
Aditya Tiwari Pune 8.10 Yrs
No ratings yet
Aditya Tiwari Pune 8.10 Yrs
3 pages
FBI Painél Xl4-5 Program.
No ratings yet
FBI Painél Xl4-5 Program.
100 pages
MCQ (New Topics-Special Laws) - Part
100% (1)
MCQ (New Topics-Special Laws) - Part
2 pages
Parakram 2025 GATE Batch - Computer Science & IT
No ratings yet
Parakram 2025 GATE Batch - Computer Science & IT
41 pages
Clean Code Comments
No ratings yet
Clean Code Comments
43 pages
3shape TRIOS Product Catalogue
No ratings yet
3shape TRIOS Product Catalogue
33 pages
Syntax Analysis
No ratings yet
Syntax Analysis
73 pages
SAP Cloud For Customer Setup Guide: Dun & Bradstreet
No ratings yet
SAP Cloud For Customer Setup Guide: Dun & Bradstreet
26 pages
Adobe All Pre-Activated Softwares Bundle by Ahmed Tech Man
No ratings yet
Adobe All Pre-Activated Softwares Bundle by Ahmed Tech Man
5 pages
Job Description-Qa Assistant
No ratings yet
Job Description-Qa Assistant
2 pages
Sodapdf
No ratings yet
Sodapdf
1 page

RANDOM FOREST (Binary Classification)

Uploaded by

RANDOM FOREST (Binary Classification)

Uploaded by

CODE:

import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

for dirname, _, filenames in os.walk('/kaggle/input'):

for filename in filenames:

 dirname: The current directory being explored.

 filenames: A list of filenames in the current directory.

7. print(os.path.join(dirname, filename)): In this line, os.path.join() is used to combine the current

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

import matplotlib.pyplot as plt

# Load the dataset

data = pd.read_csv('/kaggle/input/honey-adulteration/adulteration.csv') # Replace 'your_dataset.csv'

# Split the data into features (X) and target (y)

X = data.iloc[:, 4:-1] # Select spectral band columns as features

y = data['Class'] # Target variable (adulterated or not)

# Map 'Class' to binary labels (e.g., 'Clover' to 1 and others to 0)

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features (optional but often recommended)

# Initialize and train a classification model (Random Forest in this example)

# Evaluate the model's performance

accuracy = accuracy_score(y_test, y_pred)

conf_matrix = confusion_matrix(y_test, y_pred)

classification_rep = classification_report(y_test, y_pred)

# Print the evaluation results

# Plot feature importances (if applicable to your model)

plt.title('Feature Importance for Binary Classification')

2. Loading the Dataset:

3. Splitting Features and Target:

4. Splitting the Dataset:

5. Standardizing Features (Optional):

 The features are standardized using the StandardScaler from scikit-learn.

6. Initializing and Training a Classification Model (Random Forest):

 A binary classification model is initialized using the RandomForestClassifier from scikit-

8. Evaluating Model Performance:

 The confusion matrix is computed using the confusion_matrix function, providing

 A classification report is generated using the classification_report function, which

9. Printing Evaluation Results:

10. Plotting Feature Importances (if applicable):

You might also like