0% found this document useful (0 votes)
44 views6 pages

DWDM Mid Project

Uploaded by

akibshahrier0228
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views6 pages

DWDM Mid Project

Uploaded by

akibshahrier0228
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

AMERICAN INTERNATIONAL

UNIVERSITY-BANGLADESH
Faculty of Science and Technology

Project Cover Page


Assignment Title: Implementation of Naïve Bayes Algorithm
Assignment No: 01 Date of Submission: 16 July 2024
Course Title: Data Warehousing and Data Mining
Course Code: CSC4285 Section: A
Semester: Summer 2023-24 Course Teacher: Dr. Akinul Islam Jony

Declaration and Statement of Authorship:


1. I/we hold a copy of this Assignment/Case-Study, which can be produced if the original is lost/damaged.
2. This Assignment/Case-Study is my/our original work and no part of it has been copied from any other student’s work or
from any other source except where due acknowledgement is made.
3. No part of this Assignment/Case-Study has been written for me/us by any other person except where such
collaborationhas been authorized by the concerned teacher and is clearly acknowledged in the assignment.
4. I/we have not previously submitted or currently submitting this work for any other course/unit.
5. This work may be reproduced, communicated, compared and archived for the purpose of detecting plagiarism.
6. I/we give permission for a copy of my/our marked work to be retained by the Faculty for review and comparison,
including review by external examiners.
7. I/we understand thatPlagiarism is the presentation of the work, idea or creation of another person as though it is your
own. It is a formofcheatingandisaveryseriousacademicoffencethatmayleadtoexpulsionfromtheUniversity. Plagiarized
material can be drawn from, and presented in, written, graphic and visual form, including electronic data, and oral
presentations. Plagiarism occurs when the origin of them arterial used is not appropriately cited.
8. I/we also understand that enabling plagiarism is the act of assisting or allowing another person to plagiarize or to copy
my/our work.

* Student(s) must complete all details except the faculty use part.
** Please submit all assignments to your course teacher or the office of the concerned teacher.

Group Name/No.: -

No Name ID Program Signature


1 Shakibul Hasan 21-45263-2 BSc [CSE]
2 Srabone Raxit 21-45038-2 BSc [CSE]
3 Ashik Ahamed 21-45368-2 BSc [CSE]
4 Irtiza Ahsan Abir 21-45009-2 BSc [CSE]
5 Choose an item.

6 Choose an item.

7 Choose an item.

8 Choose an item.

9 Choose an item.

10 Choose an item.

Faculty use only


FACULTYCOMMENTS

Marks Obtained

Total Marks

Assignment/Case-Study Cover; © AIUB-2020


Project Description:
The purpose of this project is to implement the Naïve Bayes algorithm on a dataset. The labeled
dataset will be preprocessed by locating missing values, correcting invalid or noisy values,
dropping columns that do not impact the target variable, and converting numerical variables to
categorical variables, as the Naïve Bayes algorithm works best with categorical data. The dataset
will be trained, and unseen samples will be used to predict the outcomes. The percentage of
successful predictions, i.e., the accuracy of the model, will be calculated.

Dataset Link:
https://www.kaggle.com/datasets/rabieelkharoua/consumer-electronics-sales-
dataset?fbclid=IwZXh0bgNhZW0CMTAAAR3Y1p3NRHylnBwv2QI9RgsdbprSg8f9QqItRqAo
Fzuis5XjdO1QnvqeFbw_aem_irvYTw84umdZlqBYukuYCg

Dataset Description:
The “Predict Consumer Electronics Sales Dataset” provides insights into consumer electronics
sales and aims to analyze factors influencing purchase intent in the consumer electronics market.
The dataset consists of 9,000 sets of sample data. The instances in this dataset include Product ID,
Product Category (e.g., Smartphones, Laptops), Product Brand (e.g., Apple, Samsung), Product
Price, Customer Age, Customer Gender (0 - Male, 1 - Female), Purchase Frequency, Customer
Satisfaction (1 to 5), and Purchase Intent (0 - No, 1 - Yes). The Product ID variable will be
discarded since it does not impact the target variable, Purchase Intent. Product Price, Customer
Age, and Purchase Frequency are numerical variables that will be converted to categorical
variables. The dataset provides valuable information for building a model to understand and predict
customer purchase intent.

Implemented Code:
Coding was done in Python language using Google Colab.

1. Import csv file


import pandas as pd
df = pd.read_csv('/content/Consumer_Electronics_Sales_Data.csv')
df.head()

The `pandas` library was imported to read the csv file containing the dataset which was stored in
the files section of Google Colab. The `head` function is used to print the first 5 samples of the
dataset.

1|Page
2. Drop 'ProductID' column
df = df.drop('ProductID', axis=1)
df.head()

The `drop` function is used to drop the ‘ProductID’ column (`axis=1` means column) which has
no effect on the target variable.

3. Count missing values in each column


missing_values_count = df.isna().sum()
print(missing_values_count)

The `isna` function is used to locate missing values and the `sum` function is used to calculate the
total number of missing values. As can be seen, none of the variables contain any missing values.

4. Categorizing the numerical variables


bins = [15, 30, 50, 70]
labels = ['Young', 'Middle-age', 'Old-age']
df['CustomerAge'] = pd.cut(df['CustomerAge'], bins=bins, labels=labels, right=False)

bins = [1, 5, 15, 20]


labels = ['Occasional', 'Regular', 'Premium']
df['PurchaseFrequency'] = pd.cut(df['PurchaseFrequency'], bins=bins, labels=labels, right=False)

bins = [1, 1000, 2000, 3000]


labels = ['Low', 'Medium', 'High']
df['ProductPrice'] = pd.cut(df['ProductPrice'], bins=bins, labels=labels, right=False)

df.head()

2|Page
The `pd.cut` function is used to segment the data into the specified bins and labels. The
`right=False` parameter ensures that the bin intervals are closed on the left and open on the right,
meaning the rightmost edge of the interval is excluded from the bin.

This results in categorizing the ‘CustomerAge’ column into Young (15-30), Middle-age (31-50),
and Old-age (51-70), the ‘PurchaseFrequency’ column into Occasional (1-5), Regular (6-15), and
Premium (16-20), and the ‘ProductPrice’ column into Low (1-1000), Medium (1001-2000), and
High (2001-3000).

5. Renaming Categories
df['CustomerGender'] = df['CustomerGender'].replace({0: 'Male', 1: 'Female'})

df['CustomerSatisfaction'] = df['CustomerSatisfaction'].replace({1: 'Dissatisfied', 2: 'Somewhat


Dissatisfied', 3: 'Neutral', 4: 'Satisfied', 5: 'Very Satisfied'})

df['PurchaseIntent'] = df['PurchaseIntent'].replace({0: 'No', 1: 'Yes'})

df.head()

The `replace` function is used to rename numbered categories into more readable category names.
a) Renamed 0 with 'Male' and 1 with 'Female' in the 'CustomerGender' column.
b) Renamed 1 with 'Dissatisfied', 2 with 'Somewhat Dissatisfied', 3 with 'Neutral', 4 with
'Satisfied', 5 with 'Very Satisfied' in the 'CustomerSatisfaction' column.
c) Renamed 0 with 'No' and 1 with 'Yes' in the 'PurchaseIntent' column.

6. Splitting dataset for ‘Train’ and ‘Test’


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
for column in ['ProductCategory', 'ProductBrand', 'ProductPrice', 'CustomerAge',
'CustomerGender', 'PurchaseFrequency', 'CustomerSatisfaction', 'PurchaseIntent']:
df[column] = le.fit_transform(df[column])

3|Page
X = df.drop('PurchaseIntent', axis=1)
y = df['PurchaseIntent']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

X_train.shape, X_test.shape

Necessary modules from the `scikit-learn` library are imported. `train_test_split` is used to split
the dataset into training and testing sets, while `LabelEncoder` is used to convert categorical
variables into numerical values.

For each column, the `fit_transform` method of LabelEncoder is applied. This method learns
unique values, assigns them numeric codes, and converts the categorical values in the column to
their corresponding numeric codes, making them suitable for machine learning algorithms that
require numeric input.

The dataset is separated into features and the target variable. `X` contains all columns except
'PurchaseIntent', representing the input features for the model. `y` contains only the
'PurchaseIntent' column, which is the output or target variable the model will predict.

The `train_test_split` function to divide the data into training and testing subsets. `X_train` and
`y_train` are subsets of features and target variable used for training the model. `X_test` and
`y_test` are subsets of features and target variable used for testing and evaluating the model.
`test_size=0.3` specifies that 30% of the data should be reserved for testing, and 70% should be
used for training. `random_state=42` ensures that the split is reproducible by setting a fixed seed
for random number generation.

Finally, the `shape` function is used to display the shapes (i.e., dimensions) of the training and
testing sets.

7. The Naïve Bayes algorithm


import numpy as np

def gaussian_naive_bayes(X_train, y_train, X_test):


classes, counts = np.unique(y_train, return_counts=True)
priors = counts / len(y_train)

means = {}
stds = {}
for cls in classes:
cls_data = X_train[y_train == cls]
means[cls] = np.mean(cls_data, axis=0)
stds[cls] = np.std(cls_data, axis=0)

probs = []

4|Page
for cls in classes:
class_prob = np.sum(-0.5 * ((X_test - means[cls]) ** 2) / (stds[cls] ** 2)
- 0.5 * np.log(2 * np.pi * (stds[cls] ** 2)), axis=1)
probs.append(class_prob + np.log(priors[cls]))

y_pred = classes[np.argmax(probs, axis=0)]

return y_pred

y_pred = gaussian_naive_bayes(X_train, y_train, X_test)

accuracy = np.mean(y_pred == y_test)


print(f'Accuracy: {accuracy*100:.2f}%')

The `NumPy` library is imported which is used for numerical operations such as array
manipulation and mathematical calculations.

A function is defined and called that implements the Gaussian Naive Bayes algorithm.
a) To compute the prior probabilities of each class `np.unique(y_train, return_counts=True)`
gets the unique classes and their counts from the training labels and `priors` calculates the
prior probability of each class by dividing the count of each class by the total number of
training samples.
b) The mean and standard deviation of each feature for each class in the training data are
computed.
c) Then the log-probabilities of each test sample belonging to each class is computed.
d) After which the predicted class for each test sample by selecting the class with the highest
probability is determined.

The accuracy of the model is calculated and printed where `np.mean(y_pred == y_test)` computes
the mean of correct predictions (i.e., how many predictions match the true labels) and
`accuracy*100` converts the accuracy into a percentage.

Conclusion
The Gaussian Naive Bayes classifier achieved an accuracy of 80.15% in predicting customer
purchase intent based on consumer electronics sales data. This performance suggests effective
prediction capabilities, though future work could enhance results through improved data
preprocessing, feature engineering, and comparison with other algorithms.

5|Page

You might also like