DWDM Mid Project
DWDM Mid Project
UNIVERSITY-BANGLADESH
Faculty of Science and Technology
* Student(s) must complete all details except the faculty use part.
** Please submit all assignments to your course teacher or the office of the concerned teacher.
Group Name/No.: -
6 Choose an item.
7 Choose an item.
8 Choose an item.
9 Choose an item.
10 Choose an item.
Marks Obtained
Total Marks
Dataset Link:
https://www.kaggle.com/datasets/rabieelkharoua/consumer-electronics-sales-
dataset?fbclid=IwZXh0bgNhZW0CMTAAAR3Y1p3NRHylnBwv2QI9RgsdbprSg8f9QqItRqAo
Fzuis5XjdO1QnvqeFbw_aem_irvYTw84umdZlqBYukuYCg
Dataset Description:
The “Predict Consumer Electronics Sales Dataset” provides insights into consumer electronics
sales and aims to analyze factors influencing purchase intent in the consumer electronics market.
The dataset consists of 9,000 sets of sample data. The instances in this dataset include Product ID,
Product Category (e.g., Smartphones, Laptops), Product Brand (e.g., Apple, Samsung), Product
Price, Customer Age, Customer Gender (0 - Male, 1 - Female), Purchase Frequency, Customer
Satisfaction (1 to 5), and Purchase Intent (0 - No, 1 - Yes). The Product ID variable will be
discarded since it does not impact the target variable, Purchase Intent. Product Price, Customer
Age, and Purchase Frequency are numerical variables that will be converted to categorical
variables. The dataset provides valuable information for building a model to understand and predict
customer purchase intent.
Implemented Code:
Coding was done in Python language using Google Colab.
The `pandas` library was imported to read the csv file containing the dataset which was stored in
the files section of Google Colab. The `head` function is used to print the first 5 samples of the
dataset.
1|Page
2. Drop 'ProductID' column
df = df.drop('ProductID', axis=1)
df.head()
The `drop` function is used to drop the ‘ProductID’ column (`axis=1` means column) which has
no effect on the target variable.
The `isna` function is used to locate missing values and the `sum` function is used to calculate the
total number of missing values. As can be seen, none of the variables contain any missing values.
df.head()
2|Page
The `pd.cut` function is used to segment the data into the specified bins and labels. The
`right=False` parameter ensures that the bin intervals are closed on the left and open on the right,
meaning the rightmost edge of the interval is excluded from the bin.
This results in categorizing the ‘CustomerAge’ column into Young (15-30), Middle-age (31-50),
and Old-age (51-70), the ‘PurchaseFrequency’ column into Occasional (1-5), Regular (6-15), and
Premium (16-20), and the ‘ProductPrice’ column into Low (1-1000), Medium (1001-2000), and
High (2001-3000).
5. Renaming Categories
df['CustomerGender'] = df['CustomerGender'].replace({0: 'Male', 1: 'Female'})
df.head()
The `replace` function is used to rename numbered categories into more readable category names.
a) Renamed 0 with 'Male' and 1 with 'Female' in the 'CustomerGender' column.
b) Renamed 1 with 'Dissatisfied', 2 with 'Somewhat Dissatisfied', 3 with 'Neutral', 4 with
'Satisfied', 5 with 'Very Satisfied' in the 'CustomerSatisfaction' column.
c) Renamed 0 with 'No' and 1 with 'Yes' in the 'PurchaseIntent' column.
le = LabelEncoder()
for column in ['ProductCategory', 'ProductBrand', 'ProductPrice', 'CustomerAge',
'CustomerGender', 'PurchaseFrequency', 'CustomerSatisfaction', 'PurchaseIntent']:
df[column] = le.fit_transform(df[column])
3|Page
X = df.drop('PurchaseIntent', axis=1)
y = df['PurchaseIntent']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train.shape, X_test.shape
Necessary modules from the `scikit-learn` library are imported. `train_test_split` is used to split
the dataset into training and testing sets, while `LabelEncoder` is used to convert categorical
variables into numerical values.
For each column, the `fit_transform` method of LabelEncoder is applied. This method learns
unique values, assigns them numeric codes, and converts the categorical values in the column to
their corresponding numeric codes, making them suitable for machine learning algorithms that
require numeric input.
The dataset is separated into features and the target variable. `X` contains all columns except
'PurchaseIntent', representing the input features for the model. `y` contains only the
'PurchaseIntent' column, which is the output or target variable the model will predict.
The `train_test_split` function to divide the data into training and testing subsets. `X_train` and
`y_train` are subsets of features and target variable used for training the model. `X_test` and
`y_test` are subsets of features and target variable used for testing and evaluating the model.
`test_size=0.3` specifies that 30% of the data should be reserved for testing, and 70% should be
used for training. `random_state=42` ensures that the split is reproducible by setting a fixed seed
for random number generation.
Finally, the `shape` function is used to display the shapes (i.e., dimensions) of the training and
testing sets.
means = {}
stds = {}
for cls in classes:
cls_data = X_train[y_train == cls]
means[cls] = np.mean(cls_data, axis=0)
stds[cls] = np.std(cls_data, axis=0)
probs = []
4|Page
for cls in classes:
class_prob = np.sum(-0.5 * ((X_test - means[cls]) ** 2) / (stds[cls] ** 2)
- 0.5 * np.log(2 * np.pi * (stds[cls] ** 2)), axis=1)
probs.append(class_prob + np.log(priors[cls]))
return y_pred
The `NumPy` library is imported which is used for numerical operations such as array
manipulation and mathematical calculations.
A function is defined and called that implements the Gaussian Naive Bayes algorithm.
a) To compute the prior probabilities of each class `np.unique(y_train, return_counts=True)`
gets the unique classes and their counts from the training labels and `priors` calculates the
prior probability of each class by dividing the count of each class by the total number of
training samples.
b) The mean and standard deviation of each feature for each class in the training data are
computed.
c) Then the log-probabilities of each test sample belonging to each class is computed.
d) After which the predicted class for each test sample by selecting the class with the highest
probability is determined.
The accuracy of the model is calculated and printed where `np.mean(y_pred == y_test)` computes
the mean of correct predictions (i.e., how many predictions match the true labels) and
`accuracy*100` converts the accuracy into a percentage.
Conclusion
The Gaussian Naive Bayes classifier achieved an accuracy of 80.15% in predicting customer
purchase intent based on consumer electronics sales data. This performance suggests effective
prediction capabilities, though future work could enhance results through improved data
preprocessing, feature engineering, and comparison with other algorithms.
5|Page