Naïve Bayesian Classifier and K-Means Clustering
Naïve Bayesian Classifier and K-Means Clustering
Submission
My Files
My Files
University
Document Details
Submission ID
trn:oid:::29034:86246888 11 Pages
Download Date
File Name
File Size
98.2 KB
The percentage indicates the combined amount of likely AI-generated text as It is essential to understand the limitations of AI detection before making decisions
well as likely AI-generated text that was also likely AI-paraphrased. about a student’s work. We encourage you to learn more about Turnitin’s AI detection
capabilities before using the tool.
Detection Groups
1 AI-generated only 0%
Likely AI-generated text from a large-language model.
Disclaimer
Our AI writing assessment is designed to help educators identify text that might be prepared by a generative AI tool. Our AI writing assessment may not always be accurate (it may misidentify
writing that is likely AI generated as AI generated and AI paraphrased or likely AI generated and AI paraphrased writing as only AI generated) so it should not be used as the sole basis for
adverse actions against a student. It takes further scrutiny and human judgment in conjunction with an organization's application of its specific academic policies to determine whether any
academic misconduct has occurred.
False positives (incorrectly flagging human-written text as AI-generated) are a possibility in AI models.
AI detection scores under 20%, which we do not surface in new reports, have a higher likelihood of false positives. To reduce the
likelihood of misinterpretation, no score or highlights are attributed and are indicated with an asterisk in the report (*%).
The AI writing percentage should not be the sole basis to determine whether misconduct has occurred. The reviewer/instructor
should use the percentage as a means to start a formative conversation with their student and/or use it to examine the submitted
assignment in accordance with their school's policies.
Non-qualifying text, such as bullet points, annotated bibliographies, etc., will not be processed and can create disparity between the submission highlights and the
percentage shown.
Student’s Name
Institution Affiliation
Professor’s name
Course
Date
1. Concept Explanation
Machine learning uses the Naïve Bayesian Classifier as a probabilistic algorithm for
classification operations. This algorithm relies on Bayes' theorem, assuming features become
independent when the class is known. Despite using a simplified conditional independence
assumption, the Naïve Bayesian Classifier functions effectively for spam detection, sentiment
analysis, and medical diagnosis. Feature independence occurs only after the class label has been
provided.
Assumptions:
1. Conditional Independence – The algorithm bases its operation on a rule that states
features show independence from one another when the class label serves as input.
3. Prior Probabilities Are Used – The model relies on prior knowledge (base rates of
classes).
𝑃(𝑋|𝐶)𝑃(𝐶)
𝑃(𝐶|𝑋) =
𝑃(𝑋)
Where:
For multiple features X = (X1, X2, ..., Xn), the Naïve Bayes assumption simplifies:
𝑛
𝑃(𝐶)𝛱ⅈ=1 𝑃(𝑋ⅈ |𝐶)
𝑃(𝐶|𝑋) =
𝑃(𝑋)
Spam detection involves categorizing emails into spam and valid messages (ham). The
primary purpose is to develop a predictive model for identifying spam emails based on word
Classification Objective
The goal is to determine the probability of an email being spam given a set of observed
words. This is achieved using the Naïve Bayes classifier, which assumes that the presence of
each word in the email is independent of the others, given the class label.
Dataset
Consider a small dataset of emails with the presence (1) or absence (0) of specific keywords:
1 1 1 0 1 1
2 0 1 1 0 0
3 1 1 1 1 1
4 0 0 1 0 0
5 1 0 1 1 1
Calculate Priors:
3
P(Spam) = = 0.6
5
2
P(Not Spam) = = 0.4
5
Calculate Likelihoods:
2
P(Free=1∣Spam) = = 0.67
3
2
P(Win=1∣Spam) = = 0.67
3
2
P(Money=1∣Spam) = = 0.67
3
1
P(Offer=0∣Spam) = = 0.33
3
0
P(Free=1∣Not Spam) = = 0.00
2
1
P(Win=1∣Not Spam) = = 0.5
2
1
P(Money=1∣Not Spam) = = 0.5
2
1
P(Offer=0∣Not Spam) = = 0.5
2
Compute Posteriors:
# Training dataset
X_train = np.array([[1,1,0,1], [0,1,1,0], [1,1,1,1], [0,0,1,0], [1,0,1,1]])
y_train = np.array([1, 0, 1, 0, 1]) # 1 = Spam, 0 = Not Spam
# Model training
nb_model = BernoulliNB()
nb_model.fit(X_train, y_train)
# Prediction
prediction = nb_model.predict(X_test)
print("Prediction:", "Spam" if prediction[0] == 1 else "Not Spam")
Output:
Prediction: Spam
1. Concept Explanation
groups data points through their shared features. Clustering algorithms detect natural data
groupings in an unsupervised manner since they work without predefined categories. Clustering
serves multiple functions, including market segmentation and anomaly detection, image
spending behavior and income levels. This allows businesses to target specific customer groups
1. Centroid Selection:
2. Cluster Assignment:
o Each data point is assigned to the nearest centroid based on the Euclidean
distance.
3. Centroid Updating:
o Compute the new centroid by taking the mean of all points in the cluster.
𝑛
1
𝐶𝑘 = ∑ 𝑥ⅈ
𝑛
ⅈ=1
where:
𝑛
2
𝑑(𝑥, 𝐶𝑘 ) = √∑(𝑥𝑗 − 𝐶𝑘𝑗 )
0=1
where:
x is a data point,
2. Customer Segmentation
customers by their purchasing activities and financial capability. Businesses use this approach to
1 15 39
2 16 81
3 17 6
4 18 77
5 20 40
6 24 94
7 25 3
8 30 73
9 35 92
10 40 8
Using Euclidean distance to compute the 3 centroids to each customer, and assigning it to
Distance to C1 (15,39):
Distance to C2 (24,94):
Distance to C3 (40,8):
New centroid:
51+17+20+25+40 39+6+40+3+8
C1 = ( , ) = (23.4,19.2)
5 5
10
# Cluster assignments
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
11