Shaikh Assignment 3
Shaikh Assignment 3
1. How many clusters did you use initially? Please explain why you picked that number.
Initially, I created two cluster models: one with three clusters and another with two.
I then chose the model with three clusters for my analysis based on the Silhouette
coefficient, a metric used to assess the quality of clustering. A higher Silhouette
score suggests better separation between clusters. In this case, I observed an
improvement in the score when using three clusters.
3. What are the characteristics of that largest cluster (from question 2)? What makes it
different from the other clusters.
Cluster_1 is the largest group, consisting of 190 customers. This cluster likely includes a mix
of CustomerTypes, a range of IndustryTypes, and various FirmSizes. Its significance lies in its
size, representing a diverse customer base with varied preferences.
4. Which cluster has the highest product quality average? Why do you think it is the highest?
Base your opinion on the cluster’s characteristics.
The cluster with the highest average product quality is Cluster_1, with an average score of
8.58. This cluster likely has the highest product quality score because it contains a significant
number of customers (82, the largest group). It’s possible that these customers are more
engaged or loyal, leading to a higher perception of product quality. This could also indicate
that the products provided to this cluster match their preferences and needs better, which
contributes to a more favorable rating of product quality.
5. Which cluster has the lowest product quality average? Why do you think it is the lowest?
Base your opinion on the cluster’s characteristics.
The cluster with the lowest average product quality is Cluster_0, with an average score of
6.83.The lower product quality score for Cluster_0 could indicate that the customers in this
group are less satisfied with the products they are receiving. This might be due to a mismatch
between the products and their expectations or needs. Another possible reason could be that
this cluster represents a customer segment that is either more critical or less engaged, leading to
lower overall satisfaction with product quality.
6. To improve your analysis, what missing dimension would you add to your data set? Why?
7. Run the cluster again but prune your variables to simplify your model.
The goal of reducing the clusters to two is to simplify the analysis and create clearer
distinctions.
Both Cluster_0 and Cluster_1 have an equal number of customers, with each
cluster containing 100 customers.
The PCA analysis generated 18 components, which are linear combinations of the initial
variables, while retaining the data's variability.
2.Calculate the percentage variance and the cumulative percentage variance for each component
in your output file.
3.Include in your Word document a table (screenshot will work) that includes the percentage
variance, cumulative percentage variance, and the weights for each feature.
4. Based on the answers from question 2, how many components would you keep?
a. Why did you choose that number?
b. What was the cumulative percentage variance for all those components?
The cumulative variation of the 5 components I have chosen is 80.80%.
5. Using your answer from question 4 and the table from question 3, which features would
you keep?
a. Why did you choose those features?
7. Calculate the percentage variance and the cumulative percentage variance for each
component.
8. Include in your Word document a table (screenshot will work) that includes the
percentage variance, cumulative percentage variance, and the weights for each feature.
9. Based on the answers from question 7, how many components would you keep?
10.Using your answer from question 9 and the table from question 8, which features would you
keep?
11.From your own experience, what features should be added to this data set to make your
analysis better?
12. Why were you NOT allowed to use Satisfaction in your PCA?