Questions For Applied Multivariate Analysis
Questions For Applied Multivariate Analysis
Factor Analysis
1. Define factor analysis. What is its main objective?
2. Explain the orthogonal factor model in factor analysis.
3. Describe the significance of factor loadings in factor analysis.
4. What is meant by the estimation of factor loadings?
5. Explain the concept of factor rotation in factor analysis. Why is it used?
6. Differentiate between orthogonal and oblique rotations in factor analysis.
7. What is the purpose of eigenvalues in factor analysis?
8. How does factor analysis differ from principal component analysis?
Logistic Regression
17. Explain how logistic regression handles binary classification.
18. Define the logit function in logistic regression.
19. What are the assumptions underlying logistic regression?
20. Differentiate between logistic regression and linear regression in terms of application.
21. Why is logistic regression considered a type of generalized linear model?
1
25. Factor Extraction: Given the covariance matrix below for three variables, extract the first two factors using
principal component analysis. Calculate the factor loadings for each variable on the factors. Covariance matrix:
2 0.5 0.8
Σ = 0.5 1.5 0.3
0.8 0.3 1.8
26. Factor Rotation: For the factor loading matrix below, perform a varimax rotation and obtain the rotated factor
loadings. Factor loading matrix:
0.8 0.3
L = 0.4 0.6
0.7 0.5
2
Logistic Regression
p
34. For a logistic regression model with the equation log 1−p = 1.5 + 0.7x1 − 0.3x2 , calculate the probability p of the
event occurring when x1 = 2 and x2 = 1.
35. Suppose a logistic regression model gives the following coefficients: intercept = 0.5, x1 coefficient = 1.2. Calculate
the odds ratio associated with a one-unit increase in x1
Logistic Regression
40. Logistic Regression Model Fitting: Suppose you are given the following data for a binary classification problem:
X1 X2 Y
2 3 0
4 6 0
5 7 1
6 8 1
8 9 1
3
Factor Analysis and Factor Rotation
42. Suppose you have the following covariance matrix for four variables:
4 2 1 0.5
2 3 0.8 1
Σ= 1 0.8 2 0.6
Use maximum likelihood estimation to find the loadings of the first two factors and calculate the communalities of
each variable.
43. Varimax Rotation in Factor Analysis: Given the factor loading matrix below for three factors, perform a
varimax rotation to obtain the rotated factor loadings.
0.7 0.2 0.5
0.6 0.3 0.4
L= 0.8 0.5 0.1
Derive the Bayes discriminant function for classifying new observations and classify the point [4, 4]T using this
function.
45. Performance Evaluation in Quadratic Discriminant Analysis (QDA): Using QDA, classify a set of obser-
vations for two populations:
1 0
• Population 1: µ1 = [2, 3] , Σ1 =
T
.
0 2
2 0.5
• Population 2: µ2 = [4, 6]T , Σ2 = .
0.5 1
Classify the following points: [3, 4]T , [5, 5]T , and [2, 6]T . Calculate and discuss the misclassification rate if known
population labels are available for these points.
Logistic Regression
46. Multiclass Logistic Regression: Consider a dataset with three classes (Class 1, Class 2, Class 3) and predictor
variables X1 and X2 . For the following observations:
Class X1 X2
1 2 3
1 3 4
2 5 6
2 6 7
3 8 9
3 9 10
Fit a multinomial logistic regression model to predict the class label based on X1 and X2 . Provide the coefficient
estimates and interpret their meaning.
47. Regularized Logistic Regression: You are given a dataset with predictor variables X1 , X2 , and a binary response
variable Y where Y = 1 indicates success and Y = 0 indicates failure. The data is as follows:
4
X1 X2 Y
1 2 0
2 3 0
3 4 1
4 5 1
5 6 1
Fit a logistic regression model with L2 regularization (Ridge) to predict Y based on X1 and X2 . Discuss the effect
of the regularization parameter on the coefficient estimates.
48. Interpreting Model Coefficients: Consider a logistic regression model with the following form:
P (Y = 1)
ln = −1 + 0.3X1 − 0.5X2 + 0.7X3
1 − P (Y = 1)
Using the data point (X1 , X2 , X3 ) = (4, 2, 5), calculate the probability P (Y = 1). Explain the impact of each
coefficient on the predicted probability for this data point.
5
Introduction to Cluster Analysis
1. Define cluster analysis and its primary goal.
2. Explain the importance of cluster analysis in data mining.
3. What are the main applications of cluster analysis?
4. Given a dataset of 10 samples with two features each, calculate the Euclidean and Manhattan distances between
all sample pairs. Based on these distances, suggest an appropriate clustering method and justify your choice.
Proximity Measures
9. Define a proximity measure and its role in clustering.
10. Mention any two proximity measures suitable for categorical data.
11. How does the choice of proximity measure impact clustering results?
12. Consider a dataset with the following five 2-dimensional points: (2, 3), (3, 3), (4, 4), (5, 5), and (6, 8). Compute the
proximity matrix using the Euclidean distance measure. Describe the role of proximity in clustering these points
and identify pairs of points with high similarity.
Types of Clustering
13. Differentiate between hierarchical and non-hierarchical clustering.
14. What is the primary characteristic of partitional clustering?
15. Describe one advantage and one disadvantage of hierarchical clustering.
6
K-nearest-neighbor Classifiers
25. Describe the K-nearest-neighbor (KNN) algorithm for classification.
26. What are the main requirements for implementing KNN classifiers?
27. How does the choice of K affect the performance of a KNN classifier?
28. You have a dataset with 6 observations in two-dimensional space. Four of these belong to Class A: (1, 2), (2, 3),
(3, 1), and (3, 2); and two belong to Class B: (6, 7) and (7, 8). Use K = 3 in the K-nearest-neighbor classification
algorithm to classify a new point at (5, 5). Explain the process and show all calculations.
K-medoids Clustering
29. Differentiate between K-means and K-medoids clustering.
30. What is the primary goal of the K-medoids algorithm?
31. Mention one advantage of K-medoids over K-means in terms of robustness to outliers.
32. For a dataset containing five points: (1, 1), (2, 2), (3, 3), (8, 8), and (9, 9), use the K-medoids algorithm to cluster
the data into two clusters. Choose initial medoids as (1, 1) and (9, 9), and perform one iteration of the algorithm.
Show how you calculate the cost and update the medoids.
Long Answer
33. Given the following dataset of points in a two-dimensional space: (2, 3), (3, 3), (6, 5), (8, 8), and (7, 5), perform
K-means clustering with K = 2. Show all steps, including initialization, assignment, and updates to centroids, and
provide the final clusters and centroids.
34. Given a distance matrix for the following points: A(1, 2), B(2, 2), C(3, 5), D(5, 6), calculate the clustering using
agglomerative hierarchical clustering. Provide a dendrogram representation of the clusters formed.
35. Consider the following dataset with two classes:
• Class 1: (1, 1), (1, 2), (2, 1)
• Class 2: (6, 5), (7, 5), (7, 6)
Using a K-nearest neighbor classifier with K = 2, classify the point (5, 4). Show all calculations for distance
measures used.
36. Given the following data points: (1, 2), (1, 4), (1, 0), (10, 2), (10, 4), (10, 0), apply the K-medoids algorithm with
K = 2. Calculate the medoids, the assignments of points to clusters, and the total cost.
37. Consider the following two sets of data points:
• Set 1: (1, 2), (2, 3)
• Set 2: (3, 4), (5, 6)
Calculate the Euclidean and Manhattan dissimilarity measures between the two sets of data points and discuss the
differences in the results.
38. Given three points A(2, 3), B(3, 5), and C(5, 7), compute the proximity matrix using the cosine similarity measure.
Show all calculations and interpret the results.
39. Explain how agglomerative and divisive hierarchical clustering methods differ and provide a numerical example
where both methods are applied to the same dataset. Compare the clustering results in terms of the number of
clusters formed.
40. Given a clustering result from a K-means algorithm with three clusters, calculate the Silhouette Coefficient for each
point. Use the following distance values:
• Distances within the same cluster: (1, 0.5, 0.2)
• Distances to nearest cluster: (2, 2.5, 3)
Provide a conclusion based on the Silhouette Coefficients.
41. For the points P1 (1, 2), P2 (4, 6), and P3 (5, 8), compute the pairwise distances using both the Euclidean and Cheby-
shev distance measures. Which measure suggests a stronger relationship between the points?
42. Using a dataset of 5 points: (1, 1), (1, 3), (2, 2), (8, 8), and (9, 9), run the K-means algorithm for K = 2 for two
iterations. Start with initial centroids at (1, 1) and (8, 8). Show the changes in centroids and assignments after
each iteration.
7
Unit 4
Short Answer
Density Search Clustering Techniques
1. Define density-based clustering. How does it differ from other clustering methods?
2. Explain the concept of core points, border points, and noise points in density-based clustering.
3. Describe the DBSCAN algorithm and its key parameters.
4. Given a dataset with coordinates of points, apply the DBSCAN algorithm with specified values for the parameters
ϵ and minP ts. Identify the clusters and noise points.
5. Consider a dataset where the density threshold is defined as 3 points within a radius of 2 units. Determine whether
each point forms a core point, border point, or noise point.
Fuzzy Clustering
11. Define fuzzy clustering. How does it differ from traditional clustering methods?
12. Describe the concept of membership degrees in fuzzy clustering.
13. Explain the Fuzzy C-Means algorithm and its objective function.
14. For a set of three data points, calculate the membership values for each cluster center using the Fuzzy C-means
algorithm with given cluster centers.
15. Given a dataset and two initial cluster centers, calculate one iteration of fuzzy membership values for each point.
8
Mixture for Categorical Data
26. What is a mixture model for categorical data? How does it differ from traditional clustering models?
27. Describe the role of the Expectation-Maximization (EM) algorithm in fitting mixture models.
28. Explain the concept of multinomial distributions in the context of categorical data mixture models.
29. Consider a dataset with two categories and apply a basic EM algorithm iteration to estimate cluster membership
probabilities.
30. Given a dataset and initial probabilities for two clusters, compute the expected counts of each category under the
current mixture model.
long answer
1. Density Search Clustering Techniques:
Given the following dataset of points in a two-dimensional space: (1, 2), (1, 4), (1, 0), (2, 2), (2, 3), (3, 3), apply the
DBSCAN algorithm with ϵ = 1 and a minimum number of points minP ts = 2. Identify the clusters formed.
2. Clustering with Constraints:
Given a set of data points in a 2D space and a set of must-link and cannot-link constraints, describe how you would
modify the K-means algorithm to incorporate these constraints. Provide a hypothetical dataset and the modified
algorithm steps.
3. Fuzzy Clustering:
Given a dataset of three points: (1, 1), (1, 2), (2, 1) and c = 2 clusters. Calculate the membership values using the
fuzzy C-means algorithm. Assume a fuzziness parameter m = 2.
4. Optimization Clustering Techniques:
Given the following data points: (1, 1), (1, 2), (1, 3), (10, 10), find the optimal clustering solution using K-means with
k = 2. Show your calculations for the centroid updates and final clusters.
5. Discrete Data Clustering:
Consider a dataset with the following discrete categorical attributes for 5 observations:
• Observation 1: (A, B)
• Observation 2: (A, C)
• Observation 3: (B, C)
• Observation 4: (B, D)
9
• Observation 5: (C, D)
Apply the K-medoids algorithm with k = 2 and show the medoids and clusters formed.
6. Mixture for Categorical Data:
Suppose you have a dataset of categorical data with the following two features: Color (Red, Blue) and Shape (Circle,
Square). Create a simple mixture model for this data and calculate the probabilities for each category under the
mixture model.
7. Latent Class Analysis:
You have responses from 100 individuals on a questionnaire with binary outcomes (Yes/No). Construct a hypo-
thetical dataset and perform latent class analysis to identify the number of latent classes. Calculate the class
probabilities and the expected frequencies for each class.
8. Mixture Models for Mixed Mode Data:
Given a dataset consisting of both continuous (age, income) and categorical (gender, occupation) features, describe
how you would implement a mixture model to cluster this data. Provide a numerical example to illustrate the
clustering process.
10
Short Answer
1. Define a finite mixture model and explain its purpose in cluster analysis.
2. What are the main advantages of using finite mixture densities in clustering?
3. Describe the key assumptions underlying finite mixture models in clustering.
4. What is the purpose of inference in finite mixture models?
5. Explain the difference between parameter estimation and inference in finite mixture models.
6. Describe how model selection is performed in finite mixture models.
7. List the common methods for estimating parameters in finite mixture models.
8. What is the role of the likelihood function in estimating finite mixture models?
9. Briefly explain the concept of the latent variable in the context of finite mixture models.
10. What is the purpose of the Expectation-Maximization (EM) algorithm in finite mixture models?
11. Describe the steps of the EM algorithm in the context of finite mixture models.
12. Explain why the EM algorithm is suitable for maximum likelihood estimation in mixture models.
13. Define the maximum likelihood estimation for mixtures of multivariate normal densities.
14. Describe the parameters involved in a multivariate normal mixture model.
15. Explain why multivariate normal mixtures are commonly used in clustering.
16. What is non-Gaussian model-based clustering?
17. Explain one advantage of using non-Gaussian models in clustering over Gaussian models.
18. Give an example of a non-Gaussian distribution that can be used for model-based clustering.
19. For a given data set with three clusters, calculate the finite mixture density given the following Gaussian distribu-
tions:
• Cluster 1: Mean = 2, Variance = 1, Weight = 0.3
• Cluster 2: Mean = 5, Variance = 1.5, Weight = 0.5
• Cluster 3: Mean = 8, Variance = 2, Weight = 0.2
Find the mixture density for x = 4.
20. Given a mixture model with two Gaussian components, N (µ1 = 0, σ12 = 1) and N (µ2 = 5, σ22 = 2), with weights
0.4 and 0.6 respectively, calculate the probability of data point x = 3 belonging to each component.
21. Given a mixture of two Gaussian distributions with weights 0.6 and 0.4, means 1 and 3, and variances 1.5 and 2
respectively, estimate the mean of the entire mixture model.
22. For a data set with two clusters following Gaussian distributions N (µ1 , σ12 ) and N (µ2 , σ22 ), initialize an EM algorithm
by calculating the E-step using:
23. * Initial means µ1 = 1, µ2 = 4
24. Variances σ12 = 2, σ22 = 3
25. Weights w1 = 0.5, w2 = 0.5
Compute the responsibilities for data point x = 2.
26. Suppose a data set is generated from a mixture of two multivariate normal distributions with parameters:
1 0
• Component 1: Mean vector µ1 = [2, 3], Covariance matrix Σ1 = , Weight = 0.4
0 1
2 0.5
• Component 2: Mean vector µ2 = [5, 7], Covariance matrix Σ2 = , Weight = 0.6
0.5 1.5
Calculate the log-likelihood of the data point x = [3, 4].
27. In a mixture model with a Gaussian and a Poisson component, the Gaussian component has mean 3, variance 1,
and weight 0.7, while the Poisson component has mean 4 and weight 0.3. For a given data point x = 3, calculate
the probability density under the mixture model.
11
Long Answer
Finite Mixture Densities as Models for Cluster Analysis
1. A dataset consists of 100 points generated from a mixture of two Gaussian distributions with equal mixing proportions.
The first Gaussian component has mean µ1 = 5 and variance σ12 = 1, while the second has mean µ2 = 10 and variance
σ22 = 2.
(a) Write the probability density function for this mixture model.
(b) Calculate the likelihood of observing the data point x = 7 under this mixture model.
12