ML Clustering and Regression FAQs
ML Clustering and Regression FAQs
A: K-means clustering is an unsupervised learning algorithm used to partition data into K distinct
iteratively assigning each data point to the nearest centroid based on the Euclidean distance. After
the assignment, the centroids are recalculated as the mean of all points in the respective cluster.
This process of assigning and updating centroids is repeated until the centroids no longer change
significantly, indicating convergence. The goal is to minimize the within-cluster sum of squares
(WCSS), which represents the variance within each cluster. K-means is simple and fast, making it
suitable for large datasets. However, it requires the number of clusters K to be known in advance
and can struggle with non-spherical or overlapping clusters. It also may converge to a local
that identifies clusters based on the density of data points in a region. It uses two parameters:
epsilon (eps), which defines the radius of a neighborhood around a point, and minPts, the minimum
number of points required to form a dense region. A point is classified as a core point if it has at
least minPts within its eps-radius. Points within the neighborhood of a core point are included in the
cluster and may expand it further if they themselves are core points. Points that do not fall within any
core point's neighborhood are labeled as noise. DBSCAN is effective for discovering clusters of
arbitrary shapes and handling outliers. However, it may perform poorly on datasets with varying
densities, since a fixed eps may not be suitable across all regions.
A: Choosing the optimal number of clusters (K) in K-means clustering is critical for meaningful
results. One common technique is the Elbow Method, where you plot the Within-Cluster Sum of
Squares (WCSS) for different values of K and look for a 'bend' or 'elbow' in the plot, suggesting
diminishing returns in reducing WCSS. Another method is the Silhouette Score, which measures
how similar a data point is to its own cluster compared to other clusters; a higher average silhouette
score indicates better-defined clusters. The Gap Statistic compares the total intra-cluster variation
for different values of K with their expected values under null reference distributions of the data.
These methods help determine an optimal K by balancing between model complexity and goodness
of fit, though they may require domain knowledge and visualization, especially in high-dimensional
spaces.
A: DBSCAN generally struggles with datasets that have clusters of varying densities. This is
because it relies on two fixed parameters: epsilon (eps), which defines the neighborhood radius, and
minPts, the minimum number of points required to form a dense region. In datasets with different
density regions, a single eps value may be too small for sparse clusters (causing them to be missed
or labeled as noise) and too large for dense clusters (causing them to merge together). While
DBSCAN is robust to outliers and can detect arbitrarily shaped clusters, it lacks the flexibility to
adapt to varying local densities. Advanced variants such as HDBSCAN (Hierarchical DBSCAN) are
better suited for such datasets as they extend DBSCAN by converting it into a hierarchical clustering
Q: What is Regression?
A: Regression is a type of supervised learning used to model the relationship between a dependent
variable and one or more independent variables. The main goal is to predict continuous output
values. For example, predicting house prices based on features like location, size, and number of
rooms. In simple linear regression, the model fits a straight line (y = mx + b) to the data. In multiple
regression, more than one feature is used. Regression analysis helps in forecasting, trend analysis,
and inferring causal relationships between variables. It's widely used in fields like economics,
normality of residuals. Violating these can lead to biased results. Regression provides
interpretability, as coefficients indicate the influence of each feature on the outcome. However, it's
sensitive to outliers and multicollinearity, so data preprocessing and diagnostic checking are crucial.
A: There are several regression techniques in machine learning, each suited to different data types
6. **Support Vector Regression (SVR)**: Uses support vector machines for regression tasks.
9. **Gradient Boosting Regression**: Builds models sequentially to correct errors of previous ones.
Choosing the right method depends on the problem type, data size, noise, and whether the
Q: Which different evaluation metrics are used for regression? List them and explain
A: Several evaluation metrics are used to assess regression models. These include:
1. **Mean Absolute Error (MAE)**: Average of absolute differences between actual and predicted
2. **Mean Squared Error (MSE)**: Average of squared differences. Penalizes larger errors more.
3. **Root Mean Squared Error (RMSE)**: Square root of MSE. Same unit as target variable.
5. **Adjusted R-squared**: Adjusted for the number of predictors, helps avoid overfitting.
6. **Mean Absolute Percentage Error (MAPE)**: Expresses error as a percentage. Useful for
business forecasting.
Each metric gives different insights. MAE and RMSE measure average errors, while R² explains
how well the features explain the variance. The choice depends on the application and how errors