0% found this document useful (0 votes)
5 views4 pages

ML Clustering and Regression FAQs

The document provides an overview of K-means and DBSCAN clustering algorithms, explaining their mechanisms, strengths, and weaknesses. It also discusses methods for selecting the optimal number of clusters in K-means, the challenges DBSCAN faces with varying densities, and various regression methods and evaluation metrics used in machine learning. Key regression techniques include linear, polynomial, and regularization methods, with evaluation metrics such as MAE, MSE, and R-squared highlighted for assessing model performance.

Uploaded by

parnikavispute04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views4 pages

ML Clustering and Regression FAQs

The document provides an overview of K-means and DBSCAN clustering algorithms, explaining their mechanisms, strengths, and weaknesses. It also discusses methods for selecting the optimal number of clusters in K-means, the challenges DBSCAN faces with varying densities, and various regression methods and evaluation metrics used in machine learning. Key regression techniques include linear, polynomial, and regularization methods, with evaluation metrics such as MAE, MSE, and R-squared highlighted for assessing model performance.

Uploaded by

parnikavispute04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Q: What is K-means clustering and how does it work?

A: K-means clustering is an unsupervised learning algorithm used to partition data into K distinct

non-overlapping subgroups or clusters. It works by randomly initializing K cluster centroids, then

iteratively assigning each data point to the nearest centroid based on the Euclidean distance. After

the assignment, the centroids are recalculated as the mean of all points in the respective cluster.

This process of assigning and updating centroids is repeated until the centroids no longer change

significantly, indicating convergence. The goal is to minimize the within-cluster sum of squares

(WCSS), which represents the variance within each cluster. K-means is simple and fast, making it

suitable for large datasets. However, it requires the number of clusters K to be known in advance

and can struggle with non-spherical or overlapping clusters. It also may converge to a local

minimum, depending on the initial centroid positions.

Q: What is DBSCAN clustering and how does it work?

A: DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm

that identifies clusters based on the density of data points in a region. It uses two parameters:

epsilon (eps), which defines the radius of a neighborhood around a point, and minPts, the minimum

number of points required to form a dense region. A point is classified as a core point if it has at

least minPts within its eps-radius. Points within the neighborhood of a core point are included in the

cluster and may expand it further if they themselves are core points. Points that do not fall within any

core point's neighborhood are labeled as noise. DBSCAN is effective for discovering clusters of

arbitrary shapes and handling outliers. However, it may perform poorly on datasets with varying

densities, since a fixed eps may not be suitable across all regions.

Q: How do you choose the optimal number of clusters in K-means clustering?

A: Choosing the optimal number of clusters (K) in K-means clustering is critical for meaningful

results. One common technique is the Elbow Method, where you plot the Within-Cluster Sum of

Squares (WCSS) for different values of K and look for a 'bend' or 'elbow' in the plot, suggesting

diminishing returns in reducing WCSS. Another method is the Silhouette Score, which measures
how similar a data point is to its own cluster compared to other clusters; a higher average silhouette

score indicates better-defined clusters. The Gap Statistic compares the total intra-cluster variation

for different values of K with their expected values under null reference distributions of the data.

These methods help determine an optimal K by balancing between model complexity and goodness

of fit, though they may require domain knowledge and visualization, especially in high-dimensional

spaces.

Q: Can DBSCAN clustering handle datasets with different densities?

A: DBSCAN generally struggles with datasets that have clusters of varying densities. This is

because it relies on two fixed parameters: epsilon (eps), which defines the neighborhood radius, and

minPts, the minimum number of points required to form a dense region. In datasets with different

density regions, a single eps value may be too small for sparse clusters (causing them to be missed

or labeled as noise) and too large for dense clusters (causing them to merge together). While

DBSCAN is robust to outliers and can detect arbitrarily shaped clusters, it lacks the flexibility to

adapt to varying local densities. Advanced variants such as HDBSCAN (Hierarchical DBSCAN) are

better suited for such datasets as they extend DBSCAN by converting it into a hierarchical clustering

approach, allowing clusters of varying densities to be identified more accurately.

Q: What is Regression?

A: Regression is a type of supervised learning used to model the relationship between a dependent

variable and one or more independent variables. The main goal is to predict continuous output

values. For example, predicting house prices based on features like location, size, and number of

rooms. In simple linear regression, the model fits a straight line (y = mx + b) to the data. In multiple

regression, more than one feature is used. Regression analysis helps in forecasting, trend analysis,

and inferring causal relationships between variables. It's widely used in fields like economics,

biology, and engineering. Assumptions include linearity, independence, homoscedasticity, and

normality of residuals. Violating these can lead to biased results. Regression provides

interpretability, as coefficients indicate the influence of each feature on the outcome. However, it's
sensitive to outliers and multicollinearity, so data preprocessing and diagnostic checking are crucial.

Q: List different Regression Methods in Machine Learning?

A: There are several regression techniques in machine learning, each suited to different data types

and relationships. Common methods include:

1. **Linear Regression**: Models a linear relationship between inputs and output.

2. **Polynomial Regression**: Extends linear regression by fitting a polynomial curve.

3. **Ridge Regression**: Adds L2 regularization to linear regression to prevent overfitting.

4. **Lasso Regression**: Uses L1 regularization to shrink coefficients, enabling feature selection.

5. **Elastic Net**: Combines L1 and L2 regularization.

6. **Support Vector Regression (SVR)**: Uses support vector machines for regression tasks.

7. **Decision Tree Regression**: Builds a tree model for predictions.

8. **Random Forest Regression**: Uses an ensemble of decision trees.

9. **Gradient Boosting Regression**: Builds models sequentially to correct errors of previous ones.

Choosing the right method depends on the problem type, data size, noise, and whether the

relationship is linear or non-linear.

Q: Which different evaluation metrics are used for regression? List them and explain

A: Several evaluation metrics are used to assess regression models. These include:

1. **Mean Absolute Error (MAE)**: Average of absolute differences between actual and predicted

values. Easy to interpret.

2. **Mean Squared Error (MSE)**: Average of squared differences. Penalizes larger errors more.

3. **Root Mean Squared Error (RMSE)**: Square root of MSE. Same unit as target variable.

4. **R-squared (R²)**: Proportion of variance explained by the model. Ranges from 0 to 1.

5. **Adjusted R-squared**: Adjusted for the number of predictors, helps avoid overfitting.

6. **Mean Absolute Percentage Error (MAPE)**: Expresses error as a percentage. Useful for

business forecasting.

Each metric gives different insights. MAE and RMSE measure average errors, while R² explains
how well the features explain the variance. The choice depends on the application and how errors

are interpreted in the given context.

You might also like