ADS Viva
ADS Viva
Answer:
Data Science is the process of collecting, analyzing, and interpreting large amounts of data to make better
decisions. It combines tools from statistics, computer science, and domain knowledge.
Example: Netflix uses data science to recommend shows based on your watch history.
Answer:
Statistical learning is about using statistics to learn patterns from data. It includes both supervised
learning (like prediction) and unsupervised learning (like clustering).
Example: Predicting house prices based on location and size.
Answer:
Answer:
Data visualization turns numbers into graphs and charts, making it easier to see trends, patterns, and
outliers.
Example: A line chart showing monthly sales helps a manager quickly spot which months performed
better.
Answer:
Statistics helps us understand and describe data, test hypotheses, and build models.
Example: Mean, median, and standard deviation summarize a dataset.
Answer:
Optimization finds the best solution under constraints. In ML, it's used to minimize error.
Example: In linear regression, optimization finds the best-fit line that minimizes the distance between
predicted and actual values.
Answer:
Structured thinking breaks down problems into smaller parts:
Answer:
1. Non-negativity: Probabilities are ≥ 0
3. Additivity: If two events can't happen at the same time, their combined probability is the sum
Example: The chance of rolling a 1 or a 2 on a die = P(1) + P(2) = 1/6 + 1/6 = 1/3
Answer:
A random variable maps outcomes of a random event to numbers.
Example: When flipping a coin:
● Heads = 1
● Tails = 0
The outcome is random, but we can assign it a number.
11. What is the difference between Discrete and Continuous Random Variables?
Answer:
Answer:
It describes the likelihood of outcomes for a random variable.
Example: For a fair die, each number (1–6) has a probability of 1/6.
Answer:
They measure spread in data.
Answer:
It updates the probability of an event based on new evidence.
Example: If 1% of emails are spam, and a keyword appears more in spam, Bayes’ Theorem helps
calculate the chance an email with that keyword is actually spam.
Answer:
Probability of an event given another event has happened.
Example: P(Rain | Clouds) = Probability of rain given it's cloudy.
Answer:
As the number of trials increases, the sample mean gets closer to the true mean.
Example: Flip a fair coin 1,000 times. Around 50% should be heads.
Answer:
A way to check if a result is statistically significant.
Example: You want to check if a new drug is better than the old one. You test the difference using
hypothesis testing.
Answer:
It shows how two variables are related.
Answer:
Overfitting happens when a model learns the training data too well, including noise and outliers. This
makes it perform poorly on new/unseen data.
Example: A student memorizes past exam questions but can't answer new ones in the actual test.
Fixes: Use simpler models, cross-validation, or regularization.
Answer:
Underfitting happens when a model is too simple to capture the pattern in data. It performs badly on both
training and test data.
Example: Predicting house prices using just the number of bathrooms when many other factors (like
location) matter.
23. What is Cross-validation?
Answer:
Cross-validation is a method to test how well a model works by splitting data into training and test sets
multiple times.
Example: In 5-fold CV, data is split into 5 parts, and the model trains on 4 while testing on 1. This
repeats 5 times for better accuracy estimate.
Answer:
Answer:
Answer:
A confounding variable influences both independent and dependent variables, creating a false
relationship.
Example: Ice cream sales and drowning deaths both increase in summer. The confounding variable is
temperature.
27. What is a Histogram and what does it show?
Answer:
A histogram is a bar chart that shows the distribution of numerical data. It groups data into bins and
shows how many values fall into each bin.
Example: A histogram of student test scores shows how many scored between 60–70, 70–80, etc.
Answer:
A box plot shows the spread and skewness of data using five-number summary:
Answer:
Matrices are used to represent and process data efficiently. Operations like multiplication, inversion, and
transpose are used in:
● Linear regression
● Neural networks
30. What is Eigenvalue and Eigenvector and why are they important?
Answer:
Answer:
Answer:
It's a bell-shaped curve where data is symmetrically distributed around the mean. Many natural processes
follow it.
Example: Human heights, test scores.
It's used in hypothesis testing and confidence intervals.
Answer:
The null hypothesis (H₀) is a statement that there is no effect or difference. We test it to see if we can
reject it in favor of the alternative hypothesis.
Example: H₀: New drug has no effect.
If results are significant, we reject H₀.
Answer:
The p-value tells us how likely we would get our data if the null hypothesis were true.
Answer:
A matrix is a rectangular arrangement of numbers in rows and columns. It's used to represent datasets,
transformations, and operations in machine learning.
Example: A table of student scores (rows = students, columns = subjects) is a matrix.
Answer:
The determinant is a single number that gives information about a square matrix.
● If the determinant is 0, the matrix is not invertible (i.e., you can't solve equations with it).
Answer:
The trace is the sum of the diagonal elements of a square matrix.
It’s used in optimization and machine learning cost functions.
Answer:
Nullity = number of solutions to the equation Ax = 0. It’s related to the number of columns minus the
rank.
Formula:
Nullity = Number of columns - Rank
It shows how many directions in the solution space give 0 when the matrix multiplies a vector.
Answer:
Answer:
In facial recognition, images are converted to matrices. PCA (which uses eigenvectors) reduces the size
of data while keeping important features, helping recognize faces efficiently.
Answer:
Matrix factorization breaks a matrix into simpler matrices that, when multiplied together, give the
original matrix.
Used in:
● Dimensionality reduction
Example: A = LU, where L is lower triangular, U is upper triangular.
9. What is LU Decomposition?
Answer:
LU stands for Lower-Upper. A matrix is split into:
● L (lower triangular)
● U (upper triangular)
This helps solve equations and find determinants efficiently.
Answer:
QR splits a matrix into:
● Q: An orthogonal matrix
Answer:
SVD breaks a matrix A into three parts:
A = UΣVᵗ
Where:
Answer:
In recommendation engines like Netflix, SVD helps predict missing ratings by simplifying the
user-item matrix while keeping key patterns.
Answer:
Data visualization is turning data into charts, graphs, and visuals to make it easier to understand.
Example: A pie chart of expenses helps you instantly see what you spend the most on.
Answer:
Answer:
It plots points on an x-y graph to show relationships between two variables.
Example: Plotting study hours vs. exam scores to see correlation.
16. What is a Heatmap?
Answer:
A heatmap uses colors to represent values in a matrix.
Example: A correlation matrix heatmap shows which variables are strongly related.
Answer:
It reduces the number of features in data while keeping important information.
Why:
● Faster computation
● Easier to visualize
● Removes noise
Example: Reducing 1000-pixel features in image recognition to top 50.
Answer:
PCA is a method that uses eigenvalues and eigenvectors to reduce data dimensions. It keeps
components with the most variation.
Example: Compressing a 100-feature dataset to top 2 features for visualization.
Answer:
A correlation matrix shows how strongly variables are related. It’s often shown using a heatmap.
Example: In sales data, a correlation matrix might show that advertising spend is strongly correlated
with revenue.
20. What is the role of matplotlib and seaborn in Python for visualization?
Answer:
● matplotlib: A basic plotting library (line, bar, scatter, etc.)
● seaborn: Built on matplotlib, provides prettier, high-level charts like boxplots, heatmaps, and
regression plots.
Answer:
An inner product is a way to measure the similarity between two vectors.
For two vectors A and B, the inner product is:
A⋅B=a1b1+a2b2+...+anbn
Answer:
It relates to the angle between two vectors.
A⋅B=∥A∥∥B∥cos(θ)
Answer:
Cosine similarity measures the angle between two vectors, not their size.
Answer:
It’s the straight-line distance between two points in space.
Answer:
Also called L1 distance, it’s the total distance moved along axes.
Distance=∣x1−x2∣+∣y1−y2∣
Answer:
● p = 1 → Manhattan
● p = 2 → Euclidean
Can be adjusted for different situations in ML.
Answer:
It measures distance while considering correlations in data.
Useful when features have different scales or variances.
Used in: Outlier detection, clustering.
Answer:
Used to compare two sets.
Jaccard=Intersection/ Union
Answer:
● Cosine Similarity: When direction matters more than magnitude (e.g., document similarity).
● Euclidean Distance: When magnitude and location both matter (e.g., physical location).
31. How are distance measures used in KNN (K-Nearest Neighbors)?
Answer:
KNN uses distance measures (usually Euclidean) to find the k closest data points and predicts based on
their labels.
Example: To classify a fruit as apple or orange based on nearest neighbors in feature space.
Answer:
A table that shows the distances between every pair of points in a dataset.
Example: If you have 5 cities, the matrix shows pairwise distances between them all.
Used in: Clustering, path optimization.
Answer:
Projection means "dropping" a vector onto another vector or plane.
Used in:
Answer:
In high dimensions, distances become less meaningful (curse of dimensionality).
Answer:
Answer:
EDA is the process of analyzing datasets to summarize their main features using statistics and
visualizations.
Purpose:
● Understand patterns
● Spot anomalies
● Test assumptions
Example: Before building a model on sales data, EDA helps you understand trends and outliers.
Answer:
Structured data is organized in rows and columns (like in a spreadsheet).
Elements include:
Answer:
They show where most values lie in a dataset.
Common estimates:
● Mean (average)
Answer:
They describe how spread out the data is.
Common ones:
Answer:
Variability helps us understand risk, reliability, and differences.
Example: If two employees have the same average performance but one has high variability, they may
be unreliable.
Answer:
The expected value is the long-term average of a random variable.
Formula:
E[X] = Σ (x * P(x))
Example: In tossing a fair die, expected value = (1+2+3+4+5+6)/6 = 3.5
Answer:
Moments are values that describe shape and distribution.
Answer:
A histogram is a bar chart showing frequency distribution of numerical data.
Used to:
9. What is Skewness?
Answer:
Skewness tells whether the data is leaning left or right (not symmetrical).
Answer:
Kurtosis shows how peaked or flat the distribution is.
Answer:
Use tools like:
● Histogram
● Box plot
● Density plot
Also calculate: mean, median, mode, variance, skewness, kurtosis.
Goal: Understand shape, center, spread, and outliers.
Answer:
A box plot shows:
● Outliers (dots)
Example: Box plot of test scores can highlight if a few students performed much worse or better.
13. How do we explore binary and categorical data?
Answer:
Use counts and proportions, and visualize with:
● Bar charts
● Pie charts
● Frequency tables
Example: Pie chart showing % of customers who said "yes" or "no" to buying a product.
Answer:
Covariance tells if two variables move together.
Answer:
Correlation is a standardized form of covariance (range = -1 to 1).
● 0: no relation
Answer:
Analyzing two or more variables together to see relationships.
Techniques include:
● Scatter plots
● Pair plots
● Heatmaps
Example: Exploring relationship between income, education, and job satisfaction.
Answer:
A grid of scatter plots comparing all variable pairs in a dataset.
Use it to:
● Spot trends
● Detect correlations
● Identify outliers
Tool: sns.pairplot() in Python.
Answer:
EDA helps to:
Answer:
Answer:
Outliers are data points that are significantly different from the rest.
Detection methods:
● IQR method:
Outliers<Q1−1.5×IQR or >Q3+1.5×IQR
Example: A student's test score of 100 when most score between 50–70.
● Drop rows/columns
Answer:
Answer:
A scatter plot shows the relationship between two numeric variables.
Example: Plotting "hours studied" vs. "exam score" reveals positive correlation.
You can visually detect trends, clusters, or outliers.
Answer:
It’s a table showing the frequency distribution of two categorical variables.
Example: Crosstab of "gender" vs. "purchased" shows how many men/women made a purchase.
Answer:
A heatmap displays values or relationships using color.
Answer:
Multicollinearity = variables are too strongly related.
Check using:
● Correlation matrix
● Understand distribution
● Identify relationships
Tools: Bar chart, histogram, box plot, scatter plot, line chart
Answer:
IQR (Interquartile Range) = Q3 - Q1
It measures spread of the middle 50% of the data and helps find outliers.
Example:
Q1 = 30, Q3 = 70 → IQR = 40
Outlier threshold = Q1 - 1.5×IQR = -30, Q3 + 1.5×IQR = 130
31. What are summary statistics and how are they used in EDA?
Answer:
Summary statistics = basic metrics describing data
Answer:
Data profiling is the process of examining data quality and content:
● Null counts
● Unique values
● Distribution
● Min/Max values
Tools: pandas-profiling, Sweetviz
Answer:
Highly correlated features may give redundant information.
Removing one helps reduce model complexity and multicollinearity.
Example: Height (cm) and Height (inches) are perfectly correlated.
Answer:
Use:
● Seasonal decomposition
Example: Plotting monthly sales to see if they rise every December.
35. What is a violin plot and when is it better than a box plot?
Answer:
A violin plot combines box plot and density plot.
It shows median, IQR, and full distribution shape.
Useful when: You want to see both summary stats and distribution in one chart.
Answer:
Sample bias happens when the sample does not represent the full population correctly.
Causes: Bad sampling methods, missing groups, etc.
Example: Surveying only morning gym users may miss out on night-time users.
Answer:
Selection bias is a type of sample bias that occurs when some groups are systematically excluded or
more likely to be included.
Example: If an online survey excludes those without internet access, it suffers from selection bias.
Answer:
The CLT states that:
If you take many random samples from any population, the distribution of the sample
means will tend to be normal (bell-shaped) as the sample size increases.
Importance: It allows us to use normal distribution for hypothesis testing even if the data
isn't normal.
Example: If you repeatedly average samples of people's daily steps, the averages will form
a normal curve.
Answer:
Standard error is the standard deviation of the sample mean.
It tells us how much the sample mean varies from the actual population mean.
SE=Standard Deviation / sqrt{n}
Example: A small standard error means sample averages are close to the true mean.
Answer:
Bootstrapping is a method to estimate confidence intervals by:
Answer:
A range that’s likely to contain the true population parameter (like the mean), based on sample data.
Example: "The average height is 170cm ± 5cm with 95% confidence" means there's a 95% chance the
real average lies between 165 and 175cm.
Answer:
The normal distribution is a bell-shaped symmetric curve with:
Answer:
It’s a distribution with extreme values (very high or low) that are not rare.
Right-tailed (positive): Income (few people earn a lot)
Left-tailed (negative): Some error metrics
They don’t drop off quickly like the normal curve.
Answer:
A bell-shaped curve like the normal distribution, but wider—used when:
Answer:
Used for binary outcomes (success/failure) repeated n times.
Answer:
Models the number of events in a fixed time or space, where events happen independently.
Example: Number of emails received per hour.
Answer:
It models the time between events in a Poisson process.
Properties:
● Skewed
● Mean = 1/λ
Example: Time between buses arriving at a stop.
Answer:
Used in survival analysis and reliability engineering.
It models time until failure of a system.
Example: Predicting the life of a machine component.
Answer:
It means using statistical or machine learning methods to find a function that explains the data.
Example: Fitting a line through a scatter plot of sales vs. advertising.
Answer:
Answer:
Use t-distribution when:
Answer:
Larger sample size → narrower confidence interval (more precise).
Smaller sample size → wider interval (less reliable).
Answer:
As you increase the number of samples, the sample mean approaches the population mean.
Example: Rolling a die many times → average approaches 3.5.
Answer:
Example:
●
Binomial: 10 coin flips
● Poisson: Calls received per minute
Answer:
● Symmetric if p = 0.5
● Skewed if p ≠ 0.5
As the number of trials increases, the binomial distribution approaches normal.
Example: With 100 coin flips, the binomial curve looks bell-shaped.
22. What does the mean and variance of a binomial distribution depend on?
Answer:
Mean (μ) = n × p
Variance (σ²) = n × p × (1 - p)
Example: For 10 coin tosses (p = 0.5),
μ = 5, σ² = 2.5
Answer:
Answer:
Answer:
● Mechanical failures
● Product lifespans
Shape varies:
Answer:
Because it reflects variability of the mean, not individual data points.
SE = SD / √n
Larger samples → smaller SE → more accurate estimate
Answer:
Because it accounts for more uncertainty when sample size is small and standard deviation is
unknown.
As sample size grows, t-distribution becomes more like the normal.
28. How do you choose the right distribution for your data?
Answer:
Answer:
Skewness measures asymmetry:
● Negative skew: long left tail (e.g., exam scores with few low scores)
It affects which distribution and statistical methods to use.
Answer:
Kurtosis measures "peakedness" or tail thickness:
Answer:
Histograms show the shape, skewness, peaks, and spread of the data.
You can visually detect if data looks:
● Normal (bell-shaped)
Answer:
It allows us to:
Answer:
Fat-tailed distributions (like Cauchy or some power laws):
Answer:
Bootstrapping:
Answer:
Simulations can help visualize and understand:
● Sampling variability
● Shape of distributions
Answer:
Hypothesis testing is a method to decide whether a claim about a population is true based on sample
data.
You start with two statements:
Answer:
A p-value tells you how likely your sample data would occur if the null hypothesis were true.
Answer:
Chi-square tests are used to compare observed counts with expected counts to see if there is a
relationship between categorical variables.
Example: Testing if gender is related to choice of product color (e.g., blue or red).
Answer:
A confidence interval is a range where we believe the true population value (like the mean) lies.
Example: If we say average income is $50,000 ± $2,000 with 95% confidence, it means we’re 95% sure
the real average is between $48,000 and $52,000.
Answer:
A t-test compares the means of two groups.
Use it when:
Answer:
Answer:
If results are statistically significant, it means they’re unlikely to have happened by random chance,
usually based on a p-value threshold (e.g., 0.05).
Answer:
ANOVA checks whether three or more groups have different means.
Instead of doing multiple t-tests, ANOVA gives a single test.
Example: Comparing exam scores from 3 different schools.
Answer:
Degrees of freedom are the number of independent values that can vary when calculating a statistic.
Example: In a sample of 5 numbers with a fixed mean, only 4 values can freely change → df = 4.
Answer:
Using many t-tests increases the chance of false positives (Type I errors).
ANOVA controls this risk and provides one overall test for all group differences.
Answer:
Answer:
Answer:
Answer:
If we repeated the experiment 100 times, about 95 of those intervals would contain the true value.
It does not mean there's a 95% chance the value is in this one interval.
Answer:
It’s a sequence of random values with no pattern or correlation.
Used in time-series to model random variation.
Example: Random stock market noise.
Answer:
The probability of correctly rejecting a false null hypothesis.
High power = low chance of a Type II error.
Power increases with:
● Lower variability
Answer:
Answer:
Use a paired t-test when comparing measurements from the same group before and after a change.
Example: Testing blood pressure before and after medication in the same patients.
Answer:
Answer:
A/B testing helps marketers decide which version of a product, webpage, or ad performs better by
comparing two variants. It tests hypotheses and helps optimize user experiences.
Example: Testing two versions of an email subject line to see which one generates more clicks.
Answer:
Answer:
A paired t-test compares the means of two related groups. It’s used when data points are paired or
matched (e.g., before and after measurements).
Example: Comparing a person’s weight before and after a diet program.
Answer:
A p-value < 0.05 means there’s strong evidence against the null hypothesis. This indicates that the
observed result is statistically significant, and we reject H₀.
Example: A drug’s effectiveness p-value = 0.03 suggests it's likely better than the placebo.
25. What is the difference between confidence level and significance level?
Answer:
● Confidence level (e.g., 95%) refers to the degree of certainty that the population parameter lies
within the confidence interval.
● Significance level (e.g., 0.05) is the threshold for rejecting the null hypothesis.
If the p-value < significance level, we reject H₀.
Answer:
If the p-value from ANOVA is less than 0.05, you conclude that at least one of the group means is
significantly different from the others. To determine which group differs, you would conduct post-hoc
tests like Tukey's HSD.
Example: Comparing salaries across three industries (Tech, Healthcare, and Education).
Answer:
The null hypothesis in A/B testing is that there is no difference between the two versions being tested.
Example: In a test between Version A (button red) and Version B (button green), the null hypothesis is
that the color doesn’t affect the click-through rate.
Answer:
Statistical power is the probability that a test will correctly reject the null hypothesis when it is false
(i.e., detect a true effect). High power reduces the risk of a Type II error.
Power = 1 - β (β is the probability of a Type II error).
Answer:
Larger sample sizes increase the power of a test and can make it easier to detect significant differences.
Small sample sizes may fail to detect real effects, even if they exist.
Example: Testing a new treatment with 50 patients vs. 500 patients can yield more reliable results with
the larger sample.
Answer:
A non-parametric test doesn’t assume that data follows a specific distribution (e.g., normal distribution).
It is used when the data violates the assumptions of parametric tests like t-tests or ANOVA.
Example: Mann-Whitney U test (non-parametric alternative to the t-test).
31. What is the difference between a confidence interval and a prediction interval?
Answer:
● Confidence interval: Provides a range for the mean of a population based on a sample.
Answer:
The F-distribution is used in ANOVA to test the ratio of variances between groups. If the F-statistic is
large, it suggests the group means are significantly different.
Example: Testing if average test scores differ between three teaching methods.
Answer:
A Type I error is when you incorrectly reject a true null hypothesis (a false positive). This could lead
to wrong decisions like thinking a treatment works when it actually doesn’t.
Example: Approving a drug that has no effect.
34. What does it mean if the confidence interval for a mean includes zero?
Answer:
If the confidence interval includes zero, it suggests there’s a possibility of no effect. This would
indicate the null hypothesis cannot be rejected at a given confidence level.
Example: The interval for the difference in weight loss between two groups includes 0, so we fail to
reject the null hypothesis.
Answer:
The chi-square goodness-of-fit test compares the observed frequencies of a categorical variable to the
expected frequencies under a specific hypothesis.
Example: Testing if a dice is fair by comparing the number of times each face appears to the expected
1/6 probability.
Answer:
A confusion matrix is a table used to evaluate the performance of a classification model. It shows the
counts of true positives, false positives, true negatives, and false negatives.
Example: For a spam email classifier, it shows how many emails were correctly marked as spam (true
positives) or incorrectly marked (false positives).
Answer:
Precision measures the proportion of positive predictions that were correct. It is calculated as:
Answer:
Recall measures the proportion of actual positives that were correctly identified. It is calculated as:
Answer:
Specificity measures the proportion of actual negatives correctly identified by the model.
Answer:
An ROC (Receiver Operating Characteristic) curve plots the true positive rate (recall) against the
false positive rate at various thresholds. It helps you understand the trade-off between sensitivity (recall)
and specificity.
Example: A model with a better ROC curve can better discriminate between classes.
Answer:
Lift measures how much better a model performs compared to random guessing. It’s the ratio of the
results predicted by the model to the baseline performance (random prediction).
Example: In direct mail marketing, if a model predicts that 30% of the recipients will respond, and only
10% would respond randomly, the lift is 3.
Answer:
● Global optima: The absolute best solution in the entire search space.
● Local optima: A solution that is better than nearby solutions but not necessarily the best overall.
Example: Imagine a mountain range; the highest peak is the global optimum, while smaller
peaks are local optima.
Answer:
Unconstrained optimization involves finding the optimal solution for a problem without any restrictions
or constraints.
Example: Finding the maximum profit in a business without any resource limitations.
Answer:
Least-squares optimization minimizes the sum of squared errors between observed and predicted
values, commonly used in linear regression.
Answer:
Optimizing a machine learning model involves:
1. Hyperparameter tuning: Adjusting model parameters like learning rate or tree depth.
2. Feature selection: Choosing the most important features for the model.
Answer:
Gradient descent is an iterative optimization algorithm used to minimize a function (like the loss function
in machine learning). It updates model parameters in the opposite direction of the gradient to find the
minimum.
Example: In linear regression, gradient descent finds the best line by iteratively adjusting the
coefficients.
14. What is the difference between convex and non-convex optimization problems?
Answer:
● Convex optimization: The objective function has a single global minimum. It’s easier to solve
because any local minimum is also the global minimum.
● Non-convex optimization: The objective function has multiple local minima and possibly
several global minima.
Example: Convex: Linear regression. Non-convex: Neural networks.
Answer:
Constraints are conditions that limit the solutions of an optimization problem. They could be in the form
of equality or inequality restrictions.
Example: In business, constraints may be a limited budget, manpower, or materials.
Answer:
Lagrange multipliers are used to find the maximum or minimum of a function subject to equality
constraints. They help solve constrained optimization problems.
Example: Maximizing profit while keeping production cost within a limit.
Answer:
A convex function is one where the line segment between any two points on the graph lies above the
graph itself. In optimization, this property guarantees that any local minimum is the global minimum.
Example: A simple parabolic function (e.g., f(x)=x2) is convex.
Answer:
A local minimum is a point where the function value is lower than its neighboring points, but it may not
be the lowest point overall.
Example: A bowl-shaped curve has many local minima, but only one global minimum.
Answer:
The objective function is the function that needs to be maximized or minimized in an optimization
problem. It’s the central piece in finding the optimal solution.
Example: In machine learning, the objective function could be the loss function (like mean squared
error) that needs to be minimized.
Answer:
Overfitting occurs when a model learns the noise in the training data rather than the actual pattern. This
results in a model that performs well on the training data but poorly on new data.
Example: A decision tree that perfectly classifies training data but fails on test data due to overfitting.
Answer:
Precision measures the proportion of true positives among all predicted positives. Specificity, on the
other hand, measures the proportion of true negatives among all actual negatives.
Precision: How many predicted positives are actually correct?
Specificity: How many actual negatives are correctly identified?
Example: In a disease detection model, precision focuses on the accuracy of predicted positive
diagnoses, while specificity focuses on avoiding false alarms.
Answer:
The threshold is the probability value above which the model predicts the positive class. You can choose
a threshold based on:
23. What is the F1-score and when should you use it?
Answer:
The F1-score is the harmonic mean of precision and recall, giving a balance between them. It’s useful
when you need to balance the trade-off between precision and recall, especially when dealing with
imbalanced datasets.
Example: In medical diagnostics, F1-score is useful when both false positives and false negatives carry
significant costs.
24. What is the difference between the ROC curve and Precision-Recall curve?
Answer:
● ROC Curve: Plots True Positive Rate (Recall) vs False Positive Rate. It is useful for balanced
datasets.
● Precision-Recall Curve: Plots Precision vs Recall. It’s more informative when the dataset is
imbalanced (e.g., detecting rare events like fraud).
Example: In fraud detection, a Precision-Recall curve is preferred because fraud cases are rare.
Answer:
Cross-validation helps assess the model’s performance by training and testing it on different subsets of
data, providing a more reliable estimate of how the model will generalize to new data.
Example: If you use k-fold cross-validation, the model is trained on k-1 folds and tested on the
remaining fold, rotating through all folds.
26. What is the role of the AUC-ROC curve in imbalanced classification problems?
Answer:
The AUC-ROC curve helps evaluate classifier performance in imbalanced datasets by focusing on how
well the model distinguishes between the positive and negative classes, regardless of class distribution.
Example: In predicting rare diseases, the AUC-ROC curve is useful to ensure the model isn’t biased
toward the more common negative class.
27. What is global optimization, and how does it differ from local optimization?
Answer:
Global optimization finds the best possible solution across the entire search space. Local optimization
only finds the best solution within a local region of the search space.
Example: In training deep learning models, local optimization (e.g., gradient descent) might only find
local minima, whereas global optimization seeks the overall best solution.
Answer:
A loss function measures the difference between predicted and actual values. It helps guide the
optimization process by indicating how well the model is performing.
Example: In regression, the Mean Squared Error (MSE) is a commonly used loss function to minimize
during model training.
29. What is the difference between global optima and saddle points?
Answer:
● Global optima represents the lowest (minimization problem) or highest (maximization problem)
point in the entire optimization space.
● Saddle points are points where the derivative is zero, but they are neither minima nor maxima.
Example: A saddle point might appear in a neural network’s loss function where the gradient is
zero, but it’s not the true global minimum.
31. What is stochastic gradient descent (SGD), and how is it different from regular gradient
descent?
Answer:
SGD updates the model parameters using only a single random sample (or a small batch) at each
iteration, making it computationally faster. It introduces randomness, which can help escape local optima.
Example: Training a neural network using SGD instead of batch gradient descent allows faster
convergence with large datasets.
32. What are the limitations of using only precision or recall as evaluation metrics?
Answer:
Using only precision or recall can be misleading.
● Precision ignores false negatives, and might be misleading in cases where false negatives are
important.
● Recall ignores false positives, which can be a problem when false positives are costly.
Example: In spam email detection, a high recall but low precision might flood users with spam
emails.
Answer:
Regularization adds a penalty term to the optimization function to prevent overfitting by reducing the
complexity of the model.
Example: In linear regression, L2 regularization (Ridge regression) adds a penalty to large coefficients to
avoid overfitting.
● L2 regularization (Ridge) discourages large coefficients but does not set them exactly to zero,
leading to a smoother model.
Example: L1 regularization is useful for feature selection, while L2 regularization is used to
prevent overfitting by shrinking coefficients.
35. How do you interpret the ROC AUC score in model evaluation?
Answer: