AI and Math_Python Multiple-Choice Questions
AI and Math_Python Multiple-Choice Questions
2. What happens if the learning rate in gradient descent is set too large?
A) Training will be very slow.
B) The algorithm may overshoot and diverge.
C) It ensures convergence to the minimum faster.
D) It has no effect on convergence.
Answer: B) The algorithm may overshoot and diverge.
Explanation: A very large learning rate causes each update to overshoot the optimum. In fact, if the
learning rate exceeds a critical threshold, gradient descent will diverge (i.e., fail to converge) 2 . This
can make the loss jump around or grow indefinitely instead of settling.
4. Why is cross-entropy loss often preferred over mean squared error (MSE) for classification with
a sigmoid output?
A) MSE leads to convex optimization, cross-entropy does not.
B) Cross-entropy simplifies to MSE in logistic regression.
C) Cross-entropy is convex in final weights, MSE with sigmoid may not be.
D) Cross-entropy always produces smaller loss values than MSE.
Answer: C) Cross-entropy is convex in final weights, MSE with sigmoid may not be.
Explanation: When using a sigmoid activation for binary classification, the cross-entropy loss is
convex in the final layer’s parameters, whereas MSE combined with a sigmoid is not convex. This
1
means MSE could get stuck in a local minimum, whereas cross-entropy provides a more direct
gradient for learning 4 . Practically, cross-entropy loss tends to converge faster for classification
tasks.
5. Which loss function is appropriate for a binary classification problem with a sigmoid output?
A) Mean Squared Error (MSE)
B) Hinge Loss
C) Binary Cross-Entropy (Log Loss)
D) Categorical Cross-Entropy
Answer: C) Binary Cross-Entropy (Log Loss).
Explanation: For binary classification (sigmoid output), binary cross-entropy (also called log loss) is
commonly used. It measures the difference between predicted probabilities and actual binary labels
5 . MSE can be used but is less effective in this setting. Hinge loss is typical for SVMs, and
6. Which loss function is commonly used to train a (linear) Support Vector Machine (SVM)?
A) Mean Squared Error
B) Binary Cross-Entropy
C) Hinge Loss
D) Categorical Cross-Entropy
Answer: C) Hinge Loss.
Explanation: SVMs are margin-based classifiers, and their objective uses the hinge loss. Hinge loss
is defined as max(0, 1 – y·f(x)) for labels y ∈ {+1, -1} 6 . This loss penalizes points within the
margin or misclassified points, enforcing a margin of at least 1 for correct classifications.
8. In a confusion matrix for a binary classifier, what does a false positive (Type I error) represent?
A) Model predicted positive, actual negative.
B) Model predicted negative, actual positive.
C) Both predicted and actual are positive.
D) Both predicted and actual are negative.
Answer: A) Model predicted positive, actual negative.
Explanation: A false positive (FP) occurs when the model predicts the positive class (e.g., “yes” or “1”)
but the true label is negative (e.g., “no” or “0”). It is indeed the case of “predicted positive/actual
negative” 8 .
2
9. What is precision in a binary classification context?
A) TP / (TP + FN)
B) TP / (TP + FP)
C) (TP + TN) / (TP + FP + TN + FN)
D) FP / (FP + TN)
Answer: B) TP / (TP + FP).
Explanation: Precision measures how many of the positively predicted instances are actually
positive. It is defined as true positives divided by all predicted positives (TP + FP) 9 . A high precision
means most predicted positives are correct.
12. Why might one prefer F1 score over accuracy for imbalanced classification problems?
A) F1 ignores false positives completely.
B) F1 equally weighs precision and recall, capturing performance on the minority class.
C) Accuracy is always unreliable.
D) F1 only considers true positives.
Answer: B) F1 equally weighs precision and recall, capturing performance on the minority class.
Explanation: For imbalanced datasets, accuracy can be misleading because it may be high simply by
predicting the majority class. F1 score balances precision and recall, providing a single measure of a
model’s accuracy on the positive class. It “gives a better sense of the classifier’s performance,
especially on skewed datasets” 12 .
13. In a spam detection task where false positives (legitimate email marked spam) are costly,
which metric should be maximized?
A) Precision
B) Recall
C) Accuracy
D) AUC-ROC
3
Answer: A) Precision.
Explanation: In this scenario, we want to minimize false positives (legitimate emails incorrectly
flagged). Maximizing precision (TP/(TP+FP)) ensures that when we predict spam, it is indeed spam
9 . This reduces the rate of false positives.
14. In a medical test where missing a true case (false negative) is critical, which metric should be
maximized?
A) Precision
B) Recall
C) Accuracy
D) Specificity
Answer: B) Recall.
Explanation: Here false negatives are very costly (missing a sick patient). We want to maximize recall
(sensitivity = TP/(TP+FN)) to catch as many true positive cases as possible 10 . A high recall means
few actual positives are missed.
18. How does the ReLU activation help with the vanishing gradient problem?
A) It bounds outputs, preventing overflow.
4
B) Its derivative is either 0 or 1, avoiding small gradients.
C) It is non-monotonic.
D) It normalizes the input distribution.
Answer: B) Its derivative is either 0 or 1, avoiding small gradients.
Explanation: Unlike sigmoid, ReLU’s derivative is 1 for positive inputs and 0 for negative. This means
positive values propagate gradients effectively without diminishing. As one source notes, using ReLU
“prevents the gradient from vanishing” because the gradient does not shrink towards zero for
positive inputs 16 .
19. What does the softmax function do when applied to the output of a neural network?
A) Converts raw scores to a probability distribution over classes.
B) Scales values to the range [-1,1].
C) Shifts all values by their mean.
D) Selects the highest scoring class.
Answer: A) Converts raw scores to a probability distribution over classes.
Explanation: The softmax function exponentiates each score and normalizes by the sum of
exponentials, resulting in values in (0,1) that sum to 1 17 . This makes the outputs interpretable as
class probabilities in multi-class classification.
5
Specifically, the environment provides a reward signal and new state, guiding the agent to improve
its policy 20 . The agent’s goal is to maximize cumulative reward.
24. Which ensemble technique builds models sequentially, where each new model focuses on the
errors of the previous models?
A) Bagging
B) Boosting
C) Random Subspace
D) Cross-Validation
Answer: B) Boosting.
Explanation: Boosting trains an ensemble of “weak learners” sequentially. Each new model pays
more attention (higher weight) to instances the previous models misclassified, thereby iteratively
correcting errors 22 . This is opposite to bagging, which builds models independently.
27. Which part of a Generative Adversarial Network (GAN) is responsible for creating new data
samples?
A) The discriminator
6
B) The convolutional layer
C) The generator
D) The loss function
Answer: C) The generator.
Explanation: In a GAN, there are two neural networks: the generator creates new synthetic data
(e.g., images) that resemble the training data, while the discriminator tries to distinguish real from
generated samples 25 . The generator’s goal is to produce outputs so realistic that the discriminator
cannot tell them apart.
7
Logistic regression, SVM, and CART are discriminative, modeling P(Y|X) directly. (This follows the idea
that generative models capture joint probabilities 7 .)
32. If a model predicts 100% of instances as the positive class in a highly imbalanced dataset,
which metric will it appear deceptively high on?
A) Precision
B) Recall
C) Accuracy
D) F1 Score
Answer: C) Accuracy.
Explanation: In imbalanced data, predicting all samples as the majority class yields high accuracy
(since that class dominates), but precision/recall on the minority class is poor. This shows accuracy
can be misleading for imbalanced problems.
33. Which activation function would you choose to mitigate the vanishing gradient problem in a
deep network?
A) Sigmoid
B) Hyperbolic tangent (tanh)
C) ReLU or its variants
D) Linear (identity)
Answer: C) ReLU or its variants.
Explanation: ReLU (Rectified Linear Unit) is non-saturating for positive inputs, with a constant
gradient of 1. This avoids gradient shrinkage that plagues sigmoid/tanh. As noted, replacing sigmoid
with ReLU “is the simplest solution to the vanishing gradient problem” 26 .
34. In a binary classification with an extremely imbalanced class distribution, which loss function
is most suitable?
A) Regular (unweighted) cross-entropy
B) Mean Squared Error
C) Weighted or focal loss variant of cross-entropy
D) Hinge loss
Answer: C) Weighted or focal loss variant of cross-entropy.
Explanation: For extreme class imbalance, one often uses weighted cross-entropy or specialized
losses like focal loss to give more importance to the minority class. A standard unweighted loss (like
regular cross-entropy or MSE) would bias toward the majority class. (Focal loss, for example, down-
weights easy examples to focus on hard ones.)
35. Which method is a form of regularization that encourages model weights to become sparse
(many zeros)?
A) L1 regularization
B) L2 regularization
C) Dropout
D) Batch normalization
Answer: A) L1 regularization.
Explanation: L1 regularization adds the sum of absolute weights to the loss. This has the effect of
pushing many weights exactly to zero, yielding sparse solutions. In contrast, L2 regularization (sum
of squares) only shrinks weights towards zero but rarely makes them exactly zero 27 .
8
36. What effect does L2 regularization (weight decay) have on model weights?
A) It sets all weights exactly to zero.
B) It encourages weights to become small (but usually nonzero).
C) It only affects biases.
D) It increases the magnitude of weights to prevent underfitting.
Answer: B) It encourages weights to become small (but usually nonzero).
Explanation: L2 regularization adds the squared norm of weights to the loss, causing weights to
decay towards zero. However, unlike L1, L2 typically produces small weights rather than exact zeros
28 .
37. Which scenario describes internal covariate shift that batch normalization addresses?
A) Changing distribution of inputs to hidden layers during training.
B) Data labels changing during training.
C) Overfitting due to low training error.
D) Underfitting due to insufficient model capacity.
Answer: A) Changing distribution of inputs to hidden layers during training.
Explanation: Internal covariate shift refers to the changing distribution of layer inputs as the
network parameters update. Batch normalization reduces this shift by keeping layer inputs
normalized (zero mean and unit variance), thus stabilizing training 29 .
38. Why might one use an unsupervised dimensionality reduction technique before training a
supervised model?
A) To label new data.
B) To remove noise and reduce overfitting.
C) To increase the number of features.
D) To convert categorical to numerical features.
Answer: B) To remove noise and reduce overfitting.
Explanation: Unsupervised reduction (like PCA) can compress data by capturing most variance,
potentially filtering noise and reducing model complexity. This can improve generalization and
reduce overfitting by lowering dimensionality before training a supervised model.
39. If your model is overfitting the training data, which of the following is a valid approach?
A) Remove regularization.
B) Increase model complexity (more layers, more neurons).
C) Add dropout or increase regularization (e.g., L2).
D) Use a larger learning rate to train faster.
Answer: C) Add dropout or increase regularization (e.g., L2).
Explanation: Overfitting indicates the model is too complex for the amount of data. To combat it,
one can introduce or strengthen regularization (like L2 weight decay) or use dropout (randomly
omitting units during training) to reduce co-adaptation 18 . Increasing model complexity or
removing regularization would worsen overfitting.
9
Answer: B) Non-linearity enabling learning of complex functions.
Explanation: Activation functions (e.g., ReLU, tanh) introduce non-linear transformations to
neurons. This allows the network to approximate complex non-linear mappings. Without non-linear
activation, a deep network would collapse to a linear function regardless of depth.
3. What is the probability of getting exactly k successes in n independent Bernoulli trials with
success probability p?
A) pk (1 − p)n−k
n
B) (k )pn−k (1 − p)k
n
C) (k )pk (1 − p)n−k
n!
D) k!(n−k)! (without p factors)
n
Answer: C) (k )pk (1 − p)n−k .
Explanation: The probability of exactly k successes in n Bernoulli(p) trials is given by the binomial
n n
formula (k )pk (1 − p)n−k 32 , where (k ) = n!/(k!(n − k)!) .
10
5. If a 3×3 matrix has eigenvalues 2, 3, and 4, what is its determinant?
A) 9
B) 24
C) 20
D) 6
Answer: B) 24.
Explanation: The determinant of a matrix equals the product of its eigenvalues (for an n×n matrix).
Here det(A) = 2 × 3 × 4 = 24.
print(append_to(12))
print(append_to(42))
x = [1, 2, 3]
y = [1, 2, 3]
print(x == y)
print(x is y)
11
C) False then True
D) False then False
Answer: B) True then False.
Explanation: x == y checks value equality, which is True because both lists contain [1,2,3].
However, x is y checks object identity. x and y are two distinct list objects, so x is y is
False 36 .
P = [[0] * 3] * 3
P[0][0] = 5
print(P)
12. What is the result of 0.1 + 0.2 == 0.3 in Python, and why?
A) True, because 0.1 + 0.2 exactly equals 0.3.
B) False, because floating-point representations are imprecise.
C) True, because of automatic rounding.
12
D) False, because the == operator is broken in Python.
Answer: B) False, because floating-point representations are imprecise.
Explanation: In binary floating point (IEEE 754), numbers like 0.1 and 0.2 cannot be represented
exactly. Thus 0.1+0.2 results in a number very close to but not exactly 0.3, making
(0.1 + 0.2) == 0.3 evaluate to False 39 .
13. What is the formula for the sum of the first n positive integers (1 + 2 + … + n)?
A) n(n + 1)/2
B) n2
C) n(n − 1)/2
D) n2 +n
Answer: A) n(n + 1)/2 .
Explanation: The well-known formula for the sum of the first n integers is 1+2+⋯+n=
n(n+1)
2 . This can be derived by pairing terms or via induction.
14. What is the probability of getting exactly 2 heads in 3 fair coin flips?
A) 0.25
B) 0.375
C) 0.5
D) 0.75
Answer: B) 0.375.
3
Explanation: There are (2) = 3 ways to get 2 heads out of 3 flips, and each specific outcome has
probability (0.5)3 = 0.125 . Thus the probability is 3 × 0.125 = 0.375 .
for i in range(5):
pass
print(i)
A) 4
B) 5
C) 0
D) Error, since i is not defined outside the loop.
Answer: A) 4 .
13
Explanation: In Python, the loop variable i remains defined after the loop ends, retaining its last
value. After range(5) , the last value assigned to i was 4, so print(i) outputs 4 .
print(func(3))
print(func(3, 4))
A) 3 then 7
B) 0 then 7
C) 3 then 4
D) Error, because y has a default.
Answer: A) 3 then 7 .
Explanation: Calling func(3) uses the default y=0 , so it returns 3+0=3 . Calling func(3,4)
overrides the default, returning 3+4=7 .
14
Explanation: Calling len() on an integer (non-sequence) raises TypeError: object of type
'int' has no len() 40 , since integers do not support the len operation.
4 neural network - Loss Function for Probability Regression - Data Science Stack Exchange
https://datascience.stackexchange.com/questions/45285/loss-function-for-probability-regression
24 The Limitations of Perceptron: Why it Struggles with XOR | by Aryan Rusia | Medium
https://medium.com/@aryanrusia8/the-limitations-of-perceptron-why-it-struggles-with-xor-21905d31f924
15
30 Binomial coefficient - Wikipedia
https://en.wikipedia.org/wiki/Binomial_coefficient
38 Python list multiplication: [[...]]*3 makes 3 lists which mirror each other when modified - Stack Overflow
https://stackoverflow.com/questions/6688223/python-list-multiplication-3-makes-3-lists-which-mirror-each-other-when
16