0% found this document useful (0 votes)
4 views16 pages

AI and Math_Python Multiple-Choice Questions

The document contains a series of multiple-choice questions and answers related to AI and machine learning concepts, including gradient descent, loss functions, model evaluation metrics, activation functions, and reinforcement learning. Each question is followed by an explanation of the correct answer, providing insights into the underlying principles of the topics discussed. The content is structured to test knowledge and understanding of key machine learning concepts and practices.

Uploaded by

actionmaster679
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views16 pages

AI and Math_Python Multiple-Choice Questions

The document contains a series of multiple-choice questions and answers related to AI and machine learning concepts, including gradient descent, loss functions, model evaluation metrics, activation functions, and reinforcement learning. Each question is followed by an explanation of the correct answer, providing insights into the underlying principles of the topics discussed. The content is structured to test knowledge and understanding of key machine learning concepts and practices.

Uploaded by

actionmaster679
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

AI and Math/Python Multiple-Choice Questions

AI and Machine Learning MCQs


1. Which statement about gradient descent on a non-convex function is true?
A) It always finds the global minimum.
B) It may converge to a local minimum or saddle point.
C) It cannot be applied to non-convex functions.
D) It always oscillates without converging.
Answer: B) It may converge to a local minimum or saddle point.
Explanation: Gradient descent can be applied to non-convex loss functions, but unlike the convex
case, there is no guarantee of finding a global minimum. It will converge to a stationary point (which
could be a local minimum or saddle point) rather than necessarily the global minimum 1 .

2. What happens if the learning rate in gradient descent is set too large?
A) Training will be very slow.
B) The algorithm may overshoot and diverge.
C) It ensures convergence to the minimum faster.
D) It has no effect on convergence.
Answer: B) The algorithm may overshoot and diverge.
Explanation: A very large learning rate causes each update to overshoot the optimum. In fact, if the
learning rate exceeds a critical threshold, gradient descent will diverge (i.e., fail to converge) 2 . This
can make the loss jump around or grow indefinitely instead of settling.

3. What is the effect of a very small learning rate in gradient descent?


A) It prevents convergence.
B) It makes training extremely slow and possibly stuck.
C) It causes oscillations around the optimum.
D) It has the same effect as a large learning rate.
Answer: B) It makes training extremely slow and possibly stuck.
Explanation: If the learning rate is too small, weight updates are tiny, so gradient descent converges
very slowly and may require many epochs. The model might also get stuck in flat regions or poor
local minima because updates are too small to move to better solutions 3 .

4. Why is cross-entropy loss often preferred over mean squared error (MSE) for classification with
a sigmoid output?
A) MSE leads to convex optimization, cross-entropy does not.
B) Cross-entropy simplifies to MSE in logistic regression.
C) Cross-entropy is convex in final weights, MSE with sigmoid may not be.
D) Cross-entropy always produces smaller loss values than MSE.
Answer: C) Cross-entropy is convex in final weights, MSE with sigmoid may not be.
Explanation: When using a sigmoid activation for binary classification, the cross-entropy loss is
convex in the final layer’s parameters, whereas MSE combined with a sigmoid is not convex. This

1
means MSE could get stuck in a local minimum, whereas cross-entropy provides a more direct
gradient for learning 4 . Practically, cross-entropy loss tends to converge faster for classification
tasks.

5. Which loss function is appropriate for a binary classification problem with a sigmoid output?
A) Mean Squared Error (MSE)
B) Hinge Loss
C) Binary Cross-Entropy (Log Loss)
D) Categorical Cross-Entropy
Answer: C) Binary Cross-Entropy (Log Loss).
Explanation: For binary classification (sigmoid output), binary cross-entropy (also called log loss) is
commonly used. It measures the difference between predicted probabilities and actual binary labels
5 . MSE can be used but is less effective in this setting. Hinge loss is typical for SVMs, and

categorical cross-entropy is for multi-class classification.

6. Which loss function is commonly used to train a (linear) Support Vector Machine (SVM)?
A) Mean Squared Error
B) Binary Cross-Entropy
C) Hinge Loss
D) Categorical Cross-Entropy
Answer: C) Hinge Loss.
Explanation: SVMs are margin-based classifiers, and their objective uses the hinge loss. Hinge loss
is defined as max(0, 1 – y·f(x)) for labels y ∈ {+1, -1} 6 . This loss penalizes points within the
margin or misclassified points, enforcing a margin of at least 1 for correct classifications.

7. Which of the following is a characteristic of a generative model (as opposed to a discriminative


model)?
A) It models only the conditional probability P(Y|X).
B) It cannot generate new data samples.
C) It captures the joint distribution P(X, Y) and can generate new data.
D) It always has lower error than discriminative models.
Answer: C) It captures the joint distribution P(X, Y) and can generate new data.
Explanation: Generative models learn a model of the joint probability (or equivalently model P(X|Y)
and P(Y)). They can generate new synthetic data points similar to the training set. As one source
notes, “Generative models can generate new data points … They capture the joint probability and
can be used for generative tasks” 7 . Discriminative models learn only P(Y|X) and do not generate
new data.

8. In a confusion matrix for a binary classifier, what does a false positive (Type I error) represent?
A) Model predicted positive, actual negative.
B) Model predicted negative, actual positive.
C) Both predicted and actual are positive.
D) Both predicted and actual are negative.
Answer: A) Model predicted positive, actual negative.
Explanation: A false positive (FP) occurs when the model predicts the positive class (e.g., “yes” or “1”)
but the true label is negative (e.g., “no” or “0”). It is indeed the case of “predicted positive/actual
negative” 8 .

2
9. What is precision in a binary classification context?
A) TP / (TP + FN)
B) TP / (TP + FP)
C) (TP + TN) / (TP + FP + TN + FN)
D) FP / (FP + TN)
Answer: B) TP / (TP + FP).
Explanation: Precision measures how many of the positively predicted instances are actually
positive. It is defined as true positives divided by all predicted positives (TP + FP) 9 . A high precision
means most predicted positives are correct.

10. What is recall (sensitivity) in a binary classification context?


A) TP / (TP + FN)
B) TN / (TN + FP)
C) (TP + TN) / (TP + FP + TN + FN)
D) FP / (FP + TN)
Answer: A) TP / (TP + FN).
Explanation: Recall (also called sensitivity or true positive rate) measures how many of the actual
positive instances were correctly identified. It is defined as true positives divided by all actual
positives (TP + FN) 10 .

11. Which statement about F1 score is correct?


A) F1 is the arithmetic mean of precision and recall.
B) F1 = (Precision + Recall) / 2.
C) F1 = 2 * (Precision * Recall) / (Precision + Recall).
D) F1 only depends on accuracy and precision.
Answer: C) F1 = 2 * (Precision * Recall) / (Precision + Recall).
Explanation: The F1-score is the harmonic mean of precision and recall, given by
2 * precision * recall / (precision + recall) 11 . This combines both metrics to give
a single measure, which is especially useful for imbalanced classes.

12. Why might one prefer F1 score over accuracy for imbalanced classification problems?
A) F1 ignores false positives completely.
B) F1 equally weighs precision and recall, capturing performance on the minority class.
C) Accuracy is always unreliable.
D) F1 only considers true positives.
Answer: B) F1 equally weighs precision and recall, capturing performance on the minority class.
Explanation: For imbalanced datasets, accuracy can be misleading because it may be high simply by
predicting the majority class. F1 score balances precision and recall, providing a single measure of a
model’s accuracy on the positive class. It “gives a better sense of the classifier’s performance,
especially on skewed datasets” 12 .

13. In a spam detection task where false positives (legitimate email marked spam) are costly,
which metric should be maximized?
A) Precision
B) Recall
C) Accuracy
D) AUC-ROC

3
Answer: A) Precision.
Explanation: In this scenario, we want to minimize false positives (legitimate emails incorrectly
flagged). Maximizing precision (TP/(TP+FP)) ensures that when we predict spam, it is indeed spam
9 . This reduces the rate of false positives.

14. In a medical test where missing a true case (false negative) is critical, which metric should be
maximized?
A) Precision
B) Recall
C) Accuracy
D) Specificity
Answer: B) Recall.
Explanation: Here false negatives are very costly (missing a sick patient). We want to maximize recall
(sensitivity = TP/(TP+FN)) to catch as many true positive cases as possible 10 . A high recall means
few actual positives are missed.

15. What is the accuracy of a classifier?


A) TP / (TP + FP)
B) TN / (TN + FP)
C) (TP + TN) / (TP + TN + FP + FN)
D) 2 * (Precision * Recall) / (Precision + Recall)
Answer: C) (TP + TN) / (TP + TN + FP + FN).
Explanation: Accuracy measures the overall fraction of correct predictions: both true positives and
true negatives out of all predictions 13 .

16. Which activation function is defined as f(x) = max(0, x)?


A) Sigmoid
B) Tanh
C) ReLU (Rectified Linear Unit)
D) Softmax
Answer: C) ReLU (Rectified Linear Unit).
Explanation: The ReLU activation outputs the input directly if it is positive, otherwise it outputs zero
14 . In formula terms, f(x) = max(0, x). It is very popular in deep networks because it is simple and

alleviates the vanishing gradient problem.

17. What is a drawback of the sigmoid activation function in deep networks?


A) It outputs values outside [0,1].
B) It causes vanishing gradients for large positive inputs.
C) It is not differentiable.
D) It always produces negative outputs.
Answer: B) It causes vanishing gradients for large positive (or negative) inputs.
Explanation: Sigmoid squashes inputs to (0,1). For very large magnitude inputs, its gradient
becomes near zero, leading to the vanishing gradient problem in deep networks 15 . This makes
learning slow or stalled in deep layers.

18. How does the ReLU activation help with the vanishing gradient problem?
A) It bounds outputs, preventing overflow.

4
B) Its derivative is either 0 or 1, avoiding small gradients.
C) It is non-monotonic.
D) It normalizes the input distribution.
Answer: B) Its derivative is either 0 or 1, avoiding small gradients.
Explanation: Unlike sigmoid, ReLU’s derivative is 1 for positive inputs and 0 for negative. This means
positive values propagate gradients effectively without diminishing. As one source notes, using ReLU
“prevents the gradient from vanishing” because the gradient does not shrink towards zero for
positive inputs 16 .

19. What does the softmax function do when applied to the output of a neural network?
A) Converts raw scores to a probability distribution over classes.
B) Scales values to the range [-1,1].
C) Shifts all values by their mean.
D) Selects the highest scoring class.
Answer: A) Converts raw scores to a probability distribution over classes.
Explanation: The softmax function exponentiates each score and normalizes by the sum of
exponentials, resulting in values in (0,1) that sum to 1 17 . This makes the outputs interpretable as
class probabilities in multi-class classification.

20. What is the purpose of dropout in training neural networks?


A) To speed up matrix computations.
B) To augment data by mixing inputs.
C) To prevent overfitting by randomly omitting units.
D) To ensure deterministic outputs.
Answer: C) To prevent overfitting by randomly omitting units.
Explanation: Dropout randomly “drops” (sets to zero) a subset of neurons in each training step,
which prevents units from co-adapting. It effectively trains an ensemble of thinner networks. This
technique “significantly reduces overfitting” in large networks 18 .

21. What is batch normalization used for?


A) It adds dropout layers to a network.
B) It accelerates training by normalizing layer inputs.
C) It ensures batch sizes are equal in training.
D) It sums gradients across batches.
Answer: B) It accelerates training by normalizing layer inputs.
Explanation: Batch normalization rescales and recenters the inputs of each layer (within each mini-
batch) so they have zero mean and unit variance during training. This makes training faster and
more stable 19 .

22. In reinforcement learning, how does an agent learn?


A) By supervised labels from a dataset.
B) By receiving rewards from interactions with an environment.
C) By clustering data points.
D) By minimizing reconstruction error.
Answer: B) By receiving rewards from interactions with an environment.
Explanation: In RL, an agent takes actions in an environment and learns from the feedback.

5
Specifically, the environment provides a reward signal and new state, guiding the agent to improve
its policy 20 . The agent’s goal is to maximize cumulative reward.

23. Which ensemble method does a Random Forest primarily use?


A) Boosting (sequential).
B) Bagging (parallel).
C) Stacking.
D) AdaBoost.
Answer: B) Bagging (parallel).
Explanation: Random Forest builds many decision trees independently on bootstrap-sampled
subsets of data (and random feature subsets). This is a form of “bagging” (bootstrap aggregating)
21 . The final prediction averages (or votes) over these trees.

24. Which ensemble technique builds models sequentially, where each new model focuses on the
errors of the previous models?
A) Bagging
B) Boosting
C) Random Subspace
D) Cross-Validation
Answer: B) Boosting.
Explanation: Boosting trains an ensemble of “weak learners” sequentially. Each new model pays
more attention (higher weight) to instances the previous models misclassified, thereby iteratively
correcting errors 22 . This is opposite to bagging, which builds models independently.

25. Which of the following learning algorithms is unsupervised?


A) Support Vector Machine (SVM)
B) K-Means Clustering
C) Logistic Regression
D) Decision Tree Classifier
Answer: B) K-Means Clustering.
Explanation: K-Means is a classic unsupervised algorithm that groups unlabeled data into clusters
based on similarity 23 . The other listed methods (SVM, logistic regression, decision tree classifier)
are supervised learning algorithms.

26. What is one fundamental limitation of a single-layer perceptron?


A) It cannot perform regression.
B) It cannot solve problems that are not linearly separable.
C) It always overfits the training data.
D) It cannot classify any data beyond binary.
Answer: B) It cannot solve problems that are not linearly separable.
Explanation: A perceptron is a linear classifier (a single-layer neural network). It can only separate
data with a straight line (hyperplane). As famously noted, it cannot solve problems like XOR which
are not linearly separable 24 . This limitation motivated multi-layer networks.

27. Which part of a Generative Adversarial Network (GAN) is responsible for creating new data
samples?
A) The discriminator

6
B) The convolutional layer
C) The generator
D) The loss function
Answer: C) The generator.
Explanation: In a GAN, there are two neural networks: the generator creates new synthetic data
(e.g., images) that resemble the training data, while the discriminator tries to distinguish real from
generated samples 25 . The generator’s goal is to produce outputs so realistic that the discriminator
cannot tell them apart.

28. Which scenario is typical of reinforcement learning?


A) An agent learns to classify images given labeled examples.
B) An agent learns to predict future stock prices from historical data.
C) An agent learns a policy to maximize reward by trial-and-error interactions with an environment.
D) An agent compresses data into a lower-dimensional representation.
Answer: C) An agent learns a policy to maximize reward by trial-and-error interactions with an
environment.
Explanation: Reinforcement learning involves an agent that takes actions in an environment,
receiving a reward signal. The agent’s goal is to maximize cumulative reward through trial and error
20 . This distinguishes it from supervised or unsupervised paradigms.

29. What metric is given by (2 × Precision × Recall)/(Precision + Recall) ?


A) Accuracy
B) Specificity
C) F1 score
D) Dice coefficient (note: similar formula)
Answer: C) F1 score.
Explanation: This formula is exactly the F1-score (harmonic mean of precision and recall) 11 . (The
Dice coefficient in binary classification is the same formula as F1, but the standard term here is F1-
score.)

30. Why is F1-score especially useful in imbalanced datasets?


A) It only uses true negatives.
B) It balances precision and recall, capturing minority class performance.
C) It gives more weight to the majority class.
D) It is equivalent to accuracy when classes are imbalanced.
Answer: B) It balances precision and recall, capturing minority class performance.
Explanation: F1 combines precision and recall, giving insight into how well the model does on the
positive (often minority) class. For imbalanced data, focusing solely on accuracy can be misleading,
while F1 reflects performance on both false positives and false negatives 11 12 .

31. Which of the following is a generative model?


A) Logistic Regression
B) Support Vector Machine (SVM)
C) Naive Bayes
D) Decision Tree (CART)
Answer: C) Naive Bayes.
Explanation: Naive Bayes is a generative model: it models the joint distribution P(X, Y) by P(X|Y)P(Y).

7
Logistic regression, SVM, and CART are discriminative, modeling P(Y|X) directly. (This follows the idea
that generative models capture joint probabilities 7 .)

32. If a model predicts 100% of instances as the positive class in a highly imbalanced dataset,
which metric will it appear deceptively high on?
A) Precision
B) Recall
C) Accuracy
D) F1 Score
Answer: C) Accuracy.
Explanation: In imbalanced data, predicting all samples as the majority class yields high accuracy
(since that class dominates), but precision/recall on the minority class is poor. This shows accuracy
can be misleading for imbalanced problems.

33. Which activation function would you choose to mitigate the vanishing gradient problem in a
deep network?
A) Sigmoid
B) Hyperbolic tangent (tanh)
C) ReLU or its variants
D) Linear (identity)
Answer: C) ReLU or its variants.
Explanation: ReLU (Rectified Linear Unit) is non-saturating for positive inputs, with a constant
gradient of 1. This avoids gradient shrinkage that plagues sigmoid/tanh. As noted, replacing sigmoid
with ReLU “is the simplest solution to the vanishing gradient problem” 26 .

34. In a binary classification with an extremely imbalanced class distribution, which loss function
is most suitable?
A) Regular (unweighted) cross-entropy
B) Mean Squared Error
C) Weighted or focal loss variant of cross-entropy
D) Hinge loss
Answer: C) Weighted or focal loss variant of cross-entropy.
Explanation: For extreme class imbalance, one often uses weighted cross-entropy or specialized
losses like focal loss to give more importance to the minority class. A standard unweighted loss (like
regular cross-entropy or MSE) would bias toward the majority class. (Focal loss, for example, down-
weights easy examples to focus on hard ones.)

35. Which method is a form of regularization that encourages model weights to become sparse
(many zeros)?
A) L1 regularization
B) L2 regularization
C) Dropout
D) Batch normalization
Answer: A) L1 regularization.
Explanation: L1 regularization adds the sum of absolute weights to the loss. This has the effect of
pushing many weights exactly to zero, yielding sparse solutions. In contrast, L2 regularization (sum
of squares) only shrinks weights towards zero but rarely makes them exactly zero 27 .

8
36. What effect does L2 regularization (weight decay) have on model weights?
A) It sets all weights exactly to zero.
B) It encourages weights to become small (but usually nonzero).
C) It only affects biases.
D) It increases the magnitude of weights to prevent underfitting.
Answer: B) It encourages weights to become small (but usually nonzero).
Explanation: L2 regularization adds the squared norm of weights to the loss, causing weights to
decay towards zero. However, unlike L1, L2 typically produces small weights rather than exact zeros
28 .

37. Which scenario describes internal covariate shift that batch normalization addresses?
A) Changing distribution of inputs to hidden layers during training.
B) Data labels changing during training.
C) Overfitting due to low training error.
D) Underfitting due to insufficient model capacity.
Answer: A) Changing distribution of inputs to hidden layers during training.
Explanation: Internal covariate shift refers to the changing distribution of layer inputs as the
network parameters update. Batch normalization reduces this shift by keeping layer inputs
normalized (zero mean and unit variance), thus stabilizing training 29 .

38. Why might one use an unsupervised dimensionality reduction technique before training a
supervised model?
A) To label new data.
B) To remove noise and reduce overfitting.
C) To increase the number of features.
D) To convert categorical to numerical features.
Answer: B) To remove noise and reduce overfitting.
Explanation: Unsupervised reduction (like PCA) can compress data by capturing most variance,
potentially filtering noise and reducing model complexity. This can improve generalization and
reduce overfitting by lowering dimensionality before training a supervised model.

39. If your model is overfitting the training data, which of the following is a valid approach?
A) Remove regularization.
B) Increase model complexity (more layers, more neurons).
C) Add dropout or increase regularization (e.g., L2).
D) Use a larger learning rate to train faster.
Answer: C) Add dropout or increase regularization (e.g., L2).
Explanation: Overfitting indicates the model is too complex for the amount of data. To combat it,
one can introduce or strengthen regularization (like L2 weight decay) or use dropout (randomly
omitting units during training) to reduce co-adaptation 18 . Increasing model complexity or
removing regularization would worsen overfitting.

40. What does an activation function add to a neural network?


A) Linearity to the model.
B) Non-linearity enabling learning of complex functions.
C) Additional regularization.
D) The optimization algorithm.

9
Answer: B) Non-linearity enabling learning of complex functions.
Explanation: Activation functions (e.g., ReLU, tanh) introduce non-linear transformations to
neurons. This allows the network to approximate complex non-linear mappings. Without non-linear
activation, a deep network would collapse to a linear function regardless of depth.

Math and Python MCQs


1. What is the value of the combination “n choose k” (the number of ways to choose k items from
n)?
n!
A) (n−k)!
n!
B) k!(n−k)!
C) nk
n!
D) (k−1)!(n−k+1)!
n!
Answer: B) k!(n−k)! .
n!
Explanation: The binomial coefficient, read as "n choose k", is given by k!(n−k)! . This formula counts
the number of ways to select k elements from n without regard to order 30 .

2. Which of the following correctly states Bayes’ theorem?


A) P (A∣B) = P (A) + P (B)
B) P (A∣B) = P (B∣A) P (A)/P (B)
C) P (A∣B) = P (A ∩ B)/P (B)
D) P (A∣B) = P (A ∪ B)
P (B∣A) P (A)
Answer: B) P (A∣B) = P (B) .
P (B∣A)P (A)
Explanation: Bayes’ theorem relates conditional probabilities by P (A∣B) = P (B)
31 .

3. What is the probability of getting exactly k successes in n independent Bernoulli trials with
success probability p?
A) pk (1 − p)n−k
n
B) (k )pn−k (1 − p)k
n
C) (k )pk (1 − p)n−k
n!
D) k!(n−k)! (without p factors)
n
Answer: C) (k )pk (1 − p)n−k .
Explanation: The probability of exactly k successes in n Bernoulli(p) trials is given by the binomial
n n
formula (k )pk (1 − p)n−k 32 , where (k ) = n!/(k!(n − k)!) .

4. Which statement is equivalent to saying "matrix A is invertible"?


A) det(A) = 0 .
B) det(A) 0. =
C) Zero is an eigenvalue of A.
D) A is rectangular.
Answer: B) det(A)  0 . =
Explanation: A square matrix A is invertible (non-singular) if and only if its determinant is nonzero.
Equivalently, 0 is not an eigenvalue of A 33 .

10
5. If a 3×3 matrix has eigenvalues 2, 3, and 4, what is its determinant?
A) 9
B) 24
C) 20
D) 6
Answer: B) 24.
Explanation: The determinant of a matrix equals the product of its eigenvalues (for an n×n matrix).
Here det(A) = 2 × 3 × 4 = 24.

6. What is the output of the following Python code?

def append_to(element, to=[]):


to.append(element)
return to

print(append_to(12))
print(append_to(42))

A) [12] then [42]


B) [12] then [42, 12]
C) [12] then [12, 42]
D) [12, 42] then [42]
Answer: C) [12] then [12, 42].
Explanation: The default list to is created once and shared across function calls. On the first call it
becomes [12] . On the second call, the same list gets appended with 42, yielding [12, 42] . This
behaviour arises because Python evaluates default arguments only once at definition time 34 .

7. What does the expression a is b check in Python?


A) Whether a and b have the same value.
B) Whether a and b are the same object in memory.
C) Whether the contents of a and b are deeply equal.
D) Whether a and b have the same type.
Answer: B) Whether a and b are the same object in memory.
Explanation: The is operator checks identity: it returns True only if both variables point to the
exact same object 35 . (This is different from == , which checks if values are equal.)

8. Given the Python code below, what will be printed?

x = [1, 2, 3]
y = [1, 2, 3]
print(x == y)
print(x is y)

A) True then True


B) True then False

11
C) False then True
D) False then False
Answer: B) True then False.
Explanation: x == y checks value equality, which is True because both lists contain [1,2,3].
However, x is y checks object identity. x and y are two distinct list objects, so x is y is
False 36 .

9. What is a key difference between a Python list and tuple?


A) Lists can be used as dictionary keys, tuples cannot.
B) Lists are immutable, tuples are mutable.
C) Lists are mutable, tuples are immutable.
D) Lists support fewer operations than tuples.
Answer: C) Lists are mutable, tuples are immutable.
Explanation: In Python, lists can be modified (mutable), whereas tuples cannot be changed after
creation (immutable) 37 . This also means tuples can be used as keys in dictionaries (since they are
hashable) but lists cannot.

10. Which of these can be used as a key in a Python dictionary?


A) A list
B) A tuple
C) A dict
D) A set
Answer: B) A tuple.
Explanation: Dictionary keys must be immutable (hashable). Tuples are immutable and can serve as
keys; lists, dicts, and sets are mutable and cannot be keys. (This follows from the list-vs-tuple
immutability property 37 .)

11. What will be the output of the following code?

P = [[0] * 3] * 3
P[0][0] = 5
print(P)

A) [[5, 0, 0], [0, 0, 0], [0, 0, 0]]


B) [[5, 0, 0], [5, 0, 0], [5, 0, 0]]
C) [[5, 0, 0], [0, 5, 0], [0, 0, 5]]
D) Error
Answer: B) [[5, 0, 0], [5, 0, 0], [5, 0, 0]] .
Explanation: Using [[0] * 3] * 3 creates three references to the same inner list. So modifying
one row ( P[0][0] = 5 ) changes all rows at that column. All three rows share the same list object
38 .

12. What is the result of 0.1 + 0.2 == 0.3 in Python, and why?
A) True, because 0.1 + 0.2 exactly equals 0.3.
B) False, because floating-point representations are imprecise.
C) True, because of automatic rounding.

12
D) False, because the == operator is broken in Python.
Answer: B) False, because floating-point representations are imprecise.
Explanation: In binary floating point (IEEE 754), numbers like 0.1 and 0.2 cannot be represented
exactly. Thus 0.1+0.2 results in a number very close to but not exactly 0.3, making
(0.1 + 0.2) == 0.3 evaluate to False 39 .

13. What is the formula for the sum of the first n positive integers (1 + 2 + … + n)?
A) n(n + 1)/2
B) n2
C) n(n − 1)/2
D) n2 +n
Answer: A) n(n + 1)/2 .
Explanation: The well-known formula for the sum of the first n integers is 1+2+⋯+n=
n(n+1)
2 . This can be derived by pairing terms or via induction.

14. What is the probability of getting exactly 2 heads in 3 fair coin flips?
A) 0.25
B) 0.375
C) 0.5
D) 0.75
Answer: B) 0.375.
3
Explanation: There are (2) = 3 ways to get 2 heads out of 3 flips, and each specific outcome has
probability (0.5)3 = 0.125 . Thus the probability is 3 × 0.125 = 0.375 .

15. Given a = [1, 2] and b = a , what does a become after a += [3] ?


A) [1, 2, 3] and b is still [1, 2] .
B) [1, 2, 3] and b is also [1, 2, 3] .
C) [1, 2] and b is [1, 2, 3] .
D) Error, because b is a copy of a.
Answer: B) [1, 2, 3] and b is also [1, 2, 3] .
Explanation: b = a makes b reference the same list as a . The += [3] operation modifies the
list in place, so both a and b will reflect the change, yielding [1, 2, 3] .

16. What will be printed by the following code?

for i in range(5):
pass
print(i)

A) 4
B) 5
C) 0
D) Error, since i is not defined outside the loop.
Answer: A) 4 .

13
Explanation: In Python, the loop variable i remains defined after the loop ends, retaining its last
value. After range(5) , the last value assigned to i was 4, so print(i) outputs 4 .

17. In Python, what does the expression len('42') return?


A) 42
B) 2
C) Error, because '42' is not numeric.
D) 1
Answer: B) 2.
Explanation: The string '42' has two characters ('4' and '2'), so len('42') returns 2. (Calling
len() on an integer value like 42 would cause a TypeError 40 , but on the string it counts
characters.)

18. Which of the following is not a mutable data type in Python?


A) list
B) dict
C) tuple
D) set
Answer: C) tuple.
Explanation: Tuples are immutable (cannot be changed after creation) 37 . Lists, dictionaries, and
sets are all mutable.

19. What is the output of this Python code?

def func(x, y=0):


return x + y

print(func(3))
print(func(3, 4))

A) 3 then 7
B) 0 then 7
C) 3 then 4
D) Error, because y has a default.
Answer: A) 3 then 7 .
Explanation: Calling func(3) uses the default y=0 , so it returns 3+0=3 . Calling func(3,4)
overrides the default, returning 3+4=7 .

20. In Python, what exception is raised by len(42) ?


A) ValueError
B) TypeError
C) NameError
D) No exception, it returns 2.
Answer: B) TypeError.

14
Explanation: Calling len() on an integer (non-sequence) raises TypeError: object of type
'int' has no len() 40 , since integers do not support the len operation.

1 optimization - Can gradient descent be applied to non-convex functions? - Cross Validated


https://stats.stackexchange.com/questions/172900/can-gradient-descent-be-applied-to-non-convex-functions

2 3 Learning Rate in Gradient Descent


https://apxml.com/courses/introduction-to-deep-learning/chapter-3-training-loss-optimization/learning-rate

4 neural network - Loss Function for Probability Regression - Data Science Stack Exchange
https://datascience.stackexchange.com/questions/45285/loss-function-for-probability-regression

5 Binary Cross Entropy/Log Loss for Binary Classification


https://www.analyticsvidhya.com/blog/2021/03/binary-cross-entropy-log-loss-for-binary-classification/

6 Hinge-loss & relationship with Support Vector Machines | GeeksforGeeks


https://www.geeksforgeeks.org/hinge-loss-relationship-with-support-vector-machines/

7 25 Generative vs Discriminative Models: Differences & Use Cases | DataCamp


https://www.datacamp.com/blog/generative-vs-discriminative-models

8 9 10 11 12 13 Understanding the Confusion Matrix in Machine Learning | GeeksforGeeks


https://www.geeksforgeeks.org/confusion-matrix-machine-learning/

14 ReLU Activation Function in Deep Learning | GeeksforGeeks


https://www.geeksforgeeks.org/relu-activation-function-in-deep-learning/

15 16 26 Vanishing Gradient Problem: Causes, Consequences, and Solutions - KDnuggets


https://www.kdnuggets.com/2022/02/vanishing-gradient-problem.html

17 Softmax function - Wikipedia


https://en.wikipedia.org/wiki/Softmax_function

18 Dropout: A Simple Way to Prevent Neural Networks from Overfitting


https://jmlr.org/papers/v15/srivastava14a.html

19 29 Batch normalization - Wikipedia


https://en.wikipedia.org/wiki/Batch_normalization

20 Reinforcement learning - Wikipedia


https://en.wikipedia.org/wiki/Reinforcement_learning

21 22 Bagging vs Boosting in Machine Learning | GeeksforGeeks


https://www.geeksforgeeks.org/bagging-vs-boosting-in-machine-learning/

23 K means Clustering – Introduction | GeeksforGeeks


https://www.geeksforgeeks.org/k-means-clustering-introduction/

24 The Limitations of Perceptron: Why it Struggles with XOR | by Aryan Rusia | Medium
https://medium.com/@aryanrusia8/the-limitations-of-perceptron-why-it-struggles-with-xor-21905d31f924

27 28 How L1 Regularization brings Sparsity` | GeeksforGeeks


https://www.geeksforgeeks.org/how-l1-regularization-brings-sparsity/

15
30 Binomial coefficient - Wikipedia
https://en.wikipedia.org/wiki/Binomial_coefficient

31 Bayes' theorem - Wikipedia


https://en.wikipedia.org/wiki/Bayes%27_theorem

32 5.4.1: Binomial Distribution Formula - Statistics LibreTexts


https://stats.libretexts.org/Courses/Las_Positas_College/Math_40%3A_Statistics_and_Probability/
05%3A_Discrete_Probability_Distributions/5.03%3A_Binomial_Distribution/5.4.01%3A_Binomial_Distribution_Formula

33 Invertible matrix - Wikipedia


https://en.wikipedia.org/wiki/Invertible_matrix

34 Common Gotchas — The Hitchhiker's Guide to Python


https://docs.python-guide.org/writing/gotchas/

35 36 Python Object Comparison : “is” vs “==” | GeeksforGeeks


https://www.geeksforgeeks.org/python-object-comparison-is-vs/

37 Difference Between List and Tuple in Python | GeeksforGeeks


https://www.geeksforgeeks.org/python-difference-between-list-and-tuple/

38 Python list multiplication: [[...]]*3 makes 3 lists which mirror each other when modified - Stack Overflow
https://stackoverflow.com/questions/6688223/python-list-multiplication-3-makes-3-lists-which-mirror-each-other-when

39 c - is (0.1 + 0.2) == 0.3 true or false? - Stack Overflow


https://stackoverflow.com/questions/62727051/is-0-1-0-2-0-3-true-or-false

40 Python's Mutable vs Immutable Types: What's the Difference? – Real Python


https://realpython.com/python-mutable-vs-immutable-types/

16

You might also like