Interview Preparing - ML Draft
Interview Preparing - ML Draft
1
1.3 SUPERVISED MACHINE LEARNING
Examples for regression applications:
Analyze the marketing effectiveness, pricing, and promotions on the sales of a product.
Forecast sales by analyzing the monthly company’s sales for the past few years.
Predict house prices with an increase in the sizes of houses.
Calculate causal relationships between parameters in biological systems
2
you reach the most accurate line with a minimum error value (that is, the minimum distance
between the line and all points).
Types:
1. Univariate Linear Regression — the basic information needed to start with.
2. Multivariate Linear Regression — the more complex form of Linear Regression. In higher
dimensions where we have more than one input (X), the line is called a plane or a hyper-plane.
𝑌(𝑋) = 𝑝0 + 𝑝1 𝑋1 + 𝑝2 𝑋2 + ⋯ + 𝑝𝑛 𝑋𝑛
Multicollinearity in the dataset means independent variables are highly related to each
other, and a small change in the data can cause a large change in the regression
coefficients.
The logistic function (sigmoid function) is an S-shaped curve for data discrimination
across multiple classes. It can take any real value 0 – 1.
3
Naïve Bayes classifiers is a powerful and simple supervised machine learning
algorithm. It assumes that the value of a particular feature is independent of the value of any
other feature, given the class variable.
For example:
A fruit may be considered to be an apple if it is red, round, and about 10 cm in
diameter.
Features: Color, roundness, and diameter.
1. Define two classes (CY and CN) that correspond to Apple = Yes and Apple = No.
2. Compute the probability for 𝐶𝑦 as x:
𝑝(𝐶𝑌 | 𝑥): 𝑝(𝐴𝑝𝑝𝑙𝑒 = 𝑌𝑒𝑠 | 𝑅𝑒𝑑, 𝑟𝑜𝑢𝑛𝑑, => 10 𝑐𝑚)
3. Compute the probability for 𝐶𝑁 as x:
to calculate p(Color = Red | Apple = Yes), you are asking, “What is the probability for
having a red color object given that we know that it is an apple”.
4
Entropy
It is the measure of the amount of uncertainty and randomness in a set of data for
the classification task. Entropy is maximized when all points have equal probabilities.
Entropy zero means that there is no randomness for this attribute.
Entropy for one attribute: Entropy(x) = - 𝑝(𝑥𝑖 ) ∗ log 2 ∗ 𝑝(𝑥𝑖 )
Entropy for two attributes: Entropy(T, xi) = - 𝛴𝑖 𝑝(𝑥𝑖 ) ∗ log 2 ∗ 𝑝(𝑥𝑖 )
Gain(T, x) = Entropy(T)-Entropy(T, x)
Information gain
It is used for ranking the attributes or features to split at given node in the tree.
Information gain = (Entropy of distribution before the split)–(entropy of distribution after
it)
Used for ranking the attributes or features to split at given node in the tree.
It defines how much information a feature provides about a class.
The feature with the highest information gain is used for the first split.
The standard deviation reduction is based on the decrease in standard deviation after a
dataset is split on an attribute. Constructing a decision tree is all about finding attribute
that returns the highest standard deviation reduction (i.e., the most homogeneous
branches).
6
Kernel
A kernel helps us find a hyperplane in the higher dimensional space without
increasing the computational cost.
Hyperplane
This is basically a separating line between two data classes in SVM. But in Support
Vector Regression, this is the line that will be used to predict the continuous output.
𝑌 = 𝑤𝑥 + 𝑏 (𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 ℎ𝑦𝑝𝑒𝑟𝑝𝑙𝑎𝑛𝑒)
−𝑎 < 𝑌 − 𝑤𝑥 + 𝑏 < +𝑎
Decision Boundary
A decision boundary can be thought of as a demarcation line (for simplification) on
one side of which lie positive examples and on the other side lie the negative examples. On
this very line, the examples may be classified as either positive or negative. This same
concept of SVM will be applied in Support Vector Regression as well. The equations of
decision boundary become:
𝑤𝑥 + 𝑏 = +𝑎
𝑤𝑥 + 𝑏 = −𝑎
1.3.9 K-Nearest Neighbor
1.4.1K-Means Algorithm
K-means clustering is an unsupervised machine learning technique. The main goal of
the algorithm is to group the data observations into k clusters, where each observation
belongs to the cluster with the nearest mean. A cluster’s center is the centroid.
7
Examples of applications include
Customer segmentation.
Image segmentation and compression.
Recommendation systems.
1.4.6 LLE
o T-SNE
o Independent Component Analysis
o Singular value decomposition
https://www.javatpoint.com/unsupervised-machine-learning
8
Perceptron
A perceptron is a single neuron model that was an originator for neural networks. It
is similar to linear regression. Each neuron has its own bias and slope (weights).
Backpropagation
Backpropagation is an algorithm for training neural networks that have many layers.
It works in two phases.
Propagation of inputs through a neural network to the final layer (called feedforward).
The algorithm computes an error. An error value is then calculated by using the wanted
output and the actual output for each output neuron in the network. The error value is
propagated backward through the weights of the network (adjusting the weights)
beginning with the output neurons through the hidden layer and to the input layer (as a
function of the contribution of the error).
Backpropagation continues to be an important aspect of neural network learning.
With faster and cheaper computing resources, it continues to be applied to larger and denser
networks.
Types of neural networks.
Multilayer perceptron (MLP): A class of feed-forward artificial neural networks
(ANNs). It is useful in classification problems where inputs are assigned a class. It also
works in regression problems for a real-valued quantity like a house price prediction.
Convolutional neural network (CNN): Takes an input as an image. It is useful for image
recognition problems like facial recognition.
9
Recurrent neural network (RNN): It is suitable for inputs like audio and languages. It
can be used in applications like speech recognition and machine translation.
Hybrid neural network: Covers more complex neural networks, for example,
autonomous cars that require processing images and work by using radar.
1.6.2 Eclat
1.7.3 Bagging
1.7.4 Boosting
1.7.5 Regularization
10
1.8.4 Vanishing/exploding gradients
P1. You’re trying to classify images of cats and dogs. Plotting the images in some
transformed 2-dimensional feature space reveals the following pattern (on the left). In
some other space, images of dogs and wolves show a different pattern (on the right).
What model would you use to classify cats vs. dogs, and what would you use for dogs vs.
wolves? Why?
P2. I’m trying to fit a single hidden layer neural network to a given dataset, and I find that the
weights are oscillating a lot over training iterations (varying wildly, often swinging between
positive and negative values). What parameter do I need to tune to address this issue?
P3. When training a support vector machine, what value are you optimizing for?
P4. Lasso regression uses the L1-norm of coefficients as a penalty term, while ridge
regression uses the L2-norm. Which of these regularization methods is more likely to
result in sparse solutions, where one or more coefficients are exactly zero?
P5. When training a 10-layer neural net using backpropagation, I find that the weights for the
top 3 layers are not changing at all! The next few layers (4-6) are changing, but very
slowly. What’s going on and how do I fix this?
P6. I’ve found some data about wheat-growing regions in Europe that includes annual rainfall
(R, in inches), mean altitude (A, in meters) and wheat output (O, in kgs/km2). A rough
analysis and some plots make me believe that output is related to the square of rainfall,
and log of altitude: O = β0 + β1 × R2 + β2 × loge(A)
P7. Can I fit the coefficients (β) in my model to the data using linear regression?
11
12