ML - Interview Prep
ML - Interview Prep
• Define
○ focuses on developing algorithms and models
○ that enable computers to learn and make predictions or
decisions
○ without being explicitly programmed.
• Machine Learning algorithms:
○ Supervised:
§ Define
□ machine learning algorithms that require labeled
training data
§ supervised learning algorithms
□ Support Vector Machines
® Facts on SVM
◊ It finds a hyperplane of dimension
(number of feature - 1)
◊ The hyperplane is defined by the
equation:
◊ w1 * Height + w2 * Weight - b = 0
◊ We need to find the values of w1, w2,
and b that define the hyperplane.
◊ To find the hyperplane with the
maximum margin, we need to solve the
following optimization problem:
◊ yi(w1 * xi + w2 * xi - b) >= 1 for all data
points
◊ So for each i, either xi is in the
hyperplane of 1 or -1.
◊ w . xi - b = 1 or w. xi - b = -1
□ Regression
® It deals with continuous data
® It predicts the relationship the data represents
® It used to find a mathematical function that
explain
® the relationship between the independent
variable and the dependent variable
® Types:
® It predicts the relationship the data represents
® It used to find a mathematical function that
explain
® the relationship between the independent
variable and the dependent variable
® Types:
◊ Logistic regression:
} z = β0 + β1*X1 + β2*X2 + ... +
βn*Xn
} Sigmoid function
– P(Y=1|X) = 1 / (1 + exp(-z))
◊ multinomial logistic regression
} softmax(z)
– = exp(z_k) / (Σ(exp(z_i)))
□ Naive Bayes
® Facts on Naive Bayes
◊ Unlike just Bayes' theorem,
◊ Naive Bayes assumes the relationship
◊ That class y and independent features
◊ The presence or absence of feature x1 is
independent of feature x2
® The different naive Bayes classifiers
◊ Bernoulli, binomial, Gaussian
□ Random forest Classifier:
® Facts:
◊ It utilizes ensemble learning that
combines multiple decision trees, to
make it prediction
® Workings:
◊ Random sampling
} Randomly sample subset of the
original dataset, for each decision
tree
} Ensure it’s the same size as original
dataset
} May contain duplicates
◊ Randomly select some features for node
splitting
◊ Decision tree construction
} Construction decision tree by
splitting nodes
} Split threshold that maximize
information gain are chosen
} classification
◊ Decision tree construction
} Construction decision tree by
splitting nodes
} Split threshold that maximize
information gain are chosen
} classification
– Gini impurity = 1 - (p0^2 + p1
^2)
– Entropy = -p0 * log2(p0) - p1
* log2(p1)
} Regression:
– MSE = (1/n) * Σ(y - ŷ)^2
◊ Hierarchical clustering
◊ Density-based clustering
◊ Fuzzy clustering, etc.
□ Anomaly Detection,
□ Neural Networks and Latent Variable Models.
○ Reinforcement learning
• Cross-Validation
○ Split all data into
§ Training
□ Overfitting:
® Fact:
◊ The model learns the training dataset too
well.
◊ The model does poorly on the testing
data
□ Overfitting:
® Fact:
◊ The model learns the training dataset too
well.
◊ The model does poorly on the testing
data
® Solution:
◊ Increasing training dataset
◊ Cross-Validation
□ Underfitting:
® Fact:
◊ Fails to capture patterns in the data:
◊ Poor performance on both training and
testing
® Solution:
◊ Increase training data
◊ Hyperparameter tuning
◊ Increase model complexity
◊ Add more features
□ Loss Function
® Facts:
◊ It involves calculating the disparity or
error between the predicted value and
the actual value, for only a record or a
single data point.
® Examples of loss functions:
◊ Mean-Squared Error(MSE):
} FACTS:
– Used in regression task
} MSE = √(predicted value - actual
value)2
◊ Hinge loss:
} Facts:
– It is used to train the
machine learning classifier
} Hinge Loss = max(0, 1 - y * f(x))
– Y = -1 OR 1
– F(x) = predicted score
□ Cost Functions
® It aggregates the loss of each datapoints
§ Testing
§ Validation set
○ The dataset is divided into k equally sized folds.
□ Cost Functions
® It aggregates the loss of each datapoints
§ Testing
§ Validation set
○ The dataset is divided into k equally sized folds.
○ Iteratively, for each fold:
§ The model is trained on the combination of k-1 folds
(training set).
§ The trained model is evaluated on the remaining fold
(testing set).
○ The performance metrics (e.g., accuracy, precision, recall) are
recorded for each iteration.
• Bias in Machine Learning
○ Biases are inconsistencies in the dataset
○ The inconsistencies are not mutually exclusive
○ They can intertwine and affect each other
• F1 score
○ Facts:
§ Evaluates the performance and accuracy of classification
models
§ It combines precision and recall for its accuracy value
○ Components:
§ Precision
□ Out of the ones predicted to be true
□ Which were did it get right
□ Precision = TP / (TP + FP)
§ Recall
□ Out of the ones that were actually true
□ Which did it get right
□ Recall = TP / (TP + FN)
○ F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
• Neural Network:
○ It is computational model inspired by the brain
○ It consist of inter-connect artificial neurons, organized in layers
○ The take in inputs process it through a series of mathematical
equations
○ Through this process it can learn patterns from data and perform
classification and regression task
• Ensemble learning
○ Facts:
§ It involves combining several models to create a more
powerful and robust model.
§ It also help a model balance between bias and variance
□ The bias-variance tradeoff involves a model ability to
• Ensemble learning
○ Facts:
§ It involves combining several models to create a more
powerful and robust model.
§ It also help a model balance between bias and variance
□ The bias-variance tradeoff involves a model ability to
fit to the training data (low bias) and then generalize
to unseen data (low variance)
§ Example of Ensemble learning
□ Voting:
® Independent base learners predict the class or
value
® For regression, we computer the average for
final prediction
® For classification, we compute the mode for
final value
□ Bagging
® It Voting with random samples and random
features of dataset
□ Boosting:
® It involves a sequence of base learners
® Subsequent learners focus of the records the
previous model got wrong
® To get aggregate predication, weighting of
model output could be used.
• Exploratory Data Analysis (EDA)
○ FACT
§ there is no specific way that lets us know which ML
algorithm to use
§ If Discrete data, we use classification models, like svm
§ If continuous data, we is regression models
○ EDA
§ Data collection and inspection
□ We classify our features as continuous, categorical
data
§ Data visualization
§ Perform descriptive statistics
□ Mean, median, std, variance
§ Data cleaning and preprocessing
□ Handle missing data
□ Handle Outliers:
® Facts:
◊ Outliers are data points that significantly
deviates from the majority data points
§ Data cleaning and preprocessing
□ Handle missing data
□ Handle Outliers:
® Facts:
◊ Outliers are data points that significantly
deviates from the majority data points
◊ They can occur due to
} Rare events
} Error or data corruption
® Detect outliers
◊ Z-score:
} Z-scores beyond a certain
threshold are outliers: 2 or 3
◊ IQR:
} Outliers < P25 - (t * IQR)
} Outliers >P75 + (t * IQR)
◊ Data visualization
} Box plots
} Histogram
} Scatter plot
® Handling outliers:
◊ Replace outlier with median, mode
◊ Transform feature: log transformation
◊ Create an additional binary feature to
indicate outlier
◊ Remove the outlier
□ Encoding of categorical data
§ ML algorithm selection
• Recommender System
○ Fact:
§ Predicts user interest
§ Recommend products based on predicted interest
○ Data used:
§ User ratings
§ Search engine queries
§ Purchase histories
○ Collaborative filtering vs Content-based filtering
Collaborative filtering Content-based filtering
Leverages the preferences and It focusses of utilizing the
behaviors of similar users to make features and characteristics
recommendations of items and user
preferences
○ Only Effective with large data on Doesn’t require large data
Leverages the preferences and It focusses of utilizing the
behaviors of similar users to make features and characteristics
recommendations of items and user
preferences
○ Only Effective with large data on Doesn’t require large data
user-item interaction on user-item interaction
Can capture complex user Content-Based Filtering
preferences and unexpected item focuses on matching item
associations features with user
preferences.
• Normality of a dataset:
○ Fact:
§ We check if a data is normally distributed
○ Tests
§ Histogram:
□ From the histogram plot, we can check for symmetric
bell shape
§ Quantile-Quantile plot:
□ plot the observed quantiles (x-axis) against the
theoretical quantiles (y-axis)
□ Look for a straight line between points
§ Shapiro-Wilk Test:
□ W = W statistics
□ Wc = critical value
□ If W > Wc, we reject the null hypothesis of normality.
□ If W ≤ Wc, we fail to reject the null hypothesis of
normality