0% found this document useful (0 votes)
35 views19 pages

Machine Learning Notes

Notes for cse machine learning

Uploaded by

Navneet Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views19 pages

Machine Learning Notes

Notes for cse machine learning

Uploaded by

Navneet Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

1.

Distance-based methodsDistance-based methods rely on measuring distances between


data points to classify or predict outcomes. One popular technique is k-Nearest Neighbors
(k-NN), which predicts the class of a data point by finding the k-nearest data points in the
feature space and assigning the most common class from those neighbors. These methods
are intuitive and effective when data points are naturally clustered, but they tend to
perform poorly when there is noise or irrelevant features. The computational cost increases
significantly as k increases, especially for large datasets, due to the need to compute
distances for all training data.
2. Nearest-Neighborsk-Nearest Neighbors (k-NN) works by classifying a query point based
on the majority class of its k closest neighbors in feature space. The choice of k is crucial—
small values can lead to high variance and sensitivity to noise, while larger values reduce
variance but may increase bias. k-NN performs well when data points are dense and feature
relationships are non-linear. However, it is computationally expensive due to the need to
compute distances for each query, which can become infeasible for large datasets.
3. Decision TreesDecision Trees split data into subsets by asking sequential questions about
feature values. They build a tree structure where each node represents a decision or
feature split, and leaf nodes predict the class or value. Decision Trees are easy to interpret
and handle non-linear relationships well but are prone to overfitting if not pruned. Pruning
helps mitigate overfitting by removing unnecessary splits. They tend to work well on
structured data, but their performance can degrade on noisy or continuous-valued inputs.
4. Naive BayesNaive Bayes classifiers assume independence between features given the
class, which simplifies probability estimation. Despite this assumption, Naive Bayes
performs surprisingly well in practice, especially for text classification tasks like spam
detection, due to its efficiency and simplicity. The independence assumption holds when
the feature relationships are sparse, making it less effective when dependencies are strong.
However, when used with large datasets and many features, Naive Bayes remains a
powerful, fast model.
5. Linear ModelsLinear models describe relationships between input features and a
dependent variable using linear equations. Linear Regression predicts continuous
outcomes by minimizing the sum of squared differences between predicted and actual
values. It assumes a linear relationship between predictors and the target variable and
performs well when the data is linearly separable. Logistic Regression extends this to
binary classification by modeling probabilities using a logistic function. Linear models are
computationally efficient but may struggle with non-linear relationships and complex
feature interactions.
6. Linear RegressionLinear Regression models the relationship between a dependent
variable and independent variables by fitting a linear equation that minimizes the squared
differences between predicted and actual values. It is commonly used in tasks like
predicting house prices, sales forecasts, etc. However, it is sensitive to assumptions such as
the linear relationship between features and the target. When this assumption is violated,
the model may underperform.
7. Logistic RegressionLogistic Regression predicts binary outcomes by modeling
probabilities using a logistic function. The model outputs values between 0 and 1,
representing the probability of the positive class. It is widely used for tasks like spam
detection, medical diagnosis, or customer churn prediction. Despite being computationally
efficient and easy to interpret, logistic regression assumes linearity between features and
outcomes, which can limit its performance when dealing with complex, non-linear
relationships.
8. Generalized Linear Models (GLMs)GLMs extend linear regression by incorporating
different distributions for the response variable, such as binary (Logistic Regression) or
count data (Poisson Regression). A GLM has a linear predictor and a link function that
connects the predictor to the response variable. GLMs are flexible and widely used when
data distributions are non-normal or when linear assumptions of linear regression don’t
hold. For example, logistic regression is a GLM for binary classification.
9. Support Vector Machines (SVMs)SVMs seek to find the optimal hyperplane that
separates classes with maximum margin. They project data into a higher-dimensional space
using kernel functions, allowing non-linear classification. SVMs are effective in high-
dimensional spaces and perform well in cases with clear decision boundaries. However,
they can be computationally expensive due to the need to solve quadratic optimization
problems and tune kernel parameters.
10. Nonlinearity and Kernel MethodsKernel methods apply functions to transform the
feature space into a higher-dimensional space, where the data may become linearly
separable. For instance, the Gaussian kernel is commonly used in SVMs to handle non-
linear boundaries. These methods are useful when relationships between features are
complex and cannot be captured by linear models, but they require careful selection of
kernel types and hyperparameters to avoid overfitting.
11. Beyond Binary ClassificationMany real-world problems require classification models
that predict more than two possible outcomes. Multi-class classification methods, like
One-vs-All and softmax regression, extend binary classifiers to multiple classes. Structured
output models, such as sequence models for tasks like NLP and computer vision, model
dependencies between multiple outputs. These models are used in applications like multi-
label classification or hierarchical classification tasks.
12. Multi-class/Structured OutputsMulti-class classification involves predicting multiple
classes rather than binary outcomes. Techniques like softmax regression or One-vs-All can
be used, especially when there are multiple classes and no clear decision boundary.
Structured output models handle dependencies between multiple outputs, such as in
sequence prediction tasks (e.g., speech recognition, text generation, object detection).
13. RankingRanking models are used to predict the order or relevance of items, often
applied in recommendation systems and search engines. Learning to Rank algorithms like
RankNet or RankSVM rank items based on relevance, rather than merely classifying them.
Ranking is important when the goal is to present the most relevant content or items first.
14. ClusteringClustering is an unsupervised learning technique that groups data points
based on their similarities. K-means is a popular algorithm that partitions data into K
clusters by minimizing the sum of squared distances from points to the cluster center.
Clustering is used for tasks such as customer segmentation, anomaly detection, and feature
exploration. However, it struggles with noisy data and does not handle well when clusters
overlap.
15. K-meansK-means partitions data into K clusters by minimizing the sum of squared
distances between each point and its cluster center. It is widely used in tasks like customer
segmentation, image compression, and exploratory data analysis. However, K-means is
sensitive to the choice of initial centroids and does not work well with noisy or overlapping
clusters.
16. Kernel K-meanKernel K-means extends K-means by applying kernel functions to map
data into higher-dimensional spaces, where complex, non-linear relationships can be
captured. This makes Kernel K-means suitable for clustering when the data is not linearly
separable. However, it requires careful tuning of kernel parameters.
17. Dimensionality ReductionDimensionality reduction techniques aim to reduce the
number of features while preserving as much important information as possible. PCA
(Principal Component Analysis) identifies the directions of maximum variance and projects
data onto a lower-dimensional space. It is commonly used for reducing feature space
complexity and improving interpretability.
18. PCA (Principal Component Analysis)PCA reduces the dimensionality of data by finding
principal components that capture the most variance. It is widely used for data
visualization, feature selection, and removing noise or redundant features. PCA is effective
when data has high dimensionality and many redundant or correlated features.
19. Kernel PCAKernel PCA extends PCA to handle non-linear data by using kernel functions
to map data into a higher-dimensional space. This allows the identification of non-linear
patterns, which are not possible with linear PCA. It is commonly used for tasks like image
recognition, text processing, and other non-linear datasets.
20. Matrix Factorization
Matrix factorization decomposes large matrices into lower-dimensional representations,
commonly used for collaborative filtering systems like recommendation engines. It helps
capture latent factors that represent relationships between users and items, enabling
better recommendations.
21. Matrix Completion
Matrix completion estimates missing entries in partially observed matrices by leveraging
their low-rank structure. This is widely used in collaborative filtering systems to predict
missing user-item interactions and improve recommendation accuracy.
22. Generative Models
Generative models, such as mixture models and latent factor models, capture the
underlying distribution of data and can generate new data samples. They are useful for
tasks like image generation, anomaly detection, and natural language processing.
23. Evaluating Machine Learning Algorithms
Evaluating machine learning algorithms involves assessing their performance using metrics
like accuracy, precision, recall, F1-score, and cross-validation. It is crucial to select metrics
that reflect the specific requirements of the task (e.g., minimizing false positives or
maximizing recall).
24. Model Selection
Model selection involves comparing and selecting models based on performance metrics,
such as accuracy, precision, and cross-validation. The goal is to choose the model that
generalizes best to unseen data while minimizing overfitting.
25. Statistical Learning TheoryStatistical Learning Theory focuses on how models generalize
from training data to unseen data. It explores concepts like bias-variance trade-offs,
focusing on balancing underfitting and overfitting.
26. Ensemble MethodsEnsemble methods combine multiple models (e.g., Bagging,
Boosting, Random Forests) to improve accuracy by reducing variance and improving
robustness. These methods aggregate multiple weak models into a stronger, more accurate
model.
27. Sparse Modeling and EstimationSparse modeling focuses on learning models with
fewer parameters, which helps reduce overfitting and improve interpretability. Techniques
like Lasso and Ridge regression enforce sparsity in the model coefficients.
28. Modeling Sequence/Time-Series DataModels like RNNs and LSTMs capture temporal
dependencies in sequential data. They are commonly used in time-series forecasting,
natural language processing, and speech recognition.
29. Deep Learning & Feature Representation Learning
Deep learning involves neural networks that automatically learn hierarchical feature
representations. These models are widely used in image, text, and speech processing.
30. Scalable Machine LearningScalable machine learning applies distributed computing and
online learning techniques to handle large datasets efficiently, particularly for big data
applications.
31. Semi-supervised LearningSemi-supervised learning leverages both labeled and
unlabeled data, helping improve performance when labeled data is scarce.
32. Active LearningActive learning focuses on selecting the most informative data points to
label, reducing the need for extensive labeling.
33. Reinforcement LearningReinforcement learning learns optimal actions by maximizing
long-term rewards, commonly applied in autonomous systems and robotics.
34. Inference in Graphical ModelsGraphical models represent dependencies between
variables and are useful in domains like computer vision and natural language processing.
35. Bayesian Learning and InferenceBayesian learning incorporates prior knowledge to
update beliefs about model parameters, enabling uncertainty quantification.

36. Recent Trends in Machine Learning


Recent trends include Transformers, self-supervised learning, and generative models,
driving advances in NLP, vision, and recommendation systems.
Supervised Learning:Supervised learning is a type of machine learning where the model is
trained on labeled data. In this setting, both input data (features) and corresponding output
labels (target) are provided. The goal is to learn a mapping from inputs to outputs, enabling
the model to make predictions on unseen data.
Key Components of Supervised Learning:
1. Training Data: The model learns from a labeled dataset, which contains input-output
pairs.
2. Learning Algorithm: A mathematical function that finds patterns in the data to make
predictions. Common algorithms include regression, classification, decision trees,
SVMs, etc.
3. Prediction: After training, the model generalizes to make predictions on new, unseen
data.
4. Error Function: Measures the difference between predicted and actual outputs,
guiding the model to minimize this error (commonly done using methods like MSE for
regression or cross-entropy loss for classification).
Types:
• Regression: Predicts continuous outcomes (e.g., predicting house prices).
• Classification: Predicts categorical outcomes (e.g., spam detection, image
classification).
Supervised learning is widely used in scenarios where labeled data is abundant, such as
healthcare diagnostics, recommendation systems, and finance.
Unsupervised Learning:
Unsupervised learning involves training a model on data without explicit labels. The goal is
to find patterns, structures, or relationships in the data that do not have direct labeled
outputs.
Key Components of Unsupervised Learning:
1. Training Data: Only input features are provided; no labels are given.
2. Learning Algorithm: The model tries to understand the underlying structure, often
using clustering, dimensionality reduction, or generative models.
3. Output: The model organizes or represents data in a way that reveals hidden
structures or groupings.
4. Evaluation: Evaluation often relies on how well the model captures the data’s
structure, typically measured using metrics like Silhouette scores or reconstruction
error.
Techniques:
• Clustering: Grouping data into clusters with similar patterns (e.g., k-means, DBSCAN).
• Dimensionality Reduction: Techniques like PCA reduce the number of features while
preserving data’s important structure.
• Generative Models: Models like Autoencoders or Variational Autoencoders (VAEs)
generate new data by learning the distribution of the input data.
• Anomaly Detection: Identifies outliers or unusual patterns in the data.
Unsupervised learning is used in scenarios like customer segmentation, image compression,
data visualization, and feature extraction when labeled data is scarce or costly to obtain.
Group A (Very Short Answer Type Questions)
1. Answer any ten of the following:
(i) Fraud Detection, Image Classification, Diagnostic, and Customer Retention are
applications in Machine Learning algorithm.
(ii) Machine learning algorithm build a model which is based on sample data is known as
Supervised Learning.
(iii) The prior goal of unsupervised learning model is to determine the underlying structure
or distribution in the data.
(iv) Real-Time decisions, Game AI, Learning Tasks, Skill Acquisition, and Robot Navigation
are applications in Reinforcement Learning.
(v) Deep learning algorithms are more accurate than machine learning algorithm in image
classification.
(vi) True or False: Overfitting is more likely when you have huge amount of data to train.
False. Overfitting is more likely when the model is too complex for the amount of data
available, leading it to memorize the training data rather than learning general patterns.
(vii) If you use an ensemble of different base models, is it necessary to tune the hyper
parameters of all base models to improve the ensemble performance?1 Not necessarily.
Tuning the hyperparameters of the base models can improve the ensemble performance,
but it's not always necessary. The ensemble technique itself often helps to improve
performance by reducing overfitting and increasing robustness.
(viii) The Bayes rule can be used in answering probability-related questions, such as
classification and prediction.
(ix) True or False: K-means clustering aims to partition n observations into k clusters. True.
(x) Which ensemble model helps in reducing variance? Bagging, such as Random Forest.
(xi) Write one common feature about neural network and linear regression models. Both
neural networks and linear regression models can be used for regression and classification
tasks.
(xii) In a Decision Tree Leaf Node represents a decision or a prediction.
Group B (Short Answer Type Questions)
2. How to Tackle Overfitting and Underfitting?
Overfitting:
• Regularization: Techniques like L1 (Lasso) and L2 (Ridge) regularization add a penalty
term to the model's loss function, discouraging it from becoming too complex.
• Cross-validation: Techniques like k-fold cross-validation help to estimate the model's
performance on unseen data and identify overfitting early on.
• Early stopping: In iterative learning algorithms, stopping the training process before
the model starts to overfit the training data.
• Data augmentation: Creating additional training data by applying transformations to
the existing data, such as rotations, flips, and zooms.
Underfitting:
• Adding more features: Increasing the complexity of the model by adding more
features or using more complex models.
• Removing noise: Cleaning the data by removing irrelevant or noisy features.
• Increasing model complexity: Using more complex models, such as deep learning
models or higher-degree polynomials in regression.
3. Explain Pool-Based sampling in Active learning.
In pool-based sampling, the learning algorithm has access to a large pool of unlabeled data.
The algorithm then iteratively selects the most informative data points from the pool to be
labeled by a human expert. This is done by using a query strategy that aims to maximize the
model's expected improvement in performance. Common query strategies include
uncertainty sampling, query by committee, and expected model change.
4. Differentiate Between Machine Learning and Deep Learning.
Machine Learning:
• Relies on algorithms that learn from data to make predictions or decisions.
• Uses various algorithms like decision trees, support vector machines, and k-nearest
neighbors.
• Often requires feature engineering to extract relevant information from raw data.
• Limited in its ability to learn complex patterns in high-dimensional data.
Deep Learning:
• A subset of machine learning that uses artificial neural networks with multiple layers
(deep networks).
• Capable of learning complex patterns and representations directly from raw data.
• Requires large amounts of data and computational power.
• Has achieved state-of-the-art results in tasks like image recognition, natural language
processing, and speech recognition.
5. Explain How a System can play a Game of Chess using Reinforcement Learning.
In reinforcement learning, an agent learns to interact with an environment by taking
actions and receiving rewards or penalties. In the context of playing chess, the agent would
be the chess-playing system, the environment would be the chessboard and the rules of the
game, the actions would be the moves the agent makes, and the rewards would be winning
or losing the game.
The system could use a technique like Q-learning, where it learns an action-value function
that estimates the expected reward for taking a particular action in a given state. The
system would start by playing random moves and gradually improve its strategy by
exploring different moves and updating its Q-values based on the outcomes of the games.
6. Explain 'Naive' in a Naive Bayes Algorithm.
In Naive Bayes, the 'naive' assumption is that the features are conditionally independent
given the class label. This means that the algorithm assumes that the presence or absence
of one feature does not affect the probability of another feature given the class label. This
assumption simplifies the calculation of probabilities and makes the algorithm
computationally efficient.2 However, it may not always hold in real-world data.
Group C (Long Answer Type Questions)
7. (a) What Is a False Positive and False Negative and How are they significant with an
example?
• False Positive: A false positive occurs when a model predicts an event (e.g., a disease)
will occur, but in reality, it does not.
• False Negative: A false negative occurs when a model predicts an event will not occur,
but in reality, it does.
Example: In medical diagnosis, a false positive for a disease would mean a healthy person is
incorrectly diagnosed as having the disease, leading to unnecessary anxiety and potential
treatments. A false negative would mean a sick person is incorrectly diagnosed as healthy,
potentially delaying critical treatment.
Significance:
• False positives and false negatives have different consequences depending on the
application. In some cases, false positives are more critical (e.g., security systems),
while in others, false negatives are more critical (e.g., medical diagnosis).
• The ratio of false positives and false negatives is crucial in evaluating the performance
of a model and choosing the best model for a specific application.
7. (b) Why do you need Confusion matrix?
A confusion matrix is a table that summarizes the performance of a classification model. It
shows the number of true positives,1 true negatives, false positives, and false negatives2 for
each class. This information helps to:
• Visualize model performance: The confusion matrix provides a clear visual
representation of the model's accuracy, errors, and biases.
• Identify areas for improvement: By analyzing the confusion matrix, you can identify
which classes the model is struggling with and focus on improving its performance on
those classes.
• Compare different models: The confusion matrix can be used to compare the
performance of different models on the same dataset.
7. (c) For the given dataset, apply Naive Bayes' Algorithm and predict the outcome for a
car= {Red, Domestic, SUV}.
Dataset:

Color Origin Type Class

Red Domestic SUV Yes

Blue Imported Sedan No

Green Domestic Sedan Yes

... ... ... ...

Naive Bayes' Algorithm:


1. Calculate prior probabilities:
o P(Yes) = (Number of "Yes" instances) / (Total number of instances)
o P(No) = (Number of "No" instances) / (Total number of instances)
2. Calculate conditional probabilities:
o P(Red|Yes), P(Domestic|Yes), P(SUV|Yes)
o P(Red|No), P(Domestic|No), P(SUV|No)
3. Calculate posterior probabilities:
o P(Yes|Red, Domestic, SUV) = P(Yes) * P(Red|Yes) * P(Domestic|Yes) *
P(SUV|Yes)
o P(No|Red, Domestic, SUV) = P(No) * P(Red|No) * P(Domestic|No) * P(SUV|No)
4. Predict the class:
o If P(Yes|Red, Domestic, SUV) > P(No|Red, Domestic, SUV), predict "Yes"
o Otherwise, predict "No"
Note: To calculate the actual probabilities, you would need the complete dataset.
8. (a) Explain: Sparse Modelling and estimation.
Sparse modeling is a technique used in machine learning and statistics to find solutions
with few non-zero components. This is beneficial for several reasons:
• Reduced model complexity: Sparse models are simpler and easier to interpret, making
them more understandable and less prone to overfitting.
• Improved generalization: By focusing on the most important features, sparse models
can generalize better to unseen data.
• Computational efficiency: Sparse models can be more efficient to train and use,
especially in high-dimensional data.
Estimation in sparse modeling involves techniques like L1 regularization (Lasso) and L0
regularization. These methods add a penalty term to the loss function that encourages the
model to have fewer non-zero coefficients.
8. (b) Explain: Time series data in machine learning
Time series data is a sequence of data points collected over time. Examples include stock
prices, weather data, and sensor readings. Machine learning techniques are used to analyze
time series data for various purposes:
• Forecasting: Predicting future values of the time series.
• Anomaly detection: Identifying unusual patterns or events in the data.
• Classification: Classifying time series into different categories.
• Clustering: Grouping similar time series together.
8. (c) Explain how deep learning and feature representation learning related.
Deep learning models, particularly deep neural networks, excel at feature representation
learning. This means they can automatically learn complex and meaningful features from
raw data. For example, in image recognition, a deep neural network can learn to detect
edges, shapes, and textures in the images, which are then used to make predictions. This
automatic feature extraction is a key advantage of deep learning compared to traditional
machine learning methods that often require manual feature engineering.
9. (a) Explain the ID3 algorithm for decision tree learning.
The ID3 (Iterative Dichotomiser 3) algorithm is a greedy algorithm used to build decision
trees. It works by recursively partitioning the data based on the feature that provides the
highest information gain. Information gain is a measure of how much the entropy
(impurity) of the data is reduced by splitting on a particular feature.
9. (b) What is the procedure of building Decision tree using ID3 with Gain and Entropy.
1. Calculate the entropy of the target variable (class) for the entire dataset.
2. Calculate the information gain for each feature by splitting the data on that feature
and calculating the entropy of the resulting subsets.
3. Select the feature with the highest information gain as the root node of the decision
tree.
4. Recursively repeat steps 1-3 for each subset of the data created by the split, until a
stopping criterion is met (e.g., all nodes are pure or the depth of the tree reaches a
certain limit).
9. (c) What do you mean by Information gain and Entropy and its relation with ID3.
• Entropy is a measure of the impurity or randomness of a dataset. A higher entropy
means the data is more mixed and less predictable.
• Information gain is the reduction in entropy achieved by splitting the data on a
particular feature. It measures how much information we gain about the target
variable by knowing the value of that feature.
In the ID3 algorithm, the feature with the highest information gain is selected at each step
to create the decision tree. This is because the goal is to find the feature that provides the
most information about the target variable, which will lead to the most accurate
predictions.
10. (a) Why KNN is non-parametric?
K-Nearest Neighbors (KNN) is a non-parametric algorithm because it does not make any
assumptions about the underlying distribution of the data. Parametric models, on the other
hand, assume a specific parametric form for the data (e.g., normal distribution). This means
KNN can be used for a wider range of data types and does not require any assumptions
about the data distribution.
10. (b) Write down three Pros and Cons of KNN algorithm.
Pros:
• Simple and easy to implement: KNN is a relatively simple algorithm that is easy to
understand and implement.
• Versatile: Can be used for both classification and regression tasks.
• No training phase: KNN does not require any training phase, making it suitable for
online learning scenarios.
Cons:
• Sensitive to the choice of k: The performance of KNN is sensitive to the choice of the k
parameter.
• Computational cost: KNN can be computationally expensive for large datasets, as it
requires calculating distances to all data points.
• Curse of dimensionality: KNN can perform poorly in high-dimensional spaces, as the
distance between data points becomes less meaningful.
10. (c) What is Euclidean distance in terms of machine learning? Using Euclidean distance
find the distance between points P(3, 2) and Q(4, 1).
In machine learning, Euclidean distance is a common distance metric used to measure the
similarity between two data points. It is calculated as the square root of the sum of the
squared differences between the corresponding coordinates of the two points.
Euclidean distance between P(3, 2) and Q(4, 1):
sqrt((4 - 3)^2 + (1 - 2)^2) = sqrt(1 + 1) = sqrt(2)
11. (a) What is model selection? Write down the types of data and based on that which
model is used?
Model selection is the process of choosing the best model from a set of candidate models.
The choice of model depends on various factors, including the type of data, the goal of the
analysis, and the available computational resources.
Types of data and suitable models:
• Numerical data: Linear regression, support vector regression, decision trees, neural
networks
• Categorical data: Logistic regression, decision trees, support vector machines, Naive
Bayes
• Time series data: ARIMA, LSTM, Prophet
• Image data: Convolutional neural networks (CNNs)
• Text data: Recurrent neural networks (RNNs
Certainly, let's address 11.b and 11.c:
11. (b) Explain Cross validation with an example.
Cross-validation is a technique used to estimate the performance of a machine learning
model on unseen data. It involves splitting 1 the data into multiple subsets (folds), training
the model on some folds, and evaluating it on the remaining fold(s). This process is
repeated multiple times, 2 with each fold used for evaluation once. The average
performance across all folds provides a more robust estimate of the model's true
performance compared to using a single train-test split.
Example: k-fold cross-validation
1. Split the data: Divide the dataset into k equal-sized folds.
2. Train and evaluate:
o For each fold i:
▪ Use folds 1 to (i-1) and (i+1) to k for training.
▪ Use fold i for evaluation.
o Calculate the performance metric (e.g., accuracy, F1-score) on the evaluation
fold.
3. Average performance: Calculate the average performance across all k folds.
11. (c) Explain Statistical Learning Theory.
Statistical learning theory is a framework that provides a theoretical foundation for
machine learning. It aims to understand the principles and limitations of learning from data.
Key concepts in statistical learning theory include:
• Bias-variance trade-off: This trade-off describes the balance between bias (systematic
error) and variance (random error) in a model. High bias models tend to underfit the
data, while high variance models tend to overfit the data.
• Generalization error: The expected error of a model on unseen data.
• VC dimension: A measure of the complexity of a model, which influences its ability to
generalize.
• Regularization: Techniques used to control model complexity and prevent overfitting.
1. Explain Matrix Factorization and where it is used.
Matrix factorization is a technique used to decompose a large matrix into the product of
two smaller matrices. This is often done to reduce dimensionality, identify patterns, and
improve performance in various applications. It is commonly used in:
• Recommendation systems: Recommending products, movies, or other items to users
based on their past behavior.
• Natural language processing: Analyzing text data, such as topic modeling and
document clustering.
• Collaborative filtering: Making predictions about user preferences based on the
preferences of other users.
• Image and video analysis: Dimensionality reduction and feature extraction for image
and video data.
2. Why ensemble learning is used? What is the general principle of an ensemble method
and what is bagging and boosting in ensemble method?
Ensemble learning is used1 to improve the performance of machine learning models by
combining the predictions of multiple models. The general principle is that the combined
model is often more accurate and robust than any individual model.
Bagging (Bootstrap Aggregating) involves creating multiple subsets of the training data by
sampling with replacement (bootstrapping). Each subset is used to train a separate model,
and the final prediction is made by averaging the predictions of all models.
Boosting is an iterative process where each subsequent model focuses on the examples
that were misclassified by the previous models. This results in a sequence of models that
are increasingly better at classifying the difficult examples.
3. Explain the Difference Between Classification and Regression?
• Classification: Predicts a categorical outcome (e.g., spam or not spam, cat or dog).
• Regression: Predicts a continuous value (e.g., house price, temperature).
4. Compare K-means and KNN Algorithms.
• K-means: Unsupervised clustering algorithm that groups data points into clusters
based on their similarity.
• KNN: Supervised classification algorithm that classifies a data point based on the
majority class of its k nearest neighbors.
5. How do we decide the value of "K" in KNN algorithm? Why is the odd value of "K"
preferable in KNN algorithm?
The value of K in KNN is typically determined through cross-validation. An odd value of K is
often preferred because it helps avoid ties when determining the majority class.
Group C
7. (a) Discuss the different types of Machine Learning?
• Supervised learning: Learning from labeled data to make predictions on new data.
• Unsupervised learning: Learning from unlabeled data to discover patterns and
structures.
• Reinforcement learning: Learning by interacting with an environment and receiving
rewards or penalties.
7. (b) What are parametric and non-parametric model?
• Parametric models: Make assumptions about the underlying data distribution and use
a fixed number of parameters.
• Non-parametric models: Do not make assumptions about the data distribution and
can use a flexible number of parameters.
7. (c) How is machine learning related to AI?
Machine learning is a subset of artificial intelligence that focuses on enabling computers to
learn from data without being explicitly programmed.
8. (a) Explain Generative Mixture model
A generative mixture model is a probabilistic model that assumes the data is generated
from a mixture of different probability distributions.
8. (b) With a proper diagram explain the steps of a generative mixture model
[Diagram showing the steps of a generative mixture model]
8. (c) Write down the steps of PCA (Principal Component Analysis)
1. Standardize the data.
2. Calculate the covariance matrix.
3. Calculate the eigenvectors and eigenvalues of the covariance matrix.
4. Sort the eigenvectors by decreasing eigenvalue.2
5. Choose the top k eigenvectors.
6. Project the data onto the new subspace defined by the top k eigenvectors.
9. (a) Explain the Confusion Matrix with Respect to Machine Learning Algorithms with an
suitable example
A confusion matrix is a table that summarizes the performance of a classification model by
showing the number of true positives, true negatives, false positives, and false negatives.3
[Example of a confusion matrix for a binary classification problem]
9. (b) What are the advantages and disadvantages of using a decision tree?
Advantages:
• Easy to interpret and visualize.
• Can handle both categorical and numerical data.
• Can handle missing values.
Disadvantages:
• Can be prone to overfitting.
• Can be sensitive to small changes in the data.

10. (a) Explain the three techniques under supervised feature selection

Supervised feature selection aims to identify the most relevant features for a given
machine learning task, using the available labels. Common techniques include:

• Filter methods: These methods evaluate features based on their individual scores,
independent of the chosen learning algorithm. Examples include:
o Correlation-based methods: Measure the correlation between features and the
target variable.
o Information gain: Measures the reduction in entropy (uncertainty) when a
feature is used to make a decision.
• Wrapper methods: These methods evaluate feature subsets based on their
performance with a specific learning algorithm. This involves training and evaluating
the model multiple times with different feature combinations. Examples include:
o Recursive feature elimination (RFE): Iteratively removes features with the least
importance according to the model.
o Forward selection: Starts with an empty set of features and adds features one
by one based on their contribution to the model's performance.
• Embedded methods: These methods perform feature selection as part of the model
training process. Examples include:
o L1 regularization (Lasso): Adds a penalty term to the model's objective
function, which encourages the model to use fewer features.
o Decision tree-based methods: Feature importance can be estimated based on
how often a feature is used to make decisions in a decision tree.
10. (b) Explain the benefits of using feature selection in machine learning
• Improved model performance: Removing irrelevant or redundant features can lead
to better generalization and reduced overfitting.
• Reduced training time and computational cost: Fewer features mean faster training
and lower memory usage.
• Increased model interpretability: By selecting a smaller subset of features, it
becomes easier to understand which features are most important for the model's
predictions.
• Reduced dimensionality: Feature selection can be used to reduce the
dimensionality of the data, which can be beneficial for visualization and other
downstream tasks.
10. (c) Explain the curse of dimensionality
The curse of dimensionality refers to the challenges that arise when dealing with high-
dimensional data. As the number of features increases:
1 2

• Data sparsity: Data points become increasingly sparse in the feature space, making
it difficult to find meaningful patterns and relationships.
• Overfitting: Models with many features are more likely to overfit the training data,
leading to poor generalization performance.
• Computational complexity: Many machine learning algorithms have computational
complexity that increases exponentially with the number of features.
11. (a) What is Artificial Intelligence and why do we need it?
Artificial intelligence (AI) is the field of computer science that aims to create intelligent
agents, which are systems that can reason, learn, and act autonomously. We need AI to
3

solve complex problems that are beyond the capabilities of humans, such as:
• Automating tasks: AI can automate repetitive and mundane tasks, freeing up human
time and resources.
• Making better decisions: AI can analyze large amounts of data and make more
informed decisions than humans.
• Developing new technologies: AI is driving innovation in areas such as robotics,
medicine, and transportation.
11. (b) What is Deep Learning, and give some of its examples that are used in real-
world?
Deep learning is a subfield of machine learning that uses artificial neural networks with
multiple layers (deep neural networks) to learn complex patterns and representations
4

from data. Examples of deep learning applications include:


5

• Image recognition: Used in self-driving cars, facial recognition systems, and medical
image analysis.
• Natural language processing: Used in chatbots, machine translation, and sentiment
analysis.
• Speech recognition: Used in voice assistants like Siri and Alexa.
11. (c) Differentiate between Artificial Intelligence, Machine Learning, and Deep
Learning
• Artificial intelligence (AI): The broad field of creating intelligent agents.
• Machine learning (ML): A subset of AI that focuses on enabling computers to learn
from data without being explicitly programmed.
• Deep learning (DL): A subset of ML that uses deep neural networks to learn
complex representations from data.

You might also like