Take It Easy: Created Status Last Read
Take It Easy: Created Status Last Read
Status Open
Summary: Module 1 to 5
Model paper
https://prod-files-secure.s3.us-west-2.amazonaws.com/872d74be-2f52-4a00-ad56-893819e7a4ea/be309919-b2d0-4
987-b271-5dde2d5b25ff/18AI72.pdf
Qp - 1
https://prod-files-secure.s3.us-west-2.amazonaws.com/872d74be-2f52-4a00-ad56-893819e7a4ea/241296b2-82fd-4
daf-ac16-49aa1d043479/Copy_of_PYQ_1.pdf
Qp - 2
https://prod-files-secure.s3.us-west-2.amazonaws.com/872d74be-2f52-4a00-ad56-893819e7a4ea/aa273b6a-f410-4f
28-b322-6383884eec76/Copy_of_PYQ_2.pdf
Module 1
1. Explain gradient descent algorithm. (OR) Explain the steps required for implementing
Gradient Descent Algorithm. (10M) → Repeated (OR) Using code snippets, explain the
Gradient Descent Algorithm through utility methods. (10M)
Gradient descent is an optimization algorithm which is commonly-used to train machine learning models and neural
networks. Training data helps these models learn over time, and the cost function within gradient descent specifically
acts as a barometer, gauging its accuracy with each iteration of parameter updates. Until the function is close to or equal
to zero, the model will continue to adjust its parameters to yield the smallest possible error.
Take it easy 1
There are three types of gradient descent algorithm
import random
def initialize(dim):
np.random.seed(seed=42)
random.seed(42)
b = random.random()
w = np.random.rand(dim)
return b, w
This method initializes the bias ( b ) and weights ( w ) randomly. The dim parameter is the number of weights to be
initialized (besides the bias). The seed is set for reproducibility, and it returns the initialized bias and weights.
This method calculates the predicted values of Y ( Y_hat ) given the bias ( b ), weights ( w ), and input matrix ( X )
using matrix multiplication.
This method computes the Mean Squared Error (MSE) as the cost function. It calculates the residuals, squares
them, sums over all records, and divides by the number of observations.
Take it easy 2
w_1 = w_0 - learning_rate * dw
return b_1, w_1
This method updates the bias ( b ) and weights ( w ) based on the gradients of the cost function with respect to the
bias and weights.
return gd_iterations_df, b, w
This method runs the gradient descent algorithm, updating bias and weights iteratively. It also tracks the cost at
every 10 iterations.
These utility methods collectively implement the Gradient Descent Algorithm for Linear Regression, including
initialization, prediction, cost calculation, and parameter updates. The last method, run_gradient_descent , orchestrates
the entire process for a specified number of iterations.
2. Discuss the steps for building machine learning model.
Steps for Building Machine Learning Models are:
Identify the features (independent variables) and outcome variable (dependent variable) in the dataset.
Use the train_test_split function from the sklearn.model_selection module to split the dataset into training and
test sets.
Example code:
Initialize the model and fit it to the training data using the fit method.
Example code:
Take it easy 3
Use the predict method to make predictions on the test set.
Example code:
y_pred = linreg.predict(X_test)
Use metrics such as Mean Absolute Percentage Error (MAPE) or Root Mean Square Error (RMSE) to measure
accuracy.
Calculate R-squared value to understand the amount of variance explained by the model.
linreg = LinearRegression()
linreg.fit(X_train, y_train)
print("Intercept:", linreg.intercept_)
print("Coefficients:", linreg.coef_)
y_pred = linreg.predict(X_test)
These steps provide a structured approach to building, validating, and measuring the accuracy of a machine learning
model.
3. Contrast the features of Receiver Operating Curve (ROC) and Area Under ROC (AUC)
score in Logistic regression model with code snippets.
Receiver Operating Characteristic Curve (ROC) and Area Under ROC (AUC) Score are widely used metrics for
evaluating the performance of classification models, including Logistic Regression. Let's contrast the features of ROC
and AUC with code snippets.
The ROC curve is a graphical representation of the trade-off between true positive rate (sensitivity) and false
positive rate (1-specificity) at various thresholds.
2. Implementation:
The ROC curve is generated using the draw_roc_curve() method, which utilizes the roc_curve() function from
sklearn.metrics .
Take it easy 4
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
AUC is a single numeric value representing the area under the ROC curve.
It quantifies the overall performance of the model, with higher AUC values indicating better discrimination
between positive and negative classes.
2. Implementation:
The ROC curve is plotted using the draw_roc_curve() method, which uses the plt.plot() function from
matplotlib to visualize the FPR and TPR.
Take it easy 5
In summary, the ROC curve provides a visual representation of the model's performance at different thresholds, and
the AUC score condenses this information into a single metric. The provided code snippets showcase how to
calculate these metrics and plot the ROC curve for a Logistic Regression model.
4. In context to ARIMA Model, explain the following
i) Dicky-Fuller Test. ii) Forecast and Measure Accuracy.
Dickey-Fuller Test
1. Purpose: The Dickey-Fuller test is employed to assess the stationarity of a time series, a crucial assumption for
ARIMA modeling.
2. Test Hypotheses:
The adfuller function from statsmodels.tsa.stattools is used to conduct the Dickey-Fuller test.
4. Interpretation:
If the p-value is less than 0.05 (common significance level), the null hypothesis is rejected, indicating that the
time series is stationary.
Forecasting future values is performed using an ARIMA model, presumably with lag 1.
The forecast is obtained for specific time points, in this case, months 31 to 37.
The forecasted demand values for months 31 to 37 are given as an array: [480.15, 497.71, 506.01, 509.93,
Mean Absolute Percentage Error (MAPE) is utilized to evaluate the accuracy of the forecast.
The MAPE formula calculates the average percentage error between actual and forecasted values.
The reported MAPE value of 19.12% suggests the average percentage difference between actual and
forecasted values for the given ARIMA model with lag 1.
In summary, the Dickey-Fuller test is employed to check stationarity, and forecast accuracy is assessed using MAPE
after forecasting with an ARIMA model.
5. Explain the components of time series data. (OR) In context to forecasting describe the
componenets of time-series data and explain the concept of decomposing time series. (10M)
→ Repeated
Components of Time-Series Data
Trend Component (Tt)
Definition: The trend component represents the consistent long-term upward or downward movement of the
data.
Example: In the context of demand for a product, if the demand shows a steady increase over several years,
it indicates a positive trend.
Definition: The seasonal component represents repetitive upward or downward movements (fluctuations)
from the trend that occur within a calendar year at fixed intervals.
Example: Seasonal fluctuations can be observed in demand for certain products during specific times of the
year, such as increased demand for winter clothing in the winter season.
Take it easy 6
Definition: The cyclical component is the fluctuation around the trend line at random intervals, driven by
macro-economic changes such as recession, unemployment, etc. Cyclical fluctuations have repetitive
patterns with a time between repetitions of more than a year.
Example: A cyclical downturn in demand for luxury goods during an economic recession.
Definition: The irregular component represents random, uncorrelated changes or white noise in the data. It
follows a normal distribution with a mean value of 0 and constant variance.
Example: Unpredictable spikes or drops in demand that cannot be attributed to trend, seasonality, or cyclical
patterns.
Decomposing Time Series
The process of decomposing a time series involves separating it into its individual components, namely trend,
seasonal, cyclical, and irregular components. The goal is to analyze and understand each component separately,
which can aid in making more accurate forecasts. The decomposition is typically done using mathematical techniques
or statistical methods.
The decomposition equation is often expressed as:
[Yt = Tt + St + Ct + It ]
Trend Component (Tt): Identifying and estimating the trend helps understand the overall direction of the data
over the long term.
Seasonal Component (St): Isolating the seasonal component allows for the identification of repetitive patterns
within a calendar year.
Cyclical Component (Ct): Analyzing the cyclical component helps in recognizing broader economic trends and
fluctuations.
Irregular Component (It): Examining the irregular component helps identify random variations in the data that
are not explained by trend, seasonality, or cyclical patterns.
Forecasting methods such as moving average, exponential smoothing, and ARIMA leverage the understanding of
these components to make predictions about future values. It is essential to choose an appropriate forecasting
technique based on the characteristics of the time series data and the nature of its components.
6. Explain the moving average technique to forecast the future value of time series data.
(OR) Explain the Moving Average Model and the method to calculate forecast accuracy.
(10M) → Repeated
Take it easy 7
Moving Average (MA) processes are regression models that utilize past residuals to predict future values within a time-
series dataset. Expressing a moving average process with a lag of 1 as:
[Yt+1 = a1 et + et+1 ]
The generalization of this model to q lags involves determining the appropriate value of q, or the number of lags, based
on specific criteria outlined by Yaffee and McGee (2000):
1. The auto-correlation values are significant for the initial q lags and then decline to zero.
Forecast accuracy measures help evaluate the performance of forecasting models such as Moving Average (MA)
models. Here are some common methods for calculating forecast accuracy:
∣Actuali ∣
× 100)
Description: MAPE expresses forecast accuracy as a percentage of the absolute difference between actual and
forecasted values relative to the actual values. A lower MAPE indicates better accuracy.
In the given code snippet:
Description: MAE represents the average absolute difference between actual and forecasted values. It gives
equal weight to all errors.
Description: RMSE is the square root of the average squared differences between actual and forecasted values.
It penalizes larger errors more than smaller errors.
4. Forecast Bias
1 n
Formula: (Bias = n
∑i=1 (Actuali − F orecasti ))
Description: Forecast bias measures the average tendency of the forecasts to be either too high (positive bias)
or too low (negative bias).
In the provided code snippets, the MAPE is explicitly calculated using the get_mape function for the Moving Average (MA)
model with lag 1. The MAPE value is then reported as 17.8%, indicating the percentage error in the forecast.
7. Illustrate the KNN algorithm with an example.
The K-Nearest Neighbors (KNN) algorithm is a non-parametric and lazy learning algorithm used for regression and
classification problems. It classifies new observations by comparing them with the training data and finding similar
neighbors. Here's an illustration of the KNN algorithm using the bank marketing dataset
Take it easy 8
1. Finding Neighbors
Observations in the training set that are similar to the new observation are considered neighbors.
The number of neighbors (K) to be considered for classifying a new observation is a parameter that can be set.
The class for the new observation is predicted to be the same class as the majority of its neighbors.
2. Distance Metrics
[D(O1 , O2 ) =
(X11 − X21 )2 + (X12 − X22 )2 ]
Other distance metrics such as Minkowski distance, Jaccard Coefficient, and Gower’s distance can also be
used.
3. Implementation in Python
The scikit-learn library provides the KNeighborsClassifier algorithm for classification problems.
Key parameters include n_neighbors (number of neighbors), metric (distance metric), and weights (weighting
scheme).
4. Accuracy Evaluation
ROC AUC score and the ROC curve are commonly used to evaluate KNN accuracy.
5.Confusion Matrix
It includes true positives, true negatives, false positives, and false negatives.
6.Classification Report
Take it easy 9
Precision, recall, and F1-score for each class are summarized in the classification report.
print(metrics.classification_report(test_y, pred_y))
7. Hyperparameter Tuning
The optimal number of neighbors (K) can be found through hyperparameter tuning using GridSearch in scikit-
learn.
8. With respect to Moving Average model discuss the different methos for calculating
Forecast Accuracy.
Moving Average (MA) processes are regression models that utilize past residuals to predict future values within a time-
series dataset. Expressing a moving average process with a lag of 1 as:
[Yt+1 = a1 et + et+1 ]
The generalization of this model to q lags involves determining the appropriate value of q, or the number of lags, based
on specific criteria outlined by Yaffee and McGee (2000):
1. The auto-correlation values are significant for the initial q lags and then decline to zero.
Forecast accuracy measures help evaluate the performance of forecasting models such as Moving Average (MA)
models. Here are some common methods for calculating forecast accuracy:
∣Actuali ∣
× 100)
Description: MAPE expresses forecast accuracy as a percentage of the absolute difference between actual and
forecasted values relative to the actual values. A lower MAPE indicates better accuracy.
Description: MAE represents the average absolute difference between actual and forecasted values. It gives
equal weight to all errors.
1 n
Formula: (RMSE = n
∑i=1 (Actuali − F orecasti )2 )
Description: RMSE is the square root of the average squared differences between actual and forecasted values.
It penalizes larger errors more than smaller errors.
4. Forecast Bias
1
Formula: (Bias = n
∑ni=1 (Actuali − F orecasti ))
Description: Forecast bias measures the average tendency of the forecasts to be either too high (positive bias)
or too low (negative bias).
In the provided code snippets, the MAPE is explicitly calculated using the get_mape function for the Moving Average (MA)
model with lag 1. The MAPE value is then reported as 17.8%, indicating the percentage error in the forecast.
Take it easy 10
Interpretability
MAPE is expressed as a percentage, making it easy to understand and communicate to both technical and non-
technical audiences.
Stakeholders can easily grasp the idea that, for example, a MAPE of 10% means the average forecast error is
10% of the actual values.
Scale Independence
MAPE is scale-independent, meaning it can be used to compare forecast accuracy across different datasets or
time series, regardless of the magnitude of the values.
This makes MAPE a versatile measure that can be applied to various domains and industries.
Symmetry
MAPE treats overestimation and underestimation symmetrically. Both types of errors contribute equally to the
overall accuracy measure.
This symmetry is sometimes seen as an advantage when evaluating the overall performance of a forecasting
model.
MAPE is frequently used in real-world forecasting applications, and many forecasting software packages provide
MAPE as a standard metric for evaluating model performance.
Limitations
MAPE has some limitations. It can be sensitive to extreme values or outliers in the data.
Despite its popularity, it's essential to note that no single accuracy measure is universally suitable for all situations.
Depending on the characteristics of the data and the specific goals of the forecasting task, other metrics like Mean
Absolute Error (MAE), Root Mean Squared Error (RMSE), or others might also be considered. The choice of accuracy
measure should align with the specific requirements and characteristics of the forecasting problem at hand.
10. In the context to Machine learning explain Bias - Variance Trade-off with example.
(10M)
The Bias-Variance Trade-off is a fundamental concept in machine learning that helps us understand the sources of
error in a predictive model. It involves finding the right balance between bias and variance to achieve a model that
generalizes well to unseen data.
Suppose you are working on a regression problem where you want to predict housing prices based on various features
such as square footage, number of bedrooms, and location. You collect a dataset containing information about different
houses, including their features and actual selling prices.
You decide to use a simple linear regression model, assuming a linear relationship between the square footage
and the price of the house.
However, the true relationship may be more complex, involving non-linear dependencies. As a result, your model
may struggle to capture the nuances of the data.
This is an example of high bias or underfitting, as the model is too simplistic to represent the underlying patterns
in the housing price data.
Recognizing the limitations of the linear model, you decide to use a high-degree polynomial regression model.
This model has the flexibility to fit the training data very closely, capturing intricate details.
Take it easy 11
The model performs exceptionally well on the training dataset, but when you apply it to new, unseen houses, it
fails to generalize. The predictions are highly sensitive to small changes in the training data.
This is an example of high variance or overfitting, as the model is too complex and captures noise in the training
data rather than the true underlying relationship.
Trade-off
To find the right balance, you experiment with different models of varying complexity (polynomial degrees in this
case).
You train models with degrees ranging from 1 to 10 and evaluate their performance on both the training and test
datasets.
After evaluating the models, you observe that a polynomial regression model with a degree of 3 achieves a good
balance. It captures the non-linear patterns in the data without being overly complex.
This model generalizes well to new houses and doesn't suffer from the issues of underfitting or overfitting.
Module 2
1. Show that how evaluation problem and learning problem issues are addressed by
Hidden Markov Model. (OR) Discuss the problems in Hidden Markov method. (OR) Define
Hidden Markov Model and illustrate any two central issues addressed by Hidden Markov
model. → Repeated (10M)
Hidden Markov Models (HMMs) are widely used in various applications, including speech recognition and activity
recognition, to address both evaluation and learning problems. Let's discuss how HMM addresses these issues based on
the provided resource.
The evaluation problem in HMM involves determining the probability of observing a given sequence of visible
states (VT )given a particular model (q). This problem is addressed by using the forward algorithm, which
The forward algorithm is a recursive procedure that calculates the probability of being in a particular state at
each time step.
The decoding problem aims to determine the most likely or probable sequence of hidden states that the machine
traversed while generating the visible states (VT ).Essentially, it identifies the most probable state given the
observed sequence.
Take it easy 12
A trellis diagram, composed of a matrix of nodes, is employed for solving the decoding problem. Each column in
the diagram represents possible states at a specific time. The first column corresponds to time instant 0, and
subsequent columns depict states at different intervals. This graphical representation aids in visualizing state
changes over time, particularly useful for Hidden Markov Models (HMM) with varying numbers of hidden states.
In the learning phase of Hidden Markov Models (HMM), the objective is to estimate the transmission probability
(\(a_{ij}\)) and emission probability (\(b_{jk}\)). To achieve this, a set of known sequences is employed for
training.
3
1
5
7
2
0
(a3 (4))represents the probability of the machine being in state (w3 )and generating the sequence
(V1 , V3 , V1 , V5 ).
(b3 (4))signifies the probability of the machine in state (w3 )generating the next three visible states.
In summary, HMMs provide a comprehensive framework to address evaluation, decoding, and learning problems in
temporal pattern recognition applications. The combination of forward and backward algorithms, along with the Viterbi
algorithm, enables efficient computation and parameter estimation for HMMs.
Take it easy 13
2. Using K-Medoids Algorithm solve the Problem for the following dataset of 6 objects as
shown in the table below into clusters, for K=2.
Solution
Take it easy 14
Take it easy 15
3. List and Explain applications of Clustering as well as requirements of Clustering. (OR)
Discuss the applications and requirements of clustering with example. (10M)
Applications of Clustering
Business Intelligence
1. Target Marketing: Marketers use cluster analysis to discover and categorize groups based on purchasing
patterns for better target marketing.
2. Market Segmentation: Clustering helps in dividing the market into segments, aiding in product positioning
and new product development.
Pattern Recognition
1. Grouping Similar Patterns: Clustering methods group similar patterns into clusters, assisting in identifying
patterns with higher similarity within clusters.
Image Processing
1. Segmentation: Clustering is applied in image processing to segment images into areas with similar
attributes, aiding in object identification.
2. Applications: Used in various areas such as analysis of remotely sensed images, traffic system monitoring,
and fingerprint recognition.
Bioinformatics
Taxonomies: Clustering techniques are used to derive plant and animal taxonomies and categorize genes
with similar functionalities.
Biological Systematics: Helps in studying the diversification of living forms and relationships among living
things based on similar characteristics.
Web Technology
Document Classification: Clustering assists in classifying documents on the web for effective information
delivery.
Search Engines
Improving Search Results: Clustering algorithms contribute to the success of search engines like Google by
providing more accurate and faster search results.
Text Mining
High-Quality Information Extraction: Clustering in text mining helps extract high-quality information from text,
including sentiment analysis and document summarization.
Requirements of Clustering
Scalability
Independence of Results: Clustering algorithms should provide similar results regardless of the size of the
dataset, ensuring scalability for large databases.
Handling Various Data Types: Clustering algorithms should be designed to handle numeric as well as other
data types like nominal, binary, and ordinal, as well as complex data types such as graphs, sequences,
images, and documents.
Detecting Non-Spherical Clusters: Clustering algorithms should be capable of detecting clusters with
arbitrary shapes, as real-world data may exhibit diverse and non-spherical cluster shapes.
Parameter Sensitivity: Clustering algorithms should not heavily rely on domain knowledge for input
parameters, as this can affect the quality of clustering and burden the user.
Take it easy 16
Dealing with Noise: Clustering algorithms should be able to handle noise in real-world data, including
attribute noise introduced by measurement tools and random errors.
Incremental Clustering
Accommodating New Data: Some clustering algorithms should support incremental updates to the database
without the need to recompute the clustering from scratch.
Order Independence: Clustering algorithms should be insensitive to the order in which data objects are
presented to ensure robustness and reliability.
Handling Numerous Dimensions: Clustering algorithms should effectively handle high-dimensional datasets,
providing accurate results for datasets with numerous dimensions or attributes.
Handling Constraints:
Must-Link and Cannot-Link Constraints: Clustering algorithms should handle constraints such as must-link
and cannot-link constraints, ensuring that instances specified in the constraints are appropriately clustered or
not clustered together.
User-Friendly Results: Clustering results should be interpretable and usable for users, tied with specific
semantic interpretations and applications that can make practical use of the information retrieved after
clustering.
4. For the given set of points, identify the clusters using Agglomerative Algorithm
Clustering: Complete Link, use Euclidian Distance and draw Final cluster formed. →
Repeated
Agglomerative clustering is a bottom-up hierarchical clustering method that starts with individual data points and
gradually merges them into larger clusters. The process continues until all data points belong to a single cluster or a
specified number of clusters is reached. The key idea is to iteratively merge the closest clusters based on a distance
metric until the desired number of clusters is achieved.
Program
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
Take it easy 17
linked = linkage(X, 'ward')
dendrogram(linked, orientation='top', distance_sort='descending', show_leaf_counts
=True)
plt.title("Hierarchical Clustering Dendrogram")
plt.show()
In this example, we first generate synthetic data with three clusters using the make_blobs function. Then, we perform
agglomerative clustering with the AgglomerativeClustering class from scikit-learn. Finally, we visualize the data, the
clustering result, and the dendrogram, which illustrates the hierarchy of cluster merges.
Flowchart
The agglomerative algorithm is carried out in three steps:
Problem-example
Take it easy 18
Solution
Take it easy 19
Take it easy 20
5. Explain any two types of clustering methods. (10M)
Clustering is a type of unsupervised machine learning technique that involves grouping similar data points or objects into
subsets known as clusters. The primary goal of clustering is to organize and discover inherent patterns, structures, or
relationships within a dataset without using predefined labels or target values.
Soft Clustering: Assigns a probability or likelihood to each data point for being in multiple clusters.
1. Partitioning Method
Definition: Division of a database into k partitions, where each partition represents a cluster.
Soft Clustering Note: In soft clustering, an object can belong to two clusters.
Algorithmic Steps:
Criteria for Good Partitioning: Objects in the same cluster are close, while those in different clusters are
far.
2. Hierarchical Method
Definition: Alternative to partitioning clustering, does not require pre-specifying the number of clusters.
Approaches:
Agglomerative (bottom-up): Objects start in separate groups, merging close ones iteratively.
Divisive (top-down): All objects start in the same cluster, continuously splitting until termination.
3. Density-Based Methods
Concepts:
Density reachability: A point is density reachable if it's within distance e from another point with sufficient
neighbors.
Density connectivity: Points p and q are density-connected if there's a point r with sufficient neighbors,
forming a chaining process.
4. Grid-Based Methods
Approach: Concerned with the value space around data points rather than the points themselves.
Steps:
6. Illustrate how the K-means clustering method is used to assign the data points to
different clusters. (10M)
K-means clustering is a partitioning method that aims to group data points into distinct clusters based on their similarity.
The algorithm iteratively refines cluster assignments until convergence. Here is an illustration of the steps involved in the
K-means clustering method:
Example
Let's consider a dataset X = {x1, x2, x3, ... xn} and aim to partition it into c clusters. The steps are as follows:
Take it easy 21
1. Initialization:
Randomly select c cluster centers from the data points. Let V = {v1, v2, ...vc} represent these initial cluster centers.
2. Assignment:
Calculate the distance between each data point xi and each cluster center vj. Assign xi to the cluster with the
minimum distance.
1 c
[vi = ∑j=1
i
xij ]
ci
4. Repeat:
Repeat steps 2 and 3 until convergence. If no data point is reassigned to a different cluster, stop. Otherwise,
continue.
2. Efficiency:
Relatively efficient with computational complexity O(tknd), where n is the number of data objects, k is the
number of clusters, d is the number of attributes, and t is the number of iterations.
Gives the best result when datasets are distinct or well separated.
Variant to nonlinear transformations, resulting in different outcomes with different data representations.
5. Local Optima:
Provides local optima of the squared error function, not guaranteed to find the global optimum.
6. Limited Applicability:
Applicable only when the mean is defined, hence not suitable for categorical data.
Program
Take it easy 22
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
In this example, we first generate synthetic data with three clusters using the make_blobs function. Then, we perform
agglomerative clustering with the AgglomerativeClustering class from scikit-learn. Finally, we visualize the data, the
clustering result, and the dendrogram, which illustrates the hierarchy of cluster merges.
Flowchart
The agglomerative algorithm is carried out in three steps:
Take it easy 23
Problem-example
Solution
Take it easy 24
Take it easy 25
8. Using K-means clustering algorithm, solve the problem for two clusters of 6 objects as
show in the table below tabulate all the assignments.
Solution
Take it easy 26
Take it easy 27
Module 3
1. Illustrate the association rule mining concept with an example. Discuss its pros and
cons. (OR) With the code snippets discuss the ways of applying Association Rules. (OR)
Briefly explain the steps involved in applying association rules. (10M) → Repeated
Association rules are patterns or relationships identified within a dataset that reveal how items are frequently associated
or co-occur. Specifically, association rule mining seeks to discover interesting relationships between variables in large
datasets. The most common application of association rule mining is in the context of transactional data, such as
customer purchase histories.
Example
Code snippets
1. Data Preparation
Collect transactional data: Gather data that represents transactions or baskets of items, such as customer
purchases.
Represent data: Organize the data into a suitable format, where each row corresponds to a transaction, and
items are listed for each transaction.
Example Code:
Create a matrix where each row represents a transaction, each column represents an item, and the values
are binary indicating whether an item is present in a transaction.
Take it easy 28
Example Code:
one_hot_encoding = OnehotTransactions()
one_hot_txns = one_hot_encoding.fit(all_txns).transform(all_txns)
Use the Apriori algorithm to find frequent itemsets, which are combinations of items that occur together
frequently in the transactions.
Example Code:
Set additional metrics thresholds such as confidence and lift to filter out uninteresting rules.
Example Code:
1. Simple and Interpretable: Association rules are straightforward and easy to understand. The "if-then"
format makes them interpretable for both technical and non-technical stakeholders.
2. Applicability: Widely used in various industries, such as retail, healthcare, finance, and more, for tasks like
market basket analysis, recommendation systems, and fraud detection.
3. Reveals Hidden Patterns: Helps identify hidden patterns and relationships within large datasets that may
not be immediately apparent.
4. Decision Support: Provides valuable insights for decision-making, allowing businesses to optimize product
placement, create effective marketing strategies, and improve customer experience.
1. Limited to Binary Data: Association rule mining typically deals with binary data (item present or not), which
may oversimplify complex relationships or preferences.
2. Doesn't Consider Sequential Patterns: Traditional association rule mining doesn't consider the order or
sequence of item occurrences. For some applications, the order of items may be crucial.
Take it easy 29
3. Quality of Rules: The quality of rules depends on the choice of metrics and thresholds, and the results may
vary based on these parameters.
4. Spurious Associations: It may discover associations that are statistically significant but lack meaningful or
logical interpretations, leading to potentially spurious rules.
5. Scalability: For large datasets with many items, the number of possible itemsets can grow exponentially,
making the computation of rules computationally expensive.
2. List and Explain importance of words in a Bag-of-Words (BoW) Model (OR) Explain the
Bag-of-Words (Bow) model with suitable example.(10M) → Repeated
The Bag-of-Words (BoW) model is a representation technique used in natural language processing and text analytics. It
involves creating a dictionary of all the words present in a corpus and then representing each document as a vector
based on the occurrence of words in the document. There are different ways to identify the importance of words in a
BoW model, and three common vector models are discussed: Count Vector Model, Term Frequency Vector Model, and
Term Frequency-Inverse Document Frequency (TF-IDF) Model.
In this model, the importance of words is determined by counting their occurrences in each document.
The count vector represents the frequency of each word in the document.
Term Frequency (TF) is calculated as the frequency of each term in the document divided by the total number
of words in the document.
It provides a normalized representation of word frequency, making it suitable for comparing documents of
different lengths.
TF-IDF measures the importance of a word in a document relative to its frequency in the entire corpus.
It increases proportionally with the number of times a word appears in a document but is reduced by the
word's frequency in the corpus.
TF-IDF is effective in identifying words that are both frequent in a document and unique to that document.
The count vectors represent the frequency of each word in the respective documents, providing a numerical
representation for further analysis or machine learning tasks.
The process of creating count vectors involves using a CountVectorizer in Python, as shown in the provided code
snippets. Additionally, the code demonstrates the importance of handling low-frequency words, removing stop words,
and applying stemming to create more meaningful and concise representations of the documents in the BoW model.
Take it easy 30
Given the IDF values for each term (x1, x2, x3, x4, x5):
I 0.693
really 1.098
never 1.098
like 0.693
IPL 0.693
Assuming the terms (x1, x2, x3, x4, x5) represent words (I, really, never, like, IPL), and 'y' represents the
sentiment:
The values in the table represent the TF-IDF for each term in the respective documents.
Term Frequency Vector Model Example
Consider two documents:
The values in the table represent the Term Frequency (TF) for each term in the respective documents.
3. Write a note on User - Based Similarity Algorithm and Finding the Best Model.
In user-based collaborative filtering, the similarity between users is computed based on their interactions with items
(movies in this case). The algorithm aims to find users with similar preferences and recommend items liked by those
similar users. Here's a step-by-step guide on implementing a user-based similarity algorithm using the MovieLens
dataset:
import pandas as pd
Take it easy 31
2. Calculating Cosine Similarity between Users
return common_movies
# Example usage
common_movies = get_user_similar_movies(2, 338)
Implement different collaborative filtering algorithms (user-based, item-based, matrix factorization) and compare
their performance.
Adjust the algorithm and parameters based on the evaluation results to find the most effective collaborative filtering
model for the MovieLens dataset.
4. Discuss the two variations of collaborative filtering. (10M)
Collaborative filtering is a recommendation technique based on the notion of similarity or distance between users. It
operates on the idea that if two users have similar preferences and have rated common items similarly, their preferences
are likely to be similar in the future. There are two variations of collaborative filtering:
User-Based Similarity
Definition: This variation finds K similar users based on the common items they have bought and rated.
Take it easy 32
Methodology: The similarity or distance between users is calculated using the ratings they have given to
common items. Common similarity measures include Jaccard coefficient, cosine similarity, Euclidean distance,
and Pearson correlation.
Example: In the provided example, users Rahul, Purvi, and Gaurav are represented in a Euclidean space based
on their ratings for two common books, "Into Thin Air" and "Missoula." The Euclidean distance between users
helps identify the similarity, and recommendations can be made based on similar users.
Item-Based Similarity
Definition: This variation finds K similar items based on common users who have bought those items.
Methodology: Similarity between items is determined by analyzing the preferences of users who have bought
and rated those items. If two items have been bought and rated similarly by many users, they are considered
similar.
Comparison to K-Nearest Neighbors (KNN): Both user-based and item-based collaborative filtering algorithms
share similarities with the K-Nearest Neighbors algorithm discussed in Chapter 6.
The example uses the MovieLens dataset to find similar users based on common movies they have watched
and rated.
The dataset includes user IDs, movie IDs, ratings, and timestamps.
The ratings are loaded into a DataFrame, and a pivot table is created with users as rows, movies as columns,
and ratings as values.
Cosine similarity is then calculated between users using the sklearn library.
The resulting similarity matrix is used to find similar users for each user.
One challenge is the "cold start problem," where new users have no or limited purchase and rating history. User
similarity relies on historical data, making it ineffective for new users until they provide enough data.
Item-Based Similarity is introduced as an alternative to address the cold start problem, as it focuses on
relationships between items rather than users.
5. With an example, explain the methods of Item-Based similarity algorithm in
collaborative filtering. (10M)
Item-Based Similarity in collaborative filtering is a recommendation technique that focuses on finding similarities between
items based on user behavior. It assumes that if users have liked or rated two items similarly, there is a strong
relationship between those items. Here, we'll explain the methods of Item-Based Similarity using an example.
We start by loading the MovieLens dataset, which includes information about users, movies, and their ratings.
import pandas as pd
We create a matrix where rows represent movies, columns represent users, and the values represent user
ratings.
Take it easy 33
ating').fillna(0)
Next, we calculate the similarity between items using a similarity measure such as cosine similarity.
Cosine similarity between two items \(i\) and \(j\) is calculated as the cosine of the angle between their rating
vectors.
4. Making Recommendations
To recommend movies to a user, we identify movies that are similar to the ones the user has already liked.
We find the top-N similar items and recommend those to the user.
# Example: Recommend movies for a user who liked movieId 1 and movieId 50
user_ratings = pd.Series([5, 4], index=[1, 50])
recommended_movies = recommend_movies(user_ratings, item_similarity_df)
The recommendation system can be evaluated using metrics like precision, recall, or mean squared error.
The algorithm can be iteratively improved by incorporating user feedback and adjusting the similarity
calculation.
In this example, the Item-Based Similarity algorithm uses the patterns of user ratings to identify similar movies and
recommend them to users who have shown interest in certain items. The cosine similarity is just one method, and
other similarity measures can be employed based on the characteristics of the data and the recommendation system.
6. Using the code snippets discuss the challenges of text analytics. (10M) (OR) Using the
code snippets, explain the challenges of text analytics. (10M) → Repeated
Text analytics, also known as text mining or natural language processing (NLP), is the process of extracting valuable
information and insights from unstructured text data. Unstructured text data includes a wide range of sources such as
books, articles, social media posts, reviews, emails, and more. Text analytics involves applying various computational
techniques and algorithms to analyze, interpret, and derive meaningful patterns or knowledge from this unstructured text.
Code Snippet
Take it easy 34
This code demonstrates the process of loading unstructured text data into a structured pandas DataFrame for
further analysis.
2. Data Pre-processing Challenges
Challenge Explanation
Text data requires extensive pre-processing before applying machine learning algorithms. Algorithms like
regression, classification, or clustering work best when the data is cleaned and prepared. Cleaning text data
involves tasks such as tokenization, stemming, and removing stop words.
Code Snippet
This code snippet emphasizes the importance of exploring the dataset, understanding its structure, and checking
for missing values or inconsistencies as part of the pre-processing.
Code Snippet
This code snippet introduces the concept of the Bag-of-Words (BoW) model as a way to extract features from text
data.
Code Snippet
This code snippet visually represents the distribution of positive and negative sentiments in the dataset, a crucial
step in sentiment classification.
Take it easy 35
8. Using code snippet, explain the steps required to build Naive-Bayes model for
sentiment classifictaion. (10M)
Naive Bayes is a family of probabilistic classification algorithms based on Bayes' theorem with the assumption of
independence among features. The "naive" in Naive Bayes comes from the assumption that the features used to
describe an observation are mutually independent, given the class label.
Take it easy 36
To build a Naïve-Bayes model for sentiment classification, you can follow the steps outlined in the provided
resource. Below is a code snippet that demonstrates these steps:
# Assuming you have a dataset named train_ds with features and sentiment columns
train_X, test_X, train_y, test_y = train_test_split(train_ds_features, train_ds.se
ntiment, test_size=0.3, random_state=42)
This code assumes you have a dataset named train_ds with features and sentiment columns. Make sure to replace
the dataset and column names accordingly. The steps involve splitting the dataset, building the Naïve–Bayes model,
making predictions, and evaluating the model's accuracy using a classification report and confusion matrix
visualization.
Module 4
1. Explain different types of activation functions for processing a node in Neural
networks. (7M) (OR) Discuss the different types of activation functions of Neural networks
algon with its features. (10M) (OR) Explain any two activation function. (5M) → Repeated
Activation functions play a crucial role in neural networks by introducing non-linearity into the model, enabling it to learn
complex patterns and relationships. In the context of neural networks, activation functions are applied to the output of a
node or neuron. The provided resource describes various types of activation functions, categorized into bipolar and
unipolar activation functions, as well as the identity function and the ramp function.
Take it easy 37
The bipolar binary function introduces a binary decision based on the sign of the net input. If the net input is
positive, the neuron output is +1, and if it's negative, the output is -1. This function is suitable for tasks where the
network needs to make clear-cut decisions with positive and negative outcomes.
⁍
Bipolar Continuous Function
The bipolar continuous function provides a smooth transition between positive and negative outputs, controlled by
the parameter \( \lambda \). As \( \lambda \) increases, the function approaches a step-like behavior similar to the
bipolar binary function. This smooth transition allows for a more gradual adjustment of weights during training,
contributing to smoother convergence.
The term "bipolar" signifies that both positive and negative responses are generated.
Δ Δ
f(net) =
1 + exp (−λnet)
f(net) = {
Δ 1, net > 0
0, net < 0
Note: The unipolar binary function is the limit of the unipolar continuous function as (λ)approaches infinity.
Take it easy 38
Identity Function
The identity function is linear, preserving the input as the output. It is typically used in the input layer to
maintain the original features without introducing non-linearity. This function is useful when the raw input
values are meaningful and should be directly passed to the next layer.
The identity function is linear, and it preserves the input as the output. Typically used in the input layer.
Ramp Function
The ramp function is a piecewise linear function that introduces a gradual increase in output as the input grows
beyond 0. It is useful when a smooth transition is needed, particularly in cases where the input can have a wide
range of values. The ramp function helps capture and emphasize the variations in the positive range.
⎧0, if x < 0
f(x) = ⎨x, if 0 ≤ x ≤ 1
⎩
1, if x > 1
Takes the value 0 for (x < 0), xfor (0 ≤ x ≤ 1), and 1 for (x > 1).
2. Explain the Learning process involved in the neural network that responds to a
stimulus correctly. (5M)
The learning process in a neural network involves adjusting the values of network parameters to respond correctly to a
stimulus. There are three main categories of learning: supervised learning, unsupervised learning, and reinforcement
learning.
Supervised Learning
An analogy is given using a child learning to sing. Initially, the child doesn't know how to sing and learns by
imitating a singer.
Each input vector is associated with a desired output, forming a training pair.
Take it easy 39
During training, the input vector is fed into the network, producing an actual output. The error signal is generated
by comparing this output with the desired output.
The network adjusts its weights based on the error signal to make the actual output closer to the desired output.
The process is depicted in a block diagram where the error signal influences the adjustment of network
parameters.
Unsupervised Learning
Unsupervised learning occurs without the guidance of a teacher, akin to a fish learning to swim without explicit
instruction.
Inputs of a similar category are grouped together by the network without predefined training.
The network forms clusters of similar input patterns during the training process.
When a new input is applied, the network classifies it into a particular cluster or forms a new cluster if it doesn't
belong to any existing cluster.
There is no external feedback to validate the correctness of the output; instead, the network discovers patterns
through self-organization.
Reinforcement Learning
Reinforcement learning is similar to supervised learning, but only critical information is available.
The process involves extracting real information from the critical information available.
An error signal is generated by comparing the actual output with the desired output, similar to supervised
learning.
The reinforcement model is illustrated in a block diagram where both error signal and reinforcement signal
contribute to the learning process.
Take it easy 40
3. Solve XOR function using McCulloch-Pitts neuron. (8M) (OR) Solve XOR function using
McCulloch-Pitts neuron. (OR) Construct XOR function usng Mc Culloch-Pitts neuron (10M)
→ Repeated
Solution
Take it easy 41
Take it easy 42
4. Derive the Backpropagation rule considering the training rule for Output Unit weights
and Training Rule for Hidden Unit weights (8M) (OR) Write and explain the back propagation
algorithm. (10M) → Repeated
Backpropagation rule considering the training rule for Output Unit weights and Training Rule for hidden unit
weights
The backpropagation rule is used to train neural networks by adjusting the weights and biases in order to minimize
the error between the network's output and the expected output. The rule is derived using the chain rule of calculus
and involves two main steps: the training rule for output unit weights and the training rule for hidden unit weights.
aj , the target value as tk , and the output of the network as ak . The error E is given by
1
E= ∑(tk − ak )2 .
2
The change in weight Δwkj is given by the negative gradient of the error with respect to the weight, scaled
by a learning rate ϵ :
∂E
Δwkj = −ϵ
∂wkj
∂E ∂E ∂ak ∂netk
=
where netk is the weighted sum of the inputs to the output unit. The derivative of E with respect to ak is
−(tk − ak ), the derivative of ak with respect to netk is ak (1 − ak )(assuming a logistic activation function),
Substituting these results back into the equation for Δwkj , we get:
∂E
Δwji = −ϵ
∂wji
where netj is the weighted sum of the inputs to the hidden unit. The derivative of E with respect to netk is
−(tk − ak )ak (1 − ak ), the derivative of netk with respect to aj is wkj , the derivative of aj with respect to
netj is aj (1 − aj ), and the derivative of netj with respect to wji is simply the input xi .
Substituting these results back into the equation for Δwji , we get:
Take it easy 43
This is the training rule for the hidden unit weights.
In summary, the backpropagation rule involves computing the gradient of the error with respect to the
weights, and then adjusting the weights in the direction that decreases the error. The training rules for the
output unit weights and the hidden unit weights are derived using the chain rule of calculus.
Backpropagation algorithm
The backpropagation algorithm is a method used in training neural networks. It calculates the gradient of the loss
function with respect to the weights of the network, which is then used to update the weights and minimize the loss.
Here is a simplified version of the backpropagation algorithm:
1. Initialize the weights: Start by initializing the weights to small random values[1].
2. Feedforward: For each input in the training set, compute the output of the network. This is done by passing the
input through each layer of the network and applying the activation function[1].
3. Compute the output error: For each output unit, calculate the error as the difference between the target output
and the actual output of the network[1].
4. Backpropagate the error: For each hidden unit, compute the error by summing the errors of the output units it
is connected to, weighted by the corresponding weights[1].
5. Update the weights: Adjust the weights in the direction that decreases the error. This is done by subtracting a
fraction of the gradient from the current weights. The fraction is determined by the learning rate[1].
6. Repeat: Repeat steps 2-5 until the stopping condition is met (e.g., the error is below a certain threshold, a
maximum number of iterations has been reached, etc.)[1].
This algorithm is typically used in conjunction with an optimization method such as gradient descent or stochastic
gradient descent to perform the weight updates[4]. The backpropagation algorithm is efficient and makes it possible
to train multi-layer networks
https://www.youtube.com/watch?v=Ilg3gGewQ5U
https://www.youtube.com/watch?v=URJ9pP1aURo
5. Derive the Gradient Descent Rule and explain the importance of Stochastic Gradient
Descent. (6M) (OR) Derive the Gradient Descent Rule and explain the conditions in which
Gradient Descent is applied. (10M) → Repeated
Derivation of the Gradient Descent Rule
The key idea behind gradient descent is to use the gradient of the error function to guide the search through the
hypothesis space of weight vectors to find the weights that best fit the training data.
We can define the training error E(w)of a hypothesis (weight vector) as:
1
E(w) = ∑(td − od )2
2
d∈D
where D is the set of training examples, td is the target output for example d, and od is the actual output for
To find the direction of steepest descent, we take the derivative of E(w)with respect to the weight vector:
∂E
∇E(w) = = ∑(td − od )xd
∂w
d∈D
The gradient ∇E(w)gives the direction of steepest increase in error. The negative gradient therefore points in
the direction of steepest decrease. This leads to the gradient descent update rule:
d∈D
Take it easy 44
where ηis the learning rate. This rule updates each weight in proportion to the negative of the gradient, gradually
descending along the error surface to find the minimum.
Importance of Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) is a variant of gradient descent that is used when the dataset is large and it is
computationally expensive to compute the gradient of the function for the entire dataset.
SGD works by randomly sampling a subset of the data and computing the gradient of the function for that subset.
This approximation of the gradient is then used to update the current estimate of the minimum.
SGD is often used in practice because it can be much faster than gradient descent, especially for large datasets.
However, SGD can also be more noisy than gradient descent, and it can sometimes converge to a local minimum
rather than the global minimum.
If these conditions are not met, then gradient descent may not converge to the global minimum, or it may converge
very slowly.
6. Prove the population evolution and the schema theorem incontext to genetic algorithm
(6M)
Population Evolution
The population evolves over generations in a genetic algorithm based on the selection, crossover, and mutation
operators. Specifically:
Selection chooses fitter individuals in the population to pass on to the next generation. This causes the average
fitness of the population to increase over generations.
Crossover combines parts of two parent individuals to produce new offspring. This allows beneficial traits from
different individuals to be combined.
Mutation randomly changes some individuals. This introduces new diversity into the population.
Schema Theorem
The schema theorem characterizes how the number of instances of a schema (pattern) s changes over time:
Let m(s, t)= number of instances of schema s at generation t
pc = crossover probability
pm = mutation probability
Take it easy 45
This shows schemas with above average fitness f(s,t) > f(t) will tend to increase in the population. Short, low-order
schemas with small o(s) and d(s) are less disrupted by crossover and mutation.
So the schema theorem proves fit, short, low-order schemas receive exponentially increasing trials in a genetic
algorithm. This allows efficient parallel search through the space of schemas.
References
Holland, J. H. (1975). Adaptation in natural and artificial systems. University of Michigan Press.
Whitley, D. & Vose, M. D. (1995). Foundations of genetic algorithms 3. Morgan Kaufmann.
Citations:
[1]
https://ppl-ai-file-upload.s3.amazonaws.com/web/direct-files/9003805/0c9dd6e4-30b2-4214-900f-980999fb8dff/txt.txt
7. Describe the evolution of neural networks. (5M)
The evolution of neural networks can be summarized in the following table:
1871–73 Reticular theory Joseph von Gerlach The nervous system is a single continuous network.
McCulloch–Pitts
1943 McCulloch and Pitts Simplified model of a neuron.
Neuron
1. Crossover
This operator generates two new offspring from two parent strings by copying selected bits from each parent. The bit
at position i in each offspring is copied from the bit at position i in one of the two parents. The choice of which parent
contributes the bit for position i is determined by an additional string called the crossover mask. There are different
types of crossover operations, such as single-point crossover, two-point crossover, and uniform crossover. In single-
point crossover, the crossover mask is always constructed so that it begins with a string containing n contiguous 1s,
followed by the necessary number of 0s to complete the string. This results in offspring in which the first n bits are
contributed by one parent and the remaining bits by the second parent. In two-point crossover, offspring are created
by substituting intermediate segments of one parent into the middle of the second parent string. Uniform crossover
combines bits sampled uniformly from the two parents[1].
2. Mutation
This operator produces small random changes to the bit string by choosing a single bit at random, then changing its
value. Mutation is often performed after crossover has been applied. It helps to maintain diversity within the
population and prevent premature convergence on poor solutions[1].
Take it easy 46
These operators are used to generate new candidate solutions in the population, allowing the genetic algorithm to
explore the solution space. The crossover operator combines the information from two parent solutions to produce new
offspring, while the mutation operator introduces random changes to maintain diversity in the population.
9. Illustrate the genetic programming with suitable example. (10M)
https://www.youtube.com/watch?v=YG589P3LzGw
Genetic programming (GP) is a technique in artificial intelligence that evolves programs, starting from a population of
unfit (usually random) programs, fit for a particular task by applying operations analogous to natural genetic processes.
The operations include selection of the fittest programs for reproduction (crossover), replication and/or mutation
according to a predefined fitness measure, usually proficiency at the desired task.
Let's illustrate this with an example of a simple symbolic regression problem. The goal is to evolve a program that can
predict the output of a mathematical function, given the input. For simplicity, let's assume the function is y = x2 + x +
1, and we want to evolve a program that can predict 'y' given 'x'.
1. Initialization: We start by creating a population of random programs. In this case, a program is a mathematical
expression composed of operators (+, -, *, /) and variables (x). For example, one program might be "x + x", another
might be "x * x", and so on.
2. Fitness Evaluation: Each program in the population is evaluated for its fitness, i.e., how well it solves the problem.
In this case, the fitness of a program is determined by how closely its output matches the output of the function y =
x2 + x + 1for a range of 'x' values. The closer the match, the higher the fitness.
3. Selection: Programs are selected for reproduction based on their fitness. The higher the fitness, the higher the
chance of being selected. This is analogous to the principle of "survival of the fittest" in natural evolution.
4. Crossover: Two programs are selected and a point is chosen within each program. The parts of the programs after
these points are swapped to create two new programs. For example, if the parents are "x * x" and "x + 1", the
offspring might be "x * 1" and "x + x".
5. Mutation: A program is selected and a point is chosen within the program. The part of the program after this point is
replaced with a randomly generated part. For example, "x * x" might mutate to "x * 1".
6. Replacement: The least fit programs in the population are replaced with the new programs created by crossover
and mutation.
7. Termination: The process is repeated (from step 2) until a program with a satisfactory level of fitness is found, or a
predefined number of generations have been produced.
This example illustrates the basic process of genetic programming. In practice, GP can be used to evolve much more
complex programs, and the fitness function, selection method, crossover and mutation operators, and termination
condition can all be tailored to the specific problem at hand
10. Describe the power of perceptron. (5M) (OR) Describe the power of perceptron. (OR)
Explain the concept of Perceptron with a neat diagram (10M) → Repeated
They output 1 for instances on one side of the hyperplane and -1 for instances on the other side.
Take it easy 47
Perceptrons can represent all primitive boolean functions (AND, OR, NAND, NOR).
Every boolean function can be represented by a network of interconnected units based on these primitives.
The rule converges to a weight vector that correctly classifies all training examples, provided the training
examples are linearly separable and a sufficiently small learning rate is used.
Module 5
1. Prove the K-nearest neighbor algorithm for approximating a discrete - valued function
with pseudocode. (10M) (OR) Explain K nearest neighbor algorithm in detail. (10M)
https://youtu.be/wTF6vzS9fy4
The k-Nearest Neighbor (k-NN) algorithm is a type of instance-based learning method. It assumes that all instances
correspond to points in an n-dimensional space. The nearest neighbors of an instance are defined in terms of the
standard Euclidean distance. The algorithm can be used to approximate both discrete-valued and real-valued target
function.
Text book
Take it easy 48
Take it easy 49
2. Suppose hypothesis h commits r = 10 errors over a sample of n = 65 independently
drawn examples, then solve the following
(i) What is the variance and standard deviation for number of true error rate errorD(h)?
(ii) What is the 90% confidence interval (two-sided) for the true error rate?
(iii) What is the 95% one-sided interval (i.e., what is the upper bound U such that errorD(h)
≤5 U with 95% confidence)?
(i) The variance and standard deviation for the number of true error rate errorD(h)
The error rate is a binomial distribution, where the variance is given by Var(X) = np(1 − p)where n is the
number of trials (in this case, the number of examples, 65), and p is the probability of success (in this case, the error
rate, 10/65).
10 10
So, the variance is Var(X) = 65 ∗ 65
∗ (1 − 65
) = 8.71795
The standard deviation is the square root of the variance, so SD(X) = 8.71795 = 2.953
(ii) The 90% confidence interval (two-sided) for the true error rate
p(1−p)
The confidence interval for a binomial distribution is given by p ± z n where
z is the z-score for the desired
confidence level (for a 90% confidence level, z = 1.645).
10 10
65 (1− 65 )
So, the 90% confidence interval is 10 ± 1.645 ∗ = [0.092, 0.231]
65 65
(iii) The 95% one-sided interval (i.e., what is the upper bound U such that errorD(h) ≤ U with 95% confidence)
For a one-sided confidence interval, we use a z-score for a 95% confidence level in a one-tailed test, which is 1.645.
10 10
65 (1− 65 )
So, the upper bound U is 10 + 1.645 ∗ = 0.231
65 65
65 65
3. What is reinforcement learning and develop reinforcement learning problem with neat
diagram. (10M) (OR) Describe reinforcement learning. Discuss how it differs from other
function approximation tasks. (10M)
Take it easy 50
Reinforcement learning is a type of machine learning that addresses how an autonomous agent that senses and acts in
its environment can learn to choose optimal actions to achieve its goals. This problem covers a wide range of tasks such
as learning to control a mobile robot, optimizing operations in factories, and learning to play board games. The agent
performs actions in its environment, and a trainer may provide a reward or penalty to indicate the desirability of the
resulting state. The agent's task is to learn from this indirect, delayed reward, to choose sequences of actions that
produce the greatest cumulative reward[1].
A reinforcement learning problem involves an agent, an environment, and a goal. The agent has a set of sensors to
observe the state of its environment and a set of actions it can perform to alter this state. The agent's task is to learn a
control strategy, or policy, for choosing actions that achieve its goals. The goals of the agent can be defined by a reward
function that assigns a numerical value—an immediate payoff—to each distinct action the agent may take from each
distinct state. The agent's task is to perform sequences of actions, observe their consequences, and learn a control
policy that maximizes the reward accumulated over time by the agent[1].
Reinforcement learning differs from other function approximation tasks in several important respects
1. Delayed reward
In reinforcement learning, training information is not available in the form of pairs of current state and optimal action.
Instead, the trainer provides only a sequence of immediate reward values as the agent executes its sequence of
actions. The agent, therefore, faces the problem of temporal credit assignment: determining which of the actions in
its sequence are to be credited with producing the eventual rewards[1].
2. Exploration
In reinforcement learning, the agent influences the distribution of training examples by the action sequence it
chooses. This raises the question of which experimentation strategy produces the most effective learning. The
learner faces a tradeoff in choosing whether to favor exploration of unknown states and actions (to gather new
information), or exploitation of states and actions that it has already learned will yield high reward (to maximize its
cumulative reward)[1].
In many practical situations, sensors provide only partial information. For example, a robot with a forward-pointing
camera cannot see what is behind it. In such cases, it may be necessary for the agent to consider its previous
observations together with its current sensor data when choosing actions, and the best policy may be one that
chooses actions specifically to improve the observability of the environment[1].
4. Life-long learning
Unlike isolated function approximation tasks, reinforcement learning often requires that the agent learn several
related tasks within the same environment, using the same sensors. This setting raises the possibility of using
previously obtained experience or knowledge to reduce sample complexity when learning new tasks.
4. Interpret the Q function and Solve Q Learning Algorithm assuming deterministic
rewards and actions with an example. (10M) (OR) Discuss Q learning concept and write its
algorithm. (10M) (OR) With an illustrative example explain Q learning method. (10M)
Take it easy 51
https://www.youtube.com/watch?v=J3qX50yyiU0
Q learning is a model-free reinforcement learning algorithm used to find the optimal policy for a given environment. It
does this by learning an action-value function, which gives the expected utility of taking a given action in a given state
and following the optimal policy thereafter. The Q function, denoted as $$ Q(s, a) $$, represents the maximum
discounted cumulative reward that can be achieved starting from state $$ s $$, taking action $$ a $$, and thereafter
following the optimal policy[1].
Q Learning Algorithm
The Q learning algorithm involves the following steps:
1. Initialize the Q-values Q(s, a) arbitrarily for all state-action pairs.
2. Observe the current state s.
3. Select an action ausing a policy derived from Q (e.g., ϵ-greedy).
4. Execute the action a, receive the immediate reward r , and observe the new state s′ .
5. Update the Q-value for the state-action pair (s, a) using the formula: Q(s, a) ← r + γ maxa′ Q(s′ , a′ )where
γ is the discount factor and maxa′ Q(s′ , a′ )is the estimated optimal future value.
6. Set s ← s' and repeat the process until a termination condition is met (e.g., a certain number of episodes or
convergence of Q-values).
The algorithm assumes deterministic rewards and actions, meaning the outcome of each action is predictable and the
reward is consistent.
Illustrative Example
Imagine a simple grid world where an agent can move up, down, left, or right. The goal is to reach a specific location
on the grid. The agent receives a reward of zero for each move and a positive reward when it reaches the goal. The
Q learning algorithm would proceed as follows:
3. After taking the action, the agent observes the reward and the new state.
4. The agent updates the Q-value for the state-action pair based on the observed reward and the maximum Q-value
of the new state.
5. This process continues until the agent has sufficiently learned the Q-values to navigate optimally to the goal.
As the agent explores the environment, the Q-values are updated, and the agent learns to predict the value of each
action in each state. Over time, the Q-values converge to the optimal values, which represent the best possible action
the agent can take in each state to maximize its cumulative reward[1].
In summary, Q learning is a powerful algorithm for learning optimal policies in environments with deterministic actions
and rewards. It does not require a model of the environment and can be used in a wide range of applications, from
game playing to robotics.
5. Illustrate how the estimating accuracy is useful in evaluating a learned hypothesis
(10M)
Estimating the accuracy of a learned hypothesis is crucial for several reasons:
1. Decision Making: It helps in deciding whether the hypothesis is reliable enough to be used in practice. For example,
in medical treatment effectiveness studies, accurate estimation of hypothesis accuracy is vital for making informed
decisions[1].
2. Learning Process: It is an integral part of many learning methods. For instance, when post-pruning decision trees to
prevent overfitting, the accuracy of the pruned versus unpruned tree must be evaluated[1].
Take it easy 52
Bias: The accuracy observed over the training examples may not be a good estimator for future examples since
the hypothesis was derived from these examples, leading to an optimistically biased estimate[1].
Variance: Even with an unbiased test set, the measured accuracy can vary from the true accuracy depending on
the makeup of the test examples. The smaller the test set, the greater the expected variance[1].
Importance of Accuracy Estimation
Understanding the accuracy of a hypothesis allows us to:
Determine the probable error in this accuracy estimate, which is essential for setting realistic expectations and
understanding the confidence in the hypothesis[1].
Sample Error: The fraction of a data sample that the hypothesis misclassifies.
True Error: The probability that the hypothesis will misclassify a randomly drawn instance from the entire
unknown distribution[1].
In summary, estimating the accuracy of a learned hypothesis is essential for making informed decisions, guiding the
learning process, and setting realistic expectations about the performance of the hypothesis on future data.
Understanding the potential bias and variance in the estimate, as well as using statistical methods to calculate
confidence intervals, are key components of this evaluation process.
6. Write a note on:
i) Mean and Variance.
ii) Estimators, Bias and Variance. (10M)
i) Mean and Variance
The mean is a measure of central tendency, representing the average value of a set of data. It is calculated by
summing all the values in the dataset and dividing by the number of values. The variance, on the other hand, is a
measure of dispersion, indicating how much the values in the dataset deviate from the mean. It is calculated by
taking the average of the squared differences from the mean. A high variance indicates that the data points are
spread out from the mean and from each other, while a low variance indicates that the data points tend to be close to
the mean and to each other.
Bias is the difference between the expected (or average) prediction of our model and the correct value which we
are trying to predict. A model with high bias pays very little attention to the training data and oversimplifies the
model, leading to high error on both training and test data.
Variance, on the other hand, is the variability of model prediction for a given data point. A model with high
variance pays a lot of attention to training data and does not generalize well on the data it hasn't seen before. As
a result, such models perform very well on training data but have high error rates on test data.
Take it easy 53
The bias-variance tradeoff is a central problem in supervised learning. Ideally, one wants to choose a model
that both accurately captures the regularities in its training data, but also generalizes well to unseen data.
Unfortunately, it is typically impossible to do both simultaneously. High-bias, low-variance models have a lower
test error rate on the training set but a high error rate on the test set. Low-bias, high-variance models have a
higher test error rate on the training set but their error rate increases on the test set.
7. In the context to Machine learning explain Bias - Variance Trade-off with example. (10M)
The Bias-Variance Tradeoff is a fundamental concept in machine learning that describes the relationship between a
model's complexity, the accuracy of its predictions, and its ability to generalize to unseen data[2].
Bias refers to the difference between the average prediction of a model and the correct value it is trying to predict. A
model with high bias pays little attention to the training data and oversimplifies the model, leading to errors due to
incorrect assumptions in the learning algorithm. This is known as underfitting[3].
Variance, on the other hand, refers to the degree to which the estimate of the target function varies when using different
training sets. High variance can result from an algorithm modeling the random noise in the training data, leading to errors
due to sensitivity to small fluctuations in the training set. This is known as overfitting[1][2].
The Bias-Variance Tradeoff refers to the property of models where an increase in one component (bias or variance)
tends to result in a decrease in the other. In other words, lowering a model’s bias leads to an increase in its variance and
vice versa. This relationship is due to the complexity of the model: a more complex model will have low bias and high
variance, while a less complex model will have high bias and low variance[1][3].
For example, consider a model trying to predict house prices based on various features. A high bias model might
oversimplify the problem and only consider the number of rooms in the house, leading to consistent but inaccurate
predictions. On the other hand, a high variance model might consider too many features, including irrelevant ones such
as the color of the house, leading to very accurate predictions on the training data but poor performance on unseen data.
The goal in machine learning is to find a good balance between bias and variance, minimizing the total error. An optimal
balance of bias and variance would neither overfit nor underfit the model[3]. This balance can be adjusted in specific
algorithms by modifying parameters.
In LWLR, given a new query instance, an approximation is constructed that fits the training examples in the
neighborhood surrounding the query instance. This approximation is then used to calculate the estimated target
value for the query instance. A different local approximation will be calculated for each distinct query instance.
The target function is approximated near the query point using a linear function. The coefficients of this linear
function are found using methods such as gradient descent to minimize the error in fitting the function to a given set
of training examples. The error criterion is redefined to emphasize fitting the local training examples. Three possible
criteria are: minimizing the squared error over just the nearest neighbors, minimizing the squared error over the
entire set of training examples while weighting the error of each training example by some decreasing function of its
distance from the query point, or a combination of the two. The contribution of each instance to the weight update is
multiplied by the distance penalty, and the error is summed over only the nearest training examples.
The literature on LWLR contains a broad range of alternative methods for distance weighting the training examples,
and a range of methods for locally approximating the target function. In most cases, the target function is
approximated by a constant, linear, or quadratic function. More complex functional forms are not often found
because the cost of fitting more complex functions for each query instance is prohibitively high, and these simple
approximations model the target function quite well over a sufficiently small subregion of the instance space.
Locally Weighted Linear Regression (LWLR) is a non-parametric regression algorithm that estimates the relationship
between a dependent variable and one or more independent variables. It is a type of kernel regression that uses a
weighted linear regression model to predict the value of the dependent variable at a given point.
The weights in LWLR are determined by a kernel function, which assigns higher weights to data points that are closer to
the point being predicted. This allows LWLR to capture local patterns and relationships in the data, which can be useful
when the relationship between the variables is non-linear or changes over time.
Take it easy 54
LWLR is a relatively simple algorithm to implement and can be used to solve a variety of regression problems. However,
it can be computationally expensive when the number of data points is large.
1. Choose a kernel function. Common choices include the Gaussian kernel and the Epanechnikov kernel.
2. Determine the bandwidth of the kernel. The bandwidth controls the size of the neighborhood around each data point
that is used to fit the linear regression model.
3. For each data point, fit a linear regression model to the data points within the neighborhood.
4. Use the linear regression model to predict the value of the dependent variable at the given point.
Predicting the price of a house based on its square footage and location.
Forecasting the sales of a product based on its price and advertising budget.
Estimating the risk of a loan applicant based on their credit score and debt-to-income ratio.
https://www.youtube.com/watch?v=38kNPkeGoR4 https://www.youtube.com/watch?v=to_LPkV1bnI
Take it easy 55