0% found this document useful (0 votes)
54 views55 pages

Take It Easy: Created Status Last Read

The document discusses building machine learning models and evaluating logistic regression models. It provides code snippets to: 1) Implement gradient descent for linear regression with initialization, prediction, cost calculation, and parameter updates. 2) Split a dataset into training and test sets, build a model on the training set, make predictions on the test set, and measure accuracy. 3) Contrast the ROC curve and AUC score for evaluating logistic regression models, with the ROC curve plotting the true positive rate against the false positive rate and AUC measuring the entire two-dimensional area under the entire ROC curve from 0 to 1.

Uploaded by

Sandhya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views55 pages

Take It Easy: Created Status Last Read

The document discusses building machine learning models and evaluating logistic regression models. It provides code snippets to: 1) Implement gradient descent for linear regression with initialization, prediction, cost calculation, and parameter updates. 2) Split a dataset into training and test sets, build a model on the training set, make predictions on the test set, and measure accuracy. 3) Contrast the ROC curve and AUC score for evaluating logistic regression models, with the ROC curve plotting the true positive rate against the false positive rate and AUC measuring the entire two-dimensional area under the entire ROC curve from 0 to 1.

Uploaded by

Sandhya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Take it easy

Created @July 22, 2023 11:09 AM

Status Open

Last Read @January 29, 2024 6:59 PM

Summary: Module 1 to 5

Model paper

https://prod-files-secure.s3.us-west-2.amazonaws.com/872d74be-2f52-4a00-ad56-893819e7a4ea/be309919-b2d0-4
987-b271-5dde2d5b25ff/18AI72.pdf

Qp - 1

https://prod-files-secure.s3.us-west-2.amazonaws.com/872d74be-2f52-4a00-ad56-893819e7a4ea/241296b2-82fd-4
daf-ac16-49aa1d043479/Copy_of_PYQ_1.pdf

Qp - 2

https://prod-files-secure.s3.us-west-2.amazonaws.com/872d74be-2f52-4a00-ad56-893819e7a4ea/aa273b6a-f410-4f
28-b322-6383884eec76/Copy_of_PYQ_2.pdf

Module 1
1. Explain gradient descent algorithm. (OR) Explain the steps required for implementing
Gradient Descent Algorithm. (10M) → Repeated (OR) Using code snippets, explain the
Gradient Descent Algorithm through utility methods. (10M)
Gradient descent is an optimization algorithm which is commonly-used to train machine learning models and neural
networks. Training data helps these models learn over time, and the cost function within gradient descent specifically
acts as a barometer, gauging its accuracy with each iteration of parameter updates. Until the function is close to or equal
to zero, the model will continue to adjust its parameters to yield the smallest possible error.

Take it easy 1
There are three types of gradient descent algorithm

i) Batch - gradient descent algorithm

ii) Stochastic - gradient descent algorithm


iii) Mini - batch gradient descent algorithm

The utility methods for applying Gradient Descent Algorithm:


Method 1: Random Initialization of the Bias and Weights

import random

def initialize(dim):
np.random.seed(seed=42)
random.seed(42)
b = random.random()
w = np.random.rand(dim)
return b, w

This method initializes the bias ( b ) and weights ( w ) randomly. The dim parameter is the number of weights to be
initialized (besides the bias). The seed is set for reproducibility, and it returns the initialized bias and weights.

Method 2: Predict Y Values from the Bias and Weights

def predict_Y(b, w, X):


return b + np.matmul(X, w)

This method calculates the predicted values of Y ( Y_hat ) given the bias ( b ), weights ( w ), and input matrix ( X )
using matrix multiplication.

Method 3: Calculate the Cost Function — MSE

def get_cost(Y, Y_hat):


Y_resid = Y - Y_hat
return np.sum(np.matmul(Y_resid.T, Y_resid)) / len(Y_resid))

This method computes the Mean Squared Error (MSE) as the cost function. It calculates the residuals, squares
them, sums over all records, and divides by the number of observations.

Method 4: Update the Bias and Weights

def update_beta(x, y, y_hat, b_0, w_0, learning_rate):


db = (np.sum(y_hat - y) * 2) / len(y)
dw = (np.dot((y_hat - y), x) * 2) / len(y)
b_1 = b_0 - learning_rate * db

Take it easy 2
w_1 = w_0 - learning_rate * dw
return b_1, w_1

This method updates the bias ( b ) and weights ( w ) based on the gradients of the cost function with respect to the
bias and weights.

Method 5: Finding the Optimal Bias and Weights

def run_gradient_descent(X, Y, alpha=0.01, num_iterations=100):


b, w = initialize(X.shape[1])
gd_iterations_df = pd.DataFrame(columns=['iteration', 'cost'])

for each_iter in range(num_iterations):


Y_hat = predict_Y(b, w, X)
this_cost = get_cost(Y, Y_hat)
prev_b = b
prev_w = w
b, w = update_beta(X, Y, Y_hat, prev_b, prev_w, alpha)
if each_iter % 10 == 0:
gd_iterations_df.loc[result_idx] = [each_iter, this_cost]

return gd_iterations_df, b, w

This method runs the gradient descent algorithm, updating bias and weights iteratively. It also tracks the cost at
every 10 iterations.

These utility methods collectively implement the Gradient Descent Algorithm for Linear Regression, including
initialization, prediction, cost calculation, and parameter updates. The last method, run_gradient_descent , orchestrates
the entire process for a specified number of iterations.
2. Discuss the steps for building machine learning model.
Steps for Building Machine Learning Models are:

1. Identify Features and Outcome Variable

Identify the features (independent variables) and outcome variable (dependent variable) in the dataset.

2. Split the Dataset into Training and Test Sets

Use the train_test_split function from the sklearn.model_selection module to split the dataset into training and
test sets.

Example code:

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(
sales_df[["TV", "Radio", "Newspaper"]],
sales_df.Sales, test_size=0.3,
random_state=42
)

3. Build the Model Using the Training Set

Choose a machine learning model (e.g., Linear Regression).

Initialize the model and fit it to the training data using the fit method.

Example code:

from sklearn.linear_model import LinearRegression


linreg = LinearRegression()
linreg.fit(X_train, y_train)

4. Predict Outcome Variable Using the Test Set

Take it easy 3
Use the predict method to make predictions on the test set.

Example code:

y_pred = linreg.predict(X_test)

5. Compare Predicted and Actual Values, and Measure Accuracy

Create a DataFrame to store actual, predicted, and residual values.

Use metrics such as Mean Absolute Percentage Error (MAPE) or Root Mean Square Error (RMSE) to measure
accuracy.

Example code (for RMSE):

from sklearn import metrics


mse = metrics.mean_squared_error(y_test, y_pred)
rmse = round(np.sqrt(mse), 2)
print("RMSE:", rmse)

6. Additional Metrics (Optional)

Calculate R-squared value to understand the amount of variance explained by the model.

Example Linear Regression Model

Initialize model, fit to training data, and obtain model parameters.

linreg = LinearRegression()
linreg.fit(X_train, y_train)
print("Intercept:", linreg.intercept_)
print("Coefficients:", linreg.coef_)

Make predictions on the test set.

y_pred = linreg.predict(X_test)

Measure accuracy using RMSE and R-squared.

rmse = round(np.sqrt(metrics.mean_squared_error(y_test, y_pred)), 2)


r2 = metrics.r2_score(y_train, linreg.predict(X_train))
print("RMSE:", rmse)
print("R Squared:", r2)

These steps provide a structured approach to building, validating, and measuring the accuracy of a machine learning
model.
3. Contrast the features of Receiver Operating Curve (ROC) and Area Under ROC (AUC)
score in Logistic regression model with code snippets.
Receiver Operating Characteristic Curve (ROC) and Area Under ROC (AUC) Score are widely used metrics for
evaluating the performance of classification models, including Logistic Regression. Let's contrast the features of ROC
and AUC with code snippets.

Receiver Operating Characteristic Curve (ROC)


1. Description:

The ROC curve is a graphical representation of the trade-off between true positive rate (sensitivity) and false
positive rate (1-specificity) at various thresholds.

It helps visualize the model's performance across different decision thresholds.

2. Implementation:

The ROC curve is generated using the draw_roc_curve() method, which utilizes the roc_curve() function from
sklearn.metrics .

Take it easy 4
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

fpr, tpr, thresholds = metrics.roc_curve(test_results_df.actual, test_results_df.ch


d_1, drop_intermediate = False)

Area Under ROC (AUC) Score


1. Description:

AUC is a single numeric value representing the area under the ROC curve.

It quantifies the overall performance of the model, with higher AUC values indicating better discrimination
between positive and negative classes.

2. Implementation:

AUC is calculated using the roc_auc_score() method from sklearn.metrics .

auc_score = metrics.roc_auc_score(test_results_df.actual, test_results_df.chd_1)


round(float(auc_score), 2)

Plotting ROC Curve


Description:

The ROC curve is plotted using the draw_roc_curve() method, which uses the plt.plot() function from
matplotlib to visualize the FPR and TPR.

plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score)


plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

Take it easy 5
In summary, the ROC curve provides a visual representation of the model's performance at different thresholds, and
the AUC score condenses this information into a single metric. The provided code snippets showcase how to
calculate these metrics and plot the ROC curve for a Logistic Regression model.
4. In context to ARIMA Model, explain the following
i) Dicky-Fuller Test. ii) Forecast and Measure Accuracy.
Dickey-Fuller Test
1. Purpose: The Dickey-Fuller test is employed to assess the stationarity of a time series, a crucial assumption for
ARIMA modeling.

2. Test Hypotheses:

Null Hypothesis (H0): The time series is non-stationary (b = 1).


Alternative Hypothesis (HA): The time series is stationary (b < 1).
3. Test Implementation:

The adfuller function from statsmodels.tsa.stattools is used to conduct the Dickey-Fuller test.

The test returns a test statistic and a p-value.

4. Interpretation:

If the p-value is less than 0.05 (common significance level), the null hypothesis is rejected, indicating that the
time series is stationary.

Forecast and Measure Accuracy


1. Forecasting:

Forecasting future values is performed using an ARIMA model, presumably with lag 1.

The forecast is obtained for specific time points, in this case, months 31 to 37.

2. Forecasted Values (Example):

The forecasted demand values for months 31 to 37 are given as an array: [480.15, 497.71, 506.01, 509.93,

511.78, 512.66, 513.07] .

3. Measure Accuracy (MAPE):

Mean Absolute Percentage Error (MAPE) is utilized to evaluate the accuracy of the forecast.

The MAPE formula calculates the average percentage error between actual and forecasted values.

The reported MAPE value of 19.12% suggests the average percentage difference between actual and
forecasted values for the given ARIMA model with lag 1.

In summary, the Dickey-Fuller test is employed to check stationarity, and forecast accuracy is assessed using MAPE
after forecasting with an ARIMA model.

5. Explain the components of time series data. (OR) In context to forecasting describe the
componenets of time-series data and explain the concept of decomposing time series. (10M)
→ Repeated
Components of Time-Series Data
Trend Component (Tt)

Definition: The trend component represents the consistent long-term upward or downward movement of the
data.

Example: In the context of demand for a product, if the demand shows a steady increase over several years,
it indicates a positive trend.

Seasonal Component (St)

Definition: The seasonal component represents repetitive upward or downward movements (fluctuations)
from the trend that occur within a calendar year at fixed intervals.

Example: Seasonal fluctuations can be observed in demand for certain products during specific times of the
year, such as increased demand for winter clothing in the winter season.

Cyclical Component (Ct)

Take it easy 6
Definition: The cyclical component is the fluctuation around the trend line at random intervals, driven by
macro-economic changes such as recession, unemployment, etc. Cyclical fluctuations have repetitive
patterns with a time between repetitions of more than a year.

Example: A cyclical downturn in demand for luxury goods during an economic recession.

Irregular Component (It)

Definition: The irregular component represents random, uncorrelated changes or white noise in the data. It
follows a normal distribution with a mean value of 0 and constant variance.

Example: Unpredictable spikes or drops in demand that cannot be attributed to trend, seasonality, or cyclical
patterns.
Decomposing Time Series
The process of decomposing a time series involves separating it into its individual components, namely trend,
seasonal, cyclical, and irregular components. The goal is to analyze and understand each component separately,
which can aid in making more accurate forecasts. The decomposition is typically done using mathematical techniques
or statistical methods.
The decomposition equation is often expressed as:

[Yt = Tt + St + Ct + It ]
​ ​ ​ ​ ​

Trend Component (Tt): Identifying and estimating the trend helps understand the overall direction of the data
over the long term.

Seasonal Component (St): Isolating the seasonal component allows for the identification of repetitive patterns
within a calendar year.

Cyclical Component (Ct): Analyzing the cyclical component helps in recognizing broader economic trends and
fluctuations.

Irregular Component (It): Examining the irregular component helps identify random variations in the data that
are not explained by trend, seasonality, or cyclical patterns.

Forecasting methods such as moving average, exponential smoothing, and ARIMA leverage the understanding of
these components to make predictions about future values. It is essential to choose an appropriate forecasting
technique based on the characteristics of the time series data and the nature of its components.

6. Explain the moving average technique to forecast the future value of time series data.
(OR) Explain the Moving Average Model and the method to calculate forecast accuracy.
(10M) → Repeated

Take it easy 7
Moving Average (MA) processes are regression models that utilize past residuals to predict future values within a time-
series dataset. Expressing a moving average process with a lag of 1 as:

[Yt+1 = a1 et + et+1 ]
​ ​ ​ ​

The generalization of this model to q lags involves determining the appropriate value of q, or the number of lags, based
on specific criteria outlined by Yaffee and McGee (2000):

1. The auto-correlation values are significant for the initial q lags and then decline to zero.

2. The partial auto-correlation function (PACF) exhibits an exponential decrease.

Forecast accuracy measures help evaluate the performance of forecasting models such as Moving Average (MA)
models. Here are some common methods for calculating forecast accuracy:

1. Mean Absolute Percentage Error (MAPE)


1 n ∣Actuali − F orecasti ∣
Formula: (MAPE = n ​ ∑i=1 ​

∣Actuali ∣ ​

​ × 100)
Description: MAPE expresses forecast accuracy as a percentage of the absolute difference between actual and
forecasted values relative to the actual values. A lower MAPE indicates better accuracy.
In the given code snippet:

forecast_31_37 = ma_model.predict(30, 36)


get_mape(vimana_df.demand[30:], forecast_31_37)

The MAPE for the MA model with lag 1 is calculated as 17.8%.

2. Mean Absolute Error (MAE)


1 n
Formula: (MAE = n
​ ∑i=1 ∣Actuali − F orecasti ∣)
​ ​ ​

Description: MAE represents the average absolute difference between actual and forecasted values. It gives
equal weight to all errors.

3. Root Mean Squared Error (RMSE)


1 n
Formula: (RMSE = n
​ ∑i=1 (Actuali − F orecasti )2 )
​ ​ ​ ​

Description: RMSE is the square root of the average squared differences between actual and forecasted values.
It penalizes larger errors more than smaller errors.

4. Forecast Bias
1 n
Formula: (Bias = n

∑i=1 (Actuali − F orecasti ))
​ ​ ​

Description: Forecast bias measures the average tendency of the forecasts to be either too high (positive bias)
or too low (negative bias).

In the provided code snippets, the MAPE is explicitly calculated using the get_mape function for the Moving Average (MA)
model with lag 1. The MAPE value is then reported as 17.8%, indicating the percentage error in the forecast.
7. Illustrate the KNN algorithm with an example.
The K-Nearest Neighbors (KNN) algorithm is a non-parametric and lazy learning algorithm used for regression and
classification problems. It classifies new observations by comparing them with the training data and finding similar
neighbors. Here's an illustration of the KNN algorithm using the bank marketing dataset

Take it easy 8
1. Finding Neighbors

Observations in the training set that are similar to the new observation are considered neighbors.

The number of neighbors (K) to be considered for classifying a new observation is a parameter that can be set.

The class for the new observation is predicted to be the same class as the majority of its neighbors.

2. Distance Metrics

Neighbors are found by computing the distance between observations.

Euclidean distance is a widely used distance metric, given by:

[D(O1 , O2 ) =
​ ​ (X11 − X21 )2 + (X12 − X22 )2 ]
​ ​ ​ ​ ​

Other distance metrics such as Minkowski distance, Jaccard Coefficient, and Gower’s distance can also be
used.

3. Implementation in Python

The scikit-learn library provides the KNeighborsClassifier algorithm for classification problems.

Key parameters include n_neighbors (number of neighbors), metric (distance metric), and weights (weighting
scheme).

from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(n_neighbors=5, metric='minkowski')


knn_clf.fit(train_X, train_y)

4. Accuracy Evaluation

ROC AUC score and the ROC curve are commonly used to evaluate KNN accuracy.

_, _, _, _ = draw_roc_curve(knn_clf, test_X, test_y)

5.Confusion Matrix

Confusion matrix provides detailed information about the model's performance.

It includes true positives, true negatives, false positives, and false negatives.

6.Classification Report

Take it easy 9
Precision, recall, and F1-score for each class are summarized in the classification report.

print(metrics.classification_report(test_y, pred_y))

7. Hyperparameter Tuning

The optimal number of neighbors (K) can be found through hyperparameter tuning using GridSearch in scikit-
learn.
8. With respect to Moving Average model discuss the different methos for calculating
Forecast Accuracy.
Moving Average (MA) processes are regression models that utilize past residuals to predict future values within a time-
series dataset. Expressing a moving average process with a lag of 1 as:

[Yt+1 = a1 et + et+1 ]
​ ​ ​ ​

The generalization of this model to q lags involves determining the appropriate value of q, or the number of lags, based
on specific criteria outlined by Yaffee and McGee (2000):

1. The auto-correlation values are significant for the initial q lags and then decline to zero.

2. The partial auto-correlation function (PACF) exhibits an exponential decrease.

Forecast accuracy measures help evaluate the performance of forecasting models such as Moving Average (MA)
models. Here are some common methods for calculating forecast accuracy:

1. Mean Absolute Percentage Error (MAPE)


1 n ∣Actuali − F orecasti ∣
Formula: (MAPE = n

∑i=1 ​

∣Actuali ∣ ​


× 100)
Description: MAPE expresses forecast accuracy as a percentage of the absolute difference between actual and
forecasted values relative to the actual values. A lower MAPE indicates better accuracy.

In the given code snippet:

forecast_31_37 = ma_model.predict(30, 36)


get_mape(vimana_df.demand[30:], forecast_31_37)

The MAPE for the MA model with lag 1 is calculated as 17.8%.

2. Mean Absolute Error (MAE)


1 n
Formula: (MAE = n
​ ∑i=1 ∣Actuali − F orecasti ∣)
​ ​ ​

Description: MAE represents the average absolute difference between actual and forecasted values. It gives
equal weight to all errors.

3. Root Mean Squared Error (RMSE)

1 n
Formula: (RMSE = n
​ ∑i=1 (Actuali − F orecasti )2 )
​ ​ ​ ​

Description: RMSE is the square root of the average squared differences between actual and forecasted values.
It penalizes larger errors more than smaller errors.

4. Forecast Bias
1
Formula: (Bias = n
​ ∑ni=1 (Actuali − F orecasti ))
​ ​ ​

Description: Forecast bias measures the average tendency of the forecasts to be either too high (positive bias)
or too low (negative bias).

In the provided code snippets, the MAPE is explicitly calculated using the get_mape function for the Moving Average (MA)
model with lag 1. The MAPE value is then reported as 17.8%, indicating the percentage error in the forecast.

9. Discuss the most popular accuracy measure of forecasting.


The most popular accuracy measure for forecasting is arguably the Mean Absolute Percentage Error (MAPE). MAPE
is widely used because it provides a clear and interpretable representation of forecast accuracy in percentage terms. It is
particularly favored in business and industry due to its intuitive interpretation and ease of communication to non-technical
stakeholders. Here are some key characteristics and considerations regarding MAPE

Take it easy 10
Interpretability

MAPE is expressed as a percentage, making it easy to understand and communicate to both technical and non-
technical audiences.

Stakeholders can easily grasp the idea that, for example, a MAPE of 10% means the average forecast error is
10% of the actual values.

Scale Independence

MAPE is scale-independent, meaning it can be used to compare forecast accuracy across different datasets or
time series, regardless of the magnitude of the values.

This makes MAPE a versatile measure that can be applied to various domains and industries.

Symmetry

MAPE treats overestimation and underestimation symmetrically. Both types of errors contribute equally to the
overall accuracy measure.

This symmetry is sometimes seen as an advantage when evaluating the overall performance of a forecasting
model.

Commonly Used in Practice

MAPE is frequently used in real-world forecasting applications, and many forecasting software packages provide
MAPE as a standard metric for evaluating model performance.

Its popularity is due in part to its simplicity and ease of calculation.

Limitations

MAPE has some limitations. It can be sensitive to extreme values or outliers in the data.

It may not be well-defined when actual values are close to zero.

Despite its popularity, it's essential to note that no single accuracy measure is universally suitable for all situations.
Depending on the characteristics of the data and the specific goals of the forecasting task, other metrics like Mean
Absolute Error (MAE), Root Mean Squared Error (RMSE), or others might also be considered. The choice of accuracy
measure should align with the specific requirements and characteristics of the forecasting problem at hand.
10. In the context to Machine learning explain Bias - Variance Trade-off with example.
(10M)
The Bias-Variance Trade-off is a fundamental concept in machine learning that helps us understand the sources of
error in a predictive model. It involves finding the right balance between bias and variance to achieve a model that
generalizes well to unseen data.
Suppose you are working on a regression problem where you want to predict housing prices based on various features
such as square footage, number of bedrooms, and location. You collect a dataset containing information about different
houses, including their features and actual selling prices.

High Bias (Underfitting)

You decide to use a simple linear regression model, assuming a linear relationship between the square footage
and the price of the house.

However, the true relationship may be more complex, involving non-linear dependencies. As a result, your model
may struggle to capture the nuances of the data.

This is an example of high bias or underfitting, as the model is too simplistic to represent the underlying patterns
in the housing price data.

High Variance (Overfitting)

Recognizing the limitations of the linear model, you decide to use a high-degree polynomial regression model.
This model has the flexibility to fit the training data very closely, capturing intricate details.

Take it easy 11
The model performs exceptionally well on the training dataset, but when you apply it to new, unseen houses, it
fails to generalize. The predictions are highly sensitive to small changes in the training data.

This is an example of high variance or overfitting, as the model is too complex and captures noise in the training
data rather than the true underlying relationship.

Trade-off

To find the right balance, you experiment with different models of varying complexity (polynomial degrees in this
case).

You train models with degrees ranging from 1 to 10 and evaluate their performance on both the training and test
datasets.

Optimal Model Complexity

After evaluating the models, you observe that a polynomial regression model with a degree of 3 achieves a good
balance. It captures the non-linear patterns in the data without being overly complex.

This model generalizes well to new houses and doesn't suffer from the issues of underfitting or overfitting.

Module 2
1. Show that how evaluation problem and learning problem issues are addressed by
Hidden Markov Model. (OR) Discuss the problems in Hidden Markov method. (OR) Define
Hidden Markov Model and illustrate any two central issues addressed by Hidden Markov
model. → Repeated (10M)
Hidden Markov Models (HMMs) are widely used in various applications, including speech recognition and activity
recognition, to address both evaluation and learning problems. Let's discuss how HMM addresses these issues based on
the provided resource.

Evaluation Problem in HMM

The evaluation problem in HMM involves determining the probability of observing a given sequence of visible
states (VT )given a particular model (q). This problem is addressed by using the forward algorithm, which

computes the probability of observing a sequence up to a certain point in time.

The forward algorithm is a recursive procedure that calculates the probability of being in a particular state at
each time step.

In the provided resource, the evaluation problem is described by the equation:

Decoding Problem in HMM

The decoding problem aims to determine the most likely or probable sequence of hidden states that the machine
traversed while generating the visible states (VT ).Essentially, it identifies the most probable state given the

observed sequence.

Take it easy 12
A trellis diagram, composed of a matrix of nodes, is employed for solving the decoding problem. Each column in
the diagram represents possible states at a specific time. The first column corresponds to time instant 0, and
subsequent columns depict states at different intervals. This graphical representation aids in visualizing state
changes over time, particularly useful for Hidden Markov Models (HMM) with varying numbers of hidden states.

Learning Problem in HMM

In the learning phase of Hidden Markov Models (HMM), the objective is to estimate the transmission probability
(\(a_{ij}\)) and emission probability (\(b_{jk}\)). To achieve this, a set of known sequences is employed for
training.

Example Scenario Consider a visible state sequence ( 〈V , V , V , V , V , V , V 〉). In this context:


1 ​

3 ​

1 ​

5 ​

7 ​

2 ​

0 ​

(a3 (4))represents the probability of the machine being in state (w3 )and generating the sequence
​ ​

(V1 , V3 , V1 , V5 ).
​ ​ ​ ​

(b3 (4))signifies the probability of the machine in state (w3 )generating the next three visible states.
​ ​

In summary, HMMs provide a comprehensive framework to address evaluation, decoding, and learning problems in
temporal pattern recognition applications. The combination of forward and backward algorithms, along with the Viterbi
algorithm, enables efficient computation and parameter estimation for HMMs.

Take it easy 13
2. Using K-Medoids Algorithm solve the Problem for the following dataset of 6 objects as
shown in the table below into clusters, for K=2.

Note: Randomly select 2 medoids cluster centers.


The k-medoids algorithm is a clustering algorithm that is similar to the k-means algorithm but with some key differences.
It belongs to the family of partitional clustering algorithms, and its goal is to partition a dataset into k clusters, where the
number of clusters (k) is predefined. The primary distinction between k-means and k-medoids lies in the way they define
and update cluster centers.

Solution

Take it easy 14
Take it easy 15
3. List and Explain applications of Clustering as well as requirements of Clustering. (OR)
Discuss the applications and requirements of clustering with example. (10M)
Applications of Clustering

Business Intelligence

1. Target Marketing: Marketers use cluster analysis to discover and categorize groups based on purchasing
patterns for better target marketing.

2. Market Segmentation: Clustering helps in dividing the market into segments, aiding in product positioning
and new product development.

Pattern Recognition

1. Grouping Similar Patterns: Clustering methods group similar patterns into clusters, assisting in identifying
patterns with higher similarity within clusters.

Image Processing

1. Segmentation: Clustering is applied in image processing to segment images into areas with similar
attributes, aiding in object identification.

2. Applications: Used in various areas such as analysis of remotely sensed images, traffic system monitoring,
and fingerprint recognition.

Bioinformatics

Taxonomies: Clustering techniques are used to derive plant and animal taxonomies and categorize genes
with similar functionalities.

Biological Systematics: Helps in studying the diversification of living forms and relationships among living
things based on similar characteristics.

Web Technology

Document Classification: Clustering assists in classifying documents on the web for effective information
delivery.

Search Engines

Improving Search Results: Clustering algorithms contribute to the success of search engines like Google by
providing more accurate and faster search results.

Text Mining

High-Quality Information Extraction: Clustering in text mining helps extract high-quality information from text,
including sentiment analysis and document summarization.

Requirements of Clustering

Scalability

Independence of Results: Clustering algorithms should provide similar results regardless of the size of the
dataset, ensuring scalability for large databases.

Dealing with Different Types of Attributes

Handling Various Data Types: Clustering algorithms should be designed to handle numeric as well as other
data types like nominal, binary, and ordinal, as well as complex data types such as graphs, sequences,
images, and documents.

Discovery of Clusters with Arbitrary Shape

Detecting Non-Spherical Clusters: Clustering algorithms should be capable of detecting clusters with
arbitrary shapes, as real-world data may exhibit diverse and non-spherical cluster shapes.

Avoiding Domain Knowledge to Determine Input Parameters:

Parameter Sensitivity: Clustering algorithms should not heavily rely on domain knowledge for input
parameters, as this can affect the quality of clustering and burden the user.

Handling Noisy Data

Take it easy 16
Dealing with Noise: Clustering algorithms should be able to handle noise in real-world data, including
attribute noise introduced by measurement tools and random errors.

Incremental Clustering

Accommodating New Data: Some clustering algorithms should support incremental updates to the database
without the need to recompute the clustering from scratch.

Insensitivity to Input Order

Order Independence: Clustering algorithms should be insensitive to the order in which data objects are
presented to ensure robustness and reliability.

Handling High-Dimensional Data

Handling Numerous Dimensions: Clustering algorithms should effectively handle high-dimensional datasets,
providing accurate results for datasets with numerous dimensions or attributes.

Handling Constraints:

Must-Link and Cannot-Link Constraints: Clustering algorithms should handle constraints such as must-link
and cannot-link constraints, ensuring that instances specified in the constraints are appropriately clustered or
not clustered together.

Interpretability and Usability:

User-Friendly Results: Clustering results should be interpretable and usable for users, tied with specific
semantic interpretations and applications that can make practical use of the information retrieved after
clustering.
4. For the given set of points, identify the clusters using Agglomerative Algorithm
Clustering: Complete Link, use Euclidian Distance and draw Final cluster formed. →
Repeated
Agglomerative clustering is a bottom-up hierarchical clustering method that starts with individual data points and
gradually merges them into larger clusters. The process continues until all data points belong to a single cluster or a
specified number of clusters is reached. The key idea is to iteratively merge the closest clusters based on a distance
metric until the desired number of clusters is achieved.

Program

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

# Generate synthetic data


X, _ = make_blobs(n_samples=300, centers=3, random_state=42)

# Visualize the data


plt.scatter(X[:, 0], X[:, 1], s=50)
plt.title("Synthetic Data")
plt.show()

# Perform agglomerative clustering


model = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='war
d')
labels = model.fit_predict(X)

# Visualize the clustering result


plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
plt.title("Agglomerative Clustering Result")
plt.show()

# Plot the dendrogram to illustrate the hierarchy

Take it easy 17
linked = linkage(X, 'ward')
dendrogram(linked, orientation='top', distance_sort='descending', show_leaf_counts
=True)
plt.title("Hierarchical Clustering Dendrogram")
plt.show()

In this example, we first generate synthetic data with three clusters using the make_blobs function. Then, we perform
agglomerative clustering with the AgglomerativeClustering class from scikit-learn. Finally, we visualize the data, the
clustering result, and the dendrogram, which illustrates the hierarchy of cluster merges.

Flowchart
The agglomerative algorithm is carried out in three steps:

1. Convert object attributes to distance matrix.


2. Set each object as a cluster (thus, if we have N objects, we will have N clusters at the beginning).

3. Repeat until number of clusters is one.

Merge two closest clusters.

Update distance matrix.

Problem-example

Take it easy 18
Solution

Take it easy 19
Take it easy 20
5. Explain any two types of clustering methods. (10M)
Clustering is a type of unsupervised machine learning technique that involves grouping similar data points or objects into
subsets known as clusters. The primary goal of clustering is to organize and discover inherent patterns, structures, or
relationships within a dataset without using predefined labels or target values.

Hard Clustering vs. Soft Clustering

Hard Clustering: Each data point completely belongs to a single cluster.

Soft Clustering: Assigns a probability or likelihood to each data point for being in multiple clusters.

Clustering Algorithms Classification

1. Partitioning Method

Definition: Division of a database into k partitions, where each partition represents a cluster.

Soft Clustering Note: In soft clustering, an object can belong to two clusters.

Algorithmic Steps:

Initial partitioning creation.

Iterative relocation technique for improving partitioning.

Criteria for Good Partitioning: Objects in the same cluster are close, while those in different clusters are
far.

2. Hierarchical Method

Definition: Alternative to partitioning clustering, does not require pre-specifying the number of clusters.

Result: Tree-based representation (dendrogram) of objects.

Approaches:

Agglomerative (bottom-up): Objects start in separate groups, merging close ones iteratively.

Divisive (top-down): All objects start in the same cluster, continuously splitting until termination.

3. Density-Based Methods

Algorithm: Density-based spatial clustering of applications with noise (DBSCAN).

Concepts:

Density reachability: A point is density reachable if it's within distance e from another point with sufficient
neighbors.

Density connectivity: Points p and q are density-connected if there's a point r with sufficient neighbors,
forming a chaining process.

4. Grid-Based Methods

Approach: Concerned with the value space around data points rather than the points themselves.

Steps:

Create a grid structure, partitioning the data space.

Calculate cell density for each cell.

Sort cells by density.

Identify cluster centers.

Traverse neighbor cells.

6. Illustrate how the K-means clustering method is used to assign the data points to
different clusters. (10M)
K-means clustering is a partitioning method that aims to group data points into distinct clusters based on their similarity.
The algorithm iteratively refines cluster assignments until convergence. Here is an illustration of the steps involved in the
K-means clustering method:

Example
Let's consider a dataset X = {x1, x2, x3, ... xn} and aim to partition it into c clusters. The steps are as follows:

Take it easy 21
1. Initialization:
Randomly select c cluster centers from the data points. Let V = {v1, v2, ...vc} represent these initial cluster centers.

2. Assignment:
Calculate the distance between each data point xi and each cluster center vj. Assign xi to the cluster with the
minimum distance.

3. Update Cluster Centers:


Recalculate the cluster centers using the formula:

1 c
[vi = ∑j=1
i
xij ]

ci
​ ​ ​

where ci represents the number of data points in the ith cluster.

4. Repeat:
Repeat steps 2 and 3 until convergence. If no data point is reassigned to a different cluster, stop. Otherwise,
continue.

Advantages of K-means Clustering

1. Speed and Robustness:

Fast, robust, and easy to understand.

2. Efficiency:

Relatively efficient with computational complexity O(tknd), where n is the number of data objects, k is the
number of clusters, d is the number of attributes, and t is the number of iterations.

3. Effective for Well-Separated Data:

Gives the best result when datasets are distinct or well separated.

Disadvantages of K-means Clustering

1. Requires Prior Specification:

Requires prior knowledge of the number of clusters.

2. Sensitive to Initial Centers:

Randomly choosing cluster centers can lead to different results.

3. Not Suitable for Overlapping Data:

Unable to handle highly overlapping data.

4. Sensitive to Data Representation:

Variant to nonlinear transformations, resulting in different outcomes with different data representations.

5. Local Optima:

Provides local optima of the squared error function, not guaranteed to find the global optimum.

6. Limited Applicability:

Applicable only when the mean is defined, hence not suitable for categorical data.

7. Sensitivity to Noise and Outliers:

Unable to handle noisy data and outliers effectively.


7. Explain the agglomerative clustering method. Demostrate using program. (OR) Explain
Agglomerative algorithm with flow chart and steps required to process it. (10M) →
Repeated
Agglomerative clustering is a bottom-up hierarchical clustering method that starts with individual data points and
gradually merges them into larger clusters. The process continues until all data points belong to a single cluster or a
specified number of clusters is reached. The key idea is to iteratively merge the closest clusters based on a distance
metric until the desired number of clusters is achieved.

Program

Take it easy 22
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

# Generate synthetic data


X, _ = make_blobs(n_samples=300, centers=3, random_state=42)

# Visualize the data


plt.scatter(X[:, 0], X[:, 1], s=50)
plt.title("Synthetic Data")
plt.show()

# Perform agglomerative clustering


model = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='war
d')
labels = model.fit_predict(X)

# Visualize the clustering result


plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
plt.title("Agglomerative Clustering Result")
plt.show()

# Plot the dendrogram to illustrate the hierarchy


linked = linkage(X, 'ward')
dendrogram(linked, orientation='top', distance_sort='descending', show_leaf_counts
=True)
plt.title("Hierarchical Clustering Dendrogram")
plt.show()

In this example, we first generate synthetic data with three clusters using the make_blobs function. Then, we perform
agglomerative clustering with the AgglomerativeClustering class from scikit-learn. Finally, we visualize the data, the
clustering result, and the dendrogram, which illustrates the hierarchy of cluster merges.

Flowchart
The agglomerative algorithm is carried out in three steps:

1. Convert object attributes to distance matrix.


2. Set each object as a cluster (thus, if we have N objects, we will have N clusters at the beginning).

3. Repeat until number of clusters is one.

Merge two closest clusters.

Update distance matrix.

Take it easy 23
Problem-example

Solution

Take it easy 24
Take it easy 25
8. Using K-means clustering algorithm, solve the problem for two clusters of 6 objects as
show in the table below tabulate all the assignments.

Solution

Take it easy 26
Take it easy 27
Module 3
1. Illustrate the association rule mining concept with an example. Discuss its pros and
cons. (OR) With the code snippets discuss the ways of applying Association Rules. (OR)
Briefly explain the steps involved in applying association rules. (10M) → Repeated
Association rules are patterns or relationships identified within a dataset that reveal how items are frequently associated
or co-occur. Specifically, association rule mining seeks to discover interesting relationships between variables in large
datasets. The most common application of association rule mining is in the context of transactional data, such as
customer purchase histories.

Example

Code snippets

1. Data Preparation

Collect transactional data: Gather data that represents transactions or baskets of items, such as customer
purchases.

Represent data: Organize the data into a suitable format, where each row corresponds to a transaction, and
items are listed for each transaction.
Example Code:

# Loading and reading the dataset


all_txns = []
with open('groceries.csv') as f:
content = f.readlines()
txns = [x.strip() for x in content]
for each_txn in txns:
all_txns.append(each_txn.split(','))

2. Encoding the Transactions

Convert the transactional data into a one-hot-encoded format.

Create a matrix where each row represents a transaction, each column represents an item, and the values
are binary indicating whether an item is present in a transaction.

Take it easy 28
Example Code:

from mlxtend.preprocessing import OnehotTransactions

one_hot_encoding = OnehotTransactions()
one_hot_txns = one_hot_encoding.fit(all_txns).transform(all_txns)

3. Generating Frequent Itemsets

Use the Apriori algorithm to find frequent itemsets, which are combinations of items that occur together
frequently in the transactions.

Set a minimum support threshold to filter out infrequent itemsets.

Example Code:

from mlxtend.frequent_patterns import apriori

frequent_itemsets = apriori(one_hot_txns_df, min_support=0.02, use_colnames=


True)

4. Generating Association Rules

Use the frequent itemsets to generate association rules.

Set additional metrics thresholds such as confidence and lift to filter out uninteresting rules.

Example Code:

from mlxtend.frequent_patterns import association_rules

rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)

5. Analysis and Interpretation

Examine the generated rules, including support, confidence, and lift.

Identify interesting and actionable rules for business decisions.

Example Code (printing top 10 rules based on confidence):

top_10_rules = rules.sort_values('confidence', ascending=False).head(10)

Pros and Cons

Pros of Association Rule Mining

1. Simple and Interpretable: Association rules are straightforward and easy to understand. The "if-then"
format makes them interpretable for both technical and non-technical stakeholders.

2. Applicability: Widely used in various industries, such as retail, healthcare, finance, and more, for tasks like
market basket analysis, recommendation systems, and fraud detection.

3. Reveals Hidden Patterns: Helps identify hidden patterns and relationships within large datasets that may
not be immediately apparent.

4. Decision Support: Provides valuable insights for decision-making, allowing businesses to optimize product
placement, create effective marketing strategies, and improve customer experience.

Cons of Association Rule Mining

1. Limited to Binary Data: Association rule mining typically deals with binary data (item present or not), which
may oversimplify complex relationships or preferences.

2. Doesn't Consider Sequential Patterns: Traditional association rule mining doesn't consider the order or
sequence of item occurrences. For some applications, the order of items may be crucial.

Take it easy 29
3. Quality of Rules: The quality of rules depends on the choice of metrics and thresholds, and the results may
vary based on these parameters.

4. Spurious Associations: It may discover associations that are statistically significant but lack meaningful or
logical interpretations, leading to potentially spurious rules.

5. Scalability: For large datasets with many items, the number of possible itemsets can grow exponentially,
making the computation of rules computationally expensive.
2. List and Explain importance of words in a Bag-of-Words (BoW) Model (OR) Explain the
Bag-of-Words (Bow) model with suitable example.(10M) → Repeated
The Bag-of-Words (BoW) model is a representation technique used in natural language processing and text analytics. It
involves creating a dictionary of all the words present in a corpus and then representing each document as a vector
based on the occurrence of words in the document. There are different ways to identify the importance of words in a
BoW model, and three common vector models are discussed: Count Vector Model, Term Frequency Vector Model, and
Term Frequency-Inverse Document Frequency (TF-IDF) Model.

Importance of Words in a BoW Model


1. Count Vector Model:

In this model, the importance of words is determined by counting their occurrences in each document.

The count vector represents the frequency of each word in the document.

It is suitable for tasks where the emphasis is on the occurrence of words.

2. Term Frequency Vector Model:

Term Frequency (TF) is calculated as the frequency of each term in the document divided by the total number
of words in the document.

It provides a normalized representation of word frequency, making it suitable for comparing documents of
different lengths.

TF emphasizes the relative importance of words within a document.

3. Term Frequency-Inverse Document Frequency (TF-IDF) Model:

TF-IDF measures the importance of a word in a document relative to its frequency in the entire corpus.

It increases proportionally with the number of times a word appears in a document but is reduced by the
word's frequency in the corpus.

TF-IDF is effective in identifying words that are both frequent in a document and unique to that document.

Example - Count Vector Model


Consider two documents:

1. Document 1 (positive sentiment): "I really really like IPL."

2. Document 2 (negative sentiment): "I never like IPL."

Count Vector for Document 1 and Document 2:

Documents I really never like IPL Sentiment

I really really like


1 2 0 1 1 1
IPL

I never like IPL 1 0 1 1 1 0

The count vectors represent the frequency of each word in the respective documents, providing a numerical
representation for further analysis or machine learning tasks.

The process of creating count vectors involves using a CountVectorizer in Python, as shown in the provided code
snippets. Additionally, the code demonstrates the importance of handling low-frequency words, removing stop words,
and applying stemming to create more meaningful and concise representations of the documents in the BoW model.

Other Examples ( TF and TF - IDF)


Term Frequency-Inverse Document Frequency (TF-IDF) Example

Take it easy 30
Given the IDF values for each term (x1, x2, x3, x4, x5):

Terms IDF Values

I 0.693

really 1.098

never 1.098

like 0.693

IPL 0.693

TF-IDF Vector for Document 1 and Document 2

Assuming the terms (x1, x2, x3, x4, x5) represent words (I, really, never, like, IPL), and 'y' represents the
sentiment:

Documents I really never like IPL y

I really really like


0.1386 0.4394 0.0 0.1386 0.1386 1
IPL

I never like IPL 0.1732 0.0 0.2746 0.1732 0.1732 0

The values in the table represent the TF-IDF for each term in the respective documents.
Term Frequency Vector Model Example
Consider two documents:

1. Document 1 (positive sentiment): "I really really like IPL."

2. Document 2 (negative sentiment): "I never like IPL."

Term Frequency (TF) Vector for Document 1 and Document 2:


Assuming the terms (x1, x2, x3, x4, x5) represent words (I, really, never, like, IPL), and 'y' represents the
sentiment:

Documents I really never like IPL y

I really really like


0.2 0.4 0 0.2 0.2 1
IPL

I never like IPL 0.25 0 0.25 0.25 0.25 0

The values in the table represent the Term Frequency (TF) for each term in the respective documents.
3. Write a note on User - Based Similarity Algorithm and Finding the Best Model.
In user-based collaborative filtering, the similarity between users is computed based on their interactions with items
(movies in this case). The algorithm aims to find users with similar preferences and recommend items liked by those
similar users. Here's a step-by-step guide on implementing a user-based similarity algorithm using the MovieLens
dataset:

1. Loading the Dataset

import pandas as pd

# Load the ratings dataset


rating_df = pd.read_csv("ml-latest-small/ratings.csv")

# Drop the timestamp column


rating_df.drop('timestamp', axis=1, inplace=True)

# Create a pivot table with users as rows and movies as columns


user_movies_df = rating_df.pivot(index='userId', columns='movieId', values='ratin
g').fillna(0)

Take it easy 31
2. Calculating Cosine Similarity between Users

from sklearn.metrics import pairwise_distances


import numpy as np

# Calculate cosine similarity between users


user_sim = 1 - pairwise_distances(user_movies_df.values, metric='cosine')

# Set diagonal values to 0


np.fill_diagonal(user_sim, 0)

# Create a DataFrame for user similarities


user_sim_df = pd.DataFrame(user_sim, index=rating_df.userId.unique(), columns=ratin
g_df.userId.unique())

3. Filtering Similar Users

# Find most similar users for each user


most_similar_users = user_sim_df.idxmax(axis=1)

4. Finding Common Movies of Similar Users

# Function to get common movies between two users


def get_user_similar_movies(user1, user2):
common_movies = rating_df[rating_df.userId == user1].merge(
rating_df[rating_df.userId == user2],
on='movieId',
how='inner'
).merge(movies_df, on='movieId')

return common_movies

# Example usage
common_movies = get_user_similar_movies(2, 338)

Finding the Best Model


The effectiveness of the user-based similarity algorithm can be evaluated using various metrics such as precision,
recall, and Mean Squared Error (MSE). Additionally, cross-validation techniques can be employed to assess the
model's performance on unseen data. The choice of the best model depends on the specific use case and the
chosen evaluation metric.

Considerations for finding the best model:

Split the dataset into training and testing sets.

Implement different collaborative filtering algorithms (user-based, item-based, matrix factorization) and compare
their performance.

Tune hyperparameters for better results.

Evaluate the model's performance using appropriate metrics.

Adjust the algorithm and parameters based on the evaluation results to find the most effective collaborative filtering
model for the MovieLens dataset.
4. Discuss the two variations of collaborative filtering. (10M)
Collaborative filtering is a recommendation technique based on the notion of similarity or distance between users. It
operates on the idea that if two users have similar preferences and have rated common items similarly, their preferences
are likely to be similar in the future. There are two variations of collaborative filtering:

User-Based Similarity

Definition: This variation finds K similar users based on the common items they have bought and rated.

Take it easy 32
Methodology: The similarity or distance between users is calculated using the ratings they have given to
common items. Common similarity measures include Jaccard coefficient, cosine similarity, Euclidean distance,
and Pearson correlation.

Example: In the provided example, users Rahul, Purvi, and Gaurav are represented in a Euclidean space based
on their ratings for two common books, "Into Thin Air" and "Missoula." The Euclidean distance between users
helps identify the similarity, and recommendations can be made based on similar users.

Item-Based Similarity

Definition: This variation finds K similar items based on common users who have bought those items.

Methodology: Similarity between items is determined by analyzing the preferences of users who have bought
and rated those items. If two items have been bought and rated similarly by many users, they are considered
similar.

Comparison to K-Nearest Neighbors (KNN): Both user-based and item-based collaborative filtering algorithms
share similarities with the K-Nearest Neighbors algorithm discussed in Chapter 6.

Implementation Example (User-Based Similarity)

The example uses the MovieLens dataset to find similar users based on common movies they have watched
and rated.

The dataset includes user IDs, movie IDs, ratings, and timestamps.

The ratings are loaded into a DataFrame, and a pivot table is created with users as rows, movies as columns,
and ratings as values.

Cosine similarity is then calculated between users using the sklearn library.

The resulting similarity matrix is used to find similar users for each user.

Challenges with User-Based Similarity

One challenge is the "cold start problem," where new users have no or limited purchase and rating history. User
similarity relies on historical data, making it ineffective for new users until they provide enough data.

Item-Based Similarity is introduced as an alternative to address the cold start problem, as it focuses on
relationships between items rather than users.
5. With an example, explain the methods of Item-Based similarity algorithm in
collaborative filtering. (10M)
Item-Based Similarity in collaborative filtering is a recommendation technique that focuses on finding similarities between
items based on user behavior. It assumes that if users have liked or rated two items similarly, there is a strong
relationship between those items. Here, we'll explain the methods of Item-Based Similarity using an example.

Example: Movie Recommendations


Let's consider a MovieLens dataset where users provide ratings for various movies. The goal is to recommend
movies to a user based on the similarity of movies they have liked. We will focus on the Item-Based Similarity
algorithm to achieve this.

1. Loading the Dataset

We start by loading the MovieLens dataset, which includes information about users, movies, and their ratings.

import pandas as pd

# Load MovieLens dataset


movies_df = pd.read_csv("ml-latest-small/movies.csv")
ratings_df = pd.read_csv("ml-latest-small/ratings.csv")

2. Creating Item-User Matrix

We create a matrix where rows represent movies, columns represent users, and the values represent user
ratings.

item_user_matrix = ratings_df.pivot(index='movieId', columns='userId', values='r

Take it easy 33
ating').fillna(0)

3. Calculating Item Similarity

Next, we calculate the similarity between items using a similarity measure such as cosine similarity.

Cosine similarity between two items \(i\) and \(j\) is calculated as the cosine of the angle between their rating
vectors.

from sklearn.metrics.pairwise import cosine_similarity

# Calculate cosine similarity between items


item_similarity = cosine_similarity(item_user_matrix.T)
item_similarity_df = pd.DataFrame(item_similarity, index=item_user_matrix.column
s, columns=item_user_matrix.columns)

4. Making Recommendations

To recommend movies to a user, we identify movies that are similar to the ones the user has already liked.

We find the top-N similar items and recommend those to the user.

def recommend_movies(user_ratings, item_similarity_df, n=5):


similar_items = item_similarity_df[user_ratings.index].sum(axis=1)
similar_items = similar_items.sort_values(ascending=False)
recommended_movies = similar_items[~similar_items.index.isin(user_ratings.in
dex)].head(n)
return recommended_movies

# Example: Recommend movies for a user who liked movieId 1 and movieId 50
user_ratings = pd.Series([5, 4], index=[1, 50])
recommended_movies = recommend_movies(user_ratings, item_similarity_df)

5. Evaluation and Iteration:

The recommendation system can be evaluated using metrics like precision, recall, or mean squared error.

The algorithm can be iteratively improved by incorporating user feedback and adjusting the similarity
calculation.

In this example, the Item-Based Similarity algorithm uses the patterns of user ratings to identify similar movies and
recommend them to users who have shown interest in certain items. The cosine similarity is just one method, and
other similarity measures can be employed based on the characteristics of the data and the recommendation system.
6. Using the code snippets discuss the challenges of text analytics. (10M) (OR) Using the
code snippets, explain the challenges of text analytics. (10M) → Repeated
Text analytics, also known as text mining or natural language processing (NLP), is the process of extracting valuable
information and insights from unstructured text data. Unstructured text data includes a wide range of sources such as
books, articles, social media posts, reviews, emails, and more. Text analytics involves applying various computational
techniques and algorithms to analyze, interpret, and derive meaningful patterns or knowledge from this unstructured text.

1. Unstructured Nature of Text Data


Challenge Explanation
Text data is unstructured, making it more challenging to analyze compared to structured data. Extracting
meaningful insights from unstructured text requires extensive pre-processing.

Code Snippet

# Loading data using pandas


import pandas as pd
train_ds = pd.read_csv("sentiment_train", delimiter="\\t")
train_ds.head(5)

Take it easy 34
This code demonstrates the process of loading unstructured text data into a structured pandas DataFrame for
further analysis.
2. Data Pre-processing Challenges
Challenge Explanation
Text data requires extensive pre-processing before applying machine learning algorithms. Algorithms like
regression, classification, or clustering work best when the data is cleaned and prepared. Cleaning text data
involves tasks such as tokenization, stemming, and removing stop words.

Code Snippet

# Exploratory Data Analysis - Counting positive and negative sentiments


train_ds.info()

This code snippet emphasizes the importance of exploring the dataset, understanding its structure, and checking
for missing values or inconsistencies as part of the pre-processing.

3. Feature Extraction Challenges


Challenge Explanation
Unlike structured data, text data does not have explicit features. Feature extraction is crucial for building machine
learning models on text data.

Code Snippet

# Text Pre-processing - Bag-of-Words (BoW) model


from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_ds['text'])

This code snippet introduces the concept of the Bag-of-Words (BoW) model as a way to extract features from text
data.

4. Sentiment Classification Challenges


Challenge Explanation
Sentiment analysis on text data involves classifying sentiments (positive or negative) expressed in reviews.
Building a classification model requires labeled data and proper representation of text features.

Code Snippet

# Exploring sentiment distribution using count plot


import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
plt.figure(figsize=(6,5))
ax = sn.countplot(x='sentiment', data=train_ds)

This code snippet visually represents the distribution of positive and negative sentiments in the dataset, a crucial
step in sentiment classification.

5. Limited Insights from Text


Challenge Explanation
Extracting meaningful insights from text data may be limited due to the subjective and contextual nature of
language. Understanding sentiments may not capture the full complexity of user opinions.
7. Explain matrix factorization technique. (10M)
Matrix factorization is a matrix decomposition technique used in collaborative filtering for recommendation systems. The
primary goal is to break down a large user-item matrix into two lower-dimensional matrices, which when multiplied
together, approximate the original matrix. This technique is especially useful in recommendation systems where users
rate or interact with items (e.g., movies, products).

Take it easy 35
8. Using code snippet, explain the steps required to build Naive-Bayes model for
sentiment classifictaion. (10M)
Naive Bayes is a family of probabilistic classification algorithms based on Bayes' theorem with the assumption of
independence among features. The "naive" in Naive Bayes comes from the assumption that the features used to
describe an observation are mutually independent, given the class label.

Take it easy 36
To build a Naïve-Bayes model for sentiment classification, you can follow the steps outlined in the provided
resource. Below is a code snippet that demonstrates these steps:

# Step 1: Split the Dataset


from sklearn.model_selection import train_test_split

# Assuming you have a dataset named train_ds with features and sentiment columns
train_X, test_X, train_y, test_y = train_test_split(train_ds_features, train_ds.se
ntiment, test_size=0.3, random_state=42)

# Step 2: Build Naïve–Bayes Model


from sklearn.naive_bayes import BernoulliNB

# Create a Naïve–Bayes classifier


nb_clf = BernoulliNB()

# Train the model using the training set


nb_clf.fit(train_X.toarray(), train_y)

# Step 3: Make Prediction on Test Case


# Predict the sentiments of the test dataset
test_ds_predicted = nb_clf.predict(test_X.toarray())

# Step 4: Finding Model Accuracy


from sklearn import metrics

# Print the classification report


print(metrics.classification_report(test_y, test_ds_predicted))

# Optional: Visualize Confusion Matrix


import seaborn as sn
import matplotlib.pyplot as plt

# Generate confusion matrix


cm = metrics.confusion_matrix(test_y, test_ds_predicted)

# Visualize confusion matrix as a heatmap


sn.heatmap(cm, annot=True, fmt=".2f", cmap="Blues")
plt.show()

This code assumes you have a dataset named train_ds with features and sentiment columns. Make sure to replace
the dataset and column names accordingly. The steps involve splitting the dataset, building the Naïve–Bayes model,
making predictions, and evaluating the model's accuracy using a classification report and confusion matrix
visualization.

Module 4
1. Explain different types of activation functions for processing a node in Neural
networks. (7M) (OR) Discuss the different types of activation functions of Neural networks
algon with its features. (10M) (OR) Explain any two activation function. (5M) → Repeated
Activation functions play a crucial role in neural networks by introducing non-linearity into the model, enabling it to learn
complex patterns and relationships. In the context of neural networks, activation functions are applied to the output of a
node or neuron. The provided resource describes various types of activation functions, categorized into bipolar and
unipolar activation functions, as well as the identity function and the ramp function.

1. Bipolar Activation Functions


Bipolar Binary Function

Take it easy 37
The bipolar binary function introduces a binary decision based on the sign of the net input. If the net input is
positive, the neuron output is +1, and if it's negative, the output is -1. This function is suitable for tasks where the
network needs to make clear-cut decisions with positive and negative outcomes.


Bipolar Continuous Function
The bipolar continuous function provides a smooth transition between positive and negative outputs, controlled by
the parameter \( \lambda \). As \( \lambda \) increases, the function approaches a step-like behavior similar to the
bipolar binary function. This smooth transition allows for a more gradual adjustment of weights during training,
contributing to smoother convergence.

(Δ)is a constant determining the amplitude.


(λ > 0)controls the steepness of the sigmoid-like function.
As (λ)increases, the function approaches the bipolar binary function.

The term "bipolar" signifies that both positive and negative responses are generated.

2. Unipolar Activation Functions


Unipolar Continuous Function
The unipolar continuous function is a modification of the bipolar continuous function, focusing only on positive
outputs. It produces values between 0 and 1, making it suitable for tasks where the network needs to provide a
probability-like output or when dealing with non-symmetric patterns in the data.

Δ Δ
f(net) =
1 + exp (−λnet)

Unipolar Binary Function


The unipolar binary function is a simplified version of the unipolar continuous function where the output is either 1
or 0, depending on the sign of the net input. This function is useful in binary classification tasks, where the
network needs to make a clear decision between two classes.

f(net) = {
Δ 1, net > 0
0, net < 0
​ ​

Note: The unipolar binary function is the limit of the unipolar continuous function as (λ)approaches infinity.

3. Other Activation Functions

Take it easy 38
Identity Function
The identity function is linear, preserving the input as the output. It is typically used in the input layer to
maintain the original features without introducing non-linearity. This function is useful when the raw input
values are meaningful and should be directly passed to the next layer.

The identity function is linear, and it preserves the input as the output. Typically used in the input layer.

f(x) = x for all x

Ramp Function
The ramp function is a piecewise linear function that introduces a gradual increase in output as the input grows
beyond 0. It is useful when a smooth transition is needed, particularly in cases where the input can have a wide
range of values. The ramp function helps capture and emphasize the variations in the positive range.

⎧0, if x < 0
f(x) = ⎨x, if 0 ≤ x ≤ 1

​ ​ ​

1, if x > 1

Takes the value 0 for (x < 0), xfor (0 ≤ x ≤ 1), and 1 for (x > 1).

2. Explain the Learning process involved in the neural network that responds to a
stimulus correctly. (5M)
The learning process in a neural network involves adjusting the values of network parameters to respond correctly to a
stimulus. There are three main categories of learning: supervised learning, unsupervised learning, and reinforcement
learning.

Supervised Learning

In supervised learning, a teacher provides guidance during the learning process.

An analogy is given using a child learning to sing. Initially, the child doesn't know how to sing and learns by
imitating a singer.

Each input vector is associated with a desired output, forming a training pair.

Take it easy 39
During training, the input vector is fed into the network, producing an actual output. The error signal is generated
by comparing this output with the desired output.

The network adjusts its weights based on the error signal to make the actual output closer to the desired output.

The process is depicted in a block diagram where the error signal influences the adjustment of network
parameters.

Unsupervised Learning

Unsupervised learning occurs without the guidance of a teacher, akin to a fish learning to swim without explicit
instruction.

Inputs of a similar category are grouped together by the network without predefined training.

The network forms clusters of similar input patterns during the training process.

When a new input is applied, the network classifies it into a particular cluster or forms a new cluster if it doesn't
belong to any existing cluster.

There is no external feedback to validate the correctness of the output; instead, the network discovers patterns
through self-organization.

Reinforcement Learning

Reinforcement learning is similar to supervised learning, but only critical information is available.

The process involves extracting real information from the critical information available.

An error signal is generated by comparing the actual output with the desired output, similar to supervised
learning.

Additionally, a reinforcement signal is utilized to guide the adjustment of network parameters.

The reinforcement model is illustrated in a block diagram where both error signal and reinforcement signal
contribute to the learning process.

Take it easy 40
3. Solve XOR function using McCulloch-Pitts neuron. (8M) (OR) Solve XOR function using
McCulloch-Pitts neuron. (OR) Construct XOR function usng Mc Culloch-Pitts neuron (10M)
→ Repeated
Solution

Take it easy 41
Take it easy 42
4. Derive the Backpropagation rule considering the training rule for Output Unit weights
and Training Rule for Hidden Unit weights (8M) (OR) Write and explain the back propagation
algorithm. (10M) → Repeated
Backpropagation rule considering the training rule for Output Unit weights and Training Rule for hidden unit
weights

The backpropagation rule is used to train neural networks by adjusting the weights and biases in order to minimize
the error between the network's output and the expected output. The rule is derived using the chain rule of calculus
and involves two main steps: the training rule for output unit weights and the training rule for hidden unit weights.

Training Rule for Output Unit Weights


Let's denote the weight from the hidden layer to the output layer as wkj , the activation of the hidden layer as ​

aj , the target value as tk , and the output of the network as ak . The error E is given by
​ ​ ​

1
E= ∑(tk − ak )2 .
2
​ ​ ​ ​

The change in weight Δwkj is given by the negative gradient of the error with respect to the weight, scaled

by a learning rate ϵ :

∂E
Δwkj = −ϵ
∂wkj
​ ​

Applying the chain rule, we get:

∂E ∂E ∂ak ∂netk
=
​ ​

∂wkj ∂ak ∂netk ∂wkj


​ ​ ​ ​

​ ​ ​ ​

where netk is the weighted sum of the inputs to the output unit. The derivative of E with respect to ak is
​ ​

−(tk − ak ), the derivative of ak with respect to netk is ak (1 − ak )(assuming a logistic activation function),
​ ​ ​ ​ ​ ​

and the derivative of netk with respect to wkj is simply aj .


​ ​ ​

Substituting these results back into the equation for Δwkj , we get: ​

Δwkj = ϵ(tk − ak )ak (1 − ak )aj


​ ​ ​ ​ ​

This is the training rule for the output unit weights.

Training Rule for Hidden Unit Weights


The training rule for the hidden unit weights is a bit more complex because these weights affect the output
indirectly. Let's denote the weight from the input layer to the hidden layer as wji . ​

The change in weight Δwji is given by: ​

∂E
Δwji = −ϵ
∂wji
​ ​

Applying the chain rule, we get:

∂E ∂E ∂netk ∂aj ∂netj


=∑
​ ​ ​

∂wji ∂netk ∂aj ∂netj ∂wji


​ ​ ​ ​ ​ ​

​ ​ ​ ​ ​

where netj is the weighted sum of the inputs to the hidden unit. The derivative of E with respect to netk is
​ ​

−(tk − ak )ak (1 − ak ), the derivative of netk with respect to aj is wkj , the derivative of aj with respect to
​ ​ ​ ​ ​ ​ ​ ​

netj is aj (1 − aj ), and the derivative of netj with respect to wji is simply the input xi .
​ ​ ​ ​ ​ ​

Substituting these results back into the equation for Δwji , we get: ​

Δwji = ϵ ∑(tk − ak )ak (1 − ak )wkj aj (1 − aj )xi


​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Take it easy 43
This is the training rule for the hidden unit weights.

In summary, the backpropagation rule involves computing the gradient of the error with respect to the
weights, and then adjusting the weights in the direction that decreases the error. The training rules for the
output unit weights and the hidden unit weights are derived using the chain rule of calculus.

Backpropagation algorithm

The backpropagation algorithm is a method used in training neural networks. It calculates the gradient of the loss
function with respect to the weights of the network, which is then used to update the weights and minimize the loss.
Here is a simplified version of the backpropagation algorithm:

1. Initialize the weights: Start by initializing the weights to small random values[1].

2. Feedforward: For each input in the training set, compute the output of the network. This is done by passing the
input through each layer of the network and applying the activation function[1].

3. Compute the output error: For each output unit, calculate the error as the difference between the target output
and the actual output of the network[1].

4. Backpropagate the error: For each hidden unit, compute the error by summing the errors of the output units it
is connected to, weighted by the corresponding weights[1].

5. Update the weights: Adjust the weights in the direction that decreases the error. This is done by subtracting a
fraction of the gradient from the current weights. The fraction is determined by the learning rate[1].

6. Repeat: Repeat steps 2-5 until the stopping condition is met (e.g., the error is below a certain threshold, a
maximum number of iterations has been reached, etc.)[1].

This algorithm is typically used in conjunction with an optimization method such as gradient descent or stochastic
gradient descent to perform the weight updates[4]. The backpropagation algorithm is efficient and makes it possible
to train multi-layer networks

https://www.youtube.com/watch?v=Ilg3gGewQ5U

https://www.youtube.com/watch?v=URJ9pP1aURo

5. Derive the Gradient Descent Rule and explain the importance of Stochastic Gradient
Descent. (6M) (OR) Derive the Gradient Descent Rule and explain the conditions in which
Gradient Descent is applied. (10M) → Repeated
Derivation of the Gradient Descent Rule
The key idea behind gradient descent is to use the gradient of the error function to guide the search through the
hypothesis space of weight vectors to find the weights that best fit the training data.

We can define the training error E(w)of a hypothesis (weight vector) as:

1
E(w) = ∑(td − od )2
2
​ ​ ​ ​

d∈D

where D is the set of training examples, td is the target output for example d, and od is the actual output for
​ ​

example dgiven weight vector w.

To find the direction of steepest descent, we take the derivative of E(w)with respect to the weight vector:

∂E
∇E(w) = = ∑(td − od )xd
∂w
​ ​ ​ ​ ​

d∈D

The gradient ∇E(w)gives the direction of steepest increase in error. The negative gradient therefore points in
the direction of steepest decrease. This leads to the gradient descent update rule:

w ← w − η∇E(w) = w + η ∑(td − od )xd ​ ​ ​ ​

d∈D

Take it easy 44
where ηis the learning rate. This rule updates each weight in proportion to the negative of the gradient, gradually
descending along the error surface to find the minimum.
Importance of Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) is a variant of gradient descent that is used when the dataset is large and it is
computationally expensive to compute the gradient of the function for the entire dataset.

SGD works by randomly sampling a subset of the data and computing the gradient of the function for that subset.

This approximation of the gradient is then used to update the current estimate of the minimum.

SGD is often used in practice because it can be much faster than gradient descent, especially for large datasets.

However, SGD can also be more noisy than gradient descent, and it can sometimes converge to a local minimum
rather than the global minimum.

Conditions for Applying Gradient Descent


Gradient descent can be applied to any function that is differentiable. However, there are some conditions that must
be met in order for gradient descent to converge to the global minimum. These conditions are:

The function must be convex.

The gradient of the function must be Lipschitz continuous.

The learning rate must be small enough.

If these conditions are not met, then gradient descent may not converge to the global minimum, or it may converge
very slowly.
6. Prove the population evolution and the schema theorem incontext to genetic algorithm
(6M)
Population Evolution
The population evolves over generations in a genetic algorithm based on the selection, crossover, and mutation
operators. Specifically:

Selection chooses fitter individuals in the population to pass on to the next generation. This causes the average
fitness of the population to increase over generations.

Crossover combines parts of two parent individuals to produce new offspring. This allows beneficial traits from
different individuals to be combined.

Mutation randomly changes some individuals. This introduces new diversity into the population.

Schema Theorem
The schema theorem characterizes how the number of instances of a schema (pattern) s changes over time:
Let m(s, t)= number of instances of schema s at generation t

f(h)= fitness of individual h

f(t)= average fitness at generation t

pc = crossover probability

pm = mutation probability

o(s) = number of fixed bits in s

d(s)= distance between outermost fixed bits


Then the expected number of instances of s at the next generation is:

Take it easy 45
This shows schemas with above average fitness f(s,t) > f(t) will tend to increase in the population. Short, low-order
schemas with small o(s) and d(s) are less disrupted by crossover and mutation.

So the schema theorem proves fit, short, low-order schemas receive exponentially increasing trials in a genetic
algorithm. This allows efficient parallel search through the space of schemas.

References
Holland, J. H. (1975). Adaptation in natural and artificial systems. University of Michigan Press.
Whitley, D. & Vose, M. D. (1995). Foundations of genetic algorithms 3. Morgan Kaufmann.

Mitchell, M. (1996). An introduction to genetic algorithms. MIT press.

Citations:
[1]
https://ppl-ai-file-upload.s3.amazonaws.com/web/direct-files/9003805/0c9dd6e4-30b2-4214-900f-980999fb8dff/txt.txt
7. Describe the evolution of neural networks. (5M)
The evolution of neural networks can be summarized in the following table:

Year Theory Name Inventor Features

1871–73 Reticular theory Joseph von Gerlach The nervous system is a single continuous network.

Proposed that the nervous system is actually made up


1888–91 Neuron doctrine Santiago Ramon y Cajal
of discrete individual cells forming a network.

Consolidation of neuron doctrine. Nerve cells were


Neuron doctrine was Heinrich Wilhelm Gottfried
1891 individual cells interconnected through synapses. This
accepted von Waldeyer-Hartz
was found by electron microscope.

McCulloch–Pitts
1943 McCulloch and Pitts Simplified model of a neuron.
Neuron

The perceptron may be able to learn, make decisions,


1957–58 Perceptron Frank Rosenblatt
and translate languages.

Though perceptron were advanced of McCulloch–Pitts


model it had its own limitations. A multilayered network
1965–68 Multilayer perceptron Ivakhnenko et al. of neurons with hidden layer(s) can be used to
approximate any continuous function to any desired
precision.

The concept of Backpropagation became


popularized by Rumelhart et al., which is a method
Became popular by
1960–70 Back Propagation used for training neural networks by adjusting weights
Rumelhart et al. in 1986
based on the error rate obtained in the previous epoch
(complete pass through the training dataset).

Unsupervised pre-training. Used in training a very


2006 Unsupervised learning Hinton and Salakhutdinov
deep learner.

8. Discuss any two genetic operators (5M)


The two most common genetic operators used in genetic algorithms are crossover and mutation[1].

1. Crossover

This operator generates two new offspring from two parent strings by copying selected bits from each parent. The bit
at position i in each offspring is copied from the bit at position i in one of the two parents. The choice of which parent
contributes the bit for position i is determined by an additional string called the crossover mask. There are different
types of crossover operations, such as single-point crossover, two-point crossover, and uniform crossover. In single-
point crossover, the crossover mask is always constructed so that it begins with a string containing n contiguous 1s,
followed by the necessary number of 0s to complete the string. This results in offspring in which the first n bits are
contributed by one parent and the remaining bits by the second parent. In two-point crossover, offspring are created
by substituting intermediate segments of one parent into the middle of the second parent string. Uniform crossover
combines bits sampled uniformly from the two parents[1].

2. Mutation

This operator produces small random changes to the bit string by choosing a single bit at random, then changing its
value. Mutation is often performed after crossover has been applied. It helps to maintain diversity within the
population and prevent premature convergence on poor solutions[1].

Take it easy 46
These operators are used to generate new candidate solutions in the population, allowing the genetic algorithm to
explore the solution space. The crossover operator combines the information from two parent solutions to produce new
offspring, while the mutation operator introduces random changes to maintain diversity in the population.
9. Illustrate the genetic programming with suitable example. (10M)

https://www.youtube.com/watch?v=YG589P3LzGw

Genetic programming (GP) is a technique in artificial intelligence that evolves programs, starting from a population of
unfit (usually random) programs, fit for a particular task by applying operations analogous to natural genetic processes.
The operations include selection of the fittest programs for reproduction (crossover), replication and/or mutation
according to a predefined fitness measure, usually proficiency at the desired task.

Let's illustrate this with an example of a simple symbolic regression problem. The goal is to evolve a program that can
predict the output of a mathematical function, given the input. For simplicity, let's assume the function is y = x2 + x +
1, and we want to evolve a program that can predict 'y' given 'x'.
1. Initialization: We start by creating a population of random programs. In this case, a program is a mathematical
expression composed of operators (+, -, *, /) and variables (x). For example, one program might be "x + x", another
might be "x * x", and so on.

2. Fitness Evaluation: Each program in the population is evaluated for its fitness, i.e., how well it solves the problem.
In this case, the fitness of a program is determined by how closely its output matches the output of the function y =
x2 + x + 1for a range of 'x' values. The closer the match, the higher the fitness.
3. Selection: Programs are selected for reproduction based on their fitness. The higher the fitness, the higher the
chance of being selected. This is analogous to the principle of "survival of the fittest" in natural evolution.

4. Crossover: Two programs are selected and a point is chosen within each program. The parts of the programs after
these points are swapped to create two new programs. For example, if the parents are "x * x" and "x + 1", the
offspring might be "x * 1" and "x + x".

5. Mutation: A program is selected and a point is chosen within the program. The part of the program after this point is
replaced with a randomly generated part. For example, "x * x" might mutate to "x * 1".

6. Replacement: The least fit programs in the population are replaced with the new programs created by crossover
and mutation.

7. Termination: The process is repeated (from step 2) until a program with a satisfactory level of fitness is found, or a
predefined number of generations have been produced.

This example illustrates the basic process of genetic programming. In practice, GP can be used to evolve much more
complex programs, and the fitness function, selection method, crossover and mutation operators, and termination
condition can all be tailored to the specific problem at hand

10. Describe the power of perceptron. (5M) (OR) Describe the power of perceptron. (OR)
Explain the concept of Perceptron with a neat diagram (10M) → Repeated

Representational Power of Perceptrons


Perceptrons can represent hyperplane decision surfaces in the n-dimensional space of instances.

They output 1 for instances on one side of the hyperplane and -1 for instances on the other side.

Take it easy 47
Perceptrons can represent all primitive boolean functions (AND, OR, NAND, NOR).

Every boolean function can be represented by a network of interconnected units based on these primitives.

Networks of perceptrons can represent a rich variety of functions.


The Perceptron Training Rule
The perceptron training rule is an iterative algorithm that modifies the weights of a perceptron to minimize the
number of misclassifications on a set of training examples.

The rule converges to a weight vector that correctly classifies all training examples, provided the training
examples are linearly separable and a sufficiently small learning rate is used.

Module 5
1. Prove the K-nearest neighbor algorithm for approximating a discrete - valued function
with pseudocode. (10M) (OR) Explain K nearest neighbor algorithm in detail. (10M)

https://youtu.be/wTF6vzS9fy4

The k-Nearest Neighbor (k-NN) algorithm is a type of instance-based learning method. It assumes that all instances
correspond to points in an n-dimensional space. The nearest neighbors of an instance are defined in terms of the
standard Euclidean distance. The algorithm can be used to approximate both discrete-valued and real-valued target
function.

KNN_Algorithm(training_data, test_data, K):


1. Load the training_data and test_data
2. Choose the value of K
3. For each point in test_data:
a. Calculate the distance to all points in the training_data
b. Sort the calculated distances in ascending order
c. Select the first K distances from the sorted list
d. Determine the classes of the K points corresponding to these K distances
e. Assign the class that occurs most frequently among the K classes to the test
4. End

Text book

Take it easy 48
Take it easy 49
2. Suppose hypothesis h commits r = 10 errors over a sample of n = 65 independently
drawn examples, then solve the following

(i) What is the variance and standard deviation for number of true error rate errorD(h)?

(ii) What is the 90% confidence interval (two-sided) for the true error rate?

(iii) What is the 95% one-sided interval (i.e., what is the upper bound U such that errorD(h)
≤5 U with 95% confidence)?

(iv) What is the 90% one-sided interval? (10M)

(i) The variance and standard deviation for the number of true error rate errorD(h)

The error rate is a binomial distribution, where the variance is given by Var(X) = np(1 − p)where n is the
number of trials (in this case, the number of examples, 65), and p is the probability of success (in this case, the error
rate, 10/65).
10 10
So, the variance is Var(X) = 65 ∗ 65
​ ∗ (1 − 65
) ​ = 8.71795
The standard deviation is the square root of the variance, so SD(X) = 8.71795 = 2.953

(ii) The 90% confidence interval (two-sided) for the true error rate
p(1−p)
The confidence interval for a binomial distribution is given by p ± z n where
​ ​ z is the z-score for the desired
confidence level (for a 90% confidence level, z = 1.645).
10 10
65 (1− 65 )
So, the 90% confidence interval is 10 ± 1.645 ∗ = [0.092, 0.231]
​ ​

65 65
​ ​

(iii) The 95% one-sided interval (i.e., what is the upper bound U such that errorD(h) ≤ U with 95% confidence)

For a one-sided confidence interval, we use a z-score for a 95% confidence level in a one-tailed test, which is 1.645.
10 10
65 (1− 65 )
So, the upper bound U is 10 + 1.645 ∗ = 0.231
​ ​

65 65
​ ​ ​

(iv) The 90% one-sided interval

For a 90% confidence level in a one-tailed test, the z-score is 1.282.


10
(1− 10 )
So, the upper bound U is 10 + 1.282 ∗ 65 65
= 0.209
​ ​

65 65
​ ​ ​

3. What is reinforcement learning and develop reinforcement learning problem with neat
diagram. (10M) (OR) Describe reinforcement learning. Discuss how it differs from other
function approximation tasks. (10M)

Take it easy 50
Reinforcement learning is a type of machine learning that addresses how an autonomous agent that senses and acts in
its environment can learn to choose optimal actions to achieve its goals. This problem covers a wide range of tasks such
as learning to control a mobile robot, optimizing operations in factories, and learning to play board games. The agent
performs actions in its environment, and a trainer may provide a reward or penalty to indicate the desirability of the
resulting state. The agent's task is to learn from this indirect, delayed reward, to choose sequences of actions that
produce the greatest cumulative reward[1].

A reinforcement learning problem involves an agent, an environment, and a goal. The agent has a set of sensors to
observe the state of its environment and a set of actions it can perform to alter this state. The agent's task is to learn a
control strategy, or policy, for choosing actions that achieve its goals. The goals of the agent can be defined by a reward
function that assigns a numerical value—an immediate payoff—to each distinct action the agent may take from each
distinct state. The agent's task is to perform sequences of actions, observe their consequences, and learn a control
policy that maximizes the reward accumulated over time by the agent[1].

Reinforcement learning differs from other function approximation tasks in several important respects

1. Delayed reward

In reinforcement learning, training information is not available in the form of pairs of current state and optimal action.
Instead, the trainer provides only a sequence of immediate reward values as the agent executes its sequence of
actions. The agent, therefore, faces the problem of temporal credit assignment: determining which of the actions in
its sequence are to be credited with producing the eventual rewards[1].

2. Exploration

In reinforcement learning, the agent influences the distribution of training examples by the action sequence it
chooses. This raises the question of which experimentation strategy produces the most effective learning. The
learner faces a tradeoff in choosing whether to favor exploration of unknown states and actions (to gather new
information), or exploitation of states and actions that it has already learned will yield high reward (to maximize its
cumulative reward)[1].

3. Partially observable states

In many practical situations, sensors provide only partial information. For example, a robot with a forward-pointing
camera cannot see what is behind it. In such cases, it may be necessary for the agent to consider its previous
observations together with its current sensor data when choosing actions, and the best policy may be one that
chooses actions specifically to improve the observability of the environment[1].

4. Life-long learning

Unlike isolated function approximation tasks, reinforcement learning often requires that the agent learn several
related tasks within the same environment, using the same sensors. This setting raises the possibility of using
previously obtained experience or knowledge to reduce sample complexity when learning new tasks.
4. Interpret the Q function and Solve Q Learning Algorithm assuming deterministic
rewards and actions with an example. (10M) (OR) Discuss Q learning concept and write its
algorithm. (10M) (OR) With an illustrative example explain Q learning method. (10M)

Take it easy 51
https://www.youtube.com/watch?v=J3qX50yyiU0

Q learning is a model-free reinforcement learning algorithm used to find the optimal policy for a given environment. It
does this by learning an action-value function, which gives the expected utility of taking a given action in a given state
and following the optimal policy thereafter. The Q function, denoted as $$ Q(s, a) $$, represents the maximum
discounted cumulative reward that can be achieved starting from state $$ s $$, taking action $$ a $$, and thereafter
following the optimal policy[1].

Q Learning Algorithm
The Q learning algorithm involves the following steps:

1. Initialize the Q-values Q(s, a) arbitrarily for all state-action pairs.
2. Observe the current state s.
3. Select an action ausing a policy derived from Q (e.g., ϵ-greedy).
4. Execute the action a, receive the immediate reward r , and observe the new state s′ .
5. Update the Q-value for the state-action pair (s, a) using the formula: Q(s, a) ← r + γ maxa′ Q(s′ , a′ )where

γ  is the discount factor and maxa′ Q(s′ , a′ )is the estimated optimal future value.

6. Set s ← s' and repeat the process until a termination condition is met (e.g., a certain number of episodes or
convergence of Q-values).

The algorithm assumes deterministic rewards and actions, meaning the outcome of each action is predictable and the
reward is consistent.

Illustrative Example
Imagine a simple grid world where an agent can move up, down, left, or right. The goal is to reach a specific location
on the grid. The agent receives a reward of zero for each move and a positive reward when it reaches the goal. The
Q learning algorithm would proceed as follows:

1. Initialize Q-values to zero.

2. At each state, the agent selects an action (e.g., move right).

3. After taking the action, the agent observes the reward and the new state.

4. The agent updates the Q-value for the state-action pair based on the observed reward and the maximum Q-value
of the new state.

5. This process continues until the agent has sufficiently learned the Q-values to navigate optimally to the goal.

As the agent explores the environment, the Q-values are updated, and the agent learns to predict the value of each
action in each state. Over time, the Q-values converge to the optimal values, which represent the best possible action
the agent can take in each state to maximize its cumulative reward[1].

In summary, Q learning is a powerful algorithm for learning optimal policies in environments with deterministic actions
and rewards. It does not require a model of the environment and can be used in a wide range of applications, from
game playing to robotics.
5. Illustrate how the estimating accuracy is useful in evaluating a learned hypothesis
(10M)
Estimating the accuracy of a learned hypothesis is crucial for several reasons:

1. Decision Making: It helps in deciding whether the hypothesis is reliable enough to be used in practice. For example,
in medical treatment effectiveness studies, accurate estimation of hypothesis accuracy is vital for making informed
decisions[1].

2. Learning Process: It is an integral part of many learning methods. For instance, when post-pruning decision trees to
prevent overfitting, the accuracy of the pruned versus unpruned tree must be evaluated[1].

Challenges in Estimating Accuracy


When data is limited, estimating the future accuracy of a hypothesis presents two main difficulties:

Take it easy 52
Bias: The accuracy observed over the training examples may not be a good estimator for future examples since
the hypothesis was derived from these examples, leading to an optimistically biased estimate[1].

Variance: Even with an unbiased test set, the measured accuracy can vary from the true accuracy depending on
the makeup of the test examples. The smaller the test set, the greater the expected variance[1].
Importance of Accuracy Estimation
Understanding the accuracy of a hypothesis allows us to:

Estimate how well it will classify future instances.

Determine the probable error in this accuracy estimate, which is essential for setting realistic expectations and
understanding the confidence in the hypothesis[1].

Sample Error and True Error


To evaluate a hypothesis, we distinguish between:

Sample Error: The fraction of a data sample that the hypothesis misclassifies.

True Error: The probability that the hypothesis will misclassify a randomly drawn instance from the entire
unknown distribution[1].

Confidence Intervals for Hypothesis Accuracy


For discrete-valued hypotheses, statistical theory provides a way to estimate the true error based on observed
sample error. For example, with a 95% confidence level, the true error is expected to lie within an interval around the
sample error, calculated using a specific formula involving the number of examples and the number of errors
observed[1].

Basics of Sampling Theory


Understanding sampling theory and statistics is crucial for evaluating hypotheses and learning algorithms. It provides
a framework for issues like overfitting and the relationship between generalization and the number of training
examples[1].

Error Estimation and Binomial Proportions


The deviation between sample error and true error depends on the sample size. This is a problem of estimating the
proportion of a population that exhibits some property based on a random sample. The Binomial distribution is often
used to model this, and for large sample sizes, it can be approximated by a Normal distribution[1].

In summary, estimating the accuracy of a learned hypothesis is essential for making informed decisions, guiding the
learning process, and setting realistic expectations about the performance of the hypothesis on future data.
Understanding the potential bias and variance in the estimate, as well as using statistical methods to calculate
confidence intervals, are key components of this evaluation process.
6. Write a note on:
i) Mean and Variance.
ii) Estimators, Bias and Variance. (10M)
i) Mean and Variance

The mean is a measure of central tendency, representing the average value of a set of data. It is calculated by
summing all the values in the dataset and dividing by the number of values. The variance, on the other hand, is a
measure of dispersion, indicating how much the values in the dataset deviate from the mean. It is calculated by
taking the average of the squared differences from the mean. A high variance indicates that the data points are
spread out from the mean and from each other, while a low variance indicates that the data points tend to be close to
the mean and to each other.

ii) Estimators, Bias and Variance

Bias and variance are two properties of estimators.

Bias is the difference between the expected (or average) prediction of our model and the correct value which we
are trying to predict. A model with high bias pays very little attention to the training data and oversimplifies the
model, leading to high error on both training and test data.

Variance, on the other hand, is the variability of model prediction for a given data point. A model with high
variance pays a lot of attention to training data and does not generalize well on the data it hasn't seen before. As
a result, such models perform very well on training data but have high error rates on test data.

Take it easy 53
The bias-variance tradeoff is a central problem in supervised learning. Ideally, one wants to choose a model
that both accurately captures the regularities in its training data, but also generalizes well to unseen data.
Unfortunately, it is typically impossible to do both simultaneously. High-bias, low-variance models have a lower
test error rate on the training set but a high error rate on the test set. Low-bias, high-variance models have a
higher test error rate on the training set but their error rate increases on the test set.
7. In the context to Machine learning explain Bias - Variance Trade-off with example. (10M)
The Bias-Variance Tradeoff is a fundamental concept in machine learning that describes the relationship between a
model's complexity, the accuracy of its predictions, and its ability to generalize to unseen data[2].

Bias refers to the difference between the average prediction of a model and the correct value it is trying to predict. A
model with high bias pays little attention to the training data and oversimplifies the model, leading to errors due to
incorrect assumptions in the learning algorithm. This is known as underfitting[3].
Variance, on the other hand, refers to the degree to which the estimate of the target function varies when using different
training sets. High variance can result from an algorithm modeling the random noise in the training data, leading to errors
due to sensitivity to small fluctuations in the training set. This is known as overfitting[1][2].

The Bias-Variance Tradeoff refers to the property of models where an increase in one component (bias or variance)
tends to result in a decrease in the other. In other words, lowering a model’s bias leads to an increase in its variance and
vice versa. This relationship is due to the complexity of the model: a more complex model will have low bias and high
variance, while a less complex model will have high bias and low variance[1][3].

For example, consider a model trying to predict house prices based on various features. A high bias model might
oversimplify the problem and only consider the number of rooms in the house, leading to consistent but inaccurate
predictions. On the other hand, a high variance model might consider too many features, including irrelevant ones such
as the color of the house, leading to very accurate predictions on the training data but poor performance on unseen data.
The goal in machine learning is to find a good balance between bias and variance, minimizing the total error. An optimal
balance of bias and variance would neither overfit nor underfit the model[3]. This balance can be adjusted in specific
algorithms by modifying parameters.

8. Explain Locally Weighted Linear Regression algorithm. (10M)


Locally Weighted Linear Regression (LWLR) is a variant of linear regression that constructs an explicit approximation
to the target function over a local region surrounding a query point. It uses nearby or distance-weighted training
examples to form this local approximation. The term "locally weighted regression" is used because the function is
approximated based only on data near the query point, it's "weighted" because the contribution of each training
example is weighted by its distance from the query point, and it's called "regression" because this is the term used
widely in the statistical learning community for the problem of approximating real-valued functions.

In LWLR, given a new query instance, an approximation is constructed that fits the training examples in the
neighborhood surrounding the query instance. This approximation is then used to calculate the estimated target
value for the query instance. A different local approximation will be calculated for each distinct query instance.

The target function is approximated near the query point using a linear function. The coefficients of this linear
function are found using methods such as gradient descent to minimize the error in fitting the function to a given set
of training examples. The error criterion is redefined to emphasize fitting the local training examples. Three possible
criteria are: minimizing the squared error over just the nearest neighbors, minimizing the squared error over the
entire set of training examples while weighting the error of each training example by some decreasing function of its
distance from the query point, or a combination of the two. The contribution of each instance to the weight update is
multiplied by the distance penalty, and the error is summed over only the nearest training examples.

The literature on LWLR contains a broad range of alternative methods for distance weighting the training examples,
and a range of methods for locally approximating the target function. In most cases, the target function is
approximated by a constant, linear, or quadratic function. More complex functional forms are not often found
because the cost of fitting more complex functions for each query instance is prohibitively high, and these simple
approximations model the target function quite well over a sufficiently small subregion of the instance space.

Locally Weighted Linear Regression (LWLR) is a non-parametric regression algorithm that estimates the relationship
between a dependent variable and one or more independent variables. It is a type of kernel regression that uses a
weighted linear regression model to predict the value of the dependent variable at a given point.

The weights in LWLR are determined by a kernel function, which assigns higher weights to data points that are closer to
the point being predicted. This allows LWLR to capture local patterns and relationships in the data, which can be useful
when the relationship between the variables is non-linear or changes over time.

Take it easy 54
LWLR is a relatively simple algorithm to implement and can be used to solve a variety of regression problems. However,
it can be computationally expensive when the number of data points is large.

Here are the steps involved in LWLR:

1. Choose a kernel function. Common choices include the Gaussian kernel and the Epanechnikov kernel.

2. Determine the bandwidth of the kernel. The bandwidth controls the size of the neighborhood around each data point
that is used to fit the linear regression model.

3. For each data point, fit a linear regression model to the data points within the neighborhood.

4. Use the linear regression model to predict the value of the dependent variable at the given point.

LWLR can be used to solve a variety of regression problems, including:

Predicting the price of a house based on its square footage and location.

Forecasting the sales of a product based on its price and advertising budget.

Estimating the risk of a loan applicant based on their credit score and debt-to-income ratio.

https://www.youtube.com/watch?v=38kNPkeGoR4 https://www.youtube.com/watch?v=to_LPkV1bnI

Take it easy 55

You might also like