0% found this document useful (0 votes)
125 views38 pages

Data Mining Numericals

Data mining
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
125 views38 pages

Data Mining Numericals

Data mining
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

1.

A confusion matrix is a table used to evaluate the performance of a


classification model, showing the actual vs. predicted values. For a binary
classification problem, the confusion matrix is usually a 2x2 matrix with
the following structure:

| | Predicted Positive (P) | Predicted Negative (N) |


|------------------|------------------------|------------------------|
| Actual Positive (A) | True Positive (TP) | False Negative (FN) |
| Actual Negative (B) | False Positive (FP) | True Negative (TN) |

Now, let's use a sample confusion matrix and calculate accuracy from it.

Example Confusion Matrix:

| | Predicted Positive (P) | Predicted Negative (N) |


|------------------|------------------------|------------------------|
| Actual Positive (A) | 50 (TP) | 10 (FN) |
| Actual Negative (B) | 5 (FP) | 100 (TN) |

Definitions:
- True Positives (TP): 50 (Correctly predicted positive cases)
- False Negatives (FN): 10 (Actual positive cases but predicted negative)
- False Positives (FP): 5 (Actual negative cases but predicted positive)
- True Negatives (TN): 100 (Correctly predicted negative cases)

Accuracy Calculation:

Accuracy is the ratio of correctly predicted instances (both positives and


negatives) to the total number of instances. It is calculated using the formula:
text{Accuracy} = frac{TP + TN}{TP + TN + FP + FN}

Substitute the values from the confusion matrix:

text{Accuracy} = frac{50 + 100}{50 + 100 + 5 + 10} = frac{150}{165} approx


0.9091

So, the accuracy of the model is approximately 0.9091 or 90.91%.

2.To predict the target class for a given data point using logistic regression,
we need the following:

1. Logistic regression model: This includes the learned weights (coefficients)


for each feature and the intercept (bias term).

2. The logistic regression formula:


The decision rule for logistic regression is based on the sigmoid function,
which outputs a probability between 0 and 1.

The model's prediction is based on the logit function (a linear combination of


the features):

z = w_0 + w_1 cdot x_1 + w_2 cdot x_2

where:
- (w_0) is the intercept (bias term),
- (w_1, w_2) are the weights for features (x_1) and (x_2),
- (x_1) and (x_2) are the feature values for the data point.

The output probability (p) is given by the sigmoid function:


p = frac{1}{1 + e^{-z}}

3. Threshold: Typically, the decision threshold is(0.5). If (p geq 0.5), the model
predicts class 1 (positive class), and if (p < 0.5), the model predicts class 0
(negative class).

---
Step-by-step calculation:

To make the prediction, we would need to know the model coefficients


(weights) and the intercept for the logistic regression model that was trained.

Since these values (coefficients and intercept) are not provided, I'll walk you
through an example using hypothetical values for the weights and intercept:

Example:
- Suppose the learned coefficients for the logistic regression model are:
- Intercept (w_0 = -4\)
- Weight for (x_1) (w_1 = 0.5)
- Weight for (x_2) (w_2 = 0.4)

- The data point provided is:


- (x_1 = 2.7810836)
- (x_2 = 2.550537003)
Now, let's calculate the logit \(z\) and the predicted probability (p).

1. Calculate (z) (logit):


z = w_0 + w_1 cdot x_1 + w_2 cdot x_2
z = -4 + 0.5 cdot 2.7810836 + 0.4 cdot 2.550537003
z = -4 + 1.3905418 + 1.0202148
z = -1.5892434

2. Calculate the predicted probability (p) using the sigmoid function:


p = frac{1}{1 + e^{-z}}
p = frac{1}{1 + e^{1.5892434}} = \frac{1}{1 + 4.896} = frac{1}{5.896}
approx 0.169

3. Make the prediction:


Since (p approx 0.169), which is less than the threshold of 0.5, the predicted
class is 0 (negative class).

---
Conclusion:
For the given data point ([x_1: 2.7810836, x_2: 2.550537003]), using this
logistic regression model with the given weights and intercept, the predicted
target class is 0.

3.Converting text into numerical vectors is a critical step in text mining, as


machine learning algorithms typically require numerical data as input.
Here are the key steps involved in converting text into numerical vectors:

1. Text Preprocessing:
Before you convert text into numerical vectors, it's important to preprocess the
text to remove unnecessary elements and standardize the text.
Common preprocessing steps:
- Lowercasing: Convert all text to lowercase to ensure uniformity (e.g., "Dog"
and "dog" are treated as the same word).
- Removing punctuation: Remove punctuation marks like commas, periods,
etc., as they don't contribute to the meaning in many cases.
- Tokenization: Split text into individual words or tokens (e.g., "I love
machine learning" becomes ["I", "love", "machine", "learning"]).
- Removing stop words: Stop words are common words like "the", "and", "is",
"in" that are often removed since they don't carry much meaning.
- Stemming or Lemmatization: Reduce words to their root form (e.g.,
"running" becomes "run", "better" becomes "good"). Stemming is more
aggressive, while lemmatization aims to convert words to their base dictionary
form.

2. Choosing a Vectorization Method:


After preprocessing, the next step is to transform the tokens (words) into
numerical representations. There are several ways to do this:

A. Bag-of-Words (BoW):
The simplest approach to convert text to a numerical vector is using the Bag-
of-Words model, which represents each document as a vector of word
frequencies.
- Create a vocabulary: Identify all unique words (tokens) from the entire text
corpus (a set of documents).
- Create vectors: For each document, create a vector where each dimension
corresponds to a word in the vocabulary. The value in each dimension is the
frequency of that word in the document.

Example:
- Corpus: ["I love machine learning", "Machine learning is fun"]
- Vocabulary: ["I", "love", "machine", "learning", "is", "fun"]
- "I love machine learning" → [1, 1, 1, 1, 0, 0]
- "Machine learning is fun" → [0, 0, 1, 1, 1, 1]

Pros:
- Simple and easy to implement.
- Works well for problems where word counts matter (e.g., document
classification).
Cons:
- High dimensionality (if the vocabulary is large, the vectors can become very
sparse and large).
- Does not capture word order or semantics.

B. Term Frequency-Inverse Document Frequency (TF-IDF):


TF-IDF is an improvement over BoW. It considers both the frequency of a
word in a document (TF) and the inverse frequency of the word across all
documents (IDF). Words that are common across many documents are given
less importance, and rare words are given higher weight.

The formula for TF-IDF is:


text{TF-IDF}(t, d) = text{TF}(t, d) times text{IDF}(t)
- TF(t, d): Term Frequency = Number of times term (t) appears in document
(d).
- IDF(t): Inverse Document Frequency = (log left(frac{N}{df(t)}right)),
where (N) is the total number of documents, and (df(t)) is the number of
documents containing term (t).

Example:
- Corpus: ["I love machine learning", "Machine learning is fun"]
- TF-IDF can then be used to adjust the frequency based on how common the
word is across the corpus.

Pros:
- Reduces the influence of common words (e.g., "is", "the", etc.).
- Weighs more informative words higher.

Cons:
- Still a bag-of-words approach, so it does not capture word order or
semantics.
- Higher dimensionality than BoW, though it is often more informative.
C. Word Embeddings:
Word embeddings represent words as dense vectors in a continuous vector
space. They capture semantic relationships between words, so words with
similar meanings are placed close to each other in the vector space.
- Pre-trained word embeddings (e.g., Word2Vec, GloVe, FastText): These are
models that have been trained on large corpora to generate dense word vectors
that capture the meanings of words. Each word is represented as a fixed-length
vector, usually of size 100 to 300.
- You can use these pre-trained embeddings or train your own embeddings on
your specific corpus.

Example:
- "king" → [0.12, 0.98, -0.35, ...]
- "queen" → [0.11, 0.97, -0.33, ...]

Pros:
- Captures semantic similarity between words (e.g., "king" is similar to
"queen").
- Reduces dimensionality compared to BoW and TF-IDF.
Cons:
- Requires more computational resources to train.
- Embeddings are typically not interpretable.

D. Sentence/Document Embeddings:
Instead of using word-level embeddings, you can represent entire sentences or
documents using sentence embeddings (e.g., Doc2Vec, Universal Sentence
Encoder, BERT, GPT).
- These embeddings take into account the context and semantics of the entire
sentence or document, not just individual words.

Pros:
- Captures sentence-level meaning and context.
- Works well for downstream tasks like text classification, sentiment analysis,
etc.

Cons:
- More complex to implement and train.
- Higher computational cost, especially for models like BERT.

3. Dimensionality Reduction (Optional):


After vectorizing the text, you may end up with very high-dimensional vectors
(e.g., for a large vocabulary in BoW or TF-IDF). Dimensionality reduction
techniques like Principal Component Analysis (PCA), t-SNE, or Latent
Semantic Analysis (LSA) can help reduce the size of the feature vectors while
retaining important information.

4. Vector Normalization (Optional):


In some cases, it may be helpful to normalize or scale the vectors to ensure
that each feature has a similar range, especially for distance-based models (e.g.,
KNN, SVM). Common techniques include min-max scaling or standardization
(zero mean, unit variance).

---

Summary of Steps:

1. Preprocessing: Clean and prepare the text (lowercase, tokenization, stop


words removal, stemming/lemmatization).
2. Vectorization: Convert text into numerical vectors using methods like:
- Bag-of-Words (BoW)
- TF-IDF
- Word Embeddings (e.g., Word2Vec, GloVe)
- Sentence/Document Embeddings (e.g., BERT, GPT)
3. Optional: Apply dimensionality reduction or normalization.

Once you have converted the text into numerical vectors, you can use these
vectors as input for machine learning models to perform text mining tasks like
classification, clustering, sentiment analysis, and more.

4.In the context of Markov chains or PageRank-style algorithms, a


teleportation probability matrix refers to the probabilities of transitioning
from one state to another, including a probability of teleporting (jumping)
to a random state, which can be thought of as a kind of reset in the system.
The teleportation probability is often used to ensure that the algorithm
does not get stuck in a loop, and it introduces some randomness or
uniformity to the state transitions.

Problem Setup:
- Teleportation Probability ( a = 0.25 ) means that with a probability of 0.25,
you teleport to any state randomly.
- The remaining probability (1 - ( a )) is the probability that you will follow the
regular transition rules, depending on the system you're working with (e.g., a
Markov chain transition matrix).

Steps to construct the teleportation probability matrix:

1. Identify the Transition Matrix: Let's say you have an existing transition
matrix ( P ) representing the probability of transitioning from one state to
another (for a system with ( n ) states).

2. Adjust for Teleportation: The teleportation matrix ( T ) introduces a uniform


random jump to any state. With teleportation probability ( a ), the teleportation
matrix is constructed such that:
- With probability ( a ), the system jumps to any state randomly.
- With probability ( 1 - a ), it follows the normal transition probabilities given
by ( P ).

3. Construct the Teleportation Matrix: The teleportation matrix ( T ) is then


computed as a weighted average of the original transition matrix ( P ) and a
uniform matrix (representing random jumps):
T = (1 - a) cdot P + a cdot Q

Where:
- ( P ) is the original transition matrix (assuming rows are normalized so that the
sum of each row equals 1).
- ( Q ) is the uniform matrix, where each entry is ( frac{1}{n} ), assuming there
are ( n ) states (this ensures that the teleportation probability is uniformly
distributed).
Example with a 3-state system:

Let's assume we have a system with 3 states and the following transition matrix
( P ):
P = begin{bmatrix}
0.5 & 0.25 & 0.25
0.3 & 0.4 & 0.3
0.2 & 0.3 & 0.5
end{bmatrix}

And the teleportation probability ( a = 0.25 ).

Step 1: Construct the uniform matrix ( Q ):

The uniform matrix ( Q ) for a 3-state system (where each state has an equal
chance of being selected) is:
Q = begin{bmatrix}
frac{1}{3} & \frac{1}{3} & \frac{1}{3}
frac{1}{3} & \frac{1}{3} & \frac{1}{3}
frac{1}{3} & \frac{1}{3} & \frac{1}{3}
end{bmatrix}
= begin{bmatrix}
0.3333 & 0.3333 & 0.3333
0.3333 & 0.3333 & 0.3333
0.3333 & 0.3333 & 0.3333
end{bmatrix}

Step 2: Calculate the teleportation matrix ( T ):


Now, combine the original transition matrix ( P ) with the uniform matrix ( Q )
using the teleportation probability ( a = 0.25 ):

T = (1 - 0.25) cdot P + 0.25 cdot Q


T = 0.75 cdot P + 0.25 cdot Q

Substitute the values of ( P ) and ( Q ):

T = 0.75 cdot begin{bmatrix}


0.5 & 0.25 & 0.25
0.3 & 0.4 & 0.3
0.2 & 0.3 & 0.5
end{bmatrix}
+ 0.25 cdot begin{bmatrix}
0.3333 & 0.3333 & 0.3333
0.3333 & 0.3333 & 0.3333
0.3333 & 0.3333 & 0.3333
end{bmatrix}

Calculate the result for each element:

T = begin{bmatrix}
0.75 cdot 0.5 + 0.25 cdot 0.3333 & 0.75 cdot 0.25 + 0.25 cdot 0.3333 & 0.75
cdot 0.25 + 0.25 cdot 0.3333
0.75 cdot 0.3 + 0.25 cdot 0.3333 & 0.75 cdot 0.4 + 0.25 cdot 0.3333 & 0.75
cdot 0.3 + 0.25 cdot 0.3333
0.75 cdot 0.2 + 0.25 cdot 0.3333 & 0.75 cdot 0.3 + 0.25 cdot 0.3333 & 0.75
cdot 0.5 + 0.25 cdot 0.3333
end{bmatrix}
T = begin{bmatrix}
0.375 + 0.0833 & 0.1875 + 0.0833 & 0.1875 + 0.0833
0.225 + 0.0833 & 0.3 + 0.0833 & 0.225 + 0.0833
0.15 + 0.0833 & 0.225 + 0.0833 & 0.375 + 0.0833
end{bmatrix}

T = \begin{bmatrix}
0.4583 & 0.2708 & 0.2708
0.3083 & 0.3833 & 0.3083
0.2333 & 0.3083 & 0.4583
end{bmatrix}

Final Teleportation Matrix ( T ):

T = begin{bmatrix}
0.4583 & 0.2708 & 0.2708
0.3083 & 0.3833 & 0.3083
0.2333 & 0.3083 & 0.4583
end{bmatrix}

This is the teleportation probability matrix, where each element represents the
probability of transitioning from one state to another, including the probability
of teleporting (randomly jumping to another state) with probability 0.25.

5.To draw a box plot for the given data points:


Data points:
1090, 1560, 560, 780, 990, 670, 510, 490, 380, 880

We will first compute the five-number summary, which includes the following:

1. Minimum: The smallest value in the data set.


2. First Quartile (Q1): The median of the lower half of the data (25th
percentile).
3. Median (Q2): The middle value of the data (50th percentile).
4. Third Quartile (Q3): The median of the upper half of the data (75th
percentile).
5. Maximum: The largest value in the data set.
Step 1: Sort the Data
Sort the data in ascending order:

[ 380, 490, 510, 560, 670, 780, 880, 990, 1090, 1560 ]

Step 2: Compute the Five-Number Summary

1. Minimum: The smallest value in the sorted data is 380.

2. Maximum: The largest value in the sorted data is 1560.

3. Median (Q2): The median is the middle value. Since there are 10 data points
(an even number), the median is the average of the 5th and 6th values.

The 5th and 6th values are 670 and 780 so:
text{Median} = frac{670 + 780}{2} = frac{1450}{2} = 725
4. First Quartile (Q1): Q1 is the median of the lower half of the data (values
before the median). The lower half is:
[ 380, 490, 510, 560, 670 ]
The median of this set is the 3rd value, which is 510.
5. Third Quartile (Q3): Q3 is the median of the upper half of the data (values
after the median). The upper half is:
[ 780, 880, 990, 1090, 1560 ]
The median of this set is the 3rd value, which is 990.
Five-Number Summary:
- Minimum = 380
- Q1 = 510
- Median (Q2) = 725
- Q3 = 990
- Maximum = 1560

Step 3: Draw the Box Plot


We can now use the five-number summary to construct the box plot.

1. Draw a number line to scale (covering the range of the data from 380 to
1560).
2. Plot the five numbers:
- Draw a vertical line at each of the five summary points (minimum, Q1,
median, Q3, and maximum).
3. Box: Draw a box from Q1 (510) to Q3 (990), with a vertical line inside the
box at the median (725).
4. Whiskers: Draw lines (whiskers) from the box to the minimum (380) and
maximum (1560).
5. Outliers: If any values fall outside the whiskers (usually 1.5 times the
interquartile range (IQR) from Q1 or Q3), they are considered outliers. In this
case, no outliers are indicated since the whiskers are quite far from the extreme
values.

Sketch of the Box Plot:


Here is a textual representation of how the box plot would look:

```
380 |-----------|-----------|-----------|-----------|------------| 1560
Min Q1 Median Q3 Max
```

Explanation:
- The box spans from Q1 (510) to Q3 (990).
- The line inside the box represents the median (725).
- The whiskers extend from the min (380) to Q1 (510) and from Q3 (990) to the
max (1560).

6.To apply the K-Nearest Neighbors (K-NN) algorithm and find the class
label for the query data point (13, 15), we'll follow the steps of the
algorithm. The key components of K-NN are:

1. Choose K: The number of nearest neighbors to consider.


2. Distance Measure: We’ll use the Euclidean distance to calculate the distance
between the query point and each data point in the training set.
3. Identify K Nearest Neighbors: Find the K closest points in the training set to
the query point.
4. Assign a Class Label: Based on the majority class of the K nearest neighbors,
assign a class label to the query point.
Step-by-step Process:

Step 1: Data Setup


We need a dataset with labeled points. Here's a hypothetical dataset (training
data) with class labels:

| X1 | X2 | Class |
|-----|-----|-------|
| 10 | 12 | A |
| 15 | 14 | B |
| 13 | 15 | A |
| 16 | 17 | B |
| 14 | 16 | A |
| 12 | 13 | B |
| 11 | 11 | A |

Query Point: (13, 15)

For this example, let’s choose ( K = 3 ), meaning we’ll consider the 3 nearest
neighbors to the query point (13, 15).

Step 2: Calculate Euclidean Distances


We will use the Euclidean distance formula to calculate the distance between
the query point (13, 15) and each of the points in the dataset:

d = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}


Where:
- ((x_1, y_1)\) is the query point (13, 15),
- ((x_2, y_2)\) is a point in the dataset.

Let’s calculate the distances for each training point:

1. Distance to (10, 12):


d = sqrt{(13 - 10)^2 + (15 - 12)^2} = sqrt{3^2 + 3^2} = sqrt{9 + 9} = sqrt{18}
approx 4.24

2. Distance to (15, 14):


d = sqrt{(13 - 15)^2 + (15 - 14)^2} = sqrt{(-2)^2 + 1^2} = sqrt{4 + 1} = sqrt{5}
approx 2.24

3. Distance to (13, 15) (same point as the query):


d = sqrt{(13 - 13)^2 + (15 - 15)^2} = sqrt{0^2 + 0^2} = 0

4. Distance to (16, 17):


d = sqrt{(13 - 16)^2 + (15 - 17)^2} = sqrt{(-3)^2 + (-2)^2} = sqrt{9 + 4} =
sqrt{13}approx 3.61

5. Distance to (14, 16):


d = sqrt{(13 - 14)^2 + (15 - 16)^2} = sqrt{(-1)^2 + (-1)^2} = sqrt{1 + 1} =
sqrt{2} approx 1.41

6. Distance to (12, 13):


d = sqrt{(13 - 12)^2 + (15 - 13)^2} = sqrt{1^2 + 2^2} = sqrt{1 + 4} =sqrt{5}
approx 2.24
7. Distance to (11, 11):
d = sqrt{(13 - 11)^2 + (15 - 11)^2} =sqrt{2^2 + 4^2} = sqrt{4 + 16} = sqrt{20}
approx 4.47

Step 3: Sort the Distances


Now, we sort the distances in ascending order:

| Point | Distance | Class |


|-------|----------|-------|
| (13, 15) | 0 | A |
| (14, 16) | 1.41 | A |
| (15, 14) | 2.24 | B |
| (12, 13) | 2.24 | B |
| (16, 17) | 3.61 | B |
| (10, 12) | 4.24 | A |
| (11, 11) | 4.47 | A |

Step 4: Identify the K Nearest Neighbors


We have chosen ( K = 3 ), so the 3 nearest neighbors are:

1. (13, 15) — Class A (distance 0)


2. (14, 16) — Class A (distance 1.41)
3. (15, 14) — Class B (distance 2.24)

Step 5: Assign the Class Label


The class labels for the 3 nearest neighbors are A, A, and B. The majority class
is A.
Final Classification
Using the K-NN algorithm with ( K = 3 ) and Euclidean distance, the class label
for the query point (13, 15) is A.

Conclusion:
- Query Point: (13, 15)
-K=3
- Class Label: A
Problem Setup:

Let's consider a simple single-layer neural network (perceptron) with one


hidden layer and a single output neuron. We want to apply the backpropagation
algorithm to find the weight updates for the weights ( mathbf{v}^1 ) (input to
hidden layer) and ( mathbf{w}^1 ) (hidden layer to output layer).

Given:
- Inputs: ( x_1 = 0.4 ), ( x_2 = -0.7 )
- True Output (target): ( y_{text{true}} = 0.1 )
- Initial weights:
- For the hidden layer (( mathbf{v}^1 )): Weights between inputs and hidden
layer neurons, say ( v_1 = 0.5 ) and ( v_2 = -0.3 ) (for simplicity).
- For the output layer (( mathbf{w}^1 )): Weights between hidden neurons and
the output neuron, say ( w = 0.8 ).
- Activation function: Sigmoid function is used for both the hidden layer and
output layer.

Steps for Backpropagation:


Step 1: Feedforward Pass
In the feedforward pass, the inputs are propagated forward through the network
to compute the activations.

1.1 Compute Hidden Layer Activations

For hidden neuron ( h_1 ) (hidden layer neuron corresponding to ( x_1 )):

z_1 = v_1 x_1 + v_2 x_2 = (0.5)(0.4) + (-0.3)(-0.7) = 0.2 + 0.21 = 0.41
Apply the sigmoid activation function:
h_1 = f(z_1) = frac{1}{1 + e^{-z_1}} = frac{1}{1 + e^{-0.41}} approx
frac{1}{1 + 0.663} approx 0.60

1.2 Compute Output Layer Activation

For the output neuron ( y_{text{pred}} ), the weighted sum ( z_{text{output}} )


is:

z_{text{output}} = w h_1 = (0.8)(0.60) = 0.48

Apply the sigmoid activation function to get the predicted output:

y_{text{pred}} = f(z_{text{output}}) = frac{1}{1 + e^{-z_{text{output}}}} =


frac{1}{1 + e^{-0.48}} approx \frac{1}{1 + 0.619} approx 0.62

---

Step 2: Compute the Loss (Error)


Now, calculate the error (loss) using the mean squared error (MSE) loss
function. For a single data point, this is:
L = frac{1}{2} (y_{text{pred}} - y_{text{true}})^2

L = frac{1}{2} (0.62 - 0.1)^2 = frac{1}{2} (0.52)^2 = frac{1}{2} times 0.2704


= 0.1352
---

Step 3: Backward Pass (Compute Gradients)

The goal of backpropagation is to compute the gradients of the loss with respect
to the weights and biases in the network. We’ll apply the chain rule of calculus.

3.1 Compute the Gradient of the Loss with Respect to Output Weights ( w )

First, we compute the gradient of the loss with respect to the output weight ( w).
Using the chain rule:

frac{\partial L}{partial w} = frac{partial L}{partial y_{text{pred}}} cdot


frac{partial y_{text{pred}}}{partial w}

1. Gradient of loss with respect to output:

frac{partial L}{partial y_{text{pred}}} = y_{text{pred}} - y_{text{true}} =


0.62 - 0.1 = 0.52

2. Gradient of output with respect to weight ( w \):

frac{partial y_{text{pred}}}{partial w} = h_1 cdot f'(z_{text{output}})


Where ( f'(z_{\text{output}}) = y_{text{pred}}(1 - y_{text{pred}}) ), the
derivative of the sigmoid function. So:

f'(z_{text{output}}) = 0.62(1 - 0.62) = 0.62 times 0.38 = 0.2356

Now, the gradient of the loss with respect to ( w ) is:

frac{partial L}{partial w} = 0.52 cdot 0.60 cdot 0.2356 = 0.0733

3.2 Compute the Gradient of the Loss with Respect to Hidden Weights ( v_1 )
and ( v_2 )

Next, compute the gradients for the hidden layer weights ( v_1 ) and ( v_2 ).
Again, we use the chain rule:
frac{partial L}{partial v_1} = frac{partial L}{partial y_{text{pred}}} cdot
frac{partial y_{text{pred}}}{partial z_{text{output}}} cdot frac{partial
z_{text{output}}}{partial h_1} cdot frac{partial h_1}{partial v_1}

We already computed ( frac{partial L}{partial y_{text{pred}}} = 0.52 ). Now,


let’s compute the other parts:

1. Gradient of output with respect to hidden activation ( h_1 ):

frac{partial y_{text{pred}}}{partial z_{text{output}}} = f'(z_{text{output}}) =


0.2356
frac{partial z_{text{output}}}{partial h_1} = w = 0.8
frac{partial h_1}{partial v_1} = h_1(1 - h_1) cdot x_1 = 0.60(1 - 0.60)(0.4) =
0.24 times 0.4 = 0.096
Now, the gradient of the loss with respect to ( v_1 ) is:
frac{partial L}{partial v_1} = 0.52 cdot 0.2356 cdot 0.8 cdot 0.096 approx
0.0123

Similarly, for ( v_2 ), we calculate:


frac{partial L}{partial v_2} = 0.52 cdot 0.2356 cdot 0.8 cdot 0.096 cdot (-0.7)
approx -0.0086

---
Step 4: Update the Weights

Now that we have the gradients, we can update the weights using gradient
descent:
4.1 Update the Output Weights ( w )

Assume the learning rate ( eta = 0.1 ):


w = w - eta cdot frac{partial L}{partial w} = 0.8 - 0.1 cdot 0.0733 = 0.8 -
0.00733 = 0.79267

4.2 Update the Hidden Weights ( v_1 ) and ( v_2 )

For ( v_1 ):
v_1 = v_1 - eta cdot frac{partial L}{partial v_1} = 0.5 - 0.1 cdot 0.0123 = 0.5 -
0.00123 = 0.49877

For ( v_2 ):

v_2 = v_2 - eta cdot frac{partial L}{partial v_2} = -0.3 - 0.1 cdot (-0.0086) = -
0.3 + 0.00086 = -0.29914
Summary of Updated Weights:

- Updated output weight ( w approx 0.79267 )


- Updated hidden weights ( v_1 approx 0.49877 ), ( v_2 approx -0.29914 )

Conclusion:

In this example, we followed the backpropagation algorithm to compute the


gradients for the weights ( v_1) and ( v_2 ) (input to hidden layer), and ( w )
(hidden to output layer). We then updated the weights using gradient descent.
The learning rate was set to 0.1, and the weights were adjusted accordingly.

7.Numerical Example of Simple Linear Regression

In simple linear regression, we model the relationship between two variables


(X) (independent variable) and (Y) (dependent variable) with a linear equation:

Y = beta_0 + beta_1 X + epsilon

Where:
- (Y) is the dependent variable (target),
- (X) is the independent variable (input),
- (beta_0) is the intercept,
- (beta_1) is the slope of the regression line,
- (epsilon) is the error term (random noise).

The goal of linear regression is to find the best-fit line that minimizes the sum
of squared errors (SSE), or equivalently, the mean squared error (MSE).
Steps to Perform Linear Regression:

Step 1: Gather Data

Let’s assume we have the following dataset of (X) and (Y) values:

| (X) (independent) | (Y) (dependent) |


|---------------------|-------------------|
|1 |2 |
|2 |3 |
|3 |5 |
|4 |6 |
|5 |8 |

Step 2: Calculate the Mean of (X) and (Y)

First, we calculate the means of (X) and (Y):

overline{X} = frac{1 + 2 + 3 + 4 + 5}{5} = frac{15}{5} = 3


overline{Y} = frac{2 + 3 + 5 + 6 + 8}{5} = frac{24}{5} = 4.8

Step 3: Calculate the Slope ((beta_1))

The formula for the slope (beta_1) is:

beta_1 =frac{sum_{i=1}^{n} (X_i - overline{X})(Y_i -


overline{Y})}{sum_{i=1}^{n} (X_i - overline{X})^2}
Now, let's calculate each part of the formula:

- ( X_1 - overline{X} = 1 - 3 = -2 ), ( Y_1 - overline{Y} = 2 - 4.8 = -2.8 ) →


Product: ( (-2)(-2.8) = 5.6 )
- ( X_2 - overline{X} = 2 - 3 = -1 ), ( Y_2 - overline{Y} = 3 - 4.8 = -1.8 ) →
Product: ( (-1)(-1.8) = 1.8 )
- ( X_3 - overline{X} = 3 - 3 = 0 ), ( Y_3 - overline{Y} = 5 - 4.8 = 0.2 ) →
Product: ( (0)(0.2) = 0 )
- ( X_4 - overline{X} = 4 - 3 = 1 ), ( Y_4 - overline{Y} = 6 - 4.8 = 1.2 ) →
Product: ( (1)(1.2) = 1.2 )
- ( X_5 - overline{X} = 5 - 3 = 2 ), ( Y_5 - overline{Y} = 8 - 4.8 = 3.2 ) →
Product: ( (2)(3.2) = 6.4 )

Now sum these products:


sum (X_i - overline{X})(Y_i - overline{Y}) = 5.6 + 1.8 + 0 + 1.2 + 6.4 = 15
Next, calculate the sum of squared deviations of (X) from its mean:

- ( (X_1 - overline{X})^2 = (-2)^2 = 4 )


- ( (X_2 - overline{X})^2 = (-1)^2 = 1 )
- ( (X_3 - overline{X})^2 = (0)^2 = 0 )
- ( (X_4 - overline{X})^2 = (1)^2 = 1 )
- ( (X_5 - overline{X})^2 = (2)^2 = 4 )

Sum these squared values:


sum (X_i - overline{X})^2 = 4 + 1 + 0 + 1 + 4 = 10

Now, calculate the slope (beta_1):


beta_1 = frac{15}{10} = 1.5
Step 4: Calculate the Intercept ((beta_0))

The formula for the intercept (beta_0) is:

beta_0 = overline{Y} - beta_1 overline{X}

Substitute the values:

beta_0 = 4.8 - (1.5 times 3) = 4.8 - 4.5 = 0.3

Step 5: Write the Equation of the Line

Now that we have both the slope and the intercept, we can write the equation of
the regression line:

Y = beta_0 + beta_1 X
Y = 0.3 + 1.5X

Step 6: Make Predictions

We can use this equation to predict values of (Y) for new values of (X). For
example:

- If (X = 6), the predicted (Y) is:

Y = 0.3 + 1.5(6) = 0.3 + 9 = 9.3


- If (X = 7), the predicted (Y) is:
Y = 0.3 + 1.5(7) = 0.3 + 10.5 = 10.8

Final Model:
The equation of the regression line is:
Y = 0.3 + 1.5X

Where:
- (beta_0 = 0.3) is the intercept,
- (beta_1 = 1.5) is the slope.

This means that for every unit increase in (X), (Y) increases by 1.5 units, and
when (X = 0), (Y = 0.3).

Summary of the Steps:

1. Calculate means of (X) and (Y).


2. Compute the slope ((beta_1)) using the formula.
3. Compute the intercept ((beta_0)).
4. Write the regression equation.
5. Use the equation to make predictions for new data.

This is a simple linear regression model that predicts the dependent variable (Y)
based on the independent variable (X).

Steps in ID3 Algorithm (Iterative Dichotomiser 3):


8.The ID3 algorithm is a decision tree algorithm used for classification
tasks. It builds a decision tree by recursively splitting the data based on the
attribute that maximizes information gain at each step.
Here are the steps of the ID3 algorithm:

1. Start with the Whole Dataset:


Begin with the entire dataset as the root of the tree.

2. Check for Stopping Conditions:


- If all instances in the dataset belong to the same class, create a leaf node with
that class label.
- If no attributes remain to split on (i.e., all attributes have been used), create a
leaf node with the majority class of the instances.

3. Calculate Information Gain for Each Attribute:


For each attribute in the dataset, compute its information gain using the
Entropy concept. The attribute that provides the highest information gain is
selected for the current node.

4. Split the Dataset:


Split the dataset into subsets based on the values of the selected attribute.

5. Recursively Build the Tree:


For each subset created in step 4, repeat the process recursively:
- Calculate the information gain for the remaining attributes.
- Split the data further until the stopping conditions are met.

6. Assign Class Labels to Leaf Nodes:


Once all splits are made, the leaf nodes are assigned class labels based on the
majority class of instances in each subset.
Information Gain and Entropy Calculation:
1. Calculate Entropy for a Dataset:
Entropy is a measure of impurity or uncertainty in the dataset. The formula for
entropy is:
text{Entropy}(S) = -sum_{i=1}^{c} p_i log_2 p_i

Where:
- ( c ) is the number of classes.
- ( p_i ) is the proportion of elements in class ( i ) in the dataset.

2. Calculate Entropy for Each Attribute:


For each attribute ( A ), calculate the entropy after splitting the dataset based on
the attribute values. This is the weighted average of the entropy of each subset
of the dataset.
text{Entropy}_{text{attribute}}(A) = sum_{v in A} frac{|S_v|}{|S|}
text{Entropy}(S_v)

Where:
- ( v ) represents a value of attribute ( A ),
- ( S_v ) is the subset of data for which ( A = v ),
- ( |S| ) is the total number of instances in the dataset,
- ( |S_v| ) is the number of instances in subset ( S_v ).

3. Compute Information Gain:


The information gain for an attribute is calculated as the reduction in entropy
before and after splitting the data based on that attribute:

text{Information Gain}(A) = text{Entropy}(S) -


text{Entropy}_{text{attribute}}(A)
Where:
- ( text{Entropy}(S) ) is the entropy of the whole dataset,
- ( text{Entropy}_{text{attribute}}(A) ) is the weighted average of the entropy
after the split.

The attribute with the highest information gain is chosen as the best attribute for
splitting the dataset at that step in building the decision tree.

Example: Computing Information Gain for Attributes

Let’s go through a small example to compute the information gain for different
attributes and choose the best attribute for the root of the decision tree.

Example Dataset:

| Weather | Temperature | Humidity | PlayTennis (Class) |


|---------|-------------|----------|---------------------|
| Sunny | Hot | High | No |
| Sunny | Hot | High | No |
| Overcast| Hot | High | Yes |
| Rainy | Mild | High | Yes |
| Rainy | Cool | Normal | Yes |
| Rainy | Cool | Normal | No |
| Overcast| Cool | Normal | Yes |
| Sunny | Mild | High | No |
| Sunny | Cool | Normal | Yes |
| Rainy | Mild | Normal | Yes |
| Sunny | Mild | Normal | Yes |
| Overcast| Mild | High | Yes |
| Overcast| Hot | Normal | Yes |
| Rainy | Mild | High | No |

Step-by-Step Calculation:

1. Calculate the Entropy for the Whole Dataset:

- There are 14 instances, and the class distribution is:


- Yes: 9 instances
- No: 5 instances

The entropy for the whole dataset is:


text{Entropy}(S) = -left( \frac{9}{14} log_2 frac{9}{14} + frac{5}{14} log_2
frac{5}{14} right)

Calculating this:
text{Entropy}(S) = -left( 0.6429 times (-0.678) + 0.3571 times (-1.485) right)
text{Entropy}(S) = 0.94

2. Calculate the Information Gain for Each Attribute:

We’ll compute the information gain for each attribute by following these steps:

a) Information Gain for "Weather":


The possible values for "Weather" are: Sunny, Overcast, and Rainy. We split the
dataset based on these values and calculate the weighted average of entropy for
each subset.

- Subset for "Sunny"(5 instances): [No, No, No, No, Yes]


- Class distribution: 1 "Yes", 4 "No"
- Entropy = 0.72 (since only one "Yes" and rest are "No")

- Subset for "Overcast" (4 instances): [Yes, Yes, Yes, Yes]


- Class distribution: 4 "Yes", 0 "No"
- Entropy = 0 (since all are "Yes")

- Subset for "Rainy"(5 instances): [Yes, Yes, No, Yes, No]


- Class distribution: 3 "Yes", 2 "No"
- Entropy = 0.97

Now, we calculate the weighted average of entropy for "Weather":


text{Entropy}_{text{Weather}} = frac{5}{14} times 0.72 + frac{4}{14} times
0 + frac{5}{14} times 0.97 = 0.3243 + 0 + 0.3464 = 0.6707

Now, we can calculate the information gain for "Weather":


text{Information Gain(Weather)} = 0.94 - 0.6707 = 0.2693

b) Information Gain for "Temperature":

The possible values for "Temperature" are: Hot, Mild, and Cool. We calculate
the entropy for each subset and then compute the information gain for
"Temperature".
- Subset for "Hot" (4 instances): [No, No, Yes, Yes]
- Entropy = 1 (because there is a 50/50 split between "Yes" and "No")

- Subset for "Mild"(4 instances): [Yes, Yes, No, Yes]


- Entropy = 0.81 (since 3 "Yes" and 1 "No")

- Subset for "Cool"(6 instances): [Yes, No, Yes, Yes, Yes, Yes]
- Entropy = 0.65 (since 5 "Yes" and 1 "No")

Now, we calculate the weighted average entropy for "Temperature":


text{Entropy}_{text{Temperature}} = frac{4}{14}times 1 + frac{4}{14} times
0.81 + frac{6}{14} times 0.65 = 0.2857 + 0.2314 + 0.2786 = 0.7957

Now, we calculate the information gain for "Temperature":

text{Information Gain(Temperature)} = 0.94 - 0.7957 = 0.1443

c) Information Gain for "Humidity":

The possible values for "Humidity" are: High and Normal. We calculate the
entropy for each subset and then compute the information gain for "Humidity".

- Subset for "High" (7 instances): [No, No, Yes, Yes, No, Yes, Yes]
- Entropy = 0.98 (since 4 "Yes" and 3 "No")

- Subset for "Normal" (7 instances): [Yes, Yes, No, Yes, Yes, Yes, Yes]
- Entropy = 0.59 (since 6 "Yes" and 1 "No")
Now, we calculate the weighted average entropy for "Humidity":

9.The disjunctive sum (also known as the maximum operation) of two fuzzy
sets is a way to combine two fuzzy sets by taking the maximum
membership value for each element. It is commonly used in fuzzy logic and
fuzzy set theory to represent the union of two fuzzy sets.

Definition of Disjunctive Sum (Maximum Operation)

Given two fuzzy sets ( A ) and ( B ) with membership functions ( mu_A(x) ) and
( mu_B(x) ), the disjunctive sum ( A cup B ) (or simply ( A + B )) is a fuzzy set
with a membership function ( mu_{A cup B}(x) ) defined as:
mu_{A cup B}(x) = max{mu_A(x), mu_B(x))

That is, for each element ( x ), the membership value in the resulting fuzzy set is
the maximum of the membership values of ( x ) in sets ( A ) and ( B ).

Steps to Compute the Disjunctive Sum of Two Fuzzy Sets

Let’s go through the process step by step with an example.

Example:

Suppose we have two fuzzy sets ( A ) and ( B ) defined as follows:

- Fuzzy Set ( A ): Membership function ( mu_A(x) \)

|(x)|1|2|3|4|5|
|--------|---|---|---|---|---|
| ( mu_A(x)) | 0.3 | 0.6 | 1.0 | 0.8 | 0.5 |

- Fuzzy Set ( B ): Membership function ( mu_B(x) )

|(x)|1|2|3|4|5|
|--------|---|---|---|---|---|
|( mu_B(x) ) | 0.7 | 0.4 | 0.9 | 0.6 | 1.0 |

Now, to compute the disjunctive sum ( A cup B ), we need to take the maximum
of the membership values for each element.

Step 1: Compute the Disjunctive Sum for Each ( x )

- For ( x = 1 ), ( mu_A(1) = 0.3 ) and ( mu_B(1) = 0.7 ), so:

mu_{A cup B}(1) = max(0.3, 0.7) = 0.7

- For ( x = 2 ), ( mu_A(2) = 0.6 ) and (mu_B(2) = 0.4 ), so:

mu_{A cup B}(2) = max(0.6, 0.4) = 0.6

- For ( x = 3 ), ( mu_A(3) = 1.0 ) and ( mu_B(3) = 0.9 ), so:

mu_{A cup B}(3) = max(1.0, 0.9) = 1.0


- For ( x = 4 ), ( mu_A(4) = 0.8 ) and ( mu_B(4) = 0.6 ), so:

mu_{A cup B}(4) = max(0.8, 0.6) = 0.8


- For ( x = 5 ), ( mu_A(5) = 0.5 ) and ( mu_B(5) = 1.0 ), so:
mu_{A cup B}(5) = max(0.5, 1.0) = 1.0

Step 2: Construct the Resulting Fuzzy Set

Now, we can construct the fuzzy set ( A cup B ) using the maximum
membership values we calculated:

|(x)|1|2|3|4|5|
|--------|---|---|---|---|---|
| (mu_{A \cup B}(x) ) | 0.7 | 0.6 | 1.0 | 0.8 | 1.0 |

So, the disjunctive sum of ( A ) and ( B ) is the fuzzy set:


A cup B = {(1, 0.7), (2, 0.6), (3, 1.0), (4, 0.8), (5, 1.0)}
Summary

- Disjunctive sum of two fuzzy sets is the maximum of the membership values
for each element.
- For each element ( x ), we calculate ( mu_{A cup B}(x) = max(mu_A(x),
mu_B(x)) ).
- The resulting fuzzy set contains the maximum membership value for each
element in the union of the two fuzzy sets.

You might also like