Data Mining Numericals
Data Mining Numericals
Now, let's use a sample confusion matrix and calculate accuracy from it.
Definitions:
- True Positives (TP): 50 (Correctly predicted positive cases)
- False Negatives (FN): 10 (Actual positive cases but predicted negative)
- False Positives (FP): 5 (Actual negative cases but predicted positive)
- True Negatives (TN): 100 (Correctly predicted negative cases)
Accuracy Calculation:
2.To predict the target class for a given data point using logistic regression,
we need the following:
where:
- (w_0) is the intercept (bias term),
- (w_1, w_2) are the weights for features (x_1) and (x_2),
- (x_1) and (x_2) are the feature values for the data point.
3. Threshold: Typically, the decision threshold is(0.5). If (p geq 0.5), the model
predicts class 1 (positive class), and if (p < 0.5), the model predicts class 0
(negative class).
---
Step-by-step calculation:
Since these values (coefficients and intercept) are not provided, I'll walk you
through an example using hypothetical values for the weights and intercept:
Example:
- Suppose the learned coefficients for the logistic regression model are:
- Intercept (w_0 = -4\)
- Weight for (x_1) (w_1 = 0.5)
- Weight for (x_2) (w_2 = 0.4)
---
Conclusion:
For the given data point ([x_1: 2.7810836, x_2: 2.550537003]), using this
logistic regression model with the given weights and intercept, the predicted
target class is 0.
1. Text Preprocessing:
Before you convert text into numerical vectors, it's important to preprocess the
text to remove unnecessary elements and standardize the text.
Common preprocessing steps:
- Lowercasing: Convert all text to lowercase to ensure uniformity (e.g., "Dog"
and "dog" are treated as the same word).
- Removing punctuation: Remove punctuation marks like commas, periods,
etc., as they don't contribute to the meaning in many cases.
- Tokenization: Split text into individual words or tokens (e.g., "I love
machine learning" becomes ["I", "love", "machine", "learning"]).
- Removing stop words: Stop words are common words like "the", "and", "is",
"in" that are often removed since they don't carry much meaning.
- Stemming or Lemmatization: Reduce words to their root form (e.g.,
"running" becomes "run", "better" becomes "good"). Stemming is more
aggressive, while lemmatization aims to convert words to their base dictionary
form.
A. Bag-of-Words (BoW):
The simplest approach to convert text to a numerical vector is using the Bag-
of-Words model, which represents each document as a vector of word
frequencies.
- Create a vocabulary: Identify all unique words (tokens) from the entire text
corpus (a set of documents).
- Create vectors: For each document, create a vector where each dimension
corresponds to a word in the vocabulary. The value in each dimension is the
frequency of that word in the document.
Example:
- Corpus: ["I love machine learning", "Machine learning is fun"]
- Vocabulary: ["I", "love", "machine", "learning", "is", "fun"]
- "I love machine learning" → [1, 1, 1, 1, 0, 0]
- "Machine learning is fun" → [0, 0, 1, 1, 1, 1]
Pros:
- Simple and easy to implement.
- Works well for problems where word counts matter (e.g., document
classification).
Cons:
- High dimensionality (if the vocabulary is large, the vectors can become very
sparse and large).
- Does not capture word order or semantics.
Example:
- Corpus: ["I love machine learning", "Machine learning is fun"]
- TF-IDF can then be used to adjust the frequency based on how common the
word is across the corpus.
Pros:
- Reduces the influence of common words (e.g., "is", "the", etc.).
- Weighs more informative words higher.
Cons:
- Still a bag-of-words approach, so it does not capture word order or
semantics.
- Higher dimensionality than BoW, though it is often more informative.
C. Word Embeddings:
Word embeddings represent words as dense vectors in a continuous vector
space. They capture semantic relationships between words, so words with
similar meanings are placed close to each other in the vector space.
- Pre-trained word embeddings (e.g., Word2Vec, GloVe, FastText): These are
models that have been trained on large corpora to generate dense word vectors
that capture the meanings of words. Each word is represented as a fixed-length
vector, usually of size 100 to 300.
- You can use these pre-trained embeddings or train your own embeddings on
your specific corpus.
Example:
- "king" → [0.12, 0.98, -0.35, ...]
- "queen" → [0.11, 0.97, -0.33, ...]
Pros:
- Captures semantic similarity between words (e.g., "king" is similar to
"queen").
- Reduces dimensionality compared to BoW and TF-IDF.
Cons:
- Requires more computational resources to train.
- Embeddings are typically not interpretable.
D. Sentence/Document Embeddings:
Instead of using word-level embeddings, you can represent entire sentences or
documents using sentence embeddings (e.g., Doc2Vec, Universal Sentence
Encoder, BERT, GPT).
- These embeddings take into account the context and semantics of the entire
sentence or document, not just individual words.
Pros:
- Captures sentence-level meaning and context.
- Works well for downstream tasks like text classification, sentiment analysis,
etc.
Cons:
- More complex to implement and train.
- Higher computational cost, especially for models like BERT.
---
Summary of Steps:
Once you have converted the text into numerical vectors, you can use these
vectors as input for machine learning models to perform text mining tasks like
classification, clustering, sentiment analysis, and more.
Problem Setup:
- Teleportation Probability ( a = 0.25 ) means that with a probability of 0.25,
you teleport to any state randomly.
- The remaining probability (1 - ( a )) is the probability that you will follow the
regular transition rules, depending on the system you're working with (e.g., a
Markov chain transition matrix).
1. Identify the Transition Matrix: Let's say you have an existing transition
matrix ( P ) representing the probability of transitioning from one state to
another (for a system with ( n ) states).
Where:
- ( P ) is the original transition matrix (assuming rows are normalized so that the
sum of each row equals 1).
- ( Q ) is the uniform matrix, where each entry is ( frac{1}{n} ), assuming there
are ( n ) states (this ensures that the teleportation probability is uniformly
distributed).
Example with a 3-state system:
Let's assume we have a system with 3 states and the following transition matrix
( P ):
P = begin{bmatrix}
0.5 & 0.25 & 0.25
0.3 & 0.4 & 0.3
0.2 & 0.3 & 0.5
end{bmatrix}
The uniform matrix ( Q ) for a 3-state system (where each state has an equal
chance of being selected) is:
Q = begin{bmatrix}
frac{1}{3} & \frac{1}{3} & \frac{1}{3}
frac{1}{3} & \frac{1}{3} & \frac{1}{3}
frac{1}{3} & \frac{1}{3} & \frac{1}{3}
end{bmatrix}
= begin{bmatrix}
0.3333 & 0.3333 & 0.3333
0.3333 & 0.3333 & 0.3333
0.3333 & 0.3333 & 0.3333
end{bmatrix}
T = begin{bmatrix}
0.75 cdot 0.5 + 0.25 cdot 0.3333 & 0.75 cdot 0.25 + 0.25 cdot 0.3333 & 0.75
cdot 0.25 + 0.25 cdot 0.3333
0.75 cdot 0.3 + 0.25 cdot 0.3333 & 0.75 cdot 0.4 + 0.25 cdot 0.3333 & 0.75
cdot 0.3 + 0.25 cdot 0.3333
0.75 cdot 0.2 + 0.25 cdot 0.3333 & 0.75 cdot 0.3 + 0.25 cdot 0.3333 & 0.75
cdot 0.5 + 0.25 cdot 0.3333
end{bmatrix}
T = begin{bmatrix}
0.375 + 0.0833 & 0.1875 + 0.0833 & 0.1875 + 0.0833
0.225 + 0.0833 & 0.3 + 0.0833 & 0.225 + 0.0833
0.15 + 0.0833 & 0.225 + 0.0833 & 0.375 + 0.0833
end{bmatrix}
T = \begin{bmatrix}
0.4583 & 0.2708 & 0.2708
0.3083 & 0.3833 & 0.3083
0.2333 & 0.3083 & 0.4583
end{bmatrix}
T = begin{bmatrix}
0.4583 & 0.2708 & 0.2708
0.3083 & 0.3833 & 0.3083
0.2333 & 0.3083 & 0.4583
end{bmatrix}
This is the teleportation probability matrix, where each element represents the
probability of transitioning from one state to another, including the probability
of teleporting (randomly jumping to another state) with probability 0.25.
We will first compute the five-number summary, which includes the following:
[ 380, 490, 510, 560, 670, 780, 880, 990, 1090, 1560 ]
3. Median (Q2): The median is the middle value. Since there are 10 data points
(an even number), the median is the average of the 5th and 6th values.
The 5th and 6th values are 670 and 780 so:
text{Median} = frac{670 + 780}{2} = frac{1450}{2} = 725
4. First Quartile (Q1): Q1 is the median of the lower half of the data (values
before the median). The lower half is:
[ 380, 490, 510, 560, 670 ]
The median of this set is the 3rd value, which is 510.
5. Third Quartile (Q3): Q3 is the median of the upper half of the data (values
after the median). The upper half is:
[ 780, 880, 990, 1090, 1560 ]
The median of this set is the 3rd value, which is 990.
Five-Number Summary:
- Minimum = 380
- Q1 = 510
- Median (Q2) = 725
- Q3 = 990
- Maximum = 1560
1. Draw a number line to scale (covering the range of the data from 380 to
1560).
2. Plot the five numbers:
- Draw a vertical line at each of the five summary points (minimum, Q1,
median, Q3, and maximum).
3. Box: Draw a box from Q1 (510) to Q3 (990), with a vertical line inside the
box at the median (725).
4. Whiskers: Draw lines (whiskers) from the box to the minimum (380) and
maximum (1560).
5. Outliers: If any values fall outside the whiskers (usually 1.5 times the
interquartile range (IQR) from Q1 or Q3), they are considered outliers. In this
case, no outliers are indicated since the whiskers are quite far from the extreme
values.
```
380 |-----------|-----------|-----------|-----------|------------| 1560
Min Q1 Median Q3 Max
```
Explanation:
- The box spans from Q1 (510) to Q3 (990).
- The line inside the box represents the median (725).
- The whiskers extend from the min (380) to Q1 (510) and from Q3 (990) to the
max (1560).
6.To apply the K-Nearest Neighbors (K-NN) algorithm and find the class
label for the query data point (13, 15), we'll follow the steps of the
algorithm. The key components of K-NN are:
| X1 | X2 | Class |
|-----|-----|-------|
| 10 | 12 | A |
| 15 | 14 | B |
| 13 | 15 | A |
| 16 | 17 | B |
| 14 | 16 | A |
| 12 | 13 | B |
| 11 | 11 | A |
For this example, let’s choose ( K = 3 ), meaning we’ll consider the 3 nearest
neighbors to the query point (13, 15).
Conclusion:
- Query Point: (13, 15)
-K=3
- Class Label: A
Problem Setup:
Given:
- Inputs: ( x_1 = 0.4 ), ( x_2 = -0.7 )
- True Output (target): ( y_{text{true}} = 0.1 )
- Initial weights:
- For the hidden layer (( mathbf{v}^1 )): Weights between inputs and hidden
layer neurons, say ( v_1 = 0.5 ) and ( v_2 = -0.3 ) (for simplicity).
- For the output layer (( mathbf{w}^1 )): Weights between hidden neurons and
the output neuron, say ( w = 0.8 ).
- Activation function: Sigmoid function is used for both the hidden layer and
output layer.
For hidden neuron ( h_1 ) (hidden layer neuron corresponding to ( x_1 )):
z_1 = v_1 x_1 + v_2 x_2 = (0.5)(0.4) + (-0.3)(-0.7) = 0.2 + 0.21 = 0.41
Apply the sigmoid activation function:
h_1 = f(z_1) = frac{1}{1 + e^{-z_1}} = frac{1}{1 + e^{-0.41}} approx
frac{1}{1 + 0.663} approx 0.60
---
The goal of backpropagation is to compute the gradients of the loss with respect
to the weights and biases in the network. We’ll apply the chain rule of calculus.
3.1 Compute the Gradient of the Loss with Respect to Output Weights ( w )
First, we compute the gradient of the loss with respect to the output weight ( w).
Using the chain rule:
3.2 Compute the Gradient of the Loss with Respect to Hidden Weights ( v_1 )
and ( v_2 )
Next, compute the gradients for the hidden layer weights ( v_1 ) and ( v_2 ).
Again, we use the chain rule:
frac{partial L}{partial v_1} = frac{partial L}{partial y_{text{pred}}} cdot
frac{partial y_{text{pred}}}{partial z_{text{output}}} cdot frac{partial
z_{text{output}}}{partial h_1} cdot frac{partial h_1}{partial v_1}
---
Step 4: Update the Weights
Now that we have the gradients, we can update the weights using gradient
descent:
4.1 Update the Output Weights ( w )
For ( v_1 ):
v_1 = v_1 - eta cdot frac{partial L}{partial v_1} = 0.5 - 0.1 cdot 0.0123 = 0.5 -
0.00123 = 0.49877
For ( v_2 ):
v_2 = v_2 - eta cdot frac{partial L}{partial v_2} = -0.3 - 0.1 cdot (-0.0086) = -
0.3 + 0.00086 = -0.29914
Summary of Updated Weights:
Conclusion:
Where:
- (Y) is the dependent variable (target),
- (X) is the independent variable (input),
- (beta_0) is the intercept,
- (beta_1) is the slope of the regression line,
- (epsilon) is the error term (random noise).
The goal of linear regression is to find the best-fit line that minimizes the sum
of squared errors (SSE), or equivalently, the mean squared error (MSE).
Steps to Perform Linear Regression:
Let’s assume we have the following dataset of (X) and (Y) values:
Now that we have both the slope and the intercept, we can write the equation of
the regression line:
Y = beta_0 + beta_1 X
Y = 0.3 + 1.5X
We can use this equation to predict values of (Y) for new values of (X). For
example:
Final Model:
The equation of the regression line is:
Y = 0.3 + 1.5X
Where:
- (beta_0 = 0.3) is the intercept,
- (beta_1 = 1.5) is the slope.
This means that for every unit increase in (X), (Y) increases by 1.5 units, and
when (X = 0), (Y = 0.3).
This is a simple linear regression model that predicts the dependent variable (Y)
based on the independent variable (X).
Where:
- ( c ) is the number of classes.
- ( p_i ) is the proportion of elements in class ( i ) in the dataset.
Where:
- ( v ) represents a value of attribute ( A ),
- ( S_v ) is the subset of data for which ( A = v ),
- ( |S| ) is the total number of instances in the dataset,
- ( |S_v| ) is the number of instances in subset ( S_v ).
The attribute with the highest information gain is chosen as the best attribute for
splitting the dataset at that step in building the decision tree.
Let’s go through a small example to compute the information gain for different
attributes and choose the best attribute for the root of the decision tree.
Example Dataset:
Step-by-Step Calculation:
Calculating this:
text{Entropy}(S) = -left( 0.6429 times (-0.678) + 0.3571 times (-1.485) right)
text{Entropy}(S) = 0.94
We’ll compute the information gain for each attribute by following these steps:
The possible values for "Temperature" are: Hot, Mild, and Cool. We calculate
the entropy for each subset and then compute the information gain for
"Temperature".
- Subset for "Hot" (4 instances): [No, No, Yes, Yes]
- Entropy = 1 (because there is a 50/50 split between "Yes" and "No")
- Subset for "Cool"(6 instances): [Yes, No, Yes, Yes, Yes, Yes]
- Entropy = 0.65 (since 5 "Yes" and 1 "No")
The possible values for "Humidity" are: High and Normal. We calculate the
entropy for each subset and then compute the information gain for "Humidity".
- Subset for "High" (7 instances): [No, No, Yes, Yes, No, Yes, Yes]
- Entropy = 0.98 (since 4 "Yes" and 3 "No")
- Subset for "Normal" (7 instances): [Yes, Yes, No, Yes, Yes, Yes, Yes]
- Entropy = 0.59 (since 6 "Yes" and 1 "No")
Now, we calculate the weighted average entropy for "Humidity":
9.The disjunctive sum (also known as the maximum operation) of two fuzzy
sets is a way to combine two fuzzy sets by taking the maximum
membership value for each element. It is commonly used in fuzzy logic and
fuzzy set theory to represent the union of two fuzzy sets.
Given two fuzzy sets ( A ) and ( B ) with membership functions ( mu_A(x) ) and
( mu_B(x) ), the disjunctive sum ( A cup B ) (or simply ( A + B )) is a fuzzy set
with a membership function ( mu_{A cup B}(x) ) defined as:
mu_{A cup B}(x) = max{mu_A(x), mu_B(x))
That is, for each element ( x ), the membership value in the resulting fuzzy set is
the maximum of the membership values of ( x ) in sets ( A ) and ( B ).
Example:
|(x)|1|2|3|4|5|
|--------|---|---|---|---|---|
| ( mu_A(x)) | 0.3 | 0.6 | 1.0 | 0.8 | 0.5 |
|(x)|1|2|3|4|5|
|--------|---|---|---|---|---|
|( mu_B(x) ) | 0.7 | 0.4 | 0.9 | 0.6 | 1.0 |
Now, to compute the disjunctive sum ( A cup B ), we need to take the maximum
of the membership values for each element.
Now, we can construct the fuzzy set ( A cup B ) using the maximum
membership values we calculated:
|(x)|1|2|3|4|5|
|--------|---|---|---|---|---|
| (mu_{A \cup B}(x) ) | 0.7 | 0.6 | 1.0 | 0.8 | 1.0 |
- Disjunctive sum of two fuzzy sets is the maximum of the membership values
for each element.
- For each element ( x ), we calculate ( mu_{A cup B}(x) = max(mu_A(x),
mu_B(x)) ).
- The resulting fuzzy set contains the maximum membership value for each
element in the union of the two fuzzy sets.