Data Science IV
Data Science IV
1) Supervised Learning
Supervised learning is a type of machine learning method in which we
provide sample labeled data to the machine learning system in order to
train it, and on that basis, it predicts the output.
The system creates a model using labeled data to understand the
datasets and learn about each data, once the training and processing are
done then we test the model by providing a sample data to check whether
it is predicting the exact output or not.
The goal of supervised learning is to map input data with the output data.
The supervised learning is based on supervision, and it is the same as
when a student learns things in the supervision of the teacher. The
example of supervised learning is spam filtering.
Supervised learning can be grouped further in two categories of
algorithms:
o Classification
o Regression
2) Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns
without any supervision.
The training is provided to the machine with the set of data that has not
been labeled, classified, or categorized, and the algorithm needs to act on
that data without any supervision. The goal of unsupervised learning is to
restructure the input data into new features or a group of objects with
similar patterns.
In unsupervised learning, we don't have a predetermined result. The
machine tries to find useful insights from the huge amount of data. It can
be further classifieds into two categories of algorithms:
o Clustering
o Association
3) Reinforcement Learning
Reinforcement learning is a feedback-based learning method, in which a
learning agent gets a reward for each right action and gets a penalty for
each wrong action. The agent learns automatically with these feedbacks
and improves its performance. In reinforcement learning, the agent
interacts with the environment and explores it. The goal of an agent is to
get the most reward points, and hence, it improves its performance.
The robotic dog, which automatically learns the movement of his arms, is
an example of Reinforcement learning.
What is a dataset?
A dataset is a collection of data in which data is arranged in some order.
A dataset can contain any data from a series of an array to a database
table. Below table shows an example of the dataset:
India 38 48000 No
France 48 65000 No
Germany 40 Yes
For each attribute of the dataset, let's follow the step-2 of pseudocode : -
First Attribute - Outlook
Here, the attribute with maximum information gain is Wind. So, the decision
tree built so far -
True positive and true negatives are the observations that are correctly
predicted and therefore shown in green. We want to minimize false
positives and false negatives so they are shown in red color. These
terms are a bit confusing. So let’s take each term one by one and
understand it fully.
True Positives (TP) - These are the correctly predicted positive
values which means that the value of actual class is yes and the value
of predicted class is also yes. E.g. if actual class value indicates that
this passenger survived and predicted class tells you the same thing.
True Negatives (TN) - These are the correctly predicted negative
values which means that the value of actual class is no and value of
predicted class is also no. E.g. if actual class says this passenger did
not survive and predicted class tells you the same thing.
False positives and false negatives, these values occur when
your actual class contradicts with the predicted class.
False Positives (FP) – When actual class is no and predicted class is
yes. E.g. if actual class says this passenger did not survive but
predicted class tells you that this passenger will survive.
False Negatives (FN) – When actual class is yes but predicted class
in no. E.g. if actual class value indicates that this passenger survived
and predicted class tells you that passenger will die.
Once you understand these four parameters then we can
calculate Accuracy, Precision, Recall and F1 score.
Accuracy - Accuracy is the most intuitive performance measure
and it is simply a ratio of correctly predicted observation to the
total observations. One may think that, if we have high
accuracy then our model is best. Yes, accuracy is a great
measure but only when you have symmetric datasets where
values of false positive and false negatives are almost same.
Therefore, you have to look at other parameters to evaluate
the performance of your model. For our model, we have got
0.803 which means our model is approx. 80% accurate.
Accuracy = TP+TN/TP+FP+FN+TN
Precision - Precision is the ratio of correctly predicted positive
observations to the total predicted positive observations. The question
that this metric answer is of all passengers that labeled as survived,
how many actually survived? High precision relates to the low false
positive rate. We have got 0.788 precision which is pretty good.
Precision = TP/TP+FP
Recall (Sensitivity) - Recall is the ratio of correctly predicted positive
observations to the all observations in actual class - yes. The question
recall answers is: Of all the passengers that truly survived, how many
did we label? We have got recall of 0.631 which is good for this model
as it’s above 0.5.
Recall = TP/TP+FN
F1 score - F1 Score is the weighted average of Precision and Recall.
Therefore, this score takes both false positives and false negatives into
account. Intuitively it is not as easy to understand as accuracy, but F1
is usually more useful than accuracy, especially if you have an uneven
class distribution. Accuracy works best if false positives and false
negatives have similar cost. If the cost of false positives and false
negatives are very different, it’s better to look at both Precision and
Recall. In our case, F1 score is 0.701.
F1 Score = 2*(Recall * Precision) / (Recall + Precision)
So, whenever you build a model, this article should help you to figure
out what these parameters mean and how good your model has
performed.
As we can see from the above graph, the model tries to cover all the data
points present in the scatter plot. It may look efficient, but in reality, it is
not so. Because the goal of the regression model to find the best fit line,
but here we have not got any best fit, so, it will generate the prediction
errors.
As we can see from the above diagram, the model is unable to capture
the data points present in the plot.
How to avoid underfitting:
o By increasing the training time of the model.
o By increasing the number of features.
Goodness of Fit
The "Goodness of fit" term is taken from the statistics, and the goal of the
machine learning models to achieve the goodness of fit. In statistics
modeling, it defines how closely the result or predicted values match the
true values of the dataset.
The model with a good fit is between the underfitted and overfitted model,
and ideally, it makes predictions with 0 errors, but in practice, it is difficult
to achieve it.
As when we train our model for a time, the errors in the training data go
down, and the same happens with test data. But if we train the model for
a long duration, then the performance of the model may decrease due to
the overfitting, as the model also learn the noise present in the dataset.
The errors in the test dataset start increasing, so the point, just before the
raising of errors, is the good point, and we can stop here for achieving a
good model.
There are two other methods by which we can get a good point for our model,
which are the resampling method to estimate model accuracy and validation
dataset.
What is a Model Parameter?
A model parameter is a configuration variable that is internal to the model and whose
value can be estimated from data.
They are required by the model when making predictions.
They values define the skill of the model on your problem.
They are estimated or learned from data.
They are often not set manually by the practitioner.
They are often saved as part of the learned model.
Parameters are key to machine learning algorithms. They are the part of the model that
is learned from historical training data.
In classical machine learning literature, we may think of the model as the hypothesis and
the parameters as the tailoring of the hypothesis to a specific set of data.
Often model parameters are estimated using an optimization algorithm, which is a type
of efficient search through possible parameter values.
Statistics: In statistics, you may assume a distribution for a variable, such as a
Gaussian distribution. Two parameters of the Gaussian distribution are the mean
(mu) and the standard deviation (sigma). This holds in machine learning, where
these parameters may be estimated from data and used as part of a predictive
model.
Programming: In programming, you may pass a parameter to a function. In this
case, a parameter is a function argument that could have one of a range of
values. In machine learning, the specific model you are using is the function and
requires parameters in order to make a prediction on new data.
Whether a model has a fixed or variable number of parameters determines whether it
may be referred to as “parametric” or “nonparametric“.
Some examples of model parameters include:
The weights in an artificial neural network.
The support vectors in a support vector machine.
The coefficients in a linear regression or logistic regression.
What is a Model Hyperparameter?
A model hyperparameter is a configuration that is external to the model and whose value
cannot be estimated from data.
They are often used in processes to help estimate model parameters.
They are often specified by the practitioner.
They can often be set using heuristics.
They are often tuned for a given predictive modeling problem.
We cannot know the best value for a model hyperparameter on a given problem. We
may use rules of thumb, copy values used on other problems, or search for the best
value by trial and error.
When a machine learning algorithm is tuned for a specific problem, such as when you
are using a grid search or a random search, then you are tuning the hyperparameters of
the model or order to discover the parameters of the model that result in the most skillful
predictions.
Many models have important parameters which cannot be directly estimated from the
data. For example, in the K-nearest neighbor classification model … This type of model
parameter is referred to as a tuning parameter because there is no analytical formula
available to calculate an appropriate value.
Model hyperparameters are often referred to as model parameters which can make
things confusing. A good rule of thumb to overcome this confusion is as follows:
If you have to specify a model parameter manually then
it is probably a model hyperparameter.
Some examples of model hyperparameters include:
The learning rate for training a neural network.
The C and sigma hyperparameters for support vector machines.
The k in k-nearest neighbors.
Feature Vector
Feature: is a list of values eg: age, name, height, weight etc., that means
every column is a feature in relational table.
Feature Vector
A feature vector is a vector that stores the features for a particular
observation in a specific order.
For example, Alice is 26 years old and she is 5' 6" tall. Her feature vector
could be [26, 5.5] or [5.5, 26] depending on your choice of how to order the
elements. The order is only important insofar as it is consistent
It is representation of particular row in relational table. Each row is a feature
vector, row 'n' is a feature vector for the 'n'th sample.
Feature Set: Help to predict the output variable.
Example: To predict the age of particular person we need to know the year of
birth. Here Feature Set = Year of Birth.
Normally good feature set can be identified using expert domain knowledge
or mathematical approach.
Hope this will help you.
Suppose there are two categories, i.e., Category A and Category B, and
we have a new data point x1, so this data point will lie in which of these
categories. To solve this type of problem, we need a K-NN algorithm. With
the help of K-NN, we can easily identify the category or class of a
particular dataset. Consider the below diagram:
The K-NN working can be explained on the basis of the below algorithm:
Suppose we have a new data point and we need to put it in the required
category. Consider the below image:
Below are some points to remember while selecting the value of K in the
K-NN algorithm:
o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.
It allows us to cluster the data into different groups and a convenient way
to discover the categories of groups in the unlabeled dataset on its own
without the need for any training.
Step-2: Select random K points or centroids. (It can be other from the
input dataset).
Step-3: Assign each data point to their closest centroid, which will form
the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to
the new closest centroid of each cluster.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of
these two variables is given below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset and
to put them into different clusters. It means here we will try to group
these datasets into two different clusters.
o We need to choose some random k points or centroid to form the
cluster. These points can be either the points from the dataset or
any other point. So, here we are selecting the below two points as k
points, which are not the part of our dataset. Consider the below
image:
o Now we will assign each data point of the scatter plot to its closest
K-point or centroid. We will compute it by applying some
mathematics that we have studied to calculate the distance
between two points. So, we will draw a median between both the
centroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to
the K1 or blue centroid, and points to the right of the line are close to the
yellow centroid. Let's color them as blue and yellow for clear visualization.
o Next, we will reassign each datapoint to the new centroid. For this,
we will repeat the same process of finding a median line. The
median will be like below image:
From the above image, we can see, one yellow point is on the left side of
the line, and two blue points are right to the line. So, these three points
will be assigned to new centroids.
o As we got the new centroids so again will draw the median line and
reassign the data points. So, the image will be:
o We can see in the above image; there are no dissimilar data points
on either side of the line, which means our model is formed.
Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and
the two final clusters will be as shown in the below image:
What is Dimensionality Reduction?
for obtaining a better fit predictive model while solving the classification
and regression problems.
DECISION BOUNDARY
The goal of logistic regression, is to figure out some way to split the
datapoints to have an accurate prediction of a given observation’s class
using the information present in the features.
Let’s suppose we define a line that describes the decision boundary. So, all
of the points on one side of the boundary shall have all the datapoints
belong to class A and all of the points on one side of the boundary shall
have all the datapoints belong to class B.
S(z)=1/(1+e^-z)
p>=0.5,class=A
p<=0.5,class=B
If our threshold was .5 and our prediction function returned .7, we would
classify this observation belongs to class A. If our prediction was .2 we
would classify the observation belongs to class B.
designed:
Characteristics of Perceptron
Geometric Interpretation
Step 1:
Step 2
Step 3
Step 4
Step 5
Step 6
Perceptron Algorithm :
Feature selection is a way of selecting the subset of the most relevant features from the
original features set by removing the redundant, irrelevant, or noisy features.
Numerical Numerical
o Pearson's correlation coefficient (For linear
Correlation).
o Spearman's rank coefficient (for non-linear
correlation).
Numerical Categorical
o ANOVA correlation coefficient (linear).
o Kendall's rank coefficient (nonlinear).
Categorical Numerical
o Kendall's rank coefficient (linear).
o ANOVA correlation coefficient (nonlinear).
Categorical Categorical
o Chi-Squared test (contingency tables).
o Mutual Information.
Pruning
Pruning is a data compression technique in machine learning and search
algorithms that reduces the size of decision trees by removing sections of the tree
that are non-critical and redundant to classify instances. Pruning reduces the
complexity of the final classifier, and hence improves predictive accuracy by the
reduction of overfitting.
One of the questions that arises in a decision tree algorithm is the optimal size of the
final tree. A tree that is too large risks overfitting the training data and poorly
generalizing to new samples. A small tree might not capture important structural
information about the sample space. However, it is hard to tell when a tree algorithm
should stop because it is impossible to tell if the addition of a single extra node will
dramatically decrease error. This problem is known as the horizon effect. A common
strategy is to grow the tree until each node contains a small number of instances
then use pruning to remove nodes that do not provide additional information. [1]
Pruning should reduce the size of a learning tree without reducing predictive
accuracy as measured by a cross-validation set. There are many techniques for tree
pruning that differ in the measurement that is used to optimize performance.
Now, since you have an idea of what is feature scaling. Let us explore
what methods are available for doing feature scaling. Of all the methods
available, the most common ones are:
Normalization
Here, max(x) and min(x) are the maximum and the minimum values of
the feature respectively.
Feature standardization makes the values of each feature in the data have
zero mean and unit variance. The general method of calculation is to
determine the distribution mean and standard deviation for each feature
and calculate the new data point by the following formula:
Combinatorial explosions
Models can be evaluated using multiple metrics. However, the right choice
of an evaluation metric is crucial and often depends upon the problem
that is being solved. A clear understanding of a wide range of metrics can
help the evaluator to chance upon an appropriate match of the problem
statement and a metric.
Classification metrics
Actual 0
Actual 1
Predicted 0
True Negatives (TN)
False Negatives (FN)
Predicted 1
False Positives (FP)
True Positives (TP)
Accuracy
Accuracy is the simplest metric and can be defined as the number of test
cases correctly classified divided by the total number of test cases.
It can be applied to most generic problems but is not very useful when it
comes to unbalanced datasets.
For instance, if we are detecting frauds in bank data, the ratio of fraud to
non-fraud cases can be 1:99. In such cases, if accuracy is used, the model
will turn out to be 99% accurate by predicting all test cases as non-fraud.
The 99% accurate model will be completely useless.
If a model is poorly trained such that it predicts all the 1000 (say) data
points as non-frauds, it will be missing out on the 10 fraud data points. If
accuracy is measured, it will show that that model correctly predicts 990
data points and thus, it will have an accuracy of (990/1000)*100 = 99%!
Therefore, for such a case, a metric is required that can focus on the ten
fraud data points which were completely missed by the model.
Precision
Recall
Recall tells us the number of positive cases correctly identified out of the
total number of positive cases.
Going back to the fraud problem, the recall value will be very useful in
fraud cases because a high recall value will indicate that a lot of fraud
cases were identified out of the total number of frauds.
F1 Score
It is useful in cases where both recall and precision can be valuable – like
in the identification of plane parts that might require repairing. Here,
precision will be required to save on the company’s cost (because plane
parts are extremely expensive) and recall will be required to ensure that
the machinery is stable and not a threat to human lives.
Regression metrics
MSE is a simple metric that calculates the difference between the actual
value and the predicted value (error), squares it and then provides the
mean of all the errors.
MSE is very sensitive to outliers and will show a very high error value even
if a few outliers are present in the otherwise well-fitted model predictions.
RMSE is the root of MSE and is beneficial because it helps to bring down
the scale of the errors closer to the actual values, making it more
interpretable.
x is the actual value and y is the predicted value. This helps to scale down
the effect of the outliers by downplaying the higher error rates with the
log function. Also, RMSLE helps to capture a relative error (by comparing
all the error values) through the use of logs.
In machine learning , there is always the need to test the stability of the
model. It means based only on the training dataset; we can't fit our model
on the training dataset. For this purpose, we reserve a particular sample
of the dataset, which was not part of the training dataset. After that, we
test our model on that sample before deployment, and this complete
process comes under cross-validation. This is something different from the
general train-test split.
There are some common methods that are used for cross-validation.
These methods are given below:
We divide our input dataset into a training set and test or validation set in
the validation set approach. Both the subsets are given 50% of the
dataset.
But it has one of the big disadvantages that we are just using a 50%
dataset to train our model, so the model may miss out to capture
important information of the dataset. It also tends to give the underfitted
model.
Leave-P-out cross-validation
In this approach, the p datasets are left out of the training data. It means,
if there are total n datapoints in the original input dataset, then n-p data
points will be used as the training dataset and the p data points as the
validation set. This complete process is repeated for all the samples, and
the average error is calculated to know the effectiveness of the model.
o In this approach, the bias is minimum as all the data points are
used.
o The process is executed for n times; hence execution time is high.
o This approach leads to high variation in testing the effectiveness of
the model as we iteratively check against one data point.
K-Fold Cross-Validation
o Train/test split: The input data is divided into two parts, that are
training set and test set on a ratio of 70:30, 80:20, etc. It provides a
high variance, which is one of the biggest disadvantages.
o Training Data: The training data is used to train the model,
and the dependent variable is known.
o Test Data: The test data is used to make the predictions from
the model that is already trained on the training data. This has
the same features as training data but not the part of that.
o Cross-Validation dataset: It is used to overcome the
disadvantage of train/test split by splitting the dataset into groups
of train/test splits, and averaging the result. It can be used if we
want to optimize our model that has been trained on the training
dataset for the best performance. It is more efficient as compared to
train/test split as every observation is used for the training and
testing both.
Hypothesis Testing?
Any data science project starts with exploring the data. When we perform
an analysis on a sample through exploratory data analysis and inferential
statistics, we get information about the sample. Now, we want to use this
information to predict values for the entire population.
1. Formulate a Hypothesis
2. Determine the significance level
3. Determine the type of test
4. Calculate the Test Statistic values and the p values
5. Make Decision
One of the key steps to do this is to formulate the below two hypotheses:
Type of
Distribution Desired
predictor Attributes
type Test
variable
Chi-
Test of independence
Categorical NA Square Goodness of fit
test
1) Type1 Error – This occurs when the null hypothesis is true but we
reject it.The probability of type I error is denoted by alpha (α). Type 1
error is also known as the level of significance of the hypothesis test
2) Type 2 Error – This occurs when the null hypothesis is false but we fail
to reject it. The probability of type II error is denoted by beta (β)
Debugging Learning Algorithms,
Models are being deployed on increasingly larger tasks and datasets, and
the more the scale grows, the more important it is to debug your model.
Dimension error
Variable
Data goes through a long process starting from preparation, cleaning, and
more. In this process, developers often get confused or forget correct data
variables. So, to stay on the correct path, it’s good practice to use a data
flow diagram before architecting our models. This will help us find
the correct data variable names, model flow, and expected results.
To figure out if our model contains predictive information or not, try with
humans first. If humans can’t predict the data (image or text), then our ML
model won’t make any difference. If we try to feed it more inputs, it still
won’t make a difference. Chances are that the model will lose its
accuracy.
External data, let’s say data found on the internet or open-sourced, can be
useful. Once you collect that data and label it, you can then use it for
training. It can also be used for many other tasks. Just like external data,
we can also use an external model which was trained by another person
and reuse it for our task.
Using a high-quality but small dataset is the best way to train a simple
model. Sometimes, when you use large training data sets you can waste
too many resources and money.
Hyperparameter Tuning
Verification strategy
With verification strategies, we can find issues that aren’t related to the
actual model. We can verify the integrity of a model (i.e. verifying that it
hasn’t been changed or corrupted), or if the model is correct and
maintainable. Many practices have evolved for verification, like automated
generation of test data sequences, running multiple analyses with
different sets of input values, and performing validation checks when
importing data into a file.
Understanding the Bias-Variance Tradeoff
So let’s start with the basics and see how they make difference to our
machine learning Models.
Bias
Bias is the difference between the average prediction of our model and the
correct value which we are trying to predict. Model with high bias pays
very little attention to the training data and oversimplifies the model. It
always leads to high error on training and test data.
Variance
If our model is too simple and has very few parameters then it may have
high bias and low variance. On the other hand if our model has large
number of parameters then it’s going to have high variance and low bias.
So we need to find the right/good balance without overfitting and
underfitting the data.
For a univariate function, this means that the line segment connecting two
function’s points lays on or above its curve (it does not cross it). If it does it
means that it has a local minimum which is not a global one.
Mathematically, for two points x₁, x₂ laying on the function’s curve this
condition is expressed as:
where λ denotes a point’s location on a section line and its value has to be
between 0 (left point) and 1 (right point), e.g. λ=0.5 means a location in
the middle.
Let’s stop here for a moment. We see that the first derivative equal zero at
x=0 and x=1.5. This places are candidates for function’s extrema
(minimum or maximum )— the slope is zero there. But first we have to
check the second derivative first.
The value of this expression is zero for x=0 and x=1. These locations are
called an inflexion point — a place where the curvature changes sign —
meaning it changes from convex to concave or vice-versa. By analysing
this equation we conclude that :
Now we see that point x=0 has both first and second derivate equal to
zero meaning this is a saddle point and point x=1.5 is a global minimum.
Let’s look at the graph of this function. As calculated before a saddle point
is at x=0 and minimum at x=1.5.
The upside-down triangle is a so-called nabla symbol and you read it “del”.
To better understand how to calculate it let’s do a hand calculation for an
exemplary 2-dimensional function below.
so consequently:
If learning rate is too big the algorithm may not converge to the optimal
point (jump around) or even to diverge completely.
The goal of the SVM algorithm is to create the best line or decision
boundary that can segregate n-dimensional space into classes so that we
can easily put the new data point in the correct category in the future.
This best decision boundary is called a hyperplane.
o Linear SVM: Linear SVM is used for linearly separable data, which
means if a dataset can be classified into two classes by using a
single straight line, then such data is termed as linearly separable
data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated
data, which means if a dataset cannot be classified by using a
straight line, then such data is termed as non-linear data and
classifier used is called as Non-linear SVM classifier.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and
which affect the position of the hyperplane are termed as Support Vector.
Since these vectors support the hyperplane, hence called a Support
vector.
Linear SVM:
So to separate these data points, we need to add one more dimension. For
linear data, we have used two dimensions x and y, so for non-linear data,
we will add a third dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below
image:
So now, SVM will divide the datasets into classes in the following way.
Consider the below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-
axis. If we convert it in 2d space with z=1, then it will become as:
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data to be classified:
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)
Training set:
CREDIT-
AGE INCOME STUDENT BUYS_COMPUTER
RATING
<=30 HIGH NO FAIR NO
<=30 HIGH NO EXCELLENT NO
31..40 HIGH NO FAIR YES
>40 MEDIUM NO FAIR YES
>40 LOW YES FAIR YES
>40 LOW YES EXCELLENT NO
31..40 LOW YES EXCELLENT YES
<=30 MEDIUM NO FAIR NO
<=30 LOW YES FAIR YES
>40 MEDIUM YES FAIR YES
<=30 MEDIUM YES EXCELLENT YES
31..40 MEDIUM NO EXCELLENT YES
31..40 HIGH YES FAIR YES
>40 MEDIUM NO EXCELLENT NO
Advantages
Easy to implement
Disadvantages
Multi-Layer Networks
MLP networks are used for supervised learning format. A typical learning
algorithm for MLP networks is also called back propagation's
algorithm.
890.2K
Input values
X1=0.05
X2=0.10
Initial weight
W1=0.15 w5=0.40
W2=0.20 w6=0.45
W3=0.25 w7=0.50
W4=0.30 w8=0.55
Bias Values
b1=0.35 b2=0.60
Target Values
T1=0.01
T2=0.99
To find the value of H1 we first multiply the input value from the weights
as
H1=x1×w1+x2×w2+b1
H1=0.05×0.15+0.10×0.20+0.35
H1=0.3775
H2=x1×w3+x2×w4+b1
H2=0.05×0.25+0.10×0.30+0.35
H2=0.3925
To find the value of y1, we first multiply the input value i.e., the outcome
of H1 and H2 from the weights as
y1=H1×w5+H2×w6+b2
y1=0.593269992×0.40+0.596884378×0.45+0.60
y1=1.10590597
To calculate the final result of y1 we performed the sigmoid function as
y2=H1×w7+H2×w8+b2
y2=0.593269992×0.50+0.596884378×0.55+0.60
y2=1.2249214
Our target values are 0.01 and 0.99. Our y1 and y2 value is not matched
with our target values T1 and T2.
Now, we will find the total error, which is simply the difference between
the outputs from the target outputs. The total error is calculated as
Transaction- Items
id bought
10 A, B, D
20 A, C, D
30 A, D, E
40 B, E, F
50 B, C, D, E, F
An item is considered as frequent item if it’s occurrence in the transaction
is more than or equal to min. threshold support.
For the association rules the support and confidence is given below:
Apriori Algorithm:
Method:
Pseudo-code:
Step 1: self-joining Lk
Step 2: pruning
Example of Candidate-generation
Self-joining: L3*L3
Pruning:
C4={abcd}
Itemset sup
{B, C, E} 2
EàCB, BàE
EàB}
Hence these rules can be used for Basket data analysis, cross-marketing,
catalog design, sale campaign analysis
Association rules
• An association rule analyzes and predicts Customer behavior
• They are like if/else statements
Example:
Bread àbutter
Buys { onions, potatoes} à buys tomatoes
Parts of Association Rule:
Bread à butter [20%, 45%]
Bread : Antecedent
Butter : consequent
20% Support
45% confidence
Support and Confidence:
AàB
Support denotes probability that contains both A&B
Confidence denotes probability that a transaction containing A also
contains B.
Example
Consider in a super market
Total transactions : 100
Bread : 20
So, 20/100 *100 = 20% which is support
In 20 transactions, butter : 9 transactions
So 9/20 *100=45% which is confidence
Classification of Association Rule:
Single Dimensional Association Rule
Bread à butter
Dimension : buying
Multidimensional Association Rule
With 2 or more predicates or dimensions,
Occupation (I.T.), Age(>22) à buys (laptop)
Hybrid Association Rule
Time (5’ O clock), buys (tea) à buys (biscuits)
Applications where the association rules are used:
Web Usage Mining
Banking
Bio Informatics
Market based Analysis
Credit / debit card analysis
Product clustering
Catalog design
Clustering applications and requirements.
Clustering:
• Partitioning the data into subclasses
• It is grouping of similar objects
• Partitioning of data based on similarity
Eg: library
Here the books are grouped by subject, author etc.
Cluster Analysis:
• Cluster: a collection of data objects
– Here objects similar to one another are kept within the same
cluster
– Dissimilar objects are kept in other clusters
• Cluster analysis
– It is the process of finding similarities between data according
to the characteristics found in the data and grouping similar
data objects into clusters.
• Clustering is Unsupervised learning, it means there are no
predefined classes.
RAW DATA
CLUSTERING ALGORITHM
CLUSTERS OF DATA
Different Representations Of Clustering
APPLICATIONS OF CLUSTERING
• MARKET RESEARCH
• WWW
• PATTERN RECOGNITION
• IMAGRE PROCESSING
• DATA MINING
EXAMPLES OF CLUSTERING
Search Engine
Social Network Analysis
Genetics
Marketing
Requirements of Clustering in Data Mining
Scalability
We need highly scalable clustering algorithms to deal with large
databases.
Ability to deal with different kinds of attributes Algorithms
should be capable to be applied on any kind of data such as interval-
based (numerical) data, categorical, and binary data.
[ ]
x 11 . .. x 1f . .. x 1p
Data matrix
.. . . .. . .. . .. ...
(two modes) x i1 . .. x if . .. xip
.. . . .. . .. . .. ...
x n1 . .. x nf . .. x np
[ ]
0
d ( 2,1 ) 0
d ( 3,1 ) d(3 , 2) 0
Dissimilarity matrix : : :
d(n, 1) d(n , 2) .. . .. . 0
(one mode)
Type of data in clustering analysis
Interval-valued variables
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types
Interval-valued variables:
Interval-scaled variables are continuous measurements of a roughly linear
scale. Typical examples include weight and height, latitude and longitude
coordinates. To standardize measurements to make all variables have
equal weights, Mean absolute deviation and z-score are computed by:
where
and
However, the advantage of using the mean absolute deviation is that the
z-scores of outliers do not become too small. So the outliers are
detectable.
Binary Variables
A binary variable has only two states: 0 or 1, where 0 means that the
variable is absent, and 1 means that it is present.
p−m
d (i , j )=
p
0rdinal variables:
Ordinal variable have meaningful order.
Eg: Rank, satisfaction
Here the difference between the interval may mislead us.
Eg: difference between 1 and 2 ranks , 2 and 3 are not same.
Difference between very satisfied, satisfied and not satisfied is not
same.
Like nominal data, for ordinal data we compute frequency.
Sometimes rarely people find mean.
Dissimilarity matrices
Categorical/ ordinal
nominal
Ratio / Interval variables:
They are also called quantitative, scale, or parametric variables
Eg. no. of customers, age, weight etc.
The values may be discrete, or continuous
Eg: no. of customers =20 (discrete)
no of stores = 3 ( discrete)
weight = 2.5kg (continuous)
Vector Objects
To measure the distance between complex objects, it is often desirable to
abandon (stop) traditional metric distance computation and introduce a
nonmetric similarity function. There are several ways to define such a
similarity function, s(x, y), to compare two vectors x and y. One popular
way is to define the similarity function as a cosine measure as follows:
Partitioning Methods
K-Means Algorithm
Refer Unit - I
k-Medoids Algorithm
Instead of taking the mean value of the objects in a cluster as a reference
point, we can pick actual objects to represent the clusters, using one
representative object per cluster. Each remaining object is clustered with
the representative object to which it is the most similar. The partitioning
method is then performed based on the principle of minimizing the sum of
the dissimilarities between each object and its corresponding reference
point. That is, an absolute-error criterion is used, defined as
where E is the sum of the absolute error for all objects in the data set;
p is the point in space representing a given object in cluster C_j ;
and o_j is the representative object of C_j . In general, the algorithm
iterates until, eventually, each representative object is actually the
medoid, or most centrally located object, of its cluster. This is the basis of
the k-medoids method for grouping n objects into k clusters.
Comparing against K-Means. K-Means requires that all the data should
be in Euclidean space, the dissimilarity is measured by Euclidean
distance. However, not all the dataset meet this demand. For instance,
the categorical feature. You can not exactly tell how far is the distance
between a apple and a pear. None of the Euclidean distance formula can
be applied here. For the K-Medoids, you only need to care about the
degree of the dissimilarity which is represented by dissimilarity matrix.
Also, the k-medoids method is more robust than k-means in the presence
of noise and outliers, because a medoid is less influenced by outliers or
other extreme values than a mean.
Hierarchical approach for clustering:
Agglomerative Approach
Divisive Approach
Agglomerative Approach
Divisive Approach
Hierarchical Clustering
Find the next edge with the lowest weight and highlight it:
Continue selecting the lowest edges until all nodes are in the same tree.
The finished minimum spanning tree for this example looks like this:
Step 2: Find all of the edges that go to un-highlighted nodes. For this
example, node C has three edges with weights 1, 2, and 3. Highlight the
edge with the lowest weight. For this example, that’s 1.
Step 3: Highlight the node you just reached (in this example, that’s node
A).
Step 4: Look at all of the nodes highlighted so far (in this example, that’s
A And C). Highlight the edge with lowest weight (in this example, that’s
the edge with weight 2).
Note: if you have have more than one edge with the same weight, pick a
random one.
Step 6: Highlight the edge with the lowest weight. Choose from all of the
edges that:
1. Come from all of the highlighted nodes.
2. Reach a node that you haven’t highlighted yet
Step 7: Repeat steps 5 and 6 until you have no more un-highlighted
nodes. For this particular example, the specific steps remaining are:
a. Highlight node E.
b. Highlight edge 3 and then node D.
c. Highlight edge 5 and then node B.
d. Highlight edge 6 and then node F.
e. Highlight edge 9 and then node G.
The finished graph is shown at the bottom right of this image: