Machine_Learning_II
Machine_Learning_II
The course
discusses different learning paradigms: supervised, unsupervised and semi-supervised models,
generative and discriminative learning, parametric/non-parametric learning, frequentist and
Bayesian methods. Topics covered also include decision trees, ensemble methods, neural
networks and deep learning, reinforcement learning, and topics in machine learning theory. The
course discusses issues in large-scale machine learning. Concepts are discussed in the context
of applications such as collaborative filtering, autonomous navigation, intrusion detection, text
and web data processing, and recommender systems.
Text book:
Course Contents:
1. Evaluating ML Models
2. Generative vs. Discriminative Learning
3. Different learning paradigms: supervised, unsupervised, and semi-supervised.
4. Density Estimation and Anomaly
5. Graphical models
6. Reinforcement Learning
7. Large-Scale Machine Learning
Evaluating ML models
• Machine Learning involves constructing mathematical models to understand data and
make accurate predictions on new, unseen data.
• The objective is not just to create models but to build high-quality models that
demonstrate strong predictive capabilities.
• Performance metrics are essential tools for evaluating model effectiveness, allowing us
to determine how well a model generalizes to new data and measures its reliability.
• These metrics help assess the predictive power of the model, ensuring that it is not only
accurate on the training data but also performs well in real-world applications.
• By using performance metrics, we can compare different models, identify areas for
improvement, and refine the model to achieve optimal performance.
• Ultimately, these evaluations ensure that the model is robust, reliable, and capable of
making meaningful predictions.
• Type I Error (False Positive): Occurs when the model incorrectly predicts a
positive outcome. For example, in a medical test for a disease, a Type I error would
mean predicting that a patient has the disease when they do not.
• Type II Error (False Negative): Occurs when the model fails to predict a positive
outcome. In the same medical context, a Type II error would mean predicting that a
patient does not have the disease when they actually do.
• False positives and false negatives are not equally problematic, depending on the
application. For instance, in tumor detection, a Type II error (failing to detect a
tumor) could have more severe consequences than a Type I error (falsely
identifying a tumor).
Based on the confusion matrix, several metrics can be extracted to assess the different aspects
of the model.
Main metrics: The following metrics are commonly used to assess the performance of
classification models.
Weighted n
Averages the metric across all
Macro Average ∑ ( ni ⋅ M i ) classes, weighted by class size.
i=1
n
∑ ni
i=1
Metric Formula Interpretation
---------------- -------------------------------------
--
So, is it possible to have perfect recall (sensitivity) with a specificity of zero, what does that
mean?
Example: 1- Let's load Iris dataset and fit it to a KNN classefication model.
iris.target
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
# Load dataset
iris = load_iris()
# Consider only two classes: class 1 from index 50 to 100 and class 2
from index 101 to 150
X_class1 = iris.data[0:60]
X_class2 = iris.data[45:100]
X = np.vstack((X_class1, X_class2))
y_class1 = iris.target[0:60]
y_class2 = iris.target[45:100]
y = np.hstack((y_class1, y_class2))
# Split data
xtrain, xtest, ytrain, ytest = train_test_split(X, y, train_size=0.9)
Confusion Matrix:
[[4 2]
[1 5]]
Classification Report:
precision recall f1-score support
accuracy 0.75 12
macro avg 0.76 0.75 0.75 12
weighted avg 0.76 0.75 0.75 12
Regression metrics
• Accuracy is not applicable for regression models. Instead, the performance of a
regression model is typically evaluated using error metrics.
– Total Sum of Squares (TSS): This metric quantifies the total variation in the
dependent variable by measuring how far each observed value deviates from
the sample mean. A higher TSS indicates greater variability in the data.
– Explained Sum of Squares (ESS): ESS reflects the proportion of the total
variation that is accounted for by the regression model. A higher ESS signifies
that the model effectively captures the underlying patterns in the data.
– Residual Sum of Squares (RSS): RSS measures the variation in the model’s
errors, providing insight into how well the model fits the data. A lower RSS
indicates better model performance and greater explainability of the data.
Formula
$ SS_{tot} = \sum_{i=1}^{n} (y_i - \bar{y})^2 $
----------------------------------------------------------------------------------
--------------
• Coefficient of Determination ($ R^2 $): The $ R^2 $ value measures the
proportion of variance in the dependent variable that can be predicted from the
independent variables. It is defined as:
2 S Sre s
R =1 −
S St ot
• Main Metrics: The following metrics are frequently employed to assess the
performance of regression models, taking into account the number of predictors n
utilized in the model:
•
– $ SS_{res} $: Residual sum of squares.
– $ \hat{\sigma}^2 $: Estimated variance of the residuals.
– $ n $: Number of predictors (independent variables).
– $ m $: Total number of observations.
– $ L $: Likelihood of the model.
• Mallow's C p is a widely used criterion for model selection, aimed at identifying the
model that offers the best predictive performance.
• Adjusted R² is another useful metric that adjusts the R² value for the number of
predictors in the model. Unlike R², which never decreases with the addition of new
variables, Adjusted R² will decrease if irrelevant or redundant predictors are added
to the model, making it a more reliable measure for assessing model quality and
predictive accuracy.
Example: This example illustrates the evaluation of linear regression models using various
performance metrics, focusing on the impact of adding predictors and including a "useless"
predictor.
# Create and train the multiple linear regression model with two
predictors
model_multiple = LinearRegression()
model_multiple.fit(X_train, y_train)
# Create and train the multiple linear regression model with three
predictors (including useless predictor)
model_with_useless = LinearRegression()
model_with_useless.fit(X_train_with_useless, y_train)
Model selection
• To finally evaluate a model, we usually need three different parts of the data that we
have as follows:
• Once the model is chosen, it is trained on the entire dataset and tested on the unseen
test set.
Cross-validation:
• a method that used to select a model that does not rely too much on the initial training
set.
• The different types are summed up in the table below:
In both cases, The error is then averaged over the k folds/parts and is named cross-validation
error.
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Load dataset
data = load_iris()
X = data.data
y = data.target
# Split the dataset into training and testing (optional for
demonstration)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3, random_state=42)
# Perform cross-validation
for fold, (train_index, val_index) in enumerate(kf.split(X_train)):
# Split the data into training and validation sets
X_train_fold, X_val_fold = X_train[train_index],
X_train[val_index]
y_train_fold, y_val_fold = y_train[train_index],
y_train[val_index]
# Calculate accuracy
accuracy = accuracy_score(y_val_fold, y_val_pred)
fold_accuracies.append(accuracy)
2. Interpretable Models
• Simple Models: Models where the relationship between inputs and outputs is direct and
transparent, allowing easy tracing of predictions.
• Rule-Based Models: These make decisions based on logical conditions, making it clear
why certain predictions are made.
• Additive Models: These evaluate each input's effect independently, making the
contribution of each feature easy to understand.
4. Explainability Techniques
• Feature Importance: Quantifies which features most influence predictions.
• Local Explanations: Explains individual predictions by approximating the complex model
locally.
• Visual Tools: Plots and heatmaps help show the effect of input features on the
predictions.
Techniques
1. Feature Importance: Ranks the input features by their influence on the model’s
predictions, providing insight into which factors are most important.
2. Partial Dependence Plots (PDPs): Show how changing one feature impacts the
model’s predictions, while keeping other features constant, giving a global view of
feature influence.
import numpy as np
import sklearn
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import lime
import lime.lime_tabular
# Load dataset
data = load_iris()
X = data.data
y = data.target
class_names=data.target_names, discretize_continuous=True)
<IPython.core.display.HTML object>
Example 2: Here's an example of using Feature Importance with a Random Forest classifier.
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
# Load dataset
data = load_iris()
X = data.data
y = data.target
# print(feature_df)
# Preprocess data
X_train, X_test = X_train / 255.0, X_test / 255.0 # Normalize pixel
values
y_train, y_test = to_categorical(y_train, 10), to_categorical(y_test,
10) # One-hot encode labels
# Make predictions
y_pred_prob = model.predict(X_test)
y_pred = np.argmax(y_pred_prob, axis=1)
y_true = np.argmax(y_test, axis=1)
for i in range(num_samples):
idx = incorrect_indices[i]
plt.subplot(1, num_samples, i + 1)
plt.imshow(X_test[idx], cmap='gray')
plt.title(f"True: {y_true[idx]}\nPred: {y_pred[idx]}")
plt.axis('off')
plt.show()
Epoch 1/5
750/750 - 4s - 6ms/step - accuracy: 0.9072 - loss: 0.3313 -
val_accuracy: 0.9490 - val_loss: 0.1805
Epoch 2/5
750/750 - 3s - 5ms/step - accuracy: 0.9577 - loss: 0.1476 -
val_accuracy: 0.9609 - val_loss: 0.1308
Epoch 3/5
750/750 - 3s - 4ms/step - accuracy: 0.9701 - loss: 0.1037 -
val_accuracy: 0.9689 - val_loss: 0.1066
Epoch 4/5
750/750 - 6s - 8ms/step - accuracy: 0.9769 - loss: 0.0788 -
val_accuracy: 0.9721 - val_loss: 0.0954
Epoch 5/5
750/750 - 3s - 4ms/step - accuracy: 0.9821 - loss: 0.0620 -
val_accuracy: 0.9728 - val_loss: 0.0912
313/313 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step
Diagnostics
It is worth mentioning that high precision or accuracy or any other metric does not necessarily
reflect the true performance of the model. Instead, it might be reflecting a status of overfitting.
To get more insights about overfitting, it is fundamental to understand the role of variance and
bias in overfitting:
• Bias: the difference between the expected prediction and the correct
model(generally the difference between the average prediction and the target
value).
• Variance: the variability of the model prediction for given data points.
The relationship between Bias/variance can be summarized as follows: The simpler the model,
the higher the bias, and the more complex the model, the higher the variance.
The following table gives real cases of undefitting and overffiting and some possible remedies of
the situations.
Regularization:
• The regularization procedure aims at avoiding the model to overfit the data and thus
deals with high variance issues.
• reduce variance at the cost of introducing some bias.
• decreasing the model variability decrease the model complexity, that is the number of
predictors.
• This is done by penalizing predictors (settings their coefficients to ≃ 0 ) if they are too far
from zero, thus enforcing them to be close or equal to zero.
The following table sums up the different types of commonly used regularization techniques:
Example:
In the following example, we will see how regularization may dramatically increase the
regression performance.
import numpy as np
import matplotlib.pyplot as plt
<matplotlib.collections.PathCollection at 0x7f25af937450>
# Building and fitting the Linear Regression model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
X, y = generate_dataset(B, 50)
linearModel = LinearRegression()
linearModel.fit(poly_features, y)
0.6364593734835868
200 : 0.6653968262887501
400 : 0.6711444170343008
600 : 0.673506823781417
800 : 0.6746800126019875
1000 : 0.6753051013366153
1200 : 0.6756383882406153
1400 : 0.6758019669489553
1600 : 0.6758609933584792
1. Model Selection
• Purpose of Model Selection:
The goal is to identify the model that best fits the problem by comparing multiple
models on a validation dataset. This involves assessing their predictive performance,
robustness, and suitability for the task at hand.
2. Hyperparameter Tuning
Hyperparameters are parameters that define the structure of the model and influence how it
learns. They are set before the training process (e.g., the number of layers in a neural network,
the learning rate, or the maximum depth of a decision tree).
• Grid Search: A systematic method for hyperparameter tuning, where all possible
combinations of a predefined set of hyperparameters are tested. This approach can
be computationally expensive but guarantees an exhaustive search.
• Random Search: Instead of testing all possible combinations, random search
samples hyperparameters randomly from a distribution. It is more efficient than
grid search when searching large hyperparameter spaces.
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# Load dataset
data = load_iris()
X = data.data
y = data.target
# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=model, param_grid=param_grid,
cv=5, scoring='accuracy',
n_jobs=-1, verbose=2)
# Fit GridSearchCV
grid_search.fit(X_train, y_train)
# Print the best hyperparameters
print("Best Hyperparameters:")
print(grid_search.best_params_)
# Best model
best_model = grid_search.best_estimator_
Classification Report:
precision recall f1-score support
accuracy 1.00 45
macro avg 1.00 1.00 1.00 45
weighted avg 1.00 1.00 1.00 45
std_test_score
0 0.048562
1 0.035635
2 0.035635
3 0.048562
4 0.035635
.. ...
103 0.048562
104 0.048562
105 0.048562
106 0.048562
107 0.048562
Informally:
A generative model could generate new photos of animals that look like real animals, while a
discriminative model could tell a dog from a cat.
• Generative approach:– is to learn each language and determine as to which language the
speech belongs to
• Discriminative approach:– is determine the linguistic differences without learning any
language– a much easier task!
Example:
• Suppose we have the following data in the form (x,y): (1,0), (1,0), (2,0),
(2, 1),
then p ( x , y ) is:
y=0 y=1
x=1 1/2 0
x=2 1/4 1/4
p ( y∨x ) is
y=0 y=1
x=1 1 0
x=2 1/2 1/2
• The distribution p ( y∨x ) is the natural distribution for classifying a given example x into
a class y (discriminative)
• p ( x , y ) (Genrative) can be transformed into p ( y∨x ) by applying Bayes rule and then
used for classification: p ( x , y )= p ( x ) p ( y ∨x )
The differences between Discriminative and Generative are summarized in the following table:
Generative Discriminative
Models classes via pdfs and prior probabilities Directly estimate posterior probabilities with
no attempt to model underlying probability
distributions
Can generate synthetic data points Dedicated to classify new data which grants
better performance
a full probabilistic model of all variables provides a model only for the target variables
that we want to predict
Hard to estimate distributions accurately Easier tot une
Popular models: Gaussians, Naïve Bayes, Logistic regression, SVMs, neural networks,
Mixtures of multinomials, Mixtures of Nearest neighbor, etc
Gaussians,etc.
• In practice, generative models are most popular when we have phenomena that are well
approximated by the normal distribution, and we have a lot of sample points, so we can
approximate the shape of the distribution well.
• Advantage of (1 & 2): P ( Y ∨X ) tells the probability the guess is wrong [This is
something SVMs don’t do.]
In a 2-class problem, we can incorporate an asymmetrical loss function instead of the prio π C . In
a multi-class problem, it gets more difficult.
C2, o t h e r w i s e )
C 1 i f QC ( x ) −QC ( x )> 0
• The decision function is quadratic. Bayes decision boundary is Q C ( x ) −QC ( x )=0 .
1 2
• The fundamental assumptionis that all the Gaussians have same variance σ 2
Q C 1 ( x ) −Q C 2 ( x )= − +l n π C −l n π C
1 2 1 2
2 2
σ 2σ 1 2
• You should know that the quadratic terms in QC and QC canceled each other out.
1 2
• Now, we obtain a linear classifier for which Choosing a C that maximizes the
following linear discriminant function, which works for any number of classes:
μ C . x (|) μC|) )
2
− +l n π C
σ2 2 σ2
• In case of 2 classes, the decision boundary is w · x +α =0 and the posterior is
P ( Y =C∨X =x )=s ( Q C ( x ) − Q C ( x ) )
1 2
• The logistic function is the right Gaussian divided by the sum of the Gaussians.
• notice that even if the Gaussians are 2D, the logistic still looks 1D.
• In case of more than two classes, their LDA decision boundaries form a classical
Voronoi diagramif the priors π C are equal.
• Classification tends to predict a class (discrete) for a given point x , whereas, regression
predicts some numerical value (continuous) of a point x
• QDA and LDA don’t just perform classification; But also estimates the probability that a a
given label for a sample x is correct, which means that they implicitly do regression.
• To perform regression we:
a. Choose a form of regression function (hypothesis) h ( x ; p ) with parameter(s) p. as
instance, a decision function in classification; e.g., linear, quadratic, logistic in x ]
b. Choose a cost function (objective function) to optimize, usually based on a loss
function; e.g., risk=expected loss
• Some regression functions:
– (1). linear:h ( x ; w , α )=w · x+ α
– (2). polynomial
1
– (3). logistic:h ( x ; w , α )=s ( w · x+ α ) . recall that the logistic function is s ( 𝛾 )= −𝛾
1+e
Logistic expression
• So the logistic function seems to be a natural form for modeling certain probabilities.
• If we want to model posterior probabilities, sometimes we use LDA.
• Alternatively, we could skip fitting Gaussians to points, and instead just try to directly fit
a logistic function to a set of probabilities.
Some loss functions: let z be the prediction of h ( x ) and y be the true label.
The optimization algorithm and its speed depend crucially on which parts the regression method
is composed of.
[)
w 11
..
[ x1 . . .. x n 1)
wn
α
– to minimize R S S ( w )=wT X ′ T X ′ w − 2 yT X ′ w+ y T y
– we set $ \triangledown RSS = 2X'^T Xw - 2X'^Ty=0$
– this implies $ w=(X'^TX')^{-1}X'^Ty$
pros:
cons:
Logistic Regression
• Logistic regression function (3)+logistic loss function (C)+cost function (a).
• Fits “probabilities” in range (0,1)
• Usually used for classification. y i's can be probabilities, but in most applications they’re
all 0 or 1.
• Although both utilize Logistic function, QDA and LDA are generative models, whereas,
Logistic Regression is a discriminative one.
• With LDA, we have seen that in classification, the posterior probabilities are often
modeled well by a logistic function. The question rises, why not just fit a logistic
function directly to the data, skipping the Gaussians?*
Supose that we have a data matrix X and w including the fictitious dimension(i.e., vector of ones
and α are X 's and w ’s last components respectively.
z = np.arange(0.01, 1, 0.01)
L0 = L(z, 0.1); L04 = L(z, 0.4); L07 = L(z, 0.7)
plt.plot(z, L0, label= ' y = 0.1'), plt.plot(z, L04, label= ' y =
0.4'), plt.plot(z, L07, label= ' y = 0.7')
plt.legend();
• L e t si =s ( xi · w )
clf = LogisticRegression(random_state=0).fit(X, y)
<matplotlib.collections.PathCollection at 0x1cd4f266bc0>
• A 2018 paper by Soudry et al. shows that gradient descent applied to logistic regression
eventually converges to the maximum margin classifier
• However, the convergence will be extremely slow.
• In practice, logistic regression will usually find a linear separator reasonably quickly.
• but it’s not a practical algorithm for maximizing the margin in a reasonable amount of
time
where w ′ is the vector w with the last component α replaced by 0. Althought the
matrix X has a fictitious dimension, we DON’T penalize α .
• You clearly notice that we add a regularization term (i.e., penalty term) for
shrinkage to encourage small |)w ′|), Why?
– Guarantees that the normal system has always a unique solution.
– Standard least-squares, in the other hand, yields singular normal equations
(infinite number of solutions) when the sample points lie on a common
hyperplane in feature space. E.g., when d >n .
• The left plot of the above figure presents the quadratic form for a semidefinite cost
function associated with least-squares regression.
• You may notice that it has infinite number of minimas.
• In such cases, the regression problem is said to be ill-posed. *To obtain a positive
definite quadratic form (right image), which has a unique minimum we add small penalty
term.
• The term “regularization” implies that we are turning an ill-posed problem into a well-
posed problem.
linearModel = LinearRegression()
linearModel.fit(poly_features, y)
linearModel = LinearRegression()
linearModel.fit(poly_features2, y)
plt.subplot(1, 2, 2)
# print('multinomial_2:', linearModel.coef_)
plt.scatter(X,y )
x = np.arange(X.min(), X.max(), 0.1)
y_x = [np.power(a, np.arange(1, len(linearModel.coef_) +
1)).dot(linearModel.coef_) + linearModel.intercept_ for a in x]
plt.plot(x, y_x, c ='r');
var = int((np.diff(y_x)**2).sum())
bias = np.abs(y - [np.power(a, np.arange(1, len(linearModel.coef_) +
1)).dot(linearModel.coef_) + linearModel.intercept_ for a in
X]).sum()
plt.title("Low variance:" + str(var) + ', and High bias:' + str
(int(bias)));
• The solution to the normal system lies where a red isocontour just touches a blue
isocontour.
• As λ increases, the solution will occur at a more outer red isocontour and a more
inner blue isocontour.
• To solve minimize the cost function, we set $ \triangledown J=0 $ gives normal
equations: ( X T X + λ I ′ ) w=X T y
• I ′ here refers to the identity matrix where the bottom right is set to zero. We do this to
avoid penalizing the bias term α .
• Algorithm:
– Solve for w .
– Increase λ for more regularization and smaller |)w ′|)
−1
Tune the variance/bias of Ridge regression V a r ( β r i d g e ) =σ ( X X + λ I ′ ) X e ,
2 T T
–
where e is the noise (our data model by y= X v +e ).
– As λ → ∞ , variance → 0 and bias increases.
• The error function Err(x) is the sum of Bi a s ² , v a r i a n c e and the irreducible error σ 2.
2 2
E r r ( x )=B i a s +V a r ( β r i d g e ) + σ
• For the bias-variance trade-of, the test error as a function of λ is a U-shaped curve. We
find the bottom by cross-validation.
• Ideally, features should be “normalized” to have same variance.
• To use asymmetric penalty, The identity matrix I ′ must be replaced with another
diagonal matrix.
For Lasso Regression, The cost function engages a l 1 least absolute shrinkage and selection
operator.
• It is when using algorithm to learn the mapping function f ( X )= y from the input X
to the output y , where X and y are known beforehand.
• Called supervised because of learning from the training dataset can be thought of as
a teacher supervising the learning process.
• When the large amount of input data X incorporated with few labeled y .
• Many real world machine learning problems fall into this area.
• This is because it is expensive and time-consuming to label large amount of data as it
may require domain experts.
• On the other hand, unlabeled data is cheap and easy to collect/store.
• Both supervised and unsupervised techniques can be utilized.
a. unsupervised techniques discover and learn the structure of the data.
b. supervised techniques make best guess predictions for the unlabeled data, feed
that data back into the supervised learning algorithm as training data.
c. use the final model to make predictions on new unseen data.
Reinforcement Learning
The main goal to find k directions that capture most of the variation of sample point X ∈ ℝ d ,
where k < ¿ d .
Why?..
Let X be n × d design matrix as the table above (5 × 4) shows, what do you notice?
• The mean of the designe matrix is μ x =[ 11.7, 12.2 ,9.8 ,13.3 ) and it represents the
center of each variable.
n
1
μ j= ∑x
n i=1 i j
import numpy as np
X = np.array([[12, 13, 11, 14],[8, 8.5, 10, 13],[12, 13, 9, 14],[16,
16,8.5, 13],[8, 8.5, 10, 12]])
X.mean(axis=0)
array([11.2, 11.8, 9.7, 13.2])
~ ~
• The centerd data is the matrix X where X i j =x i j − μ j for i=1 . .n , j=1 .. d
~
• Caculate the sum of each column of X , what do you conclude?
X_hat = X - X.mean(axis=0)
print(X_hat)
~
• All the forthcoming calculation is done using the centered matrix X . X is needed no
more.
• The idea of PCA is that we pick the best direction w , then project all the data onto w
so we can analyze it in a one-dimensional space.
• Those directions span a subspace, and we want to project points orthogonally onto
the subspace.
• This would be an easy task if the directions are orthogonal (orthonormal) to each
other with length of 1.
k
• Given orthonormal directions v 1 , . .. , v k , ~
x=∑ ( x · v i ) v i
i=1
• Using the MLE, we are assuming that data are independently sampled from a
multivariate normal distribution with mean vector and variance-covariance matrix
̂
∑ ¿ 1n X T X
• The PCA Algorthm is as follow:
a. Center X
b. Normalize X [Optional]. Only if the units of measurement are different?
̂
c. Compute eigenvectors/eigenvalues of ∑ ❑:
– using the equation $AX=\lambda X ⇒ (A-\lambda I)X=0, $
– As shown in Cramer's rule, the non trivial solutions are given by $ det(A-\lambda
I)=0$.
a. Choose k based on variability the eigenvalues grants. $ % of variability= \frac{∑{i
= d-k+1}^{d} λ_i}{∑{i =1}^{d} λ_i}$.
a. pick eigenvectors v d −k +1 ,. . . , v d .
b. Compute the k principal coordinates x · v i of each training/test point
T
c. can reverse to original space by multiplying principal coordinates by v i
[<matplotlib.lines.Line2D at 0x7fed7e713750>]
original data:
[[12. 13. 11. 14. ]
[ 8. 8.5 10. 13. ]
[12. 13. 9. 14. ]
[16. 16. 8.5 13. ]
[ 8. 8.5 10. 12. ]]
recovered data:
[[11.9 13. 10.7 14.3]
[ 7.9 8.6 10.2 12.8]
[12.3 12.9 9.6 13.4]
[15.9 16.1 8.3 13.2]
[ 8. 8.4 9.7 12.4]]
MAE loss:
0.71 %
• PCA can be performed by find a direction w that maximizes sample variance of projected
data.
• In other words, when the data is projected down, it must keep as maximum spread out as
possible.
• So, the question is: how to choose the orientation of the support that grants the
afromentioned conditions?
t
x Ax
• Tp solve the problem above, we ressort to Rayleigh quotient r ( x )= t [details
xx
here].
( )
n 2
1 w 2 |)X w|) wt X t X w
argmax V a r ( {~
X 1 ,~
X 2 , . .. . ,~
X n }) = ∑ X i . = =
w n i=1 |)w|) n|)w|)2 n . wt w
• Of all eigenvectors, the above objective funtion yields v d that achieves maximum
variance λ d /n.
• IT can be seen as a sort of least-squares linear regression, with one subtle but
important change.
• In both methods, however, the goal is to minimize the sum of the squares of the
projection distances.
PCA can be used for various tasks including noise removal, feature extraction and data
compression. In the following code, PCA has been used for image compressed. The original
image has been compressed in different ratios and plots as the following:
# !wget "https://media.istockphoto.com/id/1141529240/vector/simple-
apple-in-flat-style-vector-illustration.jpg?
s=612x612&w=0&k=20&c=BTUl_6mGduAMWaGT9Tcr4X6n2IfK4M3HH-KCsr-Hrgs=" -o
"image.jpg"
from PIL import Image
import numpy
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
plt.figure(figsize=[20,12])
img = Image.open('/content/simple-apple-in-flat-style-vector-
illustration.jpg?
s=612x612&w=0&k=20&c=BTUl_6mGduAMWaGT9Tcr4X6n2IfK4M3HH-KCsr-Hrgs=')
array_origin = numpy.array(img)
stacked_arrays = array_origin.reshape(array_origin.shape[0]*
3,array_origin.shape[1] , -1).squeeze()
revers_stacked =
stacked_arrays.reshape(array_origin.shape[0],array_origin.shape[1], -
1)
plt.subplot(2, 3, 1);
plt.imshow(img);
origin_size = stacked_arrays.shape[0] * stacked_arrays.shape[1]
plt.title("Original image, size =" + str(origin_size) + " octes")
• Fraud detection in the finance, rare event detection in network traffic, visual image
inspection for buildings and road monitoring, and defect detection in production
lines are some common problems.
• for a comprehensive survey of anomaly detection techniques, check this paper out
paper
##Kernel
• In statistics, Kernel of a pdf or pmf is the form of the pdf or pmf in which any factors that
are not functions of any of the variables in the domain are omitted.
– Let K h ( X 0 , X ) be a kernel, it can be written as:
λ
K h ( X 0 , X ) =D
λ
( ‖X − X 0 )
hλ ( X0) )
– $ X,X_{0}\in \mathbb {R}^{p} $
• For many distributions, the kernel can be written in closed form. An example is the
normal distribution which has the following probability density function:
2
( x− μ )
−
1
p ( x∨μ , σ )=
2
2 2σ
e
√2 π σ 2
• Kernels are used in kernel density estimation to estimate random variables' density
functions (i.e., smoothing), or in kernel regression to estimate the conditional
expectation of a random variable.
##Nonparametric statistics
• It is the branch of statistics that is not based only on parametrized families of probability
distributions (e.g., mean and variance).
• Chosing non-parametric methods for estimating a density function is derived from a lack
of prior information about the PDF that corresponds to the data.
• If we take, as instance, maximum likelihood estimation (MLE) and Bayesian parameter
estimation (BPE), we need to estimate the value of a parameter θ^ that maximizes the
likelihood function,
n
^
θ=argmin ∏ p ( X k ∨θ )
θ k=1
• Different kernels can be used to smoothen the distribution . In the example below,
Tophat and Gaussian Kernels are used.
Mathematically,
( )
n n
^f h ( x )= 1 ∑ K h ( x − x i ) = 1 ∑ K x − xi
n i=1 n h i=1 h
where
– K is a kernel, used to calculate the scores.
– h is a bandwidth parameter that is responsible for smoothness( choosing a higher
number for h yields smoother distribution). x is a given estimate and x i is a point
from the sample dataset.
• As was mentioned above K is a kernel where we have multiple choices e.g.,
Gaussian, Tophat, Epanechnikov, etc.
Example
• In this example, a data containing univariate 200 samples has been generated.
• The histogram of the same data but with different number of bins has a disproportionate
effect on the resulting visualization.
• One can expect a major confusion of samples occurs especially at the bin boundaries.
from sklearn.datasets import make_blobs
import numpy as np
import matplotlib.pyplot as plt
X, y = make_blobs(n_samples=200, centers=2, n_features=1,
random_state=16, cluster_std = [2, 0.8])
plt.figure(figsize=[8,4])
plt.subplot(1,2,1);plt.hist(X, bins=30);
plt.subplot(1,2,2);plt.hist(X, bins=3);
# a significant differnce based on number of bins
• The kernel affect, via KDE, can be used to smoothen the resulting distribution instead of
histograms.
from sklearn.neighbors import KernelDensity
import numpy as np
from sklearn.model_selection import GridSearchCV
plt.figure(figsize=[12,4])
plt.subplot(1,2,1);plt.hist(X, bins=30); plt.title('data histogram')
plt.subplot(1,2,2);
plt.plot(np.linspace(X.min(),X.max(), 50),
kde.score_samples(np.linspace(X.min(),X.max(), 50).reshape(-1, 1)));
plt.title('Gaussians scores');
• A natural measure is the mean square error at the estimation point x, defined by:
• This expression is an example of the bias-variance dilemmaof statistics: the bias can
be reduced at the expense of the variance, and vice versa.
• The solution is to assume a standard density function and find the value of the
bandwidth that minimizes the integral of the square error (MISE)
^
h=argmin M S Ex ( PK D E)
h
Gaussian kernel smoother
• The Gaussian kernel is one of the most widely used kernels for density estimation
and anomaly detection. *It is expressed with this formula:
( )
2
( x − xi )
K h ( x , x i )=exp − 2
2h
#Generate data
B = [0.1, 0.2, 0.3, -0.4] # [beta0, beta1, beta2, beta3]
X, y = generate_dataset(B, 50)
h = [1,6,20]
• Formally,
h m ( X 0 )=‖ X 0 − X [ m ))
# load data
import os
from matplotlib.image import imread
from PIL import Image
img_dir="/content/dataset"
all_files=os.listdir(img_dir)
data_path = [os.path.join(img_dir + "/" + i) for i in all_files]
k=0
data = []
plt.figure(figsize=[12,4])
for i in data_path:
k=k+1
plt.subplot(1,6,k)
data.append(imread(i))
plt.imshow(data[k-1])
plt.show()
# Extract features
from sklearn import decomposition, datasets
from skimage.color import rgb2hsv
from sklearn.preprocessing import StandardScaler
pca = decomposition.PCA(n_components=2)
X_std_pca = pca.fit_transform(X_std)
scores = kde.score_samples(X_std_pca)
outlier_index = np.argmin(scores)
plt.imshow(data[outlier_index]);
plt.title('The outlier image is:');
Graphical models
ref1 ref2
• Probabilistic graphical models are graphs in which nodes represent random variables,
and the arcs represent conditional independence assumptions.
• They provide a abstract and compact representation of joint probability distributions.
• A graphical model can be either Undirected or Directed
– Undirected (i.e., called Markov Random Fields or Markov networks) have a simple
definition of independence: two (sets of) nodes A and B are conditionally
independent given a third node(set), C, if all paths between the nodes (in) A and B
are separated by a node (set)in C.
P ( A ∩B∨C ) ⇔ P ( A , B∨C )=P ( A∨C ) P ( B∨C )
– Directed (i.e., called Bayesian Networks or Belief Networks) independency takes
into account the directionality of the arcs (more complicated).
• Nodes may hold categorical (e.g., multinomial distributions) or continuous values (e.g.,
Gaussian distribution).
• For a discrete node with continuous parents, logistic/softmax distribution can be used.
• Using multinomials, Gaussians, and the softmax distribution, we can have a rich toolbox
for making complex models.
Example: Consider this example, in which all nodes are binary (True(T) or False(F)).
• By defenition, the joint probability of all the nodes in the graph above is
e.g., P ( T , F , F , T )=0.5∗0.9∗0.2∗0.2=0.018
whereas, P ( T , F , F , F )=0.5∗0.9∗0.8∗1=0.37
##Inference
• The most common task we wish to solve using Bayesian networks is probabilistic
inference.
• It consists in evaluating the probability distribution over some set of variables, given the
values of another set of variables.
• For example, how can we compute p ( A∨C=c )? Assume each variable is binary, a naive
method for calculation is:
– p ( A ,C=c )= ∑ p ( A , B , C=c , D , E ) .......... [16 terms]
B, D , E
Example:
• consider the water sprinkler network, and suppose we observe the fact that the grass is
wet.
• Either it is raining, or the sprinkler is on.
–
p ( C 1=T , C 3=T )
∑ p ( C 0 ,C 1=T ,C 3=T ) 0.5× 0.1 ×0.9+0.5 × 0.5 ×0.9
B, D, E
P ( C1 =T ∨C3=T ) = = = =0.4
p ( C 3=T ) p ( C3=T ) 0.6945
– P ( C2 =T ∨C3=T ) =0.70
• It is more likely that the grass is wet because it is raining: the likelihood ratio is
0.7079 /0.4298=1.647.
More efficient method:
• Bottom up reasoning (i.e., This is called diagnostic) is when moving from effects to
causes (e.g., what is the cause of wet grass (effect)?)
• Top down reasoning (i.e., causal) is when moving from causes to effect (e.g., probability
that the grass will be wet given that it is cloudy).
• Bayes nets be used for both types of reasoning.
Factor graph propagation
• Algorithmically and implementationally, it’s often easier to convert directed and
undirected graphs into factor graphs, and run factor graph propagation.
p ( x )= p ( x 1 ) p ( x 2∨x 1 ) p ( x 3∨x 2 ) p ( x 4∨x 2 )
≡ f 1( x 1 , x 2) f 2( x 2 , x 3) f 3 (x 2 , x 4 )
– where × are the variables that factor f depends on, and $ × / x $ is all
variables neighboring factor f except x .
• If a variable has only one factor as a neighbor, it can initiate message propagation.
• Once a variable has received all messages from its neighboring factor nodes, it
compute its probability by multiplying all the messages and renormalising:
p ( x )∝ ∏ μh → x ( x )
h ∈n ( x )
• DBN is a Bayesian network extended with additional mechanisms that are capable
of modeling influences over time.
• The temporal extension of Bayesian networks does not mean that the network
structure or parameters changes dynamically, but that a dynamic system is
modeled.
• HMM has one discrete hidden node and one discrete or continuous observed node
per slice.
Topologies:
• This restriction leads to what known as a left-right HMM ( commonly used for
sequential modeling).
• A linear topology is one in which transitions are only permitted to the current state
and the next state.
• If transitions to any state at any time exist, it is known as ergodic.
HMM: Parameters and Training A HMM is completely determined by the following parameters:
• Initial state distribution vector 𝞹 of size n : The probability of starting in each state.
• Transition probability matrix A of size n × n: How likely is to transit to each state, given
some current state.
• Emission probability distributions vector B of size m : the probability of generating an
observation o t , given some current state st .
Example:
transition_probability = {
'Rainy' : {'Rainy': 0.7, 'Sunny': 0.3},
'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6},
}
emission_probability = {
'Rainy' : {'walk': 0.1, 'shop': 0.4, 'clean': 0.5},
'Sunny' : {'walk': 0.6, 'shop': 0.3, 'clean': 0.1},
}
• An HMM algorithm may consist of one or more of the steps : Forward, Backword, and
Update
• The six common problems [link, link]that can be solved using HMM are the filtering,
smoothing, forecasting, evaluating, decoding, and learning problems.
– Evaluating, filtering, and forecasting problems can be solved using the forward
algorithm
– smoothing problem can be solved using the forward algorithm and backward
algorithm
– Decoding problem can be solved using the Viterbi algorithm; The learning
problem, solved through MLE can, can be solved by forward algorithm to
calculate the likelihood.
• In order to learn the afromentioned parameters θ=( 𝛑 , A , B ), the model must be trained
on labeled samples.
– The time-independent stochastic transition matrix A={a i j }=P ( X t = j ∣ X t − 1=i ) .
– The initial state distribution (i.e. when t=1) is given by π i=P ( X 1 =i ) .
– The probability of a certain observation y i at time t for state $ X_{t}=j$ is given by
b j ( y i )=P ( Y t = y i ∣ X t = j ) .
• Baum-Welch algorithm, which is an application of the Expectation-Maximization
algorithm to HMMs, can be used to tune these paramaters.
$$ {\displaystyle \theta ^{*}=\operatorname {arg\,max} _{\theta }P(Y\mid \theta )}{\displaystyle }
$$
• Step 1: Initialize
– α i ( 1 ) =π i b i ( y 1 ) ,
N
– α i ( t +1 )=b i ( y t+1 ) ∑ α j ( t ) a j i .
j=1
– Because the likelihood of all observations, LT , can be calculated, we can apply the
maximum likelihood method to estimate the unknown parameter L ( θ )=P ( X ∣θ ) .
– If the prior distribution of the parameter is given, you can also apply the MAP
method argmax [ P ( θ ) ⋅P ( X ∣θ ) ).
• This step can also be used to solve the filtering problem by:
pt ( i ) ≡ p ( S t =i∨Y 1 ,… , Y t )=α t ( i ) / Lt
• We can also solve the forecast problem, because the h-step-ahead prediction of the
state probability can be calculated via filtering
– β i ( T )=1 ,
N
– β i ( t )=∑ β j ( t +1 ) ai j b j ( y t+ 1) .
j=1
– The smoothing problem is solved, because we can calculate the probability of
curent state given all past and future states.
P ( X t =i ,Y ∣θ ) α i (t ) βi (t )
γ i ( t )=P ( X t=i ∣Y , θ )= = N
P ( Y ∣θ )
∑ αj (t ) β j(t )
j=1
–
P ( X t =i, X t +1= j ,Y ∣θ ) α i ( t ) ai j β j ( t+ 1 ) b j ( y t +1 )
ξ i j ( t )=P ( X t =i, X t +1= j∣ Y ,θ )= = N N
,
P ( Y ∣θ )
∑ ∑ α k ( t ) a k w β w ( t+1 ) b w ( y t +1 )
k=1 w=1
the expected number of transitions from state i to state j compared to the expected total
number of transitions away from state i .
T
∑ δ y v γi (t)
{
1 if y t =v k ,
)
t k
• b ¿i ( v k ) = t =1 T , where δ y v =
t k
0 otherwise
∑ γi ( t )
t =1
Example:
0.22
• The new estimate for the S1 to S2 transition is now =0.0908
2.4234
• Likewise, calculate the other transition probabilities and normalize, so they add to 1.
• Estimate the new emission matrix. For example, the probability of an observation N E
given that E come from S1 (i.e., P ¿)
0.2394
• The new estimate for the E coming from $ S_{1}$ emission is now =0.8769
0.2730
• Repeat for if N came from S1 and for if N and E came from S2 and normalize.
• To estimate the initial probabilities we assume all sequences start with the hidden state
S1 and calculate the highest probability and then repeat for S2. Again we then normalize
to give an updated initial vector.
Finally we repeat these steps until the resulting probabilities converge satisfactorily.
• Therefore, the probability of a certain observation y i at time t for state $ X_{t}=j$ is given
by mean and covariance parameters ( B≡ {μi , Σ i }i=1 ,… , K ) of a multivariate gaussian
instead.
• That is, the parameter of the Gaussian HMM is $ \theta = ( \pi ,A ,B ) $.
• In Gaussian mixture HMM, the observation probability distribution is a Gaussian mixture
distribution
Y t ∨X t ∼ G M ( {w X , 1 , … , w X , M }, {μ X ,1 , … , μ X , M }, {Σ X ,1 , … , Σ X , M })
t t t t t t
Example 1:
The following example shows how to train an HMM and use it to forecast the future.
[*********************100%***********************] 1 of 1 completed
array([1, 1, 1, 0, 0, 1, 0, 0, 1, 0])
n_fits = 500
train = stat_data[0:-100]
val = stat_data[-100:]
best_score = None
#try n fits to avoid local minima
for idx in range(n_fits):
model = hmm.CategoricalHMM(n_components=2, init_params='se',
n_iter=500, random_state=idx)
model.transmat_ = np.array([np.random.dirichlet([0.7, 0.3]),
np.random.dirichlet([0.3, 0.7])])
model.fit(train.reshape(-1,1))
# Is is more probable for gold to rise or fall in the the three next
days successively
if (model.score(np.concatenate([stat_data[-20:-4],
np.array([1,1,1])]).reshape(-1,1)) >
model.score(np.concatenate([stat_data[-20:-4],
np.array([0,0,0])]).reshape(-1,1))):
print('-The model predicts that Gold price will rise')
else:
print(-'The model predicts that Gold price will fall')