0% found this document useful (0 votes)
6 views

Machine_Learning_II

This document outlines a machine learning course covering various learning paradigms, model evaluation techniques, and performance metrics for classification and regression models. It includes discussions on supervised, unsupervised, and semi-supervised learning, as well as practical applications and examples using Python libraries. Key topics include ROC curves, confusion matrices, and metrics such as accuracy, precision, recall, and R-squared for assessing model performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Machine_Learning_II

This document outlines a machine learning course covering various learning paradigms, model evaluation techniques, and performance metrics for classification and regression models. It includes discussions on supervised, unsupervised, and semi-supervised learning, as well as practical applications and examples using Python libraries. Key topics include ROC curves, confusion matrices, and metrics such as accuracy, precision, recall, and R-squared for assessing model performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

#Course Description This is the second course in the machine learning sequence.

The course
discusses different learning paradigms: supervised, unsupervised and semi-supervised models,
generative and discriminative learning, parametric/non-parametric learning, frequentist and
Bayesian methods. Topics covered also include decision trees, ensemble methods, neural
networks and deep learning, reinforcement learning, and topics in machine learning theory. The
course discusses issues in large-scale machine learning. Concepts are discussed in the context
of applications such as collaborative filtering, autonomous navigation, intrusion detection, text
and web data processing, and recommender systems.

Text book:

1. Pattern Recognition and Machine Learning, Christopher Bishop, Springer.


2. Pattern Classification, 2nd Ed. Richard O. Duda, Peter E. Hart, David G. Stork.
3. Applied Predictive Modeling 2nd ed, by Max Kuhn, Kjell Johnson.

Course Contents:

1. Evaluating ML Models
2. Generative vs. Discriminative Learning
3. Different learning paradigms: supervised, unsupervised, and semi-supervised.
4. Density Estimation and Anomaly

5. Graphical models
6. Reinforcement Learning
7. Large-Scale Machine Learning

Evaluating ML models
• Machine Learning involves constructing mathematical models to understand data and
make accurate predictions on new, unseen data.
• The objective is not just to create models but to build high-quality models that
demonstrate strong predictive capabilities.
• Performance metrics are essential tools for evaluating model effectiveness, allowing us
to determine how well a model generalizes to new data and measures its reliability.
• These metrics help assess the predictive power of the model, ensuring that it is not only
accurate on the training data but also performs well in real-world applications.
• By using performance metrics, we can compare different models, identify areas for
improvement, and refine the model to achieve optimal performance.
• Ultimately, these evaluations ensure that the model is robust, reliable, and capable of
making meaningful predictions.

Metrics for Classification


• In classification tasks, results are often summarized using a confusion matrix,
which provides a comprehensive view of model performance by categorizing
predictions into four distinct groups based on their true and predicted labels:
– True Positives (TP): Correctly predicted positive instances.
– True Negatives (TN): Correctly predicted negative instances.
– False Positives (FP): Incorrectly predicted positive instances (also known as Type
I errors).
– False Negatives (FN): Incorrectly predicted negative instances (also known as
Type II errors).

• Type I Error (False Positive): Occurs when the model incorrectly predicts a
positive outcome. For example, in a medical test for a disease, a Type I error would
mean predicting that a patient has the disease when they do not.

• Type II Error (False Negative): Occurs when the model fails to predict a positive
outcome. In the same medical context, a Type II error would mean predicting that a
patient does not have the disease when they actually do.

• False positives and false negatives are not equally problematic, depending on the
application. For instance, in tumor detection, a Type II error (failing to detect a
tumor) could have more severe consequences than a Type I error (falsely
identifying a tumor).
Based on the confusion matrix, several metrics can be extracted to assess the different aspects
of the model.

Main metrics: The following metrics are commonly used to assess the performance of
classification models.

Metric Formula Interpretation


Accuracy T P+T N Overall performance of the
T P+T N + F P+ F N model
Precision TP Accuracy of positive predictions
T P+ F P
Recall TP Coverage of actual positive
(Sensitivity) T P+ F N samples
Specificity TN Coverage of actual negative
T N +F P samples
F1 Score 2T P Harmonic mean of precision
2T P+ F P+ F N and recall; useful for
unbalanced classes
Macro Average 1
n
Averages the metric across all
∑M
n i=1 i classes, treating each equally.

Weighted n
Averages the metric across all
Macro Average ∑ ( ni ⋅ M i ) classes, weighted by class size.
i=1
n

∑ ni
i=1
Metric Formula Interpretation
---------------- -------------------------------------
--

So, is it possible to have perfect recall (sensitivity) with a specificity of zero, what does that
mean?

ROC (Receiver Operating Characteristic):


• The Receiver Operating Characteristic (ROC) curve is a graphical representation that
illustrates the relationship between the True Positive Rate (TPR) and the False Positive
Rate (FPR) as the decision threshold is varied.
• This curve is widely used to visualize the trade-offs between sensitivity (recall) and
specificity for all possible thresholds in a diagnostic test or combination of tests.
• The area under the ROC curve (AUC or AUROC) quantifies the overall ability of the test to
discriminate between positive and negative classes, with higher values indicating better
performance.
• The key metrics related to the ROC and AUC are summarized in the table below:

Example: 1- Let's load Iris dataset and fit it to a KNN classefication model.

iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Load dataset
iris = load_iris()

# Consider only two classes: class 1 from index 50 to 100 and class 2
from index 101 to 150
X_class1 = iris.data[0:60]
X_class2 = iris.data[45:100]
X = np.vstack((X_class1, X_class2))

y_class1 = iris.target[0:60]
y_class2 = iris.target[45:100]

# Set labels to {0, 1}


y_class1 = np.zeros_like(y_class1)
y_class2 = np.ones_like(y_class2)

y = np.hstack((y_class1, y_class2))

# Split data
xtrain, xtest, ytrain, ytest = train_test_split(X, y, train_size=0.9)

# Train KNN model


knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(xtrain, ytrain)

# Compute ROC curve and ROC area


fpr, tpr, _ = roc_curve(ytest, knn.predict_proba(xtest)[:, 1])
roc_auc = auc(fpr, tpr)

# Plot ROC curve


plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area =
%0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) for Two Classes
from Iris Dataset')
plt.legend(loc="lower right")
plt.show()
2- Evaluate the trained model via confusion matrix

from sklearn.metrics import confusion_matrix, classification_report

# Apply the threshold to the predicted probabilities to get binary


predictions
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(xtrain, ytrain)
ytest_pred_thresholded = knn.predict(xtest)

# Calculate confusion matrix


conf_matrix = confusion_matrix(ytest, ytest_pred_thresholded)

# Print confusion matrix


print("Confusion Matrix:")
print(conf_matrix)

# Print classification report


print("\nClassification Report:")
print(classification_report(ytest, ytest_pred_thresholded))
# Plot confusion matrix
plt.figure(figsize=(5, 4))
plt.imshow(conf_matrix, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion Matrix')
plt.colorbar()
classes = [0, 1]
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes)
plt.yticks(tick_marks, classes)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

Confusion Matrix:
[[4 2]
[1 5]]

Classification Report:
precision recall f1-score support

0 0.80 0.67 0.73 6


1 0.71 0.83 0.77 6

accuracy 0.75 12
macro avg 0.76 0.75 0.75 12
weighted avg 0.76 0.75 0.75 12
Regression metrics
• Accuracy is not applicable for regression models. Instead, the performance of a
regression model is typically evaluated using error metrics.

• Basic Metrics: In the context of a regression model ( f ), several foundational


metrics are utilized to gauge its performance:

– Total Sum of Squares (TSS): This metric quantifies the total variation in the
dependent variable by measuring how far each observed value deviates from
the sample mean. A higher TSS indicates greater variability in the data.

– Explained Sum of Squares (ESS): ESS reflects the proportion of the total
variation that is accounted for by the regression model. A higher ESS signifies
that the model effectively captures the underlying patterns in the data.

– Residual Sum of Squares (RSS): RSS measures the variation in the model’s
errors, providing insight into how well the model fits the data. A lower RSS
indicates better model performance and greater explainability of the data.

Formula
$ SS_{tot} = \sum_{i=1}^{n} (y_i - \bar{y})^2 $

$ SS_{explained} = \sum_{i=1}^{n} (\hat{y}_i - \bar{y})^2 $

$ SS_{res} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $

----------------------------------------------------------------------------------
--------------
• Coefficient of Determination ($ R^2 $): The $ R^2 $ value measures the
proportion of variance in the dependent variable that can be predicted from the
independent variables. It is defined as:

2 S Sre s
R =1 −
S St ot

• Main Metrics: The following metrics are frequently employed to assess the
performance of regression models, taking into account the number of predictors n
utilized in the model:

– $ SS_{res} $: Residual sum of squares.
– $ \hat{\sigma}^2 $: Estimated variance of the residuals.
– $ n $: Number of predictors (independent variables).
– $ m $: Total number of observations.
– $ L $: Likelihood of the model.
• Mallow's C p is a widely used criterion for model selection, aimed at identifying the
model that offers the best predictive performance.

• The Akaike Information Criterion (AIC) and the Bayesian Information


Criterion (BIC) are two powerful metrics that provide insights into model
performance while taking into account the complexity of the models.

• Adjusted R² is another useful metric that adjusts the R² value for the number of
predictors in the model. Unlike R², which never decreases with the addition of new
variables, Adjusted R² will decrease if irrelevant or redundant predictors are added
to the model, making it a more reliable measure for assessing model quality and
predictive accuracy.
Example: This example illustrates the evaluation of linear regression models using various
performance metrics, focusing on the impact of adding predictors and including a "useless"
predictor.

# Import necessary libraries


import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Generate random data for demonstration with two predictors


np.random.seed(0)
X1 = 2 * np.random.rand(100, 1) # First predictor
X2 = 3 * np.random.rand(100, 1) # Second predictor
y = 4 + 3 * X1 + 2 * X2 + np.random.randn(100, 1) # Linear
relationship with noise

# Combine the predictors into one feature matrix


X = np.hstack((X1, X2))

# Split the data into training and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

# Create and train the multiple linear regression model with two
predictors
model_multiple = LinearRegression()
model_multiple.fit(X_train, y_train)

# Make predictions on the test set with two predictors


predictions_multiple = model_multiple.predict(X_test)

# Calculate R^2 with two predictors


r2_multiple = r2_score(y_test, predictions_multiple)

# Add a useless predictor (random noise) to X matrix


useless_predictor = np.random.randn(100, 1)
X_with_useless = np.hstack((X, useless_predictor, useless_predictor *
6, useless_predictor / 2, useless_predictor, useless_predictor))

# Split the updated data into training and test sets


X_train_with_useless, X_test_with_useless, y_train, y_test =
train_test_split(X_with_useless, y, test_size=0.2, random_state=42)

# Create and train the multiple linear regression model with three
predictors (including useless predictor)
model_with_useless = LinearRegression()
model_with_useless.fit(X_train_with_useless, y_train)

# Make predictions on the test set with three predictors


predictions_with_useless =
model_with_useless.predict(X_test_with_useless)

# Calculate R^2 with three predictors (including useless predictor)


r2_with_useless = r2_score(y_test, predictions_with_useless)

# Calculate the number of observations (n) and number of predictors


(k)
n = len(y_test)
k_multiple = X.shape[1]
k_with_useless = X_with_useless.shape[1]

# Calculate adjusted R-squared for both models


adjusted_r2_multiple = 1 - ((1 - r2_multiple) * (n - 1) / (n -
k_multiple - 1))
adjusted_r2_with_useless = 1 - ((1 - r2_with_useless) * (n - 1) / (n -
k_with_useless - 1))

# Calculate the mean of the observed values


mean_y = np.mean(y_test)

# Calculate Total Sum of Squares (TSS)


tss = np.sum((y_test - mean_y) ** 2)

# Calculate Explained Sum of Squares (ESS)


ess_with_useless = np.sum((predictions_with_useless - mean_y) ** 2)
ess_multiple = np.sum((predictions_multiple - mean_y) ** 2)

# Calculate Residual Sum of Squares (RSS)


rss_with_useless = np.sum((y_test - predictions_with_useless) ** 2)
rss_multiple = np.sum((y_test - predictions_multiple) ** 2)

# Print the results


print(f"Total Sum of Squares (TSS): {tss:.2f}")
print(f"Explained Sum of Squares (ESS) with Two Predictors:
{ess_multiple:.2f}")
print(f"Residual Sum of Squares (RSS) with Two Predictors:
{rss_multiple:.2f}")
print(f"Explained Sum of Squares (ESS) with m Predictors (including
useless predictors): {ess_with_useless:.2f}")
print(f"Residual Sum of Squares (RSS) with m Predictors (including
useless predictors): {rss_with_useless:.2f}")
print(f"R^2 with Two Predictors: {r2_multiple:.2f}")
print(f"Adjusted R^2 with m Predictors: {adjusted_r2_multiple:.2f}")
print(f"R^2 with m Predictors (including useless predictors):
{r2_with_useless:.2f}")
print(f"Adjusted R^2 with m Predictors (including useless predictors):
{adjusted_r2_with_useless:.2f}")

Total Sum of Squares (TSS): 75.85


Explained Sum of Squares (ESS) with Two Predictors: 73.92
Residual Sum of Squares (RSS) with Two Predictors: 14.51
Explained Sum of Squares (ESS) with Three Predictors (including
useless predictor): 72.84
Residual Sum of Squares (RSS) with Three Predictors (including useless
predictor): 14.39
R^2 with Two Predictors: 0.81
Adjusted R^2 with Two Predictors: 0.79
R^2 with Three Predictors (including useless predictor): 0.81
Adjusted R^2 with Three Predictors (including useless predictor): 0.70

Model selection
• To finally evaluate a model, we usually need three different parts of the data that we
have as follows:
• Once the model is chosen, it is trained on the entire dataset and tested on the unseen
test set.

Cross-validation:

• a method that used to select a model that does not rely too much on the initial training
set.
• The different types are summed up in the table below:

In both cases, The error is then averaged over the k folds/parts and is named cross-validation
error.

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X = data.data
y = data.target
# Split the dataset into training and testing (optional for
demonstration)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3, random_state=42)

# Initialize the model


model = RandomForestClassifier(n_estimators=100, random_state=42)

# Define 5-fold cross-validation


kf = KFold(n_splits=5, shuffle=True, random_state=42)

# List to store accuracy scores


fold_accuracies = []

# Perform cross-validation
for fold, (train_index, val_index) in enumerate(kf.split(X_train)):
# Split the data into training and validation sets
X_train_fold, X_val_fold = X_train[train_index],
X_train[val_index]
y_train_fold, y_val_fold = y_train[train_index],
y_train[val_index]

# Train the model


model.fit(X_train_fold, y_train_fold)

# Predict on validation set


y_val_pred = model.predict(X_val_fold)

# Calculate accuracy
accuracy = accuracy_score(y_val_fold, y_val_pred)
fold_accuracies.append(accuracy)

# Print results for this fold


print(f"Fold {fold + 1}: Accuracy = {accuracy:.4f}")

# Print overall results


print(f"\nMean Accuracy: {np.mean(fold_accuracies):.4f}")
print(f"Standard Deviation of Accuracy:
{np.std(fold_accuracies):.4f}")

Fold 1: Accuracy = 0.9048


Fold 2: Accuracy = 0.9048
Fold 3: Accuracy = 0.9048
Fold 4: Accuracy = 0.8571
Fold 5: Accuracy = 0.9524

Mean Accuracy: 0.9048


Standard Deviation of Accuracy: 0.0301
The variation in accuracy, with a difference of approximately 9.53% between the best and worst
folds, underscores why relying on a single train-test split to evaluate a model can be misleading.

Interpretable Models and Explainability


1. Definitions
• Interpretable Model: A model that is easy to understand and provides clear insights into
how input features influence predictions.
• Explainability (Post-Hoc Interpretation): Techniques used to explain complex models,
clarifying how decisions are made even when the model itself is hard to understand.

2. Interpretable Models
• Simple Models: Models where the relationship between inputs and outputs is direct and
transparent, allowing easy tracing of predictions.
• Rule-Based Models: These make decisions based on logical conditions, making it clear
why certain predictions are made.
• Additive Models: These evaluate each input's effect independently, making the
contribution of each feature easy to understand.

3. Challenges with Complex Models


• Limited Interpretability: Complex models capture intricate patterns but are harder to
explain.
• Non-Linear Relationships: Interactions between features make it difficult to isolate the
effect of individual features.
• Aggregated Predictions: Some models combine multiple components, complicating
their interpretation.

4. Explainability Techniques
• Feature Importance: Quantifies which features most influence predictions.
• Local Explanations: Explains individual predictions by approximating the complex model
locally.
• Visual Tools: Plots and heatmaps help show the effect of input features on the
predictions.

Techniques
1. Feature Importance: Ranks the input features by their influence on the model’s
predictions, providing insight into which factors are most important.

2. Partial Dependence Plots (PDPs): Show how changing one feature impacts the
model’s predictions, while keeping other features constant, giving a global view of
feature influence.

3. SHAP (SHapley Additive exPlanations): Uses game theory to assign an


importance value to each feature, explaining the contribution of each feature to a
specific prediction.
4. LIME (Local Interpretable Model-Agnostic Explanations): Creates a simple
model locally around individual predictions to explain how the complex model
arrived at that result.

5. Counterfactual Explanations: Provide insights into how to change input values to


achieve a different prediction, offering actionable guidance on altering outcomes.

6. Surrogate Models: Use a simpler, interpretable model to approximate the behavior


of a complex model, making it easier to understand its decisions.
Example 1: Here's an example using LIME (Local Interpretable Model-Agnostic Explanations) to
explain a prediction.

# !pip install lime scikit-learn

import numpy as np
import sklearn
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import lime
import lime.lime_tabular

# Load dataset
data = load_iris()
X = data.data
y = data.target

# Split the dataset into training and testing


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

# Train a random forest classifier


model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Use LIME to explain a prediction


explainer = lime.lime_tabular.LimeTabularExplainer(X_train,
feature_names=data.feature_names,

class_names=data.target_names, discretize_continuous=True)

# Choose a random instance from the test set to explain


instance = X_test[0]
print(f"Instance to predict: {instance}")
print(f"True label: {data.target_names[y_test[0]]}")

# Predict the class


prediction = model.predict([instance])
print(f"Predicted class: {data.target_names[prediction[0]]}")
# Explain the model's prediction
exp = explainer.explain_instance(instance, model.predict_proba,
num_features=2)
exp.show_in_notebook(show_table=True, show_all=False)

Instance to predict: [6.1 2.8 4.7 1.2]


True label: versicolor
Predicted class: versicolor

<IPython.core.display.HTML object>

Example 2: Here's an example of using Feature Importance with a Random Forest classifier.

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt

# Load dataset
data = load_iris()
X = data.data
y = data.target

# Train a random forest classifier


model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)

# Get feature importance


feature_importances = model.feature_importances_
feature_names = data.feature_names

# Create a DataFrame to view feature importances


feature_df = pd.DataFrame({
'Feature': feature_names,
'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

# print(feature_df)

# Plot feature importance


plt.figure(figsize=(8, 2))
plt.barh(feature_df['Feature'], feature_df['Importance'],
color='skyblue')
plt.xlabel('Feature Importance')
plt.title('Feature Importance from Random Forest')
plt.gca().invert_yaxis() # To display the most important feature at
the top
plt.show()
Combining Quantitative and Qualitative Approaches in
Model Interpretability
In the realm of Interpretable Models and Explainability, it is crucial to employ both quantitative
and qualitative methods for a comprehensive understanding of model performance. Relying
solely on quantitative approaches can provide useful metrics but may not fully capture the
nuances of model behavior. Here’s why integrating qualitative analysis is important:

1. Complementary Nature of Approaches


• Quantitative Methods:
– Provide Metrics: Quantitative approaches offer essential performance metrics
such as accuracy, precision, recall, and F1 scores, which help evaluate the overall
effectiveness of the model.
– Identify Trends: They reveal general performance trends and aggregate statistics
that are useful for initial model assessment.
• Qualitative Methods:
– Deep Dive into Errors: Qualitative analysis involves detailed inspection of specific
cases where the model made incorrect predictions, offering insights into why
these errors occurred.
– Contextual Understanding: By visually comparing erroneous predictions with the
actual classes, you can understand the model’s behavior on individual cases,
identifying if errors are due to specific features or patterns.

2. Benefits of Combining Approaches


• Identifying Patterns: Qualitative analysis can uncover patterns or systematic issues in
model errors that may not be apparent from quantitative metrics alone.
• Improving Interpretability: Manual inspection helps to understand whether the model's
predictions align with human intuition and domain knowledge, providing deeper
interpretability.
• Actionable Insights: It provides actionable insights for refining the model, such as
adjusting features or retraining, based on observed discrepancies.
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.optimizers import Adam

# Load MNIST dataset


(X_train, y_train), (X_test, y_test) = mnist.load_data()

# Preprocess data
X_train, X_test = X_train / 255.0, X_test / 255.0 # Normalize pixel
values
y_train, y_test = to_categorical(y_train, 10), to_categorical(y_test,
10) # One-hot encode labels

# Build a simple neural network model


model = Sequential([
Flatten(input_shape=(28, 28)),
Dense(128, activation='relu'),
Dense(10, activation='softmax')
])

# Compile and train the model


model.compile(optimizer=Adam(), loss='categorical_crossentropy',
metrics=['accuracy'])
model.fit(X_train, y_train, epochs=5, validation_split=0.2,
batch_size=64, verbose=2)

# Make predictions
y_pred_prob = model.predict(X_test)
y_pred = np.argmax(y_pred_prob, axis=1)
y_true = np.argmax(y_test, axis=1)

# Identify erroneous samples


incorrect_indices = np.where(y_pred != y_true)[0]

# Visualize some erroneous samples


num_samples = 6
plt.figure(figsize=(10, 6))

for i in range(num_samples):
idx = incorrect_indices[i]
plt.subplot(1, num_samples, i + 1)
plt.imshow(X_test[idx], cmap='gray')
plt.title(f"True: {y_true[idx]}\nPred: {y_pred[idx]}")
plt.axis('off')

plt.show()

Epoch 1/5
750/750 - 4s - 6ms/step - accuracy: 0.9072 - loss: 0.3313 -
val_accuracy: 0.9490 - val_loss: 0.1805
Epoch 2/5
750/750 - 3s - 5ms/step - accuracy: 0.9577 - loss: 0.1476 -
val_accuracy: 0.9609 - val_loss: 0.1308
Epoch 3/5
750/750 - 3s - 4ms/step - accuracy: 0.9701 - loss: 0.1037 -
val_accuracy: 0.9689 - val_loss: 0.1066
Epoch 4/5
750/750 - 6s - 8ms/step - accuracy: 0.9769 - loss: 0.0788 -
val_accuracy: 0.9721 - val_loss: 0.0954
Epoch 5/5
750/750 - 3s - 4ms/step - accuracy: 0.9821 - loss: 0.0620 -
val_accuracy: 0.9728 - val_loss: 0.0912
313/313 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step

Diagnostics
It is worth mentioning that high precision or accuracy or any other metric does not necessarily
reflect the true performance of the model. Instead, it might be reflecting a status of overfitting.
To get more insights about overfitting, it is fundamental to understand the role of variance and
bias in overfitting:

• Bias: the difference between the expected prediction and the correct
model(generally the difference between the average prediction and the target
value).

• Variance: the variability of the model prediction for given data points.
The relationship between Bias/variance can be summarized as follows: The simpler the model,
the higher the bias, and the more complex the model, the higher the variance.

The following table gives real cases of undefitting and overffiting and some possible remedies of
the situations.

Regularization:

• The regularization procedure aims at avoiding the model to overfit the data and thus
deals with high variance issues.
• reduce variance at the cost of introducing some bias.
• decreasing the model variability decrease the model complexity, that is the number of
predictors.
• This is done by penalizing predictors (settings their coefficients to ≃ 0 ) if they are too far
from zero, thus enforcing them to be close or equal to zero.

The following table sums up the different types of commonly used regularization techniques:

More details about regularization can be found here

Example:

In the following example, we will see how regularization may dramatically increase the
regression performance.

import numpy as np
import matplotlib.pyplot as plt

def generate_dataset(B, n):


e = np.random.normal(-15, 15, n)
X = 2 - 3 * np.random.normal(0, 1, n)
y = 0
for i in range(len(B)):
y += B[i] * X**i
y += e
return X, y

B = [0.1, 0.2, 0.3, -0.4] # [beta0, beta1, beta2, beta3]


X, y = generate_dataset(B, 50)
plt.scatter(X, y, s=15)

<matplotlib.collections.PathCollection at 0x7f25af937450>
# Building and fitting the Linear Regression model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

X, y = generate_dataset(B, 50)

#Generate new test data


X2, y2 = generate_dataset(B, 20)

poly = PolynomialFeatures(degree=8, include_bias=False)


poly2 = PolynomialFeatures(degree=16, include_bias=False)

poly_features = poly.fit_transform(X.reshape(-1, 1))


poly_features2 = poly.fit_transform(X2.reshape(-1, 1))

linearModel = LinearRegression()
linearModel.fit(poly_features, y)

# Evaluating the Linear Regression model


print(linearModel.score(poly_features2, y2))

0.6364593734835868

from sklearn.linear_model import Ridge

# List to maintain the different cross-validation scores


cross_val_scores_ridge = []
# List to maintain the different values of alpha
alpha = []

# Loop to compute the different values of cross-validation scores


for i in range(1, 9):
ridgeModel = Ridge(alpha = i * 40)
ridgeModel.fit(poly_features, y)
score = ridgeModel.score( poly_features2, y2)
# avg_cross_val_score = mean(scores)*100
cross_val_scores_ridge.append(score)
alpha.append(i * 200)

# Loop to print the different values of cross-validation scores


for i in range(0, len(alpha)):
print(str(alpha[i])+' : '+str(cross_val_scores_ridge[i]))

200 : 0.6653968262887501
400 : 0.6711444170343008
600 : 0.673506823781417
800 : 0.6746800126019875
1000 : 0.6753051013366153
1200 : 0.6756383882406153
1400 : 0.6758019669489553
1600 : 0.6758609933584792

##Model Selection and Hyperparameter Tuning

Model Selection and Hyperparameter Tuning


Choosing the right model and fine-tuning its parameters is critical for building effective machine
learning systems. This process helps balance the trade-offs between model complexity,
performance, and generalization.

1. Model Selection
• Purpose of Model Selection:
The goal is to identify the model that best fits the problem by comparing multiple
models on a validation dataset. This involves assessing their predictive performance,
robustness, and suitability for the task at hand.

2. Hyperparameter Tuning
Hyperparameters are parameters that define the structure of the model and influence how it
learns. They are set before the training process (e.g., the number of layers in a neural network,
the learning rate, or the maximum depth of a decision tree).

• Grid Search: A systematic method for hyperparameter tuning, where all possible
combinations of a predefined set of hyperparameters are tested. This approach can
be computationally expensive but guarantees an exhaustive search.
• Random Search: Instead of testing all possible combinations, random search
samples hyperparameters randomly from a distribution. It is more efficient than
grid search when searching large hyperparameter spaces.

• Bayesian Optimization: This method builds a probabilistic model of the objective


function and uses it to select the most promising hyperparameters. It is more
efficient than grid or random search and can lead to better results with fewer trials.

• Early Stopping: During training, the model’s performance is monitored on a


validation set. If performance no longer improves, training is halted early to prevent
overfitting.
Example: Here's an example demonstrating how to perform hyperparameter tuning using Grid
Search with a Random Forest classifier and the Iris dataset.

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Load dataset
data = load_iris()
X = data.data
y = data.target

# Split the dataset into training and testing


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3, random_state=42)

# Define the model


model = RandomForestClassifier(random_state=42)

# Define the hyperparameters grid


param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=model, param_grid=param_grid,
cv=5, scoring='accuracy',
n_jobs=-1, verbose=2)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)
# Print the best hyperparameters
print("Best Hyperparameters:")
print(grid_search.best_params_)

# Best model
best_model = grid_search.best_estimator_

# Predict and evaluate


y_pred = best_model.predict(X_test)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Optionally, view all results


results = pd.DataFrame(grid_search.cv_results_)
print("\nGrid Search Results:")
print(results[['params', 'mean_test_score', 'std_test_score']])

Fitting 5 folds for each of 108 candidates, totalling 540 fits


Best Hyperparameters:
{'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2,
'n_estimators': 100}

Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 19


1 1.00 1.00 1.00 13
2 1.00 1.00 1.00 13

accuracy 1.00 45
macro avg 1.00 1.00 1.00 45
weighted avg 1.00 1.00 1.00 45

Grid Search Results:


params
mean_test_score \
0 {'max_depth': None, 'min_samples_leaf': 1, 'mi...
0.933333
1 {'max_depth': None, 'min_samples_leaf': 1, 'mi...
0.942857
2 {'max_depth': None, 'min_samples_leaf': 1, 'mi...
0.942857
3 {'max_depth': None, 'min_samples_leaf': 1, 'mi...
0.933333
4 {'max_depth': None, 'min_samples_leaf': 1, 'mi...
0.942857
.. ... ..
.
103 {'max_depth': 30, 'min_samples_leaf': 4, 'min_...
0.933333
104 {'max_depth': 30, 'min_samples_leaf': 4, 'min_...
0.933333
105 {'max_depth': 30, 'min_samples_leaf': 4, 'min_...
0.933333
106 {'max_depth': 30, 'min_samples_leaf': 4, 'min_...
0.933333
107 {'max_depth': 30, 'min_samples_leaf': 4, 'min_...
0.933333

std_test_score
0 0.048562
1 0.035635
2 0.035635
3 0.048562
4 0.035635
.. ...
103 0.048562
104 0.048562
105 0.048562
106 0.048562
107 0.048562

[108 rows x 3 columns]

Decision Theory: Generative and Discriminative


Models
##Definitions

Informally:

• Generative models can generate new data instances.


• Discriminative models discriminate between different kinds of data instances.

A generative model could generate new photos of animals that look like real animals, while a
discriminative model could tell a dog from a cat.

If the task is to determine the language that someone is speaking:

• Generative approach:– is to learn each language and determine as to which language the
speech belongs to
• Discriminative approach:– is determine the linguistic differences without learning any
language– a much easier task!

Formally: given a set of data instances X and a set of labels Y :


• Generative models capture the joint probability p ( X , Y ), or just p ( X ) if there are no
labels.
• Discriminative models capture the conditional probability p ( Y ∨X ).

Example:

• Suppose we have the following data in the form (x,y): (1,0), (1,0), (2,0),
(2, 1),

then p ( x , y ) is:

y=0 y=1
x=1 1/2 0
x=2 1/4 1/4

p ( y∨x ) is

y=0 y=1
x=1 1 0
x=2 1/2 1/2
• The distribution p ( y∨x ) is the natural distribution for classifying a given example x into
a class y (discriminative)
• p ( x , y ) (Genrative) can be transformed into p ( y∨x ) by applying Bayes rule and then
used for classification: p ( x , y )= p ( x ) p ( y ∨x )

The differences between Discriminative and Generative are summarized in the following table:

Generative Discriminative
Models classes via pdfs and prior probabilities Directly estimate posterior probabilities with
no attempt to model underlying probability
distributions
Can generate synthetic data points Dedicated to classify new data which grants
better performance
a full probabilistic model of all variables provides a model only for the target variables
that we want to predict
Hard to estimate distributions accurately Easier tot une
Popular models: Gaussians, Naïve Bayes, Logistic regression, SVMs, neural networks,
Mixtures of multinomials, Mixtures of Nearest neighbor, etc
Gaussians,etc.
• In practice, generative models are most popular when we have phenomena that are well
approximated by the normal distribution, and we have a lot of sample points, so we can
approximate the shape of the distribution well.

The three ways to build classefiers


Classifiers can be characterized by a PDFs and priors, posteriors, or neither.
1. Generative models (e.g., LDA)
• Assume sample points come from probability distributions, different for each class.
• Guess form of distributions
• For each class C , fit distribution parameters to C points, giving f ( X∨Y =C ) [lhood]
• For each class C , estimate P ( Y =C ) [prior]
• Bayes’ Theorem gives P ( Y ∨X )
• pick class C that maximizes P ( Y =C∨X =x ) [posterior probability] (equivalently,
maximizes P ( X=x ∨Y =C ) P ( Y =C ) )
1. Discriminative models (e.g., logistic regression)
• Model P ( Y ∨X ) directly
1. Find decision boundary (e.g., SVM)
• model r ( x ) directly (no posterior)

• Advantage of (1 & 2): P ( Y ∨X ) tells the probability the guess is wrong [This is
something SVMs don’t do.]

• Advantage of (1): you can diagnose outliers: P ( X ) is very small

• Disadvantages of (1): often hard to estimate distributions accurately; real


distributions rarely match standard ones.

Gaussian Discriminant Analysis ( QDA and LDA)

1. Gaussian Discriminant Analysis is a generative techniques that requires a


fundamental assumption which is "each class comes from normal distribution
1
(Gaussian)" X ↪ N ( μ , σ 2 ) : f ( x )= , µ , x=v e c t o r s ; σ =s c a l a r ; d=d i me n s i o n
¿¿
2
2. For each class C , we estimate mean µC, variance σ c, and prior π C =P ( Y =C ) .
¿
3. Given x , Bayes decision ruler r ( x ) predicts class C that maximizes
f C ( x )=f ( X=x ∨Y =C ) π C .

4. l n ( w ) is monotonically increasing function for w >0, so the former is equivalent to


maximize:

In a 2-class problem, we can incorporate an asymmetrical loss function instead of the prio π C . In
a multi-class problem, it gets more difficult.

Quadratic Discriminant Analysis(QDA)


• For simplicity, let's supose that we have two classes C 1 and C 2, then pick the class with
¿
the biggest posterior probability: r ( x )=
{ 1 2

C2, o t h e r w i s e )
C 1 i f QC ( x ) −QC ( x )> 0
• The decision function is quadratic. Bayes decision boundary is Q C ( x ) −QC ( x )=0 .
1 2

• So far, we worket on Gaussian distribution where x , µ and σ are scalars.


• Such scalars applies equally well to a multi-dimensional
• In cas of anisotropic Gaussian distributions, the variance becomes a vector describing the
variability on each direction.
• One should know that QDA works very naturally with more than 2 classes. The feature
space gets partitioned into regions.
• One might not be satisfied with just knowing how each point is classified.
• One of the great things about QDA is that it allows to determine the probability that a
classification is correct.
• To recover posterior probabilities in 2-class case, use Bayes:

Linear Discriminant Analysis(LDA)


• LDA is a variant of QDA with linear decision boundaries.

• It’s less likely to overfit than QDA.

• The fundamental assumptionis that all the Gaussians have same variance σ 2

• The equations simplify nicely in this case.


( μC − μC ) x (|) μ C |) −|) μC |) )
2 2

Q C 1 ( x ) −Q C 2 ( x )= − +l n π C −l n π C
1 2 1 2

2 2
σ 2σ 1 2

• You should know that the quadratic terms in QC and QC canceled each other out.
1 2

• Now, we obtain a linear classifier for which Choosing a C that maximizes the
following linear discriminant function, which works for any number of classes:
μ C . x (|) μC|) )
2

− +l n π C
σ2 2 σ2
• In case of 2 classes, the decision boundary is w · x +α =0 and the posterior is
P ( Y =C∨X =x )=s ( Q C ( x ) − Q C ( x ) )
1 2

• The logistic function is the right Gaussian divided by the sum of the Gaussians.

• notice that even if the Gaussians are 2D, the logistic still looks 1D.

• In case of more than two classes, their LDA decision boundaries form a classical
Voronoi diagramif the priors π C are equal.

• In such case, all the Gaussians have the same width.


Likelihood of a Gaussian: A reminder
• We have already seen in ML 1 how to tune the likelihood of some function including
Normal.

• Given sample points X 1 , X 2, . . . , X n , let's find best-fit Gaussian.

• If we generate a random point from a normal distribution, what is the probability


that it will be exactly at X 1?

• Regardless the answer, we’re going to do “likelihood” anyway.]

• Likelihood of generating these points is L ( µ , σ ; X 1 , . .. , X n ) =f ( X 1 ) f ( X 2 ) ·· · f ( X n ) which


needs to be maximized ⇔ maximizing log likelihood.

• By partially derive the equation along the desired variables, we obtain:

##Regression: Least-Squares Linear and Logistic Regression

As we have learned before:

• Classification tends to predict a class (discrete) for a given point x , whereas, regression
predicts some numerical value (continuous) of a point x
• QDA and LDA don’t just perform classification; But also estimates the probability that a a
given label for a sample x is correct, which means that they implicitly do regression.
• To perform regression we:
a. Choose a form of regression function (hypothesis) h ( x ; p ) with parameter(s) p. as
instance, a decision function in classification; e.g., linear, quadratic, logistic in x ]
b. Choose a cost function (objective function) to optimize, usually based on a loss
function; e.g., risk=expected loss
• Some regression functions:
– (1). linear:h ( x ; w , α )=w · x+ α
– (2). polynomial
1
– (3). logistic:h ( x ; w , α )=s ( w · x+ α ) . recall that the logistic function is s ( 𝛾 )= −𝛾
1+e
Logistic expression

• The logistic function is an interesting choice.


• recall that LDA produces a posterior probability function with such expression
P ( Y =C∨X =x )=s ( QC ( x ) − QC ( x ) ).
1 2

• So the logistic function seems to be a natural form for modeling certain probabilities.
• If we want to model posterior probabilities, sometimes we use LDA.
• Alternatively, we could skip fitting Gaussians to points, and instead just try to directly fit
a logistic function to a set of probabilities.
Some loss functions: let z be the prediction of h ( x ) and y be the true label.

• (A) L ( z , y )=( z − y )2: squared error


• (B) L ( z , y )=|z − y ) : absolute error
• (C) L ( z , y )=− y l n z − ( 1− y ) l n ( 1− z ): logistic loss, aka cross-entropy,
y ∈{0 ,1 }, z ∈ [ 0 , 1 )
Some cost function to minimize:
1
• (a). $ J(h)=\frac{1}{n} \sum_{i=1}^{n}L(h(x_i),y_i)$: mean loss. Leave out the for sum
n
loss .
• (b). $ J(h)= max_{i=1}^n L(h(x_i),y_i)$: maximum loss.
• (c). $ J(h)= \sum_{i=1}^{n}ω_i L(h(x_i),y_i)$: weighted sum , which regards some points as
more important than others.
• (d). J ( h )= ( a ) , ( b ) , o r ( c )+ λ w 2: l 2 penalized/regularized cost or Ridge
• (e). J ( h )= ( a ) , ( b ) , o r ( c )+ λ|w ): l 1 penalized/regularized cost or Lasso

By combining regression function + Loss function + Cost function we get a regreeion


model(method) ready to fit the data. Some famous regression methods:

Regression mehod parts Description


Least-squares linear regression. (1)+(A)+(a) quadratic cost;
minimize w/calculus
Weighted Least-squares linear (1)+(A)+(c) quadratic cost;
regression minimize w/calculus
Ridge regression (1)+(A)+(d) quadratic cost;
minimize w/calculus
Lasso regression (1)+(A)+(e) quadratic program
Logistic regression. (3)+(C)+(a) convex cost; minimize
w/gradient descent
Least absolute deviations (1)+(B)+(a) linear program
Chebyshev criterion (1)+(B)+(b) linear program

The optimization algorithm and its speed depend crucially on which parts the regression method
is composed of.

###Least-Squares Linear Regression

• Least-Squares Linear Regression is acombination of Linear regression function (1)+


squared loss function (A) +cost function (a)

• So, the final cost function needs to be tunes we be: $ J(h)=\frac{1}{n} \


sum_{i=1}^{n}(x_i.w+α − y_i)^2$
• If we know that $ x·w+ α$ can be writen as x ′ . w ′ where x ′ is the vector x with
additionl cell of value 1, and w ′ is the vector w with additional cell of value α .

[)
w 11
..
[ x1 . . .. x n 1)
wn
α

• Now the objective function becomes ^ 2


R S S ( w )=mi nw|) X ′ . w − y|) , for residual sum
of squares

– to minimize R S S ( w )=wT X ′ T X ′ w − 2 yT X ′ w+ y T y
– we set $ \triangledown RSS = 2X'^T Xw - 2X'^Ty=0$
– this implies $ w=(X'^TX')^{-1}X'^Ty$

pros:

• Easy to compute; just solve a linear system.


• Unique and stable solution.

cons:

• Very sensitive to outliers, because errors are squared!


• Fails if $X^TX $ is singular, which means that the problem has multiple
solutions(unconstrained).

Logistic Regression
• Logistic regression function (3)+logistic loss function (C)+cost function (a).
• Fits “probabilities” in range (0,1)
• Usually used for classification. y i's can be probabilities, but in most applications they’re
all 0 or 1.
• Although both utilize Logistic function, QDA and LDA are generative models, whereas,
Logistic Regression is a discriminative one.
• With LDA, we have seen that in classification, the posterior probabilities are often
modeled well by a logistic function. The question rises, why not just fit a logistic
function directly to the data, skipping the Gaussians?*

Supose that we have a data matrix X and w including the fictitious dimension(i.e., vector of ones
and α are X 's and w ’s last components respectively.

• Then, we need to find out what maximizes the following:

$ J(w)=\frac{1}{n} \sum_{i=1}^{n}L(s(x_i.w),y_i) = -\sum_{i=1}^{n} [ y_i ln (x_i.w) + (1-y_i) ln(1-


(x_i.w))]$

So, let's plot the loss L ( z , y ) for y=0 and y=0.7 .


import matplotlib.pyplot as plt
import numpy as np

#L(z,y)=−y ln z−(1−y) ln(1−z)


def L(z, y):
return -y * np.log(z) - (1-y)*np.log(1-z)

z = np.arange(0.01, 1, 0.01)
L0 = L(z, 0.1); L04 = L(z, 0.4); L07 = L(z, 0.7)
plt.plot(z, L0, label= ' y = 0.1'), plt.plot(z, L04, label= ' y =
0.4'), plt.plot(z, L07, label= ' y = 0.7')
plt.legend();

• As expected, each function is minimized at its corresponding y value, while loss


functions are always convex.

• Since J(w) is always convex, it can be minimized by gradient descent.

• To perform gradient descent, we need to compute derivatives.

• L e t si =s ( xi · w )

To update weights w , we may resort to:


• Gradient descent rule: w i+1=wi +ϵ . X
T
( y − s ( X w ))
or, Stochastic gradient descent: w i+1=wi +ϵ . X i ( y i − s ( X i w ) ). Shuffle points in random
T

order, process one by one. used with very large n, and sometimes it converges before
visiting all points!
• The last technique looks a lot like the perceptron learning rule. The only difference is that
the si part.
• It should be mentioned that logistic regression separates linearly separable points.
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt

data, classes = make_classification(n_classes= 2, n_samples =40,


n_clusters_per_class=1, n_features = 2, n_redundant=0)
plt.scatter(data[:,0], data[:,1], c=classes)

clf = LogisticRegression(random_state=0).fit(X, y)

#plot the decision as a surface


Blues = plt.get_cmap('Blues')
clf.predict(np.array([3,0])[np.newaxis, :])
X, Y = np.meshgrid(np.linspace(-3,3,100),np.linspace(-3,3,100))
positions = np.vstack([X.ravel(), Y.ravel()])
positions.transpose().shape
crs = clf.predict_proba(positions.transpose())
plt.scatter(positions[0], positions[1], c = Blues(crs[:,0]), alpha=1)

#plot the points


plt.scatter(data[:,0], data[:,1], c=classes)

<matplotlib.collections.PathCollection at 0x1cd4f266bc0>
• A 2018 paper by Soudry et al. shows that gradient descent applied to logistic regression
eventually converges to the maximum margin classifier
• However, the convergence will be extremely slow.
• In practice, logistic regression will usually find a linear separator reasonably quickly.
• but it’s not a practical algorithm for maximizing the margin in a reasonable amount of
time

Shrinkage: Ridge and Lasso, and Subset Selection,


• In statistics, shrinkage is the reduction in the effects of sampling variation.
• This idea is complementary to overfitting.
• Shrinkage (also known as regularization) has the effect of reducing variance and can also
perform variable selection.

Ridge Regression (Tikhonov Regularization):

• We have seen earlier that ( 1 ) + ( A ) + ( d ) gives us the l 2 penalized/regularized cost or


Ridge:

$ \underset{w}{\operatorname{ argmin}} J(h)= ||Xw - y||^2 + λ||w'||^2$

where w ′ is the vector w with the last component α replaced by 0. Althought the
matrix X has a fictitious dimension, we DON’T penalize α .

• You clearly notice that we add a regularization term (i.e., penalty term) for
shrinkage to encourage small |)w ′|), Why?
– Guarantees that the normal system has always a unique solution.
– Standard least-squares, in the other hand, yields singular normal equations
(infinite number of solutions) when the sample points lie on a common
hyperplane in feature space. E.g., when d >n .
• The left plot of the above figure presents the quadratic form for a semidefinite cost
function associated with least-squares regression.
• You may notice that it has infinite number of minimas.
• In such cases, the regression problem is said to be ill-posed. *To obtain a positive
definite quadratic form (right image), which has a unique minimum we add small penalty
term.
• The term “regularization” implies that we are turning an ill-posed problem into a well-
posed problem.

How is it important in machine learning?

• To reduce overfiting we need to reduce the variance.


• Assume that for a given data X , we found out that 500 x + 0.5x is the best fit for well-
1 2

separated points with y i ∈0 , 1.


• We can clearly see that, based on the big coefficients/wieghts, small changes in x 1 cause
big change in y which is surely a sure sign of overfitting.
• When having large variance(overfitting), it implies that the problem is likely ill-posed,
even though technically it might be well-posed.
• The solution to such a problem is by penalizing large weights and therefore reducing the
variance.
# Building and fitting the Linear Regression model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
#Generate new test data
# X, y = generate_dataset(B, 20)

poly = PolynomialFeatures(degree=12, include_bias=False)


poly2 = PolynomialFeatures(degree=2, include_bias=False)

poly_features = poly.fit_transform(X.reshape(-1, 1))


poly_features2 = poly2.fit_transform(X.reshape(-1, 1))

linearModel = LinearRegression()
linearModel.fit(poly_features, y)

# plot data and regression result


plt.figure(figsize=[16,8])
plt.subplot(1, 2, 1)
plt.scatter(X,y )
x = np.arange(X.min(), X.max(), 0.1)
y_x = [np.power(a, np.arange(1, len(linearModel.coef_) +
1)).dot(linearModel.coef_) + linearModel.intercept_ for a in x]
# print('multinomial_12:', linearModel.coef_)
plt.plot(x, y_x, c ='r');
var = int((np.diff(y_x)**2).sum())
bias = np.abs(y - [np.power(a, np.arange(1, len(linearModel.coef_) +
1)).dot(linearModel.coef_) + linearModel.intercept_ for a in
X]).sum()
plt.title("High variance:" + str(var) + ', and low bias:' + str
(int(bias)));

linearModel = LinearRegression()
linearModel.fit(poly_features2, y)

plt.subplot(1, 2, 2)
# print('multinomial_2:', linearModel.coef_)
plt.scatter(X,y )
x = np.arange(X.min(), X.max(), 0.1)
y_x = [np.power(a, np.arange(1, len(linearModel.coef_) +
1)).dot(linearModel.coef_) + linearModel.intercept_ for a in x]
plt.plot(x, y_x, c ='r');
var = int((np.diff(y_x)**2).sum())
bias = np.abs(y - [np.power(a, np.arange(1, len(linearModel.coef_) +
1)).dot(linearModel.coef_) + linearModel.intercept_ for a in
X]).sum()
plt.title("Low variance:" + str(var) + ', and High bias:' + str
(int(bias)));

• In the following plot of weight space {β }. ^β , read as w


^ , is the least-squares solution.

• The red ellipses are the isocontours of |) X w − y|)2.


• The isocontours of $ ||w'||^2 $ are circles centered at the origin (blue).

• The solution to the normal system lies where a red isocontour just touches a blue
isocontour.

• As λ increases, the solution will occur at a more outer red isocontour and a more
inner blue isocontour.

• This process helps to reduce overfitting.


Variance and bias tradeoff

• To solve minimize the cost function, we set $ \triangledown J=0 $ gives normal
equations: ( X T X + λ I ′ ) w=X T y
• I ′ here refers to the identity matrix where the bottom right is set to zero. We do this to
avoid penalizing the bias term α .
• Algorithm:
– Solve for w .
– Increase λ for more regularization and smaller |)w ′|)
−1
Tune the variance/bias of Ridge regression V a r ( β r i d g e ) =σ ( X X + λ I ′ ) X e ,
2 T T

where e is the noise (our data model by y= X v +e ).
– As λ → ∞ , variance → 0 and bias increases.
• The error function Err(x) is the sum of Bi a s ² , v a r i a n c e and the irreducible error σ 2.
2 2
E r r ( x )=B i a s +V a r ( β r i d g e ) + σ
• For the bias-variance trade-of, the test error as a function of λ is a U-shaped curve. We
find the bottom by cross-validation.
• Ideally, features should be “normalized” to have same variance.
• To use asymmetric penalty, The identity matrix I ′ must be replaced with another
diagonal matrix.

For Lasso Regression, The cost function engages a l 1 least absolute shrinkage and selection
operator.

$$ $ \underset{w}{\operatorname{ argmin}} J(h)= ||Xw - y||^2 + λ||w'||$$

Learning paradigms: supervised, unsupervised,


and semi-supervised.
Brief reminder
• ML can be applied to almost any problem.
• it has been used to detect cancer, predict traffic patterns, match people up, recognize
faces (and facial expressions), caption images, and much, much more.
• Depending on the problem at hand, different machine learning techniques are used to
yield an effective solution.
• The four main paradigms in machine learning include supervised, unsupervised, semi-
supervised reinforcement learning.

Supervised Machine Learning

• The majority of practical machine learning uses supervised learning.

• It is when using algorithm to learn the mapping function f ( X )= y from the input X
to the output y , where X and y are known beforehand.

• Called supervised because of learning from the training dataset can be thought of as
a teacher supervising the learning process.

• Learning stops when the algorithm achieves an acceptable level of performance


(low error rate).
Unsupervised Machine Learning

• when only having input data X and no corresponding output variables.


• The goal is to model the underlying structure(i.e., distribution) of the data in order to
learn more about the data itself.
• Called unsupervised learning because there is no correct answers and therefor no
teacher.

Semi-Supervised Machine Learning

• When the large amount of input data X incorporated with few labeled y .
• Many real world machine learning problems fall into this area.
• This is because it is expensive and time-consuming to label large amount of data as it
may require domain experts.
• On the other hand, unlabeled data is cheap and easy to collect/store.
• Both supervised and unsupervised techniques can be utilized.
a. unsupervised techniques discover and learn the structure of the data.
b. supervised techniques make best guess predictions for the unlabeled data, feed
that data back into the supervised learning algorithm as training data.
c. use the final model to make predictions on new unseen data.

Reinforcement Learning

• training of machine learning models to make a sequence of decisions.


• The agent learns to achieve a goal in an uncertain,
• It employs trial and error (rewards or penalties) to come up with a solution to the
problem.
• The goal is to maximize the total reward.
• Although the reward policy are beforehand set, the model given no hints or suggestions
for how to solve the problem.
• It’s up to the model to figure out how to perform the task to maximize the reward,
starting from totally random trials and finishing with sophisticated tactics and
superhuman skills.
##Unsupervised Learning and Principal Components Analysis

• In unsupervised learning, we have sample points, but no labels/classes and nothing to


predict. The goal is to discover the structure within the data.
• Some examples can be found in:
– Clustering: partition data into groups of similar/nearby points.
– Dimensionality reduction: data often lies near a low-dimensional subspace (or
manifold) in feature space; a matrix that models the data has a low-rank
approximations.
– Density estimation: is the task of fiting a continuous distribution to discrete data.
e.g., when fitting Gaussians to sample points, that’s density estimation.

Principal Components Analysis

The main goal to find k directions that capture most of the variation of sample point X ∈ ℝ d ,
where k < ¿ d .

Why?..

• Reducing dimensions makes computations cheaper, e.g., regression.


• Somtimes used to reduce overfitting in learning algorithms by removing irrelevant
dimensions.
• Finding a small basis for representing variations in complex things, e.g., faces, genes.

Let X be n × d design matrix as the table above (5 × 4) shows, what do you notice?

Math Physics English French


Khaled 12 13 11 14
Manar 8 8.5 10 13
Fateh 12 13 9 14
Saif 16 16 8.5 13
Ines 8 8.5 10 12
• The above table (matrix) contains 5 rows (samples) and 4 columns (variables).

• The fourth sample as instance x 4 =[ 16 ,16 , 12.5 , 13 )

• The second varible is y 2=[ 12 , 8.5 ,14 ,16 ,9.5 )

• The mean of the designe matrix is μ x =[ 11.7, 12.2 ,9.8 ,13.3 ) and it represents the
center of each variable.
n
1
μ j= ∑x
n i=1 i j

import numpy as np
X = np.array([[12, 13, 11, 14],[8, 8.5, 10, 13],[12, 13, 9, 14],[16,
16,8.5, 13],[8, 8.5, 10, 12]])
X.mean(axis=0)
array([11.2, 11.8, 9.7, 13.2])

~ ~
• The centerd data is the matrix X where X i j =x i j − μ j for i=1 . .n , j=1 .. d
~
• Caculate the sum of each column of X , what do you conclude?
X_hat = X - X.mean(axis=0)
print(X_hat)

[[ 0.8 1.2 1.3 0.8]


[-3.2 -3.3 0.3 -0.2]
[ 0.8 1.2 -0.7 0.8]
[ 4.8 4.2 -1.2 -0.2]
[-3.2 -3.3 0.3 -1.2]]

~
• All the forthcoming calculation is done using the centered matrix X . X is needed no
more.

• Let w be a unit vector. The orthogonal projection of the point x on to vector w is


~ (x·w) w
x= . =( x . w ) . w (if w is unit)
|)w|) |)w|)

• Unit vectors are of use when length is not relevant.

• The idea of PCA is that we pick the best direction w , then project all the data onto w
so we can analyze it in a one-dimensional space.

• Indeed, if we project from d dimensions to just one, we lose a lot of information.

• Therfore, we pick several directions instead of one.

• Those directions span a subspace, and we want to project points orthogonally onto
the subspace.

• This would be an easy task if the directions are orthogonal (orthonormal) to each
other with length of 1.
k
• Given orthonormal directions v 1 , . .. , v k , ~
x=∑ ( x · v i ) v i
i=1

• Usualy, we want a k principal coordinates (k < d ), we don’t want to project points


back to ℝ d
How it works
T
• X X is square, symmetric, positive semi-definite, d ×d matrix.
• As it’s symmetric, its eigenvalues are real and its eigenvectors are orthogonal to each
other. proofs here
• Let 0 ≤ λ1 ≤ λ 2 ≤ . . .≤ λ d be its eigenvalues sorted.
• Let v 1 , v 2 , .. . , v d be the corresponding orthogonal unit eigenvectors ( principal
components).
• Then, the most important principal components will be the ones with the greatest
eigenvalues.

##PCA derivation (1)##

The first version of PCA performs the following:

1. Fit a Gaussian to data with maximum likelihood estimation.


2. Choose k Gaussian axes of greatest variance.

• Using the MLE, we are assuming that data are independently sampled from a
multivariate normal distribution with mean vector and variance-covariance matrix
̂

∑ ¿ 1n X T X
• The PCA Algorthm is as follow:
a. Center X
b. Normalize X [Optional]. Only if the units of measurement are different?
̂
c. Compute eigenvectors/eigenvalues of ∑ ❑:
– using the equation $AX=\lambda X ⇒ (A-\lambda I)X=0, $

– As shown in Cramer's rule, the non trivial solutions are given by $ det(A-\lambda
I)=0$.
a. Choose k based on variability the eigenvalues grants. $ % of variability= \frac{∑{i
= d-k+1}^{d} λ_i}{∑{i =1}^{d} λ_i}$.

a. pick eigenvectors v d −k +1 ,. . . , v d .
b. Compute the k principal coordinates x · v i of each training/test point
T
c. can reverse to original space by multiplying principal coordinates by v i

Example: continue with the previous centered data

# 3. Compute X'X and its eigenvectors/eigenvalues of


S = np.dot(X_hat.transpose(), X_hat)
e_values, e_vectors = np.linalg.eig(S)
index = e_values.argsort()[::-1] # sort result
e_values = e_values[index]
e_vectors, e_vectors[:,index]
4.
# Choose k based on variability the eigenvalues grants
variability = np.cumsum(e_values) * 100/ np.sum(e_values)
print("variability (%): ", variability)
# pick eigenvectors the two first eigen vectors (var = 98.5%)
v_best = e_vectors[:,0:2]
# Compute the k principal coordinates x⋅vi of each training/test
point
principal_coordinate = np.dot(X_hat, v_best)

import matplotlib.pyplot as plt


plt.figure(figsize=[8,4])
plt.subplot(1, 2, 1)
plt.scatter(principal_coordinate[:,0], principal_coordinate[:,1])
plt.subplot(1, 2, 2)
plt.plot(e_values.cumsum())

variability (%): [ 94.43097378 98.50348738 99.93198946


100. ]

[<matplotlib.lines.Line2D at 0x7fed7e713750>]

# back to original data


original_recovered = np.matrix(np.dot(principal_coordinate,
v_best.transpose())) + X.mean(axis=0)

print('original data: \n', X)


print('\n recovered data:\n', np.around(original_recovered, 1))
print('\n MAE loss:\n', np.round(np.power(X -
original_recovered,2).sum()**-1,2), "%")

original data:
[[12. 13. 11. 14. ]
[ 8. 8.5 10. 13. ]
[12. 13. 9. 14. ]
[16. 16. 8.5 13. ]
[ 8. 8.5 10. 12. ]]

recovered data:
[[11.9 13. 10.7 14.3]
[ 7.9 8.6 10.2 12.8]
[12.3 12.9 9.6 13.4]
[15.9 16.1 8.3 13.2]
[ 8. 8.4 9.7 12.4]]

MAE loss:
0.71 %

##PCA derivation (2)

• PCA can be performed by find a direction w that maximizes sample variance of projected
data.
• In other words, when the data is projected down, it must keep as maximum spread out as
possible.
• So, the question is: how to choose the orientation of the support that grants the
afromentioned conditions?

t
x Ax
• Tp solve the problem above, we ressort to Rayleigh quotient r ( x )= t [details
xx
here].

• We therefore needt to solve the following:

( )
n 2
1 w 2 |)X w|) wt X t X w
argmax V a r ( {~
X 1 ,~
X 2 , . .. . ,~
X n }) = ∑ X i . = =
w n i=1 |)w|) n|)w|)2 n . wt w

• If w is an eigenvector v i of X , this means that Rayleigh quotient = λ i.

• Of all eigenvectors, the above objective funtion yields v d that achieves maximum
variance λ d /n.

– The question here is how to obtain a second eigen vectors?


• The solution is to pick a second direction that’s orthogonal to the best direction v i,
But subject to that constraint of maximizing the sample variance.
• The same goes for the third direction and so on.

PCA derivation (3)


• The main aim is to find a direction w that minimizes the mean squared projection
distance.

• IT can be seen as a sort of least-squares linear regression, with one subtle but
important change.

• Instead of measuring the error in a fixed vertical direction, it is measured in a


direction orthogonal to the principal component direction we choose.

• In both methods, however, the goal is to minimize the sum of the squares of the
projection distances.

$\underset{w}{\operatorname{ argmin}} ∑_{i=1}^{n} ||X_i - \tilde{X_i}||^2 =


∑_{i=1}^{n} ||X_i - \frac{X_iw}{||w||^2}w||^2 = ∑_{i=1}^{n} \bigg{[}||X_i||^2 - \
big{(}\frac{X_iw}{||w||}\big{)}^2\bigg{]}$

which is : constant - n (variance from pca derivation 2).

• Minimizing mean squared projection distance means maximizing variance.

• From this point, the same reasoning as PCA derivation 2 is employed.


Example:

PCA can be used for various tasks including noise removal, feature extraction and data
compression. In the following code, PCA has been used for image compressed. The original
image has been compressed in different ratios and plots as the following:

# !wget "https://media.istockphoto.com/id/1141529240/vector/simple-
apple-in-flat-style-vector-illustration.jpg?
s=612x612&w=0&k=20&c=BTUl_6mGduAMWaGT9Tcr4X6n2IfK4M3HH-KCsr-Hrgs=" -o
"image.jpg"
from PIL import Image
import numpy
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

plt.figure(figsize=[20,12])
img = Image.open('/content/simple-apple-in-flat-style-vector-
illustration.jpg?
s=612x612&w=0&k=20&c=BTUl_6mGduAMWaGT9Tcr4X6n2IfK4M3HH-KCsr-Hrgs=')
array_origin = numpy.array(img)
stacked_arrays = array_origin.reshape(array_origin.shape[0]*
3,array_origin.shape[1] , -1).squeeze()
revers_stacked =
stacked_arrays.reshape(array_origin.shape[0],array_origin.shape[1], -
1)

plt.subplot(2, 3, 1);
plt.imshow(img);
origin_size = stacked_arrays.shape[0] * stacked_arrays.shape[1]
plt.title("Original image, size =" + str(origin_size) + " octes")

# compress using pca


i = 2
for n_copm in [100, 50, 20, 10, 5]:
pca = PCA(n_components=n_copm)
array_compressed = pca.fit_transform(stacked_arrays)
reversed_array = pca.inverse_transform(array_compressed)
revers_stacked =
reversed_array.reshape(array_origin.shape[0],array_origin.shape[1], -
1)
revers_stacked = numpy.abs(revers_stacked).astype(int)
revers_stacked[revers_stacked>255] = 255;
plt.subplot(2,3, i);
new_size = array_compressed.shape[0] * array_compressed.shape[1] +
pca.components_.shape[0]*pca.components_.shape[1]
plt.title("compress image, size =" + str(new_size) + " octes
( ratio:"+ str(int((1- new_size/origin_size) * 100)) +"%)" )
plt.imshow(revers_stacked);
i = i + 1
plt.show()
Density Estimation and Anomaly
• Anomaly detection (i.e.,outlier detection or novelty detection), is the identification
of rare items, events or observations which deviate significantly from the majority
of the data.

• Let be a univariate Gaussian distribution and Supose T are generated observations.


θ−μ
A point θ ∈T is an outlier if and only if its z-score ( z (θ )= ) is greater than a pre-
σ
selected threshold.

• Fraud detection in the finance, rare event detection in network traffic, visual image
inspection for buildings and road monitoring, and defect detection in production
lines are some common problems.

• for a comprehensive survey of anomaly detection techniques, check this paper out
paper
##Kernel

• In statistics, Kernel of a pdf or pmf is the form of the pdf or pmf in which any factors that
are not functions of any of the variables in the domain are omitted.
– Let K h ( X 0 , X ) be a kernel, it can be written as:
λ
K h ( X 0 , X ) =D
λ
( ‖X − X 0 )
hλ ( X0) )
– $ X,X_{0}\in \mathbb {R}^{p} $

– $ ||.||$ is the Euclidean norm

– $ h_{\lambda }(X_{0})$ is a parameter (kernel radius)

– D ( t ) has a positive real value reciprocally proportional to the distance


between X and X 0.

• For many distributions, the kernel can be written in closed form. An example is the
normal distribution which has the following probability density function:
2
( x− μ )

1
p ( x∨μ , σ )=
2
2 2σ
e
√2 π σ 2

• the associated kernel is:


2
(x − μ )

p ( x∨μ , σ ) ∝ e
2
2 2σ

• Kernels are used in kernel density estimation to estimate random variables' density
functions (i.e., smoothing), or in kernel regression to estimate the conditional
expectation of a random variable.

##Nonparametric statistics

• It is the branch of statistics that is not based only on parametrized families of probability
distributions (e.g., mean and variance).
• Chosing non-parametric methods for estimating a density function is derived from a lack
of prior information about the PDF that corresponds to the data.
• If we take, as instance, maximum likelihood estimation (MLE) and Bayesian parameter
estimation (BPE), we need to estimate the value of a parameter θ^ that maximizes the
likelihood function,
n
^
θ=argmin ∏ p ( X k ∨θ )
θ k=1

– if we set p ( X∨θ ) as Bayes' rule to obtain the conditional distribution of θ given


P (θ ∣ X ) ⋅ P ( X )
the data X : P ( X ∣θ )=
P (θ )
– we can clearly see, both of these methods depend strongly on knowledge of the
conditional distributions of the data
• It is based on either being distribution-free or having a specified distribution with
unspecified parameters.
• Nonparametric tests are often used when the assumptions of parametric tests are
violated.
• Kernel is a weighting function used in non-parametric estimation techniques.
##Kernel Density Estimation (KDE) [ref]

• KDE is an unsupervised learning technique that helps to estimate the PDF of a


random variable.

• KDE is in a non-parametric statistics method.

• It’s related to a histogram but with a data smoothing technique.

• Different kernels can be used to smoothen the distribution . In the example below,
Tophat and Gaussian Kernels are used.

Mathematically,

• Let X =( x 1 , x 2 , . . . , x n ) be independent and identically distributed samples drawn


from some univariate distribution

• The density ƒ at any given point x is unknown .

• The main goal is estimating the shape of ƒ.

• Its kernel density estimator is given by:

( )
n n
^f h ( x )= 1 ∑ K h ( x − x i ) = 1 ∑ K x − xi
n i=1 n h i=1 h

where
– K is a kernel, used to calculate the scores.
– h is a bandwidth parameter that is responsible for smoothness( choosing a higher
number for h yields smoother distribution). x is a given estimate and x i is a point
from the sample dataset.
• As was mentioned above K is a kernel where we have multiple choices e.g.,
Gaussian, Tophat, Epanechnikov, etc.
Example

• In this example, a data containing univariate 200 samples has been generated.
• The histogram of the same data but with different number of bins has a disproportionate
effect on the resulting visualization.
• One can expect a major confusion of samples occurs especially at the bin boundaries.
from sklearn.datasets import make_blobs
import numpy as np
import matplotlib.pyplot as plt
X, y = make_blobs(n_samples=200, centers=2, n_features=1,
random_state=16, cluster_std = [2, 0.8])

plt.figure(figsize=[8,4])
plt.subplot(1,2,1);plt.hist(X, bins=30);
plt.subplot(1,2,2);plt.hist(X, bins=3);
# a significant differnce based on number of bins

• The kernel affect, via KDE, can be used to smoothen the resulting distribution instead of
histograms.
from sklearn.neighbors import KernelDensity
import numpy as np
from sklearn.model_selection import GridSearchCV

# use grid search cross-validation to optimize the bandwidth


params = {"bandwidth": np.logspace(-1, 1, 20), 'kernel': 'gaussian'}
grid = GridSearchCV(KernelDensity(), params)
grid.fit(X, y)

print("best bandwidth: {0}".format(grid.best_estimator_.bandwidth))


# use the best estimator to compute the kernel density estimate
kde = grid.best_estimator_

plt.figure(figsize=[12,4])
plt.subplot(1,2,1);plt.hist(X, bins=30); plt.title('data histogram')
plt.subplot(1,2,2);
plt.plot(np.linspace(X.min(),X.max(), 50),
kde.score_samples(np.linspace(X.min(),X.max(), 50).reshape(-1, 1)));
plt.title('Gaussians scores');

#if we estimate the score of an outlier x = -15, whereas the min of


data is -10
print(kde.score_samples([[-15]]))

best bandwidth: 0.5455594781168519


[-39.51611853]

Choosing the bandwidth


• We would like to find a value of the smoothing parameter that minimizes the error
between the estimated density and the true density

• A natural measure is the mean square error at the estimation point x, defined by:

• This expression is an example of the bias-variance dilemmaof statistics: the bias can
be reduced at the expense of the variance, and vice versa.

• The bias-variance dilemma applied to bandwidth selection simply means that

– A large bandwidth will reduce the differences among the estimates of P K D E ( x )


for different data sets (the variance) but it will increase the bias of P K D E ( x ) with
respect to the true density P ( x )
– A small bandwidth will reduce the bias of P K D E ( x ), at the expense of a larger
variance in the estimates P K D E ( x ) *The natural way for choosing the smoothing
parameter is to plot out several curves and choose the estimate that is most in
accordance with one’s prior (subjective) ideas
• However, this method is not practical in pattern recognition since we typically have
high-dimensional data.

• The solution is to assume a standard density function and find the value of the
bandwidth that minimizes the integral of the square error (MISE)
^
h=argmin M S Ex ( PK D E)
h
Gaussian kernel smoother
• The Gaussian kernel is one of the most widely used kernels for density estimation
and anomaly detection. *It is expressed with this formula:

( )
2
( x − xi )
K h ( x , x i )=exp − 2
2h

#Generate data
B = [0.1, 0.2, 0.3, -0.4] # [beta0, beta1, beta2, beta3]
X, y = generate_dataset(B, 50)

#apply kernel smoother on the distribution


sort_index = np.argsort(y)

h = [1,6,20]

plt.figure(figsize = [16, 4])


plt.subplot(2,2,1); plt.hist(y, bins = 15); plt.title(" Histogr")
i = 2
for h_0 in h:
smoothed = []
for y_0 in np.linspace(y.min(), y.max(), 50):
smoothed.append((1/(len(y))) * np.sum( np.exp( - ( y - y_0)**2/
(2*h_0**2))))
smoothed = np.array(smoothed)
plt.subplot(2,2,i); plt.plot(np.linspace(y.min(), y.max(), 50),
smoothed)
i = i+1

Nearest neighbor smoother


• For each point X 0, take m nearest neighbors.

• Estimate the value of Y ( X 0 ) by averaging the values of these neighbors.

• Formally,
h m ( X 0 )=‖ X 0 − X [ m ))

, where X [ m) is the mt h closest neighbor to X 0 neighbor

Lets do a real anomlay detection example.

# load data
import os
from matplotlib.image import imread
from PIL import Image

img_dir="/content/dataset"
all_files=os.listdir(img_dir)
data_path = [os.path.join(img_dir + "/" + i) for i in all_files]
k=0
data = []
plt.figure(figsize=[12,4])
for i in data_path:
k=k+1
plt.subplot(1,6,k)
data.append(imread(i))
plt.imshow(data[k-1])
plt.show()

# Extract features
from sklearn import decomposition, datasets
from skimage.color import rgb2hsv
from sklearn.preprocessing import StandardScaler

data_hsv = [rgb2hsv(rgb_img)[:,:,0] for rgb_img in data]


features = [np.histogram(X, bins=5)[0][1:] for X in data_hsv]
# features
std_slc = StandardScaler()
X_std = std_slc.fit_transform(features)

pca = decomposition.PCA(n_components=2)
X_std_pca = pca.fit_transform(X_std)

Apply KDE and show the outlier


params = {"bandwidth": np.logspace(-1, 1, 20)}
grid = GridSearchCV(KernelDensity(), params)
grid.fit(X_std_pca)
kde = grid.best_estimator_

scores = kde.score_samples(X_std_pca)
outlier_index = np.argmin(scores)

plt.imshow(data[outlier_index]);
plt.title('The outlier image is:');

Graphical models
ref1 ref2

• Probabilistic graphical models are graphs in which nodes represent random variables,
and the arcs represent conditional independence assumptions.
• They provide a abstract and compact representation of joint probability distributions.
• A graphical model can be either Undirected or Directed
– Undirected (i.e., called Markov Random Fields or Markov networks) have a simple
definition of independence: two (sets of) nodes A and B are conditionally
independent given a third node(set), C, if all paths between the nodes (in) A and B
are separated by a node (set)in C.
P ( A ∩B∨C ) ⇔ P ( A , B∨C )=P ( A∨C ) P ( B∨C )
– Directed (i.e., called Bayesian Networks or Belief Networks) independency takes
into account the directionality of the arcs (more complicated).
• Nodes may hold categorical (e.g., multinomial distributions) or continuous values (e.g.,
Gaussian distribution).
• For a discrete node with continuous parents, logistic/softmax distribution can be used.
• Using multinomials, Gaussians, and the softmax distribution, we can have a rich toolbox
for making complex models.

Directed Acyclic Graphical Models (DAG)

• A DAG Model / Bayesian network corresponds to a factorization of the joint probability


distribution,
p ( A , B , C , D , E )= p ( A ) p ( B ) p ( C∨A , B ) p ( D∨B ,C ) p ( E∨C , D )
• In general:
n
p ( X 1 ,. . . , X n )=∏ p ( X i∨X p a ( i) )
i=1

where p a ( i ) are the parents of node X i


• Therfore, the Conditional Probability Distribution (CPD) for each node must be specefied
in advance.

Example: Consider this example, in which all nodes are binary (True(T) or False(F)).

• By defenition, the joint probability of all the nodes in the graph above is

P ( C 0 , C 1 , C2 , C3 ) =P ( C0 )∗P ( C1 ∨C0 )∗P ( C2 ∨C0 )∗P ( C3 ∨C1 ,C 2 )

e.g., P ( T , F , F , T )=0.5∗0.9∗0.2∗0.2=0.018

whereas, P ( T , F , F , F )=0.5∗0.9∗0.8∗1=0.37

##Inference

• The most common task we wish to solve using Bayesian networks is probabilistic
inference.
• It consists in evaluating the probability distribution over some set of variables, given the
values of another set of variables.
• For example, how can we compute p ( A∨C=c )? Assume each variable is binary, a naive
method for calculation is:
– p ( A ,C=c )= ∑ p ( A , B , C=c , D , E ) .......... [16 terms]
B, D , E

– p ( C=c )=∑ p ( A , C=c ) ........ [2 terms]


A
p ( A , C=c )
– p ( A∨C=c )= ...........[2 terms]
p ( C=c )
– Total: 16+2+2 = 8 terms

Example:

• consider the water sprinkler network, and suppose we observe the fact that the grass is
wet.
• Either it is raining, or the sprinkler is on.

p ( C 1=T , C 3=T )
∑ p ( C 0 ,C 1=T ,C 3=T ) 0.5× 0.1 ×0.9+0.5 × 0.5 ×0.9
B, D, E
P ( C1 =T ∨C3=T ) = = = =0.4
p ( C 3=T ) p ( C3=T ) 0.6945
– P ( C2 =T ∨C3=T ) =0.70
• It is more likely that the grass is wet because it is raining: the likelihood ratio is
0.7079 /0.4298=1.647.
More efficient method:

• p ( A ,C=c )= ∑ p ( A ) p ( B ) p ( C=c∨ A , B ) p ( D∨B , C=c ) p ( E∨C=c , D )


B, D , E

• ¿ ∑ p ( A ) p ( B ) p ( C=c∨ A , B ) ∑ p ( D∨B , C=c ) ∑ p ( E∨C=c , D )


B D E

• ¿ ∑ p ( A ) p ( B ) p ( C=c∨ A , B ).........................[4 terms]


B

• Total: 4+2+2 = 8 terms


• In the above example, notice that the two causes "compete" to "explain" the observed
data 'Wet grass'.
• Hence Sprinkle and Rain become conditionally dependent given that their common child,
WetGrass ( even though they are marginally independent.)
• For example:
– suppose the grass is wet, but that we also know that it is raining.
– Then the posterior probability that the sprinkler is on goes down:
– P r ( C1=1∨C3 =1 ,C 2=1 )=0.1945
• This phenomenon is called "explaining away" (either event alone is sufficient to explain
the evidence on C 3).

Top-down vs bottom-up reasoning:

• Bottom up reasoning (i.e., This is called diagnostic) is when moving from effects to
causes (e.g., what is the cause of wet grass (effect)?)
• Top down reasoning (i.e., causal) is when moving from causes to effect (e.g., probability
that the grass will be wet given that it is cloudy).
• Bayes nets be used for both types of reasoning.
Factor graph propagation
• Algorithmically and implementationally, it’s often easier to convert directed and
undirected graphs into factor graphs, and run factor graph propagation.
p ( x )= p ( x 1 ) p ( x 2∨x 1 ) p ( x 3∨x 2 ) p ( x 4∨x 2 )
≡ f 1( x 1 , x 2) f 2( x 2 , x 3) f 3 (x 2 , x 4 )

• The joint probability distribution is written as a product of factors.


– Consider a vector of variables x=( x 1 , .. . , x n )
1
p ( x )= p ( x1 , . .. , x n )= ∏ f (x )
Z j j S j

• where Z is the normalisation constant


• S j denotes the subset 1 , .. . , k which participate in the factor f j
• $x_{S_j} = \\{ x_i:i∈S_j\\}$
• open circles are for variable nodes x i and filled dots for factor nodes f i

So, how propagation proceeds in Factor Graphs?

• Let n ( x ) denote the set of factor nodes that are neighbors of x .

• Let n ( f ) denote the set of variable nodes that are neighbors of f .

• Then, probabilities are computed by propagating messages from variable nodes to


factor nodes and viceversa.

– message from variable x to factor f :


$$μ_{x→f}(x) =\underset{h∈n(x) \text{\\} \text{{f}}}{∏}μ_{h→x}(x)$$

– message from factor f to variable x :


$$μ_{f→x}(x) = \underset{×\text{\\}x}{∑} \bigg{(}f(×)\underset{y∈n(f)\text{\\}{x}}{∏}μ_{y→f(y)}\
bigg{)}$$

– where × are the variables that factor f depends on, and $ × / x $ is all
variables neighboring factor f except x .

• If a variable has only one factor as a neighbor, it can initiate message propagation.

• Once a variable has received all messages from its neighboring factor nodes, it
compute its probability by multiplying all the messages and renormalising:

p ( x )∝ ∏ μh → x ( x )
h ∈n ( x )

##Hidden Markov Models (HMMs)


• Hidden Markov Model (HMM) is the simplest kind of Dynamic Bayesian Networks
(DBNs).

• DBN is a Bayesian network extended with additional mechanisms that are capable
of modeling influences over time.

• The temporal extension of Bayesian networks does not mean that the network
structure or parameters changes dynamically, but that a dynamic system is
modeled.

• A DBN is a model of a stochastic process.

• HMM has one discrete hidden node and one discrete or continuous observed node
per slice.

• Inference in Hidden markov models and Linear Gaussian state-space models is


estimated as follow:
N
p ( Q 1, . . ., Q N ,Y 1 , . .. , Y N )= p ( Q1 ) p ( Y 1∨Q1 ) ∏ [ p ( Qt ∨Qt − 1 ) p ( Y t ∨Qt ) )
t=2

– In HMMs, the states Q are discrete.


– In linear Gaussian SSMs (State space models), the states are real Gaussian
vectors.
– Both HMMs and SSMs can be represented as singly connected DAGs (at most one
path between u and v ).
– The forward–backward algorithm in hidden Markov models (HMMs), and the
Kalman smoothing algorithm in SSMs are both instances of belief propagation
/factor graph propagation.
• As it appears, the structure and parameters are assumed to repeat as the model is
unrolled further.

• Therefore, to specify a DBN, we need to define the intra-slice topology (within a


slice), the inter-slice topology (between two slices), as well as the parameters for
the first two slices. (Such a two-slice temporal Bayes net is often called a 2TBN.)

Topologies:

• Normaly, the natural ordering of time needs to be preserved by preventing HMM


from transitioning to previous states.

• This restriction leads to what known as a left-right HMM ( commonly used for
sequential modeling).

• A linear topology is one in which transitions are only permitted to the current state
and the next state.
• If transitions to any state at any time exist, it is known as ergodic.

HMM: Parameters and Training A HMM is completely determined by the following parameters:

• Initial state distribution vector 𝞹 of size n : The probability of starting in each state.
• Transition probability matrix A of size n × n: How likely is to transit to each state, given
some current state.
• Emission probability distributions vector B of size m : the probability of generating an
observation o t , given some current state st .

Example:

states = ('Rainy', 'Sunny')

observations = ('walk', 'shop', 'clean')

start_probability = {'Rainy': 0.6, 'Sunny': 0.4}

transition_probability = {
'Rainy' : {'Rainy': 0.7, 'Sunny': 0.3},
'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6},
}

emission_probability = {
'Rainy' : {'walk': 0.1, 'shop': 0.4, 'clean': 0.5},
'Sunny' : {'walk': 0.6, 'shop': 0.3, 'clean': 0.1},
}

• An HMM algorithm may consist of one or more of the steps : Forward, Backword, and
Update
• The six common problems [link, link]that can be solved using HMM are the filtering,
smoothing, forecasting, evaluating, decoding, and learning problems.
– Evaluating, filtering, and forecasting problems can be solved using the forward
algorithm
– smoothing problem can be solved using the forward algorithm and backward
algorithm
– Decoding problem can be solved using the Viterbi algorithm; The learning
problem, solved through MLE can, can be solved by forward algorithm to
calculate the likelihood.
• In order to learn the afromentioned parameters θ=( 𝛑 , A , B ), the model must be trained
on labeled samples.
– The time-independent stochastic transition matrix A={a i j }=P ( X t = j ∣ X t − 1=i ) .
– The initial state distribution (i.e. when t=1) is given by π i=P ( X 1 =i ) .
– The probability of a certain observation y i at time t for state $ X_{t}=j$ is given by
b j ( y i )=P ( Y t = y i ∣ X t = j ) .
• Baum-Welch algorithm, which is an application of the Expectation-Maximization
algorithm to HMMs, can be used to tune these paramaters.
$$ {\displaystyle \theta ^{*}=\operatorname {arg\,max} _{\theta }P(Y\mid \theta )}{\displaystyle }
$$

where Y = ( Y 1= y 1 ,Y 2 = y 2 , … , Y T = y T ) is an observation sequence

Algorithm: For simplicity, let's consider the outcomes as univariate variables.

• Step 1: Initialize

Set θ=( 𝛑 , A , B ) to random initial conditions. Can set to prior information if


available, to speed up and steer toward a desired local maximum.

• Step 2: Forward procedure

Let $ \alpha i(t)=P(Y{1}=y_{1},\ldots ,Y_{t}=y_{t}, X_{t}=i\mid \theta )$, be the


probability of seeing the observations y 1 , y 2 , … , y t and being in state i at time t .
recursively estimated:

– α i ( 1 ) =π i b i ( y 1 ) ,
N
– α i ( t +1 )=b i ( y t+1 ) ∑ α j ( t ) a j i .
j=1

• This step can be used to solve the evaluating problem by:


K
Lt ≡ p ( Y 1 , … ,Y t )=∑ α t ( j )
j=1

– Because the likelihood of all observations, LT , can be calculated, we can apply the
maximum likelihood method to estimate the unknown parameter L ( θ )=P ( X ∣θ ) .
– If the prior distribution of the parameter is given, you can also apply the MAP
method argmax [ P ( θ ) ⋅P ( X ∣θ ) ).
• This step can also be used to solve the filtering problem by:
pt ( i ) ≡ p ( S t =i∨Y 1 ,… , Y t )=α t ( i ) / Lt

• We can also solve the forecast problem, because the h-step-ahead prediction of the
state probability can be calculated via filtering

• Step 3: Backward procedure

Let $ \beta {i}(t)=P(Y{t+1}=y_{t+1},\ldots ,Y_{T}=y_{T}\mid X_{t}=i,\theta )$ that is


the probability of the ending partial sequence y t +1 , … , y T given starting state i at
time t . recursively calculated:

– β i ( T )=1 ,
N
– β i ( t )=∑ β j ( t +1 ) ai j b j ( y t+ 1) .
j=1
– The smoothing problem is solved, because we can calculate the probability of
curent state given all past and future states.
P ( X t =i ,Y ∣θ ) α i (t ) βi (t )
γ i ( t )=P ( X t=i ∣Y , θ )= = N
P ( Y ∣θ )
∑ αj (t ) β j(t )
j=1

• Step 4: update calculate the temporary variables, according to Bayes theorem:


P ( X t =i ,Y ∣θ ) α i (t ) βi (t )
γ i ( t )=P ( X t=i ∣Y , θ )= = N
– P ( Y ∣θ )
∑ αj (t ) β j(t )
j=1


P ( X t =i, X t +1= j ,Y ∣θ ) α i ( t ) ai j β j ( t+ 1 ) b j ( y t +1 )
ξ i j ( t )=P ( X t =i, X t +1= j∣ Y ,θ )= = N N
,
P ( Y ∣θ )
∑ ∑ α k ( t ) a k w β w ( t+1 ) b w ( y t +1 )
k=1 w=1

The parameters of the hidden Markov model θ can now be updated:


¿
• π i =γ i ( 1 ) ,

the expected frequency spent in state i at time 1.

• $ a_{ij}^{*}={\frac {\sum {t=1}^{T-1}\xi {ij}(t)}{\sum {t=1}^{T-1}\gamma {i}(t)}},$

the expected number of transitions from state i to state j compared to the expected total
number of transitions away from state i .
T

∑ δ y v γi (t)
{
1 if y t =v k ,
)
t k

• b ¿i ( v k ) = t =1 T , where δ y v =
t k
0 otherwise
∑ γi ( t )
t =1

the output observations have been equal to v k while in state i

These steps are repeated iteratively until a desired level of convergence.

Example:

• A farmer collects chicken eggs at noon every day.


• chicken eggs laying depends on some unknown factors that are hidden.
• A Chicken is always in one of two states , that depends on the state of the previous day,
that influence whether the chicken lays eggs.
• the state at the initial starting point, the transition probabilities between the two states,
and probability that the chicken lays an egg given a particular state are unknown.
• Firstly, the transition and emission matrices are randomly set.
• Suppose that have the following set of observations over days ( E = eggs, N = no eggs):
N , N ,N , N ,N , E, E,N ,N , N
• This gives a set of observed transitions between days (Y ):
N N ,N N , N N , N N ,N E, E E, EN ,N N ,N N
• Estimate a new transition matrix via maximizing observation probabilities given initial θ
(i.e., P ( Y , X∨θ ) ). For example, the probability of the sequence N N and the state being
S1 then S2 is given by:
P ( Y 1=N , Y 2=N , X 1=S 1 , X 2=S2 ) =P ( X 1=S1 )∗P ( Y 1=N∨ X 1=S 1 )∗P ( X 2=S2∨ X 1=S 1 )∗P ( Y 2=N∨X 2=
• To calculate the new probability for the transition S 1→ S 2

0.22
• The new estimate for the S1 to S2 transition is now =0.0908
2.4234
• Likewise, calculate the other transition probabilities and normalize, so they add to 1.

• Estimate the new emission matrix. For example, the probability of an observation N E
given that E come from S1 (i.e., P ¿)

0.2394
• The new estimate for the E coming from $ S_{1}$ emission is now =0.8769
0.2730
• Repeat for if N came from S1 and for if N and E came from S2 and normalize.

• To estimate the initial probabilities we assume all sequences start with the hidden state
S1 and calculate the highest probability and then repeat for S2. Again we then normalize
to give an updated initial vector.

Finally we repeat these steps until the resulting probabilities converge satisfactorily.

Gaussian Hidden Markov Model

• In Gaussian HMM, the observation probability distribution is a normal distribution


Y t ∨X t ∼ N ( μ X , Σ X )
t t

• Therefore, the probability of a certain observation y i at time t for state $ X_{t}=j$ is given
by mean and covariance parameters ( B≡ {μi , Σ i }i=1 ,… , K ) of a multivariate gaussian
instead.
• That is, the parameter of the Gaussian HMM is $ \theta = ( \pi ,A ,B ) $.
• In Gaussian mixture HMM, the observation probability distribution is a Gaussian mixture
distribution
Y t ∨X t ∼ G M ( {w X , 1 , … , w X , M }, {μ X ,1 , … , μ X , M }, {Σ X ,1 , … , Σ X , M })
t t t t t t

Example 1:
The following example shows how to train an HMM and use it to forecast the future.

• Collect Gold stock market price history


# !pip install yfinance
import yfinance as yf
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Get the data for Gold


data = yf.download('GC=F','2020-01-01','2022-12-12')
#stationary price
stat_data = (data.Close.values[1:]- data.Close.values[:-
1]).astype(int)
#replace it with categorical observations
stat_data[stat_data<0] = 0
stat_data[stat_data>0] = 1
stat_data[0:10]

[*********************100%***********************] 1 of 1 completed

array([1, 1, 1, 0, 0, 1, 0, 0, 1, 0])

• Train HMM model and predict forecast


import numpy as np
from hmmlearn import hmm

n_fits = 500
train = stat_data[0:-100]
val = stat_data[-100:]
best_score = None
#try n fits to avoid local minima
for idx in range(n_fits):
model = hmm.CategoricalHMM(n_components=2, init_params='se',
n_iter=500, random_state=idx)
model.transmat_ = np.array([np.random.dirichlet([0.7, 0.3]),
np.random.dirichlet([0.3, 0.7])])
model.fit(train.reshape(-1,1))

if best_score is None or model.score(val.reshape(-1,1)) >


best_score:
best_model = model

# Is is more probable for gold to rise or fall in the the three next
days successively
if (model.score(np.concatenate([stat_data[-20:-4],
np.array([1,1,1])]).reshape(-1,1)) >
model.score(np.concatenate([stat_data[-20:-4],
np.array([0,0,0])]).reshape(-1,1))):
print('-The model predicts that Gold price will rise')
else:
print(-'The model predicts that Gold price will fall')

print('-What truly happened:', stat_data[-4:])

-The model predicts that Gold price will rise


-What truly happened: [1 1 1 1]

You might also like