0% found this document useful (0 votes)
16 views207 pages

ML Revision

Uploaded by

Kristiyana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views207 pages

ML Revision

Uploaded by

Kristiyana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 207

MACHINE LEARNING NOTES

LECTURE 1- INTRODUCTION AND


PREPROCESSING
I. PYTHON
- NUMPY
-Numpy is one of the fundamental packages for scientific computing. It
contains functionality for multi-demensional arrays, high-level
mathematical functions and pseudorandom number generators.
- SCIPY
- the collection of functions for scientific computing. It provides advanced
linear algebra routines, mathematical functions,
optimization, signal processing, special mathematical functions and
statistical distribution
- MATPLOTLIB
-the primary scientific plotting library,
providing functions for making publication-quality visualizations.You can show
figures directly in browser using the "%matplotlib notebook" or "%matlplotlib
inline" commands

- PANDAS
- a Python library for data wrangling and analysis. It is build
around a data structure called the DataFrame (a table, similar to an Excel
spreadsheet). pandas provides a great range of methods to modify and
operate this table, (e.g. SQL-like queries and joins of tables).

- SCIKIT-LEARN
-Machine learning library for the Python programming language.
II. What is ML?
- Short answer
• Machine : Easy part
• Learning : Hard part
- To solve problems on a computer, algs (sequences of intructions to transform
input->output) are necessary
- Tasks withoyt predefined algs (such as distinguishing spam form legitimate
emails) require ML
- ML involves automatically extracting algs from example data.
- In simple it’s teaching a computer to do a task using ‘data’
- Methods that extracts knowledge from the data
- Closely related to statistics and optimization
- Machine learning is focused on prediction

II.1 Why ML?


- In the past, intelligent applications relied on hand-coded rules using "if" and
"else" decisions.(used rule-based systs in the past)
- Limitations of rule-based systems- Decision logic is specific to a single
domain and task, making system adaptation difficult. Designing rules requires
deep domain expertise, hindering scalability and flexibility.
- Challenges with hand-coded rules- ot suitable for tasks like face detection
due to the complexity of human perception versus computer representation.
- We need ML cause:
- Volume of data collected grows daily
- Data is cheap and abundant but knowledge is expensive and scarce
- Machine Learning: computers learn from data to aid knowledge discovery
- Cause of the ability to solve problems related to speech, vison, recognition &
robotics
- Eliminates the need for manual rule creation by learning patterns from data.
- Offers a more scalable, adaptable, and efficient approach compared to
traditional hand-coding, especially for complex tasks where human expertise
may be insufficient to define decision rules.
- The model can be descriptive, predictive for understanding the data & gaining
knowledge form it or both.
- There are 2 types of data:
Structured data Unstructured data
Highly organized, made up mostly of Everything else-emails, text
tables with rows and columns that messages,text files, audio
define their meaning. (e.g. Excel files, voicemails, video files, images
spreadsheets and relational of pictures, illustrations, memes, etc.
databases) In order to make sense of it we need
knowledge discovery

II.2 An ML example
- We want a computer to distinguish between a Greyhound and a Labrador.
Thus, we create a dataset byu collecting examples of those breeds (collecting
photos of them) & we describe their breed-specific feats.
A) Association learning
- Association learning focuses on uncovering associations, correlations, or
patterns within a dataset.
- The primary goal is to identify relationships between variables or items that
frequently occur together.
- Association learning methods typically involve mining large datasets to find
rules or patterns that indicate associations.
- Association learning finds applications in various domains, including retail for
market basket analysis, recommendation systems, web usage mining,
bioinformatics, and more.
- In retail, it helps understand customer purchasing behaviour and optimize
product placement or cross-selling strategies.
B) Classification
- Identifying which category an object belongs to
- Example+applications : Credit scoring involves predicting the risk associated
with a loan application. Attributes like income, savings, profession, age, and
past financial history are considered. The task is to classify applicants as low-
risk or high-risk based on these attributes. Applications of Classification:
optical character recognition, face recognition, medical diagnosis, speech
recognition, spam detection, outlier detection, etc.
- Algs- SVM, nearest neighbours, random forests, etc.
- Discrete output (eg. color, gender, yes/no, etc.)
- Eg. “Will you pass this course?’
- Supervised learning

- The model trained from the data defines a decision boundary that separates
the data

C) Regression
- In regression, the goal is to predict a continuous output (value of an attribute),
based on input attributes.
- Example: Predicting the price of a used car based on attributes like brand,
year, mileage, etc.
- Application: drug response, stock prices, navigation of a mobile robot,
optimization tasks, response surface design, etc.
- Algs: SVR, nearest neighbour, random forest, etc.
- Continuous output (e.g. temp, age, distance, salary, etc.)
- E.g. “How many points will you
get in the exam?

- The model fits the data to describe the relation between 2 features or
between a feature (e.g., height) and the label (e.g., yes/no)

D) Clustering
- A method for unsupervised learning aimed at finding clusters or groupings
within input data without any predefined labels
- Helps identify outliers, customers who deviate significantly from others in their
group.
- Application- customer segmentation, grouping, experiment outcomes, etc.
- Algs- k-means, feat selection, nonnegative matrix factorization, etc.

III. Main types of ML


- Supervised, unsupervised and reinforcement learning
- Additional- semi-supervised (never-ending lang models) and active learning
A) Supervised learning

- Where the alg is trained on a labelled dataset


- Algs learn from input/output pairs provided by a "teacher."
- Often used to automate manual labour (like assigning objects to a certain
calss)
- Alg generalizes from known examples to make predictions on new data
- Examples- identifying handwritten digits on envelopes, determining tumor
benignity from medical images, detecting fraudulent activity in credit card
transactions, what will I buy next week?, self-driving cars, etc.
- Classification, detection, segmentation, regression, etc.

B) Unsupervised Learning

- Algs work with only input data, without explicit instructions on what to do with
said data+ no known output data(independent women, slayyyy)
- This means that there’s no labels/categories for the alg to learn from
- Useful for discovering hidden relationships, patterns, grouping similar data,
reducing the dimensionality of the input space, etc.
- Examples: identifying topics in blog posts, discovering trending ‘X’ topics,
grouping data into clusters, outlier detection, segmenting customers into
groups based on preferences (clustering again), detecting abnormal access
patterns on a website, etc.
- Clustering

C) Reinforcement learning

- The system learns to make a sequence of decisions to achieve a goal.


- An agent learns to make decisions by interacting with the environment.
Reasoning under uncertainty to make decisions.
- It takes actions in the environment, receives feedback, and learns to optimize
its behaviour in order to maximize its reward.
- The output is a sequence of actions. The primary focus is on learning a policy-
a sequence of actions that lead to the desired outcome.
- Example: game playing, robot navigation, self-driving cars
- Challenges: it becomes challenging when the system has limited or unreliable
sensory information
IV. Terminology and Workflow
IV.1 Workflow
IV.2 Terminology

- Model- an equation that links the values of some features to the


predicted value of the target variable; finding the equation (and
coefficients in it) is called ‘building a model’ (see also ‘fitting a model’).
- Score functions/Fit statistics/Score metrics – measures of how well the model
fits the data.
- Feature selection – reducing the number of predictors by selecting the
important ones (dimensionality reduction).
- Feature extraction – reducing the number of predictors by means of a
mathematical operation (e.g., PCA).
- ML tasks- prediciton, learn sth previously unknown.

IV.3 DummyClassifier and DummyRegressor


- Dont generate any insight about the data
- Simple baselines to compare against other more complex
classifiers/regressors
- DummyClassifier
• classifies the given data using only simple strategies; most-frequent,
uniform, constant.
- DummyRegressor
• makes predictions using simple strategies; mean, median.
V. Types of data
A) Images
- Computers work with numbers
- Images are arrays of numbers (RGB values for each pixel)

B) Text
- Words/ Letters need to be converted in a format computers can understand

VI. Preprocessing
- Feature exctraction and normalization
- Preparing the raw data and making it suitable for an ML model

VI.1 Scaling data


- With few exceptions ML algs don’t perform well when the input numerical
attributes have a varying scale.
A) Standard scaler
- z-scores (standard scores): mean of 0 and standard of 1
- common method in data normalization (good for non-skewed data)

B) Robust scaler
- Same as Standard Scaler, but with median instead of mean and
IQR instead of SD.
- Better for skewed data
- Deals better with outliers
C) MinMax scaler
- Shifts data to an interval set by xmin and xmax

D) Normalizer
- Doesn’t work by feature (column) but by row
- Each row of data is rescaled so that its norm becomes 1
- Compute the norm of the vector (square root of the squared elements)
-Divide each element by the norm
-Used only when the direction of data matters
-Helpful for histograms
IV.2 Ways to scale data
A) Univariate transoformations
- Examples of univariate transformations: logarithmic, geometric, power, etc.
- Most ML models perform best with Gaussian distributed data (bell curve)
- Methods to transform data to Gaussian include Box-Cox transform and
Yeo-Johnson transform
-Parameters can be automatically estimated so that skewness is minimized
and variance is stabilized.

B) Binning
- Separate feature values into n categories (e.g., equally spaced over the
range of values)
- Replace all values within a category by a single value, e.g., mean.
- Effective for models with few parameters, such as regression, but not
effective for models with many parameters, such as decision trees.
VII. Guiding principles in ML
VII.1 Measuring classification success
- How “predictive” are the models we have learnt?
- New data is probably not exactly the same as the training data
-What happens if we overfit our data?

VII.2 Training and test set


- The data used to build the model is split into two parts: the training set and the
test set. To avoid overfitting: build a classifier using the training set and
evaluate it using the test set.
- The training set is used to train the model
- The test set is used to evaluate its performance.
- It's essential to use separate data for training and testing to ensure the
model's ability to generalize to new, unseen data.

VII.3 Subject-wise splitting

A) Cross-validation
- To evaluate (test) your model’s ability to predict new data
- Detect overfitting or selection bias

- Techniques:
- K-fold cross validation
- Leave one out (K-fold cross validation to the extreme)

LECTURE 2- FEATURE ENGINEERING


- Covered in last lec:
-Scaling
-Binning
-Univariate transformations
- Covereed in this lec:
-Missing value imputation

I. Preprocessing- Missing value imputation


- In real world datasets, missing input values are very common
- No standard encoding (blank, 0, “NA”, NaN, Null, ...)
- Imputation: replacing missing value with estimate for that value
- Mean / median
- KNN
- Model-driven
- Iterative

II. Feature selection


- Why select features?
- Avoid overfitting
- Faster prediction and training
- Less storage for model and dataset
- Strategies
- Univariate statistics
- Model-based selection
- Iterative selection
II.1 Univariate statistics
- Look at each feature individually, looking at one var at a time
- Features will be removed if they do not have a significant relationship
with the target
- It involves examining characteristics of individual variables independently.
- Features that are significant only in combination with another feature
(interaction) will be removed. Univariate analysis does not consider
relationships between variables, unlike multivariate analysis.
- Selecting features with highest confidence is related to ANOVA (from
statistics).
- It's like zooming in on one aspect of the data to understand it better without
considering other variables.
- Examples: calculating averages, determining minimum and maximum values,
and assessing variability or spread within the data, etc.
- Pick statistic, check p-values!
- f_regression, f_classif, chi2 in scikit-learn

- Mutual info- Mutual information (as implemented here) is also univariate,


but doesn’t assume a linear model (like the F statistics do)
Can be used with SelectKBest etc.

II.2 Model-based Feature selection


- Get best fit for a particular model, involves selecting features based on their
importance or relevance to a predictive model.
- Ideally: exhaustive search over all possible combinations
- Exhaustive is infeasible (and has multiple testing issues)
- Use heuristics in practice
- Typically involves training a machine learning model on the entire set of
features.
- Importance of each feature is then assessed using techniques such as feature
importance scores or coefficients.
- Feats with the highest importance scores are kept, while less important
features are discarded.
- Helps in improving model performance by focusing on the most informative
features and reducing overfitting. (charity work lesgoooo)
- Common techniques: decision trees, random forests, or linear models for
feature selection, etc.
- It's important to validate the selected features on a separate validation set to
ensure generalization to unseen data. (STH THIS BITCH DIDN’T DO ON
HER ASSIGNMENT 😍 ).
A) Model based (Single fit)
- Model-based single fit involves training a machine learning model using the
feats most important to the model.
- The model learns patterns and relationships between features and the target
variable simultaneously.
- This approach is suitable when there's no need for feature selection or when
all features are expected to contribute to the predictive performance.
- It can simplify the modelling process by not requiring explicit feature selection
steps.
- However, it may lead to overfitting if the model is too complex relative to the
amount of data available.
- Algs: Lasso, other linear models, tree-based models
- Multivariate – linear models assume linear relation

B) Iterative model-based selection


- Iterative model-based selection of features involves a process where features
are selected or eliminated iteratively based on their importance to the model.
- Typically starts with an initial set of features and involves iteratively training a
model, evaluating feature importance, and updating the feature set.
- Process continues until a stopping criterion is met.
- Techniques like recursive feature elimination (RFE) or forward/backward
selection are commonly used for iterative feature selection.
Forward/Backward selection RFE
- Forwards: Start with single - Repeatedly training the model
feature, find most important and removing the least
feature, add, important features until the
iterate desired number of features is
- Backwards: Fit model, find reached.
least important feature,
remove, iterate
- Iterative model-based feature selection helps in identifying the most
informative features while potentially reducing overfitting and improving model
interpretability.
- Care should be taken to validate the selected features and the resulting model
to ensure robustness and generalization to unseen data.

RFE

III. Categorical variables


- Traditional assumptions of data: a 2D array of floating-point
numbers (continuous), with each column representing a
continuous feature.
- However, many real-world applications involve (discrete)
categorical features, which are discrete and not numeric.
- Unlike continuous features, categorical features lack a natural
order.
- Representation of data significantly affects machine learning
model performance. Thus, needs to be scaled.
- Often necessary to represent categorical features as numbers.
- One Hot encoding
- Count-based encoding
A) One hot encoding (dummy variables)
- Replaces a categorical variable with one or more binary
features, where each feature can have values 0 or 1.
- Each category gets its own feature, and exactly one feature is
1 for each data point (hence, "one-hot" or "one-out-of-N"
encoding).
- Each data point will have a 1 in the corresponding feature for
its category and 0s elsewhere.
- It is important that the math used by machine learning models is not
affected by the encoding (impossible to use 1,2,3, etc)
- Adding one feature for each category (feature encodes whether a
sample belongs to this category or not)

- all colours are equally distant from each other


⁃ Implementation:
⁃ Using pandas: The get_dummies function transforms
categorical columns into one-hot encoded columns automatically.
⁃ Example: data_dummies = pd.get_dummies(data)
⁃ Continuous features remain untouched, while categorical
features are expanded into new binary features.
⁃ Ensure consistency between training and test sets: Call
get_dummies on a DataFrame containing both training and test
data to ensure consistent encoding across both sets.
⁃ Mismatch in categories between training and test sets can lead
to incorrect model behavior. It's essential to have the same set of
dummy features in both sets to maintain consistent semantics.
B) Count-based encoding
- Numbers Can Encode Categoricals:
- Categorical variables are often encoded as integers for storage efficiency or
due to the way data is collected.
-For example, in the adult dataset, workclass categories might be
encoded as 0, 1, 2, etc., instead of strings like "Private".
- Encoding as integers doesn't imply the variable should be treated as
continuous; it depends on the semantics of the variable.
- If there's no inherent order between categories (as in the workclass example),
the variable should be treated as discrete.
- Encoding as integers doesn't imply the variable should be treated as
continuous; it depends on the semantics of the variable.
- If there's no inherent order between categories (as in the workclass example),
the variable should be treated as discrete.
- For other cases, such as ratings, whether to treat an integer feature as
continuous or discrete depends on the task and the machine learning
algorithm used.
- For high cardinality (higly unique) categorical features
-Example: countries
-Instead of 50 one-hot variables, replace label with the value of a variable
aggregated over that label.
- For regression:
- “people in this state have an average response of y”
- Binary classification:
-“people in this state have likelihood p for class 1”
- Multiclass:
-One feature per class: probability distribution

IV. Working with images


A) Digital images
- The values are all discrete and integers.
- Can be considered as a large array of discrete dots, each dot has a
brightness associated with it.
- These dots are called picture elements – pixels
B) Arrays and imgs

- Images are represented as matrices (e.g. numpy arrays)


- Can be written as a function f(x,y)
- Types of images: Binary Images, Grayscale Images and Color Images
Binary imgs (1 bit img) Grayscale imgs Color imgs
-Each pixel is either -Each pixel is a shade of -A stack of multiple
black or white. gray matrices; representing
-Only two possible -Normally from 0 (black) the multiple
values for each pixel to 255 (white). Each pixel channel values for each
(0,1) can be represented by pixel
-Only need one bit per eight bits, or exactly one -E.g RGB color is
pixel. byte. described by the
-Other grayscale ranges amount of red, green
are used, but generally and blue in it.
are a power of 2. (22 =
4, 24 = 64)
IV.1 Accuracy

- It measures the proportion of correctly classified instances among all


instances in the dataset

⁃ Problems with accuracy:


- imbalanced classes lead to hard-to-interpret accuracy: Accuracy can be
misleading when dealing with imbalanced datasets, where one class significantly
outnumbers the other(s). In such cases, a model that predicts the majority class for
all instances can achieve high accuracy but fail to detect minority classes, which may
be more important.
- Dooes’tt consider the types of errors made by the model. It treats all
misclassifications equally, whether they are false positives or false negatives.
- In some applications, misclassifications may have different costs. For
example, in healthcare, a false negative (missing a disease) could be more costly
than a false positive (misdiagnosis). Accuracy alone may not reflect the real-world
implications of model errors.
-Can be influenced by how the data is preprocessed and by the choice of
thresholds. For instance, preprocessing techniques like feature scaling or encoding
can impact the performance of models and, consequently, accuracy.
-Does’t capture the confidence level of predictions. A model with high accuracy
may still provide uncertain or low-confidence predictions, which could be problematic
in critical applications where confidence is essential.
- To address these limitations, it's crucial to consider additional evaluation
metrics and techniques, such as precision, recall, F1-score, ROC curves, and
confusion matrices.
-
IV.2 Precision, recall, F-score
Precision Recall F-score
-measures the proportion -measures the proportion - balance between
of TP predictions among of TP predictions among precision and recall,
all positive predictions all actual positive making it useful for
made by the model. instances in the dataset. situations where you
-It focuses on the -focuses on the model's want to consider both
accuracy of positive ability to capture all FP and FN
predictions and answers positive instances:"Of all -combines precision and
the question: "Of all actual positive instances, recall into a single
instances predicted as how many did the model metric and is useful
positive, how many are correctly predict as when there is an
actually positive?" positive?" imbalance between
classes.
V. Transforming text data
- Most ML algs prefer to work with numbers
-So far encountered 3 types of feats: fixed number of features, continuous,
categorical
- Fourth type of features- text feats
- Text data is repr as strings of chars, as the length of the strings can vary. Also can be
words, sentences or entire documents and varies highly: punctuation,different forms
for the same word, ttypos, capitalized letters, etc.

V.1 Representing text data


- Different types each requiring a different approach for processing and
analysis:
- Categorical Data: from a fixed list of options. For example, if you collect data
on people's favorite colors through a survey with predefined choices. These can
be identified by examining the range of unique values and their frequency
distribution. It's important to consolidate similar entries (e.g., correcting
misspellings) to ensure consistency.
- Free Strings Mapped to Categories: When respondents provide their own
responses in a text field, such as naming their favorite color, the resulting strings
may vary widely due to spelling errors, synonyms, or creative responses.
-Structured String Data: addresses, names, dates, telephone numbers, and
other identifiers. While they have some underlying structure, parsing and processing
them can be challenging and context-dependent. Handling this type of data often
requires domain-specific knowledge and is not easily automated.
-Text Data: consists of phrases or sentences and includes sources like
tweets, chat logs, reviews, literary works, and articles. Each document in the dataset
is considered a data point, and the entire collection is called a corpus. Analyzing text
data involves techniques from information retrieval and natural language processing,
such as tokenization, stemming, lemmatization, and vectorization, to extract
meaningful insights from the textual content.
- A sentence can be broken down into individual words.

- Each word is represented as a categorical variable (e.g., using one-hot


encoding)
- One-hot encoded vectors are possibly:
- very large (10.000s words in a general vocabulary)
- very sparse (vectors are all “0”s with a single “1”)
- Problems
- Concatenating all word vectors results in massive vectors.
-Sentences have unequal length, which is unsuitable for most ML methods.

V.2 Bag-of-words representation


- Use a single vector of length equal to the size of the vocabulary.
- Each component -is the number of times that word appears in the sample
(e.g., in a sentence or document).
- Useful for sentiment analysis (e.g. recognizing words such as “great”,
“terrible” ...)
- Most common technique to numerically represent text
- Represents each sentence or document as a vector with a value for each
word in the vocabulary.
-Binary: word present or absent in the document
-Count: how often the word appears in the document
- Popular approach: Term Frequency x Inverse Document Frequency (TF-
IDF)
- Term Frequency (TF) = (Number of times term t appears in a
document)/(Number of terms in the document)
- Inverse Document Frequency (IDF) = log(N/n), where, N is the number
of documents and n is the number of documents a term t has appeared
in. The IDF of a rare word is high, whereas the IDF of a frequent word is
likely to be low. Thus having the effect of highlighting words that are
distinct.
- We calculate TF-IDF value of a term as = TF * IDF
- TF-IDF value of a term as = TF * IDF
- Example to calculate TF-IDF of a term in a document.

Advantages Limitations
-Highly interpretable. Each word is an -All structure is lost!. Crucial info may be
independent feature. lost
-Simple method. • Misspellings: “machine” and “machnie”
- Fairly effective approach for some will be counted as different word
applications • Some expressions consists of different
words- e.g. product review with the word
“worth” -> What if the review said “not
worth” vs “definitely worth”

V.3 Text Data Preprocessing


- The bag-of-words representation is a simple yet effective way to represent
text data for machine learning tasks. Here's how it works:
- Tokenization: Each document in the corpus is split into individual words or
tokens. This process typically involves removing punctuation and splitting the
text based on whitespace- convert sentences to words
- Removing stop words
- Vocabulary Building: A vocabulary is created by collecting all unique words
(tokens) from the entire corpus. Each word is assigned a unique identifier,
often in alphabetical order. (Stemming—words are reduced to a root by
removing inflection throughdropping unnecessary characters, usually a suffix.
Lemmatization—Another approach to remove inflection by determining the
part of speech and utilizing detailed database of the language.)
- Encoding: For each document, a vector is created to represent the frequency
of each word in the vocabulary. Each element in the vector corresponds to a
word in the vocabulary, and its value represents the number of times that
word appears in the document.
A) Tokenization
- Process of breaking a stream of textual content up into words, terms,
symbols, or some other meaningful elements called tokens.
• The list of tokens turns into input for in additional processing including
parsing or text mining.
• Tokenization can swap out sensitive data
• E.g. Typically payment card or bank account numbers—with a randomized
number in the same format
B) Stemming and Lemmatizatization)
-Stemming—words are reduced to a root by removing inflection through
dropping unnecessary characters, usually a suffix.
• The stemmed form of studies is:
studi
• The stemmed form of studying is:
study
• Lemmatization—Another approach to remove inflection by determining
• the part of speech and utilizing detailed database of the language
• The lemmatized form of studies is: study
The lemmatized form of studying is: study
C) Restricting the vocab
Removing unnecessary punctuation, tags
• Removing stop words—frequent words such as ”the”, ”is”, etc. that have low
semantic content

• Removing infrequent words


Words that appear only once or twice might not be helpful:
• Restrict vocabulary size to only most frequent words (for less features)
LECTURE 3 – KNN

I. Classification

I.1 Classifiers and create decision boundaries


- Classifiers are trained on the dataset (labelled data points) and automatically
“draw” a decision boundary between the two classes

- The decision boundary can be a straight line (“stiff”) or a wiggly line (“flexible”)

- The decision boundary is considered to be a model of the separation between


the two classes

- The model is induced from the data

- The complexity of the model is proportional to the wigglyness of the decision


boundary

- Model trained from the data defines a decision boundary that separates the
data

- Classifiers create decision boundaries


- Decision bpundaries can be linear (straoght line, ‘stiff’) or non-linear (wiggly
line, ‘flexible’)

Classification Regression

I.2 Nearest-neighbour classifier


- Given a set of labeled instances (training set), new instances (test
set) are classified according to their nearest labeled neighbour

I.3 K-NN classifier


- k-NN algorithm predicts the class of a data point by considering the majority
class among its k nearest neighbours
- For binary classification, it calculates the distances between the new data
point and all training data points, selects the closest one, and assigns the
label of that point to the new data point.
- For multiple classes, it counts the occurrences of each class among the k
nearest neighbours and assigns the most common class to the new data
point.

- The hyperparameter k represents the number of labeled neighbours to

- consider

- Test points are assigned the majority label of the k nearest neighbours

- Special cases:

-k = N: since all datapoints are considered, the predicted label for a test point
will always be the the majority label of all datapoints. Equivalent to a majority
classifier.

- Ties: in case of an tie between predicted labels, there are different


possibilities. The most common one is random selection from the tied labels.
II. K-Nearest Neighbours Classification
II.1 K-NN- hypothesis space (1 neighbour)
- Decision boundaries between classes are formed by the boudnaries
around individual points in the feat space, where each point in it acts as a
prototype for its class, and new instances are classified based on their
proximity to these prototypes
- Decision Boundaries:
- decision boundaries are formed around individual
points in the feature space.
-Each data point in the feature space acts as a
prototype for its class, defining the boundary around
it.
- Prototype-Based Classification:
-Each data point in the training set serves as a
prototype or representative of its class.
-The decision boundaries are determined by the
proximity of test instances to these prototypes.
- Classification Process:
-When a new instance needs to be classified, the
algorithm looks at the nearest neighbor (prototype) in
the feature space.
-The class label of the nearest neighbor is assigned to
the new instance.
- Proximity-based Classification:
-The classification decision is made solely based on the
proximity or distance between the new instance and
the prototypes.
-Instances closer to a prototype are more likely to be
assigned to the same class as that prototype.
- Flexibility and Adaptability:
-This approach allows for flexible decision boundaries
that adapt to the distribution of data points in the
feature space.
-It is particularly useful when dealing with non-linear
decision boundaries or complex data distributions.
- Overall, by considering only one neighbor, the k-NN
algorithm forms decision boundaries based on the local
structure of the data, making it a simple yet effective
method for classification tasks, especially in scenarios
where the underlying data distribution is not well-defined or
highly non-linear.

II.2 Influence of k on the decision boundary


- >Neighs-straighter boundary
- The label/class of a point on the decision boundary’s ambiguous
II.3 Weights in k-NN
- Weights determine how much ‘influence’ each neighbour has on one
another
- Distance-weighting: each neighbor has a weight which is based on its
distance to the data point to be classified
-Inverse distance weighting – each point has a weight equal to the
inverse of its distance to the point to be classified (neighboring points have
a higher vote)
-Inverse of the square of the distance
-Kernel functions (Gaussian kernel, tricube kernel)

- If we change the distance function, the results will change.

- Implication: with distance weighting, k=n is no longer equivalent to a


majority based classifier.
II.4 Computing distance in k-NN
- Different ways to define the distance function

- Euclidean distance (straight line)

- Manhattan distance (distance between projections on the axis)

- Difference between Euclidean and Manhattan distance

II.5 k determines model complexity


- The model in k-NN is the decision boundary that separates the classes

- (In regression, the model is the line that fits the data) • Smaller k leads to
more complex decision boundaries

- k too low -> danger of overfitting, high complexity • k too high -> danger of
underfitting, low complexity
II.6 How to determine model complexity?
- Depends on complexity of the separation between the classes

- Start with the simplest model (large k in k-NN), and increase complexity
(smaller k)

II.7 How to choose k?


- Typically odd for an even number of classes (e.g., 1, 3, 5, 7, ...)

- As you decrease k, accuracy might increase, but so does computational


complexity

- In other words, a small value of k is likely to lead to overfitting (fitting


“noise”)

- A rule of thumb used by some data-miners:


II.8 Dividing our data to tune hypeparameters

II.9 Implementing KNN in sci-kit learn

II.10 Overfitting the test set

III. Nearest Centroid (NC)


- A lotta points are scattered around and I wanna group them in diff
categories. Each category represents a group of points that are similar to
each other in some way. Instead of looking at all points individually, I
choose one special point for each category, this is what is a centroid.

III.1 Nearest
shrunken
centroid
- After calculating
the NC th ealg
applies a shrinkage step. Shrinking essentially is puling the centroids
towards the overall mean of all points in the dataset (this way centroids are
regularized prevented from being overly influenced by outliers/noisy data
points)
- ‘Shrinks’ each of the class centroids towards th eoverall centroid for all
classes by an amount we call a threshold

- Nearest centroid classification

- -Takes a new sample, and compares it to each of these class


centroids. The class whose centroid it is closest to, in squared distance, is
the predicted class for that new sample.

- Nearest shrunken centroid classification

- "shrinks" each of the class centroids toward the overall centroid for all
classes by an amount we call the threshold . This shrinkage consists of
moving the centroid towards zero by threshold, setting it equal to zero if it
hits zero. For example if threshold was 2.0, a centroid of 3.2 would be
shrunk to 1.2, a centroid of -3.4 would be shrunk to -1.4, and a centroid of
1.2 would be shrunk to zero.

- After shrinking the centroids, the new sample is classified by the usual
nearest centroid rule, but using the shrunken class centroids.
-

III.2 KNN vs NC

IV. KNN regression


- k-NN classification combines the discrete predictions of k-neighbours,

- k-NN regression combines continuous predictions

- k-NN regression fits the best line between the neighbors

- In regression with k-NN, the algorithm predicts continuous values by


considering the target values of the k nearest neighbors of a given data
point.
- The prediction for a new data point is either the target value of its nearest
neighbor (in the case of one nearest neighbor) or the average (mean) of
the target values of the k nearest neighbors.
- Model performance is evaluated using the R^2 score, also known as the
coefficient of determination.
- The R^2 score measures the goodness of the regression model fit, with a
value between 0 and 1. A score of 1 indicates a perfect prediction, while a
score of 0 indicates a constant model that predicts the mean of the training
set responses.
- Overall, the k-nearest neighbors algorithm for regression provides a simple
yet effective method for predicting continuous values, with flexibility in
adjusting model complexity through the number of neighbors parameter.
IV.1 KNN- Advantages vs disatvantages
Advantages Disatvatages
⁃ Ease of Understanding: kNN’s - Slow Prediction: Prediction
simple and intuitive, easy to can be slow
understand and implement. - Scale: preprocess the data,
⁃ Baseline Performance: It often especially when features
provides reasonable have different scales, to avoid
performance, serving as a good bias in distance calculations.
baseline method for comparison. - Poor Performance with
⁃ Fast Training: Building the k-NN Many Features: k-NN may
model is usually fast, especially not perform well on datasets
for small to moderate-sized with a high number of
datasets. features, particularly when
most features are sparse (i.e.,
⁃ The cost of the learning process have many zero values).
is zero - The model can not be
interpreted (there is no
description of the learned
⁃ No assumptions about the concepts)
characteristics of the concepts to - It is computationally
learn have to be done expensive to find the k
nearest neighbours when the
⁃ Complex concepts can be dataset is very large
learned by local approximation - Performance depends on the
using simple procedures number of dimensions that we
have (curse of dimensionality)

IV.2 Curse of dimensionality and overfitting


IV.3 Adding feats improves classification?
- This example suggests that by adding (informative) features,
classification is improved. This is often the case, but...

- Adding new features increases the volume of feature space


exponentially

- For instance: 1 feature has 10 different values 1 feature: 10


possible feature values
2 features: 100 possible feature values
3 features: 1000 possible feature values
LECTURE 4- LINEAR REGRESSION

I. Regression
- In machine learning, supervised learning to predict continuous outputs
(we’ll call these y) is called regression.
- What do you need to predict outputs ?
-Continuous or categorical input features (we'll call these x);
-Training examples: many x for which y is known (e.g. many people of
whom we know the height, predicting the housing prices);
-A model, a function that represents the relationship between x and y;
-A cost function, which tells us how well our model approximates the
training examples;
-Optimization, a way of finding parameters for the model while minimizing
the loss function.
- Analytical method- shows one unique solution (not alw feasible),
optimization doesn’t give a solution
I.1 Linear regression (aka ordinary least squares)
⁃ Simplest and classic linear method for regression.
⁃ Finds parameters w and b minimizing mean squared error
between predictions and true regression targets y on the training
set.
⁃ Mean squared error is the sum of squared differences between
predictions and true values.
⁃ No parameters to control model complexity.

⁃ Given an input feature x we would like to predict an output y


⁃ In linear regression we assume that x and y are related with the following

equation:
⁃ where m is a parameter and e represents measurement or other noise

- We don’t wanna model the error (sth we don’t wanna account for and is
random)
- Goal: estimate m from training data for x and y
- Most common approach: minimize the least squares error (LSE)

- Why Least Squares?


- minimizes the squared distance between measurements and regression
line
- easy to compute (even by hand)

I.2 Solving linear regerssion with least squares


minimization
- Take the derivative of the sum of squared deviations with respect to m

- Set the first derivative to 0 to find the coefficient m

I.3 Bias/Intercept
- If the line does not pass through the origin ...

- Introduce a bias (intercept) term (b)


- Parameter b is the sum of the differences between the values for y and their
estimates (m*x).

I.4 Finding m and b


I.5 Multi-dimensional inputs

I.6 Clibration:Linear or polynomial fit


- Curve fitting: finding a mathematical function (constructing a curve) that has
the best fit on a series of data points “smoothing” – not looking for an exact
fit but for a curve that fits the data approximately

 Linear: y = ax + b (first degree polynomial) – fits a curve to 2 points


 y = ax2 + bx + c (second degree polynomial) – fits a curve to 3 points
 y = ax3 + bx2 + cx + d (third degree polynomial) – fits a curve to 4
points
I.7 Linear vs polynomial fit?
Linear Polynomial
- Assumes a linear - Assumes a polynomial
relationship between input relationship between input
features and target features and target
variable. variable.
- Straight line fits to the data Can capture more complex
- simple and easy to patterns in the data compared to a
interpret. straight line.
- Degree of polynomial (d)
determines complexity of
the model.
- Higher degree polynomials
can fit the training data
more closely but may lead
to overfitting.
- can capture more intricate
relationships but are prone
to overfitting, especially
with higher degree
polynomials.
I.9 Best fit vs perfecft fit
- Perfect fit – Goes through all points in the data.

- The best fit may not be the perfect fit.

- The best fit should give the best predictive value


I.10 Measuring fit
- Fit – accuracy of a predictive model; the extent to which predicted values of a
target variable are close to the observed values of that variable
- For regression models, the fit can be expressed as R2 (the percentage of
variance explained by the model).

I.11 Overfitting
- Machine learning is so effective in finding the best fit that it is likely to
construct a complex model that would never generalize to unseen data.
- However, a complex model that reduces prediction error and yields a
better fit also models noise.

I.12 Overfitting vs underfitting


- The relation between the complexity of the induced model and
underfitting and overfitting is a crucial notion in data mining

Overfitting Underfitting
 The induced model is not  The induced model is too
complex (flexible) enough to complex to model the data
model the data (tries to fit noise)
 Performs badly both on  Performs better on training
training and validation set set than on validation set

I.13 Validation set


- Tuning hyper-parameters:

 Never use test data for tuning the hyper-parameters


 We can divide the set of training examples into two disjoint sets: training and
validation
 Use the first set (i.e., training) to estimate the coefficients m for different
values of hyperparameter(s) (degree of the polynomial)
 Use the second set (i.e., validation) to estimate the best degree of the
polynomial, by evaluating how well the classifier does on this second set
 Then, test how well it generalizes to unseen data

I.14 How to overcome fitting in regression


Reduce the model Regularization
complexity
- Reduction of the magnitude of the
coefficients

Why?

 We do not want overfitting hence we limit


the variation of the parameters to prevent
extreme fits to the training set.
 In a way, we limit the contribution of not
effective parameters to make our function
simpler.
I.15 Ridge regression
⁃ Ridge regression is a linear model for regression.
⁃ Predictions are made using the same formula as ordinary least
squares (OLS).
⁃ In addition to predicting well on the training data, it fits an
additional constraint.
⁃ Constraint: Magnitude of coefficients (w) should be as small as
possible, implying each feature has minimal effect on the outcome
while still predicting well.
⁃ This constraint is an example of regularization, specifically L2
regularization.
⁃ Implemented in linear_model.Ridge in scikit-learn.
Parameter alpha:
⁃ Alpha controls the trade-off between simplicity of the model
(small coefficients) and training set performance.
⁃ Increasing alpha forces coefficients toward zero, decreasing
training set performance but potentially improving generalization.
⁃ Optimal alpha depends on the dataset.
Effect of Alpha:
⁃ alpha=10: Coefficients restricted, lower training and test set
scores.
⁃ alpha=0.1: Coefficients less restricted, higher training and
test set scores.
⁃ Optimal alpha balances model complexity and performance.
Coefficient Magnitudes:
- Higher alpha leads to smaller coefficient magnitudes,
indicating a more restricted model.
- Ridge coefficients generally smaller than those of
LinearRegression.
I.16 Lasso
⁃ An alternative to Ridge for
regularizing linear regression.
⁃ Like Ridge, it restricts
coefficients to be close to zero,
but using L1 regularization.
⁃ L1 regularization leads to
some coefficients being exactly
zero, providing automatic feature
selection.
⁃ Implemented in linear_model.Lasso in scikit-learn.
Parameter alpha:
⁃ Alpha controls the strength of regularization, with higher
values pushing coefficients closer to zero.
⁃ Lower alpha allows for a more complex model, potentially
improving performance.
Effect of Alpha:
- alpha=0.01: Improved performance, using 33 features.
- alpha=0.0001: Overfitting, similar to LinearRegression, using
94 features.
Coefficient Magnitudes:
⁃ Coefficients shrink toward zero as alpha increases.
⁃ Many coefficients become exactly zero for higher alpha values.
Visualization:
⁃ Plotting coefficient magnitudes for different alpha values
shows the effect of regularization.
⁃ Higher alpha leads to smaller magnitudes and more zero
coefficients.
⁃ Ridge and Lasso solutions differ in terms of sparsity of
coefficients.

Comparison with Ridge:


⁃ Ridge is usually preferred over Lasso as the default choice.
⁃ Lasso is useful when expecting only a few important features
or when interpretability is important.
⁃ ElasticNet combines penalties of Lasso and Ridge, providing a
balance between the two approaches.
LECTURE 5- LOGISTIC REGRESSION AND
NEURAL NETWORKS
I. Introduction & basic terms
- Parameters : variables learnt (found) during training, e.g weights (w)
- Hyperparameters: variables whose value is set before the training
process begins (e.g regularization parameters (alpha), number of
neighbors (k))
- Loss function/error: what you ar etrying to minimize for a single training
example to achieve your objective (e.g. square loss )
- Cost function: avg of your loss funcs over the entire training set (e.g. mean

square error )
- Objective function: any func that you optimize during training (e.g. maximum
likelihood, divergence between classes).
- A loss func’s a part of a cost func which is a type of objective function.
- Decision Boundary (for classification):
- • Single line/contour which separate data points into regions • What is the
output label at the boundary?
I.1 How do we find the parameters w?

A) Gradient Descent
- To find optimal values for w:
• iterative optimization algorithm that operates over a losslandscape (cost
function)

⁃ Follow the slope of thegradient W to reach the minimum cost

- Gradient descent for linear regression:


- Cost function : average of your loss functions over the entire training set

(e.g. mean square error)


- Gradient of cost function: The direction of your steps to achieve your

objective
- Learning rate: The size of steps took in any direction

- Gradient descent- learing rate


- The learning rate is a crucial hyperparameter in gradient descent that
determines the size of the steps taken towards the minimum of the loss
function. Here's how the learning rate impacts gradient descent:
- Learning Rate Too Small: If the learning rate (η\etaη) is too small,
convergence may be slow. In each iteration, the parameter updates are
tiny, and it may take a large number of iterations to reach the minimum.
This can result in longer training times and increased computational costs.
- Learning Rate Too Large: Conversely, if the learning rate is too large,
gradient descent

II. Logistic Regression


II.1 Regression for classification
- In some cases, we can use linear regression to determine an
appropriate boundary
- Linear regression : output is a linear function of features
- Linear classification
-decision boundary is a linear function of the input • Logistic
Regression (classifier)
-Linear Support Vector machines
II.2 Logistic regression

- Logistic regression is a statistical method used for binary


classification tasks, where the goal is to predict the probability
of a binary outcome based on one or more predictor variables.
Despite its name, logistic regression is a classification
algorithm rather than a regression algorithm.
Here's an overview of logistic regression:
- Objective: Logistic regression aims to model the probability
that a given input data point belongs to a particular class.
- Decision Boundary: In binary classification, a decision
boundary is learned from the training data. For logistic
regression, the decision boundary is a hyperplane that
separates the feature space into regions where each class is
predicted.
- Training: The vector parameters (weights)are learned by
maximizing the likelihood function or minimizing the logistic
loss function using optimization techniques like gradient
descent.

- Assumes a particular functional form (a sigmoid) is applied to the


linear

- function of the data

- Where the sigmoid is defined as:


- One parameter per data dimension (feature) and the bias
- Features can be discrete or continuous
- Output of the model between 0 and 1

II.3 Probalistic interpretation


- Logistic Regression :
• One parameter per data dimension (feature) and the bias
• Features can be discrete or continuous
• Output of the model between 0 and 1 (which can be used to model

class probability)

- If we have two classes, can you compute

II.4 Decision Boundary for Logistic Regression

II.5 Making predictions with Logistic regression


II.6 Solution with Logistic regression
II.7 Loss functions
- Our goal in training is to find the best set of weights and biases that
minimizes the loss function.

• Sum of Squared loss for regression


• Cross entropy loss for classification

- Loss func for logistic regression:


- Why don’t we use sum of squares error as our cost function in
logistic regression?
- We can still use it but it is not convex anymore since we have the
sigmoid function.
- Rather we use a logarithmic function as follows:

II.8 Entropy
- Entropy – level of uncertainty
- Exercise for computing entropy

- Cross entropy

 Also known as logarithmic loss, log loss or logistic loss


 Predicted class probability compared to actual class for output of 0
or 1
 Score calculated penalizes probability based on how far it is from
actual value
 Penalty is logarithmic in nature.
II.9 Algorithm for logistic regression

II.10 Gradient descent for logistic regression

III. Regularization
- “Regularization is any modification to a learning algorithm that is intended to
reduce its generalization error but not its training error”
- Similar to other data estimation problems, we may not have enough samples
to learn good models for logistic regression classification
- One way to overcome this is to ‘regularize’ the model, impose additional
constraints on the parameters we are fitting

- By adding a prior w

III.1 L1 vs L2 regularization
- L1 (Lasso)- encourages sparcity)
- Squared L2(Ridge)-encourages small weights

III.2 Regularization in linear regressoin


Ridge regression Lasso regression
- Adds ‘squared magnitude’ of - Aka as least absolute
coeff as penalty term. shrinkage and selecion
- If alpha=0, we get back the OG operator)
linear regression. - -adds ;absolute value of
- If alpha is very large, too much magnitude’ of coeff as
weight is added to the penalty penalty term.
term and it will lead to - If alpha=0 we get back the
underfitting OG linear regression
- If alpha’s too large, almost
all the coeffs=0-
>underfitting
- For example, normal distribution; zero mean and identity covariance

- The log likelihood becomes

- This prior pushes parameters/coefficients (m) towards zero (why is


this a good idea?) – because when we incl. this prior, the new
gradient becomes:

- The parameter \alpha is the importance of the regularization, and


it’s a hyper-parameter
- How do we decide the best value of \alpha (or a hyper-parameter in
general)?
- There are many other ways to regularize logistic regression
• The Gaussian model leads to an L2 regularization (we’re trying to
min th esquare value of w)

• Another popular regularization is an L1 which tries to minimize |w|

IV. Multi-class classification


IV. Multriclass regression
LECTURE 6 – NEURAL NETWORKS
I. Neural Networks
I.1 Conventional ML

I.3 Feature extraction


- For things we need to extract diff feats for > accuracy

I.4 Limitations for linear classsifiers


- For many decisions if it’s linear we need a lot of decision boundries
- Those decisions (linear classifiers involve input based on linear combos of
feats & many decisions involve non-linear funcs of the input
- We cannot divide the plane with a straight line, we need a curve or sth
I.5 Highly non-linear funcs
- The positive and negative cases cannot be separated by a line or plane (view
fig. above)

I.6 Linear vs non-linear funcs


Linear Non-linear
⁃ Linear methods assume a linear ⁃ Non-linear methods can capture
relationship between input features and complex relationships between input
the target variable. features and the target variable.
⁃ Examples include linear ⁃ Examples include neural
regression, logistic regression, and networks, decision trees, SVMs with
linear SVMs. non-linear kernels, and kernel methods.
⁃ These methods are relatively ⁃ These methods are more flexible
simple and interpretable. and can model highly non-linear
⁃ They work well when the relationships in the data.
relationship between variables is linear ⁃ They are capable of learning
or can be approximated as linear. intricate patterns and structures, making
⁃ However, they may struggle to them suitable for tasks with complex
capture complex, non-linear patterns in data.
the data. ⁃ However, they are often more
⁃ Exponents in equation (x alw to computationally expensive and may
power of 1) require more data to train effectively.
⁃ Y=mx+b ⁃ Non-linear methods can
⁃ Slope: constant sometimes be less interpretable than
linear methods, especially in deep
learning architectures.
⁃ Exponent (x alw>1at least once)
⁃ Slope: alw changing

I.7 Non-linear models with complex feats


- Thing in red the alg does it itself (we need to tune the params; the params we
need to tune ourselves are called hyperparameters).
- CNN for computer vision tasks

I.8 To costruct non-linear classifiers


- Goal: To construct non-linear discriminative classifiers that utilize functions of
input variables
- Neural Network approach
• Use a large number of simpler (activation) functions
• Functions are fixed (Gaussian, sigmoid, polynomial basis functions), I can
change activation funcs within layers (hence, th enam edeep learning)
•Optimization involves linear combinations of these fixed functions

Blue-input layer

Orange-hidden
layer

Red-output layer

I.9
Inspiration for NNs
Occipital part-
computer vision

Object
recognition-
posterior, inferior
temporal cortex
PFC- cognitive decisions

Motor cortex-actions

- At each layer representations get more complex


11
- Our brain has ~10 neurons

4
- Each neuron communicates (is connected) to other ~10 neurons (learning
capacity comes from those connections)

I.10 ANNS
- Neural networks define functions of the inputs (hidden features), computed by
neurons

- Artificial neurons are called units

- Neurons carry info; they either fire or not


- Synapses that provide outputs=weights in AI

I.11 Logistic regression are perceptrons


⁃ The perceptron consists of inputs, each associated with a
connection weight.
⁃ The output is calculated as the weighted sum of inputs plus a
bias term, often represented as the dot product of the input vector
and weight vector, augmented to include the bias weight.
⁃ To handle non-linear relationships or tasks like classification
requiring posterior probabilities, a threshold function or sigmoid
function can be applied to the output (to capture the neurone fire or
to paradigm).
⁃ For classification tasks with more than two outputs, multiple
perceptrons can be used, each associated with its weight vector.
⁃ In a neural network context, perceptrons represent local
functions of inputs and weights.
⁃ For classification tasks needing posterior probabilities, a two-
stage process involves calculating weighted sums
⁃ Perceptrons can also be used to approximate polynomial
functions or learn non-linear relationships by incorporating hidden
layers in multilayer perceptrons.

In summary, the perceptron serves as a versatile


computational unit capable of implementing linear and non-
linear functions, making it a foundational element in neural
networks for various machine learning tasks.

- In essence it does (single layer perceptron) just what logistic


regression does- it just uses the weighted combination of its inputs as a
bias term and puts it in its actor in an activation function and provides an
output
- The difference is we have additional layers. That's why we call
it multilayer perceptron. Okay? And the layers that we add in
between these output and input players are called hidden
layers and the number of layers we add is the hyper parameter
(decided by us)

To optimize the
weights- gradient
descent

I.12 NN architecture (Multi-layer perceptron)


- Each unit computes its value based on linear combination of values of units
that point into it, and an activation function

⁃ MLPs consist of an input layer, one or more hidden layers, and an output
layer.
⁃ Each layer contains perceptrons (or units), with connections (weights)
between them.
⁃ While theoretically, MLPs can have multiple hidden layers, in practice,
networks with one hidden layer are more common due to simplicity.
⁃ However, in certain cases, adding more hidden layers might improve the
network's ability to capture intricate patterns in the data.
⁃ In summary, multilayer perceptrons overcome the limitations of single-layer
perceptrons by introducing non-linear activation functions and multiple hidden layers,
enabling them to approximate non-linear functions and solve more complex machine
learning problems.
⁃ We optimize our hyperparameters through trial and error on the validation set.

I.13 Representational power


- Neural network with at least one hidden layer is a universal approximator (can
represent any function).

- Hidden layers and neurons (more of them) we incr complexity=>problem-


overfitting

- The capacity of the network increases with more hidden units and more
hidden layers

- Generalization performance is important and that’s why we test in on unseen


data and see how our alg generalizes
- Low genralization performance-overfitting
I.14 NN components

II. Traning the NNs


Training a neural network typically involves two main stages:
forward propagation and backpropagation. Let's break down each
stage:
⁃ Forward Propagation (feedforward):
⁃ In this stage, the input data is fed into the network, and the
activations are computed layer by layer until the output layer is
reached.
⁃ The input data is multiplied by the weights of the connections
between layers and passed through activation functions to produce
the output of each neuron.
⁃ Forward propagation follows these steps:
Input data is fed into the input layer.
The input is multiplied by the weights and passed through
activation functions in each layer to generate the output of each
neuron.
This process continues until the output layer is reached, and
the final output of the network is produced.
The output is compared with the actual target values to
compute the loss or error.
⁃ Backpropagation:
⁃ Backpropagation is the process of updating the weights of the
neural network to minimize the error between the predicted output
and the actual target values.
⁃ It involves computing the gradients of the loss function with
respect to the weights of the network using the chain rule of
calculus and adjusting the weights in the opposite direction of the
gradient to minimize the loss.
⁃ Backpropagation follows these steps:
Error gradients are computed by calculating the derivative of
the loss function with respect to the output of the network and
propagating these gradients backward through the network.
The gradients are used to update the weights of the network
using an optimization algorithm such as gradient descent.
This process iterates over multiple epochs, with each epoch
consisting of one forward pass followed by one backward pass
through the network.
The goal is to train the network to minimize the loss function
and improve its performance on unseen data.
- These two stages, forward propagation and backpropagation,
are repeated iteratively until the network converges to a set of
weights that minimize the loss function and produce accurate
predictions on the training data. Additionally, techniques such
as regularization, dropout, and batch normalization can be
applied to improve the training process and prevent
overfitting.

- Forward pass: performs inference

- • Backward pass: performs learning


• Change weights and biases to reduced error
• A routine to compute gradient
• Use chain rule of derivative of the loss function

Once
optimization
ends- no
error. A lotta
epoch can
lead
to overfitting

II.1
Activation function
II.2 Backpropagation

- Loss can be measured

- Propagating the error back and updating (adjusting) our weights and biases to
minimize loss.

- How do we find the appropriate amount to adjust?

- Compute the derivative of the loss function with respect to weights and biases

- The derivative of the function is the slope of the function

- Gradient descent: updating the weights and biases by increasing or reducing


it.

- Back-propagation: an efficient method for computing gradients needed to


perform gradient-based optimization of the weights in a multi-layer network
• Given any error function E, we just need to derive the gradients
of the activation functions

- Forward Pass:
o The algorithm begins with a forward pass, where input data is fed into
the network, and activations are computed layer by layer until the
output layer is reached.
o The output of each neuron is computed as a weighted sum of its inputs
passed through an activation function (e.g., sigmoid function for hidden
layers).
- Error Computation:
o After the forward pass, the error or loss is computed between the
predicted output and the actual target values.
o For example, in the case of nonlinear regression, the error is typically
computed using a loss function such as mean squared error (MSE).
- Backward Pass (Backpropagation):
o The error gradients with respect to the parameters (weights) of the
network are computed using the chain rule of calculus.
o The error is propagated backward through the network, starting from
the output layer towards the input layer.
o At each layer, the error gradients are computed for the weights
connecting that layer to the previous layer.
- Weight Update:
o The weights of the network are updated using an optimization
algorithm, typically gradient descent or one of its variants.
o The weights are adjusted in the opposite direction of the gradient to
minimize the error.
o The magnitude of the weight update is determined by the learning rate
(η) parameter, which controls the step size during optimization.
- Batch Learning:
o In batch learning, weight updates are accumulated over all patterns in
the training set, and the weights are updated once after a complete
pass over the training set (epoch).
o This process iterates over multiple epochs until convergence, where
the error is minimized to an acceptable level.
- Online Learning:
o Alternatively, online learning updates the weights after each individual
pattern, implementing stochastic gradient descent.
o Online learning can converge faster but requires careful tuning of the
learning rate parameter and randomization of the training data order.

• In summary, backpropagation is a fundamental algorithm for


training multilayer perceptrons, allowing them to learn complex
non-linear relationships in data by iteratively adjusting their
weights to minimize error. It combines forward pass, error
computation, backward pass, and weight update stages to
optimize the network's performance on the training data.

III. Deep Learning


- A multilayer perceptron, or neural network is typically considered deep
when it has multiple (hidden) layers

III.1 Deep NN (DL)


- A neural network with multiple layers is that it can learn a hierarchical
feature representation:

MLP CNN
⁃ Structure: MLP consists of ⁃ Structure: CNN consists of
multiple layers of neurons, including an alternating layers of convolutional layers
input layer, one or more hidden layers, and pooling layers, followed by one or
and an output layer. Each neuron in one more fully connected layers.
layer is connected to every neuron in ⁃ Convolutional Layers:
the subsequent layer. Convolutional layers apply convolution
⁃ Fully Connected Layers: In operations to input data, allowing them
MLP, each neuron in a layer is to capture local patterns or features.
connected to every neuron in the next Each filter in a convolutional layer
layer, creating a dense network. learns to detect specific features.
⁃ Parameter Sharing: There is no ⁃ Parameter Sharing: CNNs use
parameter sharing between different parameter sharing, where a small set of
parts of the input data. Each weight learnable filters (kernels) are applied
parameter is unique to its connection. across the entire input data to extract
⁃ Global Information Processing: features. This significantly reduces the
MLP processes the entire input data number of parameters and improves
globally, without considering local model efficiency.
patterns or structures. It treats each ⁃ Local Information Processing:
input feature equally. CNNs process input data in a local and
⁃ Applicability: MLPs are suitable hierarchical manner, capturing local
for tasks where input data exhibits features and gradually combining them
simple relationships and does not have to learn global representations. They
spatial or temporal dependencies, such exploit the spatial locality and
as tabular data or basic pattern hierarchical structure present in the
recognition tasks. data.
⁃ Spatial Hierarchies: CNNs are
particularly effective for tasks involving
spatial data such as images, where
local patterns and spatial relationships
are crucial for understanding the
content. They automatically learn
hierarchical representations of the input
data, starting from low-level features
(edges, textures) to high-level features
(objects, scenes).

- In summary, while MLPs are versatile and effective for


general-purpose machine learning tasks, CNNs excel in
tasks involving spatial data by leveraging parameter
sharing, local information processing, and spatial
hierarchies to learn complex features and patterns from
images, videos, and other spatial data formats.

III.2 A CNN for computer vision


III.3 The convolution layers
- Connect each hidden unit to a small input patch and share the weight
across space This is called a convolution layer and the network is a
convolutional network

- the convolutional layer applies a set of filters to the input data, extracting
meaningful features and hierarchically learning representations of the
input, making it well-suited for tasks involving spatial data like images.

III.4 The Pooling layer


- By “pooling” (e.g., taking max) filter responses at different locations we gain
robustness to the exact spatial location of features.

- the pooling layer downsamples the input feature maps, reducing their spatial
dimensions while preserving essential information, thereby improving
computational efficiency and promoting translation invariance in CNNs.

III.5 Convolutional layer


- Filters its input to reveal patterns

- the convolutional layer applies filters to the input data to extract features
hierarchically, making it well-suited for tasks involving spatial data like images.
It helps CNNs learn relevant representations of the input data while efficiently
managing the model's parameters.
III.6 Max polling layer
- Finds the maximum locally which indicates generally the maximum response
of the filtering from the previous layer.
LECTURE 7- Support Vector Machines
I. What’s SVM
- Support Vector Machine (SVM) and later generalized kernel machines offer a
different approach for linear classification and regression.
- SVMs adhere to Vapnik’s principle, prioritizing simplicity and efficiency by
focusing on learning the discriminant rather than estimating complex
probabilities directly.
- After training, the parameter of the linear model, the weight vector, can be
expressed in terms of a subset of the training set known as support vectors.
- Support vectors are instances close to the decision boundary, aiding in
knowledge extraction and providing an estimate of generalization error.
- The output of the model is expressed as a sum of influences of support
vectors, determined by kernel functions that measure similarity between data
instances.
- Kernel functions allow representation beyond traditional vector-based
methods, accommodating various data types such as graphs.
- Kernel-based algorithms are formulated as convex optimization problems,
allowing for analytical solutions without the need for heuristics like learning
rates or convergence checks.
- Hyperparameters are still required for model selection, but the optimization
process is more straightforward.
- The discussion typically starts with classification and then extends to
regression, outlier detection, and dimensionality reduction, with a common
quadratic program template to maximize separability or margin of instances
while maintaining smoothness of solution.
- Supervised ML algorithm for regression and classification
- can generate linear decision boundaries for linearly separable data
- variants exist for non-linear decision boundaries

- A Support Vector Machine (SVM) is a discriminative classifier


formally defined by a separating hyperplane.

- Given labelled training data (supervised learning), the algorithm


outputs an optimal hyperplane which categorizes new examples.

- In two dimensional space this hyperplane is a line dividing a plane in


two parts where in each class lay in either side.

II. Linking SVM and Logistic regression


II.1 Cost function of logistic regression
- The cost function of logistic regression is a measure of how well the logistic
regression model fits the training data. It quantifies the difference between the
predicted probabilities and the actual labels of the training examples.
- Here's an explanation without equations:
- Prediction: In logistic regression, we make predictions using a logistic
function (also known as a sigmoid function). This function maps the linear
combination of input features and model parameters (weights and bias) to a
probability between 0 and 1, representing the likelihood that the output
belongs to a particular class (e.g., 0 or 1).
- Cost Calculation: The cost function evaluates how well the model's
predictions match the actual labels in the training data. For each training
example, it penalizes the model based on the difference between the
predicted probability and the true label. If the model's prediction is close to the
true label, the cost is low; otherwise, it is high.
- Cross-Entropy Loss: The most common cost function used in logistic
regression is the cross-entropy loss. It calculates the difference between the
predicted probability and the actual label for each training example, taking into
account both the cases where the true label is 0 and 1. The overall cost is the
average of these individual differences across all training examples.
- Minimization: During training, the goal is to minimize this cost function by
adjusting the model parameters (weights and bias). Optimization algorithms
like gradient descent are used to iteratively update the parameters in a
direction that reduces the cost, ultimately leading to a logistic regression
model that fits the training data well.
- In summary, the cost function of logistic regression measures how well the
model's predictions align with the actual labels in the training data, guiding the
training process towards finding the optimal set of parameters for accurate
predictions.

- The cost function of a Support Vector Machine (SVM) is typically associated


with the optimization problem formulated to find the decision boundary that
maximizes the margin between different classes while minimizing
classification errors.
- Here's an explanation of the cost function of an SVM:
- Margin Maximization: SVM aims to find the hyperplane that best separates
the classes in the feature space. The margin is defined as the distance
between the hyperplane and the closest data points (support vectors) from
each class. The larger the margin, the better the generalization ability of the
model.
- Hinge Loss: The cost function used in SVM is based on hinge loss, which
penalizes misclassifications. Hinge loss encourages correct classification of
data points by imposing a penalty on data points that lie on the wrong side of
the decision boundary or within the margin.
- Cost of Misclassification: For each data point, the cost of misclassification
increases linearly with the distance from the correct side of the margin. Data
points correctly classified or lying within the margin contribute zero loss to the
cost function.
- Regularization Term: Additionally, SVMs often include a regularization term,
which balances between maximizing the margin and minimizing the
classification errors. This term controls the trade-off between model
complexity and training error, preventing overfitting.
- Objective Function: The overall objective function of an SVM is to minimize
the hinge loss while maximizing the margin and incorporating the
regularization term. This optimization problem is typically solved using convex
optimization techniques.
- In summary, the cost function of an SVM is designed to find the decision
boundary that maximizes the margin between classes while minimizing
classification errors, with a regularization term to control model complexity.
The optimization process seeks to strike a balance between maximizing the
margin and reducing misclassifications.

SVM Logistic regression


⁃ Objective: SVMs are primarily o Objective: Logistic
used for binary classification tasks, regression is used for
aiming to find a decision boundary that binary classification tasks,
maximizes the margin between classes. predicting the probability
⁃ Cost Function: The cost that an instance belongs
function in SVM is based on the hinge to a particular class.
loss. It penalizes misclassifications and o Cost Function: The cost
encourages correct classification by function in logistic
imposing a penalty on data points that regression is the negative
lie on the wrong side of the decision log-likelihood or the cross-
boundary or within the margin. entropy loss. It quantifies
⁃ Formulation: Mathematically, the difference between the
the cost function of SVM can be predicted probabilities and
expressed as the sum of the hinge loss the actual binary labels. It
and a regularization term. The penalizes
regularization term helps control the misclassifications, with
trade-off between maximizing the higher penalties for more
margin and minimizing classification confident wrong
errors, preventing overfitting. predictions.
⁃ Optimization: The goal is to o Formulation:
minimize the overall cost function by Mathematically, the cost
adjusting the model parameters function of logistic
(weights and bias) using convex regression measures the
optimization techniques. likelihood of the observed
⁃ Simpler: data under the logistic
⁃ What it does: SVMs find the regression model.
best line or boundary to separate o Optimization: The goal is
different classes in the data. to minimize the cross-
⁃ Cost Function Behavior: It entropy loss by adjusting
cares about how far data points are the model parameters
from the boundary and penalizes (weights and bias) using
misclassifications. It wants to make sure optimization algorithms
that data points are on the correct side like gradient descent or its
of the boundary. variants.
⁃ Goal: Minimize errors in - Simpler:
classifying points while maximizing the o What it does: Logistic
gap between the classes. regression predicts the
probability that an
instance belongs to a
particular class.
o Cost Function Behavior:
It looks at how well the
predicted probabilities
match the actual classes.
It penalizes more
confident wrong
predictions more heavily.
o Goal: Minimize the
difference between
predicted probabilities and
actual classes.
- In summary, while both SVM and logistic regression aim to solve binary
classification tasks, their cost functions differ in their formulations and
optimization objectives. SVMs focus on maximizing the margin between
classes, while logistic regression focuses on minimizing the difference
between predicted probabilities and actual labels.
- Simpler: In essence, SVMs focus on separating classes with a clear margin,
while logistic regression focuses on estimating probabilities and adjusting
predictions based on how confident they are.
II.2 SVM a large margin classifier
⁃ In SVM, the goal is to find a decision boundary (hyperplane) that maximizes
the margin between the classes. This concept of a "large margin" relates to ensuring
that there is a substantial gap between the decision boundary and the closest data

For the positive class (y=1), we want the decision boundary to be such that 𝜃() is
points from each class.

greater than or equal to 1, ensuring that positive instances are correctly classified

For the negative class (y=0), we want 𝜃() to be less than or equal to -1,
and lie on the correct side of the decision boundary.

ensuring that negative instances are correctly classified and lie on the correct side of

Mathematically, this translates to minimizing the sum of squares of 𝜃 values


the decision boundary.

while satisfying these constraints. This optimization problem aims to find 𝜃 values

that maximize the margin while correctly classifying all instances.


⁃ The representation where the positive class is labeled as 1 and the negative

a higher value of 𝜃() (greater than or equal to 1) and negative instances should have
class as -1 aligns with this formulation. It signifies that positive instances should have

a lower value of 𝜃() (less than or equal to -1), ensuring correct classification and a
large margin between the classes.
-Simpler (from what I understood:
⁃ In SVM, we want to draw a line (or a plane) that separates different classes of
data.
⁃ We want this line to have as much space as possible between the classes,
which we call the margin.
⁃ For one class, we want the line to be at least 1 unit away. For the other class,
we want it to be at least 1 unit away but in the opposite direction.

by 𝜃) to make this margin as big as possible while still correctly separating the
⁃ Mathematically, we find the best line by adjusting some values (represented

classes.
⁃ We also want to minimize the changes to these values, so we sum up their
squares to keep them small.
⁃ When we label one class as 1 and the other as -1, it's like saying "be at least
1 unit away on this side for class 1 and on the other side for class -1." This helps in
finding the best line that separates the classes well.
II.3 Linear SVMs: binary classification problem
- Linear Support Vector Machines (SVMs) are used for binary
classification problems, where the goal is to separate a dataset into
two classes using a straight line (or hyperplane in higher
dimensions). The SVM algorithm finds the optimal hyperplane that
maximizes the margin, which is the distance between the
hyperplane and the closest data points from each class. By
maximizing this margin, linear SVMs effectively classify new data
points by determining which side of the hyperplane they belong to.
- The SVM’s objective is to find a hyperplane, which is a decision
boundary that separates the two classes in the feature space with
the largest margin. The margin refers to the distance between the
hyperplane and the closest data points from each class, also called
support vectors. In the image, the solid line represents the decision
boundary, and the dotted lines on either side of the solid line
represent the margin.
- Ideally, a larger margin translates to better generalization on
unseen data. This is because the SVM is less likely to be swayed by
slight variations or noise in the data points.
- Here are some key points to remember about SVMs for binary
classification:
- SVMs find a hyperplane that maximizes the margin between the two
classes.
- The data points closest to the hyperplane are the support vectors
and are critical for defining the decision boundary.
- A larger margin generally leads to better performance on unseen
data

Issue here-where do we draw the


boundary between the classes

When C (regularization param)


not large, model not very sensitive
to outliers

II.4 SVM with one feature


- The pic below depicts a scenario where a Support Vector Machine (SVM)
classification model might not perform well.
- The data points show the mass of mice, with red dots representing non-obese
mice and light green dots representing obese mice. The decision boundary, a
line separating the two classes, is established based on the mass. A new
observation, is located above the decision boundary but closer to the cluster
of non-obese mice.
- In this instance, the threshold – the mass value on the decision line – isn’t a
strong predictor for classifying this new observation. The model might classify
the new observation as obese even though it appears closer to the non-obese
mice in mass.
- Here’s why the SVM might struggle here:
- Limited Features: The model is likely only considering mass as a single
feature to distinguish between obese and non-obese mice.
- Data Overlap: There might be inherent overlap in mass between obese and
non-obese mice. Even though the model separates the data points based on
mass, there will always be some ambiguity, especially for observations on the
borderline.
- Typically, for SVM classification to be more effective, more features that are
relevant to class distinction are used. For instance, alongside mass, features
like genetic markers or diet information could improve the model’s ability to
distinguish between the two classes.
Summarized- The picture
shows an SVM model
that might misclassify a
new data point . It uses
just weight to classify
obese (green) vs. non-
obese (red) mice. But
weight might not be a
perfect measure - the
new point is closer to
non-obese mice in
weight, but the model
might predict obese based on the decision line. SVMs work better with more features
that clearly distinguish the classes.

- Imagine you have a classification problem, like separating obese (green) from
non-obese (red) mice based on weight. A perfect separator would be a line
that keeps all red on one side and all green on the other.
- But what if the data is messy? Maybe some obese mice have a weight closer
to non-obese mice. A strict line (hard margin) won't work here.
- A soft margin SVM allows for some mistakes. It introduces an imaginary
"buffer zone" around the separation line. Data points can fall within this zone
or even on the wrong side of the line, but they are penalized for being there.
- The more a data point deviates from the correct side, the bigger the
penalty. This helps the SVM find a balance between a good separation line
and allowing for some errors in messy data.
- Think of it as a more forgiving version of the regular SVM, allowing for some
outliers without completely failing.
II.5 Max margin classification- Hard margin SVM
- a Support Vector Machine (SVM) classification model might not perform well
using a hard margin. Here, hard margin refers to a strict separation between
two classes of data points.
- The data points show the mass of mice, with red dots representing non-obese
mice and light green dots representing obese mice. The decision boundary, a
solid line separating the two classes, is established based on the mass. A
new observation,represented by a question mark, is located above the
decision boundary but closer to the cluster of non-obese mice.
- In this instance, the threshold – the mass value on the decision line – isn’t a
strong predictor for classifying this new observation. The model might classify
the new observation as obese even though it appears closer to the non-obese
mice in mass.
- Here’s why the SVM might struggle here:
- Limited Features: The model is likely only considering mass as a single
feature to distinguish between obese and non-obese mice.
- Data Overlap: There might be inherent overlap in mass between obese and
non-obese mice. Even though the model separates the data points based on
mass, there will always be some ambiguity, especially for observations on the
borderline.
- Typically, for SVM classification to be more effective, more features that are
relevant to class distinction are used. For instance, alongside mass, features
like genetic markers or diet information could improve the model’s ability to
distinguish between the two classes.
- Basically hard margin SVM aims to find a hyperplane that perfectly separates
classes without allowing any misclassifications, which works well only when
the data is linearly separable; however, it's sensitive to outliers and doesn't
generalize well. On the other hand, soft margin SVM introduces a margin of
tolerance, allowing for some misclassifications to find a more robust
hyperplane, especially when the data is not perfectly separable. It
incorporates slack variables to handle misclassifications and balances
between maximizing the margin and minimizing errors using a regularization
parameter C
The maximal
margin in
SVM refers
to the largest
possible
distance
between the
decision
boundary
(hyperplane)
and the
nearest data
point of any class. The goal of SVM is to find the hyperplane that maximizes
this margin, as it leads to better generalization and robustness of the
classifier, especially in cases of new or unseen data.

II.6 Linear SVMs


- SVMs are a type of supervised learning algorithm used for classification
tasks. In binary classification, an SVM attempts to categorize data points into
two distinct classes.
- The SVM’s objective is to find a hyperplane, which is a decision boundary that
separates the two classes in the feature space with the largest margin. The
margin refers to the distance between the hyperplane and the closest data
points from each class, also called support vectors. In the image, the solid line
represents the decision boundary, and the dotted lines on either side of the
solid line represent the margin.
- Ideally, a larger margin translates to better generalization on unseen
data. This is because the SVM is less likely to be swayed by slight variations
or noise in the data points.
- Here are some key points to remember about SVMs for binary classification:
- SVMs find a hyperplane that maximizes the margin between the two classes.
- The data points closest to the hyperplane are the support vectors and are
critical for defining the decision boundary.
- A larger margin generally leads to better performance on unseen data.

- Instead of fitting all the points, we focus on the boundary points


- • Goal : learn a boundary that leads to the largest margin (buffer) from points
on both sides
- Subset of vectors that support (determine boundary) are called the support
vectors
The graph illustrates a two-
dimensional feature
space, where each data point is
represented by a coordinate
based on two features. These
features could be anything
measurable that helps
distinguish between the
classes. For instance, in image
recognition, features might be
pixel intensities or color values.

II.7 What is a support vector?


- Support vectors are data points from the training set that lie closest to the
decision boundary (hyperplane) between different classes. In SVMs, these
support vectors are crucial because they determine the decision boundary
and the margin. Additionally, SVM predictions are based only on these
support vectors rather than the entire training set, making SVM
computationally efficient and memory-saving, especially for large datasets.
- an analogy: Imagine a border between two countries. The support vectors
would be like the most important guard posts along the border, strategically
placed to ensure a clear separation

- After training an SVM, a support vector is any instance located on the margin

- The decision boundary is entirely determined by the support vectors.

- SVMs compute the predictions only involves the support vectors, not the
whole training set.

II.8 Linear SVM classifier- Decision Function


- In a linear SVM classifier, to predict the class of a new instance x, you
calculate θnew=θ0+θ1⋅x1+…+θn⋅xn. If θnew is greater than or equal to 1,
it's in the positive class; if it's less than or equal to -1, it's in the negative class.
The margin, or space between classes, gets bigger as the weight vector θ
gets smaller.

- Imagine the decision function as a seesaw with a pivot point at the bias term
(θ₀). Each feature in the data point (x) contributes its weight (θₙ) to one side of
the seesaw. The decision function calculates the total weight on each side. If
the positive side outweighs the negative side by a certain threshold (θ'(x) ≥
1), the data point is classified as positive.Conversely, if the negative side
outweighs by a threshold (θ'(x) ≤ -1), it's classified as negative. A smaller
weight vector makes the seesaw more sensitive, requiring data points to be
further away from the center (decision boundary) to tip the balance decisively.

II.9 Linear SVMs


- goal of a linear SVM is to find the best hyperplane that separates different
classes in the input space. This hyperplane is determined by a linear
combination of the input features, represented by the equation w^T x + b = 0
x+b=0, where w is the weight vector, x is the input vector, and b is the bias
term. The hyperplane maximizes the margin, which is the distance between
the hyperplane and the nearest data points from each class, known as
support vectors. Linear SVMs aim to find the optimal hyperplane that not only
separates the classes but also maximizes this margin, providing better
generalization and robustness to unseen data.

Inputs between the margins are of


unknown class

Decision boundary – set of points on


the plane where decision function = 0

Advatages Disatvantages
• Accuracy • Not suited to larger datasets as
• Works well on smaller cleaner the training time with SVMs can be
datasets high • Less effective on noisier
• Can be more efficient because it datasets with overlapping classes
uses a subset of training points • Linear SVMs have a linear
decision boundary
• Originally designed as a two-
class classifier

II.10 Training a Margin-Based classifier


- Training a margin-based classifier, such as Support Vector Machines (SVMs),
involves optimizing the decision function to maximize the margin between
data points and the separating hyperplane. This optimization process aims to
find parameters (𝜃) that correctly classify training examples while maximizing
this margin. It typically involves optimization techniques like gradient descent
or quadratic programming to iteratively refine the parameters until the desired
margin is achieved.

II.11 Hard Margin SVMs


 Hard margin SVM’s does not allow any margin violations, aims to find a
hyperplane that perfectly separates the classes without any misclassification.

 with a hard-margin SVM, points within the margin cannot be classified

 a hard-margin SVM will fail to find a solution when the data is not linearly
separable.

II.12 Hard margin SVMs- the issues

- The image depicts one of the shortcomings of hard margin SVMs. Hard
margin SVMs aim to identify a hyperplane that perfectly separates the data
points belonging to different classes, with the largest margin possible. This
margin is the distance between the hyperplane and the closest data points
from each class, also known as the support vectors [2].
- The left graph in the image shows a scenario where a hard margin SVM can
successfully classify the data. The data points are linearly separable, meaning
a clear dividing line (hyperplane) can be drawn between the two classes. The
hard margin SVM finds this hyperplane and maximizes the margin by placing
it equidistant from the closest data points of each class (support vectors).
- However, the right graph illustrates an issue with hard margin SVMs – they
are sensitive to outliers. In this case, a single outlier data point disrupts the
possibility of finding a perfect separation. The presence of this outlier forces
the hyperplane to be positioned closer to one class, reducing the overall
margin. In severe cases, outliers can entirely prevent a hard margin SVM from
finding a separating hyperplane [2].
- Soft margin SVMs address this issue by allowing for some misclassification
during training. This enables the model to handle outliers and data that is not
perfectly linearly separable [1].
- In short: Hard margin SVMs aim for perfect separation of data, which is great
when it works. But, they struggle with outliers that mess up the clean
separation and reduce the margin between classes. Soft margin SVMs are
more flexible and can handle these outliers by allowing some misclassification
during training.

II.13 Linear SVM- soft margin SVM


- Soft-margin SVMs address scenarios where data points are not perfectly
separable by a hyperplane. They introduce slack variables to tolerate some
misclassifications and margin violations. These variables measure how much
each data point can deviate from the margin. The hyperparameter C controls
the balance between maximizing the margin and minimizing classification
errors: smaller C values prioritize a larger margin with more violations, while
larger C values penalize violations for a smaller margin

- A soft-margin SVM allows for some misclassifications and margin violation


- objective function relates to the margin size and the classification error
- Has a hyperparameter C, controls the trade-off between maximizing the
margin
- and minimizing the classification error. A small C => allows larger margin +
more margin violations,
large C =>penalizes margin violations, leading to a smaller margin.
the code trains a linear SVM
classifier to distinguish
between the Iris-Virginica
class (represented by 1)
from the other two Iris flower
classes (represented by 0)
using petal length and petal
width features.

the image depicts the effect


of the cost parameter (C) on
a linear SVM classifier. A
higher cost parameter
(C=100) enforces a larger
margin but can be sensitive
to outliers, while a lower
cost parameter (C=1)
allows more misclassifications during training but can better handle outliers.

- the image you sent shows two scatter plots which show the relationship
between petal length and petal width of Iris flowers from two Iris species: Iris-
Versicolor and Iris-Virginica. The text labels C=100 and C=1 correspond to
the cost parameter used in training a Support Vector Machine (SVM)
classifier.
- The goal of a linear SVM classifier is to find a hyperplane that separates the
data points belonging to different classes with the maximum margin. The
margin is the distance between the hyperplane and the closest data points
from each class, also known as the support vectors.
- The left graph (C=100) shows a scenario where a linear SVM can achieve a
good separation between the two Iris species.The data points are almost
linearly separable, meaning a clear dividing line (hyperplane) can be drawn
between the two classes. The SVM finds this hyperplane and maximizes the
margin by placing it equidistant from the closest data points of each class
(support vectors).
- The right graph (C=1) illustrates the impact of the cost parameter (C) on the
SVM classifier. A lower cost parameter allows for more misclassifications
during training. In this case, the cost parameter is set very low (C=1), which
makes the SVM classifier more flexible and allows it to tolerate some overlap
between the two Iris species data points. This increases the model's ability to
handle outliers but can also reduce the margin between the classes.

II.14 Training SVM classifiers


II.15 Linear SVM or Logistic regression

Linear SVm Logistic regression


⁃ Use Cases: Linear SVMs are ⁃ Use Cases: Logistic regression
effective when dealing with high- is suitable for problems where
dimensional data and situations where interpretability of results is important
the classes are well separated. and when probabilistic outputs are
⁃ Advantages: They work well in needed.
high-dimensional spaces, are memory- ⁃ Advantages: Provides
efficient, and can handle large datasets probabilistic interpretation of outcomes,
effectively. making it easier to interpret and explain
⁃ Strengths: Effective in cases predictions. It's also computationally
where the number of features is larger less expensive compared to SVMs.
than the number of samples. They also ⁃ Strengths: Can handle linearly
find the maximum margin hyperplane, separable as well as non-linearly
which tends to generalize better to separable data with appropriate feature
unseen data. engineering. It's particularly useful when
⁃ Weaknesses: Less probabilistic feature importance is of interest.
interpretation compared to logistic ⁃ Weaknesses: Less effective in
regression. Computationally expensive high-dimensional spaces with many
for very large datasets. features. Prone to overfitting if the
⁃ In short: Use when working with number of features is large compared to
high-dimensional data, when classes the number of samples.
are well separated, or when you need a ⁃ In short: Use when
model with good generalization interpretability of results is crucial, when
performance. you need probabilistic outputs, or when
feature importance is a consideration.
- In summary, choose linear SVMs for their generalization performance and
effectiveness spaces, while opting for logistic regression when interpretability
and probabilistic interpretation are important.

III. Kernel SVMs

image shows a scatter plot


of a dataset. The data
points are arranged in a
way that is not linearly
separable, meaning a
straight line cannot
perfectly divide the data
into two distinct
classes. This is a challenge
for linear SVMs, which are
designed to find separating hyperplanes (decision boundaries) between classes.

- Thus, we introduce what is know as the ‘kernel trick’. The kernel trick is a
technique used in machine learning, particularly with support vector machines
(SVMs), to handle non-linearly separable data. Instead of explicitly mapping
the data to a higher-dimensional space using basis functions, the kernel trick
allows us to compute the dot product of the mapped data points in the original
space. This is achieved by defining a kernel function that calculates the
similarity between pairs of data points directly in the original input space.

- Kernel SVMs
introduce a concept called the "kernel trick." This trick involves transforming
the data points from their original feature space into a higher-dimensional
space.
- In this higher-dimensional space, the data points might become linearly
separable, allowing the SVM to find a separating hyperplane.

- The kernel function essentially acts as a bridge between the original feature
space and the higher-dimensional space,without explicitly performing the
transformation itself. This keeps the computational cost lower.
In simpler terms, the
kernel trick enables us
to use a linear model in
a high-dimensional
space without explicitly
calculating the
transformed features.
This is advantageous
because it avoids the
computational
overhead of mapping
the data to higher dimensions. The kernel function computes the similarity between
data points, and by replacing the dot product with this kernel function, we can
effectively work in a high-dimensional space without explicitly transforming the data.

 The important parameters


• the regularization parameter C,
• the choice of the kernel, and the kernel-specific parameters..

 The RBF kernel has only one parameter, gamma, which is the
inverse of the width of the Gaussian kernel. gamma and C both
control the complexity of the model,

 C and gamma should be adjusted together.

Advantages Disatvantages
 • Allow for complex decision  Do no scale very well with
boundaries, even if the data the number of samples.
has only a few features. Running an SVM on data with
• Work well on low- up to 10,000 samples might
dimensional and high- work well, but working with
dimensional data (i.e., few datasets of size 100,000 or
and many features) more can become
challenging in terms of
runtime and memory usage.

 Require careful
preprocessing of the data
and tuning of the
parameters..

 SVM models are hard to


inspect; it can be difficult to
understand why a particular
prediction was made, and it
might be tricky to explain the
model to a nonexpert.

 Still, it might be worth trying SVMs, particularly if all of your features


represent measurements in similar units (e.g., all are pixel
intensities) and they are on similar scales.

III.1 Overlapping classifications

There are two desired outcomes: the drug works and


the drug doesn't work. The dosage (mg) is on the horizontal axis.Ideally, we would
like a clear separation between the two classes based on dosage. However, there
might be an overlap region where the outcome is uncertain – the dosage might be
too low or too high, leading to the drug not working.

Soft Margin SVM:


⁃ SVM (Support Vector Machine) is a machine learning algorithm used for
classification tasks. It aims to identify a hyperplane (decision boundary) that
separates the data points belonging to different classes.
⁃ A hard margin SVM seeks a perfect separation between classes, which can
be problematic with real-world data that may not be perfectly separable.
⁃ Soft margin SVMs address this by allowing for some misclassification during
training. This enables the model to handle outliers and data that is not perfectly
linearly separable.
Features:
⁃ In machine learning, features are the characteristics or attributes of the data
points used by the model to make predictions. The image refers to a scenario with
two features, which could represent any measurable qualities relevant to the
classification task.
Threshold:
- The threshold refers to the decision boundary (hyperplane) that the SVM
classifier creates to separate the data points into classes. In the image, the threshold
is depicted as a blue line.
Margins:
⁃ The margins are the spaces between the threshold (hyperplane) and the
closest data points from each class, also known as support vectors. The wider the
margin, the better the separation between the classes and potentially, the better the
generalization of the model to unseen data.
Overall:

The pid shows simplified illustration of a soft margin SVM with two features. It
highlights the concept of the threshold (decision boundary) separating the classes
and the margins that provide a buffer for potential misclassifications.

III.2 SVMs- similarity functions


- Adding features computed using a similarity function is a technique used in
Support Vector Machines (SVMs) to handle nonlinear problems. This involves
measuring how much each instance resembles a specific landmark using the
similarity function. By incorporating these similarity-based features into the
SVM model, it becomes capable of effectively dealing with nonlinearities in
the data, enabling more accurate classification or regression.
III.3 Parameters for Gaussian RBF (radial basis
functions) kernels
- The Gaussian RBF (Radial Basis Function) kernel in Support Vector
Machines (SVMs) has a parameter called gamma.
- Gamma: It defines how far the influence of a single training example reaches,
with low values meaning 'far' and high values meaning 'close'. When gamma
is low, the 'reach' of the influence is far, leading to a smoother decision
boundary. Conversely, when gamma is high, the influence is confined to
nearby points, resulting in a more complex and wiggly decision boundary.

- In summary, tuning gamma is crucial because it balances the model's bias-


variance tradeoff. High gamma values can lead to overfitting, while low values
can lead to underfitting. It's often optimized through techniques like grid
search or cross-validation.
III.4 SVM hyperparameters
In SVMs, the choice of hyperparameters depends on the kernel used:
⁃ Linear SVM:
⁃ C (regularization): Controls the trade-off between maximizing the margin and
minimizing the classification error. Higher values of C allow for fewer margin
violations, potentially leading to overfitting, while lower values allow for more
violations, potentially leading to underfitting.
⁃ Polynomial SVM:
⁃ C (regularization): Same as in linear SVMs.
⁃ d (polynomial degree): Determines the degree of the polynomial kernel.
Higher degrees can capture more complex relationships but may lead to overfitting if
not regularized properly.
⁃ RBF (Radial Basis Function) SVM:
⁃ C (regularization): Similar to linear SVMs, it controls the trade-off between
maximizing the margin and minimizing the classification error.
⁃ gamma (width of kernel): Controls the influence of a single training example,
with low values leading to smoother decision boundaries and high values leading to
more complex decision boundaries that may be prone to overfitting.
These hyperparameters are typically tuned using techniques like grid search or
cross-validation to find the combination that results in the best model performance
on validation data.

IV. SVM regression


- In summary, kernel machines for regression allow us to fit nonlinear models to
data while considering both smoothness and error, making them versatile
tools for various regression tasks.
- Linear Model: We start with a basic model that's a straight line. It's like
drawing a line through points on a graph.
- Error Handling: Instead of getting upset about every mistake, we allow for a
little bit of error. This helps our model handle noisy data better.
- Slack Variables: We give our model some flexibility by allowing it to make
small mistakes here and there, as long as it doesn't make too many.
- Optimization: We find the best line that fits our data while also considering
the errors and flexibility we've allowed for. It's like finding the best compromise
between fitting the data perfectly and being flexible.
- Support Vectors: These are the points that our model pays the most
attention to. They help define our line and determine how much error we're
willing to accept.
- Kernelization: Sometimes, our data isn't just a straight line; it's more
complicated. So, we use special tricks called kernels to handle this complexity
and find a better fit.
- Alternative Formulation: There's another way to think about our model that's
a bit different but still gets the job done. It's like having different routes to get
to the same destination.
- Overall, we're trying to find the best line (or curve) that fits our data while
allowing for some mistakes and being flexible enough to handle different
patterns in the data.

- The goal is to find a function f(x) that has at most ε deviation from
i
the actually obtained targets y for all the training data, and is at the
same time is as flat as possible.

- In other words,we do not care about errors as long as they are less
than ε, but will not accept any deviation larger than this.

V. Multiclass classification
V.1 Reduction to Binary classification
⁃ Standard Approach: If we have 4 classes, we can treat each class
separately and train a classifier to distinguish it from all the others. This results in 4
binary classifiers.
⁃ One-vs-Rest : Here, we train one classifier for each class, considering it as
the positive class and all other classes as the negative class.
⁃ One-vs-One: We create binary classifiers for each pair of classes. So, for 4
classes, we'd have 6 binary classifiers: 1 vs 2, 1 vs 3, 1 vs 4, 2 vs 3, 2 vs 4, and 3 vs
4.
V.2 Prediction with one vs rest
• Class with the highest score

To make a prediction for


a new instance: We
apply all the trained
classifiers to the
instance. The class
predicted by the
classifier with the highest
confidence (highest
score) is assigned to the
instance. If multiple
classifiers predict the
same highest
confidence, we can use a tie-breaking rule or consider it ambiguous.

V.3 One-class Classification


⁃ One-class classification, also known as outlier or novelty
detection, is particularly useful when we have examples from
only one class (e.g., positive examples) and very few or no
examples from other classes (e.g., negative examples).
⁃ Here's why it's valuable:
⁃ Limited Negative Examples: In many real-world scenarios,
obtaining examples from all possible classes can be
challenging or impractical. One-class classification allows us to
learn from the available positive examples without needing
negative examples.
⁃ Outlier Detection: One-class classification is often used to
identify outliers or novel instances in a dataset. These are data
points that significantly deviate from the norm or are unlike
the majority of the data. By learning a boundary around the
positive examples, the model can effectively identify such
outliers.
⁃ Unbalanced Data: In datasets where one class is significantly
more prevalent than others, traditional classification methods
may struggle due to the class imbalance. One-class
classification avoids this issue by focusing solely on learning
the characteristics of the positive class.
⁃ Anomaly Detection: It's also useful in anomaly detection,
where the goal is to identify rare events or patterns that
deviate from normal behavior. For example, detecting
fraudulent transactions in financial data or identifying unusual
patterns in network traffic.
⁃ Simplicity and Efficiency: Since one-class classification only
requires examples from one class, it simplifies the learning
process and often leads to more efficient models, especially in
scenarios where negative examples are scarce or difficult to
obtain.
⁃ Overall, one-class classification provides a valuable tool for
dealing with situations where traditional classification methods
may not be applicable or effective due to the absence of
negative examples or class imbalances.

V.3 One-class SVMs


⁃ One-class kernel machines, an extension of support vector machines, are
used for a specific type of unsupervised learning called one-class
classification. Rather than trying to estimate a full density function, these
machines aim to find a boundary that separates regions of high density from
regions of low density, which can be useful for outlier detection or novelty
detection.
⁃ Here's how they work:
⁃ Boundary Definition: Imagine we have a sphere with a center and a radius.
We want to enclose as much high-density data as possible within this sphere
while keeping the radius as small as possible.
⁃ Training: We define slack variables for instances that fall outside the sphere
(outliers). We also introduce a smoothness measure that is related to the
radius of the sphere. Then, we train the model to find the smallest radius that
still encapsulates most of the high-density data.
⁃ Mathematical Formulation: We formulate the problem as an optimization
task, where we minimize the radius of the sphere while penalizing instances
that fall outside it (slack variables).
⁃ Dual Formulation: By transforming the problem into its dual form, we can
solve it more efficiently. We maximize a function with respect to Lagrange
multipliers subject to certain constraints.
⁃ Support Vectors: Instances that contribute to the boundary of the sphere
(either lying on the boundary or outside it) are called support vectors. They
play a crucial role in defining the boundary.
⁃ Prediction: To determine if a new instance is an outlier, we check if its
distance from the center of the sphere exceeds the radius. If it does, it's
considered an outlier.
⁃ Kernel Trick: We can use kernel functions to define boundaries of arbitrary
shapes instead of just spheres. This allows us to capture more complex
relationships in the data.
⁃ Alternative Formulations: There are alternative formulations of one-class
kernel machines, such as the ν-SVM type, which offer different perspectives
on the problem.
⁃ In summary, one-class kernel machines provide a way to detect outliers or
novel instances by defining boundaries that encapsulate regions of high
density in the data. They are useful in situations where we have limited
information about the data distribution and want to identify unusual patterns.
How OC-SVM works:
⁃ Positive Examples: OC-SVM is trained using only positive
examples, assuming that these examples represent the target class
or normal behavior.
⁃ Max-Margin Hyperplane: The goal of OC-SVM is to find the
hyperplane that maximizes the margin between the positive
examples and the origin (representing the absence of the target
class or outliers). This hyperplane serves as the decision boundary.
⁃ Separation: By maximizing the margin between the positive
examples and the origin, OC-SVM effectively separates the normal
instances from potential outliers. Instances lying on the positive side
of the hyperplane are considered normal, while those on the
negative side are classified as outliers.
⁃ Robustness: OC-SVM is robust to outliers because it focuses
solely on learning the characteristics of the positive class and
ignores the negative class. This makes it particularly useful for
outlier detection tasks where negative examples are either
unavailable or not representative.
⁃ Overall, OC-SVM provides a powerful approach for detecting
outliers or anomalies in datasets, especially when only positive
examples are available for training. It leverages the principles of
SVMs to learn a decision boundary that separates normal instances
from potential outliers, making it a valuable tool in various
applications such as fraud detection, intrusion detection, and quality
control.

LECTURE 8- NAIVE BAYES AND DECISIONS


TREES
I. Brief intro to probabilities
I.1 Basic notations
- Random variable
• Refers to an element / event whose status is unknown:
• Rain for tomorrow X
• x = “it will rain tomorrow” a possible outcome
- Domain
• The set of values a random variable can take:
• A = The stock market movement up or down: Binary
• A = {1, 2, 3, 4, 5, 6} for rolling a fair six-sided die, domain consists of
possible
outcomes
• A = possible the weight of a randomly selected person from a population:
Continuous

I.2 Probability distributions

I.3 Priors
- Degree of belief in an event in the absence of any other information
• E.g.
• P(rain tomorrow) = 0.8
• P(no rain tomorrow) = 0.2

I.4 Conditional probability and Joint probability


I.5 Chain rule

I.6 Bayes rule

I.7 Classification problem

II. Naïve bayes classifiers


- Naïve Bayes Classifier:
 Supervised machine learning algorithm.
 Primarily used for classification tasks, like text classification.
 Utilizes principles of probability for classification.
- Generative Learning Algorithm:
 Part of a family of generative learning algorithms.
 Seeks to model the distribution of inputs for a given class or category.
- Key Characteristics:
 Assumes independence among features (hence "naïve").
 Doesn't learn which features are most important to differentiate
between classes.
- Advantages:
 Simple and efficient, especially for large datasets.
 Can perform well in practice, even with the naive independence
assumption.
- Comparison with Discriminative Classifiers:
- Unlike discriminative classifiers (e.g., logistic regression), it doesn't
directly learn the decision boundary between classes.
- Focuses on modeling the joint distribution of features and class labels.
- Efficiency and Generalization:
- Naive Bayes classifiers are fast to train due to their
simple parameter estimation process.
- However, they may not perform as well in terms of
generalization compared to linear classifiers like Logistic
Regression and LinearSVC.
- Types of Naive Bayes Classifiers:
- GaussianNB: Suitable for continuous data.
- BernoulliNB: Assumes binary data.
- MultinomialNB: Assumes count data, commonly used in
text classification.
- Prediction:
- Predictions are made by comparing a data point to the
statistics for each class, selecting the best matching
class.
- Parameterization:
- MultinomialNB and BernoulliNB have a single parameter,
alpha, controlling model complexity.
- Alpha adds virtual data points to the dataset, resulting in
smoothing of statistics.
- Performance is relatively robust to alpha settings, but
tuning it may improve accuracy.
- Usage and Performance:
- GaussianNB is suitable for very high-dimensional data.
- MultinomialNB and BernoulliNB are commonly used for
sparse count data like text.
- MultinomialNB often outperforms BernoulliNB on datasets
with many nonzero features.
II.1 Naïve Bayes classifiers for continous values
⁃ Continuous Data in Naïve Bayes:
 While Naïve Bayes traditionally assumes discrete data, real-world data
often contains continuous features like height, weight, etc.
 Continuous features are those that can take any real value within a
certain range.
⁃ Gaussian Naïve Bayes:
 For continuous data, a Gaussian model is commonly employed.
 Assumes that each class is associated with a Gaussian (normal)
distribution of the continuous features.
⁃ Specifically, it assumes that the observed input vector X is generated from a
normal distribution.
⁃ Characteristics:
 Assumes that the continuous features in each class are independently
and identically distributed according to a Gaussian distribution.
 Parameters of the Gaussian distribution (mean and variance) are
estimated from the training data.
⁃ Applications:
 Suitable for tasks involving continuous features such as height, weight,
levels of genes in cells, brain activity, etc.
 Widely used in various fields including healthcare, finance, and natural
language processing where continuous data is prevalent.
⁃ MultinomialNB and GaussianNB:
 Compute different statistics for each class.
 MultinomialNB considers the average value of each
feature for each class.
 GaussianNB stores the average value and standard
deviation of each feature for each class.

II.2 Bernouli Naïve Bayes


Bernoulli Naïve Bayes is another variant of the Naïve Bayes algorithm, specifically
designed for binary feature vectors. Here's a closer look:
⁃ Bernoulli Naïve Bayes:
 Variant of Naïve Bayes classifier.
 Suited for binary feature vectors, where features represent presence or
absence of a particular attribute.
 Often used in text classification tasks, such as spam filtering, sentiment
analysis, etc., where features may represent the presence or absence
of certain words in a document.
 Counts the occurrences of nonzero features for each
class.
 Calculation of feature counts for each class.

⁃ Probability Model:
 Assumes that features are binary-valued (0 or 1).
 Estimates the probability of each feature being 1 (present) or 0
(absent) given each class.
 Computes class conditional probabilities using the Bernoulli
distribution.
⁃ Assumptions:
 Assumes features are conditionally independent given the class, just
like the standard Naïve Bayes algorithm.
 This means that the presence or absence of one feature does not
provide any information about the presence or absence of any other
feature, given the class label.
⁃ Parameter Estimation:
 Parameters (probabilities) are estimated from the training data using
maximum likelihood estimation (MLE) or smoothed estimates like
Laplace smoothing to handle unseen features.
⁃ Applications:
 Commonly used in text classification tasks, especially when dealing
with binary feature representations, like bag-of-words or binary term
frequency-inverse document frequency (TF-IDF) vectors.
III. Decision Trees
⁃ Internal nodes correspond to attributes (features)

⁃ Leafs correspond to classification outcome (output)

⁃ Branching is determined by the best attribute value

III.1 Building a decision tree


⁃ Learning a decision tree involves finding the sequence of if/else questions that
lead to the correct answer most efficiently.
⁃ In the context of machine learning, these questions are referred to as "tests"
and are used to partition the dataset.
⁃ Tests on continuous data typically involve comparing feature values to a
threshold, like "Is feature i larger than value a?"
⁃ The algorithm searches through all possible tests to find the one that best
separates the classes.
⁃ The first test (split) is chosen based on the feature that provides the most
information about the target variable.
⁃ Subsequent tests are chosen by recursively searching for the best split in
each region of the dataset.
⁃ This process creates a binary tree of decisions, where each node represents
a test.
⁃ The recursive partitioning continues until each leaf of the tree contains only
one target value (pure leaf).
⁃ Predictions for new data points are made by traversing the tree from the root
to a leaf and predicting the majority target in that region.
⁃ Decision trees can also be used for regression tasks, where predictions are
made by finding the leaf corresponding to the new data point and outputting the
mean target value of the training points in that leaf.

III.2 Controlling complexity of decision trees


⁃ Overfitting is a common issue in decision trees, where the model becomes
highly complex and memorizes the training data.
⁃ Pure leaves in the tree indicate 100% accuracy on the training set but may
lead to poor generalization.
⁃ Strategies to prevent overfitting include pre-pruning and post-pruning.
⁃ Pre-pruning involves stopping the tree's growth before it perfectly fits the
training data.
⁃ Common criteria for pre-pruning include limiting the maximum depth of the
tree, the maximum number of leaves, or requiring a minimum number of points in a
node to continue splitting.
⁃ Scikit-learn implements pre-pruning in decision trees but not post-pruning.
⁃ Pre-pruning can be evaluated by examining the accuracy on both the training
and test sets.
⁃ Limiting the maximum depth of the tree, such as setting max_depth=4, can
decrease overfitting and improve generalization performance.
⁃ While limiting the depth may reduce training set accuracy, it often leads to
better performance on unseen data, as demonstrated by the improvement in test set
accuracy.
III.3 Strengths and weakneses

Strengths Weakneses
⁃ Decision trees can be easily ⁃ Even with pre-pruning, decision
visualized and understood, particularly trees tend to overfit the training data,
for smaller trees, making them leading to poor generalization
interpretable by non-experts. performance on unseen data.
⁃ They are invariant to the scaling ⁃ They may not perform as well as
of data, as each feature is processed ensemble methods, such as random
separately, and splits don't depend on forests or gradient boosting, in many
scaling. This means no preprocessing applications where improved
like normalization or standardization of generalization performance is crucial.
features is needed.
⁃ Decision trees can handle
features that are on completely different
scales or a mix of binary and continuous
features effectively.
III.4 Model complexity
⁃ The complexity of the model induced by a decision tree is determined by the
depth of the tree
⁃ Increasing the depth of the tree increases the number of decision boundaries
and may lead to overfitting
⁃ Pre-pruning and post-pruning
⁃ Limit tree size (pick one):
• max_depth
• max_leaf_nodes
• min_samples_split
Pre-pruning, a common strategy to prevent overfitting in decision
trees, involves stopping the growth of the tree before it perfectly fits
the training data. This is crucial because allowing the tree to grow
until all leaves are pure can lead to highly complex models that
overfit the training data.
There are two primary pre-pruning strategies:
⁃ Limiting tree depth: By setting a maximum depth for the
tree (max_depth), we restrict the number of consecutive questions
that can be asked during tree construction. This prevents the tree
from becoming arbitrarily deep and complex. For example, setting
max_depth=4 allows only four consecutive questions to be asked
during tree construction.
⁃ Limiting the number of leaves: Another approach is to limit
the maximum number of leaves in the tree (max_leaf_nodes). This
strategy prevents the tree from growing too many branches,
thereby controlling its complexity.
Implementing pre-pruning in scikit-learn's DecisionTreeClassifier
involves specifying these parameters during model instantiation. For
instance, setting max_depth=4 in the DecisionTreeClassifier
constructor limits the depth of the tree to four levels.
⁃ Let's look at the effect of pre-pruning on the Breast Cancer
dataset. Initially, without pre-pruning, the decision tree achieves
perfect accuracy (100%) on the training set but may not generalize
well to unseen data, as indicated by the slightly lower accuracy on
the test set (93.7%). However, by applying pre-pruning with
max_depth=4, we observe a decrease in training set accuracy
(98.8%), but an improvement in test set accuracy (95.1%). This
demonstrates how pre-pruning can help prevent overfitting and
improve the generalization performance of the decision tree model.
IV. Decision tree regression

LECTURE 9- Ensemble Methods

I. Bias and Variance


II. Ensemble models

Ensemble models are a powerful technique in machine learning where multiple


models are combined to improve predictive performance. Instead of relying on a
single model's predictions, ensemble methods use a group of models to make more
accurate predictions. Here are some common types of ensemble methods:
⁃ Bagging (Bootstrap Aggregating): In bagging, multiple instances of the
same base model are trained on different subsets of the training data, typically
sampled with replacement (bootstrap sampling). Then, predictions are combined by
averaging (for regression) or voting (for classification).
⁃ Random Forest: Random Forest is an extension of bagging that specifically
uses decision trees as the base models. It introduces randomness in the training
process by selecting a random subset of features at each split, which helps to
decorrelate the trees and reduce overfitting.
⁃ Boosting: Boosting algorithms like AdaBoost, Gradient Boosting, and
XGBoost sequentially train weak learners, where each subsequent learner focuses
on the mistakes made by the previous ones. Boosting gives more weight to
misclassified instances, forcing subsequent models to concentrate on them.
⁃ Stacking (Stacked Generalization): Stacking combines the predictions of
multiple models by training a meta-model on their outputs. Instead of simple
averaging or voting, stacking learns how to best combine the base models'
predictions.
⁃ Voting: In voting ensembles, multiple models make predictions on the same
dataset, and the final prediction is made based on the collective "vote" of all models.
This can be done through majority voting for classification tasks or averaging for
regression tasks.
⁃ Bayesian Model Averaging: This approach considers model uncertainty by
weighting each model's prediction by its estimated posterior probability, given the
observed data.
⁃ Ensemble methods are widely used in practice because they often improve
predictive performance, reduce overfitting, and increase model robustness.
However, they can be computationally expensive and require careful tuning of
hyperparameters.

Simple models used as building blocks for designing more complex models by
combining several of them.

• How should we combine these models?

⁃ Voting

⁃ Bagging: Train many models on bootstrapped data, then take average (e.g
Random Forests) - > less variance

⁃ Boosting : Given a weak models, run it multiple times on (reweighted) training


data, then let learned classifiers vote (Gradient Boosting) -> less bias

⁃ Stacking : Trains many models in parallel and combines them by training a


meta- model to output a prediction based on the different weak models
predictions- less bias

II.1 Voting
⁃ Build different models
• Classifiers that are most “sure” will vote with more conviction
• Classifiers will be most “sure” about a particular part of the space
• Average the result

⁃ More models are better – if they are not correlated.

⁃ Also works with neural networks

⁃ You can average any models as long as they provide calibrated


(“good”) probabilities.

⁃ Scikit-learn: VotingClassifier

The passage delves into the concept of voting as a method to combine the outputs
of multiple classifiers in ensemble learning. Here are the key points discussed:
⁃ Combination via Voting:
⁃ Voting involves taking a linear combination of the outputs of different
classifiers

⁃ Combination Rules:
- There are various combination rules besides simple weighted averaging, such
as median, minimum, maximum, and product rules. Each rule has its characteristics,
like robustness to outliers or pessimistic/optimistic behavior.
⁃ Normalization:
- Combination rules often require the outputs of classifiers to be normalized to
the same scale before aggregation.
⁃ Voting in Classification:
- In classification, voting involves choosing the class with the maximum number
of votes (plurality voting) or more than half of the votes (majority voting).
- Weighted voting schemes can be used if classifiers provide additional
information such as posterior probabilities.
⁃ Voting in Regression:
- In regression, voting typically involves averaging or taking the median of the
outputs of base regressors. Median is more robust to noise than the average.
⁃ Determining Weights:
• can be determined based on the accuracies of classifiers on a
separate validation set or learned from the data.
- Bayesian Interpretation:
Voting schemes can be seen as approximations under a Bayesian framework,
with weights representing prior model probabilities and model decisions
approximating model-conditional likelihoods.
- Voting for Error Reduction:
Voting reduces variance and error by averaging over multiple noisy models,
assuming that the noise functions added by individual models are uncorrelated with
zero mean.
This smoothing effect in the functional space acts as a regularizer, decreasing
variance while potentially offsetting bias introduced by individual models.
In essence, voting provides a simple yet effective way to combine the outputs of
multiple classifiers to improve predictive performance in both classification and
regression tasks.

III. Bagging (Bootstrap aggregation)


⁃ Generic way to build “slightly different” models

 Draw bootstrap samples from dataset (as many as there


are in the dataset, with repetition)

 Implemented in BaggingClassifier, BaggingRegressor

⁃ fits several independent models and “average” their predictions in


order to obtain a model with a lower variance.

 Fitting fully independent models require too much data

⁃ Creates multiple bootstrap samples


• Each new bootstap sample will act as another independent
dataset • Fit weak learners for each sample,
• Aggregate them (average the output)
⁃ Regression : simple average

⁃ Classification problem:
• simple majority vote (hard voting)
• highest average probability (soft voting)

bagging, which stands for bootstrap aggregating, a technique used in ensemble


learning. Here's a breakdown of the key points:
⁃ Bagging Overview:
⁃ Bagging is a voting method used in ensemble learning.
⁃ It involves training multiple base learners (often decision trees) on slightly
different training sets generated through bootstrapping.
⁃ Bootstrapping involves randomly sampling instances with replacement from
the original training set to create multiple similar but slightly different samples.
⁃ Each base learner is trained on one of these bootstrapped samples.
⁃ Unstable Algorithms:
⁃ An algorithm is considered unstable if small changes in the training set lead to
large differences in the learned model, indicating high variance.
⁃ Unstable algorithms include decision trees and multi-layer perceptrons.
⁃ Bagging Process:
⁃ Bagging addresses the instability of base learners by using bootstrapping to
generate diverse training sets.
⁃ It averages the predictions of multiple base learners during testing, reducing
variance and improving generalization performance.
⁃ Regression and Classification:
⁃ Bagging can be applied to both regression and classification tasks.
⁃ In regression, taking the median instead of the average when combining
predictions can enhance robustness.
⁃ Stability and Correlation:
⁃ Stability of an algorithm refers to the degree of correlation between different
runs of the algorithm on resampled versions of the same dataset.
⁃ Bagging aims to reduce correlation between base learners by generating
diverse training sets through bootstrapping.
⁃ Bootstrap Size:
⁃ If the original training set is large, generating smaller bootstrapped sets (with
size N' < N) can help create more diverse base learners, as excessively similar
bootstrapped sets may lead to highly correlated models.
Overall, bagging is an effective technique for reducing variance and improving the
robustness of unstable learning algorithms, making it a valuable tool in machine
learning.
III.1 Bootstrapping
⁃ Generating samples of size B (called bootstrap samples) from an initial
dataset of size N by randomly drawing with replacement B observations.

Bootstrapping is a powerful resampling technique used in statistics and machine


learning to estimate the uncertainty of a model's performance or parameters. Unlike
traditional cross-validation methods that partition the dataset into training and
validation sets, bootstrapping involves generating multiple samples from a single
dataset by drawing instances with replacement.
In the context of machine learning, bootstrapping is commonly used in bagging, a
technique where multiple models are trained on different bootstrap samples of the
original dataset and their predictions are combined to improve robustness and
accuracy.
Here's how bootstrapping works:
⁃ Sampling with replacement: To create a bootstrap sample, N instances are
randomly drawn from the original dataset of size N, allowing for the possibility
of selecting the same instance multiple times (with replacement).
⁃ Overlap and dependency: Bootstrap samples may overlap more than
samples generated by cross-validation methods, resulting in more
dependence between the samples. This can affect the estimates obtained
from bootstrapping.
⁃ Estimation of error: The probability of picking an instance from the original
dataset in a single draw is 1/N, and the probability of not picking it is 1 - 1/N.
After N draws, the probability of not picking a particular instance is
approximately e^(-1), which is about 0.368. Consequently, approximately
63.2% of the instances are included in the bootstrap sample, while the
remaining 36.8% are not used for training.
⁃ Replication for robust estimates: To address the pessimistic error
estimates caused by not training on a portion of the data, bootstrapping
involves replicating the process multiple times (generating multiple bootstrap
samples) and averaging the results. By repeating the process, the average
behavior of the model's performance can be assessed more accurately.

⁃ Overall, bootstrapping is particularly useful for small datasets where traditional


cross-validation methods may not be feasible. It provides a reliable way to
estimate uncertainty and assess the stability of a model's performance.

III.2 Boosting
⁃ Fits sequentially multiple weak learners adaptively

⁃ Each model in the sequence is fitted giving more importance to


observations in the dataset that were badly handled by the
previous models in the sequence.

⁃ Bagging : aims to reduce variance


⁃ Boosting : aims to reduce bias

Boosting is a machine learning ensemble technique that aims to improve the


performance of weak learners by combining them into a strong learner. Unlike
bagging, where base learners are generated independently, boosting actively tries to
create complementary base learners by focusing on the mistakes of the previous
learners. The original boosting algorithm, proposed by Schapire in 1990, combines
multiple weak learners to produce a strong learner with arbitrarily small error
probability.
Here's how boosting works:
⁃ Training process:
⁃ The training set is divided into multiple subsets.
⁃ A base learner (classifier or regressor) is trained on each subset sequentially.
⁃ Each subsequent learner focuses on the instances that were misclassified by
the previous learners, along with a portion of correctly classified instances to
maintain diversity.
⁃ The process continues until a predefined stopping criterion is met or until a
specified number of base learners are trained.
⁃ Testing process:
⁃ During testing, each base learner produces its prediction for a given instance.
⁃ The final prediction is made by combining the predictions of all base learners,
typically using a weighted voting scheme where the weights are proportional to the
accuracy of each base learner.
⁃ AdaBoost algorithm:
⁃ AdaBoost, short for Adaptive Boosting, is one of the most popular boosting
algorithms.
⁃ It modifies the probabilities of drawing instances during training based on the
error rates of the previous base learners.
⁃ Instances that are misclassified by the previous base learners are given
higher probabilities of being selected for training the next base learner.
⁃ AdaBoost iteratively trains a sequence of weak learners, each focusing on
correcting the errors made by the previous ones.
⁃ Generalization and performance:
⁃ The success of boosting relies on its ability to increase the margin between
different classes, similar to support vector machines.
⁃ Boosting is effective when there is sufficient training data, and the base
learners are weak but not too weak. It is susceptible to noise and outliers in the data.
⁃ AdaBoost has been generalized to regression tasks, where it modifies the
training process to handle continuous target variables.
Overall, boosting is a powerful technique for improving the accuracy of machine
learning models, especially when combined with weak learners and used on
datasets with sufficient training samples.
III.2 Stacking
 Bagging and Boosting considers mainly homogeneous
weak learners.

 Stacking

- learns several different (heterogeneous) weak learners

- combines with base models by training a meta model to output


predictions based on multiple predictions returned by weak models

 Classification problem example

 Chose some weak learners: KNN classifers, logicst


regression, and a SVM • Choose a neural network as a
meta model
• Output of the 3 weak learners = input to neural
network
• Output of neural network = Final prediction

ensem
b
- Fitting a stacking ensemble
- Steps:
o Split the training data in two folds
o Choose L weak learners and fit them to data of the first fold
o For each of the L weak learners, make predictions for observations in
the second fold
o Fit the meta-model on the second fold, using predictions made by the
weak learners as inputs
- Limitation: Only half of the data to train the base models and half of the data
to train the meta-model.
- Solution: “k-fold cross-training” approach (similar to what is done in k-fold
cross-validation) such that all the observations can be used to train the meta-
model.

IV.Random Forests

Random Forest is a popular ensemble learning method primarily


used for classification and regression tasks. It's an extension of
bagging that builds a multitude of decision trees during training and
merges their predictions to improve accuracy and reduce overfitting.
Here's how Random Forest works:
⁃ Bootstrapping: Random samples (with replacement) are
drawn from the training data. These samples are used to train
individual decision trees.
⁃ Random Feature Selection: At each node of the decision
tree, only a random subset of features is considered for splitting.
This randomness helps decorrelate the trees, making the ensemble
more robust and less prone to overfitting.
⁃ Growing Decision Trees: Each decision tree is grown to its
maximum depth without pruning. This means the trees are allowed
to become quite complex, capturing intricate patterns in the data.
⁃ Voting or Averaging: For classification tasks, the predictions
from each tree are aggregated through majority voting. For
regression tasks, the predictions are typically averaged.
Random Forests offer several advantages:
⁃ Robustness to Overfitting: By combining multiple decision
trees and introducing randomness, Random Forests are less prone
to overfitting compared to individual decision trees.
⁃ Good Performance: Random Forests often provide
competitive performance compared to other algorithms across a
wide range of datasets.
⁃ Feature Importance: They offer a natural way to estimate
feature importance, which can be helpful for feature selection and
understanding the underlying data.
However, they also have some limitations:
- Interpretability: While Random Forests provide insights into
feature importance, they are not as interpretable as simpler models
like linear regression.
- Computationally Expensive: Training multiple decision trees
and aggregating their predictions can be computationally expensive,
especially for large datasets.
- Overall, Random Forests are a versatile and powerful tool in
the machine learning toolbox, widely used for both classification and
regression tasks.

⁃ Strong learners composed of multiple trees can be called “forests”.

⁃ • Trees can be shallow (few depths) or deep (lot of depths, if not fully grown).

⁃ • Shallow trees - less variance but higher bias, better choice for sequential
methods

⁃ • Deep trees, - low bias but high variance, better choices for bagging method
that is mainly focused at reducing variance.
⁃ • Random forest approach is a bagging method where deep trees, fitted on
bootstrap samples, are combined to produce an output with lower variance.

⁃ • introduces additional randomness in feature selection

IV.1 Classification and regression with RDFs


Random Forests are versatile and can be used for both classification and regression
tasks.
Classification with Random Forests:
In classification tasks, Random Forests predict the class label of the input data.
Here's how the process typically works:
⁃ Training Phase:
⁃ Randomly select subsets of the training data (with replacement) to train
multiple decision trees.
⁃ At each node of the decision tree, a random subset of features is considered
for splitting.
⁃ Grow each tree to its maximum depth or until a stopping criterion is met.
⁃ Repeat this process to create an ensemble of decision trees.
⁃ Prediction Phase:
⁃ For a new input instance, the Random Forest aggregates the predictions of all
decision trees.
⁃ In classification, this is usually done by majority voting: the class predicted by
the majority of trees is selected as the final prediction.
Regression with Random Forests:
In regression tasks, Random Forests predict a continuous output value based on the
input features. Here's how it's done:
⁃ Training Phase:
- Similar to classification, Random Forests build an ensemble of decision trees
using bootstrapped samples and random feature selection.
- Each decision tree predicts a continuous output value.
⁃ Prediction Phase:
- For a new input instance, the Random Forest aggregates the predictions of all
decision trees.
- In regression, this is typically done by averaging the output values predicted
by all trees.
Advantages of Using Random Forests for Classification and Regression:
- Accuracy: Random Forests often provide high accuracy in both classification
and regression tasks, thanks to their ability to capture complex patterns in the data.
- Robustness: They are less prone to overfitting compared to individual
decision trees, making them suitable for a wide range of datasets.
- Feature Importance: Random Forests offer a natural way to estimate feature
importance, which can be useful for understanding the underlying relationships in the
data.
- Overall, Random Forests are a powerful and versatile technique that can be
applied to various machine learning tasks, including classification and regression,
with impressive results.

Lecture10- Model evaluation; Learning with imbalanced


data

I. Generalization performance
- A model should always be evaluated on independent test data.
• A model’s performance on unseen data will give us the generalization
- performance of the model.

- • Focus on supervised learning methods (evaluation of unsupervised methods


is more qualitative)

Refers to how well an agent can extend its learned knowledge to new, unseen
situations. Here's a closer look at the factors influencing generalization performance:
- Quality of Function Approximation: The choice of function approximation
method (such as neural networks, decision trees, or radial basis functions)
greatly influences generalization performance. A good approximation method
should capture the underlying structure of the problem space, allowing for
accurate estimation of Q-values or state values across different states and
actions.
- Representation of States and Actions: The representation of states and
actions plays a crucial role in generalization. Effective feature representation
can highlight similarities between different states and actions, enabling the
agent to generalize its knowledge more effectively. Proper feature engineering
or representation learning techniques can enhance generalization
performance.
- Training Data: The quality and diversity of the training data also impact
generalization. A diverse training set that covers a wide range of states and
actions allows the agent to learn robust policies that generalize well to unseen
situations. Insufficient or biased training data may lead to poor generalization
performance.
- Regularization: Regularization techniques, such as weight decay or dropout
in neural networks, can help prevent overfitting and improve generalization
performance. By discouraging overly complex models, regularization
techniques promote simpler models that generalize better to unseen data.
- Exploration Strategy: The exploration strategy employed by the agent during
training can influence generalization. Balancing between exploration and
exploitation allows the agent to gather diverse experiences, which can help in
learning more robust policies that generalize well.
- Task Complexity: The complexity of the reinforcement learning task also
affects generalization performance. More complex tasks may require more
sophisticated function approximation methods and richer feature
representations to achieve good generalization.
- Hyperparameter Tuning: Proper tuning of hyperparameters, such as
learning rate, batch size, and network architecture, is essential for achieving
optimal generalization performance. Hyperparameters significantly impact the
training dynamics and the ability of the agent to generalize its learned
knowledge.

- In summary, achieving good generalization performance in reinforcement


learning requires a combination of suitable function approximation methods,
effective representation of states and actions, diverse training data,
regularization techniques, appropriate exploration strategies, consideration of
task complexity, and careful hyperparameter tuning. By addressing these
factors, researchers and practitioners can develop reinforcement learning
agents that generalize well to new and unseen situations.
Inductive Bias and Model Selection:
⁃ Inductive bias refers to the set of assumptions made by a learning algorithm
to facilitate the learning process. This bias helps to constrain the solution space and
enables learning from limited data.
⁃ Model selection involves choosing the appropriate hypothesis class H that
best matches the underlying complexity of the data. The goal is to select a model
with an inductive bias that aligns well with the true data-generating process.
⁃ Generalization and Overfitting:
⁃ Generalization refers to the ability of a learned model to perform well on new,
unseen data.
⁃ Overfitting occurs when a model learns to fit the training data too closely,
capturing noise or irrelevant patterns that do not generalize well to new data.
⁃ Underfitting arises when a model is too simplistic to capture the underlying
structure of the data, leading to poor performance even on the training set.
⁃ The Triple Trade-Off:
⁃ In model selection, there's a trade-off among three factors: the complexity of
the model (hypothesis class), the amount of training data available, and the
generalization error on new data.
⁃ Increasing the complexity of the model initially reduces the generalization
error, but beyond a certain point, it leads to overfitting and increased generalization
error.
⁃ Cross-Validation and Test Sets:
⁃ Cross-validation involves partitioning the available data into training and
validation sets. Different hypothesis classes are evaluated based on their
performance on the validation set.
⁃ A separate test set, not used during model selection, is essential for
estimating the model's performance on unseen data accurately.
⁃ Test sets help to prevent overfitting to the validation set and provide a more
unbiased estimate of the model's generalization performance.
⁃ Statistical Considerations:
⁃ The text highlights the importance of considering randomness in the data and
experimental setup when evaluating models.
⁃ Statistical analysis is crucial for drawing reliable conclusions from
experimental results and assessing the significance of observed differences in model
performance.
In summary, effective model selection involves finding the right balance between
model complexity and generalization performance while considering statistical
considerations such as cross-validation and test sets. By carefully navigating these
trade-offs, machine learning practitioners can develop models that generalize well to
new data and make reliable predictions in real-world scenarios.

I.1 Evaluating a training set + test set


- Easiest evaluation process

- Split the data


• a training set to fit the model on
• a test set to evaluate the fitted model ← evaluates generalization
performance
- Problems
• How big should the training and test set be?
• How do we know if the test set is exceptionally different from the training
data? • How do we know whether the model is overfitting the data?

I.2 Sources of error: Bias and Variance


- Suppose we train a model on a random sample of the training data multiple
times, and then look at the performance on the test set.

 Bias: how far, on average, are the model’s predictions from the
correct value ?

 Variance: how far apart are the model’s predictions ?

Every dot is the performance of a model


trained on a random sample of the training
data

Low bias & Low variance: we predict


correctly and our prediction is not noisy ←
sweet spot

Low bias & High variance: we predict


correctly on average, but our prediction is
noisy ← the model is overfitting the training
data

High bias & Low variance: we predict


incorrectly, but our prediction is not noisy ← the model is underfitting the training
data

High bias & High variance: we predict incorrectly and our prediction is noisy ← worst
of everything

I.3 Recap from regression - Underfitting &


Overfitting

- Underfitting: Our model’s complexity is not sufficient to represent the intrinsic


pattern of our data.
- Overfitting: Our model’s complexity is extremely high which makes it
incapable of generalizing to unseen data. We can understand underfitting
when we have low performance score on the training set.

- We can understand overfitting when we have high performance score on the


training set but very low performance score on the validation set.

I.4 Sources of error: Bias + variance + irreducable


error
- The error of a model can be decomposed into

 Irreducible error (how noisy is the data itself): What is the


variance of the target around its true mean?

 Reducible error
• Bias2: How much does the average of the estimate deviate
from the true mean?

• Variance: What is the deviation of the estimates around their


mean?

II. Validation and tuning


An overview of evaluating supervised machine learning models, with a focus on
regression and classification tasks. Let's break down the key points:
⁃ Training and Test Set Split:
⁃ The dataset is typically divided into a training set and a test set using
techniques like train_test_split function from scikit-learn.
⁃ The training set is used to train the model, while the test set is used to
evaluate its performance on unseen data.
⁃ Model Evaluation using score Method:
⁃ After training the model, its performance is evaluated on the test set using the
score method.
⁃ For classification tasks, this often involves computing the fraction of correctly
classified samples.
⁃ The focus is on how well the model generalizes to new, unseen data rather
than its performance on the training set.
⁃ Introduction of Cross-Validation:
⁃ Cross-validation is introduced as a more robust method to assess
generalization performance.
⁃ It involves partitioning the dataset into multiple subsets (folds), training the
model on several combinations of these subsets, and then averaging the results.
⁃ Cross-validation helps to provide a more reliable estimate of a model's
performance by reducing the variance associated with a single train-test split.
⁃ Evaluation Beyond Default Measures:
⁃ The chapter discusses methods to evaluate classification and regression
performance beyond the default measures provided by the score method.
⁃ For example, in addition to accuracy for classification, one might consider
precision, recall, F1-score, etc.
⁃ For regression, metrics like mean absolute error (MAE), mean squared error
(MSE), and R-squared (R2) might be used.
⁃ Grid Search for Parameter Tuning:
⁃ Grid search is introduced as a method for adjusting the parameters of
supervised models to achieve the best generalization performance.
⁃ It involves specifying a grid of hyperparameters and exhaustively searching
through all possible combinations using cross-validation.
⁃ Grid search helps to find the optimal hyperparameters for a given model and
dataset.

II.1 Cross-Validation
- Splitting the data in a training and test set multiple times

- Most commonly used version is k-fold cross-validation

- k is the number of partitions of the data

- Each partition serves as the test set once, while all other partitions serve as
training set

Cross-validation is a robust statistical method for evaluating the generalization


performance of machine learning models. Let's elaborate on the process of k-fold
cross-validation:
⁃ Data Splitting:
⁃ In k-fold cross-validation, the dataset is divided into k subsets, or folds, of
approximately equal size.
⁃ Model Training and Evaluation:
⁃ The training and evaluation process is repeated k times, with each fold
serving as the test set once and the remaining folds as the training set.
⁃ For each iteration, a model is trained on the training set (composed of k-1
folds) and then evaluated on the test set (the fold that was left out).
⁃ Accuracy Computation:
⁃ After each iteration, the accuracy (or any other evaluation metric) of the model
is computed based on its performance on the test set.
⁃ Collection of Accuracy Values:
⁃ At the end of the k iterations, a collection of k accuracy values is obtained,
one for each fold.
⁃ Aggregation of Results:
⁃ The final evaluation metric is typically computed by averaging the k accuracy
values obtained from the individual iterations.
The illustration in the Figure demonstrates the process visually, showing how the
dataset is partitioned into folds and how each fold is used as a test set in turn.
Additionally, it's worth noting that the order of the data within each fold doesn't
necessarily correspond to consecutive parts of the dataset. Instead, the folds are
typically selected randomly to ensure that each fold represents a similar distribution
of the data.
Overall, k-fold cross-validation provides a more reliable estimate of a model's
performance by leveraging multiple train-test splits, thereby reducing the variance
associated with a single train-test split.

Benefits Disatvantages
- Leaves less to luck: If we get a - Increased computational cost
very good or bad training set by
chance, this will show in the - Simple cross-validation can
results → performance will be an result in class imbalance
outlier between training and test sets

- Shows how sensitive the model - Increased Computational Cost:


is to the training data set. High o Cross-validation requires
variance means high sensitivity training k models instead
to the training data. of just one, making it
computationally more
- Reduced Bias in Evaluation: expensive.
o With a single random split, o The computational cost of
there's a chance of cross-validation scales
obtaining unrealistically linearly with the number of
high or low test set scores folds (k).
due to random variation in - No Model Output:
the data. o Cross-validation does not
o Cross-validation ensures produce a final model that
that each example is used can be directly applied to
for both training and new data.
testing, providing a more o Instead, it provides
reliable estimate of a evaluations of how well a
model's generalization given algorithm will
performance. generalize when trained
- Sensitivity Analysis: on a specific dataset.
o Cross-validation provides
information about how
sensitive the model is to
the selection of the
training dataset.
o By observing the range of
performance across
different folds, we can
gain insights into the
model's behavior in
various scenarios.
- Efficient Use of Data:
o Cross-validation allows for
more effective utilization of
the available data.
o With each iteration, a
larger portion of the
dataset is used for
training, resulting in
potentially more accurate
models.

II.2 Stratified cross-validation

- Makes sure that there is no class imbalance in the different folds

Stratified k-fold cross-validation is an enhancement over simple k-fold cross-


validation, especially in scenarios where the distribution of classes in the dataset is
imbalanced. Here's a breakdown of its benefits and usage:
⁃ Issue with Simple k-Fold Cross-Validation:
⁃ In some cases, simple k-fold cross-validation may not be suitable, especially
when the dataset is ordered by class labels and the class distribution is imbalanced.
⁃ For example, in a dataset like the iris dataset, simple k-fold cross-validation
may lead to unrepresentative splits where one fold contains samples from only one
class, resulting in inaccurate evaluation metrics.
⁃ Stratified k-Fold Cross-Validation:
⁃ Stratified k-fold cross-validation addresses this issue by ensuring that each
fold maintains the same class distribution as the original dataset.
⁃ It divides the dataset into folds such that the proportion of samples from each
class is preserved in each fold.
⁃ This ensures that each fold represents a balanced distribution of classes,
providing more reliable estimates of the model's generalization performance.
⁃ Usage in Classification Tasks:
⁃ Stratified k-fold cross-validation is commonly used in classification tasks,
where class imbalance is a common challenge.
⁃ It helps to prevent scenarios where certain classes are poorly represented in
the training or test sets, leading to biased evaluation results.
⁃ Regression Tasks:
⁃ In regression tasks, standard k-fold cross-validation is typically used by
default in scikit-learn.
⁃ While it's technically feasible to use a stratified approach for regression, it's
not commonly used and may not provide significant benefits in most cases.
In summary, using stratified k-fold cross-validation, especially in classification tasks,
helps to ensure that the evaluation of the model's performance is more
representative and reliable, particularly in scenarios with imbalanced class
distributions.

II.3 Leave-one-out cross validation (LOO)


- k-fold cross-validation, where k=N and N=the number of items in the dataset

- Very time consuming

- Generates predictions given the maximal available data

- Can be useful to find out which items are regular and irregular from the point-
of-view of the dataset.

Leave-one-out cross-validation (LOO) is a cross-validation method where each fold


consists of a single sample from the dataset. Here's a breakdown of LOO and its
usage:
⁃ Methodology:
⁃ In LOO, for each split, one data point is held out as the test set, while the
remaining data points are used as the training set.
⁃ This process is repeated for each data point in the dataset, resulting in as
many iterations as there are samples in the dataset.
⁃ Time Complexity:
⁃ LOO can be very time-consuming, especially for large datasets, as it requires
training a model and evaluating it for each sample in the dataset.
⁃ The computational cost of LOO scales linearly with the size of the dataset.
⁃ Performance Evaluation:
⁃ LOO often provides better estimates of model performance on small datasets
compared to other cross-validation methods.
⁃ By leaving out only one sample at a time, LOO utilizes all available data for
both training and testing, potentially leading to more accurate estimates of the
model's generalization performance.
⁃ Usage:
⁃ LOO is particularly useful when dealing with small datasets where maximizing
the use of available data for training and evaluation is essential.
⁃ It can provide a more comprehensive assessment of the model's performance
by evaluating it on every data point in the dataset
In summary, while LOO can be computationally expensive, it offers a robust method
for estimating model performance, especially on small datasets, by utilizing all
available data points for both training and testing.

II.4 Shuffle-split cross-validation


- Controls test-size, training-size, and number of iterations
- Also stratified variant available

Shuffle-split cross-validation is a flexible strategy for cross-validation that offers


control over the size of training and test sets as well as the number of iterations.
Here's a breakdown of shuffle-split cross-validation and its usage:
⁃ Methodology:
⁃ In shuffle-split cross-validation, each split randomly samples a specified
number of points for the training set (train_size) and a specified number of points for
the test set (test_size), ensuring that the training and test sets are disjoint.
⁃ This process is repeated for a specified number of iterations (n_splits),
resulting in multiple training-test splits.
⁃ Flexibility:
⁃ Shuffle-split cross-validation provides flexibility in controlling the size of
training and test sets independently, allowing experimentation with different
proportions of data for training and testing.
⁃ It also allows for subsampling of the data by specifying train_size and
test_size settings that don't necessarily add up to one.
⁃ Usage:
⁃ To perform shuffle-split cross-validation in scikit-learn, you can use the
ShuffleSplit class.
⁃ Specify parameters such as test_size, train_size, and n_splits to customize
the cross-validation process according to your requirements.
⁃ Use the cross_val_score function to compute evaluation scores for each
iteration of shuffle-split cross-validation.
⁃ Example:
⁃ In the provided code snippet, shuffle-split cross-validation is applied using
scikit-learn's ShuffleSplit class with a test size of 50% and a train size of 50%, for
10 iterations.
⁃ The cross_val_score function is then used to compute the accuracy scores
for each iteration of shuffle-split cross-validation.
⁃ Stratified Variant:
⁃ For classification tasks, there's a stratified variant of shuffle-split cross-
validation called StratifiedShuffleSplit.
⁃ This variant ensures that each split maintains the same class distribution as
the original dataset, providing more reliable results for classification tasks.

In summary, shuffle-split cross-validation offers flexibility and control over the cross-
validation process, allowing for experimentation with different data proportions and
subsampling strategies. It can be particularly useful for large datasets and when fine-
tuning model parameters.

II.5 Cross-validation with groups


- In cases that groups in the data are relevant to the learning problem

o E.g. emotion recognition: If the goal is to classify emotions from


unknown persons, then it is best to split the data so that the persons
occurring in training and test sets are different.

o Common in medical applications where generalization to new patients


is important

Cross-validation with groups is a valuable technique when dealing with datasets


where samples are related or grouped in some way. This scenario commonly arises
in various domains such as medical research or speech recognition, where data from
the same individual or speaker needs to be partitioned carefully to ensure unbiased
evaluation of the model's generalization performance. Here's a breakdown of cross-
validation with groups and its implementation using scikit-learn's GroupKFold:
⁃ Scenario:
⁃ In certain datasets, samples are related or grouped in some manner. For
instance, in medical applications, multiple samples may come from the same patient,
or in speech recognition, multiple recordings may be available from the same
speaker.
⁃ The goal is to ensure that during cross-validation, samples from the same
group (e.g., patient or speaker) are not split between the training and test sets to
accurately evaluate the model's ability to generalize to new groups.
⁃ GroupKFold:
⁃ The GroupKFold class in scikit-learn facilitates cross-validation with groups.
It takes an array of group labels as an argument, indicating which group each
sample belongs to.
⁃ The groups array ensures that samples belonging to the same group are kept
together during cross-validation, preventing leakage of information between training
and test sets.
⁃ Implementation:
⁃ In the provided example, a synthetic dataset with 12 data points is created
using make_blobs. The groups array specifies which group each data point
belongs to.
⁃ The GroupKFold object is instantiated with the desired number of splits
(n_splits), and the cross_val_score function is used to compute cross-validation
scores while considering the group labels.
⁃ The scores array contains the cross-validation scores for each fold.
⁃ Visualization:
⁃ The splits generated by GroupKFold ensure that each group is entirely in
either the training set or the test set for each fold. This is illustrated in Figure 5-4.
⁃ Flexibility:
⁃ Scikit-learn offers various other splitting strategies for cross-validation,
catering to different use cases. However, KFold, StratifiedKFold, and GroupKFold
are the most commonly used ones.

In summary, cross-validation with groups is essential for ensuring unbiased


evaluation when dealing with grouped data. It helps assess the model's performance
in scenarios where samples are related, ensuring that the model can generalize
effectively to new groups not seen during training.
II.6 Tuning
- Tuning: Improving model’s generalization performance by adjusting parameter
values

- Simple grid search: try all possible combinations of chosen parameter values

- Grid search with cross-validation: use cross-validation to evaluate the


performance of each combination of parameter values.

- Danger! Optimizing on test-set means that the test-set is not independent


anymore → requires another final test-set

This is a crucial insight into the importance of properly splitting the data into training,
validation, and test sets, especially when performing model selection and parameter
tuning. Here's a breakdown of the key points:
⁃ Overly optimistic evaluation: Selecting the best model based on
performance on the test set can lead to overly optimistic estimates of the model's
generalization performance. This is because the test set was used to adjust the
parameters, making it no longer independent for evaluation.
⁃ Threefold split(as shown in the figure): To address this issue, it's
recommended to split the data into three sets: the training set to build the model, the
validation set to select the parameters, and the test set to evaluate the performance
of the selected parameters.
⁃ Implementation: After splitting the data into training/validation and test sets,
a loop is used to iterate over different combinations of parameters (e.g., gamma and
C for an SVM model). For each combination, an SVM model is trained on the training
data and evaluated on the validation set. The parameters leading to the best
performance on the validation set are selected.
⁃ Final evaluation: Once the best parameters are determined using the
validation set, a final model is trained on the combined training and validation sets.
This model is then evaluated on the test set to obtain an unbiased estimate of its
performance on unseen data.
⁃ Importance of separate test set: Keeping a separate test set ensures that
the final evaluation is unbiased and provides a realistic estimate of how well the
model generalizes to new data. Using the test set for any exploratory analysis or
model selection can lead to information leakage and overly optimistic results.
By following this approach, practitioners can make more informed decisions about
model selection and parameter tuning while ensuring reliable estimates of model
performance on unseen data.
This figure illustrates the process of parameter selection and model evaluation using
grid search with cross-validation.
⁃ Grid Search with Cross-Validation Steps: The figure outlines the steps
involved in grid search with cross-validation. It begins with the definition of the
parameter grid, which specifies the hyperparameters and their respective values to
be evaluated.
⁃ Cross-Validation Evaluation: Each combination of hyperparameters is
evaluated using cross-validation. The dataset is divided into multiple folds, and the
model is trained and validated on different subsets of the data. The mean
performance across all folds is computed for each parameter combination.
⁃ Selection of Best Parameters: The parameter combination that yields the
highest mean performance score during cross-validation is selected as the optimal
set of hyperparameters.
⁃ Retraining with Optimal Parameters: Finally, a new model is trained using
the entire training dataset and the optimal hyperparameters. This model is then
evaluated on the test set to assess its generalization performance.
⁃ Visualization: The visualization likely includes representations of the
parameter grid, cross-validation process, and the selection of the best parameters
based on mean performance scores.

a comprehensive overview of using grid search with cross-validation to tune


hyperparameters and select the best model. Here's a breakdown of the key points
discussed:
⁃ Grid search with cross-validation: Grid search with cross-validation is a
method used to find the best combination of hyperparameters for a machine learning
algorithm. It systematically evaluates the performance of the model for each
combination of hyperparameters using cross-validation.
⁃ Parameter grid: The parameter grid specifies the hyperparameters and their
corresponding values to be searched. In this example, the parameter grid includes
values for C and gamma for the SVM model.
⁃ GridSearchCV: The GridSearchCV class from scikit-learn is used to perform
grid search with cross-validation. It takes the model, parameter grid, and the cross-
validation strategy as inputs.

The heatmap visualization shown in Figure 5-8 provides a comprehensive view of


the mean cross-validation scores as a function of the hyperparameters C and
gamma. Here's a detailed discussion of the heat map and its implications:
⁃ Heat Map Interpretation: Each point in the heat map represents one
combination of hyperparameters (C and gamma). The color of each point encodes
the mean cross-validation accuracy, with lighter colors indicating higher accuracy
and darker colors indicating lower accuracy.
⁃ Sensitivity to Parameters: The heat map demonstrates the sensitivity of the
Support Vector Classifier (SVC) model to the settings of the parameters C and
gamma. It reveals that small changes in these parameters can lead to significant
variations in model performance.
⁃ Parameter Importance: Both parameters, C and gamma, play a crucial role
in determining the model's accuracy. Adjusting these parameters can lead to
substantial improvements or deteriorations in performance, as evident from the wide
range of accuracy values observed across different parameter settings.
⁃ Impact of Parameter Ranges: The chosen ranges for the parameters are
wide enough to encompass significant changes in the model's performance. This
indicates that the selected ranges cover the region where the optimal parameter
values lie, as opposed to being limited to the edges of the plot.
The heat map visualization provides valuable insights into the relationship between
hyperparameters and model performance. It helps in identifying the regions of the
parameter space where the model performs well and where it underperforms. This
information can guide further refinement of the parameter grid and help in achieving
better model performance.
In summary, analyzing the results of cross-validation through visualizations like heat
maps enables informed decisions regarding parameter tuning, ultimately leading to
the selection of the best-performing model.

II. Binary classification


II.1 Metrics for Binary classification
Choosing the correct metrics is crucial and here are some metrics commonly used in
binary classification and how they can help provide a more nuanced understanding
of model performance in the context of coronavirus testing:
⁃ Accuracy: This metric measures the overall correctness of the model's
predictions. In the case of coronavirus testing, accuracy tells us the proportion of all
diagnoses (both positive and negative) that are correct. While accuracy is a
commonly used metric, it may not be suitable in scenarios where class imbalance
exists, such as when the number of positive cases is much lower than negative
cases. In such cases, accuracy can be misleading and may not adequately reflect
the model's performance.
⁃ Precision: Precision focuses on the correctness of positive predictions made
by the model. In the context of coronavirus testing, precision indicates the proportion
of positive test results that are truly positive. A high precision score means that the
model is making fewer false positive predictions, which is crucial in medical
diagnostics as it ensures that patients are not incorrectly diagnosed with the disease
when they do not have it.
⁃ Recall (Sensitivity): Recall measures the ability of the model to correctly
identify positive cases out of all actual positive cases. In the context of coronavirus
testing, recall tells us the proportion of infected individuals who are correctly
identified by the test. A high recall score means that the model is effectively
capturing most of the positive cases, minimizing false negatives. This is particularly
important in medical diagnostics to ensure that infected individuals are not missed
during testing.
⁃ F1 Score: The F1 score is the harmonic mean of precision and recall. It
provides a balance between the two metrics and is especially useful when there is
an uneven class distribution or when both false positives and false negatives are
equally important. In the case of coronavirus testing, the F1 score would provide a
single metric that balances the trade-off between correctly identifying positive cases
and minimizing false positives.
In conclusion, while accuracy is a common metric, it may not be sufficient for
evaluating models in scenarios with class imbalance or when different types of errors
have varying consequences. Precision, recall, and the F1 score offer complementary
insights into the performance of the model and should be considered alongside
accuracy, especially in critical applications like medical diagnostics.

The image shows calculations for three metrics used in binary classification: True
Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).
These terms are defined in a confusion matrix, which is a table that shows the
number of correct and incorrect predictions made by a classification model.
⁃ True Positive (TP) is the number of items correctly classified as positive.
⁃ True Negative (TN) is the number of items correctly classified as negative.
⁃ False Positive (FP) is the number of items incorrectly classified as positive.
⁃ False Negative (FN) is the number of items incorrectly classified as negative.
The text in the image shows how TPR, TNR, Precision, Recall, and F1 Score are
calculated using these values.
⁃ Accuracy is the most common metric used in binary classification, though it
can be misleading in some cases. It is simply the ratio of correct predictions to total
predictions. In the image, it is calculated as:
Accuracy = TP + TN / TP + TN + FP + FN
- Precision measures the ratio of actual positives to the total number of
predicted positives. In other words, it tells you how many of the items you predicted
as positive actually were positive. The image shows it being calculated as:
Precision = TP / TP + FP
⁃ Recall measures the ratio of actual positives you captured to the total number
of actual positives. In other words, it tells you what proportion of the actual positives
were identified by your model. The image shows it being calculated as:
Recall = TP / TP + FN
• F1 Score is a harmonic mean between precision and recall. It is a way of
combining these two metrics into a single score. A value of 1 means perfect
precision and recall, while a value of 0 means neither precision nor recall is high.The
image shows it being calculated as:
F1 Score = 2 * Precision * Recall / Precision + Recall
These metrics can be used to evaluate the performance of a binary classification
model. A good model will have high values for all of these metrics. However, in some
cases, it may be more important to optimize for one metric over another.For
example, if you are trying to classify emails as spam or not spam, it may be more
important to have a high recall (so that you don't miss any spam emails) than a high
precision (since some non-spam emails may be accidentally classified as spam).

II.2 Problems with accuracy


⁃ Accuracy may not fully represent the performance of a model, especially in
real-world applications like medical diagnostics.
⁃ Different types of prediction errors, such as false positives and false
negatives, have varying consequences.
⁃ False negatives, where serious conditions are missed, can be more harmful
than false positives, leading to potentially severe health consequences.
⁃ Assigning dollar values to prediction errors allows for a more meaningful
evaluation of model performance in terms of real-world costs.
⁃ Cost-sensitive learning enables decision-makers to assess models based on
the actual financial implications of different types of errors, facilitating informed
decision-making in practical scenarios.
in many real-world applications, especially those with significant consequences like
medical diagnostics, the costs associated with different types of errors vary.
Understanding and quantifying these costs is crucial for evaluating model
performance effectively.
In the example of cancer screening, false positives and false negatives have
different implications:
⁃ False Positives: These occur when a healthy patient is incorrectly classified
as having cancer. The consequences include additional medical tests, potential
psychological distress for the patient, and increased healthcare costs. While false
positives are undesirable, they are generally less severe than false negatives.
⁃ False Negatives: These occur when a patient with cancer is incorrectly
classified as healthy. The consequences are much more serious and can include
delayed treatment, progression of the disease, and potentially fatal outcomes. False
negatives are particularly concerning in medical diagnostics, as they can lead to
missed opportunities for intervention and potentially harm the patient.
Assigning dollar values to false positives and false negatives allows for a more
nuanced evaluation of model performance. This approach, known as cost-sensitive
learning, enables decision-makers to weigh the costs of different types of errors and
make informed decisions about which model to use.
For example, in the context of cancer screening, the cost of a false negative (missing
a cancer diagnosis) could be assigned a much higher dollar value than the cost of a
false positive (unnecessary additional testing). By quantifying the costs associated
with each type of error, decision-makers can optimize model performance based on
the specific requirements and constraints of the application.
In commercial applications, where business impact is a key consideration,
measuring prediction errors in terms of dollars can provide valuable insights for
making strategic decisions. By focusing on minimizing the total cost of errors rather
than simply maximizing accuracy, organizations can develop more effective and
cost-efficient predictive models.

- Imbalanced classes lead to hard-to-interpret accuracy:


- In all of the cases below, accuracy is 90%

This image depicts problems with accuracy in binary classification tasks. It shows a
confusion matrix which compares the number of predicted labels to the number of
true labels.

In the ideal scenario, all the values would be on the diagonal, meaning the model
perfectly predicts all positive and negative labels. In the image, we can see that this
is not the case. There are values off the diagonal, which means the model is making
mistakes.

Specifically, the model is making both False Positive (FP) and False Negative (FN)
errors.

 False Positives (FP) are instances where the model predicted a positive
label, but the true label was negative. In the image, this is shown in the
second row, first column (P- 80.00 10.00). Here, the model predicted 10
positives that were actually negative.
 False Negatives (FN) are instances where the model predicted a negative
label, but the true label was positive. In the image, this is shown in the first
row, third column (y_pred_1 y_pred_2 y_pred_3 10.00 0.00). Here, the model
predicted 10 negatives that were actually positive.

These errors will cause the model to have a lower accuracy than it potentially could.

There are ways to improve the accuracy of a binary classification model. Here are a
few:

 Collect more data: Training a model on more data can help it to better learn
the patterns in the data and improve its ability to classify new data points.
 Use a different model architecture: Some model architectures are better
suited for binary classification tasks than others. Experimenting with different
architectures can help you to find one that works well for your specific task.
 Tune the hyperparameters of your model: The hyperparameters of a model
are the settings that control how it learns. Tuning these hyperparameters can
help you to improve the performance of your model.

By using these techniques, the accuracy of binary classification model can be


improved and reduce the number of FP and FN errors.
The image shows a confusion matrix, which is a table that compares the number of
predicted labels to the number of true labels, for three different models
(y_pred_1, y_pred_2, y_pred_3) used in a binary classification task.

 Model y_pred_1 has the worst performance among the three models. It has
10 False Negatives (FN) and 0 True Positives (TP). This means the model did
not identify any of the actual positive labels, and incorrectly classified 10
positive labels as negative.
 Model y_pred_2 also has some FN errors (10), but it also correctly classified
some of the positive labels (90). It also made some False Positive (FP) errors
(10).
 Model y_pred_3 has the best performance among the three models. It has
the highest number of True Positives (TP) (98) and the lowest number of
errors (FN=2, FP=0).

Overall, the confusion matrix shows that all three models are making classification
errors. However, model y_pred_3 performs the best out of the three with the highest
accuracy.

Here are some ways to improve the accuracy of these binary classification models:

 Collect more data: Training a model on more data can help it to better learn
the patterns in the data and improve its ability to classify new data points.
 Use a different model architecture: Some model architectures are better
suited for binary classification tasks than others. Experimenting with different
architectures can help you to find one that works well for your specific task.
 Tune the hyperparameters of your model: The hyperparameters of a model
are the settings that control how it learns. Tuning these hyperparameters can
help you to improve the performance of your model.

II.3 Goal setting!


⁃ What do I want? What do I care about?

⁃ Determine the primary objective: Identify what specific


performance metric is most important for the application,
such as precision, recall, or a combination of both.
⁃ Can I assign costs to the confusion matrix?

⁃ Define the desired outcomes: Consider what aspects of


model performance are of utmost concern and align with
the overall goal. For example, prioritize minimizing false
negatives to avoid missing critical diagnoses.

⁃ Evaluate costs associated with prediction errors: Assign


monetary values to each type of error in the confusion
matrix, such as assigning a cost of $10 for false positives
and $100 for false negatives.

⁃ What guarantees do we want to give?

⁃ Establish performance guarantees: Define the level of


confidence or assurance desired in the model's
predictions, considering the potential consequences of
both false positives and false negatives. This ensures that
the model meets specific criteria for reliability and
effectiveness in real-world applications.

II.4 Precision-Recall Curve


- Changing the threshold that is used to make a classification decision in a
model is a way to adjust the trade-off of precision and recall for a given
classifier.

- Precision – Recall Curve

- Looks at all possible thresholds or all possible trade-offs of precisions and


recalls at once.

- Used when developing a new model and unclear entirely clear what operating
point/threshold to use.
o Allows analysis of the trade-off between precision and
recall for different classification thresholds.
o Each point on the curve corresponds to a specific
threshold of the decision function.
o Starts at the top-left corner (low threshold, high recall,
low precision) and moves towards the bottom-right
corner (high threshold, low recall, high precision).
o Closer proximity to the upper-right corner indicates
better classifier performance.
- Operating Point:
o Refers to setting a specific threshold to meet a desired
precision or recall value.
o Helps in making performance guarantees and aligning
with business objectives.
- Average Precision:
o Computed as the area under the precision-recall curve.
o Provides a summary measure of classifier performance
across all possible thresholds.
o Ranges from 0 (worst) to 1 (best), with higher values
indicating better performance.
- Comparison:
o Precision-recall curves allow detailed comparison of
classifiers' performance across various thresholds.
o Average precision score provides a quantitative measure
of classifier performance, aiding in automatic model
comparison.
- Example:
o Comparison between SVM and random forest classifiers
using precision-recall curves revealed nuanced
differences in performance at different operating points.
o Average precision scores showed similar performance
between the classifiers, contrary to the results obtained
from the F1-score.

- The closer the curve to the upper-right corner, the better th eclassifier

II.5 ROC Curve


- Changing the threshold that is used to make a classification decision in a
model is a way to adjust the trade-off of precision and recall for a given
classifier.

- Receiver operating characteristics (ROC)


• instead of reporting precision and recall, it shows
• the false positive rate (FPR) against the true positive rate (TPR).

⁃ Receiver Operating Characteristic (ROC) Curve:


⁃ Analyzes classifier behavior at different thresholds by plotting the True
Positive Rate (TPR) against the False Positive Rate (FPR).
⁃ TPR is synonymous with recall, while FPR is the fraction of false positives out
of all negative samples.
⁃ Ideal curve is close to the top-left corner, indicating high TPR and low FPR.
⁃ Helps in choosing an optimal operating point for the classifier.

- For the ROC curve, the ideal curve is close to the top left: you want a
classifier that produces a high recall while keeping a low false positive rate.

II.6 ROC AUC


⁃ Area Under the ROC Curve (AUC):
⁃ Represents the integral of the ROC curve, ranging from 0 to 1.
⁃ Provides a single measure of classifier performance, with higher values
indicating better performance.
⁃ AUC of 0.5 indicates random guessing, while closer to 1 signifies better
performance.
⁃ Comparison:
⁃ AUC is particularly useful for evaluating classifiers on imbalanced datasets,
where accuracy may be misleading.
⁃ Allows comparison of different classifiers' performance without being affected
by class imbalances.
⁃ Example:
⁃ Comparison between SVM and random forest classifiers using ROC curves
reveals differences in performance at various thresholds.
⁃ AUC scores indicate that the random forest performs better than the SVM,
providing insights not captured by accuracy alone.
⁃ Recommendation:
⁃ AUC is recommended for evaluating models on imbalanced data, as it offers a
more meaningful measure of classifier performance.
⁃ Adjusting the decision threshold may be necessary to obtain useful
classification results from models with high AUC.
⁃ Area under ROC Curve
⁃ For a random prediction: 0.5

II.7 ROC curves vs Precision-Recall curves


ROC Precision-recall
- ROC curves are appropriate - Precision-recall curves are
when the observations are appropriate for imbalanced
balanced between each class datasets.

- Analyze classifier behavior by - Plot precision against recall at


plotting the True Positive Rate different classification thresholds.
(TPR) against the False Positive - Precision measures the accuracy
Rate (FPR) at various thresholds. of positive predictions, while
recall measures the proportion of
- TPR is also known as recall, actual positives correctly
representing the proportion of classified.
actual positive samples correctly - Helpful for evaluating classifiers
identified as positive. on imbalanced datasets, where
- FPR indicates the proportion of positive samples are rare.
actual negative samples - Focuses on the trade-off
incorrectly classified as positive. between precision and recall,
- Useful for evaluating classifiers rather than false positive and true
across different thresholds and positive rates.
determining optimal operating - Useful when the positive class is
points. of particular interest and class
- Ideal curve is close to the top-left imbalance is present.
corner, indicating high TPR and - Ideal curve is close to the top-
low FPR. right corner, indicating high
- Often used when the class precision and high recall
distribution is balanced. simultaneously.
⁃ ROC curves focus on the trade-off between true positive and false positive
rates, whereas precision-recall curves focus on the trade-off between precision and
recall.
⁃ ROC curves are suitable for balanced datasets, while precision-recall curves
are more appropriate for imbalanced datasets.
⁃ The choice between the two depends on the specific characteristics of the
dataset and the goals of the classification task.

III. Multiclass-classification metrics


⁃ Accuracy:
⁃ Fraction of correctly classified examples.
⁃ Not ideal for imbalanced datasets.
⁃ Limited interpretation in multiclass classification, especially
with imbalanced class distributions.
III.1 Confusion matrix
⁃ Provides detailed information about true and false classifications for each
class.
⁃ Rows represent true labels, columns represent predicted labels.
⁃ Useful for understanding the types of errors made by the classifier.

⁃ Classification Report:
⁃ Computes precision, recall, and F1-score for each class.
⁃ Precision: Proportion of true positive predictions out of all positive predictions.
⁃ Recall: Proportion of true positive predictions out of all actual positives.
⁃ F1-score: Harmonic mean of precision and recall, provides a balance between
the two.
⁃ Commonly used to evaluate multiclass classification models.
III.2 Micro and macro F1
- Macro-average F1: Average F1 scores over classes (“all classes are equally
important”)

- Weighted F1: Mean of the per-class f-scores, weighted by their support.


(“bigger classes are important”)

- Micro-average F1: Make one binary confusion matrix over all classes, then
compute recall, precision once (“all samples are equally important”)

o Derives binary F-scores per class, treating each class as the positive
class and others as negative.
o Can be averaged using "macro," "weighted," or "micro" averaging
strategies:
 "Macro" averaging: Unweighted mean of per-class F-scores.
 "Weighted" averaging: Mean of per-class F-scores, weighted by
class support.
 "Micro" averaging: Computes precision, recall, and F-score
using total counts over all classes.
- Helps to assess model performance across all classes.
- Comparison:
o Accuracy, confusion matrix, and classification report offer insights into
overall and per-class performance.
o Multiclass F-score provides a comprehensive measure of model
performance across all classes, considering both precision and recall.
o Choice of evaluation metric depends on the specific characteristics of
the dataset and the goals of the classification task, particularly
considering class distribution and the importance of individual classes.

III.3 Picking metrics


- Accuracy rarely what you want

- Problems are rarely balanced

- Find the right criterion for the task

- OR pick one arbitrarily, but at least think about it

- Emphasis on recall or precision?

- Which classes are the important ones?

III.4 Metrics for regression


- Built-in standard metrics

o R2 (Coefficient of Determination)-easy to
understand scale:
 Measures the proportion of the variance in the
dependent variable (target) that is predictable from
the independent variables (features).

 Ranges from 0 to 1, where 1 indicates perfect


prediction and 0 indicates no improvement over the
mean.

 Intuitive interpretation: the higher the R2 value, the


better the model fits the data.

o Mean Squared Error (MSE)-easy to relate to input:

 Calculates the average of the squared differences


between predicted and actual values.

 Larger errors are magnified due to squaring, making


MSE sensitive to outliers.

 Useful for penalizing large errors, but less intuitive


than R2 for interpretation.

o Mean Absolute Error (MAE), median absolute error-


more robust:

 Calculates the average of the absolute differences


between predicted and actual values.

 Provides a more intuitive measure of error


compared to MSE, as it is not influenced by the
scale of the target variable.

 Less sensitive to outliers compared to MSE.

⁃ Choosing a Metric:
⁃ R2 is commonly used as it provides a clear indication of how
well the model explains the variance in the target variable.
⁃ MSE and MAE are useful for understanding the magnitude of
errors but may not provide as direct a measure of model fit as R2.
⁃ Business decisions may sometimes be based on MSE or MAE,
particularly if specific cost functions are involved.
⁃ The choice of metric ultimately depends on the specific
requirements of the problem and the preferences of stakeholders.
⁃ When using “scoring” use “neg_mean_squared_error” etc
IV.Imbalanced data
- Sources
• Asymmetric cost • Asymmetric data

- Approaches

- Change the data

- • Add samples
• Remove samples

- Change the training procedure

- Imbalanced datasets occur when one class is significantly more frequent than
others (e.g., click-through prediction where most impressions don't result in
clicks).
- Accuracy can be misleading for imbalanced data. A simple model that always
predicts the majority class can achieve high accuracy without being
informative.
- Example: Classifying digit "9" vs all others in the digits dataset creates a 9:1
imbalance.
- A DummyClassifier predicting only the majority class ("not nine") achieves
nearly 90% accuracy, highlighting the limitations of accuracy.
- Other classifiers like DecisionTreeClassifier might not show significant
improvement over the dummy model based on accuracy alone.
- Alternative metrics are needed to evaluate models on imbalanced datasets
effectively. These metrics should penalize models that simply predict the
majority class.
IV.1 Random undersampling
- Technique: Random undersampling involves removing data points from the
majority class randomly.
- Goal: This technique aims to balance the class distribution in a dataset with a
heavily represented majority class.
- Process: It typically involves removing data points from the majority class
until the desired balance is achieved (often aiming for a class ratio close to
1:1).
- Advantages:
o Speed: Random undersampling is a very fast technique as it simply
removes data points, reducing training time.
o Efficiency: In some cases, it can lead to efficient training, especially
when dealing with large datasets where the majority class dominates.
- Disadvantages:
o Data Loss: A major drawback is the loss of potentially valuable data
from the majority class. This discarded data might contain useful
information for the model.
o Information Loss: Removing data points can lead to a loss of
information about the majority class distribution, potentially affecting
the model's generalizability.
- In summary, random undersampling is a quick and easy approach to address
imbalanced datasets, but it comes at the cost of potentially discarding
valuable data.

IV.2 Random Oversampling


- Technique: Random oversampling involves duplicating data points from the
minority class randomly.
- Goal: Similar to undersampling, it aims to balance the class distribution in a
dataset with a heavily represented majority class.
- Process: It involves randomly selecting data points from the minority class
and adding copies of those points to the training data. This process is
repeated until the desired balance is achieved (often aiming for a class ratio
close to 1:1).
- Advantages:
o Balances Classes: Effectively increases the representation of the
minority class, ensuring the model is trained on a more balanced
dataset.
o Preserves Information: No data is discarded, so all the information
from the minority class is retained for training.
- Disadvantages:
o Slow Training: Random oversampling can significantly increase the
size of the training data due to data duplication. This can lead to slower
training times, especially for large datasets.
o Overfitting Risk: Duplicating minority class data points can increase
the model's focus on those specific data points, potentially leading to
overfitting and poor performance on unseen data.
- In essence, random oversampling addresses class imbalance by replicating
minority class data, but it comes at the cost of potentially slower training and
increased risk of overfitting.

IV.3 Edited nearest neighbours


- Edited Nearest Neighbors (ENN): Targeting Noisy Data in Imbalanced
Datasets
- Origin: Edited Nearest Neighbors (ENN) emerged as a technique to reduce
the size of datasets for K-Nearest Neighbors (KNN) algorithms.
- Core Idea: It identifies and removes data points from the majority class that
are likely noisy or misclassified based on their nearest neighbors.
- Two Removal Strategies for all samples that are misclassified by KNN:
- Mode: A data point is removed if all its k nearest neighbors belong to a
different class. This approach is stricter and removes points further away from
the decision boundary.
- All: A data point is removed if any of its k nearest neighbors belong to a
different class. This is a more relaxed approach that can remove points closer
to the decision boundary.
- Benefits:
- Reduced Training Time: By removing potentially noisy data, ENN can lead to
faster training times for KNN models.
- Improved Classification: Removing data points that confuse the KNN
algorithm can potentially improve its classification accuracy.
- Boundary Cleaning: ENN can help "clean up" the decision boundary by
removing outliers from the majority class that might mislead the model.
- Limitations:
- Data Loss: Similar to undersampling techniques, ENN discards data
points, which can lead to information loss.
- Parameter Dependence: The effectiveness of ENN depends on the chosen
value of k (number of nearest neighbors).
- Overall, ENN is a valuable technique for handling imbalanced datasets,
particularly when dealing with noisy data in the majority class. However, it's
crucial to consider the potential loss of information and the impact of the
chosen k parameter.

IV.4 SMOTE (Synthetic Minority Oversampling


technique)
- SMOTE: Synthetic Sampling for Imbalanced Data
- SMOTE (Synthetic Minority Oversampling Technique) is a popular technique
for addressing imbalanced datasets.Unlike random oversampling, which
simply duplicates existing minority class data points, SMOTE generates
synthetic data points.
- Key Idea: SMOTE focuses on creating new data points for the minority class
by interpolating between existing minority class samples.
- Process:
- Identify Minority Class: SMOTE first identifies the minority class in the dataset.
- Select Minority Sample: It randomly selects a data point from the minority
class.
- Find Nearest Neighbors: The algorithm then identifies the k nearest neighbors
(other minority class data points) for the chosen sample.
- Synthetic Sample Generation: SMOTE randomly selects one of the k nearest
neighbors. Next, it calculates the difference between the chosen minority
sample and its selected neighbor in feature space (considering each feature
value).
- Interpolation: A random number between 0 and 1 is generated. This value is
multiplied by the difference vector calculated in step 4.
- Adding the Synthetic Point: Finally, the original minority class sample is added
to the difference vector scaled by the random number. This creates a new
synthetic data point that lies along the line segment connecting the original
sample and its neighbor in feature space.
- Benefits:
- Balances Classes: SMOTE effectively increases the representation of the
minority class without simply duplicating existing data.
- Preserves Information: Unlike oversampling, SMOTE leverages existing data
to create new points, potentially capturing valuable information about the
minority class distribution.
- Improves Performance: Studies have shown that SMOTE can lead to
improved performance of various classification algorithms on imbalanced
datasets.
- Limitations:
- Overfitting Risk: Similar to oversampling, SMOTE can increase the model's
focus on specific areas of the feature space, potentially leading to
overfitting. Techniques like SMOTE-ENN can be used to mitigate this risk.
- Borderline Issues: Generated synthetic data points might lie on or near the
decision boundary, potentially impacting the model's ability to generalize to
unseen data.
- Overall, SMOTE is a powerful technique for handling imbalanced datasets by
creating synthetic data points for the minority class. However, it's essential to
be aware of the potential overfitting risk and consider combining SMOTE with
other techniques to address it.

LECTURE11- Dimensionality Reduction


- A type of unsupervised learning
I. Curse of dimensionality
- The curse of dimensionality poses challenges in high-dimensional spaces,
affecting data analysis and modeling.
- Nonparametric methods like histograms are particularly vulnerable due to
sparse data density.
- As dimensions increase, the number of bins required grows exponentially,
often resulting in empty or sparsely populated bins.
- The concept of proximity or "closeness" between data points becomes less
meaningful in higher dimensions.
- Euclidean distance loses discriminative power, making it challenging to
determine meaningful relationships between data points.
- Solutions include adjusting parameters like bandwidth in kernel density
estimation or using dimensionality reduction techniques.
- Incorporating domain knowledge or priors can help guide analysis and
alleviate issues associated with high-dimensional data.
- The image you depicts a graph illustrating the relationship between the
number of features (dimensions) and classifier performance. It suggests that
classifier performance suffers as the number of features increases, which is a
phenomenon known as the curse of dimensionality.
- Understanding the Curse of Dimensionality
- Imagine searching for a specific book in a library. With a small library (few
dimensions), you can easily find the book by browsing a limited number of
shelves (features). However, in a massive library (high
dimensions), searching becomes significantly more challenging as the
number of shelves (features) explodes.
- Similarly, in machine learning, analyzing data with many features (high
dimensionality) poses challenges:
- Data Sparsity: As the number of dimensions increases, data points become
spread out more thinly in the feature space. This sparsity makes it harder to
identify patterns and relationships between features.
- Increased Computational Cost: Training machine learning models on high-
dimensional data requires significantly more computation and memory
resources.
- Overfitting: With many features, models can easily overfit the training
data, memorizing specific details rather than learning generalizable patterns.
- The Impact on Classifier Performance
- The graph shows that classifier performance, often measured by accuracy or
precision-recall, tends to peak at a certain number of features and then
degrades as the number of features continues to increase. This highlights the
detrimental effect of the curse of dimensionality on classification tasks.
- Mitigating the Curse of Dimensionality
- Several techniques can help alleviate the curse of dimensionality:
- Feature Selection: Selecting a subset of relevant features can improve
performance by focusing the model on the most informative data.
- Dimensionality Reduction: Techniques like Principal Component Analysis
(PCA) can project data into a lower-dimensional space while preserving
essential information.
- Regularization: Regularization algorithms penalize models for having too
complex decision boundaries, reducing the risk of overfitting in high
dimensions.
- By understanding and addressing the curse of dimensionality, you can
improve the effectiveness of machine learning models when dealing with high-
dimensional data.
I.2 Curse of dimensionality and overfitting

⁃ The curse of dimensionality and overfitting are closely related phenomena in


machine learning.
⁃ In a classification task like distinguishing between cats and dogs, with only 10
instances (images), the curse of dimensionality becomes apparent.
⁃ Initially, with just one feature (e.g., average amount of red color in the image),
classification may be challenging due to limited information.
⁃ Adding a second feature (e.g., average amount of green color) increases the
dimensionality but also provides more information for classification.
⁃ Similarly, a third feature (e.g., average amount of blue color) further increases
dimensionality and provides even more discriminatory information.
⁃ In three dimensions (three features), it becomes possible to achieve perfect
separation of cats and dogs using a decision boundary (a plane).
⁃ However, as the number of features increases relative to the number of
instances, the risk of overfitting also increases.
⁃ Overfitting occurs when a model captures noise or random fluctuations in the
training data rather than the underlying patterns, leading to poor generalization
performance on unseen data.
⁃ In high-dimensional spaces, models can become overly complex and fit to the
noise present in the training data, making them less effective at generalizing to new
instances.
⁃ Techniques such as regularization, cross-validation, and feature selection can
help mitigate overfitting and improve the generalization ability of models in high-
dimensional spaces.
I.3 Adding features improves selection?
- The example highlights the double-edged sword of adding features in
classification:
- Improved Classification (Up to a Point): Adding informative features can
provide more data points for the classifier to learn from. This can lead to
better separation between classes and improved classification accuracy.
- However, there's a catch...
- Curse of Dimensionality: adding features increases the dimensionality of the
feature space exponentially.
o Think of it like this: With 1 feature, data points can have 10 different
values, visualized as 10 points on a line.
o Adding another feature creates a 2D space. Now, each data point can
have 10 possible values for each feature,resulting in 10 x 10 = 100
possible combinations (imagine a grid).
o With 3 features, the space becomes 3D, and the number of possible
combinations explodes to 10 x 10 x 10 = 1000.
- This exponential growth in dimensionality leads to challenges:
- Data Sparsity: Data points become spread out more thinly in the higher-
dimensional space, making it harder to find patterns and relationships.
- Increased Computational Cost: Training models on high-dimensional data
requires significantly more computation and memory resources.
- Overfitting Risk: With many features, models can easily overfit the training
data, memorizing specific details rather than learning generalizable patterns.
- Finding the Right Balance:
- The key is to find a balance between adding informative features and
managing the curse of dimensionality. Here are some strategies:
- Feature Selection: Techniques like filter methods or wrapper methods can
help identify the most relevant features and discard redundant ones.
- Dimensionality Reduction: Techniques like PCA can project data into a lower-
dimensional space while preserving essential information.
- Regularization: Regularization algorithms penalize models for having too
complex decision boundaries, reducing the risk of overfitting in high
dimensions.
- By carefully considering the trade-offs and applying appropriate
techniques, you can leverage the benefits of adding features while mitigating
the curse of dimensionality for effective classification.

I.4 Complex vs Simple models

I.5 High-dimensional vs Low-dimensional feature


space
- A simple classification model in a high-dimensional space e.g., a
linear decision boundary (plane) in 3D
- Corresponds to a complex classification model in low-dimensional
space e.g., non-linear decision boundaries in 2D
- Overfitting is associated with (too) complex models
- Hence, too many features may lead to overfitting too

I.6 Trade-off in machine learning


- Informative features: We want to increase the number of features to put all
the relevant information in the classifier
- Curse of dimensionality: We want to decrease the number of features to
avoid the curse of dimensionality
- Machine learning algorithms should optimise the trade-off between
informative features and curse of dimensionality by means of
dimensionality reduction techniques
I.7 Dimensionality reduction: Two different ways
- Reasons for Dimensionality Reduction:
- However, several practical considerations motivate feature selection and
extraction:
- Reduced Complexity: Most learning algorithms become more complex with
increasing input dimensions (d). Reducing d simplifies the model, leading to
faster training, lower memory requirements, and potentially easier
interpretation.
- Computational Efficiency: Extracting unnecessary features can be
computationally expensive. Reducing the feature set saves resources for both
training and inference (making predictions on new data).
- Improved Model Robustness: Simpler models with fewer features tend to be
more robust, especially with limited data. They are less susceptible to
overfitting and noise in the data.
- Knowledge Extraction and Visualization: When data can be explained with
fewer features, it becomes easier to understand the underlying process and
relationships between variables. Additionally, lower-dimensional data can be
visualized more effectively for exploring patterns and identifying outliers.
- Two Main Approaches:
- The passage introduces two main approaches for dimensionality reduction:
o Feature Selection:
- Focuses on identifying a subset of the original features (k out of d) that are
most informative for the task.
- Discards the remaining (d-k) features, essentially selecting the most relevant
ones.
- Subset selection is a common approach for feature selection.
o Feature Extraction:
- Aims to create a new set of k features that are combinations of the original d
features.
- These new features are often more informative than the original ones.
- Feature extraction techniques can be supervised (utilizing labels) or
unsupervised (not relying on labels).
- Examples of Feature Extraction Techniques:
- Principal Component Analysis (PCA): This unsupervised method finds new
features (principal components) that capture the maximum variance in the
data.
- Linear Discriminant Analysis (LDA): This supervised method projects data
onto a lower-dimensional space while maximizing the separation between
classes.
- Additional Techniques:
- The passage briefly mentions other techniques, including Factor Analysis
(FA), Multidimensional Scaling (MDS), Isometric Feature Mapping (Isomap),
and Locally Linear Embedding (LLE). These techniques offer various ways to
project data into a lower-dimensional space, either linearly or non-linearly.
- Overall, understanding and applying feature selection and extraction
techniques are crucial aspects of machine learning, especially when dealing
with high-dimensional data. They can improve model performance,
interpretability, and computational efficiency.
- Feature selection:
• Keeping the most relevant variables from the original dataset • E.g. Random
Forests, Decision Trees,
• Removing features with too many missing values
- Dimensionality reduction:
- Finding a smaller set of new variables, each being a combination of the input
variables, containing basically the same information as the input variables
- Unsupervised approach
- E.g Principal Component Analysis, Nonnegative rank factorization, t-SNE (t
Stochastic Nearest Neighbour)

II. Principal Component Analysis (PCA)


- PCA operates on the features without labels = unsupervised learning
- Each feature is an axis of a N-dimensional coordinate system • Two features:
XY coordinate system
- PCA rotates the XY coordinate system
- PCA is an unsupervised method that projects data from a high-dimensional
space to a lower-dimensional space while preserving as much of the
information as possible.
- When to Use PCA:
o PCA is beneficial when dealing with high-dimensional data where many
features might be redundant or correlated.
o It can improve the efficiency of algorithms by reducing computational
costs associated with high dimensionality.
- Determining the Number of Components:
o The number of principal components (k) to retain is crucial. Techniques
like scree plots and eigenvalue analysis help identify the components
capturing most of the variance.

Benefits Limitations
- Reduced complexity: Lower- - PCA is sensitive to
dimensional data requires less outliers, which can influence the
computation and storage. calculation of eigenvectors.
- Improved - It assumes linear relationships
interpretability: Visualizing data in between features. Non-linear
a lower-dimensional space can relationships might not be
reveal patterns and relationships captured effectively.
more easily.
- Potential for better
classification: By removing
redundant features, PCA can
lead to more accurate
classification models (especially
when dealing with the curse of
dimensionality).
Overall, PCA is a powerful tool for dimensionality reduction in machine learning. By
understanding its concepts and applications, you can leverage it to improve the
performance and interpretability of your models.

II.1 PCA example


- The Scenario:
- We have two features describing a website:
o Number of inbound links (Feature 1)
o Google PageRank (Feature 2)
- The Problem:
o These features are likely correlated. A website with many inbound links
is also likely to have a high PageRank.This redundancy can be
problematic for machine learning algorithms, as they might give too
much weight to these correlated features.
- How PCA Can Help:
o PCA can identify a new set of features (principal components) that
capture the essential information from the original features.
o In this example, the first principal component might represent the
overall importance of the website (combining inbound links and
PageRank).
- Correlation between Inbound Links and Principal Component 1: The data
shows a connection between the number of links a webpage receives
(inbound links) and its score on the first principal component (PC1) from a
PCA analysis.This suggests that inbound links are a significant factor
influencing the overall data spread.
- Focus on Remaining Principal Components: The slide title "(highest
component removed)" implies the graphs depict the data after removing the
most informative component (PC1). This would shift the focus to the
remaining principal components, which capture less overall data variation
compared to PC1.

II.1 Comptuing PCA


 In thesee steps:
o Taking the Whole Dataset Ignoring Class Labels:
 This step is relevant if dealing with a supervised learning task where your data
has features (independent variables) and a target variable (dependent
variable, often called the class label).
 For PCA, we focus only on the features (independent variables) to understand
the underlying structure of the data itself, ignoring the class labels.
 Compute the Mean and Covariance:
o Mean: Calculate the average value for each feature across all data
points. This helps center the data around its average values.
o Covariance: This matrix captures the relationship between each pair
of features in the data. A positive covariance indicates features tend to
move together, while a negative covariance suggests they move in
opposite directions.
 Center the Data (Subtract Mean):
o Subtract the mean vector (calculated in step 2) from each data
point. This removes the bias caused by the original data's mean values
and allows us to focus on the variations around the mean.(Optional:
Scaling to Unit Variance: In practice, it's often beneficial to also scale
each feature to have a unit variance (standard deviation of 1).This
ensures all features contribute equally to the PCA analysis, especially
when features have different measurement scales.)
o Obtain Eigenvectors and Eigenvalues:
 can use two methods:
o Eigenvalue Decomposition: Decompose the covariance matrix (or
correlation matrix if features are scaled) to obtain eigenvectors and
eigenvalues.
o Singular Value Decomposition (SVD): This is a more general
technique applicable to non-square matrices.It can also be used on the
centered data matrix directly.
 Eigenvectors: These represent the directions of greatest variance in the
data. They define the new axes (principal components) in the transformed
feature space.
 Eigenvalues: These represent the amount of variance captured by each
eigenvector. The eigenvalues are in descending order, with the first
eigenvector corresponding to the direction of greatest variance.
o To Projec the Feature Space:
 To reduce dimensionality, you can choose a subset of
eigenvectors (typically the top few with the highest
eigenvalues). These eigenvectors will form the basis of the
new, lower-dimensional feature space.
 Project the centered data points onto the chosen eigenvectors to
obtain their principal component scores. These scores represent
the data points' positions in the transformed, lower-dimensional
space.

- Sort eigenvalues in descending order


- Choose the k eigenvectors that correspond to the k largest eigenvalues where
k is the number of dimensions of the new feature subspace (k≤d). d = original
number of dimensions.
- Construction of the projection matrix (W) that will be used to transform the
data
- Projection matrix is a matrix of our concatenated top k eigenvectors

II.2 PCA in higher dimensions


- For datasets with more than 2 features, PCA rotates the coordinate
system in such a way that:
- the projection of the data on the first principal component (new
axis) has the largest variance,
- the projection of the data on the second principal component (new
axis) has the one- but-largest variance,
- and so forth...
- If the variation in the data is associated with relevance for
classification (or regression), the most relevant features are
captured by the first principal components (and the rest captures
noise)
- Retaining the first principal components and throwing away the rest
effectively reduces the dimensionality

II.3 PCA for feature extraction

II.4 How many features (principal


components) to keep?
- No fixed rule that defines how many features should be used in a
classification problem.
- This depends on
• the amount of training data available,
• the complexity of the decision boundaries, and • the type of classifier used
- Total Explained Variance: can be used to decide the number of features.

II.5 Total explained variance

- First k components have the highest


- We try to choose k big enough to make the lost information — the
variance of the last p−k components — sufficiently small.

III. Non-negative Matrix Factorization


- What it Does:
o Decomposes non-negative data matrices (e.g., images, documents)
into lower-dimensional matrices.
o Focuses on data where negative values don't make sense
(intensities, frequencies, amounts).
o Reduces dimensionality for easier analysis and potentially better model
performance.
- The Process:
o Takes a data matrix V (e.g., image pixels).
o Factorizes V into W * H:
 W (basis matrix): holds basis vectors representing "parts" of the
data (m rows, k columns).
 H (coefficient matrix): stores weights for combining basis vectors
to reconstruct data points (k rows, n columns).
o Iteratively refines W and H to best approximate V using a cost function.
- Benefits:
o Interpretable basis vectors due to non-negative values (e.g., "edges" in
images).
o Parts-based representation: data points as combinations of basis
vectors.
o Applications in image analysis, text mining, recommendation
systems, music analysis, etc.
- Considerations:
o Not ideal for data with inherent negative values.
o User-defined number of basis vectors (k) can affect results.
o No guaranteed unique factorization (different starting points might lead
to slightly different solutions).
o Can only be applied to non-negative data
o Whether components are interpretable is hit/miss • Non-
convex optimization, requires initialization
o Can be slow on large datasets
o Not orthogonal
o
- Why NMF?
o Data points are composed into positive sums • Positive
weights can be easier to interpret
o No “cancellation” like in PCA
o No sign ambiguity like in PCA

- Can be viewed as “soft clustering”: each point is positive linear


combination of weights.

III.1 Matrix Factorization


- X is the data matrix (n rows, p columns)
- A is the basis matrix (n rows, k columns)
- B is the coefficient matrix (k rows, p columns)
- This depicts a matrix factorization process where a data matrix X is
approximated by the product of two lower-dimensional matrices, A and B.
- The specific type of matrix factorization used in the image is not explicitly
mentioned, but it could be Non-negative Matrix Factorization (NMF) which is
commonly used for analyzing data matrices where the values are all non-
negative.
- Latent factorisation, also sometimes called matrix factorization, is a family of
techniques used to uncover hidden patterns or structures within a dataset. It
achieves this by decomposing a data matrix into a product of lower-
dimensional matrices that capture these underlying factors.

III.2 Latent space


- Latent space contains a hidden compressed representation of the data
- Contains a simpler representation of our images than the pixel space
- Latent space, in the context of machine learning and dimensionality reduction
techniques like latent factorization, refers to a lower-dimensional space that
captures the essential characteristics of the original, higher-dimensional
data. It's essentially a compressed representation where similar data points
are positioned closer together.
- Here's a breakdown to understand latent space better:
- Concept:
- Imagine a high-dimensional space where each data point is represented by a
vector with numerous features.Visualizing this space can be challenging for
humans as the dimensionality increases.
- Latent space transformation techniques like NMF or PCA aim to project this
high-dimensional data onto a lower-dimensional space. This lower-
dimensional space is the latent space.
- The key is that this transformation preserves the important relationships
between data points. Points that were close together in the original space
remain close in the latent space, even though they are represented by fewer
dimensions.

- Benefits:
- Dimensionality Reduction: Working with a lower-dimensional latent space
allows for easier analysis,visualization, and manipulation of the data.
- Efficiency: Machine learning algorithms often perform better when dealing
with less data. Latent spaces can significantly reduce the computational cost
of processing complex datasets.
- Uncovering Hidden Structure: The process of finding the latent space can
reveal underlying patterns or relationships within the data that might not be
obvious in the original high-dimensional space.

- Applications:
- Recommendation Systems: Latent spaces can be used to model user
preferences and item characteristics, enabling recommendation systems to
suggest relevant items to users based on their past behavior.
- Image and Text Analysis: In image processing, latent spaces can be used
for tasks like image compression,denoising, and object
recognition. Similarly, in text analysis, latent spaces can help identify topics in
documents or group similar documents together.
- Anomaly Detection: Deviations from the expected patterns in the latent
space can indicate anomalies or outliers in the data.

- Things to Consider:
- Choosing the Right Technique: Different techniques like NMF and PCA
have their strengths and weaknesses depending on the data type and desired
outcome.
- Interpretability: While latent spaces capture important
information, interpreting the specific meaning of each dimension in the latent
space can be challenging.

- Understanding latent space is crucial for grasping how techniques like latent
factorization work and how they help us unlock the hidden potential within
complex datasets.

III.3 Linear interpolation in img space


- The concept of linear interpolation in image space applies to the image you
sent me, which appears to be a slide about image interpolation from a Tilburg
University presentation.
- Linear Interpolation Explained
- Linear interpolation is a mathematical method used to estimate the value of a
data point between two known data points. In the context of digital images, it's
a technique for approximating the color or intensity of a pixel that falls
between existing pixels.
- Image Resizing Example
- The slide you sent specifically mentions pixels in the context of “Start image,”
“Interpolated images,” and “End image.” This refers to a common application
of linear interpolation - resizing images.
- When you resize an image digitally, you're essentially changing the number of
pixels in the image.
- If you enlarge an image (increase the number of pixels), linear interpolation
creates new pixels by estimating the color or intensity values based on the
surrounding pixels in the original image.

- Here's a simplified breakdown of the process for a single pixel being


interpolated:
- Identify surrounding pixels: The algorithm identifies the four pixels nearest
to the new pixel's location in the original image.
- Calculate weights: Based on the new pixel's distance from each surrounding
pixel, weights are assigned to each neighbor. Pixels closer to the new pixel
will have higher weights.
- Average weighted values: The color or intensity values of the surrounding
pixels are averaged, weighted by the calculated weights from step 2. This
average value becomes the color or intensity of the new pixel.

- Limitations of Linear Interpolation


- While linear interpolation is a simple and efficient method, it has limitations:
- Blurring: Linear interpolation can cause blurring in the resized image,
especially when enlarging an image significantly. This is because the newly
created pixels are essentially averages of the surrounding pixels, leading to a
loss of sharp edges and details.
- Not ideal for all images: Linear interpolation works best for images with
gradual color changes. For images with sharp edges or complex patterns, it
may not produce the best results.

III.4 PCA vs NMF (non-negative matrix factorization)

PCA NMF
- Works on any data type - Designed for non-negative data
(numerical). (e.g., image pixels, word
- Components (principal frequencies).
components) can have negative - Components (basis vectors) are
and positive values, making non-negative, offering easier
interpretation less interpretability in terms of the
straightforward. original data
- Focuses on capturing the most - Aims to identify parts or basis
variance in the data, regardless elements that can be additively
of the underlying additive combined to reconstruct the data.
structure. - Particularly useful for tasks like
- Broadly applicable for image compression, topic
dimensionality reduction, feature modeling in documents, music
extraction, and anomaly source separation,and analyzing
detection. recommender systems.
- General dimensionality - Non-negative data with
reduction: PCA is a good starting interpretable parts: NMF is a
point. strong choice.

- principal components orthogonal, - latent (hidden) representation


minimize squared loss • latent features are nonnegative
- Sparse PCA: components - - Imagine explaining a face using a
orthogonal & sparse combination of features like
- Imagine explaining a face using a "eyes," "nose," and "mouth."
combination of stretching and Each feature (basis vector)
shrinking a "generic face" contributes positively to the final
template. The components can image.
have positive or negative values
for stretching or shrinking.
IV. Dimensionality Reduction for Data
Visualization: Manifold Learning
IV.1 Manifold Learning
- PCA for Visualization (Limitations):
- PCA is a common approach for dimensionality reduction, often used to create
scatter plots for visualization.
- However, its effectiveness for visualization can be limited, as seen with the
"Labeled Faces in the Wild" dataset (not shown here).
- PCA works by rotating the data and discarding components with lower
variance. While this captures the most important information, it might not
always preserve the relationships between data points well suited for
visualization in a low-dimensional space (like a 2D scatter plot).
- Manifold learning algorithms offer a more powerful approach for visualization
compared to PCA.
- They aim to find a lower-dimensional representation of the data that better
preserves the relationships between data points in the original high-
dimensional space.

- This can lead to more informative and visually separable clusters in the lower-
dimensional space.
- Allow for much more complex mappings and often provide better
visualizations.
- Learn underlying “manifold” structure, use for dimensionality reduction
IV.2 Pros and cons
- For visualization only
- Axes don’t correspond to anything in the input space.
- Often can’t transform new data.
- Pretty pictures!

IV.3 T-distributed stochastic neighbour embedding (T-


SNE)
- a specific manifold learning algorithm.
- t-SNE aims to find a two-dimensional representation of the data that
preserves the distances between points as best as possible, focusing on
keeping close neighbors close and distant points far apart.
- This can be particularly useful for visualizing datasets where the data points
lie on a lower-dimensional manifold (a curved surface) within a higher-
dimensional space.

- Starts with a random embedding


- Iteratively updates points to make “close” points close.
- Global distances are less important, neighbourhood counts.
- Good for getting coarse view of topology.
- Can be good for finding interesting data points.
LECTURE12- CLUSTERING
- Another type of unsupervised learning

I. Clustering
- An alternative to parametric methods for density estimation.
- Parametric methods assume a known distribution for the data, like Gaussian.
- Clustering relaxes the assumption of a single distribution and allows for a
mixture of distributions.
- This is useful when the data has multiple groups, like different writing styles
for the digit 7.
- Semiparametric approach assumes a parametric model for each group in the
data.
- Nonparametric approach, makes no assumptions about the data structure.
⁃ Parametric approach:
⁃ Assumes sample comes from a known distribution.
⁃ Typically assumes a parametric family, like Gaussian.
⁃ Problem reduces to estimating a small number of parameters.
⁃ Limitations of parametric approach:
⁃ Rigid parametric models can introduce bias.
⁃ Not suitable for applications where data doesn't fit assumptions.
⁃ Introduction to clustering:
⁃ Offers a semiparametric approach for situations where strict parametric
assumptions don't hold.
⁃ Allows for learning mixture parameters from data.
⁃ Discusses probabilistic modeling, vector quantization, and hierarchical
clustering.
⁃ However, Clustering can be a hard problem for a number of reasons, even
though the concept itself seems straightforward:

o Determining the number of clusters: There's often no inherent "correct"


number of clusters in a dataset. It depends on the data itself and the
goal of the clustering.
o Data shape and distance metrics: Clustering algorithms often make
assumptions about the data, like being spherical or having well-defined
clusters. Real-world data can be messy and not fit these assumptions
perfectly.Choosing the right distance metric to measure similarity
between data points is also crucial.
o Evaluation: Unlike classification where you have clear
labels, evaluating the quality of clustering is subjective.There's no
single metric to definitively say a clustering is good or bad.
⁃ Clustering Overview:
⁃ Clustering involves partitioning a dataset into groups called clusters.
⁃ The objective is to ensure similarity within clusters and dissimilarity between
clusters.
⁃ k-Means Clustering:
⁃ Simple and widely-used clustering algorithm.
⁃ Alternates between two steps: assigning data points to the nearest cluster
center and updating cluster centers based on the mean of assigned points.
⁃ Stops when cluster assignments no longer change significantly.
⁃ Algorithm Steps:
⁃ Initialization: Randomly select K data points as initial cluster centers.
⁃ Assignment Step: Assign each data point to the closest cluster center.
⁃ Update Step: Recalculate cluster centers as the mean of all data points
assigned to each cluster.
⁃ Convergence: Repeat steps 2 and 3 until cluster assignments stabilize.
⁃ Visualization:
⁃ Cluster centers represented as triangles, data points as circles, with colors
indicating cluster membership.
⁃ Boundaries of cluster centers illustrate how data is partitioned into clusters.
⁃ Implementation with scikit-learn:
⁃ Instantiate the KMeans class and specify the number of clusters.
⁃ Fit the model to the data using the fit method.
⁃ Cluster Memberships:
⁃ Each data point is assigned a cluster label, accessible via the labels_
attribute.
⁃ Predict method can be used to assign cluster labels to new data points.
⁃ Interpretation:
⁃ Clustering is similar to classification but lacks ground truth labels.
⁃ Cluster labels are arbitrary and don't have inherent meaning.
⁃ Interpretation relies on examining the characteristics of data points within
each cluster.
⁃ Cluster Centers:
⁃ Cluster centers are representative points of each cluster, stored in the
cluster_centers_ attribute.
⁃ Displayed as triangles in visualizations.
⁃ Number of Clusters:
⁃ The number of clusters K needs to be specified a priori.
⁃ Different values of K can result in different cluster assignments.
⁃ Visualization of Different Cluster Configurations:
⁃ Varying the number of clusters results in different cluster assignments.
⁃ Plotting cluster assignments for different values of K illustrates how the data is
grouped into clusters.

I.2 Goals of clustering


- Data exploration
• are there coherent groups?
•How many groups are there?

- • Data partitioning
• divide data by group before further processing.

- Unsupervised feature extraction


• Derive features from clusters or cluster distances

- E.g. Clustering techniques

o K-means, Hierarchical Clustering, Density Based Techniques,


Gaussian Mixtures Models

I.3 K-means clustering


- K-means clustering separate samples in k groups of equal variance

- Requires number of clusters to be specified

⁃ Definition: K-means clustering is a popular unsupervised machine learning


algorithm used for partitioning a dataset into K distinct, non-overlapping clusters.
Each cluster is represented by its centroid, which is the mean of all data points
assigned to the cluster.
⁃ Objective: The main goal of k-means clustering is to minimize the within-
cluster variance, also known as inertia or distortion. It aims to find clusters that are
compact and well-separated from each other.
⁃ Algorithm:
⁃ Initialization: Randomly initialize K centroids.
⁃ Assignment Step: Assign each data point to the nearest centroid, forming K
clusters.
⁃ Update Step: Recalculate the centroids by taking the mean of all data points
assigned to each cluster.
⁃ Convergence: Repeat steps 2 and 3 until convergence, i.e., when the
centroids no longer change significantly or a predefined number of iterations is
reached.
⁃ Distance Metric: Euclidean distance is commonly used to measure the
similarity between data points and centroids. However, other distance metrics like
Manhattan distance or cosine similarity can also be used depending on the nature of
the data.
⁃ Number of Clusters (K): The number of clusters K needs to be specified
beforehand, which can be challenging as it often requires domain knowledge or trial
and error. Various techniques, such as the elbow method or silhouette score, can
help determine the optimal K.
⁃ Initialization Methods:
⁃ Random initialization: Randomly select K data points as initial centroids.
⁃ K-means++ initialization: Select initial centroids that are far apart from each
other to improve convergence speed and clustering quality.
⁃ Pros:
⁃ Simple and easy to implement.
⁃ Efficient for large datasets.
⁃ Scales well with the number of data points.
⁃ Cons:
⁃ Sensitive to initial centroids, may converge to suboptimal solutions.
⁃ Assumes clusters are spherical and of equal size, which may not hold true for
all datasets.
⁃ Requires the number of clusters K to be specified a priori.
⁃ Applications:
⁃ Customer segmentation in marketing.
⁃ Image compression and color quantization.
⁃ Document clustering in natural language processing.
⁃ Anomaly detection in cybersecurity.
⁃ Recommender systems in e-commerce
Interpretation: After clustering, the centroids can be interpreted as representative
points of each cluster, aiding in understanding the characteristics and differences
between clusters.

I.4 K-means algorithm


1. Choose the number of clusters, K

2. Randomly choose initial positions of K centroids


3. Assign each of the points to the "nearest centroid" (depending on the distance
measure)

4. Recompute centroid positions

5. If solution converges -> Stop, else go the step 3.

I.5 Objective function for K-means

- Finds a local minimum of minimizing squared distances

- New data points can be assigned cluster membership based on

existing clusters.
I.6 Restriction of Cluster Shapes
- Clusters are Voronoi-diagrams of centers.

- Clusters are always convex in space

I.7 Computational Properties


- By default K-means in sklearn does 10 random restarts with
different initializations.
- For large datasets, K-means initialization may take much longer
than clustering.

- Consider using random, in particular for MiniBatchKMeans

o Uses mini-batches to reduce the computation time, while still


attempting to optimise the same objective function
(partial_fit)

I.8 MiniBatchKMeans
- Mini-batches are subsets of the input data, randomly sampled in
each training iteration.

- Algorithm

1. Draw samples randomly from the dataset, to form a mini-


batch. Assign to nearest centroid

2. Update the centroids by using a convex combination of the


average of the samples and the previous samples assigned to
that centroid.

3. Perform 1 and 2 until convergence or for a fixed number of


iterations,

I.9 Feature extraction using K-means


- Cluster membership → categorical feature

- Cluster distanced → continuous feature

- Examples:
• partitioning low-dimensional space (similar to using basis
functions) •extracting features from high-dimensional spaces

II. Hierarchical clustering


- Hierarchical clustering is a technique that focuses solely on the similarities
between instances in a dataset, without relying on any specific assumptions
about the underlying data distribution or structure. The primary objective is to
group together instances that are more similar to each other while keeping
dissimilar instances in separate groups.

- Hierarchical clustering : a series of partitions from a single cluster containing


all the data points to N clusters containing 1 data point each.

II.1 Agglomerative Hierarchical clustering


Agglomerative clustering is a hierarchical clustering technique that iteratively merges
the most similar clusters until a stopping criterion is met, typically the desired number
of clusters. breakdown of key points about agglomerative clustering:
⁃ Basic Principle:
⁃ Each data point initially forms its own cluster.
⁃ At each iteration, the two most similar clusters are merged based on a chosen
linkage criterion.
⁃ The process continues until the specified number of clusters is reached.
⁃ Linkage Criteria:
⁃ Ward: Minimizes the within-cluster variance increase when merging clusters,
often resulting in equally sized clusters.
⁃ Average: Merges clusters with the smallest average distance between all
their points.
⁃ Complete: Merges clusters with the smallest maximum distance between
their points.
⁃ Progression of Clustering:
⁃ Initially, each data point is its own cluster.
⁃ At each step, the two closest clusters are merged.
⁃ The process continues until the desired number of clusters is achieved.
⁃ Performance in scikit-learn:
⁃ Implemented in scikit-learn as AgglomerativeClustering.
⁃ Use fit_predict method to build the model and get cluster memberships on
the training set.
⁃ Unlike k-means, agglomerative clustering cannot make predictions for new
data points.
⁃ Visualization:
⁃ Dendrogram representation illustrates the hierarchical structure of clusters.
⁃ Each level in the dendrogram represents a possible clustering solution.
⁃ Choosing the Number of Clusters:
⁃ The number of clusters needs to be specified beforehand.
⁃ Techniques such as dendrogram visualization or silhouette scores can aid in
determining the optimal number of clusters.

Overall, the image visualizes the


process of hierarchical
clustering, where data points are
merged together based on their
similarity to form a hierarchy of
clusters

II.2 Agglomerative Hierarchical Clustering Techniques


- Start with N independent clusters: {P1}, {P2},...,{PN}

- Find the two closest (most similar) clusters, and join them

- Repeat step 2 until all points belong to the same cluster

- Dendograms visualize the arrangement of the clusters produced

- First figure
- Data points are numbered 0 to 11 at the bottom.
- The dendrogram shows how these points are merged into clusters.
- For example, points 1 and 4 are merged first, followed by points 6 and 9, and
so on.
- The longest branches (marked by a dashed line) indicate merging very
dissimilar clusters into 3 groups.

- The topmost level shows the two final clusters.

- Each row shows a dendrogram, which is a tree-like structure that depicts how
data points are hierarchically clustered. In the dendrogram, vertical lines
represent data points, and horizontal lines show merging steps that combine
clusters. The shorter the horizontal line, the more similar the merged clusters
are.

- The rightmost column of the dendrogram shows how many data points are in
each cluster.

- Second figure

- A dendrogram is a tree-like diagram used to visualize hierarchical clustering.


- It shows how data points are grouped into a hierarchy of clusters.
- Circles represent data points, with lines connecting them indicating how
similar the clusters they belong to are.
- Shorter lines show more similar clusters, while longer lines show less similar
clusters.
- The top of the dendrogram shows the number of clusters at the highest level
(e.g., "3" ).

- You can "cut" the dendrogram at any level to define a specific number of
clusters.

II.3 Agglomerative Hierarchcal Techniques


Single linkage clustering Complete linkage clustering
- distance between closest - distance between farthest
pairs of points pairs of points

Average linkage clustering Ward (default in sklearn)


- Mean distance of all mixed - minimizes the sum of
pairs of points squared differences within all
clusters.

- Leads to more equally sized


clusters.
I.4 Agglomerative Hierarchical Clustering

- The figure shows a hierarchical clustering dendrogram, which is indeed


influenced by two main factors.
- The two factors influencing the hierarchical clustering dendrogram are:
- Distance metric: This metric determines how similarity or dissimilarity
between data points is measured. Common distance metrics include
Euclidean distance, Manhattan distance, and Jaccard distance. The chosen
metric will affect how the data points are grouped together and the overall
structure of the dendrogram.
- Number of clusters: This is the desired number of final clusters in the data.
There isn’t always a predefined optimal number of clusters, and it can be
determined using various methods like the silhouette coefficient or the elbow
method.

- The dendrogram itself doesn’t tell the optimal number of clusters. It shows all
possible mergings of data points into clusters at different distances. You can
choose a cutoff point on the dendrogram based on your chosen criteria to
identify the desired number of clusters
I.5 Pros and Cons
Pros Cons
⁃ Topology Flexibility: ⁃ Imbalanced Cluster Sizes:
⁃ Hierarchical clustering can ⁃ Depending on the chosen linkage
accommodate various types of input criteria, hierarchical clustering
topologies, including those defined by algorithms may lead to imbalanced
graphs such as neighborhood graphs. cluster sizes. Some linkage methods
This flexibility allows for the clustering of tend to favor the formation of clusters
data with complex relationships and with uneven numbers of data points,
structures. which can impact the interpretability and
⁃ Efficiency with Sparse usability of the clustering results.
Connectivity: ⁃ Computational Complexity:
⁃ Hierarchical clustering algorithms ⁃ Hierarchical clustering algorithms
can be efficient when dealing with can be computationally intensive,
datasets characterized by sparse especially for large datasets or when
connectivity, where only a subset of using certain linkage criteria that involve
data points are connected. This makes pairwise distance computations. This
it suitable for data with irregular or non- complexity can result in longer
uniform distributions. processing times and may limit
⁃ Holistic View: scalability for very large datasets.
⁃ Hierarchical clustering provides a ⁃ Subjectivity in Interpretation:
holistic view of the data by capturing the ⁃ The hierarchical nature of
hierarchical relationships between clustering can sometimes make it
clusters at different levels of granularity. challenging to interpret the results
This comprehensive perspective can aid objectively. Deciding on the appropriate
in understanding the underlying level of granularity and determining the
structure of the data and can assist in optimal number of clusters can be
making informed decisions about the subjective and may require expert
number of clusters. judgment, potentially leading to biased
interpretations.
Overall, while hierarchical clustering offers advantages such as flexibility in handling
various data topologies and providing a holistic view of the data, it also has
limitations related to computational complexity, potential for imbalanced cluster
sizes, and subjectivity in interpretation. These factors should be carefully considered
when applying hierarchical clustering methods in practice.

II. Density-Based Clustering Methods


II.1 Density-Based Spatial Clustering of Application
with Noise

- DBSCAN is a density-based spatial clustering of applications with noise


algorithm, which is a type of clustering algorithm used to identify groups
(clusters) within a dataset.
- The two factors that influence the results of DBSCAN are:
- eps: This parameter defines the radius of a neighborhood around a data
point. Points that are within this radius are considered neighbors.
- min_samples: This parameter determines the minimum number of points that
must be within the eps radius of a point for it to be considered a core
point. Core points are points that are densely surrounded by other points.

- In the image, there are four plots, each showing the results of DBSCAN with
different combinations of eps and min_samples values. The points are colored
according to their cluster assignment:
- Solid colors: Points that belong to clusters
- White: Noise points (points that are not assigned to any cluster)
- Large markers: Core samples
- Small markers: Boundary points (points that are within the eps radius of a
core point but are not core points themselves)
- Here's a breakdown of each plot:
- Top left (eps=1.0, min_samples=2): Most of the points are classified as
noise because the eps value is too small and there are not enough points
within that radius to be considered core points.
- Top right (eps=1.5, min_samples=2): Two clusters are formed, but there are
still some noise points.
- Bottom left (eps=2.0, min_samples=2): All points are now clustered, but
some clusters have merged together. This is because the eps value is too
large, causing more points to be considered neighbors.
- Bottom right (eps=3.0, min_samples=2): All points are again clustered, but
this time into a single cluster. This is because the eps value is very
large, causing all points to be considered neighbors.

- As you can see, the choice of eps and min_samples can significantly affect
the clustering results. It is important to find the right balance between these
parameters to capture the desired number and shapes of clusters in your
data.

- Density =
number of
sample points
within a specified
radius r (epsilon)

- Core point:
sample with
more than a
specified number
of points
(min_samples)
within epsilon
(includes
samples inside
the cluster)

- Border point has


fewer than min_samples within epsilon, but is in the neighborhood
of a core point

- Noise point : any point that is not a core point or a border point.

- Finds core samples of high density and expands clusters from them.

- Sample is “core sample” if more than min_samples is within epsilon - “dense


region”

- Steps:

1. Start with a core sample


2. Recursively find neighbors that are core-samples and add
to cluster.

3. Also add samples within epsilon that are not core


samples (but don’t recurse)

4. If no other points are “reachable”, pick another core


sample, start new cluster.

5. Remaining points are labeled outliers

- Allows complex cluster shapes

- Can detect outliers

- Needs two parameters to adjust, epislon is hard to pick (can be


done based on number of clusters though).

- Can learn arbitrary cluster shapes

- Limitations

o Varying densities

o High-dimensional data

Pros Cons
⁃ Automatic Determination of ⁃ Sensitive to Parameters: The
Clusters: DBSCAN does not require performance of DBSCAN can be
the user to specify the number of sensitive to the choice of its two main
clusters beforehand. Instead, it parameters: eps (epsilon) and
identifies clusters based on the density min_samples. Selecting appropriate
of data points in the feature space. This values for these parameters can require
makes it particularly useful for datasets some experimentation and domain
where the number of clusters is not knowledge.
known a priori. ⁃ Difficulty with Varying Density:
⁃ Ability to Capture Complex DBSCAN may struggle with datasets
Shapes: Unlike some other clustering where the density of data points varies
algorithms, DBSCAN can identify significantly across the feature space. In
clusters of arbitrary shapes and sizes. such cases, choosing a suitable value
This flexibility allows it to effectively for the epsilon parameter can be
handle datasets with non-linear challenging, as it needs to capture the
boundaries and irregularly shaped density variations appropriately.
clusters. ⁃ Memory and Computational
⁃ Robust to Noise: DBSCAN can Requirements: While DBSCAN is
distinguish between points that belong generally efficient, it can be memory-
to clusters and outliers, or noise points. intensive for large datasets, especially
Points that do not belong to any cluster when dealing with high-dimensional
are labeled as noise, providing a clear data. Additionally, the computational
indication of data points that do not fit complexity of DBSCAN may increase
well into any cluster. significantly as the dataset size grows.
⁃ Scalability: Despite being ⁃ Handling High-Dimensional
slightly slower than some other Data: Like many clustering algorithms,
clustering algorithms like k-means, DBSCAN's performance can degrade in
DBSCAN is still scalable and can high-dimensional spaces due to the
handle relatively large datasets curse of dimensionality. Preprocessing
efficiently. techniques such as dimensionality
reduction may be needed to mitigate
this issue.

Despite these limitations, DBSCAN remains a popular choice for clustering tasks,
especially when dealing with datasets where the number of clusters is not known in
advance or when clusters exhibit complex shapes and densities.

II.2 DBSCAN: Core, Border and Noise Points

- Here’s a breakdown of the key aspects of the image:


- Parameters:
o eps (epsilon): This parameter defines the radius of a neighborhood
around a data point. Points that fall within this radius are considered
neighbors of the data point.
o min_samples: This parameter determines the minimum number of
points that must be within the eps radius of a data point to be
considered a core point. Core points are points that are densely
surrounded by other points.

- Types of points:
o Solid colors: Represent points that belong to clusters.
o White: These are noise points, which means they are not assigned to
any cluster.
o Large markers: These indicate core points.
o Small markers: These represent boundary points, which are points
that are within the eps radius of a core point but are not core points
themselves.

- The image shows four different plots, each representing the result of running
DBSCAN with a different combination of eps and min_samples values. By
looking at the way the points are clustered and colored in each plot, we can
see the impact of these different parameter settings.
- Here’s a more detailed explanation of each plot:
- Top left (eps=1.0, min_samples=2): In this plot, most of the data points are
classified as noise (white) because the eps value (1.0) is too small. This
means that the neighborhood radius around each point is very small, and
there aren’t enough points within that radius to satisfy the minimum samples
requirement (2) to be considered a core point.
- Top right (eps=1.5, min_samples=2): Here, we can see the formation of two
distinct clusters (solid colors), but there are still some noise points (white)
present. This is because while increasing the eps value (to 1.5) allows more
points to be considered neighbors, it’s not enough to capture all the dense
regions in the data.
- Bottom left (eps=2.0, min_samples=2): In this plot, all the data points are
clustered (solid colors), but some clusters appear to have merged
together. This is because the eps value (increased to 2.0) is now too
large, causing too many points to be considered neighbors, even those that
belong to separate dense regions.
- Bottom right (eps=3.0, min_samples=2): Here, all the data points are again
clustered, but this time into a single large cluster (solid color). This is because
the eps value (3.0) is very large, causing almost all points to be considered
neighbors, effectively ignoring the presence of multiple dense regions in the
data.

- By observing these plots, we can see how the choice of eps and min_samples
can significantly affect the clustering results in DBSCAN. It’s important to find
the right balance between these parameters to identify the desired number
and shapes of clusters within your data.
- Key points about the image:
o Points that are part of clusters are shown in solid colors.
o Noise points, which are points not assigned to any cluster, are shown
in white.
o Large markers represent core points, which are densely surrounded by
other points.
o Small markers represent boundary points, which are located close to
core points but aren’t considered core points themselves.
o As the value of eps increases (left to right), more points are included in
clusters, potentially causing multiple clusters to merge into one.
o As the value of min_samples increases (top to bottom), fewer points
are classified as core points, and more points are labeled as noise.

- Overall, the figure depicts how the selection of eps and min_samples can
influence the results of DBSCAN clustering

III. Mixture Models


- Generative model – find p(X).

- Mixture model assumption:


• Data is mixture of small number of known distributions.
• Each mixture component distribution can be learned “simply” •
Each point comes from one particular component

- We learn the component parameters and weights of components

III.1 Gaussian Mixture Models


- Each component is created by a Gaussian distribution

- There is a multinomial distribution over the components

- Non-convex optimization.

- Alternately assign points to components and compute mean and


variance (EM algorithm).

- Initialized with K-means, random restarts.

III.2 Goals of Mixture Models


- Create parametric density model.

- Allows for testing how “likely” a new point is.

- Clustering (each components is one cluster).

III.3 GMM vs K-Means

GMM K-Means
⁃ Assumption of Distribution: ⁃ Assumption of Similarity: k-
GMM assumes that the data is means assumes that the data can
generated from a mixture of several be partitioned into k clusters, each
Gaussian distributions, each represented by its centroid. It
representing a cluster. It allows for minimizes the sum of squared
clusters of different shapes and distances between data points and
sizes by modeling the data as a their respective cluster centroids.
combination of these Gaussian ⁃ Hard Clustering: k-means
distributions. performs hard clustering, meaning
⁃ Soft Clustering: GMM that each data point is assigned to
performs soft clustering, which exactly one cluster, with no notion of
means that it assigns a probability to uncertainty or probability.
each data point belonging to each ⁃ Cluster Shape: k-means
cluster rather than assigning it to a assumes that clusters are spherical
single cluster. This makes GMM and isotropic, which means that it
more flexible in capturing the may struggle with clusters of non-
uncertainty and overlap between spherical shapes or varying sizes.
clusters. ⁃ Simplicity: k-means is
⁃ Cluster Shape: GMM can computationally simpler and more
model clusters with different shapes, efficient compared to GMM, making
including elongated or elliptical it suitable for large datasets and
shapes, by adjusting the covariance high-dimensional data.
matrix of the Gaussian distributions. ⁃ Number of Clusters: k-
⁃ Complexity: GMM is more means requires specifying the
computationally complex compared number of clusters (k) beforehand,
to k-means, especially when dealing which can be a drawback if the
with high-dimensional data or a large optimal number of clusters is not
number of clusters. known in advance. Techniques like
⁃ Number of Clusters: GMM the elbow method or silhouette score
does not require specifying the can help in choosing an appropriate
number of clusters a priori, but it value for k.
needs to be initialized with an initial
guess of the number of components.
Techniques like the Bayesian
Information Criterion (BIC) or Akaike
Information Criterion (AIC) can be
used to estimate the optimal number
of components.

⁃ Use GMM when the underlying distribution of the data is not well
approximated by spherical clusters, and when there is uncertainty or overlap
between clusters.
⁃ Use k-means when the data is well-separated and can be partitioned into
spherical clusters, and when computational efficiency is a concern.

III.4 How do we evaluate the clustering result?


- Elbow plot

- Silhouette Coefficient

- And many others

A) Elbow plot

- An elbow plot is a graphical tool used in k-means clustering to determine the


optimal number of clusters (k) for your data.It visualizes the explained
variance or within-cluster sum of squares (WCSS) across different values
of k.
- X-axis: Represents the number of clusters (k)
- Y-axis: Represents the explained variance (often a percentage) or WCSS
- The Elbow: Ideally, the plot should have a distinct "elbow" shape. The
number of clusters corresponding to the point where the curve sharply bends
downwards is considered the optimal k value.
- Here's why the elbow helps us choose k:
- Initially, as we increase k (number of clusters): The WCSS (or explained
variance) will decrease significantly.This is because splitting data into more
clusters will capture more of the data's variance within each cluster.
- After a certain point: The decrease in WCSS (or explained variance)
becomes less significant with each additional cluster. This indicates that
adding more clusters isn't providing much benefit in terms of explaining the
data's variance.

- The elbow point essentially represents the point where the additional benefit
of creating new clusters starts to diminish.It's a trade-off between:
- Capturing more variance: More clusters might capture more specific
patterns in the data.
- Model complexity: Having too many clusters can lead to overfitting and a
less generalizable model.

- Here are some additional points to consider about elbow plots:


- The "elbow" can sometimes be subjective, and the optimal k value might not
always be very clear-cut.
- Elbow plots work best when the data has well-defined clusters.
- Other methods like silhouette analysis can be used alongside elbow plots to
strengthen the choice of k.

- Computes the sum of squared distance (SSE) between data points and their
assigned clusters’ centroids.

- Pick the desired # of clusters at the spot where SSE starts to flatten out and
forming an elbow.

B) Silhouette Coefficient

- The silhouette coefficient is a metric used to evaluate the quality of clustering


results in various clustering algorithms,including k-means clustering. It
considers both the cohesion (how similar points are within a cluster) and
separation (how dissimilar points are between clusters) of the data points.
- Here's a breakdown of the silhouette coefficient:
- Range: It ranges from -1 to +1.
o Values closer to +1: Indicate a good clustering, where a point is well-
matched to its assigned cluster and poorly matched to points in
neighboring clusters.
o Values closer to 0: Suggest that clusters might be overlapping or
indistinct.
o Values closer to -1: Indicate that a point may be misclassified and
placed in the wrong cluster.

- Interpretation:
- A higher average silhouette coefficient across all data points indicates a better
clustering solution.
- It helps identify potential outliers or misclassified points that might have a low
silhouette coefficient.

- Here are some additional points about the silhouette coefficient:


- It's a versatile metric that can be used with various clustering algorithms.
- It's not perfect and can be computationally expensive for large datasets.
- It's often used alongside other methods like elbow plots to get a more
comprehensive picture of clustering quality.

- A function S that measures the separation between two clusters, c1 and c2.

- How can we measure the goodness of a clustering C = c1, ... cl, using the
separation function S?

- For an individual point, i


• a = average distance of i to the points in the same cluster
• b = average distance of i to the points in the closest cluster
- For each sample:
• Compute the average distance from all data points in the same cluster (ai).
• Compute the average distance from all data points in the closest cluster (bi).
• Compute the coefficient

- The coefficient can take values in the interval [-1, 1]. • If 0 : the sample is very
close to the neighboring clusters.
• If 1 : the sample is far away from the neighboring clusters.
• If -1: the sample is assigned to the wrong clusters.

You might also like