0% found this document useful (0 votes)
51 views49 pages

ML

The document provides an overview of machine learning concepts including supervised and unsupervised learning, classification and regression algorithms like k-nearest neighbors, naive bayes and logistic regression. It also discusses feature engineering techniques like feature selection and encoding categorical variables.

Uploaded by

getap85298
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views49 pages

ML

The document provides an overview of machine learning concepts including supervised and unsupervised learning, classification and regression algorithms like k-nearest neighbors, naive bayes and logistic regression. It also discusses feature engineering techniques like feature selection and encoding categorical variables.

Uploaded by

getap85298
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 49

https://www.simplilearn.

com/tutorials/machine-learning-tutorial

Machine learning: the branch of AI, based on the concept that machines and systems
can analyze and understand data, and learn from it and make decisions with minimal
to zero human intervention.
Most industries and businesses working with massive amounts of data have recognized
the value of machine learning technology. By culling insights from this data,
businesses are able to work more efficiently and gain an advantage over others.
The point is to learn from data

vector feature: a n-sized vector of numbers who represents an object. For example
if you are analizing apples you have color, size, flavour etc
samples: the item to process
feature space aka input space aka feature set, refers to the set of all possible
features that can be used to represent a data point. Is the collections of features
used to characterize your data. the n dimensions where our variables lives.
The feature space is a vector space that represents all the possible combinations
of features that can be used to describe a data point. Each feature corresponds to
a dimension in this space, and the value of each dimension corresponds to the value
of the corresponding feature for a particular data point.
label data: data with a known classification result

MLlib is a ml library for Spark. it has 2 api groups:


- primary APIs, the original ones,
- high level APIs, useful to create ml pipelines

Steps are:
1) data ingestion
2) cleaning and transformation
3) model training
4) model testing
5) model deployment and integration

What are Tensors?


In mathematics and computer science, a tensor is a generalization of scalars,
vectors, and matrices that can represent higher-dimensional data. Tensors can be
thought of as multi-dimensional arrays or data structures that organize and store
data of various ranks and dimensions.
- Scalar == Rank 0 Tensor
- Vector == Rank 1 Tensor
- Matric == Rank 2 Tensor
- Higher-Rank Tensors == Rank N Tensor

k-nearest neighbors:
is a base base example. The k-nearest neighbor (k-NN) algorithm is a non-parametric
machine learning algorithm used for classification and regression. The "k" in k-NN
refers to the number of nearest neighbors to be considered for making a prediction.
The k-NN algorithm works by finding the k nearest data points in the training set
to a new data point and using the majority class of those k neighbors to make a
prediction about the class of the new data point

Supervised learning:
features - labels. You know what is. It contains classification and regression
Do you want to predict a category? So it is classification
Do you want to predict value of a stock? If it is just UP and DOWN than is again
classification. else regression
Do you want to predict a quantity? Than it is regression
Do you want to find out if there is anything strange like a credic card fraud? Than
is anomaly detection

Unsupervised learning:
features but no label.
Unsupervised learning enables systems to identify patterns within datasets with AI
algorithms that are otherwise unlabeled or unclassified. There are numerous
application of unsupervised learning examples, with some common examples including
recommendation systems, products segmentation, data set labeling, customer
segmentation, and similarity detection.
Do you want to find out a structure from unexplored data? So it is clustering
Clustering and Association are slightly different:
1) Clustering -> split objects in clusters who are similar inside the same cluster
and different outside
2) Association -> get probability of co-occurrence of 2 objects in the same
collection. What is the probability of "if you bought A on Amazon you'll get also
B"

Reinforced learning
there is the concept of feedback, the machine can learn from errors
Reinforcement learning is also frequently used in different types of machine
learning applications. Some common application of reinforcement learning examples
include industry automation, self-driving car technology, applications that use
Natural Language Processing, robotics manipulation, and more. Reinforcement
learning is used in AI in a wide range of industries, including finance,
healthcare, engineering, and gaming.

Classification:
- decision tree -> you know what is
- naive bayes -> based on conditional probability, good if complexity of input is
high
Let's do a spam mail example
1) we create an histogram of words with how much they are common in normal messages
Imagine that the probability to see the word "dear" GIVEN it is a normal message is
0.34
2) we create another histogram with the words of scam messages
example the probability of "money" GIVEN it is scam is 0.44, and "dear" is 0.02
3) now we receive a message starting with "dear friend"
we have to decide what is the probability of it is scam or is normal given we have
the word dear. And with bayes theorem you can revert the relation P(A|B) in P(B|A)
doing things

- random forest
- logistic regression
- KNN -> k nearest neighbors.. you know what is

Regression
- linear regression -> to estimate linear relations between variables. Here a
variable is dependent on one or more independent variables. Example in 2D: draw a
line to minimize the quadratic distance from points

Custering - used a lot also in search engines


- kmeans

Feature selection:
we have a lot of data, suppose too much. Not all are really useful.
Feature selection can:
- reduce overfitting (avoid learn noise)
- improve accuracy
- reduce training time
Feature selection
- Supervised
Supervised feature selection is a type of feature selection technique that takes
into account the target variable or output label in the selection process. It
involves selecting a subset of input features that are most relevant to the target
variable or output label, based on their predictive power and ability to improve
the performance of a machine learning model.
- intrinsic a mix of wrapper and filter. examples lasso regularization and
decision tree
- wrapper method -> we split the data in some overlapping subsets, we train
the model with each subset to decide which ones are better. examples are recursive
feature elimination and generit algorithm
- filter method -> filter columns based on how much they are CORRELATED to
output. Examples are parson coefficent, chi squared, anova.
- Unsupervised
Here we don't use the label but we check if and how the variables are correlated
- PCA (information later in this doc)
- others

dark side of feature selection, both supervised and unsupervised:


sometimes you just lose information

in general based on kind of input and output you can select your feature selection
model. And guess what? In the end you try multiple ones and you see what happens

Linear regression:
Numericals: are numbers
Categoricals: Color, Type, etc
find the right variables who influence the result
residual error: distance between the line and the actual point. You are actually
more interested in squared residual errors summed up

Categorical encoding:
Categorical encoding is the process of transforming categorical data, which
represents qualitative or nominal data, into numerical data that can be used for
machine learning models.
Categorical data refers to data that has distinct categories, such as colors,
genders, or cities. In contrast, numerical data represents quantities, such as age
or height, and can be used in mathematical operations.
There are several methods of categorical encoding, including one-hot encoding,
ordinal encoding, and binary encoding. One-hot encoding involves creating a binary
column for each category, where a 1 is placed in the corresponding column and 0s
are placed in the other columns. Ordinal encoding assigns a numerical value to each
category, such as assigning a value of 1 to the category that is considered the
smallest or the most important. Binary encoding involves converting the categorical
variable into binary digits, which can reduce the number of columns required for
the encoding (example of binary could be "red" (1) is 001 in binary, "green"
(2) is 010 in binary, "blue" (3) is 011 in binary.).Binary encoding can be useful
for categorical features with a large number of categories, as it can reduce the
dimensionality of the data and avoid the high cardinality problem associated with
one-hot encoding. However, it may not be appropriate for categorical features with
a small number of categories.

Classification:
multilabel classification: where each sample has a set of labels

Logistic regression:
is similar to linear regression, but the result is not a number, it is only true or
false!
Logistic regression is a statistical technique used to model the relationship
between a binary dependent variable and one or more independent variables. It is a
type of generalized linear model that uses a logistic function to model the
probability of the binary outcome.
In logistic regression, the dependent variable is binary, meaning it can take only
two values, 0 or 1. The independent variables can be either continuous or
categorical. The logistic regression model estimates the probability of the binary
outcome (1 or 0) for a given set of independent variables.
The logistic function used in logistic regression is also known as the sigmoid
function. It takes any input value and returns an output between 0 and 1, which
represents the probability of the binary outcome.

Confusion matrix:
A confusion matrix is a table that summarizes the performance of a classification
algorithm by comparing the actual and predicted values of the target variable. it
could contain
True Positive (TP) False Positive (FP)
False Negative (FN) True Negative (TN)
the confusion matrix is a simple and effective tool for evaluating the performance
of a classification algorithm and can provide valuable insights into the strengths
and weaknesses of the model.

Dummies:
Dummy variables are categorical variables that are converted into numerical
variables in order to be used as inputs in machine learning models. This conversion
is necessary because many machine learning models can only take numerical inputs.

SVM support vector machines:


the point here is to find an hyperplane in an N-dimensional space to distinct
classes
Support Vector Machines (SVMs) are a type of machine learning algorithm that can be
used for classification, regression, and outlier detection tasks.
At a high level, SVMs work by finding the best possible boundary (also known as a
hyperplane) between different classes of data. This boundary is selected in a way
that maximizes the margin, which is the distance between the boundary and the
nearest data points of each class.
In case of outliers we have a problem :) So what do we do?
if we have a single plane, called hard margin, we cannot do anything. But imagine
to have 2 soft margin around it
Spoiler: if you have not linear data and you try to use a plane is a problem. But
in sklearn you can actually put as params not online "kernel = linear" but also
others. Here is a case where is very important to set configs. In general "rbf" who
is default is actually very smart (double ring example)

In general neural networks does not like big numbers

Accuracy vs Precision ve Recall vs f1-score


Precision measures the proportion of true positives (correctly classified or
retrieved relevant items) among all the items that are classified or retrieved by
the system. In other words, precision measures the accuracy of the system in
identifying only the relevant items. Mathematically, precision is calculated as:
precision = true positives / (true positives + false positives)
In simple words: How many of the positives you gave me are really positives?

IMPORTANT:
Usually lower precision is more costly for the business then lower recall

On the other hand, recall measures the proportion of true positives among all the
relevant items in the data set. In other words, recall measures the completeness of
the system in identifying all the relevant items. Mathematically, recall is
calculated as:
recall = true positives / (true positives + false negatives)
In simple words: How many of the real positives you actually found?

A good system should have a high precision and a high recall. However, there is
often a trade-off between precision and recall, and the optimal balance between the
two will depend on the specific application and the priorities of the users.

Consider a program where you recognize dogs:


Upon processing a picture which contains 10 cats and 12 dogs, the program
identifies 8 dogs. Of the 8 elements identified as dogs, only 5 actually are dogs
(true positives), while the other 3 are cats (false positives). 7 dogs were missed
(false negatives), and 7 cats were correctly excluded (true negatives). The
program's precision is then 5/8 (true positives / selected elements) while its
recall is 5/12 (true positives / relevant elements).

Accuracy: Accuracy measures the overall correctness of the model's predictions,


calculated as the number of correct predictions divided by the total number of
predictions. Why accuracy is not a good inditator:
It is a simple and intuitive metric, but it can be misleading in cases where the
dataset is imbalanced (i.e., one class has a much larger number of samples than the
other). In such cases, the model may achieve high accuracy by simply predicting the
majority class for all samples, even though it performs poorly on the minority
class.

F1 score: The F1 score is a harmonic mean of precision and recall, which gives
equal weight to both metrics. It is a useful metric for imbalanced datasets, as it
balances the trade-off between precision and recall. It is calculated as 2 x
(precision x recall) / (precision + recall).

When to use what in classification?


1) Logistic Regression - Good for binary classification problems where the outcome
is either 0 or 1. It can also handle multi-class classification problems.
2) Decision Trees - Good for handling both categorical and numerical data, and can
be used for both binary and multi-class classification problems. Decision trees are
often used for feature selection and data exploration.
3) Random Forest - Good for handling large datasets and high-dimensional data. It
can handle both binary and multi-class classification problems, and is often used
for feature selection (check how much that variable is important in a random
forest).
4) Naive Bayes - Good for handling text classification problems such as spam
filtering, sentiment analysis, and topic classification. It works well with small
datasets and can handle both binary and multi-class classification problems.
5) Support Vector Machines (SVM) - Good for handling both linear and nonlinear
classification problems, and can handle both binary and multi-class classification
problems. SVMs are often used for image classification and text classification.
6) k-Nearest Neighbors (KNN) - Good for handling small datasets and can handle both
binary and multi-class classification problems. KNN is often used for
recommendation systems and text classification.

ML splittend in 2:
1) supervised
1.1) classification
1.2) regression
2) unsupervised
2.1) clustering
2.2) association

Logistic regression in not actually regression despite the name, it is


classification!
Linear regression is regression for real :)

In logistic regression you round if > 0.5 is 1 else is 0

a perfect confusion matrix has all numbers in the diagonal and 0 everywhere else

2 types of errors:
1) Reducible -> you can work on it
1.1) Bias
1.2) Variance
2) Irreducible -> you cannot change it

Error in machine learning: is to see how accurately you can make predictions on
data you trained on and also on new data. The difference between predicted and
actual
Total error: 1) + 2) together

MSE = mean square error


MSE = Variance + Bias^2 + irreducible error
The squaring of the bias term has a mathematical justification, as it ensures that
the bias term is always non-negative. Moreover, squaring the bias term puts more
weight on larger errors, which is appropriate since large errors can have a greater
impact on the overall performance of the model than small errors.

Bias vs Variance:
the bias-variance tradeoff is an important concept to consider when developing
models. The tradeoff refers to the relationship between two types of errors that
can occur in a model: bias and variance.
Bias:
Bias refers to the error that occurs when a model is not able to capture the true
underlying pattern in the data, leading to a systematic under- or overestimation of
the target variable. High bias models are typically oversimplified and may miss
important patterns or relationships in the data. If high bias the model will be
very bad both in training and testing data

Variance:
Variance, on the other hand, refers to the error that occurs when a model is too
complex and is overfitting the data, meaning it captures noise and randomness in
the training data as well as the underlying pattern. High variance models tend to
have good performance on the training data but perform poorly on new data. If high
variance it will very good on training data but bad in test data

The tradeoff between bias and variance arises because reducing one typically
increases the other. For example, a simple linear model with few parameters may
have high bias but low variance, while a complex model with many parameters may
have low bias but high variance. Finding the right balance between bias and
variance is critical to developing models that generalize well to new data.

To achieve this balance, researchers use techniques such as cross-validation and


regularization, which help to tune model complexity and reduce both bias and
variance.

cross-validation:
Cross-validation is a technique used to evaluate the performance of a machine
learning model by splitting the available data into two or more subsets: one for
training the model and another for testing the model. The goal of cross-validation
is to ensure that the model is able to generalize well to new data that it has not
seen during training.
The simplest form of cross-validation is called "k-fold cross-validation." In this
method, the data is divided into k equally sized folds, and the model is trained on
k-1 of the folds, with the remaining fold used for testing. This process is
repeated k times, with each fold used once as the validation data. The results from
each iteration of cross-validation are then averaged to provide an estimate of the
model's performance.
k-fold is a type of cross-validation

leave-one-out cross-validation (LOOCV) -> k is actually the amount of data, so


groups of 1

nested cross-validation -> good to select hyperparams: Model selection without


nested CV uses the same data to tune model parameters and evaluate model
performance. Information may thus “leak” into the model and overfit the data, we
want to avoid it.
This technique involves two nested loops of cross-validation, with the outer loop
being used to assess the performance of the model on a held-out test set, and the
inner loop being used to tune the hyperparameters of the model using a validation
set.
Steps:
The nested cross-validation process involves the following steps:
1) Outer loop: Divide the dataset into training and testing sets. The testing set
is set aside and will not be used until the final evaluation of the model. The
training set is further divided into k subsets for k-fold cross-validation.
2) Inner loop: For each fold in the k-fold cross-validation, divide the training
set into training and validation sets. The model is trained on the training set,
and the hyperparameters are tuned using the validation set.
Specifically, for each fold in the k-fold cross-validation, we split the training
data into two sets: a training set and a validation set. The model is then trained
on the training set and evaluated on the validation set.
The goal of this step is to find the hyperparameters that result in the best
performance on the validation set. This is done by systematically trying different
hyperparameter values for each fold and selecting the combination that results in
the highest validation score.
3) Use the optimal hyperparameters found in step 2 to train the model on the
entire training set.
4) Evaluate the performance of the model on the held-out test set.
5) Repeat steps 1-4 for different combinations of hyperparameters.

What is stratified cross-validation?


Stratified k-fold cross-validation is a variation of the k-fold cross-validation
technique that ensures that each fold of the dataset contains roughly the same
proportion of samples from each class as the entire dataset.
In standard k-fold cross-validation, the data is randomly divided into k folds.
This can be problematic if the dataset contains class imbalance, where one or more
classes are significantly underrepresented in the data.
Stratified k-fold cross-validation addresses this issue by ensuring that each fold
has approximately the same proportion of samples from each class as the entire
dataset
In machine learning, a "class" refers to a category or group that a sample or
instance belongs to.

Cross-validation
Cross-validation is a technique used in machine learning and statistics to evaluate
the performance of a predictive model. It involves dividing a dataset into two or
more subsets, called "folds," and using one subset as the training data to fit the
model while using the other subset as the testing data to evaluate its performance.
The most common form of cross-validation is k-fold cross-validation, where the
dataset is divided into k equal-sized subsets, or folds. The model is trained on k-
1 folds and evaluated on the remaining fold, and this process is repeated k times,
with each fold being used as the testing set once. The results from each fold can
then be averaged to provide an estimate of the model's overall performance.

what could be a good value for k?


The choice of k in k-fold cross-validation depends on the size of your dataset and
the computational resources available to you. In general, larger values of k will
result in a more accurate estimate of the model's performance, but with increased
computational cost.
A common value for k is 10, which involves splitting the data into 10 equally sized
folds. However, if your dataset is small, you may want to use a smaller value of k
to ensure that each fold contains enough data for reliable testing. On the other
hand, if your dataset is very large, you may be able to use a larger value of k
without sacrificing too much computational efficiency.

time series cross-validation -> here does not make sense to random select data..
you have to keep chronology, using time window

regularization:
Regularization is a technique used to prevent overfitting, which occurs when a
model becomes too complex and fits the training data too closely, leading to poor
generalization to new data
Regularization achieves this by adding a penalty term to the model's objective
function that encourages simpler models. There are two common types of
regularization:
1) L1 regularization (Lasso): This adds a penalty to the absolute value of the
model's coefficients, effectively encouraging some coefficients to be set to zero.
This can be useful for feature selection, as it can help identify which features
are most important for the model's performance.
2) L2 regularization (Ridge): This adds a penalty to the squared value of the
model's coefficients, which has the effect of shrinking the coefficients towards
zero. This can help to prevent overfitting by reducing the influence of less
important features.

In L1 regularization, the penalty term added to the cost function is proportional


to the absolute values of the model's weights, also known as the L1 norm (you know,
if the cost is proportional to sum of absolute values of the weights it will try to
reduce the weights). The effect of this penalty term is that it encourages the
model to shrink the weights of less important features towards zero, effectively
removing them from the model. This is in contrast to L2 regularization, which uses
the squared values of the weights (L2 norm) to shrink them towards zero.
However, L1 regularization has some drawbacks as well. It tends to produce sparse
models, meaning that many of the weights in the model are exactly zero

L1 vs L2:
it depends on the problem:
L1 is good with high dimensionality or redundancy, L1 is better if you want to
select important features.
L2 is good when you want to limit weights

Decision tree is good in both classification and regression


DT is prone to overfitting if they are too complex/deep
DT has low bias, that is very good but high variance (this is related to
overfitting)

Entropy:
Intuitively, entropy is high when the data is very mixed and uncertain, and low
when the data is very pure and predictable. For example, a dataset with only one
class label (i.e., all the data points belong to the same class) would have an
entropy of 0, because there is no uncertainty about the class label. On the other
hand, a dataset with an equal number of data points in each class would have a high
entropy, because there is a lot of uncertainty about the class label.

Information gain:
In the context of decision trees, information gain is a measure of the reduction in
entropy (or increase in purity) that results from splitting a dataset on a
particular input feature. Information gain is used to determine which input feature
to split on at each node of the tree.

Leaf node:
In the context of decision trees, a leaf node (also called a terminal node) is a
node that does not have any child nodes. In other words, a leaf node is a node that
represents a final classification or decision in the decision tree.

DT decide which one is the condition optimizing the information gain

Random forest:
fast to train and low overfitting (because forest if big)
very accurate for big databases
good to estimate missing data
basically means "a lot of decision trees"

Naive Bayes classifier


P(A|B) = [P(B|A) * P(A)] / P(B)

Advantages of NBC (naive bayes classifier)


- not sensitive to irrelevant features
- simple
- need less data
- fast and scalable

Confusion matrix: is only about classification?


No, a confusion matrix is not only about classification. While the confusion matrix
is most commonly used in the evaluation of classification models, it can also be
used in other machine learning tasks such as regression, clustering, and anomaly
detection.

In general, a confusion matrix is a table that is used to evaluate the performance


of a machine learning model by comparing its predicted output to the true output.
It shows the number of correct and incorrect predictions made by the model, as well
as the different types of errors that the model makes.

In the case of classification models, a confusion matrix will show the number of
true positives, true negatives, false positives, and false negatives for each class
in the dataset. However, for other machine learning tasks, the confusion matrix may
have a different interpretation or format.

For example, in regression tasks, the confusion matrix may show the mean squared
error, mean absolute error, and other metrics that evaluate the model's ability to
predict a continuous output. In clustering tasks, the confusion matrix may show the
number of points that are correctly or incorrectly assigned to each cluster.

Overall, the confusion matrix is a useful tool for evaluating the performance of a
machine learning model, regardless of the task at hand.

KNN: supervised, classification, very simple


selecting the right K is the key to make it work

If k is too low (e.g., k=1), the model may be too sensitive to noise and outliers
in the data. In this case, the model may overfit the training data and perform
poorly on new, unseen data. Additionally, if the dataset is imbalanced, the model
may favor the majority class and ignore the minority class, leading to poor
performance on the minority class.

If k is too high (e.g., k=n, where n is the size of the training set), the model
may be too biased and not capture the underlying patterns in the data. In this
case, the model may underfit the training data and perform poorly on both the
training and test data. Additionally, if the dataset is noisy, the model may
include irrelevant samples in its predictions, leading to poor performance.

In a N-sized db sqrt(n) is a good idea for value of K. Make it odd, so you cannot
have a draw
Usually Bias and underfitting comes together
Usually Variance and overfitting comes together

K-means: is about clustering, unsupervised


K is the number of clusters
Each cluster has a centroid, which is the mean of all the data points assigned to
that cluster.
Here are the steps involved in the K-means algorithm:
1) Initialization: Initialize k centroids randomly.
2) Assignment: Assign each data point to the nearest centroid. Euclidean
distance is typically used to calculate the distance between data points and
centroids.
3) Recalculation: Calculate the mean of all data points assigned to each
centroid and move the centroid to the mean.
4) Repeat steps 2 and 3 until convergence: Repeat steps 2 and 3 until the
centroids no longer move or a maximum number of iterations is reached.
5) Output: The final output is the k clusters with their respective centroids.

Good when:
When the number of clusters is known or can be estimated.
When the data is relatively dense and spherical in shape.
When the data is relatively simple and there are no clear outliers.

Bad when:
When the number of clusters is unknown and needs to be estimated.
When the data is not well-separated and has overlapping clusters.
When the data is not spherical in shape.
When the data has clear outliers.

When it comes to bias and variance, K-means is generally a low-bias algorithm since
it assumes nothing about the underlying distribution of the data. However, it can
be prone to high variance since the results can depend on the initial placement of
the centroids, and the algorithm can converge to a suboptimal solution. To mitigate
this, multiple initializations with different starting points can be used to find a
more robust solution.

K-means is also prone to overfitting since it can assign data points to clusters
even if they do not truly belong there. This can lead to clusters that do not have
clear distinctions or are highly variable. On the other hand, K-means can also be
prone to underfitting if the number of clusters is too small, leading to
oversimplified clusters that do not capture the true complexity of the data

Clustering:
1) Hierarchical clustering is a tree structure like mamman -> canide -> dog and so.
So you don't need to know in advance how many clusters
1.1) agglomerative clustering bottom up, you start with small clusters and
you merge together similar ones
1.2) divisive clustering, top down, as you can imagine
2) Partitional clustering
2.1) k-means: each object goes to exactly one cluster
2.2) fuzzy c-means: each object goes in one or more clusters

Distance measure
In the context of clustering, a distance measure is a way to quantify the
similarity or dissimilarity between two data points. The choice of distance measure
can have a significant impact on the clustering results and should be selected
based on the characteristics of the data and the desired clustering outcome.
There are several commonly used distance measures in clustering, including:
1) Euclidean distance: This is the most widely used distance measure and is
defined as the square root of the sum of the squared differences between the
corresponding elements of two data points.
2) Manhattan distance: Also known as city-block distance or L1 distance, this
measure calculates the absolute differences between the corresponding elements of
two data points and sums them up.
3) Cosine distance: This measure calculates the cosine of the angle between two
vectors and is commonly used for text data or high-dimensional data.
4) Hamming distance: This measure is used for binary data and calculates the
number of bits that differ between two binary vectors.
5) Jaccard distance: This measure is used for categorical data and calculates
the dissimilarity between two sets based on the size of their intersection and
union.

How do you find how many cluster in k-means? with the elbow method:
SSD is sum of squared distances (SSD) of data points to their nearest centroid. We
try with k equals to 1 to N, so we execute k-means lots of times. Usually SSD
shrinks augmenting k. When it starts shrinking less then usual, so the graph print
an elbow shape,then it is a good k value

Dimensionality reduction:
we actually want to reduce the dataset columns removing the useless infos
Dimensionality reduction is the process of reducing the number of features or
variables in a dataset while preserving the important information contained in the
original data. It is a common technique used in machine learning and data analysis
to handle high-dimensional data and to improve the efficiency and effectiveness of
machine learning models. It improve also the training cost and time. It reduce also
redundancy

There are two main types of dimensionality reduction techniques: feature selection
and feature extraction.
1) Feature selection: This approach selects a subset of the original features that
are most relevant to the prediction task.
2) Feature extraction: This approach transforms the original features into a lower-
dimensional space that preserves most of the important information. Principal
Component Analysis (PCA) is a popular method for feature extraction that projects
the data onto a new space of orthogonal components that capture the most variation
in the data.

PCA:
we want to reduce dimensionability of dataset but keeping almost all the infos
It is a mathematical technique that uses linear algebra to transform a set of
correlated variables into a set of uncorrelated variables called principal
components. We love uncorrelated variables
The main idea behind PCA is to identify the directions of maximum variation in the
data and project the data onto these directions. The first principal component is
the direction of maximum variation, followed by the second principal component, and
so on. The principal components are orthogonal to each other, meaning that they are
uncorrelated.

Terminology of PCA:
view -> the perspective through the points are view
dimensions -> how many columns
linear combination -> In mathematics, a linear combination is a combination
of two or more variables or functions that are multiplied by constants and added
together. In other words, it is a sum of scalar multiples of vectors, functions, or
other mathematical objects. For example, suppose we have two vectors u and v in a
vector space. A linear combination of u and v is a vector of the form au + bv,
where a and b are scalars (real numbers or complex numbers). The vector au + bv is
obtained by scaling u by a and scaling v by b, and then adding the resulting
vectors.
principal components (PC) -> new variables constructed as linear combinations
of existing ones
perpendicular distance -> In algebra, the perpendicular distance between a
point and a line is the shortest distance between the point and the line, and it is
measured along a line that is perpendicular (or orthogonal) to the given line.
projections -> the perpendicular distance between principal components and
datapoints
basis -> In linear algebra, a base (or basis) is a set of linearly
independent vectors that can be used to represent any other vector in a given
vector space. Any vector in that space can be written as a linear combination of
the basis vectors
orthogonal -> In mathematics, orthogonal refers to the concept of
perpendicularity, or the relationship between two objects that are at a right angle
to each other. More generally, two objects are said to be orthogonal if they are
independent or unrelated to each other. For example, in geometry, two vectors are
orthogonal if their dot product is zero, which means that they are perpendicular to
each other. In linear algebra, a set of vectors is orthogonal if each pair of
vectors in the set is orthogonal. A set of orthogonal vectors is often used as a
basis for a vector space because it is easy to calculate projections and other
operations using orthogonal vectors.

PCA is an unsupervised algorithm

Variance vs covariance vs correlation vs standard deviation:


- variance -> In statistics, variance is a measure of how spread out a set of
data is. It is a numerical value that indicates the average squared deviation of
individual data points from the mean of the data set. The variance of a random
variable X is denoted by Var(X).
Var(X) = (1/n) * ∑(Xi - μ)^2
where Xi is the ith data point, μ is the mean of the data set, and n is the total
number of data points.
The variance of a variable is a non-negative value, which means that it can take
any value greater than or equal to zero.
- covariance -> In statistics, covariance is a measure of how two variables
change or vary together. It is a numerical value that indicates the degree to which
two variables are linearly related.
Covariance is calculated using the formula:
cov(X,Y) = (1/n) * ∑(Xi - μx)(Yi - μy)
The minimum and maximum values of covariance are unbounded, which means that they
can be infinitely large or small.
- correlation -> n statistics, correlation is a measure of the relationship
between two variables. It is a numerical value that indicates the degree to which
two variables are linearly related.
Correlation is calculated using the formula:
correlation = cov(X,Y) / (σx * σy)
where cov(X,Y) is the covariance between the two variables X and Y, and σx and σy
are their respective standard deviations.
correlation si between -1 and 1
- Standard deviation -> n statistics, standard deviation is a measure of the
amount of variation or dispersion of a set of values from their mean. It is a
numerical value that indicates how much, on average, each value in a dataset
deviates from the mean of the dataset.
The formula for standard deviation is:
σ = √(Σ(xi - μ)² / n)
The minimum value of standard deviation is zero. This occurs when all values in the
dataset are identical, meaning there is no variation or deviation from the mean.
The maximum value of standard deviation is unbounded,

Variance vs Standard Deviation:


Variance is the average squared deviations from the mean, while standard deviation
is the square root of this number. Both measures reflect variability in a
distribution, but their units differ: Standard deviation is expressed in the same
units as the original values (e.g., minutes or meters).

important properties of PCA


1) dimensions reduce or stay equals, never rise
2) PC are orthogonal
3) the priority of PC decreases when their number increases

PCA has 4 steps:


1) standardisation
take each variable, subtract the mean, divide the result for the standard
deviation.
standardization refers to the process of scaling the input data so that each
variable has zero mean and unit variance before performing the PCA analysis. This
is also known as Z-score normalization.
The reason for standardizing the data in PCA is to ensure that each variable
contributes equally to the principal components. If the variables are not scaled,
those with larger variances will tend to dominate the first few principal
components, while those with smaller variances will have little influence. This can
result in a biased or misleading representation of the data.
2) covariance matrix computation
is used to express correlation between any of two variables
3) eigen values and eigen vectors
are values extracted from covariance table, they are responsible to generate new
variables
An eigenvector is a vector that, when multiplied by a particular square matrix,
results in a scalar multiple of the original vector. In other words, if A is a
square matrix, and v is a non-zero vector, then v is an eigenvector of A if there
exists a scalar λ such that:
Av = λv
The scalar λ is called the eigenvalue corresponding to the eigenvector v.
Eigenvectors and eigenvalues are always found in pairs: for every eigenvalue there
is at least one eigenvector, and vice versa.
Eigenvalues and eigenvectors are important because they can be used to decompose a
matrix into simpler components. For example, a diagonalizable matrix can be written
as:
A = PDP^-1
where P is a matrix whose columns are eigenvectors of A, D is a diagonal matrix
whose diagonal entries are the corresponding eigenvalues, and P^-1 is the inverse
of P.
4) feature vectors
this is just a metrics where the columns are the eigenvectors we are deciding to
keep

Cost functions == loss function == objective function:


It answers the question: how bad is our model performing? we have to answer this
question in each step of iteration
It quantifies the difference between the predicted output of the model and the true
output (also known as the ground truth) for a set of training examples.
For example, in a regression problem, the mean sum squared error (MSE) is a common
cost function, while in a classification problem, the cross-entropy loss is often
used.

For a NN a good optimizer to reduce the loss function could be gradient descent
Gradient descent is a popular optimization algorithm used in machine learning and
other fields to minimize a cost function. It is a first-order iterative
optimization algorithm that works by iteratively adjusting the parameters of a
model in the direction of steepest descent of the cost function.

The basic idea of gradient descent is to compute the gradient of the cost function
with respect to the parameters of the model, and then adjust the parameters in the
direction of the negative gradient. This process is repeated until a minimum of the
cost function is reached or a stopping criterion is met.
There are several variations of gradient descent, including batch gradient descent,
stochastic gradient descent, and mini-batch gradient descent. Batch gradient
descent computes the gradient using the entire training dataset, while stochastic
gradient descent computes the gradient using one training example at a time. Mini-
batch gradient descent is a compromise between the two
Gradient descent is a widely used optimization algorithm due to its simplicity and
effectiveness. However, it can be slow to converge in certain cases, and can get
stuck in local minima of the cost function
alfa is the learning rate of gradient descent. You can do big steps and miss lower
places, or small steps and take more time to train

Here's how the gradient descent works:


- Initialization: The algorithm starts by initializing the values of the
parameters randomly or with some predefined values.
- Forward Propagation: The model makes predictions using the current values of
the parameters and calculates the cost or loss function.
- Backward Propagation: The algorithm calculates the gradient of the cost
function with respect to each parameter. This gradient represents the direction and
magnitude of the steepest descent. It tells us how much we need to adjust each
parameter to reduce the cost function.
- Parameter Update: The algorithm updates the values of the parameters by moving
them in the direction of the negative gradient. The magnitude of the update is
controlled by the learning rate, which is a hyperparameter that determines the step
size at each iteration.
- Repeat: Steps 2-4 are repeated until the cost function converges to a minimum
or until a maximum number of iterations is reached.

More details about optimizers and loss functions:

Optimizer:
In the context of machine learning, an optimizer is an algorithm or method used to
adjust the parameters (weights and biases) of a model during the training process
in order to minimize the error or loss function
Possible optimizer:
- Gradient Descent: This is one of the most fundamental optimization algorithms. It
involves computing the gradient of the loss function with respect to the model
parameters and moving in the direction that reduces the loss. There are variations
of gradient descent, including batch gradient descent (updating using the entire
dataset), stochastic gradient descent (updating using one data point at a time),
and mini-batch gradient descent (updating using a small subset of the dataset).
- Adam (Adaptive Moment Estimation): Adam is an adaptive optimization algorithm
that adjusts the learning rate for each parameter based on the past gradients
- RMSProp (Root Mean Square Propagation): RMSProp adjusts the learning rate for
each parameter based on the magnitude of recent gradients. It helps prevent
vanishing or exploding gradients by normalizing the update step.
- Momentum: Momentum adds a fraction of the previous update to the current update,
which helps the optimization process move more smoothly
- Adagrad (Adaptive Gradient Algorithm): Adagrad adapts the learning rate for each
parameter based on the historical gradients.

Loss functions (different from optimizers):


A loss function, also known as a cost function or objective function, quantifies
how well the predictions made by a machine learning model. The goal during training
is to minimize this loss function.
Instead an optimizer is an algorithm or method that determines how the model's
parameters (weights and biases) should be adjusted in order to minimize the loss
function. So they work together but they are different

Different types of problems, such as classification, regression, and more


specialized tasks, require different loss functions. Some of them are:
- Regression Loss Functions
- Mean Squared Error (MSE): average squared difference between predicted and
actual values
- Mean Absolute Error (MAE):average ABSOLUTE NOT squared difference between
predicted and actual values
- Classification Loss Functions:
- Binary Cross-Entropy Loss: the difference between predicted probabilities
and true binary labels.
- Categorical Cross-Entropy Loss: the difference between predicted class
probabilities and true one-hot encoded labels.
- Hinge Loss (SVM Loss): for SVM, It encourages correct classification by
penalizing misclassifications
- Sequence-to-Sequence Loss Functions:
- Sequence Cross-Entropy Loss: Used in sequence generation tasks, measuring
the difference between predicted sequences and target sequences.
actually lots more of them

Explain me more practically the points 2 and 3 in the case of a neural network
Sure! Let's consider a simple neural network with one hidden layer, consisting of
an input layer, a hidden layer with n neurons, and an output layer with one neuron.
The network takes as input a vector x of m features and predicts a scalar value y.
Here's how the forward propagation and backward propagation steps work in this
case:

2) Forward Propagation:
The input layer receives the input vector x.
The hidden layer calculates the weighted sum of the inputs for each neuron,
applies a non-linear activation function to the result, and outputs a vector h of
size n.
The output layer calculates the weighted sum of the hidden layer activations,
applies a final activation function (such as a sigmoid or a softmax), and outputs a
scalar value y_hat, which is the predicted value.

3) Backward Propagation:
The output layer calculates the derivative of the cost function with respect to
y_hat, which is typically based on the difference between the predicted value and
the actual value of y.
The hidden layer calculates the derivative of the cost function with respect to
the activations of the hidden layer neurons, which is obtained by multiplying the
derivative of the output layer with respect to the hidden layer activations by the
weights connecting the output layer to the hidden layer.
The input layer calculates the derivative of the cost function with respect to
the inputs, which is obtained by multiplying the derivative of the hidden layer
activations with respect to the inputs by the weights connecting the hidden layer
to the input layer.

This process is repeated for multiple epochs (passes over the entire dataset) until
the cost function converges or a maximum number of epochs is reached.

In NN each layer has its own cost function and its own minimum, so gradient descent
can have different minimun and it's easier to get stuck in local minimum
you optimize with a good loss function one layer per time, starting with the last
ones, that's why it's called backpropagation

Reinforcement learning:
Reinforcement learning is a subfield of machine learning that is concerned with how
agents can learn to make decisions and take actions in an environment in order to
maximize a reward signal. It is a type of learning that is inspired by how humans
and animals learn from the consequences of their actions.
In reinforcement learning, an agent interacts with an environment by taking actions
and receiving feedback in the form of rewards or penalties. The goal of the agent
is to learn a policy, which is a mapping from states to actions, that maximizes the
expected cumulative reward over time.

Markov Decision Process (MDP) -> A Markov Decision Process (MDP) is a mathematical
framework used to model decision-making problems in situations where the outcome
depends on both random events and the actions taken by a decision maker or agent.
An MDP is defined by a set of states, actions, rewards, and a transition function.
The states represent the different possible states of the system, the actions
represent the available actions that the decision maker can take, and the rewards
represent the outcome of taking an action in a particular state. The transition
function defines the probability of transitioning from one state to another state,
given a particular action.
The key assumption of an MDP is that the future state of the system depends only on
the current state and the current action taken by the decision maker, and is
independent of the past history of the system. This is known as the Markov
property,

The reinforcement learning process typically involves the following components:


Environment: The environment represents the external system or world that the
agent interacts with. It can be modeled as a Markov Decision Process (MDP), which
consists of a set of states, a set of actions, a transition function that
determines the next state given the current state and action, and a reward function
that assigns a reward to each state-action pair.
Agent: The agent is the learning algorithm that interacts with the environment.
It observes the current state of the environment, selects an action based on its
current policy, receives a reward from the environment, and updates its policy
based on the observed reward and state transitions.
Policy: The policy is the strategy that the agent uses to select actions based
on the observed state of the environment. The goal of the agent is to learn a
policy that maximizes the expected cumulative reward over time.
Reward function: The reward function assigns a reward to each state-action pair
based on the observed behavior of the agent. The goal of the agent is to learn a
policy that maximizes the expected cumulative reward over time.

GPT-3 was not trained using reinforcement learning directly. Instead, it was pre-
trained using an unsupervised learning approach called language modeling, which
involves predicting the next word in a sequence of text given the preceding words.
However, reinforcement learning has been used in conjunction with GPT-3 to fine-
tune the model for specific tasks. This involves training the model on a specific
task using a reward signal that is provided by the task environment, such as a
score or accuracy metric.
Overall, while GPT-3 was not directly trained using reinforcement learning, it has
been used in conjunction with reinforcement learning to fine-tune the model for
specific tasks and improve its performance.

What is the fine-tuning of a model?


"Fine-tuning" a model refers to the process of taking a pre-trained machine
learning model and training it further on a new, smaller dataset to improve its
performance on a specific task. This is often done when a pre-trained model already
has some level of expertise in the task domain, but needs to be adapted to a
specific use case.
Fine-tuning typically involves freezing some or all of the pre-trained layers of
the model and adding one or more new layers on top that are specific to the task.
The weights of the pre-trained layers are kept fixed during training, while the
weights of the new layers are updated to better fit the task-specific data.

Different kinds of reinforced learning:


Model-based and model-free are two different approaches to reinforcement learning.
1) Model-based Reinforcement Learning: In model-based reinforcement learning,
the agent learns a model of the environment, including the state space, action
space, and transition probabilities between states. The agent then uses this model
to plan future actions based on predicted outcomes. This approach requires more
computational resources and is more complex, but can be more efficient in certain
situations where the environment is well-defined and predictable.
2) Model-free Reinforcement Learning: In model-free reinforcement learning, the
agent does not build a model of the environment, but instead learns a value
function or a policy directly from experience. This approach does not require a
model of the environment, making it simpler and more computationally efficient, but
can be less sample efficient in certain environments.
2.1) Value-based methods: These methods involve learning the value
function, which estimates the expected long-term reward of each state or state-
action pair. Examples of value-based methods include Q-learning and SARSA.
2.2) Policy-based methods: These methods involve learning the policy
directly, which maps states to actions. Examples of policy-based methods include
policy gradients and actor-critic methods.

Q-learning
Q-learning is a popular algorithm in reinforcement learning that enables an agent
to learn to make decisions based on rewards it receives from its environment. It is
a model-free algorithm that uses a value-based approach to estimate the optimal
action-value function (Q-function) of the environment.
In Q-learning, the agent learns to estimate the expected reward of taking a
particular action in a particular state. The Q-value of a state-action pair is
defined as the expected sum of discounted future rewards that the agent will
receive by following a specific policy from that state. The optimal policy can then
be found by selecting the action with the highest Q-value in each state. q-value is
"how good is this action in this state?"
The Q-learning algorithm involves iteratively updating the Q-values based on the
agent's experiences in the environment. At each time step, the agent takes an
action in the current state and observes the resulting reward and next state. The
Q-value for the current state-action pair is then updated using the Bellman
equation:
The Bellman equation in the context of Q-learning is a fundamental equation that
describes how the Q-values for a given state and action can be updated based on the
observed reward and the maximum Q-value for the next state.

In mathematical terms, the Bellman equation for Q-learning is:

Q(s, a) = (1 - alpha) * Q(s, a) + alpha * [r + gamma * max(Q(s', a'))]

where:
Q(s, a) is the Q-value of taking action a in state s
alpha is the learning rate, which determines the extent to which newly acquired
information overrides old information
r is the reward obtained by taking action a in state s
gamma is the discount factor, which determines the relative importance of
immediate and future rewards
max(Q(s', a')) is the maximum Q-value for any action a' in the next state s'
s' is the next state reached after taking action a in state s.

Q-learning can be a suitable algorithm for many reinforcement learning problems,


but there are situations where it may not be the best choice. Here are some factors
to consider when deciding whether to use Q-learning:
1) Discrete state and action space: Q-learning is best suited for problems where
the state space is discrete and the action space is finite. If the state space is
continuous or the action space is infinite, other algorithms such as deep
reinforcement learning may be more appropriate.
2) Sparse rewards: Q-learning can struggle with problems where the rewards are
sparse or delayed, meaning that the agent only receives a reward after many
actions. In these situations, the agent may have difficulty learning the correct
action values and may need to use additional techniques such as exploration
strategies or reward shaping.
3) Exploration vs. exploitation: Q-learning is a model-free algorithm that relies
on exploration to learn the optimal policy. If exploration is not handled
correctly, the agent may get stuck in a suboptimal policy. Strategies such as
epsilon-greedy or softmax exploration can help balance exploration and
exploitation.
4) Complex environments: Q-learning may not be the best choice for problems with
complex or dynamic environments. In these situations, other algorithms such as
model-based reinforcement learning or Monte Carlo tree search may be more
appropriate.

Actually in Q-learning some randomness is a good idea to explore more innovative


paths. Epsilon is the threshold that if the reward is below that it just go random,
if it is above it follow the reward

In reinforcement learning, exploration and exploitation are two fundamental


strategies used by an agent to interact with an environment and learn an optimal
policy.
1) Exploration refers to the process of selecting actions that are not yet known to
be optimal in order to gather more information about the environment. During
exploration, the agent may choose actions at random. The goal of exploration is to
discover new state-action pairs and to refine the agent's estimate of the optimal
policy.
2) Exploitation, on the other hand, refers to the process of selecting actions that
are known to be optimal based on the agent's current policy. During exploitation,
the agent chooses actions that have the highest expected reward based on its
current knowledge of the environment. The goal of exploitation is to maximize the
cumulative reward

The balance between exploration and exploitation is a key challenge in


reinforcement learning. If the agent explores too much, it may spend too much time
in suboptimal actions and fail to converge to an optimal policy. If the agent
exploits too much, it may get stuck in a suboptimal policy and fail to discover
better policies. Therefore, a good exploration strategy is essential for effective
reinforcement learning.

Overfitting -> you are learnign noise, you are bad in predicting new things.
Related to variance
Underfitting -> you are not learning the pattern. Related to Bias

Math for ML:


Data can be (qualitative division):
- Nominal -> country, gender etc. Categories without order here
- ordinal -> salary range, movie rating, education level etc. Categories with order
here
One key difference between nominal and ordinal data is the level of measurement.
Nominal data are at the lowest level of measurement, while ordinal data are at a
higher level. This means that ordinal data not only have the properties of nominal
data, but they also have an inherent order or ranking of the categories.

Data can be (quantitavive division):


- discrete
- continuous

Linear equation -> no powers, order 1

transpose a matrix -> flip around diagonal. Notation is A -> At where t is at


exponent level

inverse matrix -> A^-1


The inverse of a matrix is a matrix that, when multiplied by the original matrix,
results in the identity matrix. The identity matrix is a square matrix with ones on
the main diagonal and zeros elsewhere
If A is not invertible, then it is said to be singular.

vector -> 1 dimension matrix (a row)


in a vector we have a value and a direction

For a vector v with components (v₁, v₂, ..., vn), the magnitude is calculated as:
|v| = √(v₁² + v₂² + ... + vn²)

To get the direction of a vector, you can use the concept of unit vectors. A unit
vector is a vector with a magnitude of 1. You can obtain the unit vector in the
direction of a given vector v by dividing the vector by its magnitude:
u = v/|v|
The resulting unit vector u will have the same direction as the original vector v,
but with a magnitude of 1.

An eigenvector is a vector that, when multiplied by a particular square matrix,


results in a scalar multiple of the original vector. In other words, if A is a
square matrix, and v is a non-zero vector, then v is an eigenvector of A if there
exists a scalar λ such that:
Av = λv
The scalar λ is called the eigenvalue corresponding to the eigenvector v.
Eigenvectors and eigenvalues are always found in pairs: for every eigenvalue there
is at least one eigenvector, and vice versa.
Eigenvalues and eigenvectors are important because they can be used to decompose a
matrix into simpler components. For example, a diagonalizable matrix can be written
as:
A = PDP^-1

In simple terms, an eigenvector of a matrix is a non-zero vector that, when


multiplied by the matrix, results in a scaled version of the original vector. The
scaling factor is called the eigenvalue, which can be either real or complex. In
other words, when a matrix is multiplied by one of its eigenvectors, the vector
only changes in magnitude, but not in direction.

Probabilistic sampling is a method of sampling where each member of a population


has a known probability of being included in the sample. This type of sampling is
widely used in statistics and research to make inferences about a population based
on a sample.
There are several types of probabilistic sampling methods, including:
- Simple random sampling: you can imagine..
- Stratified random sampling: In this method, the population is divided into
groups or strata based on certain characteristics, and then a simple random sample
is selected from each stratum. This is useful when the population has different
subgroups that vary in important characteristics.
- Cluster sampling: In this method, the population is divided into clusters, and
then a simple random sample of clusters is selected. All members of the selected
clusters are then included in the sample
- Multi-stage sampling: In this method, a combination of sampling techniques is
used. The population is divided into clusters, and then a sample of clusters is
selected using cluster sampling. Within each selected cluster, a sample of
individuals is then selected using another sampling technique such as simple random
sampling

Stratified and cluster sampling may look similar, but bear in mind that groups
created in cluster sampling are heterogeneous, so the individual characteristics in
the cluster vary. In contrast, groups created in stratified sampling are
homogeneous, as units share characteristics.
Relatedly, in cluster sampling you randomly select entire groups and include all
units of each group in your sample. However, in stratified sampling, you select
some units of all groups and include them in your sample. In this way, both methods
can ensure that your sample is representative of the target population.
Cluster sampling is a type of probabilistic sampling method that is used when the
population is large and geographically dispersed. In this method, the population is
first divided into clusters, which are groups of individuals who share some common
characteristic or are located in a particular geographic area. The clusters can be
based on various criteria, such as geography, age, income, or occupation.
Once the clusters have been identified, a sample of clusters is selected at random,
using a simple random sampling or other probabilistic method. Then, all individuals
within the selected clusters are included in the sample. This differs from
stratified sampling, where only a subset of individuals within each stratum are
included in the sample.
2 types of statistic:
1) descriptive -> just give some info like mean, mode, median...
1.1) measure of central tendences -> mean, mode, median
1.2) measure of spread -> variance, range, standard deviation etc
2) inferential -> Inferential statistics is a subfield of statistics that involves
making inferences about a population based on a sample of data.

Hypothesis testing: calculate how probable is an hypotesis given the samples you
have

The central limit theorem is a fundamental concept in statistics that describes the
behavior of the sample means of a large number of independent and identically
distributed random variables. It states that regardless of the underlying
distribution of the individual random variables, as the sample size increases, the
distribution of the sample means approaches a normal distribution, with a mean
equal to the population mean and a standard deviation equal to the population
standard deviation divided by the square root of the sample size.

As a general rule of thumb, a sample size of at least 30 is often considered large


enough for the central limit theorem to hold.

Hypothesis testing is a statistical method used to determine whether an observed


effect or relationship in a sample of data is likely to reflect a real effect or
relationship in the population, or whether it is simply due to chance.
In hypothesis testing, we start by formulating a null hypothesis, which states that
there is no effect or relationship in the population, and an alternative
hypothesis, which states that there is a real effect or relationship in the
population.

To conduct a hypothesis test, we begin by collecting a sample of data and


calculating a test statistic, which is a numerical value that summarizes the
evidence in the sample. We then compare the test statistic to a critical value or
p-value, which represents the probability of obtaining a test statistic as extreme
or more extreme than the observed value, assuming that the null hypothesis is true.

If the p-value is very small (e.g., less than 0.05), we reject the null hypothesis
and conclude that there is sufficient evidence to support the alternative
hypothesis. If the p-value is not small, we fail to reject the null hypothesis and
conclude that the observed effect or relationship is not statistically significant.

In statistics, a population is a group of individuals, objects, or events that


share a common characteristic, while a sample is a subset of the population that is
selected for study.

The t-value (also known as the t-statistic) is a statistical test that measures the
difference between the mean of a sample and the mean of the population, relative to
the variability in the sample. It is used in hypothesis testing to determine
whether the mean of a sample is statistically different from the population mean.

The t-value is calculated by dividing the difference between the sample mean and
the population mean by the standard error of the mean. The formula for calculating
the t-value is:

t = (x̄ - μ) / (s / √n)
Permutations refer to the number of ways that a set of objects can be arranged in a
specific order, and can be calculated with or without repetition. In permutations
without repetition, each object can only be used once, and the order in which the
objects are arranged matters. For example, if we have three different letters, A,
B, and C, there are 6 possible permutations without repetition: ABC, ACB, BAC, BCA,
CAB, CBA.
On the other hand, in permutations with repetition, the same object can be used
more than once, and the order in which the objects are arranged matters. For
example, if we have three letters, A, B, and C, and we allow repetition, there are
27 possible permutations: AAA, AAB, AAC, ABA, ABB, ABC, ACA, ACB, ACC, BAA, BAB,
BAC, BBA, BBB, BBC, BCA, BCB, BCC, CAA, CAB, CAC, CBA, CBB, CBC, CCA, CCB, CCC.

Combinations, on the other hand, refer to the number of ways that a set of objects
can be selected from a larger set, and the order in which they are selected does
not matter. Like permutations, combinations can also be calculated with or without
repetition.

A z-score (also known as a standard score) is a statistical measure that represents


the number of standard deviations a data point is away from the mean of a
distribution. It is a way to standardize data so that it can be easily compared
across different datasets or samples.

To calculate the z-score for a data point, you subtract the mean of the
distribution from the data point and then divide the result by the standard
deviation of the distribution. The formula for calculating the z-score is:

z = (x - μ) / σ

How many combinations and repetitions are there?

Combinations:
Without repetition: n! / (r! * (n-r)!)
With repetition: (n + r - 1)! / (r! * (n-1)!)

Permutations:
Without repetition: n! / (n-r)!
With repetition: n^r

How to split dataset for training, validation, test data?


80/10/10 could be good, or 70/15/15, or 50/25/25. There are different opinions

to start
https://www.simplilearn.com/tutorials/machine-learning-tutorial/machine-learning-
interview-questions
An optimization problem is a type of mathematical problem where the goal is to find
the best possible solution among a set of possible solutions, often subject to
constraints. The objective is to maximize or minimize an objective function that
measures the quality of the solution.

Bayesan optimizer
A Bayesian optimizer is a type of optimization algorithm that uses Bayesian
inference to efficiently search for the optimum of an objective function. It is
often used in machine learning, hyperparameter tuning, and other optimization
problems.
The algorithm maintains a probabilistic model of the objective function and uses it
to choose the next set of parameters to evaluate. The model is updated as new
evaluations are made, allowing the algorithm to refine its search over time.

Bayesian inference is a statistical method used to update our beliefs or knowledge


about an uncertain event or parameter based on new data or evidence. It involves
the use of Bayes' theorem
In Bayesian inference, we start with an initial belief or prior probability
distribution about an uncertain event or parameter. As we collect new data or
evidence, we update our belief using Bayes' theorem to obtain a posterior
probability distribution, which reflects the probability of the event or parameter
given both the prior belief and the new data.

Let's say we have a dataset of images and we want to train a convolutional neural
network (CNN) to classify them. However, we need to find the optimal
hyperparameters for the CNN, including the number of convolutional layers, the
number of filters in each layer, the learning rate, and the batch size.
We can use a Bayesian optimizer to efficiently search for the optimal set of
hyperparameters by iteratively evaluating the model with different parameter
combinations. The optimizer will use the results of previous evaluations to update
its probabilistic model and guide the search towards promising regions of the
parameter space.

Here's how the optimization process might look:


1) Define the search space for the hyperparameters, including the range of
values for each parameter. This is obvious, you have to decide the range where to
search
2) Choose an initial set of hyperparameters to evaluate, using a method such as
random sampling. Also this one is expected isn't it?
3) Train the CNN with the chosen hyperparameters and evaluate its performance on
a validation set. In step 3, we typically train the model on a subset of the
training data using the selected hyperparameters. The purpose of this step is to
obtain an estimate of the model's performance on new, unseen data. A SUBSET, YOU
ARE NOT RICH KIDDO
4) Use the performance result to update the probabilistic model of the objective
function. HERE IS THE INTERESTING POINT We use the performance result to update the
probabilistic model of the objective function. Specifically, we update the prior
distribution of the hyperparameters using Bayes' theorem, taking into account the
likelihood of the data given the hyperparameters.
5) Use the probabilistic model to choose the next set of hyperparameters to
evaluate, balancing exploration and exploitation.
6) Repeat steps 3-5 until a stopping criterion is met, such as a maximum number
of evaluations or a convergence criterion.
7) Return the hyperparameters that yield the best performance on the validation
set.
By using a Bayesian optimizer, we can efficiently search the high-dimensional space
of hyperparameters and find a set that leads to the best performance on the
validation set, without the need for a exhaustive search. This can save time and
resources and help us build better machine learning models.

STEP 4 IN DETAILS:
n step 4, we use the performance result obtained in step 3 to update the
probabilistic model of the objective function. Specifically, we update the prior
distribution of the hyperparameters using Bayes' theorem, taking into account the
likelihood of the data given the hyperparameters.

Bayes' theorem states that the posterior distribution of a parameter given data is
proportional to the product of the likelihood of the data given the parameter and
the prior distribution of the parameter.
Actually it states P(A | B) = P(B | A) * P(A) / P(B), but here we are adapting it
to our situation
Bayes' theorem can be used to update our beliefs about the probability of an event
based on new evidence. We can use the theorem to calculate the posterior
probability of the event, which is the probability of the event occurring given the
new evidence.

Explaining of the terms:


posterior: the updated probability distribution of the hyperparameters given
the performance result and the prior distribution
likelihood: the probability of observing the performance result given the
hyperparameters
prior: the prior probability distribution of the hyperparameters before
observing the performance result

To compute the posterior distribution, we first need to specify a prior


distribution for each hyperparameter. This can be a uniform distribution, a
Gaussian distribution, or any other distribution that reflects our prior beliefs
about the hyperparameters. For example, we may assume that the regularization
parameter follows a log-uniform distribution, while the kernel type follows a
categorical distribution.

Once we have the prior distribution, we can compute the likelihood of the
performance result given the hyperparameters. This can be done using a surrogate
model or a proxy function that approximates the true objective function. The
surrogate model can be a Gaussian process, a neural network, or any other model
that is computationally efficient and can approximate the objective function
accurately.

how do we go from P(A | B) = P(B | A) * P(A) / P(B) to posterior ∝(∝ means


proportional) likelihood × prior?
easy:
- A represents the hyperparameters of the model
- B represents the observed performance of the model with the hyperparameters A
- P(A) is the prior distribution of the hyperparameters A
- P(B | A) is the likelihood function, which represents the probability of
observing performance B given the hyperparameters A
- P(B) is the marginal likelihood, which is the probability of observing
performance B integrated over all possible hyperparameters. IS A CONSTANT NUMBER,
WE CAN NEGLECT IT

So
from P(A | B) = P(B | A) * P(A) / P(B)
to P(A | B) ∝ P(B | A) * P(A)
because the ∝ symbol means "proportional to", which means that we are neglecting
the normalization constant P(B) for simplicity.

and after you can do:


from P(A | B) ∝ P(B | A) * P(A)
tod posterior ∝ likelihood * prior

Boosting, bagging, and stacking are popular ensemble learning techniques in the
context of machine learning.

- Boosting:
Boosting is an ensemble learning technique that aims to improve the performance of
weak learners (often simple models) by combining them into a strong learner. The
process involves sequentially training a series of weak models and giving more
weight to the misclassified instances in each iteration. The final prediction is
made by aggregating the predictions of all weak models. AdaBoost (Adaptive
Boosting) is one of the most well-known algorithms for boosting.

- Bagging:
Bagging stands for Bootstrap Aggregating. It is an ensemble learning technique that
involves training multiple independent models on different subsets of the training
data, created by bootstrapping (sampling with replacement). Each model in the
ensemble makes its prediction, and the final prediction is determined by averaging
(in the case of regression) or majority voting (in the case of classification) over
the predictions of all models. Random Forest is a famous algorithm that uses baggin

- Stacking:
Stacking, also known as stacked generalization, is an ensemble learning technique
that combines multiple models by using another model (meta-model) to learn how to
best combine the predictions of the base models. The process involves training
multiple base models on the same dataset, using their predictions as input features
for the meta-model. The meta-model then learns how to weigh and combine the base
model predictions to produce the final ensemble prediction. Stacking can be a
powerful technique when used appropriately, but it requires more computational
resources and careful tuning compared to boosting and bagging.

Neural networks:
A Neural Network (NN) is a specific architecture inspired by the human brain's
neural structure. It is a complex network of interconnected artificial neurons that
process and transmit information. It excels at recognizing patterns and making
predictions based on learned representations from the data.
Types:
- Feedforward Neural Networks: They are the simplest type of neural networks, where
data flows in a unidirectional manner, moving from the input layer to the output
layer without any feedback loops.
- Recurrent Neural Networks (RNNs): These networks have connections that allow
feedback loops, making them suitable for sequential data and tasks that require
memory.
- Convolutional Neural Networks (CNNs): CNNs are particularly well-suited for image
and video analysis, using convolutional layers to automatically detect patterns.
- Generative Adversarial Networks (GANs): GANs are composed of two neural networks,
namely a generator and a discriminator, collaborating to produce lifelike data,
commonly employed for generating images and videos.
- Long Short-Term Memory Networks (LSTMs): A type of RNN that can retain
information for an extended period, making them effective in tasks involving
sequential data.

More about CNN:


what is a coinvolutional layer?
A convolutional layer is one of the key building blocks of a CNN. It is designed to
automatically and adaptively learn spatial hierarchies of features from the input
data. In computer vision tasks, such as image recognition, the convolutional layer
is used to detect various patterns and features within an image.
The main idea behind a convolutional layer is to slide a set of small filters (also
known as kernels) across the input data and perform convolution operations. Each
filter represents a specific pattern that the model is trying to detect in the
input. During the convolution process, the filter scans the input and creates a
feature map, which highlights regions in the input that match the pattern
represented by the filter. By using multiple filters, the convolutional layer can
simultaneously detect multiple features.
The process of training a CNN involves updating the weights of these filters during
the training process, allowing the model to learn and adapt to different patterns
present in the data. After training, the convolutional layers become feature
extractors

What is deep learning?


In deep learning, neural networks are organized in layers, and these layers are
referred to as "deep" because they can have many hidden layers between the input
and output layers. These hidden layers allow the network to learn hierarchical
representations of the data, capturing increasingly complex features as the
information propagates through the network.

What are transformers?


In the context of machine learning, transformers refer to a type of deep learning
model architecture that has gained significant popularity and success in various
natural language processing (NLP) tasks.
The transformer architecture is designed to process sequential data, such as
sentences or text, by employing the concept of self-attention. Self-attention
allows the model to weigh the importance of different words or tokens in the input
sequence when making predictions for each word/token

Key components of the transformer architecture include:


- Self-Attention Mechanism: The self-attention mechanism calculates attention
weights between all pairs of words/tokens in the input sequence. These weights
represent the importance of each word/token with respect to all others.
- Multi-Head Attention: Transformers often use multiple attention heads to capture
different aspects of the relationships between words/tokens. Multi-head attention
allows the model to learn multiple attention patterns and provides richer
representations.
- Encoder and Decoder Stacks: Transformers are typically composed of a stack of
encoder layers and, in some cases, decoder layers. The encoder processes the input
sequence and produces context-aware representations for each word/token. In
sequence-to-sequence tasks (like machine translation), the decoder uses these
representations to generate the output sequence
- Positional Encoding: Since transformers do not inherently encode the order of
words/tokens in the input sequence, positional encodings are added to provide
information about their positions,
In the context of Machine Learning (ML), data repositories and data warehouses
serve different purposes when it comes to managing and storing data:
- Data Repository:
A data repository is a general term used to describe a centralized location or
storage system where various types of data are stored. It is often used as an
umbrella term for any collection of data, whether structured, semi-structured, or
unstructured, that is organized and stored in a way that facilitates access and
retrieval. A data repository can include raw data, processed data, metadata, and
more. It's a more generic concept and can encompass various types of data storage
systems. General storage for various types of data, often without specific
analytical or reporting requirements. Can store diverse data types, including raw,
processed, and experimental data. Might have varying levels of data quality and
consistency. Used for data storage and sometimes as a stepping stone for data
preprocessing before analysis.

- Data Warehouse:
A data warehouse is a specific type of data repository that is designed to support
business intelligence (BI) and analytical processing. It is a centralized, large-
scale data storage system that consolidates data from various sources across an
organization. Designed to support analytical processing, reporting, and decision-
making by providing a structured and optimized environment for querying and
analyzing data. Typically stores structured data. Focuses on ensuring data quality,
consistency, and accuracy for reliable analysis. Used for business intelligence,
reporting, and advanced analytics

IN short: in data repository there is chaos, in da warehouse there is order

Data pipelines:
Data pipelines refer to a series of processes that involve the collection,
transformation, and movement of data from various sources to a destination, often
with the goal of making the data ready for analysis, reporting, or other purposes
A data pipeline is not limited to moving data solely from a data repository to a
data warehouse, although that can be a common use case. A data pipeline can involve
various stages and components, including data extraction, transformation,
processing, and loading, as well as movement between different storage systems

ETL:
ETL stands for "Extract, Transform, Load," and it refers to a process commonly used
in data engineering to move data from source systems to a destination system, often
a data warehouse, while applying transformations along the way.
- Extract:
In the extraction phase, data is collected from various sources, which can include
databases, APIs, flat files, logs, and more
- Transform:
The transformation phase involves cleaning, enriching, and structuring the
extracted data so that it aligns with the requirements of the destination system.
Data may need to be cleaned of inconsistencies, duplicated entries, and errors.
Additionally, data transformations might involve aggregating, joining, pivoting,
and applying calculations to prepare the data for analysis
- Load:
In the loading phase, the transformed data is loaded into the target destination.
Loading data often includes considerations about how frequently new data needs to
be added (batch or real-time), how to handle updates and inserts, and how to
maintain data integrity.
A data lake is a centralized repository that allows organizations to store and
manage large volumes of structured, semi-structured, and unstructured data at any
scale. Unlike traditional databases or data warehouses, which often require data to
be structured before storing, data lakes accommodate raw data in its original
format. This flexibility makes data lakes particularly well-suited for storing
diverse and heterogeneous data sources, such as logs, sensor data, social media
posts, images, videos, and more. This is even less structured than a data
repository, and also more general purpose

aws S3 is a good place where to put a data lake and also a data repository

AWS Glue:
AWS Glue is a managed extract, transform, and load (ETL) service provided by Amazon
Web Services (AWS). It's designed to help users easily prepare and transform data
for analytics, reporting, and other purposes, without requiring extensive manual
coding or infrastructure management. AWS Glue simplifies the process of building
and managing ETL pipelines by providing tools for data cataloging, data
transformation, and data movement.
- Glue can read from S3
- Glue has crawlers: Crawlers discover and catalog metadata from various (usually
more than one) data sources such as Amazon S3, databases, and data warehouses.
Crawlers analyze the structure and schema of data. It founds out from for example a
csv what is the schema. The wizard in the console is magic and allows you to setup
also roles and other things. You can decide when to run the crowler (can be on
demand, every day.. etc). You have to set also where to put the crawled data, and
it will create the corrispective table in the data catalog.
- glue has data catalog: a centralized metadata repository known as the AWS Glue
Data Catalog. This catalog stores metadata information about various data sources,
databases, tables, and transformations. INTERESTING: This catalog can also be
integrated with Amazon Athena. By cataloging your data in Glue, you can make it
easier for Athena to discover and query the data using SQL.
- Glue has ETL jobs:
AWS Glue allows users to create ETL jobs using a visual interface. These jobs
define the data transformations needed to convert raw data into a suitable format
for analysis or reporting. AWS Glue generates code (Python or Scala) based on the
visual ETL job configuration. Tipical example of job: Spark, but is a Spark with
like 0 code you do everything with the interface. ETL capabilities can be used to
clean, transform, and prepare your data before querying it with Athena

Batch Processing:
Batch processing involves collecting and processing a large amount of data in
predefined batches or chunks. Data is accumulated over a period of time, and then
processing is triggered on the collected batch.
Stream Processing:
Stream processing involves continuously processing and analyzing data as it arrives
in real-time. Data is processed piece by piece as it flows, allowing for quick
insights and immediate actions

AWS Data Pipeline:


AWS Data Pipeline is a web service provided by Amazon Web Services (AWS) that
allows you to automate the movement and transformation of data between different
AWS services and on-premises data sources. It provides a way to create, schedule,
and manage data workflows, making it easier to move data between various systems,
perform transformations, and execute data processing tasks. Often it moves things
from S3 to RDS or the contrary
Also here you literally do everything with the console. For the different steps of
the pipeline you litelly use the UI to select which on the possible ones and to
configure it. The possible steps are:
- CopyActivity:
This activity is used to copy data from a source location to a destination
location. It can be used to move data between Amazon S3 buckets, databases, and
other storage systems.
- HiveActivity:
Hive activities are used for executing Hive queries on data stored in Hadoop
Distributed File System (HDFS) or Amazon EMR clusters.
- PigActivity:
Similar to HiveActivity, PigActivity is used for executing Pig scripts on HDFS or
Amazon EMR clusters.s
- EmrActivity:
This activity allows you to run custom Amazon EMR (Elastic MapReduce) cluster steps
as part of your pipeline.
- DataNode:
DataNode activities are used to define data nodes within a pipeline, specifying
data locations and formats for inputs and outputs of activities.
- An InputDataNode specifies the input data for an activity
- An OutputDataNode specifies the output data for an activity.
- Schedule:
A schedule activity specifies when a pipeline should start or stop. It can be used
to define the timing for the execution of the pipeline's activities.
- Resource:
Resource activities define resources required for activities, such as Amazon EC2
instances or Amazon EMR clusters.

AWS Batch:
AWS Batch is a managed service provided by Amazon Web Services (AWS) that enables
you to easily run batch computing workloads on the AWS Cloud. It simplifies the
process of provisioning, managing, and scaling compute resources for batch
processing tasks
Key features and components of AWS Batch include:
- Job Unit:
a task you have to do
- Job Definitions:
AWS Batch allows you to define job definitions that encapsulate the parameters and
configuration required for a specific batch processing task. Job definitions
include information about the Docker image to be used, resource requirements, and
other details.
- Job Queues:
Job queues are used to organize and manage the execution of batch jobs. You can
create multiple job queues with different compute environments and priority levels
- Compute Environments:
A compute environment is a set of compute resources that AWS Batch uses to execute
your batch jobs. These resources can be either EC2 instances or Fargate containers
- Security and Access Control:
AWS Batch supports AWS Identity and Access Management (IAM) for role-based access
control and security.
- Monitoring and Logging:
AWS Batch provides monitoring and logging capabilities, including integration with
Amazon CloudWatch for tracking resource utilization, job status, and other metrics.

A Jupyter Notebook is an interactive computing environment that enables users to


create and share documents containing live code, equations, visualizations,
explanatory text, and more. It's widely used in data science, research, and
education to develop and share code-driven narratives and analyses
Bernoulli distribution
The Bernoulli distribution is a discrete probability distribution that models a
random experiment with two possible outcomes: success (often denoted as "1") and
failure (often denoted as "0").

Binomial distribution
The binomial distribution is a discrete probability distribution that describes the
number of successes in a fixed number of independent Bernoulli trials (experiments
with two possible outcomes: success and failure)

The Bernoulli distribution and the binomial distribution are closely related but
different:
The Bernoulli distribution describes the outcome of a single binary experiment or
trial, where there are only two possible outcomes: success (1) or failure (0). It
models a single event with a fixed probability of success. The binomial
distribution describes the number of successes in a fixed number of independent
Bernoulli trial

Uniform distribution
The uniform distribution, often referred to as the continuous uniform distribution,
is a probability distribution that describes a continuous random variable with
outcomes that are equally likely within a specified interval. In simple terms, it
represents a situation where all values within a given range have the same
probability of occurring.

The normal distribution, also known as the Gaussian distribution, is one of the
most important and widely used probability distributions in statistics. It
describes the distribution of a continuous random variable that is symmetrically
distributed around its mean, creating the characteristic bell-shaped curve.

The Poisson distribution is a discrete probability distribution that describes the


number of events that occur within a fixed interval of time or space, given a known
average rate of occurrence. The average number of events in the time is known a
priori, but the exact moment is not. All events independant

The exponential distribution is a continuous probability distribution that models


the time between events in a process where events occur at a constant average rate
(so the time between the events in the poisson distribution). It's often used to
describe the time between occurrences

Time series data typically exhibits various components:


- Trend: Long-term movement or pattern in the data, indicating overall growth or
decline.
- Seasonality: Repeating patterns or fluctuations that occur at regular intervals
(e.g., daily, weekly, yearly).
- Cyclic Patterns: Long-term oscillations that are not as regular as seasonality
and may not have fixed periods.
- Irregular or Residual Component: Random fluctuations that cannot be explained by
the other components.
- Autocorrelation measures the correlation between a time series and a lagged
version of itself. This is important for identifying repeating patterns and
relationships.

Amazon QuickSight:
Amazon QuickSight is a cloud-based business intelligence (BI) and data
visualization service provided by Amazon Web Services (AWS). It enables users to
create interactive and insightful visualizations, dashboards, and reports from
various data sources.
QuickSight supports a wide range of data sources, including AWS services like
Amazon Redshift, Amazon RDS, Amazon S3, as well as on-premises databases,
spreadsheets, and third-party services
QuickSight includes built-in data preparation capabilities that allow you to clean,
transform, and shape your data

INTERESTING: yuo can ask questions in plain english to quicksight: you can use
Natural Language Query (NLQ) in Amazon QuickSight to ask questions about your data
using plain English. Natural Language Query is a feature that allows you to
interact with your data and create visualizations by typing questions in a
conversational manner, without needing to write SQL queries or use complex tools.
For example, you can ask questions like:
"Show me total sales by product category."
"What are the sales trends over the last year?"
"How many orders were placed by each customer?"
"What is the average rating by product?"

AWS EMR
Amazon EMR (Elastic MapReduce) is a cloud-based big data platform provided by
Amazon Web Services (AWS). It's designed to simplify and accelerate the processing
of large amounts of data using popular open-source frameworks such as Apache
Hadoop, Apache Spark, Apache Hive, Apache HBase, and more
- EMR provides a managed cluster environment where you can easily provision,
configure, and scale clusters to process large datasets. You can choose from a
variety of cluster configurations based on your needs.
- EMR allows you to run distributed data processing frameworks like Hadoop and
Spark, enabling you to process data in parallel across a cluster of virtual
machines (nodes).
- EMR can seamlessly integrate with various data storage services on AWS, including
Amazon S3, Amazon DynamoDB, Amazon RDS, and more
- EMR can integrate with other AWS services like AWS Glue, Amazon Kinesis, Amazon
Redshift

Dimensionality Dilemma:
In high-dimensional spaces, several issues can occur that complicate the analysis
and modeling of the data. Here are some key aspects of the Dimensionality Dilemma:
- if the number of dimensions increases, the computational resources and time
required for processing and analyzing the data also increase significantly
- In high-dimensional spaces, data points tend to be sparsely distributed
- With a large number of dimensions, there is a risk of overfitting.
- As the number of dimensions increases, the amount of data required to accurately
represent the underlying distribution or relationships becomes larger. This can
make it challenging to gather sufficient data to build accurate models.
- Visualizing data beyond three dimensions becomes challenging for humans

To address the Dimensionality Dilemma, various techniques are used:


- Feature Selection: Selecting the most informative features and discarding
irrelevant or redundant
- Dimensionality Reduction: Techniques like Principal Component Analysis (PCA)
- Regularization: Techniques like L1 regularization (Lasso) can encourage the model
to use fewer features by penalizing large coefficients.
- Domain Knowledge: Leveraging domain knowledge can help identify which features
are truly relevant

SageMaker:
Amazon SageMaker is a fully managed machine learning service provided by Amazon Web
Services (AWS). It aims to simplify the process of building, training, and
deploying machine learning models at scale. SageMaker provides a comprehensive set
of tools and capabilities that cater to various stages of the machine learning
lifecycle, making it easier for data scientists and developers to develop and
deploy machine learning models.
You have:
- End-to-End Workflow: SageMaker offers a complete workflow for machine learning,
including data preprocessing, model training, hyperparameter tuning, model
evaluation, and deployment.
- Managed Jupyter Notebooks: SageMaker provides Jupyter notebooks that are pre-
configured with the necessary libraries
- Built-In Algorithms: SageMaker includes a collection of built-in algorithms for
common machine learning tasks like classification, regression, clustering, etc
- Custom Algorithms: You can also bring your own custom algorithms and frameworks,
allowing you to develop models using popular libraries like TensorFlow, PyTorch etc
- AutoML: SageMaker provides AutoML capabilities, such as AutoML Autopilot, which
automates the process of selecting features, algorithms, and hyperparameters to
build optimal models

SageMaker feature store:


It's designed to simplify the process of storing, sharing, and managing features
(also known as attributes or variables) used in machine learning models. It is
where you put features to later analyze them. It can also take care of features
updates

SageMaker Ground Truth


Amazon SageMaker Ground Truth is a managed data labeling service provided by Amazon
Web Services (AWS) as part of the Amazon SageMaker platform. It's designed to help
data scientists and machine learning practitioners create high-quality labeled
training datasets for machine learning models. Ground Truth streamlines the process
of labeling data by combining human labeling with automated labeling, reducing the
time and effort required to prepare labeled training data.
Really its interfaces to label things are like the captcha

Amazon Mechanical Turk:


like half a million of contractors who label things if you pay them

Simple data imputation:


Simple data imputation is a method used to handle missing or incomplete data by
replacing missing values with estimated or imputed values based on certain
assumptions or techniques.

Binning (aka Bucketing):


Refers to the process of grouping continuous numerical data into discrete intervals
or bins. Binning is used to transform continuous data into categorical data, which
can simplify analysis, visualization, and modeling. It's particularly useful when
you want to capture trends or patterns in the data that might not be apparent when
dealing with individual data points.
Example: you age columns age and you make it have only "young", "average" and "old"
values

Data Transformation:
Logarithmic Data Transformation:
In the context of data transformation, a logarithmic transformation involves
applying the logarithm function to the values of a dataset. The logarithm function
helps to compress or expand the scale of data, which can be particularly useful
when dealing with data that spans a wide range of values or when the data
distribution is skewed.

Square Root Data Transformation:


In the context of data transformation, a square root transformation involves taking
the square root of the values in a dataset. This transformation is used to achieve
a more balanced distribution of values, particularly when dealing with data that
has positive skewness (a long tail on the right side of the distribution) or when
the data values span a wide range.

Reciprocal Data Transformation:


In the context of data transformation, a reciprocal transformation involves taking
the reciprocal (the multiplicative inverse) of the values in a dataset. The
reciprocal of a number "x" is denoted as 1/x. This transformation is used to deal
with data that is highly skewed towards small values and aims to stretch the data
distribution to make it more spread out.

Data Scaling:
Data scaling, also known as feature scaling or normalization, is a preprocessing
step in data analysis and machine learning that involves transforming the numerical
features of a dataset to a common scale. Scaling is essential when features have
different ranges, units, or magnitudes, as many machine learning algorithms perform
better or converge faster when the data is on a similar scale.
Types:
- Normalization -> make it from 0 to 1
- Standard scaling -> z-score normalization, is a data scaling technique used in
data preprocessing to transform numerical features into a standardized distribution
with a mean of 0 and a standard deviation of 1. X_standardized = (X - X_mean) /
X_standard_deviation

Text feature engineering involves transforming raw text data into numerical
features that can be used for machine learning and natural language processing
(NLP) tasks. Commonly, the first steps in text feature engineering include:
- Tokenization: Tokenization is the process of splitting a text document into
individual words or tokens. It's a crucial step as it breaks down the text into
manageable units that can be further processed. Tokenization can also include
handling punctuation and special characters.
- Lowercasing: Converting all text to lowercase helps ensure that words with
different capitalizations are treated as the same word.
- Stopword Removal: Stopwords are common words like "and," "the," "is," and "in"
that might not contribute much to the meaning of the text.
- Stemming and Lemmatization: Stemming and lemmatization are techniques for
reducing words to their base or root form. This helps to group related words
together and reduce the vocabulary size. Stemming involves removing suffixes or
prefixes, while lemmatization involves converting words to their dictionary form.
- Handling Special Characters and Numbers: Depending on the task, you might need to
decide how to handle special characters, numbers, URLs, and other non-textual
elements in the text data.
- Removing HTML Tags and Formatting: If working with web data, you might need to
remove HTML tags, scripts, and other formatting elements from the text.
- Dimensionality Reduction: Text data often results in high-dimensional feature
spaces. Dimensionality reduction techniques like Principal Component Analysis (PCA)
or t-SNE can help reduce the dimensionality while preserving important information
- Encoding Categorical Features: If your text data includes categorical features
(e.g., author names, categories), you'll need to encode them into numerical
representations using techniques like one-hot encoding or label encoding.
Term Frequency-Inverse Document Frequency
In the context of text feature engineering, TF-IDF stands for Term Frequency-
Inverse Document Frequency. It is a widely used technique for converting text data
into numerical features that can be used in various natural language processing
(NLP) tasks, such as text classification, clustering, and information retrieval.
The TF-IDF formula consists of two components:
- Term Frequency (TF): This component measures how often a term appears in a
document. It's calculated as the ratio of the number of times the term appears in
the document to the total number of terms in the document
- Inverse Document Frequency (IDF): This component measures the rarity of a term
across the entire corpus. It's calculated as the logarithm of the ratio of the
total number of documents in the corpus to the number of documents containing the
term.
The final TF-IDF score for a term in a document is obtained by multiplying the TF
and IDF components:
TF-IDF = TF * IDF

Key points about TF-IDF:


- High TF-IDF values indicate terms that are important to a specific document but
not common across the entire corpus.
- Low TF-IDF values indicate either unimportant terms or terms that are common
across many documents.
- TF-IDF is sparse: Each document's TF-IDF vector is sparse, containing mostly
zeros, as most words are not present in a given document.
- TF-IDF can be used to represent documents as vectors, where each dimension
represents a unique term in the corpus.

Bag of Words (BoW)


The Bag of Words (BoW) model is a fundamental concept in natural language
processing (NLP) and text mining. It's a simple yet effective way to represent text
data as numerical features that can be used for various machine learning tasks. The
BoW model disregards the order and structure of words in a document and focuses
solely on the frequency of words in the text.
Ideas:
- Vocabulary Creation: A vocabulary is created by collecting all unique words
(terms) from the entire corpus of documents. Each word in the vocabulary becomes a
dimension in the feature space.
- Term Frequency Count: For each document in the corpus, the frequency of each word
in the vocabulary is counted. This creates a vector of word frequencies for each
document. Each element in the vector represents the count of a specific word in the
document.
- Sparsity: Since most documents contain only a subset of the vocabulary words, the
resulting document vectors are typically sparse, with many zero values.
- Order Independence: The BoW model treats documents as unordered collections of
words. The order of words and grammar structure are not considered, which can lead
to the loss of some semantic meaning.
- Word Frequency Matters: The BoW model assumes that the frequency of words in the
text reflects their importance. However, common words like "the" and "and" may have
high frequencies but might not be semantically significant.
- Word Variations: The BoW model doesn't differentiate between different word forms
or word variations (stemming and lemmatization can be applied to address this).

N-grams:
An extension fo the BoW model who works with more than 1 term
N-grams are contiguous sequences of "n" items (usually words, characters, or
tokens) extracted from a text document. N-grams are used in natural language
processing (NLP) and text analysis to capture local patterns, relationships between
adjacent words, and context within a sequence of words.

Things that ML cannot do:


1) Describe the cluster after it found it
2) Anomaly detection
3) work on data without patterns
4) explain "why"
5) work when data are not enough
6) art (?)
7) sarcasm detection
8) generalization

Precision:
True positives / true positives + false positivers. AKA how many things I found
positives are positives for real?

Recall:
true positives / true positives + false negatives. AKA how many or the real
positives I actually found as positives

F1 Score:
The F1 score is a commonly used metric in machine learning for evaluating the
performance of classification models, especially when dealing with imbalanced
datasets.
F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

Gradient Boosting:

XGBoost:
XGBoost, which stands for "Extreme Gradient Boosting," is a powerful and widely
used machine learning algorithm for supervised learning tasks, particularly in the
domain of structured data and tabular datasets. XGBoost is an ensemble learning
method that combines the strengths of decision trees and gradient boosting
techniques to achieve high predictive accuracy and generalization performance.
Points:
- Gradient Boosting Framework: XGBoost is based on the gradient boosting framework,
which is a technique for building an ensemble of weak learners (usually decision
trees) to create a strong predictive model. It trains successive trees to correct
the errors of the previous ones.
- Decision Trees: XGBoost uses decision trees as base learners. These trees are
shallow, which helps reduce overfitting and makes them computationally efficient.
- Regularization: XGBoost introduces regularization techniques to prevent
overfitting and improve model generalization. L1 and L2
- Custom Loss Functions: XGBoost allows users to define custom loss functions for
specific problems, providing flexibility for tackling a wide range of tasks.
- Handling Missing Data: XGBoost can handle missing data directly during training,
reducing the need for explicit imputation.
- Parallel and Distributed Computing: XGBoost is designed for efficiency and can
make use of parallel and distributed computing to speed up training and prediction
processes.

RecordIo format:
RecordIO is the name for a set of binary data exchange formats. The basic idea is
to divide the data into individual chunks, called ‘records’, and then to prepend to
every record its length in bytes, followed by the data.
Since there is no formal specification of the RecordIO format, there tend to be
slight incompatibilities between RecordIO implementations
In the protobuf recordIO format, SageMaker converts each observation in the dataset
into a binary representation as a set of 4-byte floats, then loads it in the
protobuf values field. If you are using Python for your data preparation, we
strongly recommend that you use these existing transformations. There are lots of
RecordIo formats, the user one in aws is Apache mxnet.recordio
RecordIO is more space efficient then csv.
If you need a human-readable and widely supported format for data exchange and
analysis, CSV is a good choice. On the other hand, if you're dealing with large
datasets, complex data structures, or require efficient data serialization and
streaming for machine learning, then RecordIO might be more suitable. It's also
worth noting that there are other formats like JSON, Parquet, and Avro that offer
different trade-offs in terms of readability, efficiency, and features.

Minibatch size:
In the context of machine learning, a "minibatch" refers to a subset of the
training data that is used to update the model's parameters during each iteration
of the training process. Training a machine learning model involves updating its
parameters (weights and biases) based on the differences between the model's
predictions and the actual target values in the training dataset. This process is
typically guided by an optimization algorithm, such as gradient descent. Using the
entire training dataset for each update can be computationally expensive and slow,
especially when dealing with large datasets. To address this, the concept of
minibatching is introduced. Instead of using the entire dataset, training is
performed on smaller subsets of data, known as minibatches.

Learning rate:
The learning rate is a hyperparameter that plays a crucial role in the optimization
process of training machine learning models. It determines the step size or the
magnitude of parameter updates during each iteration of the optimization algorithm.
A high learning rate results in larger steps, which can lead to faster convergence
but risks overshooting the optimal solution. A low learning rate takes smaller
steps, which can lead to more accurate convergence but might be slower.

Momentum:
Momentum helps the optimization process move more smoothly and efficiently towards
the optimal solution by introducing a form of inertia or memory to the parameter
updates.
Momentum introduces the concept of "velocity" to this process. Instead of simply
updating the parameters based on the current gradient, momentum involves keeping
track of the previous parameter updates and using them to influence the current
update. This creates a smoothing effect that helps overcome oscillations and
obstacles in the optimization landscape.

Freeze layer == Fine tuning


The "freeze layer" pattern, often referred to as "fine-tuning," is a technique used
in transfer learning when training deep neural networks. Transfer learning involves
taking a pre-trained neural network, which has already been trained on a large
dataset, and adapting it to a new task or dataset that might be smaller or related
but not identical.

Semantic segmentation
Semantic segmentation is a computer vision task in machine learning that involves
classifying each pixel in an image into one of several predefined classes or
categories. The goal of semantic segmentation is to partition an image into
meaningful regions or segments, where each segment corresponds to a specific object
or region of interest in the scene.
In other words, semantic segmentation goes beyond traditional image classification,
which assigns a single label to an entire image. Instead, semantic segmentation
provides a detailed understanding of the content of an image by assigning a label
to each pixel, indicating the category of the object or material present at that
pixel location.

How do you optimize hyperparameters?


When deciding hyperparameters for machine learning models, you have several options
for optimization techniques.
- Grid Search:
In grid search, you define a grid of possible hyperparameter values for each
hyperparameter you want to tune. The algorithm systematically tries all possible
combinations of values within the grid and evaluates the performance of the model
using cross-validation. Grid search is straightforward and easy to implement, but
it can be computationally expensive
- Random Search:
Random search involves randomly selecting hyperparameter values from predefined
ranges for a certain number of iterations. While random search doesn't cover all
possible combinations like grid search, it's often more computationally efficient
- Bayesian Optimization:
Bayesian optimization uses probabilistic models to model the relationship between
hyperparameters and model performance. It iteratively selects the next set of
hyperparameters based on the uncertainty in the model's performance. This approach
can be efficient in high-dimensional spaces and is particularly useful when
evaluating the model is time-consuming.
- Hyperband:
Hyperband combines random search with a bandit-based strategy, dynamically
allocating resources to promising hyperparameter configurations while discarding
unpromising ones. It's effective when you have limited computational resources and
want to quickly identify good-performing configurations.
- Genetic Algorithms:
Genetic algorithms are inspired by evolutionary processes
- Particle Swarm Optimization:
Particle Swarm Optimization (PSO) is inspired by the social behavior of birds and
fish. It involves a population of particles that move through the hyperparameter
space to find optimal configurations by following the best-performing particles.
- Gradient-Based Optimization:
Some libraries and tools allow you to optimize hyperparameters using gradient-based
methods. These methods require defining a differentiable performance metric and
using optimization algorithms similar to those used for training models.
- Automated Hyperparameter Tuning Libraries:
There are various libraries and platforms designed specifically for automated
hyperparameter tuning, such as Hyperopt, Optuna, and Ray Tune. They uses most of
the above options and more

Good idea: limit tuning to only a subset of important params and limit tuning to a
limited range or you'll spend on this your life

Seq2Seq:
A Seq2Seq (Sequence-to-Sequence) algorithm, also known as an Encoder-Decoder model,
is a type of architecture used in machine learning for tasks involving sequences,
such as natural language processing, machine translation, text summarization, and
more.
The core idea of a Seq2Seq model involves two main components:
- Encoder: The encoder is responsible for processing the input sequence and
transforming it into a fixed-size context vector or representation. It's typically
implemented using recurrent neural networks (RNNs), such as LSTM (Long Short-Term
Memory) or GRU (Gated Recurrent Unit), which are capable of capturing sequential
information.
- Decoder: The decoder takes the context vector produced by the encoder and
generates the output sequence step by step. It's also implemented using RNNs, where
each step involves predicting the next element in the sequence based on the context
vector and the previously generated elements. The decoder can also use attention
mechanisms to focus on different parts of the input sequence during each decoding
step.

So, the steps are:


- Encoding: The input sequence is fed into the encoder RNN. The encoder processes
the input sequence step by step, updating its hidden state at each step. The final
hidden state or context vector represents the encoded information of the entire
input sequence.
- Decoding: The decoder RNN takes the context vector as its initial hidden state
and generates the output sequence. It starts with a special start-of-sequence token
and generates one token at a time. At each step, the decoder's hidden state and the
previously generated token are used to predict the next token. This process
continues until an end-of-sequence token is generated or a predefined maximum
length is reached.
- Attention Mechanism: Attention mechanisms allow the decoder to pay different
levels of attention to different parts of the input sequence during each decoding
step.

Embedding layer (usually pre-trained):


The embedding layer plays a crucial role in transforming discrete tokens (words or
symbols) into continuous vector representations (from sparse to dense), allowing
the model to work with the input and output sequences in a continuous space (so
that similar words are in similar vectors in the embedding space). These embeddings
capture semantic relationships and contribute to the model's ability to understand
and generate meaningful sequences. The embedding layer is used in both the encoder
and the decoder of a Seq2Seq architecture to bridge the gap between discrete tokens
and continuous computations.

Beam size:
In the context of Seq2Seq algorithms, the "beam size" refers to a parameter used in
beam search, a decoding technique commonly employed in sequence generation tasks,
including machine translation and text generation.
It's a not-greedy algorithm: Instead of greedily choosing the token with the
highest probability at each decoding step, beam search maintains a list of multiple
candidate sequences, called the "beam." The beam size determines the number of
candidate sequences to consider at each step.

Metrics for Seq2Seq:


- Accuracy: Accuracy is a commonly used metric in classification tasks, but it's
less common in Seq2Seq tasks where the output sequences can have different lengths
and the focus is on generating coherent and meaningful sequences rather than exact
matches. In Seq2Seq tasks, accuracy might refer to token-level accuracy, where you
calculate the percentage of correctly predicted tokens in the generated sequence
compared to the reference sequence.
- Perplexity: Perplexity is a metric often used to measure the quality of language
models. It quantifies how well a probability distribution (in this case, the
predicted sequence) predicts a given sample (the reference sequence). Lower
perplexity indicates better model performance. Perplexity is calculated using the
cross-entropy loss and the number of predicted tokens. It provides a measure of how
well the model's predicted probabilities align with the true distribution of the
data.
- BLEU: BLEU is a widely used metric for evaluating the quality of machine-
generated text, especially in machine translation. It compares the generated
sequence against one or more reference (human-generated) sequences and computes a
score based on the overlap of n-grams (continuous sequences of words) between the
generated and reference sequences. BLEU ranges from 0 to 1, where a higher BLEU
score indicates better alignment with the reference sequences.

Object2Vec -> so innovative that chatgpt never heard about it


Object2Vec is a highly customizable multi-purpose algorithm that can learn
embeddings of pairs of objects. The embeddings are learned in a way that it
preserves their pairwise similarities - Similarity is user-defined: users need to
provide the algorithm with pairs of objects that they define as similar (1) or
dissimilar (0); alternatively, the users can provide a real-valued similarity score
for reach object pair.
The learned embeddings can be used to compute nearest neighbors of objects, as well
as to visualize natural clusters of related objects in the embedding space. In
addition, the embeddings can also be used as features of the corresponding objects
in downstream supervised tasks such as classification or regression
A pair could be user-action, sentence-sentence, label-sequence etc so is very
general purpose

DeepAR forcasting -> an AWS algorithm


DeepAR is a forecasting algorithm. It is designed for time series forecasting,
which involves predicting future values based on historical observations of a
variable over time. DeepAR is particularly well-suited for scenarios where the time
series data exhibit complex patterns and dependencies.
It combines the strengths of autoregressive models and recurrent neural networks
(RNNs) to provide accurate and probabilistic forecasts. It's especially useful when
dealing with long time series and when you need to capture seasonality, trends, and
other temporal patterns.

Model Architecture: DeepAR employs a recurrent neural network architecture,


typically using long short-term memory (LSTM) units. LSTMs are well-suited for
handling sequences with long-range dependencies, which is common in time series
data.

Training: The algorithm is trained on historical time series data, learning to


capture the patterns and relationships within the data. During training, the model
generates probabilistic forecasts, allowing it to capture uncertainty in its
predictions.

Quantile Loss: DeepAR uses a quantile loss function during training, which
encourages the model to produce probabilistic forecasts at different quantiles
(percentiles). This means that DeepAR provides not only point forecasts but also
prediction intervals, allowing you to estimate the uncertainty of the forecast.

Context Length:
In the context of the DeepAR forecasting algorithm, the "context length" parameter
is an important setting that influences how the algorithm uses historical data for
making predictions. The context length determines how many past time steps
(observations) the algorithm considers when generating forecasts for future time
steps.
Prediction lenght:
The prediction length specifies the number of future time steps for which you want
the algorithm to generate forecasts

Don't exagerate with context and prediction because it will create a big model who
needs a huge amount of data.

Random cut forest:


Random Cut Forest (RCF) is an anomaly detection algorithm designed to identify
outliers, anomalies, or anomalies in high-dimensional data

Latent Dirichlet Allocation (LDA):


Unsupervised, not deep learning based
LDA is a probabilistic generative model used in natural language processing (NLP)
and topic modeling. It's designed to uncover the latent topics that underlie a
collection of documents. The fundamental idea behind LDA is that each document is a
mixture of various topics, and each topic is a distribution over words. LDA helps
identify the topics present in a large corpus of text documents and how those
topics are distributed across the documents.

Neural topic model:


A Neural Topic Model is a type of topic modeling approach that leverages neural
network architectures to discover latent topics within a collection of text
documents. Traditional topic models like Latent Dirichlet Allocation (LDA) are
probabilistic graphical models, while neural topic models integrate neural network
components to learn topic representations from data in a more flexible and
potentially more powerful manner.

Factorization Machine (FM):


A Factorization Machine (FM) is a machine learning algorithm that's particularly
effective for handling high-dimensional sparse data.
This is an extension on more dimensions of a linear model
The primary motivation behind Factorization Machines is to model pairwise
interactions between features, which is a limitation of traditional linear models.
These pairwise interactions are important for capturing non-linear relationships
and dependencies in data.
It can be used for classification and regression, is supervised
In practice:
Imagine you're trying to predict whether a person will like a movie based on
certain features like the movie's genre, the age of the viewer, and the director:
- Pairs of Features:
A Factorization Machine looks at pairs of features to understand how they interact
and contribute to the prediction. In our movie example, it's not just about the
genre or the viewer's age alone—it's about how the genre and age together influence
the viewer's preference.
- Latent Factors:
The magic happens with latent factors. Think of these as hidden traits or
characteristics associated with features. For example, each genre might have a
hidden factor related to its popularity, and each viewer might have a hidden factor
related to their preferences. These factors help capture complex relationships
between features.
- Factorization:
The Factorization Machine breaks down these latent factors into smaller components
and then combines them back together in a way that highlights their interactions.
It uses math to factorize these factors and calculate how they contribute to the
overall prediction.
- Learning from Data:
To make predictions, the Factorization Machine uses a learning process. It takes a
bunch of data where you know the features and the actual outcomes (liked or not
liked), and it adjusts its understanding of the latent factors to minimize the
difference between its predictions and the actual outcomes.
Predictions:
Once the Factorization Machine has learned from the data, it can predict whether a
new person will like a movie based on their age, the movie's genre, and other
features. It considers not just the features individually, but how they combine and
interact through those latent factors.

IP Insights: another aws-only algorithm


Amazon SageMaker IP Insights is an unsupervised learning algorithm that learns the
usage patterns for IPv4 addresses. It is designed to capture associations between
IPv4 addresses and various entities, such as user IDs or account numbers. You can
use it to identify a user attempting to log into a web service from an anomalous IP
address, for example

Parallelism:
Parallelism in machine learning refers to the practice of executing tasks
concurrently in order to speed up the training or inference process of machine
learning models. Parallelism takes advantage of modern hardware architectures, such
as multi-core CPUs or distributed computing environments, to process data more
efficiently. There are two main types of parallelism commonly used in machine
learning: model parallelism and data parallelism.

- Model Parallelism:
Model parallelism involves dividing a machine learning model into different parts
or layers and distributing those parts across different compute resources. Each
resource (such as a CPU or GPU) is responsible for computing the forward and
backward passes for a specific section of the model. This approach is useful when a
single device does not have enough memory to store and compute the entire model.
Model parallelism is commonly used in large models where certain layers are memory-
intensive. For example, in deep neural networks, convolutional layers may require
more memory than fully connected layers

- Data Parallelism:
Data parallelism involves distributing the data used for training across multiple
compute resources. Each resource processes a subset of the data, computes
gradients, and updates the model's parameters. The updates are then combined to
update the global model. Data parallelism is effective when dealing with large
datasets that can be divided into smaller batches.
In deep learning, data parallelism is particularly useful. Neural networks are
trained using gradient-based optimization techniques, and data parallelism allows
each resource to compute gradients for different data batches simultaneously. These
gradients are then averaged or combined to update the model's parameters.

Data parallelism is parallel for real. Model is just partially parallel. So the
best is to use data parallelism if your model can fit 1 device and model
parallelism elsewhere

Epoch: An epoch is a single pass through the entire training dataset.


Batch: A batch is a subset of the training data that is used to update the model's
parameters at each iteration during an epoch

Autopilot:
AWS AutoPilot refers to a set of automated machine learning capabilities.
AutoPilot automates many of the steps involved in the machine learning lifecycle,
including data preprocessing, feature engineering, model selection, hyperparameter
tuning, and model evaluation. It uses advanced algorithms to automatically select
the best performing model architecture and hyperparameters.
It works only with Csv and Json

Amazon SageMaker Debugger is a feature of Amazon SageMaker, it is designed to help


you identify and resolve issues related to model training by providing real-time
monitoring, analysis, and insights into the training process.
You can see:
- Real-Time Monitoring: SageMaker Debugger continuously monitors the training
process of your machine learning models in real time. It captures various types of
data, including tensors (intermediate values during computations) and gradients,
while the model is being trained.
- Built-In Rules: Debugger comes with a set of built-in rules that analyze the
captured data to detect common training issues such as overfitting, vanishing
gradients, and more.
- Interactive Visualizations: Debugger provides interactive visualizations that
allow you to explore the captured data, tensors, gradients, and other relevant
metrics. These visualizations make it easier to diagnose issues and understand the
training behavior.
- Local Mode: You can use SageMaker Debugger's local mode to test and debug
training jobs on your local machine before scaling up to the cloud.

Sagemaker experiments:
To organize, track, and manage their machine learning experiments in a structured
and systematic way.
Machine learning experiments involve training and evaluating different models with
various hyperparameters, data preprocessing techniques, and configurations.
SageMaker Experiments provides a framework to keep track of these experiments,
making it easier to understand what works, what doesn't, and how to improve your
models.
Interesting: Autopilot actually uses experiments

SageMaker clarify:
SageMaker Clarify is designed to help machine learning practitioners detect and
mitigate bias and explainability issues in their models, ensuring fairness and
transparency in the deployment of machine learning systems.
Bias and explainability are crucial aspects of responsible AI, as they help ensure
that models are not making unfair or biased predictions and that the decisions made
by models are understandable and explainable to users.
It can do:
- Bias Detection:
Clarify helps you identify potential sources of bias in your machine learning
models by analyzing and measuring bias across different groups defined by sensitive
attributes (such as gender or ethnicity).
- Explainability:
Clarify provides tools to explain how predictions are made by your models. It
calculates feature importance scores to help you understand which features have the
most influence on predictions.
Aws comprehend:
Amazon Comprehend is a natural language processing (NLP) service provided by Amazon
Web Services (AWS). It offers a range of capabilities for analyzing and
understanding text data, making it easier to extract valuable insights, sentiments,
entities, and relationships from text documents. Amazon Comprehend is designed to
help developers and businesses process large amounts of text data efficiently and
derive meaningful information from it. Can do:
- Sentiment Analysis:
Comprehend can determine the sentiment of a piece of text, whether it's positive,
negative, neutral, or mixed. This is particularly useful for understanding customer
feedback, social media posts, and reviews.
- Entity Recognition:
The service can identify and classify entities mentioned in text, such as names of
people, organizations, dates, locations, and more. This helps in extracting
structured information from unstructured text.
- Keyphrase Extraction:
It can extract important keywords or phrases from a document, summarizing its main
topics or themes.

Amazon Kendra:
Amazon Kendra is an artificial intelligence (AI)-powered search and information
retrieval service offered by Amazon Web Services (AWS). It's designed to make it
easier for organizations to build powerful and highly accurate search solutions for
their internal and external content. Kendra is particularly focused on addressing
the challenges of enterprise search, allowing users to find relevant information
quickly and efficiently across a wide range of data sources.

Amazon Transcribe:
Amazon Transcribe is an automatic speech recognition (ASR) service provided by
Amazon Web Services (AWS). It's designed to convert spoken language into written
text, making it easier to process, analyze, and search audio content. Amazon
Transcribe is widely used for various applications, including transcribing customer
service calls, generating subtitles for videos, converting spoken content into text
for analysis, and more.

Amazon Polly -> the opposite fo transcribe


Amazon Polly is a cloud service provided by Amazon Web Services (AWS) that enables
developers to convert text into lifelike speech using advanced deep learning
technologies. Polly is designed to create natural-sounding speech in multiple
languages and voices, making it easier to add speech capabilities to applications,
services, and products.

Amazon Rekognition:
Amazon Rekognition is a machine learning service provided by Amazon Web Services
(AWS) that offers a range of image and video analysis capabilities. Rekognition
uses advanced computer vision techniques to analyze visual content, extract
information, and identify objects, faces, scenes, and more from images and videos.
It is really cheap!

Amazon Personalize:
Amazon Personalize uses advanced machine learning algorithms to analyze historical
user data, such as browsing history, purchase history, and engagement patterns. It
then generates personalized recommendations or predictions based on this data. The
service is designed to be scalable and easy to integrate into various applications,
websites, and platforms.

Amazon Forecast -> like DeepAR but more general


Amazon Forecast is machine learning service provided by Amazon Web Services (AWS).
It's designed to help businesses and developers create accurate forecasts and
predictions based on historical data. Amazon Forecast is particularly useful for
demand forecasting, inventory planning, resource allocation, and other time-series
forecasting scenarios.
It can do:
- Time Series Forecasting: Amazon Forecast is specifically designed to handle time
series data, which involves predicting future values based on historical patterns.
More or less like DeepAR forcasting
- Automated Machine Learning: The service uses automated machine learning
techniques to identify the best forecasting models for your data
- Data Import and Preparation: Amazon Forecast supports various data formats and
helps users preprocess and clean their historical data before creating forecasts.
- Forecast Accuracy Evaluation: The service provides tools to evaluate the accuracy
of the generated forecasts using metrics such as Mean Absolute Error (MAE) and Root
Mean Squared Error (RMSE).

HOw are DeepAR forcasting and Aws Forecast different?


- logic behind:
DeepAR is a forecasting algorithm that is based on deep learning techniques,
Forecast is more general incorporates a range of forecasting algorithms, including
traditional statistical methods as well as machine learning algorithms like DeepAR,
to automatically select the best model for a given dataset.
- Use cases: DeepAR is well-suited for use cases where capturing complex temporal
dependencies is crucial. It can be effective for scenarios involving irregular
patterns, long-term trends, and unusual events. Amazon Forecast is a versatile
service suitable for a wide range of industries and applications. It's particularly
useful for businesses that need accurate forecasts but might not have deep
expertise in machine learning. It can handle demand forecasting, inventory
planning, financial planning, and more.

Amazon Textract:
Amazon Textract is a machine learning service provided by Amazon Web Services (AWS)
that is designed to extract structured data from scanned documents, forms, and
images containing text. It uses advanced optical character recognition (OCR)
technology along with machine learning algorithms to analyze documents and extract
valuable information, such as text, tables, and forms, which can then be used for
various business processes and applications.
It can do:
- Document Analysis:
Amazon Textract can analyze a variety of documents, including invoices, contracts,
receipts, forms, and more. It can identify text, tables, and even key-value pairs.
- Integration:
Amazon Textract provides APIs that allow developers to integrate the service into
their applications, workflows, or systems.
- Data Output Formats:
Extracted data can be output in various formats, including JSON and CSV, making it
easy to integrate with downstream processes.
- Scalable

AWS Panorama :
AWS Panorama is a service provided by Amazon Web Services (AWS) that is designed to
bring computer vision (CV) capabilities to your existing on-premises cameras. It's
essentially an edge computing solution that allows you to deploy computer vision
models and applications directly onto your cameras, enabling real-time analysis and
inference without sending video data to the cloud.

Amazon Fraud Detector:


Amazon Fraud Detector is a service provided by Amazon Web Services (AWS) that uses
machine learning and advanced analytics to help businesses detect and prevent
online fraud. It's designed to assist organizations in identifying potentially
fraudulent activities and transactions in real-time, allowing them to take
appropriate actions to mitigate risk and protect their customers.

Amazon Lex:
Amazon Lex is a service provided by Amazon Web Services (AWS) that allows you to
build conversational interfaces, commonly known as chatbots or conversational
agents, using natural language understanding (NLU) and natural language processing
(NLP) capabilities. Amazon Lex enables you to create interactive and intelligent
chatbots that can engage in text or voice-based conversations with users, providing
information, answering questions, and performing tasks. It uses the same model as
Alexa

Recommendation systems:
Recommendation systems, also known as recommender systems, are a subset of machine
learning techniques that aim to predict and suggest items that a user might be
interested in
Here are some of the main types:
- Content-Based Filtering: Content-based filtering recommends items based on the
characteristics of the items themselves and the user's past preferences. It relies
on feature extraction from the items and then compares these features to the user's
preferences. If the extracted features match the user's historical preferences,
those items are recommended. It regards only this users, not its similiarities with
other ones, SO IT SCALES BETTER
- Collaborative Filtering, with 2 subtypes
- User-Based Collaborative Filtering: This approach recommends items to a
user based on the preferences and behaviors of users who are similar to them. It
identifies users with similar item preferences and suggests items that those
similar users have liked.
- Item-Based Collaborative Filtering: In this approach, recommendations are
made by identifying similar items based on user preferences. If a user likes or
interacts with certain items, the system recommends other items that are similar to
those the user has shown interest in.
- Hybrid Approaches: Hybrid recommendation systems combine multiple techniques to
improve recommendation accuracy.

Protocol Buffers, often referred to as Protobuf, is a binary serialization format


and language-agnostic interface description language (IDL) developed by Google.
It's designed to efficiently serialize structured data for communication between
systems and for data storage.

Guess what? When you use SageMaker you have to keep respecting al security things
like in the other services
SageMaker notebooks and training/inference containers have access on internet by
default, consider it and if you really need it. In the end you could disable it and
just use a NAT

How do you deploy in SageMaker? Easy is just 1 line ok code, you just need to know
which EC2 instance you prefer to have

BlazingText:
BlazingText refers to an algorithm and implementation for training word embeddings
and text classification models efficiently.
Word embeddings are numerical representations of words that capture semantic
relationships between words based on their context in a given corpus of text.
BlazingText provides a highly optimized implementation of the Word2Vec algorithm,
which is commonly used for creating word embeddings. This implementation is
designed to scale well across multiple compute instances, making it suitable for
processing large datasets and training word embeddings quickly.

Good ideas:
Transforming columns using logarithms is a common technique in machine learning
when dealing with data that spans a wide range of values or exhibits exponential
growth. Why? because:
- Normalization: Logarithmic transformations can help normalize data that is highly
skewed or has a large dynamic range. This can make the data more suitable for
algorithms that assume a normal distribution or for models that are sensitive to
the scale of input features.
- Stabilizing Variance: Logarithmic transformations can help stabilize the variance
of a variable. When the variance of a feature increases as its magnitude increases
it is bad for some models
- Linearization: Sometimes, taking the logarithm of a variable can transform it
into a form that appears more linear.
- Noise Reduction: Logarithmic transformations can help reduce the impact of
outliers or extreme values.

Cross validation vs split train-validation database


Using either cross-validation or a train-validation split depends on your goals,
the size of your dataset, and the nature of your machine learning project. Both
approaches have their advantages and use cases:
1) Train-Validation Split:
- Advantages: This approach is simpler to set up and manage. It's often used when
the dataset is large enough to allow for a meaningful split into training and
validation sets without sacrificing too much data for training.
- Use Cases: Train-validation splits are commonly used when you have enough data
available, and you want a simple and quick way to estimate how well your model is
likely to perform on unseen data. This approach is suitable for scenarios where
computational resources might be limited.
2) Cross-Validation:
Advantages: Cross-validation provides a more robust estimate of model
performance by using multiple splits of the data into training and validation sets.
It helps reduce the risk of overfitting and provides a better understanding of how
the model generalizes across different subsets of the data.
Use Cases: Cross-validation is particularly useful when you have a limited
amount of data and want to make the most of it for both training and validation.
It's also beneficial when you want a more accurate estimate of your model's
performance that accounts for potential variations in the data splitting process.

you should aim for


AWS Certified Machine Learning - Specialty.

https://www.youtube.com/watch?
v=chHWvBhAysQ&list=PLWsnB2XBNJzJGQ_8ZHMSSXSK7xY9XCDXe&index=14

https://www.youtube.com/watch?v=i_LwzRVP7bg

You might also like