ML
ML
com/tutorials/machine-learning-tutorial
Machine learning: the branch of AI, based on the concept that machines and systems
can analyze and understand data, and learn from it and make decisions with minimal
to zero human intervention.
Most industries and businesses working with massive amounts of data have recognized
the value of machine learning technology. By culling insights from this data,
businesses are able to work more efficiently and gain an advantage over others.
The point is to learn from data
vector feature: a n-sized vector of numbers who represents an object. For example
if you are analizing apples you have color, size, flavour etc
samples: the item to process
feature space aka input space aka feature set, refers to the set of all possible
features that can be used to represent a data point. Is the collections of features
used to characterize your data. the n dimensions where our variables lives.
The feature space is a vector space that represents all the possible combinations
of features that can be used to describe a data point. Each feature corresponds to
a dimension in this space, and the value of each dimension corresponds to the value
of the corresponding feature for a particular data point.
label data: data with a known classification result
Steps are:
1) data ingestion
2) cleaning and transformation
3) model training
4) model testing
5) model deployment and integration
k-nearest neighbors:
is a base base example. The k-nearest neighbor (k-NN) algorithm is a non-parametric
machine learning algorithm used for classification and regression. The "k" in k-NN
refers to the number of nearest neighbors to be considered for making a prediction.
The k-NN algorithm works by finding the k nearest data points in the training set
to a new data point and using the majority class of those k neighbors to make a
prediction about the class of the new data point
Supervised learning:
features - labels. You know what is. It contains classification and regression
Do you want to predict a category? So it is classification
Do you want to predict value of a stock? If it is just UP and DOWN than is again
classification. else regression
Do you want to predict a quantity? Than it is regression
Do you want to find out if there is anything strange like a credic card fraud? Than
is anomaly detection
Unsupervised learning:
features but no label.
Unsupervised learning enables systems to identify patterns within datasets with AI
algorithms that are otherwise unlabeled or unclassified. There are numerous
application of unsupervised learning examples, with some common examples including
recommendation systems, products segmentation, data set labeling, customer
segmentation, and similarity detection.
Do you want to find out a structure from unexplored data? So it is clustering
Clustering and Association are slightly different:
1) Clustering -> split objects in clusters who are similar inside the same cluster
and different outside
2) Association -> get probability of co-occurrence of 2 objects in the same
collection. What is the probability of "if you bought A on Amazon you'll get also
B"
Reinforced learning
there is the concept of feedback, the machine can learn from errors
Reinforcement learning is also frequently used in different types of machine
learning applications. Some common application of reinforcement learning examples
include industry automation, self-driving car technology, applications that use
Natural Language Processing, robotics manipulation, and more. Reinforcement
learning is used in AI in a wide range of industries, including finance,
healthcare, engineering, and gaming.
Classification:
- decision tree -> you know what is
- naive bayes -> based on conditional probability, good if complexity of input is
high
Let's do a spam mail example
1) we create an histogram of words with how much they are common in normal messages
Imagine that the probability to see the word "dear" GIVEN it is a normal message is
0.34
2) we create another histogram with the words of scam messages
example the probability of "money" GIVEN it is scam is 0.44, and "dear" is 0.02
3) now we receive a message starting with "dear friend"
we have to decide what is the probability of it is scam or is normal given we have
the word dear. And with bayes theorem you can revert the relation P(A|B) in P(B|A)
doing things
- random forest
- logistic regression
- KNN -> k nearest neighbors.. you know what is
Regression
- linear regression -> to estimate linear relations between variables. Here a
variable is dependent on one or more independent variables. Example in 2D: draw a
line to minimize the quadratic distance from points
Feature selection:
we have a lot of data, suppose too much. Not all are really useful.
Feature selection can:
- reduce overfitting (avoid learn noise)
- improve accuracy
- reduce training time
Feature selection
- Supervised
Supervised feature selection is a type of feature selection technique that takes
into account the target variable or output label in the selection process. It
involves selecting a subset of input features that are most relevant to the target
variable or output label, based on their predictive power and ability to improve
the performance of a machine learning model.
- intrinsic a mix of wrapper and filter. examples lasso regularization and
decision tree
- wrapper method -> we split the data in some overlapping subsets, we train
the model with each subset to decide which ones are better. examples are recursive
feature elimination and generit algorithm
- filter method -> filter columns based on how much they are CORRELATED to
output. Examples are parson coefficent, chi squared, anova.
- Unsupervised
Here we don't use the label but we check if and how the variables are correlated
- PCA (information later in this doc)
- others
in general based on kind of input and output you can select your feature selection
model. And guess what? In the end you try multiple ones and you see what happens
Linear regression:
Numericals: are numbers
Categoricals: Color, Type, etc
find the right variables who influence the result
residual error: distance between the line and the actual point. You are actually
more interested in squared residual errors summed up
Categorical encoding:
Categorical encoding is the process of transforming categorical data, which
represents qualitative or nominal data, into numerical data that can be used for
machine learning models.
Categorical data refers to data that has distinct categories, such as colors,
genders, or cities. In contrast, numerical data represents quantities, such as age
or height, and can be used in mathematical operations.
There are several methods of categorical encoding, including one-hot encoding,
ordinal encoding, and binary encoding. One-hot encoding involves creating a binary
column for each category, where a 1 is placed in the corresponding column and 0s
are placed in the other columns. Ordinal encoding assigns a numerical value to each
category, such as assigning a value of 1 to the category that is considered the
smallest or the most important. Binary encoding involves converting the categorical
variable into binary digits, which can reduce the number of columns required for
the encoding (example of binary could be "red" (1) is 001 in binary, "green"
(2) is 010 in binary, "blue" (3) is 011 in binary.).Binary encoding can be useful
for categorical features with a large number of categories, as it can reduce the
dimensionality of the data and avoid the high cardinality problem associated with
one-hot encoding. However, it may not be appropriate for categorical features with
a small number of categories.
Classification:
multilabel classification: where each sample has a set of labels
Logistic regression:
is similar to linear regression, but the result is not a number, it is only true or
false!
Logistic regression is a statistical technique used to model the relationship
between a binary dependent variable and one or more independent variables. It is a
type of generalized linear model that uses a logistic function to model the
probability of the binary outcome.
In logistic regression, the dependent variable is binary, meaning it can take only
two values, 0 or 1. The independent variables can be either continuous or
categorical. The logistic regression model estimates the probability of the binary
outcome (1 or 0) for a given set of independent variables.
The logistic function used in logistic regression is also known as the sigmoid
function. It takes any input value and returns an output between 0 and 1, which
represents the probability of the binary outcome.
Confusion matrix:
A confusion matrix is a table that summarizes the performance of a classification
algorithm by comparing the actual and predicted values of the target variable. it
could contain
True Positive (TP) False Positive (FP)
False Negative (FN) True Negative (TN)
the confusion matrix is a simple and effective tool for evaluating the performance
of a classification algorithm and can provide valuable insights into the strengths
and weaknesses of the model.
Dummies:
Dummy variables are categorical variables that are converted into numerical
variables in order to be used as inputs in machine learning models. This conversion
is necessary because many machine learning models can only take numerical inputs.
IMPORTANT:
Usually lower precision is more costly for the business then lower recall
On the other hand, recall measures the proportion of true positives among all the
relevant items in the data set. In other words, recall measures the completeness of
the system in identifying all the relevant items. Mathematically, recall is
calculated as:
recall = true positives / (true positives + false negatives)
In simple words: How many of the real positives you actually found?
A good system should have a high precision and a high recall. However, there is
often a trade-off between precision and recall, and the optimal balance between the
two will depend on the specific application and the priorities of the users.
F1 score: The F1 score is a harmonic mean of precision and recall, which gives
equal weight to both metrics. It is a useful metric for imbalanced datasets, as it
balances the trade-off between precision and recall. It is calculated as 2 x
(precision x recall) / (precision + recall).
ML splittend in 2:
1) supervised
1.1) classification
1.2) regression
2) unsupervised
2.1) clustering
2.2) association
a perfect confusion matrix has all numbers in the diagonal and 0 everywhere else
2 types of errors:
1) Reducible -> you can work on it
1.1) Bias
1.2) Variance
2) Irreducible -> you cannot change it
Error in machine learning: is to see how accurately you can make predictions on
data you trained on and also on new data. The difference between predicted and
actual
Total error: 1) + 2) together
Bias vs Variance:
the bias-variance tradeoff is an important concept to consider when developing
models. The tradeoff refers to the relationship between two types of errors that
can occur in a model: bias and variance.
Bias:
Bias refers to the error that occurs when a model is not able to capture the true
underlying pattern in the data, leading to a systematic under- or overestimation of
the target variable. High bias models are typically oversimplified and may miss
important patterns or relationships in the data. If high bias the model will be
very bad both in training and testing data
Variance:
Variance, on the other hand, refers to the error that occurs when a model is too
complex and is overfitting the data, meaning it captures noise and randomness in
the training data as well as the underlying pattern. High variance models tend to
have good performance on the training data but perform poorly on new data. If high
variance it will very good on training data but bad in test data
The tradeoff between bias and variance arises because reducing one typically
increases the other. For example, a simple linear model with few parameters may
have high bias but low variance, while a complex model with many parameters may
have low bias but high variance. Finding the right balance between bias and
variance is critical to developing models that generalize well to new data.
cross-validation:
Cross-validation is a technique used to evaluate the performance of a machine
learning model by splitting the available data into two or more subsets: one for
training the model and another for testing the model. The goal of cross-validation
is to ensure that the model is able to generalize well to new data that it has not
seen during training.
The simplest form of cross-validation is called "k-fold cross-validation." In this
method, the data is divided into k equally sized folds, and the model is trained on
k-1 of the folds, with the remaining fold used for testing. This process is
repeated k times, with each fold used once as the validation data. The results from
each iteration of cross-validation are then averaged to provide an estimate of the
model's performance.
k-fold is a type of cross-validation
Cross-validation
Cross-validation is a technique used in machine learning and statistics to evaluate
the performance of a predictive model. It involves dividing a dataset into two or
more subsets, called "folds," and using one subset as the training data to fit the
model while using the other subset as the testing data to evaluate its performance.
The most common form of cross-validation is k-fold cross-validation, where the
dataset is divided into k equal-sized subsets, or folds. The model is trained on k-
1 folds and evaluated on the remaining fold, and this process is repeated k times,
with each fold being used as the testing set once. The results from each fold can
then be averaged to provide an estimate of the model's overall performance.
time series cross-validation -> here does not make sense to random select data..
you have to keep chronology, using time window
regularization:
Regularization is a technique used to prevent overfitting, which occurs when a
model becomes too complex and fits the training data too closely, leading to poor
generalization to new data
Regularization achieves this by adding a penalty term to the model's objective
function that encourages simpler models. There are two common types of
regularization:
1) L1 regularization (Lasso): This adds a penalty to the absolute value of the
model's coefficients, effectively encouraging some coefficients to be set to zero.
This can be useful for feature selection, as it can help identify which features
are most important for the model's performance.
2) L2 regularization (Ridge): This adds a penalty to the squared value of the
model's coefficients, which has the effect of shrinking the coefficients towards
zero. This can help to prevent overfitting by reducing the influence of less
important features.
L1 vs L2:
it depends on the problem:
L1 is good with high dimensionality or redundancy, L1 is better if you want to
select important features.
L2 is good when you want to limit weights
Entropy:
Intuitively, entropy is high when the data is very mixed and uncertain, and low
when the data is very pure and predictable. For example, a dataset with only one
class label (i.e., all the data points belong to the same class) would have an
entropy of 0, because there is no uncertainty about the class label. On the other
hand, a dataset with an equal number of data points in each class would have a high
entropy, because there is a lot of uncertainty about the class label.
Information gain:
In the context of decision trees, information gain is a measure of the reduction in
entropy (or increase in purity) that results from splitting a dataset on a
particular input feature. Information gain is used to determine which input feature
to split on at each node of the tree.
Leaf node:
In the context of decision trees, a leaf node (also called a terminal node) is a
node that does not have any child nodes. In other words, a leaf node is a node that
represents a final classification or decision in the decision tree.
Random forest:
fast to train and low overfitting (because forest if big)
very accurate for big databases
good to estimate missing data
basically means "a lot of decision trees"
In the case of classification models, a confusion matrix will show the number of
true positives, true negatives, false positives, and false negatives for each class
in the dataset. However, for other machine learning tasks, the confusion matrix may
have a different interpretation or format.
For example, in regression tasks, the confusion matrix may show the mean squared
error, mean absolute error, and other metrics that evaluate the model's ability to
predict a continuous output. In clustering tasks, the confusion matrix may show the
number of points that are correctly or incorrectly assigned to each cluster.
Overall, the confusion matrix is a useful tool for evaluating the performance of a
machine learning model, regardless of the task at hand.
If k is too low (e.g., k=1), the model may be too sensitive to noise and outliers
in the data. In this case, the model may overfit the training data and perform
poorly on new, unseen data. Additionally, if the dataset is imbalanced, the model
may favor the majority class and ignore the minority class, leading to poor
performance on the minority class.
If k is too high (e.g., k=n, where n is the size of the training set), the model
may be too biased and not capture the underlying patterns in the data. In this
case, the model may underfit the training data and perform poorly on both the
training and test data. Additionally, if the dataset is noisy, the model may
include irrelevant samples in its predictions, leading to poor performance.
In a N-sized db sqrt(n) is a good idea for value of K. Make it odd, so you cannot
have a draw
Usually Bias and underfitting comes together
Usually Variance and overfitting comes together
Good when:
When the number of clusters is known or can be estimated.
When the data is relatively dense and spherical in shape.
When the data is relatively simple and there are no clear outliers.
Bad when:
When the number of clusters is unknown and needs to be estimated.
When the data is not well-separated and has overlapping clusters.
When the data is not spherical in shape.
When the data has clear outliers.
When it comes to bias and variance, K-means is generally a low-bias algorithm since
it assumes nothing about the underlying distribution of the data. However, it can
be prone to high variance since the results can depend on the initial placement of
the centroids, and the algorithm can converge to a suboptimal solution. To mitigate
this, multiple initializations with different starting points can be used to find a
more robust solution.
K-means is also prone to overfitting since it can assign data points to clusters
even if they do not truly belong there. This can lead to clusters that do not have
clear distinctions or are highly variable. On the other hand, K-means can also be
prone to underfitting if the number of clusters is too small, leading to
oversimplified clusters that do not capture the true complexity of the data
Clustering:
1) Hierarchical clustering is a tree structure like mamman -> canide -> dog and so.
So you don't need to know in advance how many clusters
1.1) agglomerative clustering bottom up, you start with small clusters and
you merge together similar ones
1.2) divisive clustering, top down, as you can imagine
2) Partitional clustering
2.1) k-means: each object goes to exactly one cluster
2.2) fuzzy c-means: each object goes in one or more clusters
Distance measure
In the context of clustering, a distance measure is a way to quantify the
similarity or dissimilarity between two data points. The choice of distance measure
can have a significant impact on the clustering results and should be selected
based on the characteristics of the data and the desired clustering outcome.
There are several commonly used distance measures in clustering, including:
1) Euclidean distance: This is the most widely used distance measure and is
defined as the square root of the sum of the squared differences between the
corresponding elements of two data points.
2) Manhattan distance: Also known as city-block distance or L1 distance, this
measure calculates the absolute differences between the corresponding elements of
two data points and sums them up.
3) Cosine distance: This measure calculates the cosine of the angle between two
vectors and is commonly used for text data or high-dimensional data.
4) Hamming distance: This measure is used for binary data and calculates the
number of bits that differ between two binary vectors.
5) Jaccard distance: This measure is used for categorical data and calculates
the dissimilarity between two sets based on the size of their intersection and
union.
How do you find how many cluster in k-means? with the elbow method:
SSD is sum of squared distances (SSD) of data points to their nearest centroid. We
try with k equals to 1 to N, so we execute k-means lots of times. Usually SSD
shrinks augmenting k. When it starts shrinking less then usual, so the graph print
an elbow shape,then it is a good k value
Dimensionality reduction:
we actually want to reduce the dataset columns removing the useless infos
Dimensionality reduction is the process of reducing the number of features or
variables in a dataset while preserving the important information contained in the
original data. It is a common technique used in machine learning and data analysis
to handle high-dimensional data and to improve the efficiency and effectiveness of
machine learning models. It improve also the training cost and time. It reduce also
redundancy
There are two main types of dimensionality reduction techniques: feature selection
and feature extraction.
1) Feature selection: This approach selects a subset of the original features that
are most relevant to the prediction task.
2) Feature extraction: This approach transforms the original features into a lower-
dimensional space that preserves most of the important information. Principal
Component Analysis (PCA) is a popular method for feature extraction that projects
the data onto a new space of orthogonal components that capture the most variation
in the data.
PCA:
we want to reduce dimensionability of dataset but keeping almost all the infos
It is a mathematical technique that uses linear algebra to transform a set of
correlated variables into a set of uncorrelated variables called principal
components. We love uncorrelated variables
The main idea behind PCA is to identify the directions of maximum variation in the
data and project the data onto these directions. The first principal component is
the direction of maximum variation, followed by the second principal component, and
so on. The principal components are orthogonal to each other, meaning that they are
uncorrelated.
Terminology of PCA:
view -> the perspective through the points are view
dimensions -> how many columns
linear combination -> In mathematics, a linear combination is a combination
of two or more variables or functions that are multiplied by constants and added
together. In other words, it is a sum of scalar multiples of vectors, functions, or
other mathematical objects. For example, suppose we have two vectors u and v in a
vector space. A linear combination of u and v is a vector of the form au + bv,
where a and b are scalars (real numbers or complex numbers). The vector au + bv is
obtained by scaling u by a and scaling v by b, and then adding the resulting
vectors.
principal components (PC) -> new variables constructed as linear combinations
of existing ones
perpendicular distance -> In algebra, the perpendicular distance between a
point and a line is the shortest distance between the point and the line, and it is
measured along a line that is perpendicular (or orthogonal) to the given line.
projections -> the perpendicular distance between principal components and
datapoints
basis -> In linear algebra, a base (or basis) is a set of linearly
independent vectors that can be used to represent any other vector in a given
vector space. Any vector in that space can be written as a linear combination of
the basis vectors
orthogonal -> In mathematics, orthogonal refers to the concept of
perpendicularity, or the relationship between two objects that are at a right angle
to each other. More generally, two objects are said to be orthogonal if they are
independent or unrelated to each other. For example, in geometry, two vectors are
orthogonal if their dot product is zero, which means that they are perpendicular to
each other. In linear algebra, a set of vectors is orthogonal if each pair of
vectors in the set is orthogonal. A set of orthogonal vectors is often used as a
basis for a vector space because it is easy to calculate projections and other
operations using orthogonal vectors.
For a NN a good optimizer to reduce the loss function could be gradient descent
Gradient descent is a popular optimization algorithm used in machine learning and
other fields to minimize a cost function. It is a first-order iterative
optimization algorithm that works by iteratively adjusting the parameters of a
model in the direction of steepest descent of the cost function.
The basic idea of gradient descent is to compute the gradient of the cost function
with respect to the parameters of the model, and then adjust the parameters in the
direction of the negative gradient. This process is repeated until a minimum of the
cost function is reached or a stopping criterion is met.
There are several variations of gradient descent, including batch gradient descent,
stochastic gradient descent, and mini-batch gradient descent. Batch gradient
descent computes the gradient using the entire training dataset, while stochastic
gradient descent computes the gradient using one training example at a time. Mini-
batch gradient descent is a compromise between the two
Gradient descent is a widely used optimization algorithm due to its simplicity and
effectiveness. However, it can be slow to converge in certain cases, and can get
stuck in local minima of the cost function
alfa is the learning rate of gradient descent. You can do big steps and miss lower
places, or small steps and take more time to train
Optimizer:
In the context of machine learning, an optimizer is an algorithm or method used to
adjust the parameters (weights and biases) of a model during the training process
in order to minimize the error or loss function
Possible optimizer:
- Gradient Descent: This is one of the most fundamental optimization algorithms. It
involves computing the gradient of the loss function with respect to the model
parameters and moving in the direction that reduces the loss. There are variations
of gradient descent, including batch gradient descent (updating using the entire
dataset), stochastic gradient descent (updating using one data point at a time),
and mini-batch gradient descent (updating using a small subset of the dataset).
- Adam (Adaptive Moment Estimation): Adam is an adaptive optimization algorithm
that adjusts the learning rate for each parameter based on the past gradients
- RMSProp (Root Mean Square Propagation): RMSProp adjusts the learning rate for
each parameter based on the magnitude of recent gradients. It helps prevent
vanishing or exploding gradients by normalizing the update step.
- Momentum: Momentum adds a fraction of the previous update to the current update,
which helps the optimization process move more smoothly
- Adagrad (Adaptive Gradient Algorithm): Adagrad adapts the learning rate for each
parameter based on the historical gradients.
Explain me more practically the points 2 and 3 in the case of a neural network
Sure! Let's consider a simple neural network with one hidden layer, consisting of
an input layer, a hidden layer with n neurons, and an output layer with one neuron.
The network takes as input a vector x of m features and predicts a scalar value y.
Here's how the forward propagation and backward propagation steps work in this
case:
2) Forward Propagation:
The input layer receives the input vector x.
The hidden layer calculates the weighted sum of the inputs for each neuron,
applies a non-linear activation function to the result, and outputs a vector h of
size n.
The output layer calculates the weighted sum of the hidden layer activations,
applies a final activation function (such as a sigmoid or a softmax), and outputs a
scalar value y_hat, which is the predicted value.
3) Backward Propagation:
The output layer calculates the derivative of the cost function with respect to
y_hat, which is typically based on the difference between the predicted value and
the actual value of y.
The hidden layer calculates the derivative of the cost function with respect to
the activations of the hidden layer neurons, which is obtained by multiplying the
derivative of the output layer with respect to the hidden layer activations by the
weights connecting the output layer to the hidden layer.
The input layer calculates the derivative of the cost function with respect to
the inputs, which is obtained by multiplying the derivative of the hidden layer
activations with respect to the inputs by the weights connecting the hidden layer
to the input layer.
This process is repeated for multiple epochs (passes over the entire dataset) until
the cost function converges or a maximum number of epochs is reached.
In NN each layer has its own cost function and its own minimum, so gradient descent
can have different minimun and it's easier to get stuck in local minimum
you optimize with a good loss function one layer per time, starting with the last
ones, that's why it's called backpropagation
Reinforcement learning:
Reinforcement learning is a subfield of machine learning that is concerned with how
agents can learn to make decisions and take actions in an environment in order to
maximize a reward signal. It is a type of learning that is inspired by how humans
and animals learn from the consequences of their actions.
In reinforcement learning, an agent interacts with an environment by taking actions
and receiving feedback in the form of rewards or penalties. The goal of the agent
is to learn a policy, which is a mapping from states to actions, that maximizes the
expected cumulative reward over time.
Markov Decision Process (MDP) -> A Markov Decision Process (MDP) is a mathematical
framework used to model decision-making problems in situations where the outcome
depends on both random events and the actions taken by a decision maker or agent.
An MDP is defined by a set of states, actions, rewards, and a transition function.
The states represent the different possible states of the system, the actions
represent the available actions that the decision maker can take, and the rewards
represent the outcome of taking an action in a particular state. The transition
function defines the probability of transitioning from one state to another state,
given a particular action.
The key assumption of an MDP is that the future state of the system depends only on
the current state and the current action taken by the decision maker, and is
independent of the past history of the system. This is known as the Markov
property,
GPT-3 was not trained using reinforcement learning directly. Instead, it was pre-
trained using an unsupervised learning approach called language modeling, which
involves predicting the next word in a sequence of text given the preceding words.
However, reinforcement learning has been used in conjunction with GPT-3 to fine-
tune the model for specific tasks. This involves training the model on a specific
task using a reward signal that is provided by the task environment, such as a
score or accuracy metric.
Overall, while GPT-3 was not directly trained using reinforcement learning, it has
been used in conjunction with reinforcement learning to fine-tune the model for
specific tasks and improve its performance.
Q-learning
Q-learning is a popular algorithm in reinforcement learning that enables an agent
to learn to make decisions based on rewards it receives from its environment. It is
a model-free algorithm that uses a value-based approach to estimate the optimal
action-value function (Q-function) of the environment.
In Q-learning, the agent learns to estimate the expected reward of taking a
particular action in a particular state. The Q-value of a state-action pair is
defined as the expected sum of discounted future rewards that the agent will
receive by following a specific policy from that state. The optimal policy can then
be found by selecting the action with the highest Q-value in each state. q-value is
"how good is this action in this state?"
The Q-learning algorithm involves iteratively updating the Q-values based on the
agent's experiences in the environment. At each time step, the agent takes an
action in the current state and observes the resulting reward and next state. The
Q-value for the current state-action pair is then updated using the Bellman
equation:
The Bellman equation in the context of Q-learning is a fundamental equation that
describes how the Q-values for a given state and action can be updated based on the
observed reward and the maximum Q-value for the next state.
where:
Q(s, a) is the Q-value of taking action a in state s
alpha is the learning rate, which determines the extent to which newly acquired
information overrides old information
r is the reward obtained by taking action a in state s
gamma is the discount factor, which determines the relative importance of
immediate and future rewards
max(Q(s', a')) is the maximum Q-value for any action a' in the next state s'
s' is the next state reached after taking action a in state s.
Overfitting -> you are learnign noise, you are bad in predicting new things.
Related to variance
Underfitting -> you are not learning the pattern. Related to Bias
For a vector v with components (v₁, v₂, ..., vn), the magnitude is calculated as:
|v| = √(v₁² + v₂² + ... + vn²)
To get the direction of a vector, you can use the concept of unit vectors. A unit
vector is a vector with a magnitude of 1. You can obtain the unit vector in the
direction of a given vector v by dividing the vector by its magnitude:
u = v/|v|
The resulting unit vector u will have the same direction as the original vector v,
but with a magnitude of 1.
Stratified and cluster sampling may look similar, but bear in mind that groups
created in cluster sampling are heterogeneous, so the individual characteristics in
the cluster vary. In contrast, groups created in stratified sampling are
homogeneous, as units share characteristics.
Relatedly, in cluster sampling you randomly select entire groups and include all
units of each group in your sample. However, in stratified sampling, you select
some units of all groups and include them in your sample. In this way, both methods
can ensure that your sample is representative of the target population.
Cluster sampling is a type of probabilistic sampling method that is used when the
population is large and geographically dispersed. In this method, the population is
first divided into clusters, which are groups of individuals who share some common
characteristic or are located in a particular geographic area. The clusters can be
based on various criteria, such as geography, age, income, or occupation.
Once the clusters have been identified, a sample of clusters is selected at random,
using a simple random sampling or other probabilistic method. Then, all individuals
within the selected clusters are included in the sample. This differs from
stratified sampling, where only a subset of individuals within each stratum are
included in the sample.
2 types of statistic:
1) descriptive -> just give some info like mean, mode, median...
1.1) measure of central tendences -> mean, mode, median
1.2) measure of spread -> variance, range, standard deviation etc
2) inferential -> Inferential statistics is a subfield of statistics that involves
making inferences about a population based on a sample of data.
Hypothesis testing: calculate how probable is an hypotesis given the samples you
have
The central limit theorem is a fundamental concept in statistics that describes the
behavior of the sample means of a large number of independent and identically
distributed random variables. It states that regardless of the underlying
distribution of the individual random variables, as the sample size increases, the
distribution of the sample means approaches a normal distribution, with a mean
equal to the population mean and a standard deviation equal to the population
standard deviation divided by the square root of the sample size.
If the p-value is very small (e.g., less than 0.05), we reject the null hypothesis
and conclude that there is sufficient evidence to support the alternative
hypothesis. If the p-value is not small, we fail to reject the null hypothesis and
conclude that the observed effect or relationship is not statistically significant.
The t-value (also known as the t-statistic) is a statistical test that measures the
difference between the mean of a sample and the mean of the population, relative to
the variability in the sample. It is used in hypothesis testing to determine
whether the mean of a sample is statistically different from the population mean.
The t-value is calculated by dividing the difference between the sample mean and
the population mean by the standard error of the mean. The formula for calculating
the t-value is:
t = (x̄ - μ) / (s / √n)
Permutations refer to the number of ways that a set of objects can be arranged in a
specific order, and can be calculated with or without repetition. In permutations
without repetition, each object can only be used once, and the order in which the
objects are arranged matters. For example, if we have three different letters, A,
B, and C, there are 6 possible permutations without repetition: ABC, ACB, BAC, BCA,
CAB, CBA.
On the other hand, in permutations with repetition, the same object can be used
more than once, and the order in which the objects are arranged matters. For
example, if we have three letters, A, B, and C, and we allow repetition, there are
27 possible permutations: AAA, AAB, AAC, ABA, ABB, ABC, ACA, ACB, ACC, BAA, BAB,
BAC, BBA, BBB, BBC, BCA, BCB, BCC, CAA, CAB, CAC, CBA, CBB, CBC, CCA, CCB, CCC.
Combinations, on the other hand, refer to the number of ways that a set of objects
can be selected from a larger set, and the order in which they are selected does
not matter. Like permutations, combinations can also be calculated with or without
repetition.
To calculate the z-score for a data point, you subtract the mean of the
distribution from the data point and then divide the result by the standard
deviation of the distribution. The formula for calculating the z-score is:
z = (x - μ) / σ
Combinations:
Without repetition: n! / (r! * (n-r)!)
With repetition: (n + r - 1)! / (r! * (n-1)!)
Permutations:
Without repetition: n! / (n-r)!
With repetition: n^r
to start
https://www.simplilearn.com/tutorials/machine-learning-tutorial/machine-learning-
interview-questions
An optimization problem is a type of mathematical problem where the goal is to find
the best possible solution among a set of possible solutions, often subject to
constraints. The objective is to maximize or minimize an objective function that
measures the quality of the solution.
Bayesan optimizer
A Bayesian optimizer is a type of optimization algorithm that uses Bayesian
inference to efficiently search for the optimum of an objective function. It is
often used in machine learning, hyperparameter tuning, and other optimization
problems.
The algorithm maintains a probabilistic model of the objective function and uses it
to choose the next set of parameters to evaluate. The model is updated as new
evaluations are made, allowing the algorithm to refine its search over time.
Let's say we have a dataset of images and we want to train a convolutional neural
network (CNN) to classify them. However, we need to find the optimal
hyperparameters for the CNN, including the number of convolutional layers, the
number of filters in each layer, the learning rate, and the batch size.
We can use a Bayesian optimizer to efficiently search for the optimal set of
hyperparameters by iteratively evaluating the model with different parameter
combinations. The optimizer will use the results of previous evaluations to update
its probabilistic model and guide the search towards promising regions of the
parameter space.
STEP 4 IN DETAILS:
n step 4, we use the performance result obtained in step 3 to update the
probabilistic model of the objective function. Specifically, we update the prior
distribution of the hyperparameters using Bayes' theorem, taking into account the
likelihood of the data given the hyperparameters.
Bayes' theorem states that the posterior distribution of a parameter given data is
proportional to the product of the likelihood of the data given the parameter and
the prior distribution of the parameter.
Actually it states P(A | B) = P(B | A) * P(A) / P(B), but here we are adapting it
to our situation
Bayes' theorem can be used to update our beliefs about the probability of an event
based on new evidence. We can use the theorem to calculate the posterior
probability of the event, which is the probability of the event occurring given the
new evidence.
Once we have the prior distribution, we can compute the likelihood of the
performance result given the hyperparameters. This can be done using a surrogate
model or a proxy function that approximates the true objective function. The
surrogate model can be a Gaussian process, a neural network, or any other model
that is computationally efficient and can approximate the objective function
accurately.
So
from P(A | B) = P(B | A) * P(A) / P(B)
to P(A | B) ∝ P(B | A) * P(A)
because the ∝ symbol means "proportional to", which means that we are neglecting
the normalization constant P(B) for simplicity.
Boosting, bagging, and stacking are popular ensemble learning techniques in the
context of machine learning.
- Boosting:
Boosting is an ensemble learning technique that aims to improve the performance of
weak learners (often simple models) by combining them into a strong learner. The
process involves sequentially training a series of weak models and giving more
weight to the misclassified instances in each iteration. The final prediction is
made by aggregating the predictions of all weak models. AdaBoost (Adaptive
Boosting) is one of the most well-known algorithms for boosting.
- Bagging:
Bagging stands for Bootstrap Aggregating. It is an ensemble learning technique that
involves training multiple independent models on different subsets of the training
data, created by bootstrapping (sampling with replacement). Each model in the
ensemble makes its prediction, and the final prediction is determined by averaging
(in the case of regression) or majority voting (in the case of classification) over
the predictions of all models. Random Forest is a famous algorithm that uses baggin
- Stacking:
Stacking, also known as stacked generalization, is an ensemble learning technique
that combines multiple models by using another model (meta-model) to learn how to
best combine the predictions of the base models. The process involves training
multiple base models on the same dataset, using their predictions as input features
for the meta-model. The meta-model then learns how to weigh and combine the base
model predictions to produce the final ensemble prediction. Stacking can be a
powerful technique when used appropriately, but it requires more computational
resources and careful tuning compared to boosting and bagging.
Neural networks:
A Neural Network (NN) is a specific architecture inspired by the human brain's
neural structure. It is a complex network of interconnected artificial neurons that
process and transmit information. It excels at recognizing patterns and making
predictions based on learned representations from the data.
Types:
- Feedforward Neural Networks: They are the simplest type of neural networks, where
data flows in a unidirectional manner, moving from the input layer to the output
layer without any feedback loops.
- Recurrent Neural Networks (RNNs): These networks have connections that allow
feedback loops, making them suitable for sequential data and tasks that require
memory.
- Convolutional Neural Networks (CNNs): CNNs are particularly well-suited for image
and video analysis, using convolutional layers to automatically detect patterns.
- Generative Adversarial Networks (GANs): GANs are composed of two neural networks,
namely a generator and a discriminator, collaborating to produce lifelike data,
commonly employed for generating images and videos.
- Long Short-Term Memory Networks (LSTMs): A type of RNN that can retain
information for an extended period, making them effective in tasks involving
sequential data.
- Data Warehouse:
A data warehouse is a specific type of data repository that is designed to support
business intelligence (BI) and analytical processing. It is a centralized, large-
scale data storage system that consolidates data from various sources across an
organization. Designed to support analytical processing, reporting, and decision-
making by providing a structured and optimized environment for querying and
analyzing data. Typically stores structured data. Focuses on ensuring data quality,
consistency, and accuracy for reliable analysis. Used for business intelligence,
reporting, and advanced analytics
Data pipelines:
Data pipelines refer to a series of processes that involve the collection,
transformation, and movement of data from various sources to a destination, often
with the goal of making the data ready for analysis, reporting, or other purposes
A data pipeline is not limited to moving data solely from a data repository to a
data warehouse, although that can be a common use case. A data pipeline can involve
various stages and components, including data extraction, transformation,
processing, and loading, as well as movement between different storage systems
ETL:
ETL stands for "Extract, Transform, Load," and it refers to a process commonly used
in data engineering to move data from source systems to a destination system, often
a data warehouse, while applying transformations along the way.
- Extract:
In the extraction phase, data is collected from various sources, which can include
databases, APIs, flat files, logs, and more
- Transform:
The transformation phase involves cleaning, enriching, and structuring the
extracted data so that it aligns with the requirements of the destination system.
Data may need to be cleaned of inconsistencies, duplicated entries, and errors.
Additionally, data transformations might involve aggregating, joining, pivoting,
and applying calculations to prepare the data for analysis
- Load:
In the loading phase, the transformed data is loaded into the target destination.
Loading data often includes considerations about how frequently new data needs to
be added (batch or real-time), how to handle updates and inserts, and how to
maintain data integrity.
A data lake is a centralized repository that allows organizations to store and
manage large volumes of structured, semi-structured, and unstructured data at any
scale. Unlike traditional databases or data warehouses, which often require data to
be structured before storing, data lakes accommodate raw data in its original
format. This flexibility makes data lakes particularly well-suited for storing
diverse and heterogeneous data sources, such as logs, sensor data, social media
posts, images, videos, and more. This is even less structured than a data
repository, and also more general purpose
aws S3 is a good place where to put a data lake and also a data repository
AWS Glue:
AWS Glue is a managed extract, transform, and load (ETL) service provided by Amazon
Web Services (AWS). It's designed to help users easily prepare and transform data
for analytics, reporting, and other purposes, without requiring extensive manual
coding or infrastructure management. AWS Glue simplifies the process of building
and managing ETL pipelines by providing tools for data cataloging, data
transformation, and data movement.
- Glue can read from S3
- Glue has crawlers: Crawlers discover and catalog metadata from various (usually
more than one) data sources such as Amazon S3, databases, and data warehouses.
Crawlers analyze the structure and schema of data. It founds out from for example a
csv what is the schema. The wizard in the console is magic and allows you to setup
also roles and other things. You can decide when to run the crowler (can be on
demand, every day.. etc). You have to set also where to put the crawled data, and
it will create the corrispective table in the data catalog.
- glue has data catalog: a centralized metadata repository known as the AWS Glue
Data Catalog. This catalog stores metadata information about various data sources,
databases, tables, and transformations. INTERESTING: This catalog can also be
integrated with Amazon Athena. By cataloging your data in Glue, you can make it
easier for Athena to discover and query the data using SQL.
- Glue has ETL jobs:
AWS Glue allows users to create ETL jobs using a visual interface. These jobs
define the data transformations needed to convert raw data into a suitable format
for analysis or reporting. AWS Glue generates code (Python or Scala) based on the
visual ETL job configuration. Tipical example of job: Spark, but is a Spark with
like 0 code you do everything with the interface. ETL capabilities can be used to
clean, transform, and prepare your data before querying it with Athena
Batch Processing:
Batch processing involves collecting and processing a large amount of data in
predefined batches or chunks. Data is accumulated over a period of time, and then
processing is triggered on the collected batch.
Stream Processing:
Stream processing involves continuously processing and analyzing data as it arrives
in real-time. Data is processed piece by piece as it flows, allowing for quick
insights and immediate actions
AWS Batch:
AWS Batch is a managed service provided by Amazon Web Services (AWS) that enables
you to easily run batch computing workloads on the AWS Cloud. It simplifies the
process of provisioning, managing, and scaling compute resources for batch
processing tasks
Key features and components of AWS Batch include:
- Job Unit:
a task you have to do
- Job Definitions:
AWS Batch allows you to define job definitions that encapsulate the parameters and
configuration required for a specific batch processing task. Job definitions
include information about the Docker image to be used, resource requirements, and
other details.
- Job Queues:
Job queues are used to organize and manage the execution of batch jobs. You can
create multiple job queues with different compute environments and priority levels
- Compute Environments:
A compute environment is a set of compute resources that AWS Batch uses to execute
your batch jobs. These resources can be either EC2 instances or Fargate containers
- Security and Access Control:
AWS Batch supports AWS Identity and Access Management (IAM) for role-based access
control and security.
- Monitoring and Logging:
AWS Batch provides monitoring and logging capabilities, including integration with
Amazon CloudWatch for tracking resource utilization, job status, and other metrics.
Binomial distribution
The binomial distribution is a discrete probability distribution that describes the
number of successes in a fixed number of independent Bernoulli trials (experiments
with two possible outcomes: success and failure)
The Bernoulli distribution and the binomial distribution are closely related but
different:
The Bernoulli distribution describes the outcome of a single binary experiment or
trial, where there are only two possible outcomes: success (1) or failure (0). It
models a single event with a fixed probability of success. The binomial
distribution describes the number of successes in a fixed number of independent
Bernoulli trial
Uniform distribution
The uniform distribution, often referred to as the continuous uniform distribution,
is a probability distribution that describes a continuous random variable with
outcomes that are equally likely within a specified interval. In simple terms, it
represents a situation where all values within a given range have the same
probability of occurring.
The normal distribution, also known as the Gaussian distribution, is one of the
most important and widely used probability distributions in statistics. It
describes the distribution of a continuous random variable that is symmetrically
distributed around its mean, creating the characteristic bell-shaped curve.
Amazon QuickSight:
Amazon QuickSight is a cloud-based business intelligence (BI) and data
visualization service provided by Amazon Web Services (AWS). It enables users to
create interactive and insightful visualizations, dashboards, and reports from
various data sources.
QuickSight supports a wide range of data sources, including AWS services like
Amazon Redshift, Amazon RDS, Amazon S3, as well as on-premises databases,
spreadsheets, and third-party services
QuickSight includes built-in data preparation capabilities that allow you to clean,
transform, and shape your data
INTERESTING: yuo can ask questions in plain english to quicksight: you can use
Natural Language Query (NLQ) in Amazon QuickSight to ask questions about your data
using plain English. Natural Language Query is a feature that allows you to
interact with your data and create visualizations by typing questions in a
conversational manner, without needing to write SQL queries or use complex tools.
For example, you can ask questions like:
"Show me total sales by product category."
"What are the sales trends over the last year?"
"How many orders were placed by each customer?"
"What is the average rating by product?"
AWS EMR
Amazon EMR (Elastic MapReduce) is a cloud-based big data platform provided by
Amazon Web Services (AWS). It's designed to simplify and accelerate the processing
of large amounts of data using popular open-source frameworks such as Apache
Hadoop, Apache Spark, Apache Hive, Apache HBase, and more
- EMR provides a managed cluster environment where you can easily provision,
configure, and scale clusters to process large datasets. You can choose from a
variety of cluster configurations based on your needs.
- EMR allows you to run distributed data processing frameworks like Hadoop and
Spark, enabling you to process data in parallel across a cluster of virtual
machines (nodes).
- EMR can seamlessly integrate with various data storage services on AWS, including
Amazon S3, Amazon DynamoDB, Amazon RDS, and more
- EMR can integrate with other AWS services like AWS Glue, Amazon Kinesis, Amazon
Redshift
Dimensionality Dilemma:
In high-dimensional spaces, several issues can occur that complicate the analysis
and modeling of the data. Here are some key aspects of the Dimensionality Dilemma:
- if the number of dimensions increases, the computational resources and time
required for processing and analyzing the data also increase significantly
- In high-dimensional spaces, data points tend to be sparsely distributed
- With a large number of dimensions, there is a risk of overfitting.
- As the number of dimensions increases, the amount of data required to accurately
represent the underlying distribution or relationships becomes larger. This can
make it challenging to gather sufficient data to build accurate models.
- Visualizing data beyond three dimensions becomes challenging for humans
SageMaker:
Amazon SageMaker is a fully managed machine learning service provided by Amazon Web
Services (AWS). It aims to simplify the process of building, training, and
deploying machine learning models at scale. SageMaker provides a comprehensive set
of tools and capabilities that cater to various stages of the machine learning
lifecycle, making it easier for data scientists and developers to develop and
deploy machine learning models.
You have:
- End-to-End Workflow: SageMaker offers a complete workflow for machine learning,
including data preprocessing, model training, hyperparameter tuning, model
evaluation, and deployment.
- Managed Jupyter Notebooks: SageMaker provides Jupyter notebooks that are pre-
configured with the necessary libraries
- Built-In Algorithms: SageMaker includes a collection of built-in algorithms for
common machine learning tasks like classification, regression, clustering, etc
- Custom Algorithms: You can also bring your own custom algorithms and frameworks,
allowing you to develop models using popular libraries like TensorFlow, PyTorch etc
- AutoML: SageMaker provides AutoML capabilities, such as AutoML Autopilot, which
automates the process of selecting features, algorithms, and hyperparameters to
build optimal models
Data Transformation:
Logarithmic Data Transformation:
In the context of data transformation, a logarithmic transformation involves
applying the logarithm function to the values of a dataset. The logarithm function
helps to compress or expand the scale of data, which can be particularly useful
when dealing with data that spans a wide range of values or when the data
distribution is skewed.
Data Scaling:
Data scaling, also known as feature scaling or normalization, is a preprocessing
step in data analysis and machine learning that involves transforming the numerical
features of a dataset to a common scale. Scaling is essential when features have
different ranges, units, or magnitudes, as many machine learning algorithms perform
better or converge faster when the data is on a similar scale.
Types:
- Normalization -> make it from 0 to 1
- Standard scaling -> z-score normalization, is a data scaling technique used in
data preprocessing to transform numerical features into a standardized distribution
with a mean of 0 and a standard deviation of 1. X_standardized = (X - X_mean) /
X_standard_deviation
Text feature engineering involves transforming raw text data into numerical
features that can be used for machine learning and natural language processing
(NLP) tasks. Commonly, the first steps in text feature engineering include:
- Tokenization: Tokenization is the process of splitting a text document into
individual words or tokens. It's a crucial step as it breaks down the text into
manageable units that can be further processed. Tokenization can also include
handling punctuation and special characters.
- Lowercasing: Converting all text to lowercase helps ensure that words with
different capitalizations are treated as the same word.
- Stopword Removal: Stopwords are common words like "and," "the," "is," and "in"
that might not contribute much to the meaning of the text.
- Stemming and Lemmatization: Stemming and lemmatization are techniques for
reducing words to their base or root form. This helps to group related words
together and reduce the vocabulary size. Stemming involves removing suffixes or
prefixes, while lemmatization involves converting words to their dictionary form.
- Handling Special Characters and Numbers: Depending on the task, you might need to
decide how to handle special characters, numbers, URLs, and other non-textual
elements in the text data.
- Removing HTML Tags and Formatting: If working with web data, you might need to
remove HTML tags, scripts, and other formatting elements from the text.
- Dimensionality Reduction: Text data often results in high-dimensional feature
spaces. Dimensionality reduction techniques like Principal Component Analysis (PCA)
or t-SNE can help reduce the dimensionality while preserving important information
- Encoding Categorical Features: If your text data includes categorical features
(e.g., author names, categories), you'll need to encode them into numerical
representations using techniques like one-hot encoding or label encoding.
Term Frequency-Inverse Document Frequency
In the context of text feature engineering, TF-IDF stands for Term Frequency-
Inverse Document Frequency. It is a widely used technique for converting text data
into numerical features that can be used in various natural language processing
(NLP) tasks, such as text classification, clustering, and information retrieval.
The TF-IDF formula consists of two components:
- Term Frequency (TF): This component measures how often a term appears in a
document. It's calculated as the ratio of the number of times the term appears in
the document to the total number of terms in the document
- Inverse Document Frequency (IDF): This component measures the rarity of a term
across the entire corpus. It's calculated as the logarithm of the ratio of the
total number of documents in the corpus to the number of documents containing the
term.
The final TF-IDF score for a term in a document is obtained by multiplying the TF
and IDF components:
TF-IDF = TF * IDF
N-grams:
An extension fo the BoW model who works with more than 1 term
N-grams are contiguous sequences of "n" items (usually words, characters, or
tokens) extracted from a text document. N-grams are used in natural language
processing (NLP) and text analysis to capture local patterns, relationships between
adjacent words, and context within a sequence of words.
Precision:
True positives / true positives + false positivers. AKA how many things I found
positives are positives for real?
Recall:
true positives / true positives + false negatives. AKA how many or the real
positives I actually found as positives
F1 Score:
The F1 score is a commonly used metric in machine learning for evaluating the
performance of classification models, especially when dealing with imbalanced
datasets.
F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
Gradient Boosting:
XGBoost:
XGBoost, which stands for "Extreme Gradient Boosting," is a powerful and widely
used machine learning algorithm for supervised learning tasks, particularly in the
domain of structured data and tabular datasets. XGBoost is an ensemble learning
method that combines the strengths of decision trees and gradient boosting
techniques to achieve high predictive accuracy and generalization performance.
Points:
- Gradient Boosting Framework: XGBoost is based on the gradient boosting framework,
which is a technique for building an ensemble of weak learners (usually decision
trees) to create a strong predictive model. It trains successive trees to correct
the errors of the previous ones.
- Decision Trees: XGBoost uses decision trees as base learners. These trees are
shallow, which helps reduce overfitting and makes them computationally efficient.
- Regularization: XGBoost introduces regularization techniques to prevent
overfitting and improve model generalization. L1 and L2
- Custom Loss Functions: XGBoost allows users to define custom loss functions for
specific problems, providing flexibility for tackling a wide range of tasks.
- Handling Missing Data: XGBoost can handle missing data directly during training,
reducing the need for explicit imputation.
- Parallel and Distributed Computing: XGBoost is designed for efficiency and can
make use of parallel and distributed computing to speed up training and prediction
processes.
RecordIo format:
RecordIO is the name for a set of binary data exchange formats. The basic idea is
to divide the data into individual chunks, called ‘records’, and then to prepend to
every record its length in bytes, followed by the data.
Since there is no formal specification of the RecordIO format, there tend to be
slight incompatibilities between RecordIO implementations
In the protobuf recordIO format, SageMaker converts each observation in the dataset
into a binary representation as a set of 4-byte floats, then loads it in the
protobuf values field. If you are using Python for your data preparation, we
strongly recommend that you use these existing transformations. There are lots of
RecordIo formats, the user one in aws is Apache mxnet.recordio
RecordIO is more space efficient then csv.
If you need a human-readable and widely supported format for data exchange and
analysis, CSV is a good choice. On the other hand, if you're dealing with large
datasets, complex data structures, or require efficient data serialization and
streaming for machine learning, then RecordIO might be more suitable. It's also
worth noting that there are other formats like JSON, Parquet, and Avro that offer
different trade-offs in terms of readability, efficiency, and features.
Minibatch size:
In the context of machine learning, a "minibatch" refers to a subset of the
training data that is used to update the model's parameters during each iteration
of the training process. Training a machine learning model involves updating its
parameters (weights and biases) based on the differences between the model's
predictions and the actual target values in the training dataset. This process is
typically guided by an optimization algorithm, such as gradient descent. Using the
entire training dataset for each update can be computationally expensive and slow,
especially when dealing with large datasets. To address this, the concept of
minibatching is introduced. Instead of using the entire dataset, training is
performed on smaller subsets of data, known as minibatches.
Learning rate:
The learning rate is a hyperparameter that plays a crucial role in the optimization
process of training machine learning models. It determines the step size or the
magnitude of parameter updates during each iteration of the optimization algorithm.
A high learning rate results in larger steps, which can lead to faster convergence
but risks overshooting the optimal solution. A low learning rate takes smaller
steps, which can lead to more accurate convergence but might be slower.
Momentum:
Momentum helps the optimization process move more smoothly and efficiently towards
the optimal solution by introducing a form of inertia or memory to the parameter
updates.
Momentum introduces the concept of "velocity" to this process. Instead of simply
updating the parameters based on the current gradient, momentum involves keeping
track of the previous parameter updates and using them to influence the current
update. This creates a smoothing effect that helps overcome oscillations and
obstacles in the optimization landscape.
Semantic segmentation
Semantic segmentation is a computer vision task in machine learning that involves
classifying each pixel in an image into one of several predefined classes or
categories. The goal of semantic segmentation is to partition an image into
meaningful regions or segments, where each segment corresponds to a specific object
or region of interest in the scene.
In other words, semantic segmentation goes beyond traditional image classification,
which assigns a single label to an entire image. Instead, semantic segmentation
provides a detailed understanding of the content of an image by assigning a label
to each pixel, indicating the category of the object or material present at that
pixel location.
Good idea: limit tuning to only a subset of important params and limit tuning to a
limited range or you'll spend on this your life
Seq2Seq:
A Seq2Seq (Sequence-to-Sequence) algorithm, also known as an Encoder-Decoder model,
is a type of architecture used in machine learning for tasks involving sequences,
such as natural language processing, machine translation, text summarization, and
more.
The core idea of a Seq2Seq model involves two main components:
- Encoder: The encoder is responsible for processing the input sequence and
transforming it into a fixed-size context vector or representation. It's typically
implemented using recurrent neural networks (RNNs), such as LSTM (Long Short-Term
Memory) or GRU (Gated Recurrent Unit), which are capable of capturing sequential
information.
- Decoder: The decoder takes the context vector produced by the encoder and
generates the output sequence step by step. It's also implemented using RNNs, where
each step involves predicting the next element in the sequence based on the context
vector and the previously generated elements. The decoder can also use attention
mechanisms to focus on different parts of the input sequence during each decoding
step.
Beam size:
In the context of Seq2Seq algorithms, the "beam size" refers to a parameter used in
beam search, a decoding technique commonly employed in sequence generation tasks,
including machine translation and text generation.
It's a not-greedy algorithm: Instead of greedily choosing the token with the
highest probability at each decoding step, beam search maintains a list of multiple
candidate sequences, called the "beam." The beam size determines the number of
candidate sequences to consider at each step.
Quantile Loss: DeepAR uses a quantile loss function during training, which
encourages the model to produce probabilistic forecasts at different quantiles
(percentiles). This means that DeepAR provides not only point forecasts but also
prediction intervals, allowing you to estimate the uncertainty of the forecast.
Context Length:
In the context of the DeepAR forecasting algorithm, the "context length" parameter
is an important setting that influences how the algorithm uses historical data for
making predictions. The context length determines how many past time steps
(observations) the algorithm considers when generating forecasts for future time
steps.
Prediction lenght:
The prediction length specifies the number of future time steps for which you want
the algorithm to generate forecasts
Don't exagerate with context and prediction because it will create a big model who
needs a huge amount of data.
Parallelism:
Parallelism in machine learning refers to the practice of executing tasks
concurrently in order to speed up the training or inference process of machine
learning models. Parallelism takes advantage of modern hardware architectures, such
as multi-core CPUs or distributed computing environments, to process data more
efficiently. There are two main types of parallelism commonly used in machine
learning: model parallelism and data parallelism.
- Model Parallelism:
Model parallelism involves dividing a machine learning model into different parts
or layers and distributing those parts across different compute resources. Each
resource (such as a CPU or GPU) is responsible for computing the forward and
backward passes for a specific section of the model. This approach is useful when a
single device does not have enough memory to store and compute the entire model.
Model parallelism is commonly used in large models where certain layers are memory-
intensive. For example, in deep neural networks, convolutional layers may require
more memory than fully connected layers
- Data Parallelism:
Data parallelism involves distributing the data used for training across multiple
compute resources. Each resource processes a subset of the data, computes
gradients, and updates the model's parameters. The updates are then combined to
update the global model. Data parallelism is effective when dealing with large
datasets that can be divided into smaller batches.
In deep learning, data parallelism is particularly useful. Neural networks are
trained using gradient-based optimization techniques, and data parallelism allows
each resource to compute gradients for different data batches simultaneously. These
gradients are then averaged or combined to update the model's parameters.
Data parallelism is parallel for real. Model is just partially parallel. So the
best is to use data parallelism if your model can fit 1 device and model
parallelism elsewhere
Autopilot:
AWS AutoPilot refers to a set of automated machine learning capabilities.
AutoPilot automates many of the steps involved in the machine learning lifecycle,
including data preprocessing, feature engineering, model selection, hyperparameter
tuning, and model evaluation. It uses advanced algorithms to automatically select
the best performing model architecture and hyperparameters.
It works only with Csv and Json
Sagemaker experiments:
To organize, track, and manage their machine learning experiments in a structured
and systematic way.
Machine learning experiments involve training and evaluating different models with
various hyperparameters, data preprocessing techniques, and configurations.
SageMaker Experiments provides a framework to keep track of these experiments,
making it easier to understand what works, what doesn't, and how to improve your
models.
Interesting: Autopilot actually uses experiments
SageMaker clarify:
SageMaker Clarify is designed to help machine learning practitioners detect and
mitigate bias and explainability issues in their models, ensuring fairness and
transparency in the deployment of machine learning systems.
Bias and explainability are crucial aspects of responsible AI, as they help ensure
that models are not making unfair or biased predictions and that the decisions made
by models are understandable and explainable to users.
It can do:
- Bias Detection:
Clarify helps you identify potential sources of bias in your machine learning
models by analyzing and measuring bias across different groups defined by sensitive
attributes (such as gender or ethnicity).
- Explainability:
Clarify provides tools to explain how predictions are made by your models. It
calculates feature importance scores to help you understand which features have the
most influence on predictions.
Aws comprehend:
Amazon Comprehend is a natural language processing (NLP) service provided by Amazon
Web Services (AWS). It offers a range of capabilities for analyzing and
understanding text data, making it easier to extract valuable insights, sentiments,
entities, and relationships from text documents. Amazon Comprehend is designed to
help developers and businesses process large amounts of text data efficiently and
derive meaningful information from it. Can do:
- Sentiment Analysis:
Comprehend can determine the sentiment of a piece of text, whether it's positive,
negative, neutral, or mixed. This is particularly useful for understanding customer
feedback, social media posts, and reviews.
- Entity Recognition:
The service can identify and classify entities mentioned in text, such as names of
people, organizations, dates, locations, and more. This helps in extracting
structured information from unstructured text.
- Keyphrase Extraction:
It can extract important keywords or phrases from a document, summarizing its main
topics or themes.
Amazon Kendra:
Amazon Kendra is an artificial intelligence (AI)-powered search and information
retrieval service offered by Amazon Web Services (AWS). It's designed to make it
easier for organizations to build powerful and highly accurate search solutions for
their internal and external content. Kendra is particularly focused on addressing
the challenges of enterprise search, allowing users to find relevant information
quickly and efficiently across a wide range of data sources.
Amazon Transcribe:
Amazon Transcribe is an automatic speech recognition (ASR) service provided by
Amazon Web Services (AWS). It's designed to convert spoken language into written
text, making it easier to process, analyze, and search audio content. Amazon
Transcribe is widely used for various applications, including transcribing customer
service calls, generating subtitles for videos, converting spoken content into text
for analysis, and more.
Amazon Rekognition:
Amazon Rekognition is a machine learning service provided by Amazon Web Services
(AWS) that offers a range of image and video analysis capabilities. Rekognition
uses advanced computer vision techniques to analyze visual content, extract
information, and identify objects, faces, scenes, and more from images and videos.
It is really cheap!
Amazon Personalize:
Amazon Personalize uses advanced machine learning algorithms to analyze historical
user data, such as browsing history, purchase history, and engagement patterns. It
then generates personalized recommendations or predictions based on this data. The
service is designed to be scalable and easy to integrate into various applications,
websites, and platforms.
Amazon Textract:
Amazon Textract is a machine learning service provided by Amazon Web Services (AWS)
that is designed to extract structured data from scanned documents, forms, and
images containing text. It uses advanced optical character recognition (OCR)
technology along with machine learning algorithms to analyze documents and extract
valuable information, such as text, tables, and forms, which can then be used for
various business processes and applications.
It can do:
- Document Analysis:
Amazon Textract can analyze a variety of documents, including invoices, contracts,
receipts, forms, and more. It can identify text, tables, and even key-value pairs.
- Integration:
Amazon Textract provides APIs that allow developers to integrate the service into
their applications, workflows, or systems.
- Data Output Formats:
Extracted data can be output in various formats, including JSON and CSV, making it
easy to integrate with downstream processes.
- Scalable
AWS Panorama :
AWS Panorama is a service provided by Amazon Web Services (AWS) that is designed to
bring computer vision (CV) capabilities to your existing on-premises cameras. It's
essentially an edge computing solution that allows you to deploy computer vision
models and applications directly onto your cameras, enabling real-time analysis and
inference without sending video data to the cloud.
Amazon Lex:
Amazon Lex is a service provided by Amazon Web Services (AWS) that allows you to
build conversational interfaces, commonly known as chatbots or conversational
agents, using natural language understanding (NLU) and natural language processing
(NLP) capabilities. Amazon Lex enables you to create interactive and intelligent
chatbots that can engage in text or voice-based conversations with users, providing
information, answering questions, and performing tasks. It uses the same model as
Alexa
Recommendation systems:
Recommendation systems, also known as recommender systems, are a subset of machine
learning techniques that aim to predict and suggest items that a user might be
interested in
Here are some of the main types:
- Content-Based Filtering: Content-based filtering recommends items based on the
characteristics of the items themselves and the user's past preferences. It relies
on feature extraction from the items and then compares these features to the user's
preferences. If the extracted features match the user's historical preferences,
those items are recommended. It regards only this users, not its similiarities with
other ones, SO IT SCALES BETTER
- Collaborative Filtering, with 2 subtypes
- User-Based Collaborative Filtering: This approach recommends items to a
user based on the preferences and behaviors of users who are similar to them. It
identifies users with similar item preferences and suggests items that those
similar users have liked.
- Item-Based Collaborative Filtering: In this approach, recommendations are
made by identifying similar items based on user preferences. If a user likes or
interacts with certain items, the system recommends other items that are similar to
those the user has shown interest in.
- Hybrid Approaches: Hybrid recommendation systems combine multiple techniques to
improve recommendation accuracy.
Guess what? When you use SageMaker you have to keep respecting al security things
like in the other services
SageMaker notebooks and training/inference containers have access on internet by
default, consider it and if you really need it. In the end you could disable it and
just use a NAT
How do you deploy in SageMaker? Easy is just 1 line ok code, you just need to know
which EC2 instance you prefer to have
BlazingText:
BlazingText refers to an algorithm and implementation for training word embeddings
and text classification models efficiently.
Word embeddings are numerical representations of words that capture semantic
relationships between words based on their context in a given corpus of text.
BlazingText provides a highly optimized implementation of the Word2Vec algorithm,
which is commonly used for creating word embeddings. This implementation is
designed to scale well across multiple compute instances, making it suitable for
processing large datasets and training word embeddings quickly.
Good ideas:
Transforming columns using logarithms is a common technique in machine learning
when dealing with data that spans a wide range of values or exhibits exponential
growth. Why? because:
- Normalization: Logarithmic transformations can help normalize data that is highly
skewed or has a large dynamic range. This can make the data more suitable for
algorithms that assume a normal distribution or for models that are sensitive to
the scale of input features.
- Stabilizing Variance: Logarithmic transformations can help stabilize the variance
of a variable. When the variance of a feature increases as its magnitude increases
it is bad for some models
- Linearization: Sometimes, taking the logarithm of a variable can transform it
into a form that appears more linear.
- Noise Reduction: Logarithmic transformations can help reduce the impact of
outliers or extreme values.
https://www.youtube.com/watch?
v=chHWvBhAysQ&list=PLWsnB2XBNJzJGQ_8ZHMSSXSK7xY9XCDXe&index=14
https://www.youtube.com/watch?v=i_LwzRVP7bg