LFD 1
LFD 1
(LFD)
Data Science
Goals: Achieve human-like intelligence, understand and solve complex problems, adapt
to new situations.
A method within AI: ML focuses on the ability of machines to learn from data without
explicit programming.
Goals: Identify patterns, make predictions, improve performance over time with more
data.
A subfield of ML: DL uses artificial neural networks inspired by the human brain to learn
from large amounts of data, particularly unstructured data like images and text.
Goals: Solve complex problems requiring pattern recognition, feature extraction, and
representation.
Analogies:
The Language Learning Analogy:
AI: Learning a new language in general, the ability to communicate and understand
information in a different way.
ML: Learning vocabulary and basic grammar, acquiring the building blocks of language
through repeated exposure and practice.
AI: Finding your way in a city, the overall ability to navigate and reach your destination.
ML: Using a basic map with directions, a straightforward approach to getting from point
A to point B.
Machine
Learning
Families
Supervised Unsupervised
Learning Learning
Reinforcement Deep
Learning Learning
1. Supervised Learning: In this family of algorithms, the models are trained on (labeled
data), where the (output is known), to learn the relationship between the input features
and the output variable.
o Examples:
Classification: Predicting a category (e.g., spam vs. not spam, image recognition).
o Common algorithms:
linear regression
logistic regression
Neural Networks
4. Deep Learning: This family of algorithms is a (subset of machine learning) that uses (deep
neural networks) to model (complex relationships between inputs and outputs). Deep
learning has achieved state-of-the-art results in various domains such as image
classification, speech recognition, and natural language processing.
5. Other Families:
Data, Training, Prediction, Actual value
Product x1 X2 X3
F(X) prediction
Machine Learning Cycle
Drop ?
Uniformaly Missing
Not effect in
distributed variables
prediction
variables
Outliers : remove the row that have outliers from the dataset
Data
Preprocessing
Numrical Categorical
Variables Variables
1) Log Transformation
2) Square root Transformation
3) Reciprocal Transformation
4) Exponential Transformation
5) Box-Cox Transformation
Modeling
The model function for linear regression (algorithm)
linear regression make mapping or linear relationship between variables and target variable
i number of row
x The feature
w weight – slop
b Intercept - bias
ID version price
1 4 32
2 3 22
f (X(1)) = w * x[X(1)] + b
f (X(2)) = w * x[X(2)] + b
Product w1 w2 w3
f (X) = w[1] * x[1] + w[2] * x[2] + w[3] * x[3] + b prediction of first row
Red line prediction
Y hat f(X)
Gradient descent
Change values of w and b to decrease the cost function by enhance the prediction F(X)
Overfitting and Underfitting
usually overfitting is happening in training data (big dataset / many features --> high
complexity), then we use ways to generaliztion the model:
In Real Life models is complex more than this (Non linear relationship)
Logistic Regression capture non-linear relationships between the variables and make
predictions for event occurring by binary outcomes (two possible values e.g. Yes or No, 1 or 0)
, the output is a probability
Activation function is a (more general concept) to add non-linearity (transforming the linear
relationship to a non linear relationship)
Probability threshold (usually 0.5 or higher) to make a binary decision about the outcome 0 or
1 (can use it in classification)
SVM
Support Vector Machine (algorithm) used in classification problems where the goal is to
divide the data points into two or more categories based on their features, by find a
hyperplane (a boundary that divides the data) that separates the different classes
Hyperplane make like margine and try to push it out to increase the clean boundary.
Will increase dimension/features on data (but it maybe make (cursive dimensionality) mean
overfitting because we increase dimensions/features
Here we can’t separate data by line, so will increase the dimensions and separate it by plane
Linearly Separable Data:
For example, in a binary classification problem, data has two features or more, linearly
separable data might look like two distinct clusters on a scatter plot with a clear gap
between them.
Support Vector Machine (usually used in classification), can used as regressor but it isn't
common
Support Vector Machines (SVMs) are often used in image classification problems, where the
goal is to classify images into one of several pre-defined categories.
SVMs can be used for both binary and multi-class classification problems in image analysis.
By equation (Kernel) to increase dimensions without make problems like increase time to
learn (by dot product)
kernel function determines how the input data (features) is transformed from dimension
into a higher-dimensional space, so in non-linear separable the data might become linearly
separable (it effective in capturing complex relationships)
Common types of kernels function:
1. Linear Kernel:
linear decision boundary (two classes)
2. Polynomial Kernel:
non-linear decision boundary (fit well) on datasets with non-linearly separable
classes, have Parameters c and d control the shift and degree of the polynomial
Parameters of RBF:
1) C
mean how much the SVM tries to avoid misclassifying the training data
larger C means a higher penalty for misclassification, so more complex decision
boundary (fits the data better and make better accuracy)
smaller C means a lower penalty for misclassification, so simpler decision boundary
(generalizes better but less accuracy)
2) gamma
mean shape of the decision boundary, how much influence each training example
has on the decision boundary
larger gamma means a narrower RBF kernel (each example has a small region of
influence), can capture the local shape of the data because decision boundary be
more sensitive to individual data points (more complex boundary), so will lead to a
more accurate model on training data but could result in overfitting
smaller gamma means a wider RBF kernel (each example has a large region of
influence), can capture the global shape of the data (could underfit the data)
Hyperparameter Tuning will try many times to identify best c & gamma for the model
Accuracy not the only way that we used to metric the classification, and worse one of them
1) cross-validation
2) Grid search
3) random search
4) Bayesian optimization
Cross-Validation
1) split training data into multiple subsets (training set and testing data)
2) The model is trained on some subset and validated on the remaining ones
3) This process is repeated multiple times (folds)
4) Performance metrics are Averaged across the folds (Average for best c & gamma for
each fold)
Variable Names:
1) Camel case:
2) Pascal case:
3) Snake case:
Neural networks are used for various tasks, including pattern recognition, classification,
regression, and other machine learning tasks
Applications of neural networks, such as image recognition, natural language processing, and
autonomous vehicles
The depth and complexity of neural networks can vary, ranging from simple models like the
perceptron to more complex architectures like convolutional neural networks (CNNs) and
recurrent neural networks (RNNs)
Perceptron is the simplest form (architecture) of a neural network (the foundational concept
in neural networks)
Neuron takes multiple input signals (numerical values that represent various features or
aspects of the input data)
Neural Networks typically consisting of a single layer with one or more artificial neurons. It
takes input values, applies weights, adds a bias, and outputs a binary value (0 or 1) based on an
activation function
perceptron is the basic building block, and by connecting perceptrons in layers, we form neural
networks
Artificial neurons:
also known as perceptrons, are the basic computational units in a neural network, involves
taking inputs, applying weights to those inputs, summing them up, adding a bias, and then
passing the result through an activation function to produce an output (if classification will
apply the activation function)
The connections (edges) between neurons in adjacent layers are formed by the weights.
Each neuron in one layer is connected to every neuron in the next layer.
Weights determine the strength and sign (positive or negative) of the connection between two
neurons.
Larger weights amplify the influence of the input on the connected neuron, while smaller
weights reduce it
- used when dealing with problems that involve non-linear mappings or complex data
Relationships
Explicit memory of past inputs: ability of a model to retain and utilize information
(historical information encountered during the training process) from past inputs during its
decision-making process, It involves the ability of the model to learn and remember patterns,
dependencies, or relationships in the training data, which may include sequences or temporal
patterns.
Radial basis functions: mathematical function used both as an activation function in neural
networks (specifically RBF networks) and as a kernel function in support vector machines for
non-linear mapping of input data. It is especially useful when dealing with problems that
involve non-linear relationships
Neurons are connected in layers (structure of a neural network):
1) Input layer : Neurons in this layer receive the initial input data, each neuron has its own
weights
2) One or more hidden layers : process the input data through interconnected artificial
neurons, applying weights and biases and passing the results through activation
functions, hidden layers allows the network to capture intricate relationships and non-
linear patterns that may exist in the input data
3) Output layer : Neurons in this layer produce the final output of the network
The connections between neurons have associated weights that are adjusted during the
training process
Feature Extraction
Hierarchical Learning
Representation Learning
Enhanced Expressiveness
Generalization
Hidden layer make non-linear relationship between input and output and get more complex
relationship between input and output (hidden layer act as feature engineering)
Relu (rectified linear unit) function used between input layer and hidden layer , and between
hidden layer and output layer, Sigmoid function used in output layer
Can use Softmax Activation Function in output layer with multi-class classification
problems, It takes a vector of raw scores (logits) as input and converts them into probabilities,
so the output values are in the range (0, 1), and the sum of the probabilities across all classes is
equal to 1
The training process of a neural network involves both feedforward and backpropagation
steps
Feedforward: Information flows in one direction, from the input layer through one or more
hidden layers to the output layer (Primarily used for making predictions or classifications)
The chain rule: calculate the gradients of the error with respect to the parameters of the
network during the training process when the error is propagated backward through the
network
Single Layer Perceptron (SLP)
the simplest form of a neural network, It consists of an input layer and an output layer only,
with no hidden layers, Neurons in the input layer represent the features of the input data,
Neurons in the output layer produce the network's output, used with linearly separable
problems (2 classes)
The training process involves adjusting the weights and biases of the perceptron to correctly
classify input data (by backpropagation algorithm)
Optimization algorithms:
1) Gradient descent (linear regression)
2) Backpropagation (neural networks)
Adam optimizer (most used optimization algorithm used in deep learning models)
Usually overfitting is happening when use many neurons (many features) because will make
high complexity, but it's good to increase features or neurons in the network to increase
accuracy but we should make regularization to avoid overfitting
4 Thing we can monitor on them (training accuracy, traning loss, validation accuracy,
validation loss)
(Accuracy) not interested if prediction was 0.8 or 1 or 0.6 (accuracy just interested when
prediction be same actual value)
Loss Functions
Also known as cost functions or objective functions, like MSE in linear regression (will penalize
the model based on the probilities from output)
The choice of the type of loss function depends on the type of problem being solved (e.g.,
regression or classification)
Common types of loss functions and their roles:
1) Mean Squared Error (MSE) commonly used for regression tasks
2) Mean Absolute Error (MAE) commonly used for regression tasks
3) Binary Cross-Entropy Loss (Log Loss) Commonly used for binary classification
problems
4) Categorical Cross-Entropy Loss Used for multi-class classification problems
5) Hinge Loss Commonly used for support vector machines (SVM) and some types of
binary classification tasks
6) Huber Loss A combination of MSE and MAE, Huber loss is less sensitive to outliers
than MSE, It is often used in regression tasks where the presence of outliers may affect
model performance
like when p= 0.8 ~ 1 and true label is 1, but still have loss in 0.2 (will penalize the model on
this 0.2 to decrease this loss in future), by adjusting its weights and biases to improve its
predictions (minimizing the binary cross-entropy loss)
very fast - low user interface (should implement everything with yourself, not friendly)
Pytorch used now in many applications of AI like chatGPT, pytorch is framework has set of
classes, each class have its behavior, we inherent some classes and take its functions
Confusion Matrix
Confusion matrix is a tool in machine learning, allowing developers and data
scientists to evaluate and improve the performance of classification models by
providing valuable metrics such as accuracy, precision, recall, f1-score, and more
Can help to identify the types of errors the model is making, such as false
positives, false negatives, true positives, and true negatives
Classification Report
- Precision is sensitive to the number of false positives and is useful when the cost of false
positives is high, A high precision indicates that the model has a low rate of false positives
- Recall is sensitive to the number of false negatives and is useful when the cost of missing
positive instances is high
- Support represents the number of actual occurrences of the class in the specified dataset
Explainability
Ability to understand and interpret the decisions and predictions made by
a machine learning model
shap values:
Directly associated with the feature domain, and they provide insights
into the impact of each feature on the model predictions