Lecture 5 - Feature Extraction, Model Building & Evaluation
Lecture 5 - Feature Extraction, Model Building & Evaluation
Model building
Evaluation
2
Practice
https://www.kaggle.com/code/benmendelson04/house-prices-ml#All-features-in-tr
ain-data-have-0-missing-values
Score: 0.18367
3
Feature extraction
Feature extraction is the process of identifying and selecting the most
important information or characteristics from a data set.
It’s like distilling the essential elements, helping to simplify and highlight the
key aspects while filtering out less significant details.
4
Feature extraction vs. feature engineering
Feature engineering is the process of reworking a data set to improve the
training of a machine learning model. By adding, deleting, combining, or
mutating the data within the data set, data scientists expertly tailor the training
data to ensure the resulting machine learning model will fulfill its business use
case.
5
Feature extraction vs. feature selection
Feature selection is closely related. Where feature extraction and feature
engineering involve creating new features, feature selection is the process of
choosing which features are most likely to enhance the quality of your prediction
variable or output.
Feature extraction transforms raw data, with image files being a typical use case,
into numerical features that are compatible with machine learning algorithms
6
Why feature extraction is necessary
Reduces redundant data
Feature extraction cuts through the noise, removing redundant and unnecessary data. This frees machine learning programs to
focus on the most relevant data.
The most accurate machine learning models are those developed using only the data required to train the model to its intended
business use. Including peripheral data negatively impacts the model’s accuracy.
Including training data that doesn’t directly contribute to solving the business problem bogs down the learning process. Models
trained on highly relevant data learn more quickly and make more accurate predictions.
Pruning out peripheral data boosts speed and efficiency. With less data to sift through, compute resources aren’t dedicated to
processing tasks that aren’t generating additional value.
7
Common Feature Extraction Techniques
Feature selection
i. Correlation
8
Common Feature Extraction Techniques
Tabular data:
- Dimensionality Reduction
https://www.analyticsvidhya.com/blog/2021/04/guide-for-feature-extraction-tec
hniques/
PCA (Principal Component Analysis)
LDA (Linear Discriminant Analysis)
- Autoencoder
9
Common Feature Extraction Techniques
Text data:
- TFIDF, CountVectorizer, CBOW
- Embedding vectors
Image data:
- Histogram of oriented gradients (HOG)
- Scale-invariant feature transform (SIFT)
- Convolutional neural networks (CNNs)
Audio data:
- Short-time Fourier transform (STFT), mel-frequency cepstral coefficients (MFCCs),
or spectral features
10
Model building
Split the dataset
Choose algorithm
11
Split the dataset
Splitting a dataset is a crucial step in the machine learning pipeline, typically done
to create separate sets for training, validation, and testing.
The goal is to train the model on one subset, tune its parameters on another, and
evaluate its performance on a completely unseen set.
12
Train-Test Split
Description: The dataset is divided into two parts: a training set and a testing set.
Ratio: Common ratios are 80-20 or 70-30, with the majority for training.
Use Case: Suitable for large datasets where a separate validation set is not
needed.
13
Train-Validation-Test Split
Description: The dataset is split into 3 parts: training, validation, and testing sets.
Use Case: Ideal for fine-tuning model parameters on the validation set before final
evaluation on the test set.
14
Cross-Validation
Description: The dataset is divided into multiple subsets, and the model is trained
and validated multiple times, each time using a different subset for validation and
the remaining for training.
Use Case: Useful for small datasets to ensure that the model's performance is not
highly dependent on a particular random split.
15
Cross-Validation variations
Leave-One-Out Cross-Validation (LOOCV):
Description: A type of cross-validation where only one instance is used for validation, and the rest is used for training.
This process is repeated for each instance in the dataset.
Use Case: Common in situations with very limited data but can be computationally expensive.
K-Fold Cross-Validation:
Description: The dataset is divided into k subsets (folds), and the model is trained and validated k times, each time
using a different fold for validation and the remaining for training.
Use Case: Balances the trade-off between computation time and reliable performance estimation.
Leave-P-Out Cross-Validation:
Description: Similar to LOOCV, but p instances are left out for validation in each iteration.
Use Case: A compromise between LOOCV and k-fold cross-validation, suitable for moderately sized datasets.
16
Special cases
Stratified Split:
Description: Ensures that the
proportion of classes in each subset
mirrors the original dataset. Useful
for imbalanced datasets.
Use Case: Maintains the class
distribution in training, validation,
and test sets.
17
Special cases
Time-Based Split:
18
Choose algorithm
Choosing the right algorithm for training a machine learning model is a crucial
decision that can significantly impact the model's performance.
The choice depends on various factors, including the nature of the problem, the
characteristics of the data, and the specific requirements of the task.
19
Choose algorithm
Type of problem: Different algorithms are better suited to different types of problems.
E.g.: If you are trying to classify data into discrete categories, you might consider
using a classification algorithm, such as logistic regression or support vector machines
(SVMs). If you are trying to predict a continuous outcome, you might consider using a
regression algorithm, such as linear regression or random forest.
Size and complexity of the data: Some algorithms are better suited to large, complex
datasets, while others are more suitable for smaller, simpler datasets.
E.g.: Decision trees and random forests are generally good for handling large
datasets (many features), while logistic regression and SVMs may be more suitable for
smaller datasets.
20
Choose algorithm
Speed and scalability: If you are working with a large dataset or need to make
predictions in real-time, you may want to choose an algorithm that is fast and scalable.
Decision trees and random forests are generally fast to train, but may not scale well
to very large datasets (many samples). On the other hand, linear regression and logistic
regression are usually fast to train and scale well to large datasets.
Interpretability: If you need to be able to explain and interpret the results of your model,
you may want to choose an algorithm that is more interpretable, such as linear
regression or logistic regression. More complex algorithms, such as random forests and
deep neural networks, may be more accurate but are often more difficult to interpret.
21
Typical
algorithms
22
Tune hyper parameters
Hyperparameter tuning is a crucial step in optimizing the performance of machine learning models.
Hyperparameters are external configuration settings for the model that are not learned from the data but must be
set before training
E.g.:
23
Hyperparams tuning techniques
Grid Search: defining a grid of hyperparameter values and training the model with all possible
combinations of values.
Process:
Define a range of values for each hyperparameter.
Specify all possible combinations using a grid.
Train the model for each combination.
Evaluate performance and select the best set of hyperparameters.
Pros: Exhaustive search that guarantees finding the best combination.
Cons: Computationally expensive, especially for large grids.
24
Hyper params tuning techniques
Random Search: samples hyper params values randomly from predefined ranges.
Process:
Specify a range for each hyperparameter.
Randomly sample combinations from the hyperparameter space.
Train the model for each combination.
Evaluate performance and select the best set of hyperparameters.
Pros: More computationally efficient than grid search.
Cons: May miss the optimal values.
25
Grid search vs. Random search
26
Hyper params tuning techniques
Gradient-Based Optimization: optimization tunes hyperparameters by using
gradient-based optimization methods.
Process:
Compute the gradient of the objective function with respect to
hyperparameters.
Update hyperparameter values using gradient descent or related algorithms.
Repeat until convergence.
Pros: Effective for differentiable hyperparameters.
Cons: Not applicable to discrete hyperparameters.
27
Hyper params tuning techniques
Bayesian Optimization:
Genetic Algorithms:
28
Evaluation
Evaluating machine learning models is a crucial step to assess their performance
and generalization capability. The choice of evaluation metrics depends on the
type of problem you are solving (classification, regression, clustering, etc.)
29
Evaluation Metrics for Classification
Accuracy: Ratio of correctly predicted instances to the total instances.
Precision: Proportion of true positive predictions among all positive predictions.
Recall (Sensitivity): Proportion of true positive predictions among all actual
positives.
F1 Score: Harmonic mean of precision and recall.
Area Under the ROC Curve (AUC-ROC): Measures the area under the Receiver
Operating Characteristic curve.
30
31
Evaluation Metrics for Classification
Confusion Matrix: A matrix that shows the counts of true positive, true negative,
false positive, and false negative predictions. Particularly useful for binary and
multiclass classification problems.
32
Evaluation Metrics for Classification
ROC Curve and Precision-Recall Curve:
https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning/
33
Evaluation Metrics for Regression
Mean Absolute Error (MAE): Average absolute difference between predicted and
true values.
34
Homework
https://www.kaggle.com/competitions/playground-series-s4e1/overview
35