Week 6. Data Preparation and Transformation
Week 6. Data Preparation and Transformation
Transformation
2. Data Understanding
Collect initial data → Describe data → Explore data → Verify
data quality
3. Data Preparation
Select data → Clean data → Construct data → Integrate data →
Format data
4. Modeling
Select modeling techniques → Generate test design → Build
model → Assess model
5. Evaluation
Evaluate results → Review process → Determine next steps
6. Deployment
Plan deployment → Plan monitoring and maintenance → Produce
final report → Review project https://aws.amazon.com/what-is/data-mi
ning/
What is Machine Learning (ML)?
Machine learning (ML) is a branch of artificial intelligence that
enables computers to learn patterns from data and make
predictions without explicit programming.
Machine Learning Algorithms
Linear Regression, Polynomial
Regression, Support Vector
Regression Regression, Decision Tree
Regression, Random Forest
Supervised Regression
Learning Logistic Regression, K-Nearest
Neighbors, Support Vector
Classification Machines, Decision Tree, Random
Forest, Naïve Bayes
Qualitative Quantitative
(categorical) (numerical)
Binary
Types of Features
Feature type Feature sub-type Definition Example
Categorical Nominal Labelled variables Cloud provider:
with no quantitative AWS, MS, Google
value
Categorical Ordinal Adds the sense of Job title: junior data
order to the labelled scientist, senior
variable data scientist,
chief data scientist
• Some ML libraries, such as scikit-learn, may not accept string values on your
categorical features.
• The data distribution of your variable may not be the most optimal distribution
for your algorithm.
• Your ML algorithm may be impacted by the scale of your data.
• Some observations of your variable may be missing information, and you will
have to fix it. These are also known as missing values.
• You may find outlier values of your variable that can potentially add bias to
your model.
• Your variable may be storing different types of information, and you may only
be interested in a few of them (for example, a date variable can store the day
of the week or the week of the month).
• You might want to find a mathematical representation for a text variable.
• …
Dealing with categorical
features
Transforming nominal features: Label Encoding
It shows how different scales of the variable could change the hyper plan’s projection of
k-means clustering.
Why is Normalization important and when is it
unnecessary?
• Why is Normalization Important?
• Normalization is crucial for algorithms that rely on numerical
calculations, such as:
• Neural networks and linear regression, which use weighted sums of
input variables. Without normalization, large feature values can dominate
smaller ones, leading to unstable optimizations.
• Distance-based algorithms like K-nearest neighbors (KNN) and k-
means clustering, where different feature scales can distort distance
calculations and clustering results.
• When is Normalization Unnecessary?
• Some machine learning models, like decision trees, do not rely on feature
magnitudes but instead evaluate the predictive power of each feature
(e.g., through entropy or information gain). In such cases, normalization
does not impact model performance.
How is Normalization Applied?
• A common approach is the Min-Max Scaler, which scales values
between 0 and 1, or any other specified range. The formula for
Min-Max normalization is:
• where:
• X is the original value,
• µ is the mean of the dataset,
• σ is the standard deviation of the dataset.
Why Use Standardization?
• Standardization is beneficial when:
• Data follows a normal distribution, as many machine learning models
(such as logistic regression, support vector machines, and PCA) perform
better when features are normally distributed.
• Identifying outliers, since standardized values (z-scores) indicate how far
a data point is from the mean in terms of standard deviations.