0% found this document useful (0 votes)
16 views69 pages

Machine Learning: Dr. Jagan. T Professor Department of ECE, GRIET

The document provides an overview of machine learning, including its definition, key concepts, types (supervised, unsupervised, reinforcement learning), applications, benefits, and challenges. It also discusses feature engineering, its importance, techniques for data cleaning, scaling, encoding, and selection, as well as dimensionality reduction methods like PCA and LDA. The future of machine learning is highlighted, emphasizing ongoing research areas such as deep learning and explainable AI.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views69 pages

Machine Learning: Dr. Jagan. T Professor Department of ECE, GRIET

The document provides an overview of machine learning, including its definition, key concepts, types (supervised, unsupervised, reinforcement learning), applications, benefits, and challenges. It also discusses feature engineering, its importance, techniques for data cleaning, scaling, encoding, and selection, as well as dimensionality reduction methods like PCA and LDA. The future of machine learning is highlighted, emphasizing ongoing research areas such as deep learning and explainable AI.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 69

Machine Learning

Unit-I
Dr. Jagan. T
Professor
Department of ECE, GRIET
Introduction to Machine Learning
• Machine learning is a subfield of artificial intelligence (AI) that focuses on enabling
computers to learn from data without being explicitly programmed. It involves algorithms
and statistical models that allow computers to improve their performance on a specific task
over time, based on the data they are exposed to.
Key Concepts in Machine Learning:
• Algorithms: These are sets of rules and procedures that guide the learning process.
Different algorithms are suited for different types of data and tasks.
• Data: Machine learning models require data to learn from. This data can be labeled (for
supervised learning) or unlabeled (for unsupervised learning).
• Training: The process of feeding data to the algorithm to enable it to learn patterns and
relationships.
• Model: The output of the training process, representing the learned patterns and used for
making predictions or decisions on new data.
Types of Machine Learning

• Supervised Learning: The algorithm learns from labeled data, where the correct
output is provided. Examples include classification (e.g., spam detection) and
regression (e.g., predicting house prices).

• Unsupervised Learning: The algorithm learns from unlabeled data, identifying


patterns and structures without explicit guidance. Examples include clustering (e.g.,
customer segmentation) and dimensionality reduction (e.g., feature extraction).

• Reinforcement Learning: The algorithm learns through trial and error, receiving
rewards or penalties for its actions. It's often used in robotics and game playing.
Applications of Machine Learning:

• Image Recognition: Identifying objects or faces in images.

• Natural Language Processing: Understanding and generating human


language.

• Recommendation Systems: Suggesting products or content based on user


preferences.

• Fraud Detection: Identifying potentially fraudulent transactions.

• Medical Diagnosis: Assisting in disease diagnosis and treatment planning.


Etc…..
Benefits of Machine Learning:

• Automation: Automating tasks that traditionally require human


intelligence.

• Improved Accuracy: Making more accurate predictions and decisions


compared to traditional methods.

• Personalization: Tailoring experiences to individual users.

• Insights: Discovering hidden patterns and insights in data.


Challenges of Machine Learning:

• Data Requirements: Machine learning models often require large amounts


of high-quality data.

• Complexity: Developing and deploying machine learning models can be


complex.

• Bias: Machine learning models can inherit biases present in the data.

• Interpretability: Understanding why a machine learning model makes


certain decisions can be challenging.
The Future of Machine Learning:

• Machine learning is a rapidly evolving field with tremendous potential.


Ongoing research is focused on areas such as:

• Deep Learning: Using artificial neural networks with multiple layers to learn
complex patterns.

• Explainable AI: Developing methods to make machine learning decisions


more transparent and understandable.

• Federated Learning: Training machine learning models on decentralized


data sources without sharing the data itself.
Introduction - What is Feature Engineering?

• The process of selecting, transforming, and creating features from


raw data.
• A crucial step in machine learning, often more impactful than
algorithm selection.
• Aims to improve the performance of machine learning models.
• Bridges the gap between raw data and model understanding.
Introduction - What is Feature Engineering?

A visual representation of a data pipeline for feature creation would typically


show a series of connected boxes, each representing a distinct step in the
process, starting with raw data ingestion from various sources, progressing
through data cleaning, transformation (including feature engineering), and
finally culminating in the creation of new, refined features ready for machine
learning modeling; often with arrows indicating the data flow between each
stage, highlighting key transformations applied at each step like binning,
scaling, or one-hot encoding.
Introduction - What is Feature Engineering?
Introduction - What is Feature Engineering?

Key elements in a feature creation data pipeline visualization:


Data Source:
•Boxes representing the initial raw data sources, like databases, APIs, or files.
Data Ingestion:
•A box signifying the process of extracting data from sources and loading it into the
pipeline.
Data Cleaning:
•A stage where data is cleaned and preprocessed, including handling missing values,
outliers, and formatting inconsistencies.
Introduction - What is Feature Engineering?

Feature Engineering:
•A set of boxes representing different feature engineering techniques, like binning,
scaling, creating interaction features, or applying domain-specific transformations.
Feature Selection:
•A box indicating the process of choosing the most relevant features for the model.
Data Transformation:
•Boxes showing operations like one-hot encoding, normalization, or log
transformation.
Feature Storage:

•A box representing where the final engineered features are stored, like a data
warehouse or a dedicated feature store.
Why Bother with Feature Engineering?

•Improved Model Performance: Better features lead to more accurate and robust models.
•Faster Training: Relevant features can reduce training time.
•Enhanced Interpretability: Engineered features can make models easier to understand.
•Handles Missing Data: Feature engineering can address missing values effectively.
•Addresses Outliers: Techniques can mitigate the impact of outliers.
•Image: A graph showing improved model performance with feature engineering.
Types of Feature Engineering Techniques
• (Brief explanations and examples for each):
• Data Cleaning: Handling missing values, outliers, and inconsistencies. (e.g.,
imputation, outlier removal)
• Feature Scaling: Normalization and standardization. (e.g., Min-Max scaling, Z-
score normalization)
• Encoding Categorical Variables: One-hot encoding, label encoding. (e.g.,
converting colors to numerical representations)
• Creating New Features: Polynomial features, interaction features, domain-specific
features. (e.g., creating BMI from height and weight)
• Feature Transformation: Log transformation, square root transformation. (e.g.,
handling skewed data)
• Feature Selection: Identifying the most relevant features. (e.g., filter methods,
wrapper methods, embedded methods)
• Image: Icons representing each technique.
Data Cleaning

• Missing Value Imputation: Mean/median/mode imputation, K-NN


imputation.
• Outlier Handling: Identifying and treating outliers using box plots, IQR, etc.

• Handling Inconsistent Data: Correcting typos, inconsistencies in formatting.


Data Cleaning
Feature scaling

• Feature scaling is one of the most important data preprocessing step


in machine learning. Algorithms that compute the distance between
the features are biased towards numerically larger values if the data is
not scaled.
• Tree-based algorithms are fairly insensitive to the scale of the features.
Also, feature scaling helps machine learning, and deep learning
algorithms train and converge faster.
• There are some feature scaling techniques such as Normalization and
Standardization that are the most popular and at the same time, the
most confusing ones.
Scaling Up: Feature Scaling

•Normalization (Min-Max Scaling): Scales features to a range between 0 and 1.

•Standardization (Z-score Normalization): Scales features to have a mean of 0


and a standard deviation of 1.
Normalization (Min-Max Scaling):
This is used to transform features to be on a similar scale. The new point is
calculated as:
• This scales the range to [0, 1] or sometimes [-1, 1].
• Geometrically speaking, transformation squishes the n-dimensional data into an
n-dimensional unit hypercube.
• Normalization is useful when there are no outliers as it cannot cope up with
them. Usually, we would scale age and not incomes because only a few people
have high incomes but the age is close to uniform.
Standardization or Z-Score Normalization

• Standardization or Z-Score Normalization is the transformation of features


by subtracting from mean and dividing by standard deviation. This is often
called as Z-score.
Standardization or Z-Score Normalization
Standardization can be helpful in cases where the data follows a Gaussian
distribution.

However, this does not have to be necessarily true.

• Geometrically speaking, it translates the data to the mean vector of original


data to the origin and squishes or expands the points if std is 1 respectively.

• We can see that we are just changing mean and standard deviation to a
standard normal distribution which is still normal thus the shape of the
distribution is not affected.
Encoding Categorical Variables
• One-Hot Encoding: Creates binary
columns for each category.
• Label Encoding: Assigns a unique integer
to each category.
Creating New Features

•Polynomial Features: Creating higher-degree versions of existing features.


•Interaction Features: Combining multiple features.
•Domain-Specific Features: Leveraging domain knowledge to create
relevant features.
Feature Transformation

• Transforming Data: Feature Transformation

• Log Transformation: Useful for handling skewed data.

• Square Root Transformation: Another technique for


dealing with skewed data.
Feature Transformation
• Log Transformation:

By taking the logarithm of a data set, large values are compressed, effectively reducing the

impact of extreme outliers and making the distribution appear more symmetrical. This is

particularly useful when data is skewed towards the right, with a few very large values.

• Square Root Transformation:

This transformation has a milder effect on the data compared to log transformation, making

it suitable for situations where the skewness is not as extreme. It can also be applied to data

with zero values, unlike log transformation.


Feature Transformation

• Transforming Data: Feature Transformation

• Log transformation is a widely used technique to handle skewed data,


while square root transformation can also be used to address skewness,
particularly when dealing with positive skewed data; however, log
transformation is often considered more effective for heavily skewed
distributions, especially when the large values significantly skew the
data.
Feature Transformation
Feature Selection
•Filter Methods: Using statistical measures to rank features.
•Wrapper Methods: Evaluating subsets of features.
•Embedded Methods: Feature selection integrated into the model training process.
Feature Selection
•In feature selection,
•"Filter Methods" evaluate features individually based on statistical measures like
correlation, ranking them based on their apparent relevance to the target variable,
while
•"Wrapper Methods" test different combinations of features by training a model on
each subset to find the best performing set, and
•"Embedded Methods" incorporate feature selection directly within the model
training process, allowing the model to learn which features are most important during
learning itself.
Feature Selection
Key points about each method:
•Filter Methods:
•Pros: Fast, computationally efficient, easy to interpret.
•Cons: May miss important feature interactions, as features are evaluated
independently.
•Examples: Chi-square test, correlation coefficient, mutual information.

•Wrapper Methods:
•Pros: Can potentially find optimal feature subsets for a specific model.
•Cons: Computationally expensive, can overfit to the chosen model.
•Examples: Forward selection, backward elimination, recursive feature elimination
(RFE).
Feature Selection

Key points about each method:


•Embedded Methods:

•Pros: Combines advantages of filter and wrapper methods, less computationally


intensive than pure wrapper methods.
•Cons: May be model-specific, requiring careful selection of the model to use.
•Examples: Lasso regression (L1 regularization), decision trees, Random Forests.
Dimensionality:

• Dimensionality reduction is a powerful tool for dealing with high-


dimensional data. By reducing the number of features, it can improve
model performance, computational efficiency, and data visualization.

• What is Dimensionality Reduction?


• Dimensionality reduction is a technique that reduces the number of
features in a dataset while preserving its important information. It's
like summarizing a long article into a few key points - you lose some
details, but you keep the main idea.
Dimensionality:
• Why is it Important?
• The Curse of Dimensionality: When dealing with high-dimensional data, the data becomes
sparse, making it harder to find meaningful patterns. This can lead to overfitting, where the
model performs well on the training data but poorly on new data.

• Computational Efficiency: Training models on high-dimensional data can be


computationally expensive and time-consuming. Reducing the number of features can
significantly speed up the training process.

• Improved Model Performance: By removing irrelevant or redundant features,


dimensionality reduction can improve the accuracy and generalization ability of machine
learning models.

• Data Visualization: It's often easier to visualize data in 2D or 3D. Dimensionality reduction
can help project high-dimensional data into a lower-dimensional space for visualization.
Dimensionality:

• How Does it Work?


• There are various techniques for dimensionality reduction, including:
• Feature Selection: Choosing a subset of the most relevant features
and discarding the rest.
• Feature Extraction: Creating new features by combining or
transforming the original features.
Dimensionality:

Common Techniques:
• Principal Component Analysis (PCA): A linear technique that finds
the directions of maximum variance in the data and projects the data
onto those directions.
• Linear Discriminant Analysis (LDA): A technique that finds the
linear combinations of features that best separate different classes.
Dimensionality:

Principal Component Analysis (PCA) is an unsupervised technique that


identifies the directions of greatest variance within a dataset, projecting
data onto these directions to reduce dimensionality,
Dimensionality:

• Principal Component Analysis (PCA) is a linear dimensionality reduction technique designed to

extract a new set of variables from an existing high-dimensional dataset.

• Its primary goal is to reduce the dimensionality of the data while preserving as much variance as

possible.

• PCA is an unsupervised algorithm that creates linear combinations of the original features, known as

principal components.

• These components are calculated such that the first one captures the maximum variance in the dataset,

while each subsequent component explains the remaining variance without being correlated with the

previous ones.
Dimensionality:

Linear Discriminant Analysis (LDA) is a supervised technique that


finds linear combinations of features that best separate different classes
in a dataset, focusing on maximizing class separability rather than
overall variance.
Dimensionality:

• Linear discriminant analysis (LDA) is an approach used in supervised


machine learning to solve multi-class classification problems.

• LDA separates multiple classes with multiple features through data


dimensionality reduction.

• This technique is important in data science as it helps optimize machine


learning models.
Dimensionality:
Dimensionality:
Dimensionality:
Dimensionality:
PCA:
Advantages and Disadvantages of Principal Component Analysis
Advantages of Principal Component Analysis
1.Multicollinearity Handling: Creates new, uncorrelated variables to address issues
when original features are highly correlated.

2.Noise Reduction: Eliminates components with low variance (assumed to be noise),


enhancing data clarity.

3.Data Compression: Represents data with fewer components, reducing storage needs
and speeding up processing.

4.Outlier Detection: Identifies unusual data points by showing which ones deviate
significantly in the reduced space.
Dimensionality:
Disadvantages of Principal Component Analysis
1.Interpretation Challenges: The new components are combinations of original
variables, which can be hard to explain.

2.Data Scaling Sensitivity: Requires proper scaling of data before application, or results
may be misleading.

3.Information Loss: Reducing dimensions may lose some important information if too
few components are kept.

4.Assumption of Linearity: Works best when relationships between variables are linear,
and may struggle with non-linear data.

5.Computational Complexity: Can be slow and resource-intensive on very large


datasets.

6.Risk of Overfitting: Using too many components or working with a small dataset
might lead to models that don’t generalize well.
PCA: Example
Let’s understand it’s working in simple terms:
Imagine you’re looking at a messy cloud of data points (like stars in the sky) and want to
simplify it. PCA helps you find the “most important angles” to view this cloud so you don’t
miss the big patterns. Here’s how it works, step by step:

Step 1: Standardize the Data


Make sure all features (e.g., height, weight, age) are on the same scale. Why? A feature
like “salary” (ranging 0–100,000) could dominate “age” (0–100) otherwise.
Standardizing our dataset to ensures that each variable has a mean of 0 and a standard
deviation of 1.
(Z=X−μ/σ)
Here, μ is the mean of independent features μ={μ1,μ2,⋯,μm}
•σ is the standard deviation of independent features σ={σ1,σ2,⋯,σm}
PCA: Example
Step 2: Find Relationships
Calculate how features move together using a covariance matrix. Covariance measures the
strength of joint variability between two or more variables, indicating how much they
change in relation to each other. To find the covariance we can use the formula:

The value of covariance can be positive, negative, or zeros.


•Positive: As the x1 increases x2 also increases.

•Negative: As the x1 increases x2 also decreases.

•Zeros: No direct relation.


PCA: Example
Step 3: Find the “Magic Directions” (Principal Components)
•PCA identifies new axes (like rotating a camera) where the data spreads out the most:

• 1st Principal Component (PC1): The direction of maximum variance (most


spread).
• 2nd Principal Component (PC2): The next best direction, perpendicular to PC1,
and so on.
•These directions are calculated using Eigenvalues and Eigenvectors where:
eigenvectors (math tools that find these axes), and their importance is ranked
by eigenvalues (how much variance each captures).
For a square matrix A, an eigenvector X (a non-zero vector) and its
corresponding eigenvalue λ (a scalar) satisfy:
AX=λX
PCA: Example
This means:
•When A acts on X, it only stretches or shrinks X by the scalar λ.

•The direction of X remains unchanged (hence, eigenvectors define “stable directions” of A).

It can also be written as :


AX−λX=0
(A−λI)X=0

where I is the identity matrix of the same shape as matrix A.


And the above conditions will be true only if (A–λI)will be non-invertible (i.e. singular matrix).
That means,
∣A–λI∣=0
This determinant equation is called the characteristic equation.
•Solving it gives the eigenvalues \lambda,

•and therefore corresponding eigenvector can be found using the equation AX=λX.
PCA: Example

Step 4: Pick the Top Directions & Transform Data

•Keep only the top 2–3 directions (or enough to capture ~95% of the variance).
•Project the data onto these directions to get a simplified, lower-dimensional version.
PCA is an unsupervised learning algorithm, meaning it doesn’t require prior knowledge of
target variables.
It’s commonly used in exploratory data analysis and machine learning to simplify datasets
without losing critical information.
We know everything sound complicated, let’s understand again with help of visual image
where, x-axis (Radius) and y-axis (Area) represent two original features in the dataset.
PCA: Example
PCA: Example

Principal Components (PCs):


•PC₁ (First Principal Component): The direction along which the data has the maximum
variance. It captures the most important information.
•PC₂ (Second Principal Component): The direction orthogonal (perpendicular) to PC₁.
It captures the remaining variance but is less significant.
PCA: Example
Now, The red dashed lines indicate the spread (variance) of data along different directions
The variance along PC₁ is greater than PC₂,
which means that PC₁ carries more useful information about the dataset.

•The data points (blue dots) are projected onto PC₁, effectively reducing
• the dataset from two dimensions (Radius, Area) to one dimension (PC₁).

•This transformation simplifies the dataset while retaining most of the original variability.
The image visually explains why PCA selects the direction with the highest variance
(PC₁).

By removing PC₂,
we reduce redundancy while keeping essential information.
The transformation helps in data compression, visualization, and improved model
performance.
Linear Discriminant Analysis in Machine Learning

When working with high-dimensional datasets it is important to apply


dimensionality reduction techniques to make data exploration and modeling more
efficient.

One such technique is Linear Discriminant Analysis (LDA) which helps in reducing
the dimensionality of data while retaining the most significant features for
classification tasks.

It works by finding the linear combinations of features that best separate the classes
in the dataset. In this article we will learn about it and how to implement it in
python.
Linear Discriminant Analysis in Machine Learning

Maximizing Class Separability : Role of LDA

Linear Discriminant Analysis (LDA) also known as Normal Discriminant


Analysis is supervised classification problem that helps separate two or more
classes by

converting higher-dimensional data space into a lower-dimensional space.

It is used to identify a linear combination of features that best separates classes


within a dataset.
Linear Discriminant Analysis in Machine Learning

For example we have two classes that need to be separated efficiently.


Each class may have multiple features and using a single feature to classify them may
result in overlapping.

To solve this LDA is used as it uses multiple features to improve classification


accuracy.
Linear Discriminant Analysis in Machine Learning

Core Assumptions of LDA


For LDA to perform effectively certain assumptions are made:
1.Gaussian Distribution: Data within each class should follow a
Gaussian distribution.
2.Equal Covariance Matrices: Covariance matrices of the different classes should
be equal.
3.Linear Separability: A linear decision boundary should be sufficient to separate
the classes.
Linear Discriminant Analysis in Machine Learning

For example, when data points belonging to two classes are plotted if they are not
linearly separable LDA will attempt to find a projection that maximizes class
separability.
Linear Discriminant Analysis in Machine Learning

Image shows an example where the classes (black and green circles) are not linearly

separable. LDA attempts to separate them using red dashed line.

It uses both axes (X and Y) to generate a new axis in such a way that it

maximizes the distance between the means of the two classes while minimizing

the variation within each class.

This transforms the dataset into a space where the classes are better separated.
Linear Discriminant Analysis in Machine Learning

After transforming the data points along a new axis LDA maximizes the class

separation. This new axis allows for clearer classification by projecting the data

along a line that enhances the distance between the means of the two classes.
Linear Discriminant Analysis in Machine Learning
• Perpendicular distance between the decision boundary and the data points helps
us to visualize how LDA works by reducing class variation and increasing
separability.

• After generating this new axis using the above-mentioned criteria, all the data
points of the classes are plotted on this new axis and are shown in the figure
given below.
Linear Discriminant Analysis in Machine Learning
• It shows how LDA creates a new axis to project the data and separate the two
classes effectively along a linear path.

• But it fails when the mean of the distributions are shared as it becomes
impossible for LDA to find a new axis that makes both classes linearly separable.
In such cases we use non-linear discriminant analysis.
Linear Discriminant Analysis in Machine Learning
Linear Discriminant Analysis in Machine Learning
Linear Discriminant Analysis in Machine Learning
Advantages of LDA
•Simple and computationally efficient.

•Works well even when the number of features is much larger than the number of training
samples.
•Can handle multicollinearity.

Disadvantages of LDA
•Assumes Gaussian distribution of data which may not always be the case.

•Assumes equal covariance matrices for different classes which may not hold in all datasets.

•Assumes linear separability which is not always true.

•May not always perform well in high-dimensional feature spaces.


Linear Discriminant Analysis in Machine Learning

Applications of LDA

1.Face Recognition: It is used to reduce the high-dimensional feature space of pixel


values in face recognition applications helping to identify faces more efficiently.
2.Medical Diagnosis: It classifies disease severity in mild, moderate or severe based on
patient parameters helping in decision-making for treatment.
3.Customer Identification: It can help identify customer segments most likely to
purchase a specific product based on survey data.
Linear Discriminant Analysis in Machine Learning

Linear Discriminant Analysis (LDA) is a technique for dimensionality reduction that not
only simplifies high-dimensional data but also enhances the performance of models by
maximizing class separability.

By converting data into a lower-dimensional space it helps us to improve accuracy of


classification task.

You might also like