0% found this document useful (0 votes)
13 views64 pages

Module 2 ML Chapter2

Module 2, Chapter 2 discusses various aspects of data analysis, focusing on bivariate and multivariate data, statistics, and essential mathematics for machine learning. It covers techniques such as feature engineering, dimensionality reduction, and probability distributions, emphasizing their importance in understanding and processing data. Additionally, it introduces methods like Gaussian elimination, matrix decomposition, and the Expectation-Maximization algorithm for clustering and density estimation.

Uploaded by

anudeep05062005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views64 pages

Module 2 ML Chapter2

Module 2, Chapter 2 discusses various aspects of data analysis, focusing on bivariate and multivariate data, statistics, and essential mathematics for machine learning. It covers techniques such as feature engineering, dimensionality reduction, and probability distributions, emphasizing their importance in understanding and processing data. Additionally, it introduces methods like Gaussian elimination, matrix decomposition, and the Expectation-Maximization algorithm for clustering and density estimation.

Uploaded by

anudeep05062005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

MODULE 2

CHAPTER 2
UNDERSTANDING DATA – 2
UNDERSTANDING DATA – 2
Contents
• Bivariate Data and Multivariate Data.
• Multivariate Statistics.
• Essential Mathematics for Multivariate Data.
• Feature Engineering and Dimensionality Reduction
Techniques.
Introduction
• Bivariate data involves two variables and examines their relationship.
• It helps identify trends, correlations, and potential causes.
• We will analyze temperature and sweater sales data to demonstrate.
• The dataset below shows temperature (°C) and corresponding sweater sales (in
thousands):
• A scatter plot visualizes the relationship between temperature and
sweater sales.
• It shows how sales decrease as temperature increases, indicating a
negative correlation.
• The line chart also demonstrates the negative relationship between
temperature and sweater sales.
• As the temperature increases, the sales of sweaters consistently
decrease.
Bivariate Statistics
• Covariance measures the joint variability of two random variables.
• It indicates the direction of the relationship:
• - Positive covariance: Variables increase together.
• - Negative covariance: One variable increases while the other decreases.
• Correlation measures the strength and direction of a linear relationship between
variables.
• Positive correlation: Both variables move together.
• Negative correlation: Variables move in opposite directions.
• Zero correlation: No linear relationship.
• The Pearson correlation coefficient is defined as:
• r = COV(X, Y) / (σX * σY)
• Where σX and σY are the standard deviations of X and Y.
• Covariance and correlation help in understanding the relationship between
variables.
• A high positive correlation (≈0.984) indicates a strong linear relationship in the
given data.
Multivariate Statistics
• Definition: Analysis of more than two observable variables.
• Importance: Helps analyze datasets with thousands of measurements.
• Examples: Regression analysis, PCA, path analysis.
• Display the table with Id, Attribute 1, Attribute 2, Attribute 3.
• Mean vector: (2, 7.5, 1.33)
• Variance: Covariance matrix.
• Mean vector = centroid; Variance = dispersion matrix.
Heatmap
• Definition: Visual representation of a 2D matrix.
• Color significance: Darker for larger values, lighter for smaller.
• Use case: Traffic data analysis to distinguish heavy and low traffic regions.
• Image: Heatmap for patient data.
Pairplot
• Definition: Visual technique for
multivariate data.
• Use case: Identify relationships and
correlations among variables.
• Structure: Pairwise scatter plots in a
matrix format.
• Figure 2.14: Pairplot for random data.
• Shows relationships among three
variables.
Essential Mathematics for Multivariate Data
• Machine learning involves mathematical concepts from Linear Algebra, Statistics,
Probability, and Information Theory. Linear algebra plays a crucial role as it is the
mathematics of data, dealing with linear equations, vectors, matrices, vector
spaces, and transformations.
• Importance of Linear Algebra in Machine Learning
• Linear Algebra is fundamental to many areas, including machine learning,
scientific applications, and data science.
• It plays a significant role in dealing with linear equations, vectors, matrices,
vector spaces, and transformations.
• These mathematical concepts are necessary for handling and processing
multivariate data efficiently.
Linear Systems and Gaussian Elimination
• A linear system of equations consists of equations with unknown
variables. If Ax = y, then x can be found using x = y/A = A⁻¹y.
Gaussian elimination helps solve these systems efficiently.
Gaussian Elimination Steps
• Step 1: Convert the System into an Augmented Matrix
• Step 2: Convert to Upper Triangular Form (Forward Elimination)
• The goal is to get 1s in the diagonal and 0s below the diagonal.
• Perform row operations to create leading ones and eliminate terms below them.
• The allowed operations:
• Swapping rows if necessary (to bring a nonzero pivot to the diagonal).
• Multiplying a row by a nonzero scalar to make the pivot 1.
• Subtracting a multiple of one row from another to create zeros below the pivot.
• Example:
For a system with three variables:
• Make the first pivot element 1 (if needed, swap rows).
• Subtract multiples of the first row from lower rows to eliminate the first column.
• Make the second pivot element 1.
• Subtract multiples of the second row from rows below to eliminate the second column.
• Repeat until an upper triangular matrix (zeros below diagonal) is obtained.
• Step 3: Convert to Row Echelon Form (Reduced Row Echelon Form -
Optional)
• Further modify the matrix to get 1s in the diagonal and 0s above and below it.
• This step isn't always necessary for solving the system, but it makes the
solution clearer.
• Step 4: Solve Using Back-Substitution
• Start from the last equation (bottom row) and solve for the last variable.
• Substitute the known values into the previous rows to find the remaining
variables.
• Continue until all variables are found.
• Step 1: Convert to an Augmented Matrix
• Step 2: Normalize the First Row (Make the first pivot 1)
• Divide row 1 by 2
• Step 3: Eliminate Below the First Pivot
• Subtract 4 × (Row 1) from Row 2:
• Step 4: Normalize the Second Pivot (Make it 1)
• Divide Row 2 by -5:
• Step 5: Eliminate Above the Second Pivot
• Subtract 2 × (Row 2) from Row 1:
• Step 6: Read the Solution
Matrix decompositions
• Reducing a matrix to its constituent parts for complex matrix operations.
• Also known as matrix factorization methods.
• Helps in simplifying problems related to data science, engineering, and applied
mathematics.
• Matrix decomposition involves breaking down a matrix into its constituent parts.
• Helps in performing complex matrix operations efficiently.
• Also known as matrix factorization methods.
Eigen Decomposition
• The most common matrix decomposition technique.
• Reduces a matrix into eigenvalues and eigenvectors.
• Representation:
• A = Q Λ Q^T
• Q: Matrix of eigenvectors
• Λ (Lambda): Diagonal matrix of eigenvalues
• Q^T: Transpose of Q
LU Decomposition
• One of the simplest matrix decompositions.
• Expresses a matrix A as:
• A = LU
• L: Lower triangular matrix
• U: Upper triangular matrix
• Performed using Gaussian elimination.
• Steps:
1. Augment an identity matrix to A.
2. Apply row operations using Gaussian elimination.
3. Extract L (lower triangular) and U (upper triangular) matrices.
Machine Learning and Importance of
Probability and Statistics
• Machine learning is linked with statistics and probability.
• Statistics is the heart of machine learning.
• Without statistics, data analysis is difficult.
• Probability is essential for machine learning.
• Data can be assumed to be generated by a probability distribution.
• ML datasets have multiple distributions.
• Knowledge of probability distribution and random variables is crucial.
• Experiments in ML involve hypothesis and model construction.
• ML has many models based on hypothesis testing.
• Evaluating models involves hypothesis testing and significance analysis.
• Probability theory links with ML through:
• Hypothesis testing
• Model evaluation
• Sampling theory for dataset construction
• Probability and statistics are fundamental in ML.
• Help in model construction, evaluation, and data interpretation.
• Essential for understanding ML concepts and improving accuracy.
What is a Probability Distribution?
• A probability distribution summarizes the probability associated with a variable’s
events.
• It is a parameterized mathematical function.
• Describes the relationship between observations in a sample space.
• Types of Probability Distributions
1. Discrete probability distribution
2. Continuous probability distribution
• Probability distributions help in modeling uncertainties in data.
• Both discrete and continuous distributions are crucial in machine learning.
• Understanding these concepts aids in statistical modeling and decision-making.
Continuous Probability Distributions
• Represents events of a continuous random variable.
• Summarized by Probability Density Function (PDF).
• PDF calculates the probability of observing an instance.
• Cumulative Distribution Function (CDF) computes the probability of
observation ≤ a value.
• These distributions apply to continuous random variables.
• Examples include Normal, Rectangular (Uniform), and Exponential
distributions.
Normal Distribution
• Also known as Gaussian distribution or bell-shaped curve.
• Most common distribution function.
• Characterized by mean (μ) and standard deviation (σ).
• Mean, median, and mode are the same.
• Z-score normalization is commonly used.
• PDF of Normal Distribution
• Formula:
• f(x, μ, σ²) = (1 / √(2πσ²)) * e^(-(x-μ)² / 2σ²)
• Describes the shape of the normal distribution.
• Used in statistical tests and hypothesis testing.
Rectangular (Uniform) Distribution
• Also known as uniform distribution.
• Equal probabilities for all values in range [a, b].
• Formula:
• P(X = x) = 1 / (b - a), for a ≤ x ≤ b
Exponential Distribution
• Used to describe time between events in a Poisson process.
• Special case of Gamma distribution with a fixed parameter of 1.
• Formula:
• f(x, λ) = λe^(-λx), for x ≥ 0 and λ > 0
• Mean and standard deviation: β = 1 / λ
• Continuous distributions are widely used in statistics and machine learning.
• Normal, uniform, and exponential distributions help model real-world data.
• Understanding these distributions aids in statistical modeling and
probability analysis.
Discrete Probability Distributions
• The discrete equivalent of PDF is called Probability Mass Function
(PMF).
• Used for discrete random variables (e.g., number of heads in coin
tosses).
• PDF shows the shape of the distribution.
• CDF computes the probability of an observation ≤ a given value.
• Probability of an event cannot be detected directly but is computed as
the area under the curve.
• Examples:
• Binomial Distribution
• Used for binary outcomes (success/failure).
• Formula: P(X = k) = (n choose k) * p^k * (1-p)^(n-k)
• Mean: μ = np, Variance: σ² = np(1-p)
• Poisson Distribution
• Models event occurrences over time.
• Formula: P(X = x) = (e^(-λ) * λ^x) / x!
• Mean: λ
• Standard deviation: sqrt(λ)
• Bernoulli Distribution
• Single binary outcome (0 or 1).
• Mean: p, Variance: p(1-p)
Density Estimation
• Estimating the density function from observed data.
• - Two types:
• Parametric Density Estimation: Assumes a known distribution.
• Non-Parametric Density Estimation: No assumption about distribution.
Maximum Likelihood Estimation (MLE)
• A probabilistic framework for estimating distribution parameters.
• Likelihood function: L(X; θ) = Π p(xi; θ)
• Maximizing the log-likelihood function is preferred:
• max Σ log p(xi; θ)
• Used in predictive modeling and regression problems.
• If Gaussian distribution is assumed, MLE leads to:
• max Π (1 / sqrt(2πσ²)) * e^(-(yi - h(xi; β))² / 2σ²)
• SGD (Stochastic Gradient Descent) is often used for optimization.
Gaussian Mixture Model and Expectation-
Maximization (EM) Algorithm
• Clustering is an important task in machine learning.
• MLE framework is useful for model-based clustering.
• A model assumes data is generated by a distribution with parameters.
• Mixture models involve multiple distributions.
• Gaussian Mixture Model (GMM) is used when Gaussian distributions
are involved.
What is the EM Algorithm?
• Estimates MLE in presence of latent/missing variables.
• Example: Dataset with boys' and girls' weights (latent gender).
• Boys' weights may follow one Gaussian, girls' another.
• Gender is a latent variable, influencing weight distribution.
• EM estimates Probability Density Functions (PDF) when latent
variables exist.
• Stages of the EM Algorithm
• Expectation (E) Stage:
• Estimate expected PDF and parameters for each latent variable.
• Maximization (M) Stage:
• Optimize parameters using MLE function.
• Iterative process continues until latent variables fit probability
distributions.

• GMM assumes data is generated by multiple Gaussian distributions.


• EM is an iterative method for MLE estimation with latent variables.
• Two main steps: Expectation and Maximization.
• Used widely for clustering and density estimation.
Parzen Window
• Given 'n' samples, X = {x₁, x₂, ..., xₙ}.
• Samples are drawn independently (IID distribution).
• Region R covers 'k' samples out of 'n'.
• Probability density function is given by: p = k/n.
• Estimate formula: p(x) = (k/n) / V.
• V is the volume of region R.
• If R is a hypercube centered at x, and h is the length:
• V = h² for 2D square cube.
• V = h³ for 3D cube.
• Parzen Window Function
• The window function:
• φ((xᵢ - x)/h) = 1 if |xᵢ - xₖ|/h < 1/2, otherwise 0.
• Indicates if the sample is inside the region.
• Used in probability density estimation.
• Parzen Probability Density Function
• Given by p(x) = k/nV.
• Can be rewritten as:
• p(x) = (1/n) Σ (1/Vₙ) φ((xᵢ - x)/h).
• Window function can be replaced by other functions.
• If Gaussian function is used, it becomes the Gaussian density
function.
KNN Estimation
• Another non-parametric density estimation method.
• The parameter k is determined.
• Based on k-nearest neighbors.
• Probability density function estimate is the average of neighbor values.
• Parzen Window and KNN are non-parametric methods for density
estimation.
• Parzen Window uses a window function to estimate density.
• KNN determines density based on the k-nearest neighbors.
• Both methods are useful for unsupervised learning and classification.
Feature Engineering and Dimensionality
Reduction Techniques
• Features are attributes in machine learning.
• Feature engineering involves selecting important features to improve
model performance.
• Two main problems:
• 1. Feature Transformation: Creating new features (e.g., BMI from
height & weight).
• 2. Feature Selection: Choosing a subset of features to improve
efficiency.
Feature Subset Selection
• Reduces dataset size by removing irrelevant features.
• Constructs a minimal set of attributes for machine learning.
• High-dimensional datasets suffer from 'curse of dimensionality'.
• If a dataset has 'n' attributes, there are 2ⁿ possible subsets, making
selection complex.
• Solution: Remove non-contributing components to reduce
dimensionality.
Feature Removal Criteria
• Features can be removed based on:
• 1. **Feature Relevancy:** Features should be relevant for classification.
• - Example: A mole on a face helps in detection more than common
features like a nose.
• - Measured using mutual information, correlation, and distance measures.
• 2. **Feature Redundancy:** Some features provide duplicate information.
• - Example: Age can be derived from the Date of Birth column in a
database.
Feature Selection Procedure
• Steps for feature selection:
• 1. Generate all possible subsets.
• 2. Evaluate subsets based on model performance.
• 3. Select the optimal subset.
• - **Filter-based selection** uses statistical measures (e.g., correlation,
entropy).
• - **Wrapper-based selection** uses classifiers to determine best
features (computationally expensive).
Feature engineering
• Feature engineering improves model efficiency and accuracy.
• Feature transformation creates new meaningful attributes.
• Feature selection removes redundant/irrelevant attributes to reduce
complexity.
• Various selection methods include filter-based and wrapper-based
techniques.
Algorithms
• Stepwise Forward Selection
• Starts with an empty set of attributes.
• Each attribute is tested for statistical significance.
• The best attribute is added to the reduced set iteratively.
• The process continues until a good subset of attributes is obtained.
• Stepwise Backward Elimination
• Starts with a complete set of attributes.
• At each stage, the worst attribute is removed.
• The process continues until an optimal subset is achieved.
• Combined Approach
• A combination of forward selection and backward elimination.
• Allows adding the best attribute and removing the worst attribute
simultaneously.
• Balances feature selection for optimal model performance.
Principal Component Analysis
• PCA (Principal Component Analysis) is a technique used for dimensionality
reduction.
• It transforms a set of measurements into a new set of features.
• The goal is to capture maximum variance with fewer components.
• Principal Component Analysis (PCA) or KL Transform is used to transform
a dataset into a new set of features that exhibit high information packing
properties.
• Reduces redundancy by eliminating correlated information.
• Provides a compact representation of data with reduced dimensions.
Mathematical Formulation
• Mean vector is calculated as:
• m_x = E{x}

• Covariance matrix is computed using:


• C = E{(x - m_x)(x - m_x)^T}

• The covariance matrix is symmetric, and its eigenvalues and


eigenvectors are computed.
PCA Algorithm Steps
1. Obtain the target dataset x.
2. Compute the mean vector and subtract it from the dataset.
3. Compute the covariance matrix C.
4. Calculate eigenvalues and eigenvectors of C.
5. Select the eigenvectors corresponding to the highest eigenvalues.
6. Obtain the feature vector and perform transformation.
7. The transformed data can be recovered if needed.
Principal Components Selection
• Eigenvectors corresponding to the highest eigenvalues are chosen.
• Feature vector is formed with selected eigenvectors.
• Transformation is applied to obtain a lower-dimensional representation.
• Scree plot helps visualize the most important principal components.
• Advantages of PCA
• Reduces dimensionality of large datasets.
• Eliminates irrelevant and redundant attributes.
• Helps in visualization of high-dimensional data.
• Enhances computational efficiency in machine learning models.
Exercise problem
LDA (Linear Discriminant Analysis )
Algorithm
• LDA is a feature reduction technique like PCA.
• It projects high-dimensional data to a lower dimension (a line).
• LDA is used for data classification.
• Given two classes, LDA finds the means of both classes.
• The mean of class c1 and class c2 is computed as:
• μ1 = (1 / N1) * Σ xi (for class c1)
• μ2 = (1 / N2) * Σ xi (for class c2)
• The goal of LDA is to optimize:
• J(V) = (V^T * σB * V) / (V^T * σW * V)
• V is the linear projection.
• σB is the between-class scatter matrix.
• σW is the within-class scatter matrix.
• Between-class scatter matrix:
• σB = N1 (μ1 - μ)(μ1 - μ)^T + N2 (μ2 - μ)(μ2 - μ)^T
• Within-class scatter matrix:
• σW = Σ (xi - μ1)(xi - μ1)^T + Σ (xi - μ2)(xi - μ2)^T
• Eigenvalue Equation
• To maximize J(V), solve:
• σB * V = λσW * V OR σW^(-1) * σB * V = λV
• The projection vector V is computed as:
• V = σW^(-1) * (μ1 - μ2)
• The transformation of x is given by:
• y = V^T * x
• Like in PCA, the largest eigenvalues can be retained to have
projections.
SVD (Singular Value Decomposition)
Algorithm
• Singular Value Decomposition (SVD) is a useful decomposition
technique.
• A given matrix A can be decomposed as: A = U S Vᵀ.
• Here, U and V are orthogonal matrices, and S is a diagonal matrix.
• SVD is widely used in dimensionality reduction, noise reduction, and
compression.
Procedure for SVD Decomposition
• 1. For a given matrix, Compute A Aᵀ.
• 2. Find eigen values of Aᵀ A.
• 3. Sort the eigenvalues in descending order and store eigenvectors in
U.
• 4. Compute the square root of eigenvalues and store them diagonally
in S.
• 5. Compute eigenvectors for Aᵀ A and store them in V.
• Understanding Matrices U, S, and V
• U: Contains the left singular vectors (orthogonal matrix).
• S: Contains singular values (diagonal matrix).
• V: Contains the right singular vectors (orthogonal matrix).
• Thus, A = U S Vᵀ.
• Applications of SVD
• Image Compression
• Dimensionality Reduction (PCA)
• Noise Reduction
• Data Analysis and Pattern Recognition
Thank you

You might also like