Module 2 ML Chapter2
Module 2 ML Chapter2
CHAPTER 2
UNDERSTANDING DATA – 2
UNDERSTANDING DATA – 2
Contents
• Bivariate Data and Multivariate Data.
• Multivariate Statistics.
• Essential Mathematics for Multivariate Data.
• Feature Engineering and Dimensionality Reduction
Techniques.
Introduction
• Bivariate data involves two variables and examines their relationship.
• It helps identify trends, correlations, and potential causes.
• We will analyze temperature and sweater sales data to demonstrate.
• The dataset below shows temperature (°C) and corresponding sweater sales (in
thousands):
• A scatter plot visualizes the relationship between temperature and
sweater sales.
• It shows how sales decrease as temperature increases, indicating a
negative correlation.
• The line chart also demonstrates the negative relationship between
temperature and sweater sales.
• As the temperature increases, the sales of sweaters consistently
decrease.
Bivariate Statistics
• Covariance measures the joint variability of two random variables.
• It indicates the direction of the relationship:
• - Positive covariance: Variables increase together.
• - Negative covariance: One variable increases while the other decreases.
• Correlation measures the strength and direction of a linear relationship between
variables.
• Positive correlation: Both variables move together.
• Negative correlation: Variables move in opposite directions.
• Zero correlation: No linear relationship.
• The Pearson correlation coefficient is defined as:
• r = COV(X, Y) / (σX * σY)
• Where σX and σY are the standard deviations of X and Y.
• Covariance and correlation help in understanding the relationship between
variables.
• A high positive correlation (≈0.984) indicates a strong linear relationship in the
given data.
Multivariate Statistics
• Definition: Analysis of more than two observable variables.
• Importance: Helps analyze datasets with thousands of measurements.
• Examples: Regression analysis, PCA, path analysis.
• Display the table with Id, Attribute 1, Attribute 2, Attribute 3.
• Mean vector: (2, 7.5, 1.33)
• Variance: Covariance matrix.
• Mean vector = centroid; Variance = dispersion matrix.
Heatmap
• Definition: Visual representation of a 2D matrix.
• Color significance: Darker for larger values, lighter for smaller.
• Use case: Traffic data analysis to distinguish heavy and low traffic regions.
• Image: Heatmap for patient data.
Pairplot
• Definition: Visual technique for
multivariate data.
• Use case: Identify relationships and
correlations among variables.
• Structure: Pairwise scatter plots in a
matrix format.
• Figure 2.14: Pairplot for random data.
• Shows relationships among three
variables.
Essential Mathematics for Multivariate Data
• Machine learning involves mathematical concepts from Linear Algebra, Statistics,
Probability, and Information Theory. Linear algebra plays a crucial role as it is the
mathematics of data, dealing with linear equations, vectors, matrices, vector
spaces, and transformations.
• Importance of Linear Algebra in Machine Learning
• Linear Algebra is fundamental to many areas, including machine learning,
scientific applications, and data science.
• It plays a significant role in dealing with linear equations, vectors, matrices,
vector spaces, and transformations.
• These mathematical concepts are necessary for handling and processing
multivariate data efficiently.
Linear Systems and Gaussian Elimination
• A linear system of equations consists of equations with unknown
variables. If Ax = y, then x can be found using x = y/A = A⁻¹y.
Gaussian elimination helps solve these systems efficiently.
Gaussian Elimination Steps
• Step 1: Convert the System into an Augmented Matrix
• Step 2: Convert to Upper Triangular Form (Forward Elimination)
• The goal is to get 1s in the diagonal and 0s below the diagonal.
• Perform row operations to create leading ones and eliminate terms below them.
• The allowed operations:
• Swapping rows if necessary (to bring a nonzero pivot to the diagonal).
• Multiplying a row by a nonzero scalar to make the pivot 1.
• Subtracting a multiple of one row from another to create zeros below the pivot.
• Example:
For a system with three variables:
• Make the first pivot element 1 (if needed, swap rows).
• Subtract multiples of the first row from lower rows to eliminate the first column.
• Make the second pivot element 1.
• Subtract multiples of the second row from rows below to eliminate the second column.
• Repeat until an upper triangular matrix (zeros below diagonal) is obtained.
• Step 3: Convert to Row Echelon Form (Reduced Row Echelon Form -
Optional)
• Further modify the matrix to get 1s in the diagonal and 0s above and below it.
• This step isn't always necessary for solving the system, but it makes the
solution clearer.
• Step 4: Solve Using Back-Substitution
• Start from the last equation (bottom row) and solve for the last variable.
• Substitute the known values into the previous rows to find the remaining
variables.
• Continue until all variables are found.
• Step 1: Convert to an Augmented Matrix
• Step 2: Normalize the First Row (Make the first pivot 1)
• Divide row 1 by 2
• Step 3: Eliminate Below the First Pivot
• Subtract 4 × (Row 1) from Row 2:
• Step 4: Normalize the Second Pivot (Make it 1)
• Divide Row 2 by -5:
• Step 5: Eliminate Above the Second Pivot
• Subtract 2 × (Row 2) from Row 1:
• Step 6: Read the Solution
Matrix decompositions
• Reducing a matrix to its constituent parts for complex matrix operations.
• Also known as matrix factorization methods.
• Helps in simplifying problems related to data science, engineering, and applied
mathematics.
• Matrix decomposition involves breaking down a matrix into its constituent parts.
• Helps in performing complex matrix operations efficiently.
• Also known as matrix factorization methods.
Eigen Decomposition
• The most common matrix decomposition technique.
• Reduces a matrix into eigenvalues and eigenvectors.
• Representation:
• A = Q Λ Q^T
• Q: Matrix of eigenvectors
• Λ (Lambda): Diagonal matrix of eigenvalues
• Q^T: Transpose of Q
LU Decomposition
• One of the simplest matrix decompositions.
• Expresses a matrix A as:
• A = LU
• L: Lower triangular matrix
• U: Upper triangular matrix
• Performed using Gaussian elimination.
• Steps:
1. Augment an identity matrix to A.
2. Apply row operations using Gaussian elimination.
3. Extract L (lower triangular) and U (upper triangular) matrices.
Machine Learning and Importance of
Probability and Statistics
• Machine learning is linked with statistics and probability.
• Statistics is the heart of machine learning.
• Without statistics, data analysis is difficult.
• Probability is essential for machine learning.
• Data can be assumed to be generated by a probability distribution.
• ML datasets have multiple distributions.
• Knowledge of probability distribution and random variables is crucial.
• Experiments in ML involve hypothesis and model construction.
• ML has many models based on hypothesis testing.
• Evaluating models involves hypothesis testing and significance analysis.
• Probability theory links with ML through:
• Hypothesis testing
• Model evaluation
• Sampling theory for dataset construction
• Probability and statistics are fundamental in ML.
• Help in model construction, evaluation, and data interpretation.
• Essential for understanding ML concepts and improving accuracy.
What is a Probability Distribution?
• A probability distribution summarizes the probability associated with a variable’s
events.
• It is a parameterized mathematical function.
• Describes the relationship between observations in a sample space.
• Types of Probability Distributions
1. Discrete probability distribution
2. Continuous probability distribution
• Probability distributions help in modeling uncertainties in data.
• Both discrete and continuous distributions are crucial in machine learning.
• Understanding these concepts aids in statistical modeling and decision-making.
Continuous Probability Distributions
• Represents events of a continuous random variable.
• Summarized by Probability Density Function (PDF).
• PDF calculates the probability of observing an instance.
• Cumulative Distribution Function (CDF) computes the probability of
observation ≤ a value.
• These distributions apply to continuous random variables.
• Examples include Normal, Rectangular (Uniform), and Exponential
distributions.
Normal Distribution
• Also known as Gaussian distribution or bell-shaped curve.
• Most common distribution function.
• Characterized by mean (μ) and standard deviation (σ).
• Mean, median, and mode are the same.
• Z-score normalization is commonly used.
• PDF of Normal Distribution
• Formula:
• f(x, μ, σ²) = (1 / √(2πσ²)) * e^(-(x-μ)² / 2σ²)
• Describes the shape of the normal distribution.
• Used in statistical tests and hypothesis testing.
Rectangular (Uniform) Distribution
• Also known as uniform distribution.
• Equal probabilities for all values in range [a, b].
• Formula:
• P(X = x) = 1 / (b - a), for a ≤ x ≤ b
Exponential Distribution
• Used to describe time between events in a Poisson process.
• Special case of Gamma distribution with a fixed parameter of 1.
• Formula:
• f(x, λ) = λe^(-λx), for x ≥ 0 and λ > 0
• Mean and standard deviation: β = 1 / λ
• Continuous distributions are widely used in statistics and machine learning.
• Normal, uniform, and exponential distributions help model real-world data.
• Understanding these distributions aids in statistical modeling and
probability analysis.
Discrete Probability Distributions
• The discrete equivalent of PDF is called Probability Mass Function
(PMF).
• Used for discrete random variables (e.g., number of heads in coin
tosses).
• PDF shows the shape of the distribution.
• CDF computes the probability of an observation ≤ a given value.
• Probability of an event cannot be detected directly but is computed as
the area under the curve.
• Examples:
• Binomial Distribution
• Used for binary outcomes (success/failure).
• Formula: P(X = k) = (n choose k) * p^k * (1-p)^(n-k)
• Mean: μ = np, Variance: σ² = np(1-p)
• Poisson Distribution
• Models event occurrences over time.
• Formula: P(X = x) = (e^(-λ) * λ^x) / x!
• Mean: λ
• Standard deviation: sqrt(λ)
• Bernoulli Distribution
• Single binary outcome (0 or 1).
• Mean: p, Variance: p(1-p)
Density Estimation
• Estimating the density function from observed data.
• - Two types:
• Parametric Density Estimation: Assumes a known distribution.
• Non-Parametric Density Estimation: No assumption about distribution.
Maximum Likelihood Estimation (MLE)
• A probabilistic framework for estimating distribution parameters.
• Likelihood function: L(X; θ) = Π p(xi; θ)
• Maximizing the log-likelihood function is preferred:
• max Σ log p(xi; θ)
• Used in predictive modeling and regression problems.
• If Gaussian distribution is assumed, MLE leads to:
• max Π (1 / sqrt(2πσ²)) * e^(-(yi - h(xi; β))² / 2σ²)
• SGD (Stochastic Gradient Descent) is often used for optimization.
Gaussian Mixture Model and Expectation-
Maximization (EM) Algorithm
• Clustering is an important task in machine learning.
• MLE framework is useful for model-based clustering.
• A model assumes data is generated by a distribution with parameters.
• Mixture models involve multiple distributions.
• Gaussian Mixture Model (GMM) is used when Gaussian distributions
are involved.
What is the EM Algorithm?
• Estimates MLE in presence of latent/missing variables.
• Example: Dataset with boys' and girls' weights (latent gender).
• Boys' weights may follow one Gaussian, girls' another.
• Gender is a latent variable, influencing weight distribution.
• EM estimates Probability Density Functions (PDF) when latent
variables exist.
• Stages of the EM Algorithm
• Expectation (E) Stage:
• Estimate expected PDF and parameters for each latent variable.
• Maximization (M) Stage:
• Optimize parameters using MLE function.
• Iterative process continues until latent variables fit probability
distributions.