Module - 02 Machine Learning (BCS602)
Module - 02 Machine Learning (BCS602)
Module-2
Bivariate Data
Bivariate data involves two variables, and the goal of bivariate analysis is to explore the relationship
between them.
This relationship can help in comparisons, identifying causes, and further exploration of the data.
Bivariate Data involves two variables. Bivariate data deals with causes of relationships. The aim is
to find relationships among data.
Consider the following Table 2.3, with data of the temperature in a shop and sales of sweaters.
Scatter Plot
Strength: Indicates how closely the data points fit a pattern or trend.
Shape: Helps in identifying the type of relationship (linear, quadratic, etc.).
Direction: Shows whether the relationship is positive, negative, or neutral.
Outliers: Helps identify any points that deviate significantly from the trend.
Scatter plots are often used in the exploratory phase of data analysis before calculating correlation
coefficients or fitting regression models.
Bivariate Statistics
There are various statistical measures to describe the relationship between two variables.
Covariance measures the joint variability of two random variables. It tells you whether an
increase in one variable results in an increase or decrease in the other variable.
Mathematically, the covariance between two variables X and Y is defined as:
Covariance values:
Positive covariance: As one variable increases, the other variable also increases.
Negative covariance: As one variable increases, the other variable decreases.
Zero covariance: No linear relationship between the variables.
Correlation
While covariance measures the direction of the relationship, correlation quantifies the strength of
the relationship between two variables.
Unlike covariance, correlation is dimensionless, meaning it is not affected by the units of the
variables.
Multivariate Statistics
Multivariate data refers to data that involves more than two variables, and in machine learning, most
datasets are multivariate.
This can involve multiple dependent (response) variables, and is often used for analyzing more
complex data scenarios.
Regression Analysis
Principal Component Analysis (PCA)
Path Analysis
The mean vector is used to represent the mean of multiple variables, and the covariance matrix
represents the variance and relationships among all variables.
The mean vector is also known as the centroid, while the covariance matrix is also referred to as
the dispersion matrix.
Regression Analysis:
Used to model the relationship between multiple independent variables and a dependent variable.
Factor Analysis:
Heatmap
Applications:
Heatmaps are useful for visualizing complex data like traffic patterns or patient health data, where
you can easily identify regions of higher or lower values.
Example:
In vehicle traffic data, regions with heavy traffic are highlighted with dark colors, making it easy
to spot problem areas.
A pairplot (or scatter matrix) is a matrix of scatter plots that shows relationships between every
pair of variables in a multivariate dataset.
This method allows you to visually examine correlations or relationships between variables.
A random matrix of three columns is chosen and the relationships of the columns is plotted as a
pairplot (or scattermatrix) as shown below in Figure 2.14.
Visual Layout: Each scatter plot in the matrix shows the relationship between two
variables.
Usefulness: By examining the pairplot, you can easily identify patterns,
correlations, or clusters among the variables.
In the realm of machine learning and multivariate data analysis, several mathematical concepts are
foundational.
These include concepts from Linear Algebra, Statistics, Probability, and Optimization. Below is
an overview of essential mathematical tools that are necessary for understanding and working
with multivariate data.
Linear Algebra
Linear algebra is crucial in machine learning as it provides the tools for dealing with data in the
form of vectors and matrices. Here's a breakdown of important topics:
Vectors: A vector is an ordered list of numbers. It can represent data points or features
of an observation in a multivariate dataset.
o Dot product and cross product are used to compute projections and angles
between vectors.
Matrices: A matrix is a 2D array of numbers. In machine learning, matrices often
represent data where rows are instances and columns are features.
o Matrix multiplication allows the transformation of data and is used in
various algorithms like linear regression, neural networks, and more.
Eigenvalues and Eigenvectors: These are important for dimensionality reduction
techniques such as Principal Component Analysis (PCA). They are used to transform
data into a new basis that captures the most variance.
Determinants and Inverses: The determinant of a matrix tells us if the matrix is
invertible (non-singular). The inverse of a matrix is used to solve linear systems of
equations.
Singular Value Decomposition (SVD): This is a factorization method used in PCA and
other dimensionality reduction techniques to decompose a matrix into singular values
and vectors.
Statistics
Statistics is key to understanding the relationships between different variables in multivariate data.
Key concepts include:
Mean and Variance: Measures of central tendency (mean) and spread (variance) are
essential to understanding the distribution of each variable.
Covariance: Covariance measures the relationship between two variables. A positive
covariance indicates that as one variable increases, the other tends to increase.
Correlation: Correlation is a normalized measure of covariance that indicates the
strength and direction of the relationship between two variables.
Multivariate Normal Distribution: Many machine learning algorithms assume that the
data follows a multivariate normal distribution, which extends the idea of normal
distribution to more than one variable.
Principal Component Analysis (PCA): PCA is used to reduce the dimensionality of the
dataset while retaining as much variance as possible. It uses eigenvectors and
eigenvalues to identify the principal components.
Probability
Probability theory underpins the concept of uncertainty, which is inherent in real-world data:
Bayes' Theorem: This theorem describes the probability of an event, based on prior
knowledge of related events. It's fundamental to algorithms like Naive Bayes and
Bayesian Inference.
Markov Chains: These are used for modeling systems that undergo transitions from
one state to another with a certain probability, without memory of previous states.
Optimization
Optimization is key to finding the best model for multivariate data. Many machine learning
algorithms are formulated as optimization problems.
Multivariate Analysis
Scatter Plots: A scatter plot can be used to visualize the relationship between two
variables. For multivariate data, pair plots or scatter matrices are used to examine the
relationships between all pairs of variables.
Heatmaps: Used to visualize correlation matrices or covariance matrices, where color
intensity represents the strength of the relationship.
Dimensionality Reduction
Dimensionality reduction is used to reduce the number of variables in a dataset while maintaining the
essential information:
Principal Component Analysis (PCA): A technique that reduces the dimensionality of the
dataset by projecting the data onto a set of orthogonal axes (principal components) that
explain the most variance.
t-SNE: A technique for dimensionality reduction that is well-suited for visualizing high-
dimensional data in 2D or 3D space.
Feature engineering and dimensionality reduction are critical steps in machine learning
workflows.
They ensure that models are not only accurate but also efficient, interpretable, and scalable.
1. Feature Engineering
Feature engineering involves creating, modifying, or selecting features (variables) from raw data
to improve the performance of machine learning models.
1. Feature Creation
2. Feature Transformation
o Normalization: Scaling values to a specific range, typically [0,1].
o Standardization: Transforming features to have a mean of 0 and a
standard deviation of 1.
o Log Transformation: Reducing the impact of large values by applying the log
function.
o Power Transformation: Stabilizing variance by applying functions like
square root or exponential transformations.
3. Handling Missing Values
o Imputation: Filling missing values with statistical measures (mean,
median, mode) or predictions from models.
o Dropping Features or Rows: Removing features or samples with excessive
missing data.
Dimensionality Reduction
Dimensionality reduction aims to reduce the number of features while preserving as much relevant
information as possible.
It helps combat issues like overfitting, high computational costs, and the curse of dimensionality.
5. Feature Agglomeration
o Purpose: Groups features with similar characteristics (hierarchical
clustering for features).
o Combines redundant features into a single representative feature.
7. Factor Analysis
o Purpose: Identifies underlying latent variables (factors) that explain
observed variables.
o Assumes that observed data is influenced by a smaller number of
unobservable factors.
Applications: Effective for small datasets where computational cost isn’t a concern.
Pipeline Integration:
Many machine learning frameworks (e.g., scikit-learn) support building pipelines where feature
engineering and dimensionality reduction steps are automated.
Hybrid Methods:
For example:
o Combine PCA with feature selection to reduce noise and retain relevant
features.
o Use autoencoders to generate compact features, then apply supervised learning
techniques.
Applications
Text Data:
o Use TF-IDF for feature creation and Latent Semantic Analysis (LSA) for
dimensionality reduction.
Image Data:
Genomic Data:
Sensor Data:
o Combine Fourier transforms for feature extraction and PCA for dimensionality
reduction.
Best Practices
Understand Data: Always begin with exploratory data analysis (EDA) to understand feature
importance and relationships.
Avoid Over-Reduction: Ensure that dimensionality reduction techniques retain sufficient information to
build an accurate model.
Evaluate: Continuously evaluate feature engineering and dimensionality reduction using cross-
validation.
Chapter – 02
A learning system is a computational system that uses algorithms to learn from data or experiences to
improve its performance over time.
The design of such systems focuses on the following essential steps: Choosing a
Training Experience
The first step in building a learning system is selecting the type of training experience it will use to
learn. This involves determining the source of data and how it will be used.
Direct Experience:
The system is explicitly provided with examples of board states and their correct moves.
Example: In a chess game, the system is given specific board states and the optimal moves for
those states.
Indirect Experience:
Instead of explicit guidance, the system is provided with sequences of moves and their
results.
Example: The system observes the outcome (win or loss) of different move
sequences and learns to optimize its strategy.
In supervised training, a supervisor labels all valid moves for a given board state.
In the absence of a supervisor, the system uses self-play or exploration to learn. For example, a
chess agent can play games against itself and identify successful moves.
o For reliable performance, training samples must cover a wide range of scenarios.
o If the training data and testing data have similar distributions, the system's
performance will be better.
The target function represents the knowledge the system needs to learn.
It specifies the goal of the learning system and what it is trying to predict or optimize.
Once the target function is defined, the next step is deciding how to represent it. The representation
depends on the complexity of the problem and the available computational resources.
Common Representations:
Lookup Tables:
Used for simple problems where all possible states and actions can be enumerated.
Example: A small chessboard with a limited number of moves.
Mathematical Functions:
For complex systems, models like neural networks, decision trees, or support vector machines
are used to approximate the target function.
Example: Using a neural network to predict the best chess moves based on board states.
Function Approximation
In most real-world problems, the target function is too complex to be represented exactly. Instead,
an approximation of the target function is learned.
Approaches to Approximation:
Parametric Models:
Models with a fixed number of parameters (e.g., linear regression, neural networks). Non-
Parametric Models:
Models that adapt their complexity to the amount of data (e.g., k-nearest neighbors, decision trees).
Learning Algorithms:
o Algorithms like gradient descent, reinforcement learning, or evolutionary algorithms are used
to optimize the parameters of the function.
o Example: In a chess game, reinforcement learning allows the agent to learn by trial and
error, optimizing its strategy over time.
Experience:
Use a combination of self-play (indirect experience) and historical game data (direct experience).
Target Function:
Define the target function as selecting the best move M given the board state B:
Use a deep neural network to represent the target function, where inputs are board states and
outputs are move probabilities.
Function Approximation:
Train the neural network using reinforcement learning, with rewards based on the outcome of
games played by the system.
Concept learning is a strategy in machine learning that involves acquiring abstract knowledge or
inferring general concepts from the given training data.
It enables the learner to generalize from specific training examples and classify objects or
instances based on common, relevant features.
Concept learning is the process of abstraction and generalization from data, where:
It involves:
Categorization:
o For example, humans classify animals like elephants, cats, or dogs based on specific
distinguishing features.
Boolean-Valued Function:
o Each concept or category learned is represented as a Boolean function that returns true or
false:
True for positive examples that belong to the category.
False for negative examples that do not belong to the category.
Example:
Input:
Output:
Testing:
Training:
o The learner observes a set of labeled examples (positive and negative instances).
o It identifies common, relevant features from the positive examples and contrasts them
with negative examples.
Hypothesis Formation:
Generalization:
Output: Target concept for an elephant, e.g., "has a trunk," "has tusks," and "large size." Testing:
A machine learning model abstracts a training dataset and makes predictions on unseen data.
Training: Involves feeding training data into a machine learning algorithm, tuning parameters, and
generating a predictive model.
Goals: Selecting the right model, training effectively, reducing training time, and achieving high
performance on unseen data.
Types of Parameters:
Model Parameters: Learnable directly from training data (e.g., regression coefficients,
decision tree splits, neural network weights).
Dataset Splitting:
Error Types:
o Training Error (In-sample Error): Error when the model is tested on training data.
o Test Error (Out-of-sample Error): Error when predicting on unseen test data.
Loss Function: Measures prediction error. Example: Mean Squared Error (MSE)—a smaller
value indicates higher accuracy.
Algorithm Selection: Choose a model suitable for the problem and dataset.
Challenges:
Resampling Methods
Random Train/Test Splits: Randomly split the data for training and testing.
o K-fold Cross-Validation: Split data into k parts, train on k-1 folds, and test on the
remaining fold.
o Stratified K-fold: Ensures each fold contains a proportionate distribution of class labels.
Precision-Recall Curve:
Scoring Models: Combine model performance and complexity into a single score. Example:
Selects the simplest model with the fewest bits to represent both data and predictions.