Ml. Model 2
Ml. Model 2
Module-2
Bivariate Data
Bivariate data involves two variables, and the goal of bivariate analysis is to explore the
relationship between them.
This relationship can help in comparisons, identifying causes, and further exploration of
the data.
Bivariate Data involves two variables. Bivariate data deals with causes of relationships.
The aim is to find relationships among data.
Consider the following Table 2.3, with data of the temperature in a shop and sales of
sweaters.
Scatter Plot
Strength: Indicates how closely the data points fit a pattern or trend.
Shape: Helps in identifying the type of relationship (linear, quadratic, etc.).
Direction: Shows whether the relationship is positive, negative, or neutral.
Outliers: Helps identify any points that deviate significantly from the trend.
Scatter plots are often used in the exploratory phase of data analysis before calculating
correlation coefficients or fitting regression models.
Bivariate Statistics
There are various statistical measures to describe the relationship between two
variables.
Covariance
Covariance measures the joint variability of two random variables. It tells you whether
an increase in one variable results in an increase or decrease in the other variable.
Mathematically, the covariance between two variables X and Y is defined as:
Covariance values:
Positive covariance: As one variable increases, the other variable also increases.
Negative covariance: As one variable increases, the other variable decreases.
Zero covariance: No linear relationship between the variables.
Correlation
While covariance measures the direction of the relationship, correlation quantifies the
strength of the relationship between two variables.
Multivariate Statistics
Multivariate data refers to data that involves more than two variables, and in machine
learning, most datasets are multivariate.
This can involve multiple dependent (response) variables, and is often used for
analyzing more complex data scenarios.
Regression Analysis
Principal Component Analysis (PCA)
Path Analysis
The mean vector is used to represent the mean of multiple variables, and the covariance
matrix represents the variance and relationships among all variables.
The mean vector is also known as the centroid, while the covariance matrix is also
referred to as the dispersion matrix.
Regression Analysis:
Factor Analysis:
Heatmap
Applications:
Heatmaps are useful for visualizing complex data like traffic patterns or patient health
data, where you can easily identify regions of higher or lower values.
Example:
In vehicle traffic data, regions with heavy traffic are highlighted with dark colors, making
it easy to spot problem areas.
A pairplot (or scatter matrix) is a matrix of scatter plots that shows relationships
between every pair of variables in a multivariate dataset.
A random matrix of three columns is chosen and the relationships of the columns is
plotted as a pairplot (or scattermatrix) as shown below in Figure 2.14.
Visual Layout: Each scatter plot in the matrix shows the relationship between two
variables.
Usefulness: By examining the pairplot, you can easily identify patterns,
correlations, or clusters among the variables.
In the realm of machine learning and multivariate data analysis, several mathematical
concepts are foundational.
These include concepts from Linear Algebra, Statistics, Probability, and Optimization.
Below is an overview of essential mathematical tools that are necessary for
understanding and working with multivariate data.
Linear Algebra
Linear algebra is crucial in machine learning as it provides the tools for dealing with
data in the form of vectors and matrices. Here's a breakdown of important topics:
Statistics
Mean and Variance: Measures of central tendency (mean) and spread (variance)
are essential to understanding the distribution of each variable.
Covariance: Covariance measures the relationship between two variables. A
positive covariance indicates that as one variable increases, the other tends to
increase.
Correlation: Correlation is a normalized measure of covariance that indicates the
strength and direction of the relationship between two variables.
Multivariate Normal Distribution: Many machine learning algorithms assume that
the data follows a multivariate normal distribution, which extends the idea of
normal distribution to more than one variable.
Principal Component Analysis (PCA): PCA is used to reduce the dimensionality of
the dataset while retaining as much variance as possible. It uses eigenvectors
and eigenvalues to identify the principal components.
Probability
Optimization
Optimization is key to finding the best model for multivariate data. Many machine
learning algorithms are formulated as optimization problems.
Multivariate Analysis
Scatter Plots: A scatter plot can be used to visualize the relationship between two
variables. For multivariate data, pair plots or scatter matrices are used to examine
the relationships between all pairs of variables.
Heatmaps: Used to visualize correlation matrices or covariance matrices, where
color intensity represents the strength of the relationship.
Dimensionality Reduction
Feature engineering and dimensionality reduction are critical steps in machine learning
workflows.
They ensure that models are not only accurate but also efficient, interpretable, and
scalable.
1. Feature Engineering
1. Feature Creation
2. Feature Transformation
o Normalization: Scaling values to a specific range, typically [0,1].
o Standardization: Transforming features to have a mean of 0 and a
standard deviation of 1.
o Log Transformation: Reducing the impact of large values by applying the
log function.
o Power Transformation: Stabilizing variance by applying functions like
square root or exponential transformations.
3. Handling Missing Values
o Imputation: Filling missing values with statistical measures (mean,
median, mode) or predictions from models.
o Dropping Features or Rows: Removing features or samples with excessive
missing data.
Dimensionality Reduction
It helps combat issues like overfitting, high computational costs, and the curse of
dimensionality.
5. Feature Agglomeration
o Purpose: Groups features with similar characteristics (hierarchical
clustering for features).
o Combines redundant features into a single representative feature.
7. Factor Analysis
o Purpose: Identifies underlying latent variables (factors) that explain
observed variables.
o Assumes that observed data is influenced by a smaller number of
unobservable factors.
Pipeline Integration:
Many machine learning frameworks (e.g., scikit-learn) support building pipelines where
feature engineering and dimensionality reduction steps are automated.
Hybrid Methods:
For example:
o Combine PCA with feature selection to reduce noise and retain relevant
features.
o Use autoencoders to generate compact features, then apply supervised
learning techniques.
Applications
Text Data:
o Use TF-IDF for feature creation and Latent Semantic Analysis (LSA) for
dimensionality reduction.
Image Data:
Genomic Data:
Sensor Data:
o Combine Fourier transforms for feature extraction and PCA for dimensionality
reduction.
Best Practices
Understand Data: Always begin with exploratory data analysis (EDA) to understand
feature importance and relationships.
Chapter – 02
A learning system is a computational system that uses algorithms to learn from data or
experiences to improve its performance over time.
The first step in building a learning system is selecting the type of training experience it
will use to learn. This involves determining the source of data and how it will be used.
Direct Experience:
The system is explicitly provided with examples of board states and their correct
moves.
Example: In a chess game, the system is given specific board states and the optimal
moves for those states.
Indirect Experience:
Instead of explicit guidance, the system is provided with sequences of moves and
their results.
Example: The system observes the outcome (win or loss) of different move
sequences and learns to optimize its strategy.
In supervised training, a supervisor labels all valid moves for a given board state.
In the absence of a supervisor, the system uses self-play or exploration to learn. For
example, a chess agent can play games against itself and identify successful moves.
o For reliable performance, training samples must cover a wide range of scenarios.
o If the training data and testing data have similar distributions, the system's
performance will be better.
The target function represents the knowledge the system needs to learn.
It specifies the goal of the learning system and what it is trying to predict or optimize.
Once the target function is defined, the next step is deciding how to represent it. The
representation depends on the complexity of the problem and the available
computational resources.
Common Representations:
Lookup Tables:
Used for simple problems where all possible states and actions can be enumerated.
Example: A small chessboard with a limited number of moves.
Mathematical Functions:
For complex systems, models like neural networks, decision trees, or support vector
machines are used to approximate the target function.
Example: Using a neural network to predict the best chess moves based on board
states.
Function Approximation
Approaches to Approximation:
Parametric Models:
Models with a fixed number of parameters (e.g., linear regression, neural networks).
Non-Parametric Models:
Models that adapt their complexity to the amount of data (e.g., k-nearest neighbors,
decision trees).
Learning Algorithms:
Training Experience:
Use a combination of self-play (indirect experience) and historical game data (direct
experience).
Target Function:
Define the target function as selecting the best move M given the board state B:
Use a deep neural network to represent the target function, where inputs are board
states and outputs are move probabilities.
Function Approximation:
Train the neural network using reinforcement learning, with rewards based on the
outcome of games played by the system.
It enables the learner to generalize from specific training examples and classify objects
or instances based on common, relevant features.
Concept learning is the process of abstraction and generalization from data, where:
It involves:
Categorization:
o For example, humans classify animals like elephants, cats, or dogs based on
specific distinguishing features.
Boolean-Valued Function:
Example:
Input:
Output:
Testing:
Training:
o The learner observes a set of labeled examples (positive and negative instances).
o It identifies common, relevant features from the positive examples and contrasts
them with negative examples.
Hypothesis Formation:
Generalization:
Output: Target concept for an elephant, e.g., "has a trunk," "has tusks," and "large size."
Testing: New animal instances are classified based on the learned concept.
Training: Involves feeding training data into a machine learning algorithm, tuning
parameters, and generating a predictive model.
Goals: Selecting the right model, training effectively, reducing training time, and
achieving high performance on unseen data.
Types of Parameters:
Model Parameters: Learnable directly from training data (e.g., regression coefficients,
decision tree splits, neural network weights).
Dataset Splitting:
Error Types:
o Training Error (In-sample Error): Error when the model is tested on training data.
o Test Error (Out-of-sample Error): Error when predicting on unseen test data.
Loss Function: Measures prediction error. Example: Mean Squared Error (MSE)—a
smaller value indicates higher accuracy.
Algorithm Selection: Choose a model suitable for the problem and dataset.
Challenges:
Approaches:
Resampling Methods
Random Train/Test Splits: Randomly split the data for training and testing.
o K-fold Cross-Validation: Split data into k parts, train on k-1 folds, and test on the
remaining fold.
Precision-Recall Curve:
Scoring Models: Combine model performance and complexity into a single score.
Selects the simplest model with the fewest bits to represent both data and predictions.