0% found this document useful (0 votes)
4 views72 pages

Mod2 Notes (4)

Module 2 covers essential concepts in data analysis, focusing on bivariate and multivariate data, statistics, and feature engineering techniques. It discusses methods like covariance, correlation, and visualizations such as scatter plots and heatmaps to understand relationships between variables. Additionally, it introduces probability distributions and density estimation methods, emphasizing their applications in machine learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views72 pages

Mod2 Notes (4)

Module 2 covers essential concepts in data analysis, focusing on bivariate and multivariate data, statistics, and feature engineering techniques. It discusses methods like covariance, correlation, and visualizations such as scatter plots and heatmaps to understand relationships between variables. Additionally, it introduces probability distributions and density estimation methods, emphasizing their applications in machine learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Module2 notes

Prepared By
Indu K S
Asst Professor,Dept Of ISE,TOCE,Bangalore
Module2
• Understanding Data – 2: Bivariate Data and Multivariate
Data,
• Multivariate Statistics,
• Essential Mathematics for Multivariate Data,
• Feature Engineering and Dimensionality Reduction
Techniques.
• Basic Learning Theory: Design of Learning System,
• Introduction to Concept of Learning,
• Modelling in Machine Learning.

Chapter-2 (2.6-2.8, 2.10), Chapter-3 (3.3, 3.4, 3.6)


2.6 BIVARIATE DATA AND MULTIVARIATE DATA
• Bivariate Data involves two variables.
• Bivariate data deals with causes of relationships.
• The aim is to find relationships among data.
Consider the following Table 2.3, with data of the temperature in a shop and sales of
sweaters.
Scatter plot and line graphs are used to visualize bivariate data
Scatter plot
It is a 2D graph showing the relationship between two variables.
It is a plot between explanatory and response variables. The scatter plot (Refer
Figure 2.11) indicates strength, shape, direction and the presence of Outliers.
2.6.1 Bivariate Statistics
Covariance and Correlation are examples of bivariate statistics.

1. Covariance
Covariance is a measure of joint probability of random variables, say X and Y.
Generally, random variables are represented in capital letters.
It is defined as covariance(X, Y) or COV(X, Y) and is used to measure the
variance between two dimensions. The formula for finding co-variance for specific x
and y are:
2. Correlation
2.7 MULTIVARIATE STATISTICS
• In machine learning, almost all datasets are multivariable.
• Multivariate data is the analysis of more than two observable variables.
• The multivariate data is like bivariate data but may have more than two
dependent variables.
• Some of the multivariate analyses are regression analysis, principal
component analysis, and path analysis.

The mean of multivariate data is a mean vector, and the mean of the above three
attributes is given as (2, 7.5, 1.33).
The variance of multivariate data becomes the covariance matrix. The mean
vector is called centroid, and variance is called dispersion matrix.
Multivariate data has three or more variables.
The aim of the multivariate analysis is much more.
They are regression analysis, factor analysis, and multivariate analysis of variance.
Heatmap
• A heatmap is a graphical representation of data where individual values in a matrix are
represented as colors.
• It is commonly used in data science and machine learning to visualize correlations
between variables, feature importance, or distributions in datasets.
• Heatmap is a graphical representation of 2D matrix.
• A heatmap is like a table, but instead of numbers, we use colors to indicate values.
• It takes a matrix as input and colours it.
• The darker colours indicate very large values and lighter colours indicate smaller
values.
2.7 MULTIVARIATE STATISTICS
• In machine learning, almost all datasets are multivariable.
• Multivariate data is the analysis of more than two observable variables.
• The multivariate data is like bivariate data but may have more than two
dependent variables.
• Some of the multivariate analyses are regression analysis, principal
component analysis, and path analysis.

The mean of multivariate data is a mean vector, and the mean of the above three
attributes is given as (2, 7.5, 1.33).
The variance of multivariate data becomes the covariance matrix. The mean
vector is called centroid, and variance is called dispersion matrix.
Multivariate data has three or more variables.
The aim of the multivariate analysis is much more.
They are regression analysis, factor analysis, and multivariate analysis of variance.
Heatmap
• A heatmap is a graphical representation of data where individual values in a matrix are
represented as colors.
• It is commonly used in data science and machine learning to visualize correlations
between variables, feature importance, or distributions in datasets.
• Heatmap is a graphical representation of 2D matrix.
• A heatmap is like a table, but instead of numbers, we use colors to indicate values.
• It takes a matrix as input and colours it.
• The darker colours indicate very large values and lighter colours indicate smaller
values.
Understanding Correlation Heatmaps
A correlation heatmap is a special type of heatmap that visualizes the correlation between variables
in a dataset. It helps identify relationships between features.
Correlation values range from -1 to 1:
● +1 → Strong positive correlation (one increases, the other also increases).
● -1 → Strong negative correlation (one increases, the other decreases).
● 0 → No correlation.

Example Multivariate Dataset


Let's create a simple multivariate dataset (10 samples, 4 features) and compute the correlation
matrix. Then, we'll visualize it using a heatmap.
Interpretation of the Correlation Heatmap
● The correlation matrix shows how strongly each variable is related to the others.
● Dark red areas indicate a strong positive correlation, while dark blue areas indicate a strong negative
correlation.
● Key insights from this dataset:
○ Height & Weight have a strong positive correlation (~0.97), meaning taller people tend to weigh
more.
○ Age & Blood Pressure are also positively correlated (~0.98), suggesting blood pressure increases
with age.
○ Other relationships have moderate correlations.
• The advantage of this method is that humans perceive colours well.
• So, by colour shaping, larger values can be perceived well.
For example, in vehicle traffic data, heavy traffic regions can be differentiated from
low traffic regions through heatmap.
• In Figure 2.13, patient data highlighting weight and health status is plotted. Here, X-
axis is weights and Y-axis is patient counts. The dark colour regions highlight patients’
weights vs patient counts in health status.
Pairplot
• Pairplot or scatter matrix is a data visual technique for multivariate
data.
• A scatter matrix consists of several pair-wise scatter plots of
variables of the multivariate data.
• All the results are presented in a matrix format. By visual
examination of the chart, one can easily find relationships among
the variables such as correlation between the variables.
• A random matrix of three columns is chosen and the relationships
of the columns is plotted as a pairplot (or scattermatrix) as shown
below in Figure 2.14.
Another example
Interpretation of the Pairplot
● The scatterplots show the relationships between each pair of variables.
● The diagonal histograms show the distribution of each individual variable.
● Key observations:
○ Study Hours & Exam Score: Strong positive correlation (students who
study more tend to score higher).
○ Study Hours & Sleep Hours: Negative correlation (students who study
more sleep less).
○ Stress Level & Sleep Hours: Negative correlation (less sleep leads to higher
stress).
2.8 ESSENTIAL MATHEMATICS FOR MULTIVARIATE
DATA
Gaussian Elimination Method
Gaussian elimination is a systematic method for solving systems of linear equations by
transforming the augmented matrix into row echelon form (upper triangular form) using row
operations.
Probability Distributions
The probability distribution gives the possibility of each outcome of a random
experiment or event.
Example: Rolling a Die 🎲
If you roll a fair six-sided die, the possible outcomes are:
1,2,3,4,5,6
Probability=No.of ways an
event can occur /Total no
of possible outcomes
Exponential Distribution

It describes how long you have to wait for something to happen.

imagine you are waiting for a bus that arrives randomly every 10 minutes (on average).

● Sometimes, it comes earlier, sometimes later.


● The Exponential Distribution models how long you will wait
3.1.2 Discrete Probability Distributions
Binomial, Poisson, and Bernoulli distributions fall under this category
Binomial Distribution: models the number of times an event happens in a fixed number of independent trials,
where each trial has only two possible outcomes:

● Success (e.g., heads in a coin flip)


● Failure (e.g., tails in a coin flip)

Example: Imagine you flip a coin 5 times.

● The probability of getting heads (H) = 0.5


● The probability of getting tails (T) = 0.5
● The number of heads you get follows a Binomial Distribution.
Poisson Distribution
The Poisson Distribution is used to model the number of times an event occurs in a fixed time or
space interval, where events happen randomly and independently at a constant rate.
Example: Customers Arriving at a Bank
Bernoulli Distribution
The Bernoulli Distribution models a situation where there are only two possible outcomes:
1. Success (1) – The event happens (e.g., getting heads in a coin flip).
2. Failure (0) – The event does not happen (e.g., getting tails in a coin flip).
It is the simplest probability distribution and is used when an event happens only once.

Example: Flipping a Coin


Density Estimation
Density estimation is a way to figure out the shape of a dataset when we don’t
know its exact distribution. It helps us understand how data is spread out and can be
used to detect unusual (anomalous) points(like outliers).
Ex: Imagine you have a bag of marbles of different sizes and colors, but you don’t
know how many of each type exist in the bag.
● You take out a few marbles (observed data) and count them.
● Based on this sample, you estimate how many of each type exist in the whole
bag (density estimation).
Let there be a set of observed values x1,x2,…,xn from a larger set of data whose
distribution is not known.
Density estimation is the problem of estimating the density function from an
observed data.
How Does It Work?
1. We collect observed values (e.g., heights of people, temperatures, sales data).
2. We estimate the density function, which tells us how frequently different values appear.
3. We use this function to check new data points:
○ If a new point fits well within the estimated density, it is normal.
○ If it is far from the expected density, it is an anomaly (outlier).
The estimated density function, denoted as p(x), can be used to value directly for any unknown
data, say xt as p(xt).
If its value is less than ϵ, then xt is not an outlier or anomaly data. Else, it is categorized as an
anomaly data.
There are two types of density estimation methods:
• Parametric density estimation and
• Non-parametric density estimation.
1. Parametric Density Estimation
Parametric density estimation assumes that the data follows a known probability
distribution (such as Normal, Exponential, or Poisson) and estimates the parameters of
that distribution.
The probability density function (PDF) is represented as: p(x∣Θ)
where Θ represents the set of parameters of the chosen distribution.
Maximum Likelihood Estimation(MLE)
MLE is a method used in statistics and machine learning to find the best parameters
(like mean and variance) for a probability distribution that best explains the given data.
Ex: Imagine you are a detective trying to figure out which suspect committed a crime.
You analyze clues (data) and try to find the most likely suspect (parameters).
Similarly, in MLE, we analyze data and try to find the most likely values for
parameters of a probability distribution.
How Does MLE Work?
1. Start with a probability distribution

○ We assume the data follows a known distribution, like Normal (Gaussian),


Binomial, or Exponential.
2. Define the Likelihood Function

○ This function tells us how likely it is to see the observed data given some
parameter values.
3. Maximize the Likelihood

○ We adjust the parameters so that the likelihood of seeing our data is as high as
possible.
○ This gives us the best estimate of the parameters.
Maximum Likelihood Estimation (MLE) is a method used in parametric density
estimation to estimate the parameters of a probability distribution by maximizing the
likelihood function. It aims to find the parameter values that make the observed data
most probable.
In parametric density estimation, we assume that the data follows a specific probability
distribution (e.g., Normal, Exponential, Binomial) with unknown parameters. MLE helps
us determine these parameters.
The relevance of this theory of MLE for machine learning is that
MLE can solve the problem of predictive modelling in
machine learning.
If one assumes that the regression problem can be framed as
predicting output y given input x, then for p(y∣x), the MLE
framework can be applied as: max∑log(y∣xi​,h).
Gaussian Mixture Model and Expectation-Maximization (EM) Algorithm

In machine learning, clustering is one of the important tasks. MLE framework is quite useful for
designing model-based methods for clustering data.
1. Gaussian Mixture Model (GMM)
A Gaussian Mixture Model (GMM) is a soft clustering algorithm that assumes data points
come from multiple Gaussian (bell-shaped) distributions. Instead of assigning each point to
one specific cluster (like K-Means), GMM gives a probability score for a data point belonging
to different clusters.
Example:
Imagine you have height and weight data for people from three different countries, but you
don’t know which country each person belongs to.
● A GMM would assume the data comes from three different Gaussian distributions.
● Instead of saying "this person is definitely from country A", GMM will say "this person
has a 70% chance of being from country A, 20% from B, and 10% from C."
2. Expectation-Maximization (EM) Algorithm
The EM algorithm is one algorithm that is commonly used for estimating the
MLE in the presence of latent or missing variables.

The EM algorithm is a smart way for a computer to learn hidden patterns


in data, especially when some information is missing or unknown. It helps
us estimate the best parameters for a model when direct calculation is
difficult.
Ex1: Imagine you are in a dark room with different objects, but you can’t
see them clearly. You try to guess what they are by touching them
(Expectation step), then use that guess to improve your understanding
(Maximization step). You keep repeating this until you’re confident about
what’s in the room.
Ex2: Guessing Student Heights in a Classroom
● Suppose you have a group of students, but you only know their weights, not
their heights.
● The EM algorithm first guesses heights based on weight (E-step).
● Then, it adjusts the guessed heights to make them fit the data better (M-step).
● It keeps improving these guesses until the best estimates are found.
Generally, there can be many unspecified distributions with different set
of parameters. The EM algorithm has two stages:
Expectation (E) Stage – Guess the missing or hidden values based on
current estimates. In this stage, the expected PDF and its parameters are
estimated for each latent variable.

Maximization (M) Stage – Update the model’s parameters (like mean


and variance) to fit the guessed data better. In this, the parameters are
optimized using the MLE function.
This process is iterative, and the iteration is continued till all the latent
variables are fitted by probability distributions effectively along with the
parameters.
2. Non-parametric Density Estimation
Non-parametric density estimation is a way to figure out how data is
distributed without assuming a specific shape (like a normal or binomial
distribution). It lets the data itself decide the shape.
Ex: Imagine you are a chef trying to guess the most popular dish in a
restaurant. Instead of assuming a fixed menu, you observe what people
order and count the most frequent dishes. Over time, you get an estimate of
what people prefer.
Similarly, in non-parametric density estimation, we observe data points
and estimate how they are spread out—without assuming a specific shape.
Parzen Window
The Parzen Window is a non-parametric method used to estimate the probability
density function (PDF) of a dataset. It helps us understand how data is distributed without
assuming a fixed shape (like normal or exponential distributions).
KNN Estimation
• The KNN estimation is another non-parametric density estimation method.
• Here, the initial parameter k is determined and based on that k-neighbours are
determined.
• The probability density function estimate is the average of the values that are
returned by the neighbours.
2.10 FEATURE ENGINEERING AND DIMENSIONALITY
REDUCTION TECHNIQUES
Features are attributes. Feature engineering is about determining the subset
of features that form an important part of the input that improves the
performance of the model in machine learning.
Feature engineering deals with two problems –
1.Feature Transformation and
2.Feature Selection.
Feature transformation: is extraction of features and creating new features
that may be helpful in increasing performance.
Ex: The height and weight may give a new attribute called Body Mass Index
(BMI).
Feature subset selection: is another important aspect of feature engineering that focuses
on selection of features to reduce the time but not at the cost of reliability.
The subset selection reduces the dataset size by removing irrelevant features and
constructs a minimum set of attributes for machine learning.
• If the dataset has n attributes, then time complexity is extremely high as n dimensions
need to be processed for the given dataset.
• For n attributes, there are 2ⁿ possible subsets.
• If the value of n is high, the problem becomes intractable(difficult). This is called
‘curse of dimensionality’.
• Since, as the number of dimensions increases, the time complexity increases. The
remedy is that some of the components that do not contribute much can be deleted.
• This results in the reduction of dimensionality.
• Typically, the feature subset selection problem uses greedy approach by looking for
the best choice at the time using locally optimal choice while hoping that it would lead
to global optimal solutions.
The features can be removed based on two aspects:
1. Feature relevancy: Some features contribute more for classification than other
features. Ex: a mole on the face can help in face detection more than common
features like the nose.
In simple words, the features should be relevant. The relevancy of the features can
be determined based on information measures such as mutual information,
correlation-based features like correlation coefficient and distance measures.

1. Feature redundancy: Some features are redundant.


Ex: when a database table has a field called Date of birth, then age field is not
relevant as age can be computed easily from date of birth. This helps in removing
the column age that leads to reduction of dimension one.
So, the procedure is:
1. Generate all possible subsets
2. Evaluate the subsets and model performance
3. Evaluate the results for optimal feature selection
Filter-based selection:
Uses statistical measures for assessing features. In this approach, no learning
algorithm is used. Correlation and information gain measures like mutual
information and entropy are all examples of this approach.
Wrapper-based methods:
Use classifiers to identify the best features. These are selected and evaluated
by the learning algorithms. This procedure is computationally intensive but
has superior performance.
2.10.1 Stepwise Forward Selection
This procedure starts with an empty set of attributes. Every time, an
attribute is tested for statistical significance for best quality and is added to
the reduced set. This process is continued till a good reduced set of attributes
is obtained.
2.10.2 Stepwise Backward Elimination
This procedure starts with a complete set of attributes. At every stage, the
procedure removes the worst attribute from the set, leading to the reduced
set.
Combined Approach Both forward and reverse methods can be combined
so that the procedure can add the best attribute and remove the worst
Eigenvalues and Eigenvectors
Eigenvalues and eigenvectors are fundamental concepts in linear algebra, widely used in machine
learning, PCA, computer vision, and deep learning.
Ex: Imagine you have a rubber sheet (a 2D surface), and you stretch or shrink it in different directions.
● Eigenvectors are the special directions in which the stretching happens.
● Eigenvalues tell you how much the stretching (or shrinking) happens in those directions.

Formal Definition:
For a square matrix A, an eigenvector v and its corresponding eigenvalue λ satisfy:

Av=λv
This means:
● When you multiply matrix A with vector v, it only scales the vector (does not change its
direction).
● The scaling factor is the eigenvalue λ.
2.10.3 Principal Component Analysis
PCA (Principal Component Analysis) is a method used in machine learning and statistics to
reduce the number of variables in a dataset while keeping the most important information.
This leads to a reduced and compact set of features. Basically, this elimination is made
possible because of the information redundancies. This compact representation is of a reduced
dimension.
Ex: Imagine you have a big collection of books and want to organize them efficiently.
● Instead of sorting them by every small detail (title, author, genre, year, pages, price,
etc.),
● You pick only the most important factors (genre and author) to classify them.
PCA does something similar—it reduces the number of features (variables) in a dataset but
keeps the most important patterns.
The PCA algorithm is as follows:

1. The target dataset x is obtained.

2. The mean is subtracted from the dataset. Let the mean be mmm. Thus, the adjusted dataset is X−m. The objective
of this process is to transform the dataset with zero mean.

3. The covariance of dataset x is obtained. Let it be C.

4. Eigenvalues and eigenvectors of the covariance matrix are calculated.

5. The eigenvector of the highest eigenvalue is the principal component of the dataset. The eigenvalues are arranged
in a descending order. The feature vector is formed with these eigenvectors in its columns.

Feature vector = {eigen vector 1,eigen vector 2,...,eigen vector n}


6. Obtain the transpose of the feature vector. Let it be A

7. PCA transform is y=A×(x−m), where x is the input dataset, m is the mean, and A is the transpose of the feature
vector.

The original data can be retrieved using the formula given below:

You might also like