0% found this document useful (0 votes)
2 views19 pages

Module 2 Notes Bcs602

Module 2 of the Machine Learning course covers bivariate and multivariate data analysis, including essential statistics and mathematical concepts for handling such data. It discusses feature engineering, dimensionality reduction techniques, and the importance of probability distributions in machine learning. Key methods such as Gaussian Mixture Models, Expectation Maximization, and Principal Component Analysis are also explored for effective data modeling and analysis.

Uploaded by

jommy.2204
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views19 pages

Module 2 Notes Bcs602

Module 2 of the Machine Learning course covers bivariate and multivariate data analysis, including essential statistics and mathematical concepts for handling such data. It discusses feature engineering, dimensionality reduction techniques, and the importance of probability distributions in machine learning. Key methods such as Gaussian Mixture Models, Expectation Maximization, and Principal Component Analysis are also explored for effective data modeling and analysis.

Uploaded by

jommy.2204
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

lOMoARcPSD|45434965

Module-2 Notes-BCS602

Machine Learning (Visvesvaraya Technological University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by TOMMY ([email protected])
lOMoARcPSD|45434965

MACHINE LEARNING- MODULE-2

MODULE 2
Understanding Data – 2: Bivariate Data and Multivariate Data, Multivariate Statistics, Essential
Mathematics for Multivariate Data, Feature Engineering and Dimensionality Reduction Techniques.

Basic Learning Theory: Design of Learning System, Introduction to Concept of Learning, Modelling in
Machine Learning.

2.6 BIVARIATE DATA AND MULTIVARIATE DATA


• Bivariate data involves two variables.
• Bivariate data deals with causes of relationships, which can be used in comparisons finding causes
and in further explorations.
• Scatter plot is used to visualise by variant data.
• The scatter plot indicates strength shaped direction and the presence of outliers. It is useful in
exploratory data before calculating a correlation coefficient or fitting regression curve.

Nagarathna C, Asst Professor, Dept.of CSE,SCE 1

Downloaded by TOMMY ([email protected])


lOMoARcPSD|45434965

MACHINE LEARNING- MODULE-2

Bivariate Statistics:
Covariance:
• Covariance is a measure of joint probability of random variables say X and Y.
• Random variables are represented in capital letters and defined as covariance(X,Y) or COV(X,Y)
and is used to measure the variance between two dimensions.
• Here x and y are data values from X and Y. E(X) and E(Y) are mean values of x and y i.
• N is the number of given data.

• Example:

Nagarathna C, Asst Professor, Dept.of CSE,SCE 2

Downloaded by TOMMY ([email protected])


lOMoARcPSD|45434965

MACHINE LEARNING- MODULE-2

Correlation
• It is the most common test for determining any Association between two phenomena.
• It measures the strength and direction of a linear relationship between an x and y variables.
• The correlation indicates the relationship between dimension using its sign.
• The sign is more important than the actual value.
o If the value is positive, it indicates that the dimension increases together.
o If the value is negative, it indicates that while One dimension increases the other dimension
decreases.
o If the value is zero indicates that both the dimensions are independent of each other.
o If the dimensions are correlated then it is better to remove One dimension as it is a
redundant dimension.

2.7 MULTIVARIATE STATISTICS


• Multivariate data is the analysis of more than two observable data.
• The multivariate data is like bivariate data but may have more than two dependent
variables.
Heatmap:
• It is a graphical representation of 2d matrix.
• It takes a matrix as input and colours it.
• The darker colours indicates very large values and the lighter colour indicates the smaller
values.

Nagarathna C, Asst Professor, Dept.of CSE,SCE 3

Downloaded by TOMMY ([email protected])


lOMoARcPSD|45434965

MACHINE LEARNING- MODULE-2

Pairplot:
• Pair plot or scatter matrix is a data visual for multivariate data.
• It consists of several pairwise scatter plots for variables of multivariate data
• All the results are presented in the matrix format, chips among the variables such as correlation
between the variables.

Nagarathna C, Asst Professor, Dept.of CSE,SCE 4

Downloaded by TOMMY ([email protected])


lOMoARcPSD|45434965

MACHINE LEARNING- MODULE-2

2.8 ESSENTIAL MATHEMATICS FOR MULTIVARIATE DATA:


• It is a branch of mathematics that is used for many scientific and other mathematical subjects.
• Linear equations vectors mattresses vector spaces and transformations.

1. Linear Systems and Gaussian Elimination for Multivariate Data


• A linear system of equations is a group of equations with unknown variables.
• Let A x=y, the solution x is given as: x=y/A=A-1y. This is true if y an A is not zero.

• For N-set equations with ‘n’ unknown variables.


• If there is a unique solution then the system is called consistent independent. If there are various
solution then the system is called consistent dependent. If there are no solutions and if the
equations are contradictory then the system is called inconsistent.
• For solving large number of system of equations Gaussian elimination can be used.
• Procedure for applying Gaussian elimination is:

• Row operations applied:


o Swapping the rows
o Multiplying or dividing a row by a constant
o Replacing a row by adding or subtracting a multiple of another row to it

Nagarathna C, Asst Professor, Dept.of CSE,SCE 5

Downloaded by TOMMY ([email protected])


lOMoARcPSD|45434965

MACHINE LEARNING- MODULE-2

2. Matrix decompositions:
• It is a way of reducing matrix into eigen values and eigen vectors.
• A matrix a can be decomposed as:

o
o
• LU decomposition:
o matrix A can be decomposed as: A=LU
o L is a lower triangular Matrix and U is the upper triangular matrix.

Nagarathna C, Asst Professor, Dept.of CSE,SCE 6

Downloaded by TOMMY ([email protected])


lOMoARcPSD|45434965

MACHINE LEARNING- MODULE-2

• Now it can be observed that the first matrix is L as it is a lower triangular matrix whose values
are determiners used in the reduction of the equation 3,3, 2 / 3.
• second matrix is U the upper triangular Matrix whose values are the values of the reduced Matrix
because of Gaussian elimination.

3. Machine Learning and Importance of Probability and Statistics

• Using statistics Analysis of data is done any data will generate a probability distribution.
Probability Distributions:
• It is a variable, say X, summarises the probability associated with X’s event.
• Distribution is a parameterized mathematical function the describes the relationship between the
observations in the sample space.
• Probability distributions are of two types
a. Discrete Probability Distribution
b. Continuous Probability Distribution
• The relationships between the events of continuous random variables and their probabilities is called
a Continuous Probability Distribution summarised as Probability Density Function (PDF) which
calculates the quality of observing and instance.
• Cumulative Distribution Function (CDF) computes the probability observation less than or equal
value.
• Cumulative Distribution Function is the probability of an event cannot be detected directly it
should be computed as the area under the curve of for a small interval around the specific outcome.
Types of continuous probability distribution
1.Normal Distribution:
• It is a continuous probability distribution known as Gaussian distribution are bell shaped curve
distribution.
• Data tends to be around a central value with no bias on the left or right
• Example heights of the student, blood pressure of the population, marks scored in a class.
• Normal distribution is given as

Nagarathna C, Asst Professor, Dept.of CSE,SCE 7

Downloaded by TOMMY ([email protected])


lOMoARcPSD|45434965

MACHINE LEARNING- MODULE-2

Nagarathna C, Asst Professor, Dept.of CSE,SCE 8

Downloaded by TOMMY ([email protected])


lOMoARcPSD|45434965

MACHINE LEARNING- MODULE-2

Nagarathna C, Asst Professor, Dept.of CSE,SCE 9

Downloaded by TOMMY ([email protected])


lOMoARcPSD|45434965

MACHINE LEARNING- MODULE-2

Density estimation is the problem of estimating the density function from an observed data.
the estimated density function denoted as P(x) can be used to value directly for any unknown data say Xt
as P(x)
There are two types of density estimation methods Parametric density estimation and Non- Parametric
density estimation:
1. Parametric Density Estimation(PDF): it assumes the data is from a known probabilistic

distribution and can be estimated as where is the parameter.


i) Maximum Livelihood Estimation (MLE): it is a Framework that can be used for density
estimation which involves formulating a function called livelihood function which is the
conditional probability of observing the observed samples and the distribution function with
its parameters. if the observations are X= {x1, x2,….., xn}, then density estimation is the
problem for choosing a PDF with a suitable parameters to describe the data.
o MLE treats as a search or Optimisation problem where the probability should be
maximized for joint probabilities of X and its parameter Ɵ.
o For example this is expressed as p(x; Ɵ), where, are X= {x1, x2,….., xn}, the livelihood
of observing the data is given as a function of L(X; Ɵ). The objective of MLE is to
maximize this function as max L(X; Ɵ).

o The joint probability of this problem can be rest stated as

o this function can be minimize as . this is called as negative log


livelihood function.
o for regression MLE Framework can be applied as, predicting the output by given the

input x then for p(y/x) Framework applied as: and

, here βis the regression Coefficient and Xi is given sample.

Nagarathna C, Asst Professor, Dept.of CSE,SCE 10

Downloaded by TOMMY ([email protected])


lOMoARcPSD|45434965

MACHINE LEARNING- MODULE-2

ii) Gaussian Mixture Model and Expectation Maximization(EM) algorithm


o MLE Framework is used for designing a model based method for clustering data. a model is a
statistical method and data is assumed to be generated by a distribution model with a parameter
theta there may be many distribution involved and that is why it is called mixture model gaussians
are normally assumed for data and this mixture model is categorised as Gaussian Mixture
Model(GMM).
o EM algorithm is used for estimating the in the presence of latent or missing variables. latent
variables examples are let us assume the data set includes weights of boys and girls where boys
weights would be slightly more than the weights of the girl larger weights are generated by
Gaussian distribution with one set of parameters while girls weights are generated with another
set of parameters. there is influence in a gender what is not directly present or observable this type
of data variables are called latent variables
o The EM algorithm has two stages:
1. Expectation(E) stage: in this stage, the expected PDF and its parameters are estimated for
each latent variables.
2. Maximization(M) stage: in this process parameter are optimised using MLE functions it is
iterative process and iteration is continued till all the latent variables are fitted by probability
distribution effectively along with the parameters.

Non-Parametric Density Estimation: it can be generated or discriminative.


There are two methods:
1. Parzen window it is a generative estimation method that finds p(Ɵ |x)as posterior probability.
o let there be n samples X= {x1, x2,….., xn},
o The samples are drawn independently called as identical independent distribution. let R
be the region that covers ‘k’ samples of total ‘n’ samples the probability density function

is given by p=k/n. the estimation is given by

o where V is the volume of the region R. if R is an hypercube centred at x and h is the length
of the hypercube the volume V is h2 for 2D square cube and h3 for 3D cube.

o the Parze Windows is given as follows

Nagarathna C, Asst Professor, Dept.of CSE,SCE 11

Downloaded by TOMMY ([email protected])


lOMoARcPSD|45434965

MACHINE LEARNING- MODULE-2

o the window indicates the sample is inside the region or not.

o The Parzen probability density function estimate is given as:

2. KNN estimation: it is another non parametric density estimation method, the initial parameter k
is determined and based on the key neighbours are determined the probability density function
estimated is the average of the values that are returned by the neighbours.

2.9 FEATURE ENGINEERING AND DIMENSIONALITY REDUCTION TECHNIQUES


• Feature engineering and dimensionality reduction are critical steps in machine learning workflows.
• They ensure that models are not only accurate but also efficient, interpretable, and scalable.
1. Feature Engineering:
a. Feature engineering involves creating, modifying, or selecting features (variables) from
raw data to improve the performance of machine learning models.
b. Feature engineering deals with two problems:
i. Feature Transformation: it is extraction of features and creating new features that may
be helpful in increasing the performance. example height and weight a new attribute
called body mass index.
ii. Feature Selection: the focuses on selection of features to reduce the time but not at the
cost of Reliability. reduces the data size by removing the irrelevant features and
construct a minimum set of attributes for machine learning.
o Filter Methods: Using statistical tests (e.g., correlation, chi-square) to select features.
o Wrapper Methods: Selecting features based on the performance of a model (e.g.,
recursive feature elimination).
o Embedded Methods: Feature selection integrated into model training (e.g.,
regularization methods like LASSO).
iii. Features can be removed based on two aspects:
o feature relevance: feature should be relevant it can be determined based on information
measures such as mutual information correlation based features like correlation
Coefficient and distance measures. some features contribute more for classification than

Nagarathna C, Asst Professor, Dept.of CSE,SCE 12

Downloaded by TOMMY ([email protected])


lOMoARcPSD|45434965

MACHINE LEARNING- MODULE-2

other features example mole on a face can help in face detection than a common features
like noise.
o Feature Redundancy: some features are redundant. when a database table as a field
called date of birth, age field is not relevant as it can be computed easily by date of birth
this helps in removing the column age that leads to reduction of dimension one.
o procedures for redundancy is
i. generate all possible subset
ii. evaluate the subset and model performance
iii. evaluate the results for optimal feature selection

2. Dimensionality Reduction Algorithms


i. stepwise forward selection:
o It starts with an empty set of attributes
o Every time an attribute is tested for statistical second for best quality added to the reduced set
o The process continues until the good reduced reduced set of attribute is obtained.
ii. Stepwise backward elimination:
o This procedure starts with a complete set of attribute.
o At every stage the procedure removes and worst attribute from the set and produces the
reduced set.
iii. Principal Component Analysis(PCA):
o PCA is a statistical technique introduced by mathematician Karl Pearson in 1901.Also known
as KL transform.
o It is used to transform a given set of measurement to a new set of features so that the features
exhibit high information packing properties.
o It works by transforming high-dimensional data into a lower-dimensional space while
maximizing the variance (or spread) of the data in the new space. This helps preserve the most
important patterns and relationships in the data.
o The PC algorithm works as follows:
1. The target dataset x is obtained.
2. The mean is subtracted from the data set transform the data set with zero mean.
Mean: mx=E(x), where E=expected value of the population and x is the random vectors
of the form

Nagarathna C, Asst Professor, Dept.of CSE,SCE 13

Downloaded by TOMMY ([email protected])


lOMoARcPSD|45434965

MACHINE LEARNING- MODULE-2

3. The covariance of dataset x is obtained. C=E{(x-mx)(x-mx)T}


For M random vectors, when M is large enough the mean vector and the covariance matrix
can be calculated as:

4. Eigen values and eigen vectors of a covariance matrix are calculated. an eigenvector is a
non-zero vector that remains in the same direction after a linear transformation, scaled by
its corresponding eigenvalue.
5. The eigen vector of the highest eigen value is the principal component of the dataset. The
eigen values are arranged in descending order. The feature vectors is formed with these
eigen vectors in its columns. Feature vector ={ eigenvector1, eigenvector2....
Eigenvectorn}
6. Obtain the transpose of feature vector let it be A.Tthe mapping of vectors x to y using the
transformation is described as y=A(x-mx). this transform is called as Karhunen-Loeve
or Hoteling transform.
7. PCA transform is y=A(x-m), where X is the input M is the mean a is the transpose of
feature vector.
8. The original data can be retrieved using the formula given below:
Original data (f) ={(A)-1 x y}+m
= {(A)T x y}+m

• Scree plot is a visualisation technique used to visualise the principal component visually.
• For any randomly selected data set of 246 PCA is applied and its sree plot indicates that only 6
out of 246 attributes are important.

Nagarathna C, Asst Professor, Dept.of CSE,SCE 14

Downloaded by TOMMY ([email protected])


lOMoARcPSD|45434965

MACHINE LEARNING- MODULE-2

Example: let the data points be (2/6) and (1/7). Apply PCA and find the transformed data.
Again, apply the inverse and prove that PCA works:

Nagarathna C, Asst Professor, Dept.of CSE,SCE 15

Downloaded by TOMMY ([email protected])


lOMoARcPSD|45434965

MACHINE LEARNING- MODULE-2

iv. Linear Discriminant Analysis (LDA)


o Purpose: Similar to PCA but focuses on maximizing class separability in supervised learning
tasks.
o Projects data onto a lower-dimensional space while maintaining class distinction.
Applications: Often used in classification problems.

v. singular value decomposition:


procedures:
1. for a given Matrix find AAT
2. find the eigenvalues of AAT
3. Sort the given eigenvalues in descending order has Matrix U
4. arrange the square root of an eigen values in diagonal.

Nagarathna C, Asst Professor, Dept.of CSE,SCE 16

Downloaded by TOMMY ([email protected])


lOMoARcPSD|45434965

MACHINE LEARNING- MODULE-2

5. find the eigen values and the eigen vectors of ATA .find the eigenvalue and pack of eigen vector
as a matrix called V.
6. Thus A=USVT, Here U and V are prthogonal matrices.

Examples:

Nagarathna C, Asst Professor, Dept.of CSE,SCE 17

Downloaded by TOMMY ([email protected])


lOMoARcPSD|45434965

MACHINE LEARNING- MODULE-2

Nagarathna C, Asst Professor, Dept.of CSE,SCE 18

Downloaded by TOMMY ([email protected])

You might also like