0% found this document useful (0 votes)
14 views15 pages

Module 2 Rnsit

Module 2 of the Machine Learning course covers understanding data, focusing on bivariate and multivariate data analysis, essential mathematics, and feature engineering techniques. It explains concepts like covariance, correlation, and various statistical methods for data visualization and analysis, including scatter plots, heat maps, and pair plots. Additionally, it discusses feature transformation, selection, and dimensionality reduction methods such as PCA and LDA, which are crucial for improving machine learning model performance.

Uploaded by

emanikanta535
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views15 pages

Module 2 Rnsit

Module 2 of the Machine Learning course covers understanding data, focusing on bivariate and multivariate data analysis, essential mathematics, and feature engineering techniques. It explains concepts like covariance, correlation, and various statistical methods for data visualization and analysis, including scatter plots, heat maps, and pair plots. Additionally, it discusses feature transformation, selection, and dimensionality reduction methods such as PCA and LDA, which are crucial for improving machine learning model performance.

Uploaded by

emanikanta535
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Module 2- Machine Learning (BCS602)

Module 2
Understanding Data
Bivariate and Multivariate data, Multivariate statistics, Essential mathematics for Multivariate data,
Overview hypothesis, Feature engineering and dimensionality reduction techniques, Basics of Learning
Theory: Introduction to learning and its types, Introduction computation learning theory, Design of
learning system, Introduction concept learning. Similarity-based learning: Introduction to Similarity or
instance based learning, Nearest-neighbour learning, weighted k- Nearest - Neighbour algorithm.

CHAPTER -2
2.6 BIVARIATE DATA AND MULTIVARIATE DATA
Bivariate Data involves two variables. Bivariate data deals with causes of relationships. The aim is
to find relationships among data. Consider the following Table 2.3, with data of the temperature in
a shop and sales of sweaters.

Here, the aim of bivariate analysis is to find relationships among variables. The relationships can then be
used in comparisons, finding causes, and in further explorations. To do that, graphical display of the data is
necessary. One such graph method is called scatter plot.

Scatter plot is used to visualize bivariate data. It is useful to plot two variables with or without nominal
variables, to illustrate the trends, and also to show differences. It is a plot between explanatory and response
variables. It is a 2D graph showing the relationship between two variables. Line graphs are similar to scatter
plots. The Line Chart for sales data is shown in Figure 2.12.

2.6.1 Bivariate Statistics


Covariance and Correlation are examples of bivariate statistics. Covariance is a measure of joint probability
of random variables, say X and Y. Generally, random variables are represented in capital letters. It is defined

Deepa S, Asst. Professor, Dept of CSE, RNSIT 1


Module 2- Machine Learning (BCS602)

as covariance (X, Y) or COV (X, Y) and is used to measure the variance between two dimensions. The formula
for finding co-variance for specific x, and y are:

Here, xi and yi are data values from X and Y. E(X) and E(Y) are the mean values of xi and yi. N is the number
of given data. Also, the COV(X, Y) is same as COV(Y, X).

If the given attributes are X = (x1, x2, … , xN) and Y = (y1, y2, … , yN), then the Pearson correlation coefficient,
that is denoted as r, is given as: (σX, σY are the standard deviations of X and Y.)

2.7 MULTIVARIATE STATISTICS


In machine learning, almost all datasets are multivariable. Multivariate data is the analysis of more than two
observable variables, and often, thousands of multiple measurements need to be conducted for one or more
subjects. Multivariate data has three or more variables. The aim of the multivariate analysis is much more.
They are regression analysis, factor analysis and multivariate analysis of variance.

Heatmap A heat map is a graphical representation of data where individual values are represented by
colors. Heat maps are often used in data analysis and visualization to show patterns, density, or intensity of
data points in a two-dimensional grid.
Example: Let's consider a heat map to display the average temperatures (in °C) across different regions in
a country over a week. Each cell in the heat map will represent a temperature for a specific region on a
specific day. This is useful to quickly identify trends, such as higher temperatures in certain regions or
specific days with unusual weather patterns. The color gradient (from blue to red) indicates the
temperature range: cooler colors represent lower temperatures, while warmer colors represent higher
temperatures.

Deepa S, Asst. Professor, Dept of CSE, RNSIT 2


Module 2- Machine Learning (BCS602)

Pairplot
Pairplot or scatter matrix is a data visual technique for multivariate data. A scatter matrix consists of several
pair-wise scatter plots of variables of the multivariate data. A random matrix of three columns is chosen and
the relationships of the columns is plotted as a pairplot (or scatter matrix) as shown in Figure 2.14.

2.8 ESSENTIAL MATHEMATICS FOR MULTIVARIATE DATA

Machine learning involves many mathematical concepts from the domain of Linear algebra, Statistics,
Probability and Information theory. The subsequent sections discuss important aspects of linear algebra
and probability.

2.8.1 Linear Systems and Gaussian Elimination for Multivariate Data


A linear system of equations is a group of equations with unknown variables. Let Ax = y, then the solution
x is given as: x= y/A= A-1y. This is true if y is not zero and A is not zero. The logic can be extended for N-

Deepa S, Asst. Professor, Dept of CSE, RNSIT 3


Module 2- Machine Learning (BCS602)

set of equations with ‘n’ unknown variables. It means if A= and y=(y1 y2…yn), then the unknown
variable x can be computed as: x= y/A= A-1y

If there is a unique solution, then the system is called consistent independent. If there are various
solutions, then the system is called consistent dependant. If there are no solutions and if the equations are
contradictory, then the system is called inconsistent.

For solving large number of system of equations, Gaussian elimination can be used. The
procedure for applying Gaussian elimination is given as follows:
1. Write the given matrix.
2. Append vector y to the matrix A. This matrix is called augmentation matrix.
3. Keep the element a11 as pivot and eliminate all a11 in second row using the matrix operation,

R2 - (a21/a11), here R2 is the 2nd row and (a21/a11) is called the multiplier.

The same logic can be used to remove a11 in all other equations.
4. Repeat the same logic and reduce it to reduced echelon form. Then, the unknown variable as:

5. Then, the remaining unknown variables can be found by back-substitution as:

To facilitate the application of Gaussian elimination method, the following row operations are
applied:
1. Swapping the rows
2. Multiplying or dividing a row by a constant
3. Replacing a row by adding or subtracting a multiple of another row to it

Deepa S, Asst. Professor, Dept of CSE, RNSIT 4


Module 2- Machine Learning (BCS602)

These concepts are illustrated in Example 2.8.

2.8.2 Matrix Decomposition


It is often necessary to reduce a matrix to its constituent parts so that complex matrix operations can be
performed.
Then, the matrix A can be decomposed as: A=Q ^ QT

where, Q is the matrix of eigen vectors, Λ is the diagonal matrix and QT is the transpose of matrix Q.

LU Decomposition
One of the simplest matrix decomposition is LU decomposition where the matrix A can be decomposed
matrices: A = LU. Here, L is the lower triangular matrix and U is the upper triangular matrix. The
decomposition can be done using Gaussian elimination method as discussed in the previous section. First,
an identity matrix is augmented to the given matrix. Then, row operations and Gaussian elimination is
applied to reduce the given matrix to get matrices L and U. Example 2.9 illustrates the application of
Gaussian elimination to get LU.

Deepa S, Asst. Professor, Dept of CSE, RNSIT 5


Module 2- Machine Learning (BCS602)

Now, it can be observed that the first matrix is L as it is the lower triangular matrix whose values are the
determiners used in the reduction of equations above such as 3, 3 and 2/3.
The second matrix is U, the upper triangular matrix whose values are the values of the reduced matrix
because of Gaussian elimination.

Introduction to Machine Learning and Probability/Statistics

• Importance: Machine learning relies heavily on statistics and probability to make


predictions and analyze data.
• Statistics in ML: Key for understanding data patterns, measuring relationships, and
quantifying uncertainties.

Probability Distributions

• Definition: A probability distribution describes the likelihood of various outcomes for a variable XXX.
• Types:

Deepa S, Asst. Professor, Dept of CSE, RNSIT 6


Module 2- Machine Learning (BCS602)

o Discrete Probability Distributions: For countable events (e.g., binomial, Poisson).


o Continuous Probability Distributions: For measurable events on a continuum (e.g., normal,
exponential).

Continuous Probability Distributions

1. Normal Distribution (Gaussian Distribution)

• Shape: Bell curve, symmetric around the mean.


• Characteristics: Defined by mean μ and standard deviation σ.
• Probability Density Function (PDF)

• Applications: Common in natural data (e.g., heights, exam scores).


• Z-score: Standardizes data points. Z=X−μ/σ
2. Uniform Distribution (Rectangular Distribution)

• Definition: Equal probability for all outcomes within range [a,b].


• PDF :

3. Exponential Distribution

Definition: Models time between events in a Poisson process

Discrete Probability Distributions

1 Binomial Distribution

• Definition: For trials with two outcomes (success/failure).


• Formula for Probability of k Successes in n Trials:

Deepa S, Asst. Professor, Dept of CSE, RNSIT 7


Module 2- Machine Learning (BCS602)

2 Poisson Distribution

• Definition: Models the number of events in a fixed interval of time.


• PDF

3 Bernoulli Distribution

• Definition: Models a single trial with two outcomes (success/failure).


• Probability Mass Function (PMF)

Density Estimation

• Goal: Estimate the probability density function (PDF) of data.


• Types:
o Parametric Density Estimation: Assumes a known distribution (e.g., Gaussian)
and estimates parameters.
o Non-Parametric Density Estimation: Does not assume a fixed distribution (e.g.,
Parzen window, k-Nearest Neighbors)

Parametric Density Estimation

1 Maximum Likelihood Estimation (MLE)

• Definition: A method for estimating the parameters of a distribution by maximizing the


likelihood function.
• Likelihood Function: Maximize L(ϴ) for parameter ϴ

Gaussian Mixture Model (GMM) and Expectation-Maximization (EM) Algorithm

• GMM: A probabilistic model assuming data is generated from a mixture of Gaussian


distributions.
• EM Algorithm:
o E-Step: Estimate the distribution parameters for each latent variable.
o M-Step: Optimize parameters using MLE.
• Iteration: Repeat until convergence.

Deepa S, Asst. Professor, Dept of CSE, RNSIT 8


Module 2- Machine Learning (BCS602)

Non-Parametric Density Estimation Methods

1 Parzen Window

• Definition: A non-parametric technique that estimates the PDF based on local samples.
• Example: Uses a kernel function like Gaussian around each data point.

2 k-Nearest Neighbors (KNN)

• Definition: Estimates density by considering the kkk closest neighbors.


• Application: Frequently used in classification tasks.

2.10 FEATURE ENGINEERING AND DIMENSIONALITY REDUCTION TECHNIQUES

Features are attributes. Feature engineering is about determining the subset of features that form
an important part of the input that improves the performance of the model, be it classification or any other
model in machine learning.

Feature engineering deals with two problems – Feature Transformation and Feature Selection.
Feature transformation is extraction of features and creating new features that may be helpful in increasing
performance. For example, the height and weight may give a new attribute called Body Mass Index (BMI).

Feature subset selection is another important aspect of feature engineering that focuses on selection of
features to reduce the time but not at the cost of reliability.

The features can be removed based on two aspects:


1. Feature relevancy – Some features contribute more for classification than other features. For
example, a mole on the face can help in face detection than common features like nose. In simple
words, the features should be relevant.
Feature redundancy – Some features are redundant. For example, when a database table has a field called
Date of birth, then age field is not relevant as age can be computed easily from date of birth.
So, the procedure is:
1. Generate all possible subsets
2. Evaluate the subsets and model performance
3. Evaluate the results for optimal feature selection

Filter-based selection uses statistical measures for assessing features. In this approach, no learning
algorithm is used. Correlation and information gain measures like mutual information and entropy are all
examples of this approach.

Wrapper-based methods use classifiers to identify the best features. These are selected and evaluated by
the learning algorithms. This procedure is computationally intensive but has superior performance.

2.10.1 Stepwise Forward Selection


This procedure starts with an empty set of attributes. Every time, an attribute is tested for statistical
significance for best quality and is added to the reduced set. This process is continued till a good reduced
set of attributes is obtained.

2.10.2 Stepwise Backward Elimination


This procedure starts with a complete set of attributes. At every stage, the procedure removes the worst
attribute from the set, leading to the reduced set.

Deepa S, Asst. Professor, Dept of CSE, RNSIT 9


Module 2- Machine Learning (BCS602)

2.10.3 Principal Component Analysis


The idea of the principal component analysis (PCA) or KL transform is to transform a given set of
measurements to a new set of features so that the features exhibit high information packing properties.
This leads to a reduced and compact set of features. Consider a group of random vectors of the form:

The mean vector of the set of random vectors is defined as:

The operator E refers to the expected value of the population. This is calculated theoretically using the
probability density functions (PDF) of the elements xi and the joint probability density functions between
the elements xi and xj. From this, the covariance matrix can be calculated as:

The mapping of the vectors x to y using the transformation can now be described as:

This transform is also called as Karhunen-Loeve or Hoteling transform. The original vector x
can now be reconstructed as follows:

If K largest eigen values are used, the recovered information would be:

The PCA algorithm is as follows:


1. The target dataset x is obtained
2. The mean is subtracted from the dataset. Let the mean be m. Thus, the adjusted dataset is X – m.
The objective of this process is to transform the dataset with zero mean.
3. The covariance of dataset x is obtained. Let it be C.
4. Eigen values and eigen vectors of the covariance matrix are calculated.
5. The eigen vector of the highest eigen value is the principal component of the dataset. The eigen
values are arranged in a descending order. The feature vector is formed with these eigen vectors in
its columns.
Feature vector = {eigen vector1, eigen vector2, … , eigen vectorn}
6. Obtain the transpose of feature vector. Let it be A.
7. PCA transform is y = A × (x – m), where x is the input dataset, m is the mean, and A is the transpose
of the feature vector.
The original data can be retrieved using the formula given below:

The new data is a dimensionaly reduced matrix that represents the original data.
Figure 2.15. The scree plot indicates that only 6 out of 246 attributes are important.

From Figure 2.15, one can infer the relevance of the attributes. The scree plot indicates that
the first attribute is more important than all other attributes.

2.10.4 Linear Discriminant Analysis


Linear Discriminant Analysis (LDA) is also a feature reduction technique like PCA. The focus of LDA
is to project higher dimension data to a line (lower dimension data). LDA is also used to classify the
data. Let there be two classes, c1 and c2. Let m1 and m2 be the mean of the patterns of two classes.
The mean of the class c1 and c2 can be computed as:

Deepa S, Asst. Professor, Dept of CSE, RNSIT 10


Module 2- Machine Learning (BCS602)

The aim of LDA is to optimize the function:

2.10.5 Singular Value Decomposition


Singular Value Decomposition (SVD) is another useful decomposition technique. Let A be the
matrix, then the matrix A can be decomposed as:

Here, A is the given matrix of dimension m × n, U is the orthogonal matrix whose dimension is m × n, S is the
diagonal matrix of dimension n × n, and V is the orthogonal matrix. The procedure for finding decomposition
matrix is given as follows:
1. For a given matrix, find AA^T
2. Find eigen values of AA^T
3. Sort the eigen values in a descending order. Pack the eigen vectors as a matrix U.
4. Arrange the square root of the eigen values in diagonal. This matrix is diagonal matrix, S.
5. Find eigen values and eigen vectors for A^TA. Find the eigen value and pack the eigen vector as a
matrix called V.
Thus, A = USV^T. Here, U and V are orthogonal matrices. The columns of U and V are left and right
singular values, respectively. SVD is useful in compression, as one can decide to retain only a certain
component instead of the original matrix A as:

Based on the choice of retention, the compression can be controlled.

CHAPTER 3 - BASICS OF LEARNING THEORY


3.3 DESIGN OF A LEARNING SYSTEM

3.4 INTRODUCTION TO CONCEPT LEARNING

Concept learning is a learning strategy that involves acquiring abstract knowledge or inferring a general
concept based on the given training samples. It aims to derive a category or classification from the data,
facilitating abstraction and generalization. In machine learning, concept learning is about finding a function
that categorizes or labels instances correctly based on the observed features.

Deepa S, Asst. Professor, Dept of CSE, RNSIT 11


Module 2- Machine Learning (BCS602)

3.4.1 Representation of a Hypothesis

A hypothesis, denoted by h, is an approximation of the target function f. It represents the relationship


between independent attributes (input features) and the dependent attribute (output or label) of the
training instances. The hypothesis acts as the predicted model that maps inputs to outputs effectively.
In concept learning, each hypothesis is represented as a conjunction (AND combination) of attribute
conditions in the antecedent part, defining specific constraints on attributes to classify instances
accurately.

3.4.2 Hypothesis Space

Hypothesis space is the set of all possible hypotheses that approximates the target function
f.

The subset of hypothesis space that is consistent with all-observed training instances is
called as Version Space.

3.4.3 Heuristic Space Search

Heuristic search is a search strategy that finds an optimized hypothesis/solution to a


problem by iteratively improving the hypothesis/solution based on a given heuristic
function or a cost measure.

3.4.4 Generalization and Specialization

Searching the Hypothesis Space

There are two ways of learning the hypothesis, consistent with all training instances from
the large hypothesis space.

Deepa S, Asst. Professor, Dept of CSE, RNSIT 12


Module 2- Machine Learning (BCS602)

1. Specialization – General to Specific learning


2. Generalization – Specific to General learning

Generalization – Specific to General Learning This learning methodology will search


through the hypothesis space for an approximate hypothesis by generalizing the most
specific hypothesis.

Specialization – General to Specific Learning This learning methodology will search


through the hypothesis space for an approximate hypothesis by specializing the most
general hypothesis.

3.4.5 Hypothesis Space Search by Find-S Algorithm

Limitations of Find-S Algorithm

3.4.6 Version Spaces

Deepa S, Asst. Professor, Dept of CSE, RNSIT 13


Module 2- Machine Learning (BCS602)

List-Then-Eliminate Algorithm

Candidate Elimination Algorithm

The diagrammatic representation of deriving the version space is shown below:

Deepa S, Asst. Professor, Dept of CSE, RNSIT 14


Module 2- Machine Learning (BCS602)

Deriving the Version Space

Deepa S, Asst. Professor, Dept of CSE, RNSIT 15

You might also like