0% found this document useful (0 votes)
9 views59 pages

ML Module2

Uploaded by

Shashank R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views59 pages

ML Module2

Uploaded by

Shashank R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

MACHINE LEARNING (BCS602)

MODULE 2
CHAPTER 1
Understanding Data: 2
2.6 BIVARIATE DATA AND MULTIVARIATE DATA
Bivariate Data involves two variables. Bivariate data deals with causes of relationships. The
aim is to find relationships among data. Consider the following Table 2.3, with data of the
temperature in a shop and sales of sweaters.

Here, the aim of bivariate analysis is to find relationships among variables. The relationships
can then be used in comparisons, finding causes, and in further explorations. To do that,
graphical display of the data is necessary. One such graph method is called scatter plot. Scatter
plot is used to visualize bivariate data. It is useful to plot two variables with or without nominal
variables, to illustrate the trends, and also to show differences. It is a plot between explanatory
and response variables. It is a 2D graph showing the relationship between two variables.

Line graphs are similar to scatter plots. The Line Chart for sales data is shown in Figure 2.12.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
2.6.1 BIVARIATE STATISTICS
Covariance and Correlation are examples of bivariate statistics. Covariance is a measure of
joint probability of random variables, say X and Y. Generally, random variables are represented
in capital letters. It is defined as covariance(X, Y) or COV(X, Y) and is used to measure the
variance between two dimensions. The formula for finding co-variance for specific x, and y
are:

The covariance between X and Y is 12. It can be normalized to a value between -1 and +1. This
is done by dividing it by the correlation of variables. This is called Pearson correlation
coefficient.
Sometimes, N - 1 is also can be used instead of N. In that case, the covariance is 60/4 = 15.
Correlation
The Pearson correlation coefficient is the most common test for determining any association
between two phenomena. It measures the strength and direction of a linear relationship between
the x and y variables.
1.If the value is positive, it indicates that the dimensions increase together.
2.If the value is negative, it indicates that while one-dimension increases, the other dimension
decreases.
3.If the value is zero, then it indicates that both the dimensions are independent of each other.
If the dimensions are correlated, then it is better to remove one dimension as it is a redundant
dimension. If the given attributes are X = (x1, x2, … , xN) and Y = (y1, y2, … , yN), then the
Pearson correlation coefficient, that is denoted as r, is given as:

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
EXAMPLE PROBLEM ON COV and CORR
Given two datasets:
X=[2,4,6,8] ,Y=[1,3,2,5]Find the covariance between X and Y.
Solution:
Step 1: Compute the Means
The mean of X is:
• Xˉ=2+4+6+84=204=5
The mean of Y is:
• Yˉ=1+3+2+54=114=2.75
Step 2: Compute the Covariance Formula

This positive covariance indicates that as X increases, Y tends to increase as well.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

r≈0.832
Since r is close to 1, this indicates a strong positive correlation between X and Y. As X
increases, Y tends to increase as well.
2.7 MULTIVARIATE STATISTICS
In machine learning, almost all datasets are multivariable. Multivariate data is the analysis of
more than two observable variables, and often, thousands of multiple measurements need to be
conducted for one or more subjects.
Multivariate data has three or more variables. The aim of the multivariate analysis is much
more. They are regression analysis, factor analysis and multivariate analysis of variance .

Heatmap
Heatmap is a graphical representation of 2D matrix. It takes a matrix as input and colours it.
The darker colours indicate very large values and lighter colours indicate smaller values.
The advantage of this method is that humans perceive colours well. So, by colour shaping,
larger values can be perceived well.
For example, in vehicle traffic data, heavy traffic regions can be differentiated from low traffic
regions through heatmap. In Figure 2.13, patient data highlighting weight and health status is
plotted. Here, X-axis is weights and Y-axis is patient counts. The dark colour regions highlight
patients’ weights vs patient counts in health status.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
Pairplot
Pairplot or scatter matrix is a data visual technique for multivariate data. A scatter matrix
consists of several pair-wise scatter plots of variables of the multivariate data. A random matrix
of three columns is chosen and the relationships of the columns is plotted as a pairplot (or
scatter matrix) as shown below in Figure 2.14.

2.8 ESSENTIAL MATHEMATICS FOR MULTIVARIATE DATA


Machine learning involves many mathematical concepts from the domain of Linear algebra,
Statistics, Probability and Information theory. The subsequent sections discuss important
aspects of linear algebra and probability.
2.8.1 Linear Systems and Gaussian Elimination for Multivariate Data
A linear system of equations is a group of equations with unknown variables. Let Ax = y, then
the solution x is given as:

This is true if y is not zero and A is not zero. The logic can be extended for N-set of equations
with ‘n’ unknown variables.
It means if A= and y=(y1 y2…yn), then the unknown variable x can be

If there is a unique solution, then the system is called consistent independent. If there are
various solutions, then the system is called consistent dependant. If there are no solutions and
if the equations are contradictory, then the system is called inconsistent. For solving large
number of system of equations, Gaussian elimination can be used. The procedure for applying
Gaussian elimination is given as follows:
1.Write the given matrix.
2.Append vector y to the matrix A. This matrix is called augmentation matrix.
3.Keep the element a11 as pivot and eliminate all a11 in second row using the matrix operation,
The same logic can be used to remove a11 in all other equations.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

4.Repeat the same logic and reduce it to reduced echelon form. Then, the unknown variable
as:

5.Then, the remaining unknown variables can be found by back-substitution as:

This part is called backward substitution. To facilitate the application of Gaussian elimination
method, the following row operations are applied:
1.Swapping the rows
2.Multiplying or dividing a row by a constant
3.Replacing a row by adding or subtracting a multiple of another row to it .
Example Problems:
GAUSSIAN ELIMINATION
1.Given the Following sytem of equation.
2x +3y –Z=5
4x+y+2y=6
-2x+5y+z=4
Find x , y, z
Step 1: Convert the Equations into an Augmented Matrix
We write only the coefficients and constants in matrix form:

Step 2: Make the First Element (Pivot) 1


To make the first element 1, we divide the first row by 2

Step 3: Make the First Column Below the Pivot Zero

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
We want to make the numbers below the 1st pivot (1 in row 1, column 1) zero.
R2->R2-4R1
R3->R3+2R1

Step 4: Make the Second Pivot 1


To make the second pivot 1, divide Row 2 by -5:

Step 5: Make the Second Column Below the Pivot Zero


To make the second column below the pivot 0, R2->R3-8R2

Step 6: Make the Third Pivot 1


To make the third pivot 1, divide Row 3 by 6.4:

Now, we have an upper triangular matrix, and we can solve for z, y, and x using back-
substitution.
Step 7: Back-Substitution
We now solve for z, y, and x one by one.
From Row 3:
z=0.40625 ,
From Row 2:
y−0.8(0.40625) =0.8=1.125
From Row 1:
x+1.5(1.125)−0.5(0.40625)=2.5
Final Answer:
x=1.0156, y=1.125, z=0.40625
2.Given the following
X+Y=5, 2X-Y=1 ,Find X , Y

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
step 1: Convert to Augmented Matrix

Step 2: Make the first pivot element 1,The first pivot is already 1.
Step 3: Make the element below the pivot 0
R2->R2-2R1

Step 4: Make the second pivot 1


Divide Row 2 by -3 to get 1:

Step 5: Make the element above the second pivot 0


Subtract R2 from Row 1:

Step 6: Read the solution, From the final matrix:


X=2 ,Y=3
These concepts are illustrated in Example 2.8.
2.8.2 Matrix Decomposition
It is often necessary to reduce a matrix to its constituent parts so that complex matrix operations
can be performed. Then, the matrix A can be decomposed as:

where, Q is the matrix of eigen vectors, Λ is the diagonal matrix and QT is the transpose of
matrix Q.
LU Decomposition One of the simplest matrix decomposition is LU decomposition where the
matrix A can be decomposed matrices:
A = LU
Here, L is the lower triangular matrix and U is the upper triangular matrix.
The decomposition can be done using Gaussian elimination method as discussed in the
previous section. First, an identity matrix is augmented to the given matrix.
Then, row operations and Gaussian elimination is applied to reduce the given matrix to get
matrices L and U.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
Example 2.9 illustrates the application of Gaussian elimination to get LU.

2.8.3 Machine Learning and Importance of Probability and Statistics


Machine learning is linked with statistics and probability. Like linear algebra, statistics is the
heart of machine learning. The importance of statistics needs to be stressed as without statistics;
Probability Distributions
A probability distribution of a variable, say X, summarizes the probability associated with X’s
events. Distribution is a parameterized mathematical function. In other words, distribution is a
function that describes the relationship between the observations in a sample space.
Consider a set of data. The data is said to follow a distribution if it obeys a mathematical
function that characterizes that distribution. The function can be used to calculate the
probability of individual observations.
Probability distributions are of two types:
1.Discrete probability distribution
2.Continuous probability distribution
The relationships between the events for a continuous random variable and their probabilities
1.Continuous Probability Distributions -Normal, Rectangular, and Exponential distributions
fall under this category.
I. Normal Distribution – Normal distribution is a continuous probability distribution.
This is also known as gaussian distribution or bell-shaped curve distribution. It is the
most common distribution function. The shape of this distribution is a typical bell-
shaped curve. In normal distribution, data tends to be around a central value with no
bias on left or right. The heights of the students, blood pressure of a population, and

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
marks scored in a class can be approximated using normal distribution. PDF of the
normal distribution is given as:

a.
b. Here, m is mean and s is the standard deviation. Normal distribution is
characterized by two parameters – mean and variance. One important concept
associated with normal distribution is z-score. It can be computed as:

c.
d. This is useful to normalize the data.
II. Rectangular Distribution – This is also known as uniform distribution. It has equal
probabilities for all values in the range a, b. The uniform distribution is given as follows:

a.
III. Exponential Distribution – This is a continuous uniform distribution. This probability
distribution is used to describe the time between events in a Poisson process.
Exponential distribution is another special case of Gamma distribution with a fixed
parameter of 1.
a. This distribution is helpful in modelling of time until an event occurs. The PDF
is given as follows:

2.Discrete Distribution
Binomial, Poisson, and Bernoulli distributions fall under this category.
I. Binomial Distribution – Binomial distribution is another distribution that is often
encountered in machine learning. It has only two outcomes: success or failure. This is
also called Bernoulli trial. The objective of this distribution is to find probability of
getting success k out of n trials. The way to get success out of k out of n number of
trials is given as:

The binomial distribution function is given as follows, where p is the probability of


success and probability of failure is (1 - p). The probability of success in a certain
number of trials is given as:

Combining both, one gets PDF of binomial distribution as:

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Here, p is the probability of each choice, k is the number of choices, and n is the total
number of choices. The mean of binomial distribution is given below:

II. Poisson Distribution – It is another important distribution that is quite useful. Given
an interval of time, this distribution is used to model the probability of a given number
of events k. The mean rule l is inclusive of previous events. Some of the examples of
Poisson distribution are number of emails received, number of customers visiting a shop
and the number of phone calls received by the office. The PDF of Poisson distribution
is given as follows:

III. Bernoulli Distribution – This distribution models an experiment whose outcome is


binary. The outcome is positive with p and negative with 1 - p. The PMF of this
distribution is given as:

IV. Density Estimation


Let there be a set of observed values x1, x2, … , xn from a larger set of data whose distribution
is not known. Density estimation is the problem of estimating the density function from an
observed data.
There are two types of density estimation methods, namely parametric density estimation and
non-parametric density estimation.

V. Parametric Density Estimation


It assumes that the data is from a known probabilistic distri- bution and can be estimated as
Maximum likelihood function is a parametric estimation
method.

VI. Maximum Likelihood Estimation


For a sample of observations, one can estimate the probability distribution. This is called
density estimation. Maximum Likelihood Estimation (MLE) is a probabilistic framework that
can be used for density estimation.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

This involves formulating a function called likelihood function which is the conditional
probability of observing the observed samples and distribution function with its parameters.
For example, if the observations are X = {x1, x2, … , xn}, then density estimation is the
problem of choosing a PDF with suitable parameters to describe the data.

MLE treats this problem as a search or optimization problem where the probability should be
maximized for the joint probabilities of X and its parameter, theta.

If one assumes that the regression problem can be framed as predicting output y given input x,
then for p(y/x), the MLE framework can be applied as:

Here, h is the linear regression model. If Gaussian distribution is assumed as it is an obvious


fact that most of the data follow Gaussian distribution, then MLE can be stated as:

Here, b is the regression coefficient and xi is the given sample. One can maximize this function
or minimize the negative log likelihood function to provide a solution for linear regression
problem. The Eq. (2.37) yields the same answer of the least-square approach.

Gaussian Mixture Model and Expectation-Maximization (EM) Algorithm :

Generally, there can be many unspecified distributions with different set of parameters. The
EM algorithm has two stages:
1. Expectation (E) Stage – In this stage, the expected PDF and its parameters are estimated
for each latent variable.
2. Maximization (M) stage – In this, the parameters are optimized using the MLE function.
This process is iterative, and the iteration is continued till all the latent variables are fitted
by probability distributions effectively along with the parameters.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

KNN Estimation: KNN estimation is another non-parametric density estimation method.


The initial Parameter K is determined and based on that k-neighbours are determined. The
probability density function estimate is the average of the values that are returned by
neighbours.
2.9 OVERVIEW OF HYPOTHESIS
Data collection alone is not enough. Data must be interpreted to give a conclusion. The
conclusion should be a structured outcome. This assumption of the outcome is called a
hypothesis.
• Statistical methods are used to confirm or reject the hypothesis.
• The assumption of the statistical test is called null hypothesis. It is also called as hypothesis
zero (H0).
• In other words, hypothesis is the existing belief. The violation of this hypothesis is called
first hypothesis (H1) or hypothesis one. This is the hypothesis the researcher is trying to
establish.
There are two types of hypothesis tests, parametric and non-parametric.
• Parametric tests are based on parameters such as mean and standard deviation.
• Non-parametric tests are dependent on characteristics such as independence of events or
data following certain distribution.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
1.Define null and alternate hypothesis
2.Describe the hypothesis using parameters
3.Identify the statistical test and statistics
4.Decide the criteria called significance value a
5.Compute p-value (probability value)
1.Define Null and Alternate Hypothesis:
• Null Hypothesis (H0): It represents the statement of no effect, no difference, or status
quo. It assumes that any observed differences are due to random chance.
• Alternative Hypothesis (Hα): It represents a statement indicating the presence of an
effect, a difference, or a relationship between variables.
2. Hypotheses are expressed using population parameters (e.g., mean μ, proportion p,
standard deviation σ\sigmaσ). Example:

3. Identify the Statistical Test and Statistic:


• The choice of statistical test depends on the type of data and hypothesis:
• t-test: Compares means of two groups.
• z-test: Used when population variance is known.
• Chi-square test: Used for categorical data.
• ANOVA: Compares means of more than two groups.
• Regression analysis: Tests relationships between variables.
Test Statistic: A numerical value computed from the sample data to determine whether to
reject H0.

4. Decide the Significance Level (α\alphaα):


• The significance level (α) represents the probability of rejecting H0 when it is actually
true (Type I error).

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
5. Compute p-value (Probability Value):
• The p-value represents the probability of obtaining a result at least as extreme as the
observed one, assuming H0 is true.
• It is compared to α\alphaα:
• If p≤αp , reject H0(significant result).
• If p>αp , fail to reject H0 (not significant).
• Take the final decision of accepting or rejecting the hypothesis based on the
parameters .
Two kinds of errors are involved, that are Type I and Type II.
• Type I error is the incorrect rejection of a true null hypothesis and is called false
positive.
Type II error is the incomplete failure of rejecting a false hypothesis and is called false
negative.
Hypothesis Testing
Two important errors called sample error and true (or actual error).
1.Sample Error (Sampling Error) – Error Due to Taking a Sample
What is it?
• When we take a sample from a population, it may not perfectly represent the whole
population.
• This difference between the sample result and the actual population result is called
sample error.
Why does it happen?
• Because we are studying only a part (sample) of the population, not the entire
population.
• The sample might accidentally have more extreme values or fewer average values.
Ex: Imagine you want to know the average height of all students in your school.
• The actual average height of all students (population) is 170 cm.
• But you don’t have time to measure everyone, so you randomly select 50 students
(sample).
• You find the average height of these 50 students is 168 cm.
• Sampling Error = 168 cm - 170 cm = -2 cm.
2.Actual Error (True Error) – Mistake in Measurement or Process
What is it?
• Sometimes, errors happen because of incorrect methods or faulty tools.
• This is a real error that should not happen if everything was done correctly.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
Why does it happen?
• Systematic errors (like using a broken ruler that adds 2 cm to every height).
• Human mistakes (like reading a scale wrong).
Ex:Suppose your height measuring tool is faulty and adds 2 cm extra to everyone’s height.
• The real average height of students is 170 cm.
• But because of the faulty tool, your measurement shows 172 cm for the whole
population.
• This is Actual Error = 172 cm - 170 cm = 2 cm.
• Even if you take a perfect sample, your results will still be wrong because of the
measurement mistake.
Let us define Let us assume that D is the unknown distribution, Target function is f(x): x ≥ (0,
1), x is the instance, h(x) is the hypothesis, and sample set is S that derives the samples on
instances drawn from X. Then, the actual error is denoted as:

• So, another error is called sample error or estimator. Sample error is with respect to sample
S. It is the probability for instances drawn from X, that is, the fractions of S that are
misclassified. The sample error is given as follows:

p-value
• Statistical tests can be performed to either accept or reject the null hypothesis. This is
done by the value called p-value or probability value. It indicates the probability of
hypothesis being true. The p-value is used to interpret or quantify the test.
• For example, a statistical test result may give a value of 0.03. Then, one can compare it
with the level 0.05. As 0.03 <0.05, the result is assumed to be significant. This means that
the variables tested are not independent. Here, 0.05 is called significant level.
• In general, significant level is called alpha and p-value is compared with Alpha . If p-value
≤ Alpha, then the hypothesis H1 is rejected and if p-value>Alpha, then the hypothesis HO
is rejected.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

• 0.05 (5%): Standard significance level.


• 0.01 (1%): More strict, used in highly sensitive studies.
• 0.10 (10%): More lenient, used in exploratory research.
Ex: A company claims their new medicine lowers blood pressure more than the old one. We
conduct a test and calculate a p-value.
• If p-value = 0.02, it means there’s only a 2% chance that we would see this result if the
new medicine was not better.
• Since 0.02 is less than 0.05, we reject the null hypothesis and conclude the new medicine
is likely more effective.
Confidence Interval :A confidence interval helps us estimate where the true mean (average)
of a population lies based on a sample. It gives us a range of values that we are fairly confident
contains the true mean.
Confidence Interval Formula: Confidence Interval=1−Significance Level
• If the significance level is 0.05 (5%), then the confidence level is 95%.
• This means we are 95% sure that the true mean lies within our calculated range.
Key Terms:
• Mean (xˉ) – The average of the sample data.
• Standard Deviation (s) – Tells how spread out the data is.
• Sample Size (N) – The number of observations in our sample.
• Z-score (z) – A value from statistical tables that corresponds to the confidence level
(e.g., 1.96 for 95% confidence).
• Margin of Error – The amount we add and subtract from the mean to get the
confidence interval.

4.Interpreting the Confidence Interval:


• If we get a 95% confidence interval of (50, 60), we can say:
→ "We are 95% confident that the true mean is between 50 and 60."
• If our hypothesis states the mean should be 55, and 55 is inside this range → We
accept the hypothesis.
If 55 is outside the range → We reject the hypothesis.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Comparing Learning Methods


1.Z test is a statistical test that is conducted on data that approximately follows a normal
distribution. The z test can be performed on one sample, two samples, or on proportions for
hypothesis testing. It checks if the means of two large samples are different or not when the
population variance is known.

2.t-test : t-test is a hypothesis test and checks if the difference between two samples mean is
real or by chance.
T-test follows t-distributions under null hypothesis and is used when number of samples < 30.
One sample Test:
Mean of one group is checked against the set average that can be either theoretical value or
population mean.
Select a group
Compute average
Compare it with theoretical value and compute t-statistic:

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
Given Sample data: X={1,2,3,4,5}
• Population mean μ = 12
• Population variance σ2 = 2
Steps to Compute the Z-score:
The formula for the Z-test statistic is:

where:
• Xˉ = sample mean
• μ= population mean
• σ = population standard deviation
n = sample size

The computed Z-score is approximately -14.23. This indicates that the sample mean (3) is
extremely far from the population mean (12) in terms of standard deviations, suggesting a
highly significant difference.
Independent Two sample t-test: t-statistic for two groups A and B is computed.

3. paired t-test : Used to evaluate the hypothesis before and after intervention. The fact is these
samples are not independent.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
4.Chi-Square test : Is non –parametric test.The goodness of fit test statistics follows Chi-
Square distribution under null hypothesis and measures the statistical significance between
observed frequency and expressed frequency , each observation is independent of each other
and follows normal distribution.

2.10 FEATURE ENGINEERING AND DIMENSIONALITY REDUCTION


TECHNIQUES
• Features are attributes. Feature engineering is about determining the subset of features that
form an important part of the input that improves the performance of the model, be it
classification or any other model in machine learning.
• Feature engineering deals with two problems – Feature Transformation and Feature
Selection.
• The features can be removed based on two aspects:

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
1.Feature relevancy – Some features contribute more for classification than other features.
For example, a mole on the face can help in face detection than common features like nose. In
simple words, the features should be relevant.
2. Feature redundancy – Some features are redundant. For example, when a database table
has a field called Date of birth, then age field is not relevant as age can be computed easily
from date of birth. So, the procedure is:
• 1.Generate all possible subsets
• 2.Evaluate the subsets and model performance
3.Evaluate the results for optimal feature selection

2.10.1 Stepwise Forward Selection


This procedure starts with an empty set of attributes. Every time, an attribute is tested for statistical
significance for best quality and is added to the reduced set.
This process is continued till a good reduced set of attributes is obtained.
2.10.2 Stepwise Backward Elimination
This procedure starts with a complete set of attributes. At every stage, the procedure removes the
worst attribute from the set, leading to the reduced set.
2.10.3 Principal Component Analysis
The idea of the principal component analysis (PCA) or KL transform is to transform a given set of
measurements to a new set of features so that the features exhibit high information packing
properties.
This leads to a reduced and compact set of features. Consider a group of random vectors of the
form:

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

The PCA algorithm is as follows:


1.The target dataset x is obtained
2.The mean is subtracted from the dataset. Let the mean be m. Thus, the adjusted dataset is X – m.
The objective of this process is to transform the dataset with zero mean.
3.The covariance of dataset x is obtained. Let it be C.
4.Eigen values and eigen vectors of the covariance matrix are calculated.
5.The eigen vector of the highest eigen value is the principal component of the dataset. The eigen
values are arranged in a descending order. The feature vector is formed with these eigen vectors in
its columns. Feature vector = {eigen vector1, eigen vector2, … , eigen vectorn}
6.Obtain the transpose of feature vector. Let it be A.
7.PCA transform is y = A × (x – m), where x is the input dataset, m is the mean, and A is the
transpose of the feature vector. The original data can be retrieved using the formula given below:

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

2.10.4 Linear Discriminant Analysis


Linear Discriminant Analysis (LDA) is also a feature reduction technique like PCA. The focus of
LDA is to project higher dimension data to a line (lower dimension data). LDA is also used to
classify the data. Let there be two classes, c1 and c2. Let m1 and m2 be the mean of the patterns
of two classes. The mean of the class c1 and c2 can be computed as:

2.10.5 Singular Value Decomposition (SVD) :


Is another useful decomposition technique. Let A be the matrix, then the matrix A can be
decomposed as:
Here, A is the given matrix of dimension m × n, U is the orthogonal matrix whose dimension
is m × n, S is the diagonal matrix of dimension n × n, and V is the orthogonal matrix. The
procedure for finding decomposition matrix is given as follows:
1.For a given matrix, find AA^T.
2.Find eigen values of AA^T
3.Sort the eigen values in a descending order. Pack the eigen vectors as a matrix U. 4.Arrange
the square root of the eigen values in diagonal. This matrix is diagonal matrix, S.
5.Find eigen values and eigen vectors for A^TA. Find the eigen value and pack the eigen vector
as a matrix called V. Thus, A = USV^ T. Here, U and V are orthogonal matrices. The columns
of U and V are left and right singular values, respectively. SVD is useful in compression, as
one can decide to retain only a certain component instead of the original matrix A as:

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Problems on PCA

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Problems on SVD

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

CHAPTER 2
BASIC LEARNING THEORY
3.3 Design of Learning System:

1.The first step is to select good training data. This is critical because the quality of the data
directly affects the success or failure of the model. The training data should provide feedback
to the algorithm, helping it improve its decisions over time. For example, in chess, the data
would help the algorithm learn which moves lead to success.

2. Next, we define a target function. This is what the algorithm uses to decide the best move
based on the current situation. In chess, the target function would help the algorithm choose its
next move based on the current board state.

3.Target Function Representation: Once the algorithm knows all the possible legal moves, we
need a way to represent the best move. This could be in the form of linear equations, graphs,
or tables. The goal is to identify the move that will lead to the best outcome.

Where, – w0 through W6 are numerical coefficients, or weights, to be chosen by the learning


algorithm. Learned values for the weights w1 through W6 will determine the relative
importance of the various board features in determining the value of the board.

4.Choosing an Approximation Algorithm: To pick the best move, the algorithm needs to
approximate which moves will be successful.
This is done by analyzing past examples and learning from successes and failures. Over time,
the algorithm gets better at predicting the best moves. The aim is to reduce the error given,

Then for every board feature xi , the weights are updated.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

5. Finally, the system is refined after many examples, successes, and failures. The algorithm
continues to improve its decision-making process.

3.4 Concept Learning


Concept Learning is a core idea in machine learning where the goal is to learn a general
concept or pattern from specific examples. The concept could be anything the algorithm
needs to identify, like "What is a cat?" or "What is a successful chess move?" Concept learning
involves recognizing patterns in the data and generalizing them to make predictions or
decisions in the future.
1. Input:
• What it is: The data or features fed into the machine learning model for it to process. The data
is labeled with the name of a concept or category it belongs to.
• Purpose: The input provides the model with the information it needs to make decisions,
predictions, or classifications. Use past experience to train and build the model.
2. Output:
• What it is: The Target function f , prediction, decision, or result generated by the model based
on the input data.
• Purpose: The output is the model’s response or action after processing the input. It is the
result the algorithm tries to optimize, depending on the task (e.g., making accurate
predictions).
3. Test:
• What it is: New instances to test the learned model.
• Purpose: Testing allows you to assess the model's generalization ability—how well it can apply
what it has learned to new, unseen data. This helps in understanding the model’s performance
and ensures it doesn't just memorize the training data (overfitting).

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

3.4.1 Representation of Hypothesis


• Representation of Hypothesis in machine learning refers to how the model or algorithm
represents the general rule or pattern it learns from the training data. This hypothesis
is the model's understanding or learned function that can be used to make predictions on
new, unseen examples.
• In simpler terms, it’s the "learned knowledge" of the model about the relationship
between input and output, and how it applies to new data.
Example: (Tail=short)^ (Color=Black)……
• The set of Hypothesis in the search space is called hypotheses. Generally “H” is used to
represent the hypotheses and “h” is used to represent a candidate hypothesis.
Each Attribute condition is the constraint on the attribute which is represented as attribute
value pair. Each attribute can take a value either “?” or “∅ ” or can hold a single value.
• ? – denotes that the attribute can take any value.
• ∅ -Denotes that the attribute cannot take any value.(represents null value)
• Single value denotes a specific single value from acceptable values of attributes.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

3.4.2 Hypothesis Space


Refers to the set of all possible hypotheses that a machine learning model can form, given a
certain learning task and the available training data. It includes every possible solution or model
the algorithm can potentially choose from based on its design, the features, and the methods
used.
• The set of hypotheses that can be generated by learning algorithm can be further
reduced by specifying a language bias.
• The subset of hypothesis space that is consistent with all observed training instances
is called version space.(Used for classification)
• Example:

3.4.3 Heuristic Space Search :


Is a search strategy used in machine learning and artificial intelligence (AI) that helps in
finding a solution to a problem by exploring the hypothesis space in a more efficient way
using a heuristic function. A heuristic is essentially a rule of thumb or an approximation to
guide the search toward the most promising solutions.
Several commonly used Heuristic space search methods are hill climbing ,constraint
satisfaction problems ,Best First Search ,Simulated annealing , Genetic Algorithm etc..
3.4.4 Generalization and Specialization
Searching the Hypothesis Space :There are two ways of learning the hypothesis, consistent
with all training instances from the large hypothesis space.
1. Specialization – General to Specific learning
2. Generalization – Specific to General learning
Prof. Navyashree K S, CSE(DS) RNSIT
MACHINE LEARNING (BCS602)
Generalization – Specific to General Learning This learning methodology will search through
the hypothesis space for an approximate hypothesis by generalizing the most specific
hypothesis.
Specialization – General to Specific Learning This learning methodology will search through
the hypothesis space for an approximate hypothesis by specializing the most general
hypothesis.
Key Characteristics of Generalization:
1. Simplicity: Simple models tend to generalize better. They do not overfit to noise in the
data, making them more likely to perform well on new data.
2. Overfitting vs. Underfitting:
1. Overfitting occurs when a model learns to fit the training data too closely,
capturing noise and specific details that do not generalize to new data.
2. Underfitting occurs when a model is too simple to capture the underlying
patterns of the data.
Key Characteristics of Specialization:
1. Complexity: Specialized models are often more complex because they incorporate
more details or rules to fit the data.
2. Overfitting: While specialization can improve performance on training data, it often
leads to overfitting, where the model does not generalize well to unseen data.
3. Tailored Hypotheses: Specialized hypotheses focus on learning specific, narrow
patterns in the data, which can be beneficial when the data has very specific features.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

3.4.5 Hypothesis Space Search by Find-S Algorithm

Find-S Algorithm is a simple machine learning algorithm used for searching the hypothesis
space to find the most specific hypothesis that correctly classifies all positive examples in a
given training set. It is a supervised learning algorithm that works in the context of concept
learning and specifically finds a maximally specific hypothesis from the hypothesis space.
Limitations of Find S Algorithm:
1. Find-S algorithm tries to find a hypothesis that is consistent with positive instances,
ignoring all negative instances. As long as the training dataset is consistent, the hypothesis
found by this algorithm may be consistent.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
2. The algorithm finds only one unique hypothesis, wherein there may be many other
hypotheses that are consistent with the training dataset.
3. Many times , the algorithm dataset may contain some errors, hence such inconsistent data
instances can mislead this algorithm in determining the consistent hypothesis since it
ignores negative instances.
3.4.6 Version Space
The version space contains the subset of hypotheses from the hypothesis space that is consistent
with all training instances in the training dataset
List-Then-Eliminate Algorithm

Candidate Elimination Algorithm

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
1. Generating the Positive Hypothesis (S)
• Start with the most specific hypothesis S (initially set to the most restrictive form, e.g., all
attributes set to "NULL" or "∅").
• For each positive example, generalize S just enough to cover the new example.
• If an attribute in S conflicts with a positive example, replace it with a more general value.
2. Generating the Negative Hypothesis (G)
• Start with the most general hypothesis G (initially set to the most generic form, e.g., all
attributes as "?" allowing any value).
• For each negative example, specialize G by refining the attributes to exclude the negative
case.
• If an example contradicts G, refine G by restricting it to exclude that example.
3. Generating the Version Space (VS)
• The Version Space is the set of hypotheses H that lie between S and G.
• As more examples are seen, S becomes more general, and G becomes more specific.
• The VS shrinks until S = G, at which point a single hypothesis remains, defining the
concept.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
3.6 Modelling In Machine Learning
A machine learning model is an abstraction of the training dataset that can perform
predictions on new data.Training a model involves feeding instances to the machine learning
algorithm.Training datasets are used to fit and tune the model.
After training, a predictive model is generated to which new data is fed for making predictions.
Modeling process includes:
Training a machine learning algorithm with the training dataset.
Tuning the model to increase performance.
Validating the model,Making predictions on new unseen data.
Key concerns in machine learning:
Choosing the right model.
How to train the model effectively.
Time required for training.
Dataset selection.
Expected performance of the model.
Machine Learning Process:
1. Choose a machine learning algorithm that suits the training data and problem domain.
2. Input the training dataset and train the algorithm to learn from the data and capture
patterns.
3. Tune the model parameters to improve the accuracy of learning.
4. Evaluate the learned model once it is built.
3.6.1 Model Selection and Evaluation:
The biggest challenge in machine learning is choosing an algorithm that suits the
problem.Hence model selection and assessment is important.
1.Model Performance: How well the model performs on the training dataset?
2.Model complexity: How much complexity the model possesses after the training phase
is over.
Model selection is a process of selecting one good enough model among different machine
Learning models for the dataset or selecting different sets of features or hyperparameters
for the same ML model.
It is difficult to find the best model because all the models exhibit somepredictive error for
the problem, so at least a good enough model should be selected that performs fairly well
with the dataset

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
Some approaches used for selecting a machine learning model are listed below:
1.Use resample methods and split the dataset as training, testing and validation datasets and
observe the performance of the model over all the phases. This is applicable for only small
datasets.
2. The simplest approach is to fit the model on training dataset and to compute measures like
error or accuracy.
3.The use of probabilistic framework and quantification of the performance of the model as
score is the third approach.
3.6.2 Re-sampling Methods
1.Re-sampling is a technique used to select a model by reconstructing the training and test
datasets.It involves randomly choosing instances from the given dataset using different
methods.The process repeatedly selects different instances from the training dataset to tune the
model.
The purpose of re-sampling is to improve the accuracy of the model.
Common re-sampling methods:
• Random train/test splits
• Cross-Validation (e.g., K-fold, LOOCV)
• Bootstrap

2.Cross-Validation is a method used to tune a model using only the training dataset.It is a
model evaluation approach that sets aside a portion of the training dataset for validation while
using the rest for training.The goal is to find the best model by estimating the average error on
different test data.
Popular cross-validation methods include:
• Holdout method
• K-fold cross-validation
• Stratified cross-validation
Leave-One-Out Cross-Validation (LOOCV)
3. The Holdout Method is the simplest form of cross-validation: The dataset is split into
two subsets,Training dataset (used to train the model),Test dataset (used to evaluate the
model).The model is trained on the training dataset and then evaluated on the test dataset. Two
variations of the holdout method:Single holdout method: Applied once.Repeated holdout
method: Applied multiple times with different splits.The model's performance is estimated
based on the test dataset.
Limitations of the holdout method:
• Can exhibit high variance.
• Performance depends on how the dataset is split.
4.K-Fold Cross-Validation:
K-Fold Cross-Validation is a technique used to evaluate the performance of a machine
learning model by splitting the dataset into multiple subsets (folds) and training/testing the
model on different partitions.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
Steps of K-Fold Cross-Validation:
1. The dataset is randomly split into K equal-sized subsets (folds).
2. The model is trained on K-1 folds and tested on the remaining 1 fold.
3. This process is repeated K times, each time using a different fold as the test set.
4. The final model performance is obtained by averaging the performance metrics from all
K iterations.

5. Stratified Cross-Validation: Stratified Cross-Validation is a variation of k-fold cross-


validation with a key difference.When splitting the dataset into k folds, it ensures that: Each
fold contains the same proportion of instances for a given categorical value.This method is
useful when dealing with imbalanced datasets to maintain class distribution across folds.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Model Performance - Key Metrics


Example : Consider a test for detecting a disease, say cancer.Table 3.4 shows a contingency
table for scenario.
Test Vs Disease Has Disease Cancer Has No Disease As Cancer
Positive True Positive False Positive
Negative False Negative True Negative

A contingency table is used to evaluate classifier performance based on:


• True Positive (TP): Correctly classified cancer patients.
• True Negative (TN): Correctly classified healthy individuals.
• False Positive (FP): Incorrectly classified healthy individuals as cancerous.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
• False Negative (FN): Incorrectly classified cancer patients as healthy.
Performance Metrics:
1. Sensitivity (Recall/TPR): Probability of detecting actual positives.

2. Specificity (TNR): Probability of detecting actual negatives.

3.Positive Predictive Value (Precision): Probability that a positive result is correct.

4.Negative Predictive Value (NPV): Probability that a negative result is correct.

5.Accuracy: Overall correctness of the classifier.

6.Precision & Recall: Precision (PPV): Accuracy of positive predictions.

7.Recall (Sensitivity/TPR): Ability to detect actual positives. F1 Score: Harmonic mean of


precision & recall.

Classifier Performance Metrics:Distance Measures: Euclidean distance between classifier


points (range: 0 to 1).Receiver Operating Characteristic (ROC) Curve:
• Plots TPR (Sensitivity) vs. 1-Specificity (FPR).
• Area Under Curve (AUC) indicates classifier performance.
• Ideal classifier: AUC = 1.0; Random model: AUC ≈ 0.5.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

AUC, Precision-Recall & Scoring Methods


AUC & Precision-Recall:
• Area Under Curve (AUC): Measures classifier performance across thresholds.
• AUC = 1: Perfect model
• AUC = 0: Incorrect model
• Precision-Recall Curve: Useful for imbalanced datasets.
• ROC vs Precision-Recall:
• ROC: When class distribution is balanced.
Precision-Recall: When class imbalance exists.
Scoring Methods:
• Minimum Description Length (MDL): Selects models balancing complexity &
performance.Based on Occam’s Razor (simpler models preferred).

Prof. Navyashree K S, CSE(DS) RNSIT

You might also like