ML Module2
ML Module2
MODULE 2
CHAPTER 1
Understanding Data: 2
2.6 BIVARIATE DATA AND MULTIVARIATE DATA
Bivariate Data involves two variables. Bivariate data deals with causes of relationships. The
aim is to find relationships among data. Consider the following Table 2.3, with data of the
temperature in a shop and sales of sweaters.
Here, the aim of bivariate analysis is to find relationships among variables. The relationships
can then be used in comparisons, finding causes, and in further explorations. To do that,
graphical display of the data is necessary. One such graph method is called scatter plot. Scatter
plot is used to visualize bivariate data. It is useful to plot two variables with or without nominal
variables, to illustrate the trends, and also to show differences. It is a plot between explanatory
and response variables. It is a 2D graph showing the relationship between two variables.
Line graphs are similar to scatter plots. The Line Chart for sales data is shown in Figure 2.12.
The covariance between X and Y is 12. It can be normalized to a value between -1 and +1. This
is done by dividing it by the correlation of variables. This is called Pearson correlation
coefficient.
Sometimes, N - 1 is also can be used instead of N. In that case, the covariance is 60/4 = 15.
Correlation
The Pearson correlation coefficient is the most common test for determining any association
between two phenomena. It measures the strength and direction of a linear relationship between
the x and y variables.
1.If the value is positive, it indicates that the dimensions increase together.
2.If the value is negative, it indicates that while one-dimension increases, the other dimension
decreases.
3.If the value is zero, then it indicates that both the dimensions are independent of each other.
If the dimensions are correlated, then it is better to remove one dimension as it is a redundant
dimension. If the given attributes are X = (x1, x2, … , xN) and Y = (y1, y2, … , yN), then the
Pearson correlation coefficient, that is denoted as r, is given as:
r≈0.832
Since r is close to 1, this indicates a strong positive correlation between X and Y. As X
increases, Y tends to increase as well.
2.7 MULTIVARIATE STATISTICS
In machine learning, almost all datasets are multivariable. Multivariate data is the analysis of
more than two observable variables, and often, thousands of multiple measurements need to be
conducted for one or more subjects.
Multivariate data has three or more variables. The aim of the multivariate analysis is much
more. They are regression analysis, factor analysis and multivariate analysis of variance .
Heatmap
Heatmap is a graphical representation of 2D matrix. It takes a matrix as input and colours it.
The darker colours indicate very large values and lighter colours indicate smaller values.
The advantage of this method is that humans perceive colours well. So, by colour shaping,
larger values can be perceived well.
For example, in vehicle traffic data, heavy traffic regions can be differentiated from low traffic
regions through heatmap. In Figure 2.13, patient data highlighting weight and health status is
plotted. Here, X-axis is weights and Y-axis is patient counts. The dark colour regions highlight
patients’ weights vs patient counts in health status.
This is true if y is not zero and A is not zero. The logic can be extended for N-set of equations
with ‘n’ unknown variables.
It means if A= and y=(y1 y2…yn), then the unknown variable x can be
If there is a unique solution, then the system is called consistent independent. If there are
various solutions, then the system is called consistent dependant. If there are no solutions and
if the equations are contradictory, then the system is called inconsistent. For solving large
number of system of equations, Gaussian elimination can be used. The procedure for applying
Gaussian elimination is given as follows:
1.Write the given matrix.
2.Append vector y to the matrix A. This matrix is called augmentation matrix.
3.Keep the element a11 as pivot and eliminate all a11 in second row using the matrix operation,
The same logic can be used to remove a11 in all other equations.
4.Repeat the same logic and reduce it to reduced echelon form. Then, the unknown variable
as:
This part is called backward substitution. To facilitate the application of Gaussian elimination
method, the following row operations are applied:
1.Swapping the rows
2.Multiplying or dividing a row by a constant
3.Replacing a row by adding or subtracting a multiple of another row to it .
Example Problems:
GAUSSIAN ELIMINATION
1.Given the Following sytem of equation.
2x +3y –Z=5
4x+y+2y=6
-2x+5y+z=4
Find x , y, z
Step 1: Convert the Equations into an Augmented Matrix
We write only the coefficients and constants in matrix form:
Now, we have an upper triangular matrix, and we can solve for z, y, and x using back-
substitution.
Step 7: Back-Substitution
We now solve for z, y, and x one by one.
From Row 3:
z=0.40625 ,
From Row 2:
y−0.8(0.40625) =0.8=1.125
From Row 1:
x+1.5(1.125)−0.5(0.40625)=2.5
Final Answer:
x=1.0156, y=1.125, z=0.40625
2.Given the following
X+Y=5, 2X-Y=1 ,Find X , Y
Step 2: Make the first pivot element 1,The first pivot is already 1.
Step 3: Make the element below the pivot 0
R2->R2-2R1
where, Q is the matrix of eigen vectors, Λ is the diagonal matrix and QT is the transpose of
matrix Q.
LU Decomposition One of the simplest matrix decomposition is LU decomposition where the
matrix A can be decomposed matrices:
A = LU
Here, L is the lower triangular matrix and U is the upper triangular matrix.
The decomposition can be done using Gaussian elimination method as discussed in the
previous section. First, an identity matrix is augmented to the given matrix.
Then, row operations and Gaussian elimination is applied to reduce the given matrix to get
matrices L and U.
a.
b. Here, m is mean and s is the standard deviation. Normal distribution is
characterized by two parameters – mean and variance. One important concept
associated with normal distribution is z-score. It can be computed as:
c.
d. This is useful to normalize the data.
II. Rectangular Distribution – This is also known as uniform distribution. It has equal
probabilities for all values in the range a, b. The uniform distribution is given as follows:
a.
III. Exponential Distribution – This is a continuous uniform distribution. This probability
distribution is used to describe the time between events in a Poisson process.
Exponential distribution is another special case of Gamma distribution with a fixed
parameter of 1.
a. This distribution is helpful in modelling of time until an event occurs. The PDF
is given as follows:
2.Discrete Distribution
Binomial, Poisson, and Bernoulli distributions fall under this category.
I. Binomial Distribution – Binomial distribution is another distribution that is often
encountered in machine learning. It has only two outcomes: success or failure. This is
also called Bernoulli trial. The objective of this distribution is to find probability of
getting success k out of n trials. The way to get success out of k out of n number of
trials is given as:
Here, p is the probability of each choice, k is the number of choices, and n is the total
number of choices. The mean of binomial distribution is given below:
II. Poisson Distribution – It is another important distribution that is quite useful. Given
an interval of time, this distribution is used to model the probability of a given number
of events k. The mean rule l is inclusive of previous events. Some of the examples of
Poisson distribution are number of emails received, number of customers visiting a shop
and the number of phone calls received by the office. The PDF of Poisson distribution
is given as follows:
This involves formulating a function called likelihood function which is the conditional
probability of observing the observed samples and distribution function with its parameters.
For example, if the observations are X = {x1, x2, … , xn}, then density estimation is the
problem of choosing a PDF with suitable parameters to describe the data.
MLE treats this problem as a search or optimization problem where the probability should be
maximized for the joint probabilities of X and its parameter, theta.
If one assumes that the regression problem can be framed as predicting output y given input x,
then for p(y/x), the MLE framework can be applied as:
Here, b is the regression coefficient and xi is the given sample. One can maximize this function
or minimize the negative log likelihood function to provide a solution for linear regression
problem. The Eq. (2.37) yields the same answer of the least-square approach.
Generally, there can be many unspecified distributions with different set of parameters. The
EM algorithm has two stages:
1. Expectation (E) Stage – In this stage, the expected PDF and its parameters are estimated
for each latent variable.
2. Maximization (M) stage – In this, the parameters are optimized using the MLE function.
This process is iterative, and the iteration is continued till all the latent variables are fitted
by probability distributions effectively along with the parameters.
• So, another error is called sample error or estimator. Sample error is with respect to sample
S. It is the probability for instances drawn from X, that is, the fractions of S that are
misclassified. The sample error is given as follows:
p-value
• Statistical tests can be performed to either accept or reject the null hypothesis. This is
done by the value called p-value or probability value. It indicates the probability of
hypothesis being true. The p-value is used to interpret or quantify the test.
• For example, a statistical test result may give a value of 0.03. Then, one can compare it
with the level 0.05. As 0.03 <0.05, the result is assumed to be significant. This means that
the variables tested are not independent. Here, 0.05 is called significant level.
• In general, significant level is called alpha and p-value is compared with Alpha . If p-value
≤ Alpha, then the hypothesis H1 is rejected and if p-value>Alpha, then the hypothesis HO
is rejected.
2.t-test : t-test is a hypothesis test and checks if the difference between two samples mean is
real or by chance.
T-test follows t-distributions under null hypothesis and is used when number of samples < 30.
One sample Test:
Mean of one group is checked against the set average that can be either theoretical value or
population mean.
Select a group
Compute average
Compare it with theoretical value and compute t-statistic:
where:
• Xˉ = sample mean
• μ= population mean
• σ = population standard deviation
n = sample size
The computed Z-score is approximately -14.23. This indicates that the sample mean (3) is
extremely far from the population mean (12) in terms of standard deviations, suggesting a
highly significant difference.
Independent Two sample t-test: t-statistic for two groups A and B is computed.
3. paired t-test : Used to evaluate the hypothesis before and after intervention. The fact is these
samples are not independent.
Problems on PCA
Problems on SVD
CHAPTER 2
BASIC LEARNING THEORY
3.3 Design of Learning System:
1.The first step is to select good training data. This is critical because the quality of the data
directly affects the success or failure of the model. The training data should provide feedback
to the algorithm, helping it improve its decisions over time. For example, in chess, the data
would help the algorithm learn which moves lead to success.
2. Next, we define a target function. This is what the algorithm uses to decide the best move
based on the current situation. In chess, the target function would help the algorithm choose its
next move based on the current board state.
3.Target Function Representation: Once the algorithm knows all the possible legal moves, we
need a way to represent the best move. This could be in the form of linear equations, graphs,
or tables. The goal is to identify the move that will lead to the best outcome.
4.Choosing an Approximation Algorithm: To pick the best move, the algorithm needs to
approximate which moves will be successful.
This is done by analyzing past examples and learning from successes and failures. Over time,
the algorithm gets better at predicting the best moves. The aim is to reduce the error given,
5. Finally, the system is refined after many examples, successes, and failures. The algorithm
continues to improve its decision-making process.
Find-S Algorithm is a simple machine learning algorithm used for searching the hypothesis
space to find the most specific hypothesis that correctly classifies all positive examples in a
given training set. It is a supervised learning algorithm that works in the context of concept
learning and specifically finds a maximally specific hypothesis from the hypothesis space.
Limitations of Find S Algorithm:
1. Find-S algorithm tries to find a hypothesis that is consistent with positive instances,
ignoring all negative instances. As long as the training dataset is consistent, the hypothesis
found by this algorithm may be consistent.
2.Cross-Validation is a method used to tune a model using only the training dataset.It is a
model evaluation approach that sets aside a portion of the training dataset for validation while
using the rest for training.The goal is to find the best model by estimating the average error on
different test data.
Popular cross-validation methods include:
• Holdout method
• K-fold cross-validation
• Stratified cross-validation
Leave-One-Out Cross-Validation (LOOCV)
3. The Holdout Method is the simplest form of cross-validation: The dataset is split into
two subsets,Training dataset (used to train the model),Test dataset (used to evaluate the
model).The model is trained on the training dataset and then evaluated on the test dataset. Two
variations of the holdout method:Single holdout method: Applied once.Repeated holdout
method: Applied multiple times with different splits.The model's performance is estimated
based on the test dataset.
Limitations of the holdout method:
• Can exhibit high variance.
• Performance depends on how the dataset is split.
4.K-Fold Cross-Validation:
K-Fold Cross-Validation is a technique used to evaluate the performance of a machine
learning model by splitting the dataset into multiple subsets (folds) and training/testing the
model on different partitions.