Machine Learning Neeru
Machine Learning Neeru
Machine Learning Neeru
The term "Machine Learning" was introduced by Arthur Samuel in 1959, a pioneering figure in early
American computer gaming and artificial intelligence, during his time at IBM. Samuel characterized
machine learning as "the field of study that gives computers the ability to learn without being explicitly
programmed." However, consensus on a universal definition for machine learning remains elusive,
with different authors providing varying interpretations.
Definition of Learning
Learning in the context of machine learning is defined as follows: A computer program is considered to
learn from experience E concerning a class of tasks T and a performance measure Pif its performance
in tasks T , as measured by P, improves with experience E .
Examples
Training experience E : A sequence of images and steering commands recorded while observing a
human driver.
Definition
A computer program that learns from experience is termed a machine learning program or simply a
learning program. It is also occasionally referred to as a learner.
The machine learning process comprises four fundamental components, namely data storage,
abstraction, generalization, and evaluation. Each plays a crucial role in the learning journey,
contributing to the understanding and utilization of information. The following outlines these key
components and their respective functions:
1. Data Storage:
Role: Data storage facilities are pivotal in the learning process, serving as repositories for extensive
data. Both humans and machines rely on efficient data storage as a foundational element for advanced
reasoning.
Human Analog: In humans, data is stored in the brain, and retrieval involves electrochemical signals.
Machine Implementation: Computers utilize various storage devices like hard disk drives, flash memory,
and random access memory. Retrieval is facilitated through technologies such as cables.
2. Abstraction:
Role: Abstraction is the second component, involving the extraction of knowledge from stored data.
This process includes forming general concepts that encapsulate the essence of the data. Knowledge
creation encompasses applying existing models and developing new ones, with training being the
process of fitting a model to a dataset.
Transformation: Once trained, the model transforms the data into an abstract representation that
summarizes the original information.
3. Generalization:
Role: Generalization, the third component, is the process of translating knowledge about stored data
into a form applicable for future actions. These actions are designed for tasks that share similarities
with, but are not identical to, those encountered before.
Objective: The goal in generalization is to uncover the properties of the data that are most relevant to
future tasks.
4. Evaluation:
Role: Evaluation is the final step in the learning process, assessing the effectiveness and efficiency of
the acquired knowledge and models.
Feedback Loop: The evaluation results guide adjustments to the learning process, ensuring continuous
refinement and improvement.
In summary, the machine learning process involves storing and retrieving data, abstracting knowledge
from the data through model training, generalizing this knowledge for future tasks, and finally,
evaluating and refining the acquired knowledge for ongoing improvement. This iterative cycle forms
the foundation of the machine learning journey.
Machine learning has found diverse applications across various industries, revolutionizing how tasks
are performed, decisions are made, and systems operate. Here are some notable applications of
machine learning:
1. Healthcare:
Disease Diagnosis: Machine learning models analyze medical data to assist in early diagnosis of
diseases.
Drug Discovery: ML aids in identifying potential drug candidates and predicting their effectiveness.
2. Finance:
Fraud Detection: ML algorithms analyze transaction data to detect and prevent fraudulent activities.
3. Retail:
Recommendation Systems: ML powers personalized product recommendations for users.
Demand Forecasting: Predictive models help optimize inventory management and supply chains.
4. Marketing:
Customer Segmentation: ML categorizes customers based on behavior for targeted marketing.
5. Manufacturing:
Predictive Maintenance: ML predicts equipment failures, optimizing maintenance schedules.
Quality Control: Image recognition and analysis enhance product quality inspection.
6. Autonomous Vehicles:
Object Detection: ML enables vehicles to recognize and respond to objects and obstacles.
Object Detection: Algorithms analyze images and videos for object identification.
9. Cybersecurity:
Anomaly Detection: ML identifies unusual patterns to detect cybersecurity threats.
10. Education:
Personalized Learning: ML tailors educational content based on individual student progress.
These applications showcase the versatility of machine learning, demonstrating its impact on
optimizing processes, enhancing decision-making, and enabling innovations across various domains.
1. Evolutionary Advancements:
2. Ubiquitous Integration:
3. Human-Augmented AI:
Perspective: There is a shift towards human-machine collaboration, where machine
learning augments human capabilities rather than replacing them. This perspective
emphasizes the symbiotic relationship between humans and AI.
Perspective: The explosion of data availability is seen as a driving force for machine
learning, offering opportunities for more accurate models and insights across diverse
domains.
5. Cross-Disciplinary Collaboration:
6. Explainable AI (XAI):
Perspective: There is a growing need for machine learning models to be explainable and
interpretable, especially in critical applications such as healthcare and finance, to build
trust and transparency.
1. Ethical Considerations:
Issue: The ethical implications of machine learning, including bias, fairness, and the
societal impact of AI applications, raise complex challenges that require careful
consideration.
Issue: Machine learning models can perpetuate and amplify biases present in training
data, leading to unfair and discriminatory outcomes.
3. Data Privacy:
Issue: Concerns about the privacy of personal data used in machine learning models
highlight the need for robust data protection measures and compliance with privacy
regulations.
4. Security Vulnerabilities:
Issue: Machine learning models are susceptible to adversarial attacks, highlighting the
need for enhanced security measures to protect against manipulation and exploitation.
5. Interpretability:
Issue: The lack of interpretability in complex models, such as deep neural networks,
poses challenges in understanding model decisions, particularly in sensitive
applications.
6. Regulatory Frameworks:
7. Resource Intensiveness:
Issue: The assumption that more data always leads to better models may overlook the
quality of data, and reliance on massive datasets may not be feasible or ethical in certain
situations.
9. Generalization Challenges:
Issue: Ensuring that machine learning models generalize well to unseen data and diverse
scenarios remains a significant challenge, especially in dynamic environments.
Navigating these perspectives and addressing the associated issues requires ongoing interdisciplinary
collaboration, ethical considerations, and a commitment to developing machine learning solutions that
benefit society while minimizing potential risks and biases.
1. Supervised Learning:
Supervised learning involves providing a machine learning algorithm with a training set of examples,
each paired with correct responses or targets. The algorithm generalizes from this training set to
respond correctly to new, unseen inputs.
Characteristics:
Example: Consider patient data with gender, age, and health status labels. The algorithm learns to
predict health status based on gender and age.
2. Unsupervised Learning:
Unsupervised learning involves algorithms without labeled responses. The algorithm aims to identify
similarities between inputs, grouping together those with common characteristics. A common approach
is density estimation.
Characteristics:
Example: Using patient data with gender and age but without health status labels, the algorithm seeks
to uncover patterns or groupings.
3. Semi-Supervised Learning:
Semi-supervised learning is a type of machine learning where the algorithm is trained on a dataset that
contains both labeled and unlabeled data. While a subset of the training data has explicit labels, a larger
portion remains unlabeled. The algorithm leverages both labeled and unlabeled examples to improve
its learning and generalization capabilities.
Characteristics:
Semi-supervised learning utilizes a mix of data with known labels and data without
labels in the training set.
2. Cost-Efficient Labeling:
3. Improved Generalization:
By learning from both labeled and unlabeled instances, the algorithm aims to generalize
better to new, unseen data.
Commonly applied when labeled data is scarce, as is often the case in various domains,
such as medical imaging, speech recognition, and natural language processing.
Example:
In a dataset of financial transactions, only a small subset is labeled as fraudulent. Collecting labeled
instances of fraud is a costly and time-consuming process. Semi-supervised learning can be employed
by using the limited labeled instances of fraud along with a larger pool of unlabeled transactions.
Labeled Data:
Unlabeled Data:
The semi-supervised learning algorithm learns patterns from the labeled instances of fraud and non-
fraud, and it generalizes this knowledge to make predictions on the unlabeled transactions. This
approach allows for more efficient and cost-effective fraud detection compared to a fully supervised
model, as it benefits from a more extensive dataset while minimizing the need for exhaustive labeling.
4. Reinforcement Learning:
Definition: Reinforcement learning lies between supervised and unsupervised learning. The algorithm
receives feedback when its responses are incorrect but is not explicitly told how to correct them. It
explores different possibilities until it learns how to achieve the correct answer.
Characteristics:
Example: Teaching a dog a new trick involves rewarding or punishing based on its actions. Similarly,
reinforcement learning trains computers for tasks where explicit instructions are not provided.
These three types of learning represent fundamental approaches in machine learning, each addressing
distinct challenges and applications. Supervised learning focuses on labeled data, unsupervised
learning explores data patterns without labels, and reinforcement learning involves learning optimal
actions through exploration and feedback.
In random experiments, we are interested in the numerical outcomes i.e., numbers associated with the
outcomes of the experiment. For example, when 50 coins are tossed, we ask for the number of heads.
Whenever we associate a real number with each outcome of trial, we are dealing with a function whose
range is the set of real numbers we ask for such a function is called a random variable (r. v.) chance
variable, stochastic variable or simply a variable.
Definition: Quantities which vary with some probabilities are called random variables.
Definition: By a random variable we mean a real number associated with the outcomes of a random
experiment.
Example 1.1: Suppose two coins are tossed simultaneously then the sample space is: S= {HH, HT, TH,
TT}. Let X denote the number of heads, then if X = 0 then the outcome is {TT} and P(X = 0) = ¼. If X
takes the value 1, then outcome is {HT, TH} and P(X = 1) = 2/4. Next if X takes the value 2 then the
outcome is {HH} and P(X = 2) = 1/4.The probability distribution of this random variable X is given by
the following table:
X =x 0 1 2 Total
P( X=x) 1 2 1 1
4 4 4
Example 1.2: out of 24 mangoes 6 are rotten, 2 mangoes are drawn. Obtain the probability distribution
of the number of rotten mangoes that can be drawn:
Let X denote the number of rotten mangoes drawn then X can take values 0, 1, 2.
18 6
C X C
18 X 17 51 P( X=1)= ❑ 241 ❑ 1 = 18 X 6 = 9
18
❑C2
P( X=0)= = = ; 24 X 23 23
24
C 2 24 X 23 92 ❑C 2
❑
1X2
and
6
❑C 2 6X5 5
P( X=2)= 24 = =
❑C2
24 X 23 92
X=x 0 1 2 Total
P(X = x ) 51 9 5 1
92 23 92
Note 3: tail events let ‘x’ be any real number then the events ¿ X < x ∨¿ and ¿ X > x ∨. ¿ X x∨¿ are called
tail events. For distinction, we may label them open, closed, upper and lower tails. Often, simple r.v.’s
are expanded as linear combination of tail events.
4) F ¿
lim F ( xn)=0 as n−; lim F ( xn)=0 as n Limits property
Conditions (3), (4) and (5) are necessary as well as sufficient for F to be c.d.f. on R.
Quantities which can take only integer values are called discrete random variables.
Examples:
Definition: Let X be a discrete random variable taking value x, x = 0, 1, 2, 3, .... then P(X = x) is called
the probability mass function of X and it satisfies the following
( i ) P(X = x) 0
∞
( ii ) ∑ P (X =x)=1
x=0
A r. v. X is said to be discrete, if there exist a countable number of points x1, x2, x3, . . . and, p(xi) 0
such that
∞
F (x)= ∑ p(x i )
∑ p (x i)=1 x i≤ x
x=1
A finite equiprobable space is finite probability distribution where each sample point x 1 , x 2 , x 3 ,. . . xn
has the same probability for all i i.e.,
∑ p i=1.
xi ≤ x
Expectation: The behaviour of r.v. either discrete or continuous is completely characterized by the
distribution function F(x) or density f(x)[ P(xi) in discrete case . Instead of a function, a more compact
description can be made by a single number such as mean (expectation), median and mode known as
measures of central tendency of the r.v. X.
{ }
∞
Standard Deviation: standard deviation denoted by (S.D.) is the positive square root of variance.
σ =E ¿
2 ∑ (x 2−2 μx + μ2 )f ( x )
x
¿ ∑ x f ( x )−2 μ ∑ xf ( x ) + μ ∑ f (x )
2 2
x x x
¿ E ( X 2) −2 .❑2 .1 E ( X 2 )−❑2
since μ=∑ xf ( x ) , ∑ f (x)=1 .
x x
Let X be a continuous random variable taking value x, a x b then f(x) = P(X = x) is defined as the
p. d. f. of X and satisfies the following
b
Moments: If the range of the probability density function is from - to , the rth moment about origin
∞
is defined as μr =∫ x f (x)dx .
' r
−∞
∞
−∞
∞
−∞
[∫ ]
∞ ∞ 2
When the outcome of a random experiment can be characterized in more than one way, the probability
density is a function of more than one variate.
Example 1.4: When a card is drawn from an ordinary deck, it may be characterized according to its
suit in some order viz., say clubs, diamonds, hearts and spades and Y be a variate that assumes the
values 1 , 2, 3 , . .. , 13 which correspond to the denominations: Ace , 2 ,3 , . .. , 10 , J , Q , K . Then (X , Y ) is
a 2 – dimensional variate. The probability of drawing a particular card will be denoted by f (x , y ) and
if each card is equi-probable of being drawn, the density of (X , Y ) is
1
f ( x , y )=
∀ 1 ≤ x ≤ 4 ∀ 1≤ y ≤13
52
Trails whose outcomes can be characterized by two (three) variates give rise to bivariate (tri-variate)
distributions etc. Extensions to n-variate distributions are fairly straight forward.
The joint c. d. f. of X and Y is said to be discrete if there exists a non-negative function P such that P
vanishes everywhere except a finite or countably infinite number of points in the plane and at such
points (x , y )so that P(x , y)=P (X=x , Y = y ), for all x , y R .
Let X and Y have a joint discrete distribution. A function P which does not vanish on the set {(x i, yi)
such that i, j = 1, 2, 3, . . .} and satisfies the following properties:
∞ ∞
(i) P(x i , y j)≥ 0 for all i, j = 1, 2, 3, . . . . . . and (ii) ∑ ∑ P (x i , y j )=1 is called joint probability
i=1 j=1
Let X and Y be two jointly distributed variables with joint discrete density P(x , y), the individual
variates X and Y themselves are random variables.
. . . . . . . . . .
We have here, Pij =P( X=x i ,Y = y j ); P( x j)=∑ Pij ∧P( y j )=∑ Pij .
j i
Let X and Y have joint discrete distribution with associated probability function P. Let the possible
values of X be {x1, x2, x3, . . .,xi, . . .} and those of Y be { y1, y2, y3, . . .,yj, . . .} respectively.
( )
xi P ( xi , y j )
PX = for i, j=1, 2 , 3 ,…
yj
yj PY ( y j )
¿ 0 if PY ( y j)=0
yj
The conditional probability function of Y, given X = xi denoted by PY / xi x i is defined by ( )
PY
xi
( )
y j P ( xi , y j )
xi
=
P X ( xi )
for i , j=1 ,2 , 3 , …
¿ 0 if P X (x i)=0
Therefore P ( x i , y j) =P ( X =xi , Y = y j ) ; P ( Y = y j ) =PY ( y j )∧P ( X =x i )=P X (x i)
[∫ ]
x y
f(x, y) ≥ 0 such that for −∞ < x , y< ∞, the c. d. f. of (X, Y) given by F (x , y )=∫ f (u , v)dv du is
−∞ −∞
Some properties of joint density: Let f (x , y )≥ 0 be the joint p. d. f of continuous random vector (X,
Y) and F(x, y) be the c. d. f. of (X, Y) then it holds the following properties:
[ ]
∞ ∞ b d
Individual or Marginal Distributions: Let (X, Y) be a continuous random vector with joint c. d. f.
x y
Definition: Let (X, Y) be a 2-dimesnional continuous random vector with joint p. d. f. f(x, y). Then
the individual or marginal distribution of X and Y are defined by the p. d. f.’ s
∞ ∞
f X ( x )=∫ f (x , y)dy and f Y ( y)= ∫ f ( x , y ) dx .
−∞ −∞
[∫ ]
b b ∞
Note: The conditional p. d. f. f(x/y) is given by f X (x / y )=f (x , y )/f Y ( y )where f Y ( y) is the marginal p.
Y
Elementary algebra, using the rules of completion and balancing developed by al-Khwarizmi, allows
us to determine the value of an unknown variable x that satisfies an equation like the one below:
10 x−5=1 5+5 x
An equation like this that only involves an unknown (like x) and not its higher powers (x2 , x3 ),
along with additions (or subtractions) of the unknown multiplied by numbers (like 10x and 5x) is
called a linear equation. We now know, of course, that the equation above can be converted to a
special form (“number multiplied by unknown equals number”, or ax = b, where a and b are
numbers):
5 x=20
Once in this form, it becomes easy to see that x = b/a = 4. Linear algebra is, in essence, concerned
with the solution of several linear equations in several unknowns.
The numbers a ij are called the coefficients of the linear system; because there are m
equations and n unknown variables there are therefore m ×n coefficients. The main
problem with a linear system is of course to solve it:
Problem 6.1: Find a list of n numbers ( s1 , s 2 , … , sn ) that satisfy the above system of
linear equations
2 x1 +2 x 2+ x3 =2
x 1+ 3 x 2−x 3=11
2(1)+2(2)+(−4) = 2
(1)+3(2)− (−4) = 11
2(1)+2(−1)+(2) = 2
but
1 +3(−1)− 2 = −4 = 11.
A linear system may not have a solution at all. If this is the case, we say that the
linear system is inconsistent:
INCONSISTENT ⇔ NO SOLUTION
We will see shortly that a consistent linear system will have either just one solution
or infinitely many solutions. For example, a linear system cannot have just 4 or 5
solutions. If it has multiple solutions, then it will have infinitely many solutions.
Problem 6.3. Show that the linear system does not have a solution.
−x 1+ x2 =3
x 1−x 2=1.
0=4
which is a contradiction. Therefore, there does not exist a list ( s1 , s 2) that satisfies the
system because this would lead to the contradiction 0 = 4.
Show that for any choice of the parameter t, the list ( s1 , s 2 , s 3) is a solution to the linear system
x 1+ x2 + x 3=0
x 1+ 3 x 2−x 3=3 .
Solution. Substitute the list ( s1 , s 2 , s 3) into the left-hand-side of the first equation
( −32 −2 t )+3( 32 +t )− ( t )= 62 =3
Both equations are satisfied for any value of t. Because we can vary t arbitrarily, we get an
infinite number of solutions parameterized by t. For example, compute the list ( s1 , s 2 , s 3) for t
= 3 and confirm that the resulting list is a solution to the linear system.
We will use matrices to develop systematic methods to solve linear systems and to study the
properties of the solution set of a linear system. Informally speaking, a matrix is an array or
table consisting of rows and columns.
For example
[ ]
1 −2 1 0
A= 0 2 −8 8
−4 7 11 −5
is a matrix having m = 3 rows and n = 4 columns. In general, a matrix with m rows and n
columns is a m × n matrix and the set of all such matrices will be denoted by M m ×n . Hence, A
above is a 3 × 4 matrix. The entry of A in the ith row and jth column will be denoted by a ij. A
matrix containing only one column is called a column vector and a
matrix containing only one row is called a row vector. For example, here is a row vector
u=[ 1−3 4 ]
and here is a column vector
v=
[ ]
−3
1
We can associate to a linear system three matrices: (1) the coe fficient matrix, (2) the output
column vector, and (3) the augmented matrix. For example, for the linear system
5 x 1−3 x2 +8 x 3=−1
x 1+ 4 x 2−6 x 3=0
2 x 2+ 4 x 3=3
the coefficient matrix A, the output vector b, and the augmented matrix [A b] are:
[ ] [] [ ]
5 −3 8 −1 5 −3 8 −1
A= 1 4 −6 , b= 0 ,[ A b]= 1 4 −6 0
0 2 4 3 0 2 4 3
If a linear system has m equations and n unknowns then the coe fficient matrix A must be a m×n
matrix, that is, A has m rows and n columns. Using our previously defined notation, we can
write this as A ∈ M m × n.
If we are given an augmented matrix, we can write down the associated linear system in an obvious
way. For example, the linear system associated to the augmented matrix is
[ ]
1 4 −2 8 12
0 1 −7 2 −4
0 0 5 −1 7
x 1+ 4 x 2−2 x 3+ 8 x 4=12
x 2−7 x 3+ 2 x 4 =−4
5 x 3−x 4=7 .
We can study matrices without interpreting them as coe fficient matrices or augmented ma-trices
associated to a linear system. Matrix algebra is a fascinating subject with numerous applications
in every branch of engineering, medicine, statistics, mathematics, finance, biol-ogy, chemistry,
etc.
These operations do not alter the solution set. The idea is to apply these operations iteratively to
simplify the linear system to a point where one can easily write down the solution set. It is
convenient to apply elementary operations on the augmented matrix [A b] representing the
linear system. In this case, we call the operations elementary row operations, and the process of
simplifying the linear system using these operations is called row reduction. The goal with row
reducing is to transform the original linear system into one having a triangular structure and
then perform back substitution to solve the system. This is best explained via an example.
[ ]
1 0 −2 −4
0 1 −1 0
0 0 1 1
−3 x 1+ 2 x 2 +4 x 3=12
x 1−2 x3 =−4
2 x1 −3 x 2 + 4 x 3=−3
Solution. Our goal is to perform elementary row operations to obtain a triangular structure
and then use back substitution to solve. The augmented matrix is
[ ]
−3 2 4 12
1 0 −2 −4
2 −3 4 −3
[ ] [ ]
−3 2 4 12 → 1 0 −2 −4
1 0 −2 −4 −−R1 ↔ R2 −3 2 4 12
2 −3 4 −3 2 −3 4 −3
As you will see, this first operation will simplify the next step. Add 3 R1 ¿ R 2:
[ ] [ ]
1 0 −2 −4 → 1 0 −2 −4
−3 2 4 12 −−3 R 1+ R 2 0 2 −2 0
2 −3 4 −3 2 −3 4 −3
Add −2 R1 to R3 :
[ ] [ ]
1 0 −2 −4 → 1 0 −2 −4
0 2 −2 0 −−−2 R 1+ R 3 0 2 −2 0
2 −3 4 −3 0 −3 8 5
1
Multiply R2 by :
2
[ ] [ ]
→
1 0 −2 −4 1 0 −2 −4
1
0 2 −2 0 −− R 2 0 1 −1 0
2
0 −3 8 5 0 −3 8 5
Add 3 R2 to R3:
[ ] [ ]
1 0 −2 −4 → 1 0 −2 −4
0 1 −1 0 −−3 R 2+ R 3 0 1 −1 0
0 −3 8 5 0 0 5 5
1
Multiply R3 by :
5
[ ] [ ]
→
1 0 −2 −4 1 0 −2 −4
1
0 1 −1 0 −− R 0 1 −1 0
5 3
0 0 5 5 0 0 1 1
We can continue row reducing but the row reduced augmented matrix is in triangular form. So
now use back substitution to solve. The linear system associated to the row reduced augmented
matrix is
x 1−2 x3 =−4
x 2−x 3=0
x 3=1
The last equation gives that x 3=1. From the second equation, we obtain that x 2−x 3=0, and thus
x 2=1. The first equation then gives that x 1=−4+2(1)=−2. Thus, the solution to the original
system is (−2, 1, 1). You should verify that (−2, 1, 1) is a solution to the original system
[ ]
−3 2 4 12 −3 x 1 +2 x 2 + 4 x 3=12
M= 1 0 −2 −4 → x 1−2 x 3=−4
2 −3 4 −3 2 x1 −3 x 2 + 4 x3 =−3
[ ]
1 0 −2 −4 x1 −2 x 3=−4
N= 0 1 −1 0 → x 2−x 3=0
0 0 1 1 x3 =1
Although the two augmented matrices M and N are clearly distinct, it is a fact that they have the
same solution set.
Problem 6.7. Using elementary row operations, show that the linear system is inconsistent.
x 1+ 2 x 3=1
x 2+ x3 =0
2 x1 + 4 x 3=1
[ ]
1 0 21
0 1 10
2 0 41
[ ] [ ]
1 0 21 → 1 0 2 1
0 1 1 0 −−−2 R1 + R3 0 1 1 0
2 0 41 0 0 0 −1
[ ]
1 0 2 1
0 1 1 0
0 0 0 −1
0 x 1+ 0 x2 +0 x 3=−1
Obviously, there are no numbers x 1 , x 2 , x 3 that satisfy this equation, and therefore, the linear
system is inconsistent, i.e., it has no solution. In general, if we obtain a row in an augmented
matrix of the form
[0 0 0 … 0c]
where c is a nonzero number, then the linear system is inconsistent. We will call this type of row
an inconsistent row. However, a row of the form
[ 0 1 0 0 0]
corresponds to the equation x 2=0 which is perfectly valid.
x 1−2 x2 =−1
−x 1+ 3 x 2=3
is the intersection of the two lines determined by the equations of the system. The solution for
this system is (3,2). The two lines intersect at the point ( x 1 , x 2) = (3,2), see Figure 6.1.
Figure 6.1: The intersection point of the two lines is the solution of the linear system
x 1−2 x2 + x 3=0
2 x 2−8 x3 =8
−4 x 1+5 x 2+ 9 x3 =−9
is the intersection of the three planes determined by the equations of the system. In this case,
there is only one solution: (29, 16, 3). In the case of a consistent system of two equations, the
solution set is the line of intersection of the two planes determined by the equations of the
system, see Figure 6.2.
Figure 6.2: The intersection of the two planes is the solution set of the linear system
2 x 2−2 x 3 +3 x 6=0
−9 x 4 −x5 + x 6=−1
5 x 5+ x 6 =5
0=0
[ ]
1 5 0 −2 −1 7 −4
0 2 −2 0 0 3 0
0 0 0 −9 −1 1 −1
0 0 0 0 5 1 5
0 0 0 0 0 0 0
P2. The leftmost nonzero entry of a row is to the right of the leftmost nonzero entry of the row
above it.
Any matrix satisfying properties P1 and P2 is said to be in row echelon form (REF). In REF, the
leftmost nonzero entry in a row is called a leading entry:
[ ]
1 5 0 −2 −1 7 −4
0 2 −2 0 0 3 0
0 0 0 −9 −1 1 −1
0 0 0 0 5 1 5
0 0 0 0 0 0 0
[ ]
1 5 0 −2 −1 7 −4
0 2 −2 0 0 3 0
0 0 0 −9 −1 1 −1
0 0 0 0 5 1 5
0 0 0 0 0 0 0
We can perform elementary row operations, or row reduction, to transform a matrix into REF.
Problem 6.8. Explain why the following matrices are not in REF. Use elementary row
operations to put them in REF.
[ ] [ ]
3 −1 0 3 7 5 0 −3
M= 0 0 00 N= 0 3 −1 1
0 1 30 0 6 −5 2
Solution. Matrix M fails property P1. To put M in REF we interchange R2 with R3:
[ ] [ ]
3 −1 0 3 → 3 −1 0 3
0 0 00 −−R 2 ↔ R 3 0 1 00
0 1 30 0 0 00
The matrix N fails property P2. To put N in REF we perform the operation − 2 R2 + R3 → R3:
[ ] [ ]
7 5 0 −3 → 7 5 0 −3
0 3 −1 1 −−−2 R 2 + R 3 → R 3 0 3 −1 1
0 6 −5 2 0 0 −3 0
Why is REF useful? Certain properties of a matrix can be easily deduced if it is in REF. For
now, REF is useful to us for solving a linear system of equations. If an augmented matrix is in
REF, we can use back substitution to solve the system, just as we did in the previous problems.
8 x 1−2 x2 + x 3=4
3 x 2−x 3=7
2 x3 =4
[ ]
8 −1 1 4
0 3 −1 7
0 0 2 4
From the last equation we obtain that 2 x3 =4 , and thus x 3=2. Substituting x 3=2 into the second
equation we obtain that x 2=3. Substituting x 3=2 and x 2=3 into the first equation we obtain that
x 1=1.
Although REF simplifies the problem of solving a linear system, later in the course we will need
to completely row reduce matrices into what is called reduced row echelon form (RREF). A
matrix is in RREF if it is in REF (so it satisfies properties P1 and P2) and in addition satisfies
the following properties:
A leading 1 in the RREF of a matrix is called a pivot. For example, the following matrix in
RREF:
[ ]
1 6 0 3 00
0 0 1 −4 0 5
0 0 0 0 17
[ ]
1 6 0 3 00
0 0 1 −4 0 5
0 0 0 0 17
Problem 6.9. Use row reduction to transform the matrix into RREF.
[ ]
0 3 −6 6 4 −5
3 −7 8 −5 8 9
3 −9 12 −9 6 15
Solution. The first step is to make the top leftmost entry nonzero:
[ ] [ ]
0 3 −6 6 4 −5 → 3 −9 12 −9 6 15
3 −7 8 −5 8 9 −−R3 ↔ R 1 3 −7 8 −5 8 9
3 −9 12 −9 6 15 0 3 −6 6 4 −5
[ ] [ ]
→
3 −9 12 −9 6 15 1 −3 4 −3 2 5
1
3 −7 8 −5 8 9 −−R1 → R 1 3 −7 8 −5 8 9
3
0 3 −6 6 4 −5 0 3 −6 6 4 −5
[ ] [ ]
→
1 −3 4 −3 2 5 1 −3 4 −3 2 5
1
0 2 −4 4 2 −6 −− R 2 0 1 −2 2 1 −3
2
0 3 −6 6 4 −5 0 3 −6 6 4 −5
[ ] [ ]
1 −3 4 −3 2 5 → 1 −3 4 −3 2 5
0 1 −2 2 1 −3 −−−3 R2 + R3 0 1 −2 2 1 −3
0 3 −6 6 4 −5 0 0 0 0 1 4
We have now completed the top-to-bottom phase of the row reduction algorithm. In the next
phase, we work bottom-to-top and create zeros above the leading 1’s. Create zeros above the
leading 1 in the third row:
[ ] [ ]
1 −3 4 −3 2 5 → 1 −3 4 −3 2 5
0 1 −2 2 1 −3 −−−R 3 + R 2 0 1 −2 2 0 −7
0 0 0 0 1 4 0 0 0 0 1 4
[ ] [ ]
1 −3 4 −3 2 5 → 1 −3 4 −3 0 −3
0 1 −2 2 0 −7 −−−2 R3 + R1 0 1 −2 2 0 −7
0 0 0 0 1 4 0 0 0 0 1 4
[ ] [ ]
1 −3 4 −3 0 −3 → 1 0 −2 3 0 24
0 1 −2 2 0 −7 −−3 R2 + R1 0 1 −2 2 0 −7
0 0 0 0 1 4 0 0 0 01 4
This completes the row reduction algorithm, and the matrix is in RREF.
2 x1 + 4 x 2+ 6 x3 =8
x 1+ 2 x 2 +4 x 3 =8
3 x 1+ 6 x 2 +9 x 3=12
[ ]
2 4 68
1 2 48
3 6 9 12
[ ] [ ]
→
2 4 68 1 2 34
1
1 2 4 8 −− R1 1 2 4 8
2
3 6 9 12 3 6 9 12
[ ] [ ]
1 2 34 → 1 2 34
1 2 48 −−−R 1 + R 2 0 0 14
3 6 9 12 3 6 9 12
[ ] [ ]
1 2 34 → 1 2 34
0 0 1 4 −−−3 R1 + R3 0 0 1 4
3 6 9 12 0 0 00
The system is consistent, however, there are only 2 nonzero rows but 3 unknown variables. This
means that the solution set will contain 3 − 2 = 1 free parameter. The second row in the
augmented matrix is equivalent to the equation: x 3=4.
x 1+ 2 x 2 +3 x3 =4
x 1+ 2 x 2=−8.
We now must choose one of the variables x 1or x 2 to be a parameter, say t, and solve for the
remaining variable. If we set x 2=t then from x 1+ 2 x 2=−8 we obtain that
x 1=−8−2 t .
We can therefore write the solution set for the linear system as
x 1=−8−2 t
x 2=t
x 3=4
where t can be any real number. If we had chosen x 1 to be the parameter, say x 1=t , then the
solution set can be written as
x 1=t
1
x 2=−4− t
2
x 3=4
Both are two different parameterizations, they both give the same solution set.
In general, if a linear system has n unknown variables and the row reduced augmented matrix
has r leading entries, then the number of free parameters d in the solution set is d=n−r . Thus,
when performing back substitution, we will have to set d of the unknown variables to arbitrary
parameters. In the previous example, there are n=3 unknown variables and the row reduced
augmented matrix contained r =2 leading entries. The number of free parameters was therefore
d=n−r =3−2=1. Because the number of leading entries r in the row reduced coefficient
matrix determine the number of free parameters, we will refer to r as the rank of the coefficient
matrix: r =rank ( A).
Problem 6.11. Solve the linear system represented by the augmented matrix
[ ]
1 −7 2 −5 8 10
0 1 −3 3 1 −5
0 0 0 1 −1 4
Solution. The number of unknowns is n=5 and the augmented matrix has rank r =3 (leading
entries). Thus, the solution set is parameterized by d=5−3=2 free variables, call them t and s.
The last equation of the augmented matrix is x 4 −x5 =4 . We choose x 5 to be the first parameter
so we set x 5=t . Therefore, x 4 =4+ t . The second equation of the augmented matrix is
x 2−3 x 3+ 3 x 4 + x 5 =−5
and the unassigned variables are x 2 and x 3. We choose x 3 to be the second parameter, say
x 3=s. Then
¿−17−4 t +3 s .
We now use the first equation of the augmented matrix to write x 1 in terms of the other
variables:
¿−89−31 t+ 19 s
x 1=−89−31t +19 s
x 2=−17−4 t+3 s
x 3=s
x 4 =4+ t
x 5=t
where t and s are arbitrary real numbers. Choose arbitrary numbers for t and s and substitute the
corresponding list (x 1 , x 2 ,... , x5 ) into the system of equations to verify that it is a solution.
2. All the rows of the augmented matrix are consistent and there are no free parameters.
3. All the rows of the augmented matrix are consistent and there are d ≥ 1 variables
that must be set to arbitrary parameters
In Case 1., the linear system is inconsistent and thus has no solution. In Case 2., the linear
system is consistent and has only one (and thus unique) solution. This case occurs when r
= rank(A) = n since then the number of free parameters is d = n−r = 0. In Case 3., the
linear system is consistent and has infinitely many solutions. This case occurs when r < n
and thus d = n −r > 0 is the number of free parameters.
13.1 Vector: A vector V can be considered as an ordered list of numbers. In other words, a
vector is a mathematical entity characterized by both magnitude and direction, often represented
as an ordered set of values in a multi-dimensional space.
[]
v1
V = v2
⋮
vn
Above vector is an example of a n-dimensional vector, or n-dimensional column vector. The set
of all n-tuples of real number is denoted as Rn . As an equation it is written as,
{[ ]| }
v1
R ≔ v2 v2, … v n∈ R
n
⋮
vn
[]
v1
v
Thus, a particular n-tuple in Rn , say V = 2 denote a point in n-space. The number v i is called
⋮
vn
as the coordinates, components, entries, or elements of V . Another vector U =( u1 ,u 2 , … um ) is
an example of an m-dimensional row vector, R :={( u 1 , u2 , … ,u m )∨u1 , u2 , … , um ∈ R }
m
13.2 Vector addition and scaler multiplication: Consider two vectors U and V in Rn ,
[] []
u1 v1
U = u2 and V = v 2 , then the sum, written as U +V is the vector in Rn , obtained by adding the
⋮ ⋮
un vn
[ ]
u1 + v 1
u +v
corresponding elements from U and V . That is, U +V = 2 2 .
⋮
un + v n
The scaler product of the vector V by a real number k is obtained by multiplying each element of
V
[][ ]
v1 k v1
v k v2
By k . That is, kV =k 2 = .
⋮ ⋮
vn k vn
Example 1:
[] [] [ ][] [ ][ ]
1 4 1+4 5 3(1)−4 −1
3(2)−3
(a) Let A=
2
and B=
3
, then A+ B= 2+3 = 5 and 3 A−B= = 3
3 2 3+ 2 5 3(3)−2 7
4 0 4 +0 4 3(4)−0 12
[] [ ][]
a1 a 1+ 0 a1
a n a + 0 = a2 =A
one which points in no direction. For any vector, A= 2 ∈ R , A+O= 2 .
⋮ ⋮ ⋮
an a n+ 0 an
a 1 x 1+ a2 x2 +… an x n=b
L= { P+tU |t ∈ R }
{ P+tU +sV |t , s ∈ R }
Example 2:
{[ ] [ ]| }
1 1
2 +t 0 t ∈ R
(a) In a two-dimensional scenario, describes a line in R4 parallel to the x-
5 0
4 0
axis.
{[ ] [ ] [ ]| }
2 1 0
1 0 1
4 +s 0 +t 0 s,t ∈R
(b) In a three-dimensional scenario, describes a plane in R6
1 0 0
5 0 0
8 0 0
(c) (Specifying a plane with one linear algebraic equation): The solution set to
x 1+ x2 + x 3 + x 4 + x 5=1 can be represented equivalently as:
[ ][ ]
x1 1−x 2−x 3−x 4 −x5
x2 x2
x3 = x3 , which further can be written as hyperplane,
x4 x4
x5 x5
{[ ] [ ] [ ] [ ] [ ]| }
1 −1 −1 −1 −1
0 1 0 0 0
0 + s2 0 + s3 1 +s 4 0 + 0 s 2 , s3 , s 4 , s 5 ∈ R
0 0 0 1 0
0 0 0 0 1
Example 3:
(b) Find an equation of the hyperplane H in R4 that passes through the point P=(1 ,3 ,−4 ,2)
and is normal to the vector U =(4 ,−2, 5 , 6). The coefficients of the unknowns of an
equation of H are the components of the normal vector U ; hence, the equation of H must
be of the form 4 x1 −2 x 2 +5 x3 +6 x 4 =k . Substituting P into this equation, we obtain
4 ( 1 )−2 ( 3 ) +5 (−4 ) +6 ( 2 )=k or k =−10.
Thus, 4 x1 −2 x 2 +5 x3 +6 x 4 =−10. is the equation of hyperplane H .
Definition: The dot product, also know as inner product or scaler product of vectors in Rn ,
[] []
u1 v1
U = u2 and V = v 2 is defined as U ⋅V =u 1 v 1+u 2 v2 + …u n v n
⋮ ⋮
un vn
That is the dot product of vectors U and V is obtained by multiplying corresponding elements
and adding the resulting products.
The vectors U and V are said to be perpendicular (or orthogonal) if their dot product is zero,
that is U ⋅V =0. The inner product is also denoted as ⟨ U ,V ⟩.
Example 4:
[] [] []
1 2 2
(a) Let U = 2 , V = −3 , and W = 3 , then
−2 5 4
U ⋅V =1 ( 2 ) +2 (−3 ) + (−2 ) ( 5 )=2−6−10=−14
U ⋅W =1 ( 2 ) +2 ( 3 ) + (−2 )( 4 )=2+ 6−8 =0
(b) Let U =(1 , 2 ,3 , 5) and V =(2 , 3 , k ,5). Then find k so that U and V are perpendicular.
We have, U . V =1 ( 2 ) +2 (3 )+ 3 k +5 ( 5 )=2+6 +3 k +25=33+3 k
U and V are perpendicular then U . V =0.
Therefore, 33+3 k =0, or k =−11.
Some Properties of Dot Product: Consider the non-zero vectors U , V ,and W in Rn . Then, some
important properties of dot products are mentioned as below:
(i) Symmetry: U ⋅V =V ⋅U
(ii) Distributive: U ⋅ ( V +W )=U ⋅V +U ⋅W
(iii) Bilinear (linear in both U and V ): U ⋅ ( cV + dW )=cU ⋅V +dU ⋅W and
( cU + dW ) ⋅V =cU ⋅V +dW ⋅V , where c , d are scaler in R .
(iv) Positive Definite: U ⋅U >0.
[]
v1
v
square root of ⟨ U ,V ⟩. Consider a vector V = 2 in Rn , then
⋮
vn
√∑
n
‖V ‖=√ ⟨ U ,V ⟩=√ ( v 1) 2+ ( v 2 )2 +… ( v n )2= ( vi )
2
i=1
[]
0
0
Thus, ‖V ‖≥ 0 , and ‖V ‖=0 if and only if V = . A vector is called unit vector (or unit
⋮
0
norm vector) if ‖V ‖=1, or equivalently if ⟨ U ,V ⟩=1.
^ = V is the unit norm vector in the
For any non-zero vector V ∈ Rn , the vector V
‖V ‖
^
direction of V . The process of obtaining V from V is called the normalization of vector V
.
[] []
u1 v1
U = u2 and V = v 2 , then the distance between U and V is denoted and defined as:
⋮ ⋮
un vn
√ 2 2
d ( U ,V )=‖U −V ‖= ( u1−v 1 ) + ( u2−v 2 ) + … ( un−v n )
2
U ⋅V
U ⋅V =‖U‖‖V ‖Cosθ or cosθ= .
‖U ‖‖V ‖
The projection of a vector U onto a non-zero vector V is denoted and determined by,
U ⋅V U ⋅V
proj ( U .V )= V= V
‖V ‖
2
V ⋅V
U ⋅V ≤‖U‖‖V ‖
Proof: Let α ∈ Rbe any real number and consider the following positive quadratic polynomial in
α,
0 ≤ ( U +αV ) ⋅ (U + αV )=U ⋅U +2 αU ⋅V + α 2 V ⋅V
2 2
¿‖U ‖ +2 ( U ⋅ V ) α +‖V ‖ α 2
Let, a=‖V ‖2, b=2 ( U ⋅V ), and c=‖U‖2. Then for every value of α ,
2
a α +bα + c ≥ 0.
‖U + V ‖≤‖U‖+‖V ‖.
2
Proof: ‖U + V ‖ =( U + V ) ⋅ ( U + V )
¿ U ⋅U +2 U ⋅V +V ⋅V
2 2
¿‖U ‖ +‖V ‖ + 2‖U‖‖V ‖Cosθ
2
¿ (‖U‖+‖V ‖) +¿ 2‖U‖‖V ‖( Cosθ−1 )
≤ (‖U‖+‖V ‖) 2
Thus, ‖U + V ‖≤‖U‖+‖V ‖.
The triangle inequality is also self-evident when examining a sketch of u , v and u+ v as below:
Figure. 13.1
Example 5:
[] []
1 2
Consider two vectors in R3, U = −2 and V = 4 . Then obtain the distance and angle between
3 6
U and V .
√ 2 2 2
d ( U ,V )= ( u1−v 1 ) + ( u2−v 2 ) + … ( un−v n )
¿ √ 1+36+9
¿ √ 46
U . V =( 1 ) ( 2 ) + (−2 ) ( 4 )+ ( 3 ) ( 6 )=12,
2
‖U ‖ =( 1 )2 + (−2 )2 + ( 3 )2=14 ,
2
‖V ‖ =( 2 )2 + ( 4 )2 + ( 6 )2=56 ,
U .V 12 3 −1 3
Then, cosθ= = = , or θ=cos ( ).
‖U ‖‖V ‖ √ 14 √ 56 7 7
Also, proj ( U , V )=
U⋅V
‖V ‖
2
V=
[] []
12 2 3 1
56
4= 2.
6
7
3
13.5 Field: In the context of vector spaces, a field F is a mathematical structure that provides the
scalars used for scalar multiplication in the vector space. A field is a set equipped with two
operations, addition and multiplication, such that the set satisfies certain algebraic properties. the
following properties hold for all elements a , b , c in the field:
and a . ( b . c ) + ( a . b ) . c
(vi). Nonzero element property: a . b=0 implies that either a=0 or b=0.
(vii). Existence of multiplicative identity: There exists an element 1 in F such that a .1=a for all
a in F .
13.6 Vector Space: Formally, a vector space over a field (usually the real numbers or complex
numbers) is a set V equipped with two operations: vector addition and scalar multiplication, such
that for any vectors u , v ∈V and any scalars c , d from the field F , the following properties hold:
(i). Closure under addition: u+ v ∈ V
(iv). Existence of zero vector: There exists a vector 0 ∈V such that u+0=u for all u ∈V
(v). Existence of additive inverses: For every u ∈V , there exists a vector −u ∈V such that
u+ (−u )=0
( c +d ) . u=c . u+d .u
13.7 Spanning Sets: Let V be a vector space over F . Then vectors u1 ,u 2 , … . un are said to span
VV or form a spanning set of V if every v ∈ V , a linear combination of the vectors u1 ,u 2 , … . un .
That means if there exist scaler a , a 2 , … . a ∈ F , then, v=a1 v 1 +a 2 v 2+ …+a n v n belongs to V as
well. The following remarks can be concluded from the above definition.
Remark 1: Suppose u1 ,u 2 , … . un span V . Then, for any vector w , the set w ,u1 , u2 , …. un also
spans V .
Remark 2: Suppose u1 ,u 2 , … . un span V and suppose uk is a linear combination of some of the
other u ' s . Then the u ' s without uk also span V .
Remark 3: Suppose u1 ,u 2 , … . un span V and suppose one of the u ' s is the zero vector. Then
the u ' s without the zero vector also span V .
Example 6: Let V =R 2 (the vector space of all 2-dimensional real vectors) and consider the set
S=
{[ ] [ ]}
1 0
,
0 1
. The question is whether S spans V . Any vector in R2 can be written as a linear
{[ ] [ ] [ ]}
1 1 1
Example 7: Now, let's consider V =R 3 and the set S= 1 , 1 , 0 .
1 0 0
Again, we want to know if S spans V . For a vector in R3, say (a ,b ,c ) , we need to check if there
exist scalars x , y and z such that:
[] [][][]
1 1 1 a
x 1 +y 1 + 0 = b
1 0 0 c
[]
a
In this case, we can see that the vector b can be written as
c
[] [] [] []
a 1 1 1
b =c 1 +(b−c) 1 +(a−b) 0
c 1 0 0
So, it is concluded that S spans R3. These examples illustrate the concept of spanning sets,
where a set of vectors spans a vector space if every vector in that space can be expressed as a
linear combination of the vectors in the set.
13.8 Subspaces: In the realm of subspace geometry, envision the conventional three-dimensional
space, R3, and select any plane passing through the origin. This chosen plane constitutes a
distinct vector space. When we scale a vector within this plane by a factor like 2, -3, or any
scalar, the result remains within the same plane. Similarly, the sum of two vectors within the
[]
0
plane preserves the plane. This specific plane, passing through the origin 0 ,exemplifies a
0
foundational concept in linear algebra-it serves as a subspace within the original space R3 .
Definition. Subspace S of a vector space V is a nonempty subset that satisfies the requirements
for a vector space that Linear combinations stay in the subspace.
(i) The sum of any two vectors, x and y , also resides in the subspace: ( x + y ¿∈ S .
(ii)The product of any vector x in the subspace by any scalar c remains within the subspace:
cx ∈ S .
Example 8. Consider all vectors in R2. whose components are positive or zero. This
subset is the first quadrant of the x-y plane; the coordinates satisfy x ≥ 0 and
y ≥ 0. It is not a subspace, even though it contains zero and addition does leave us within
the subset. Rule (ii) is violated, since if the scalar c=−1 and the vector
x=
[12] , [ ]
−2
then the multiple cx = −1 is in the third quadrant instead of the first.
If we include the third quadrant along with the first, scalar multiplication is all right.
Every multiple cx will stay in this subset. However, rule (i) is now violated, since
[ 12]+[−2
−1 ] [ 1 ]
=
−1
, which is not in either quadrant. The smallest subspace
containing the first quadrant is the whole space R2 .
Example 9. Start from the vector space of 3 by 3 matrices say R3 x3. One possible
subspace is the set of lower triangular matrices. Another is the set of symmetric matrices.
Therefore, A=B and cA are lower triangular if A and B are lower triangular, and they
are symmetric if A and B are symmetric. Of course, the zero matrix is in both subspaces.
Let's see two fundamental examples: the column space and the nullspace of a matrix A .
Column space: The column space encompasses every possible linear combination that can be
formed using the columns of matrix A .
Explanation: The column space of a matrix refers to the subspace formed by all the possible
linear combinations of its columns. Consider a matrix A with columns [v 1 , v 2 , v 3 ]. The column
space would include every vector that can be expressed as a linear combination of these columns,
such as c 1 v 1 +c 2 v 2+ c 3 v 3 where, c 1 , c2 ,∧c3 are scalars. In essence, it represents the span of the
column vectors of the matrix.
[ ]
1 2
Example 10: If A= 3 4 , then the column space would include all vectors of the form
5 6
[] []
1 2
c 1 3 +c 2 4 where c 1 and c 2 can be any scalar. The column space is the subspace spanned by the
5 6
[] []
1 2
columns 3 and 4 in this example.
5 6
Note: The system of linear equation, say Ax=b have a solution if and only if the vector b can be
represented as a linear combination of the columns of matrix A . In other words, b lies within the column
space of A .
Nullspace: The null space of a matrix, denoted as N ( A), comprises all vectors x that satisfy the
equation Ax=0. Therefore, the null space of a matrix A represents the set of all solutions
(vectors x ) that, when multiplied by A , result in the zero vector ¿).
In other words, it consists of vectors that get "mapped" to the zero vector under the linear
transformation represented by the matrix A .
This null space is a subspace of the vector space Rn , analogous to how the column space is a
subspace of Rm.
Example 11: Let's consider a specific example of a matrix A and find its null space in R3 :
[ ]
1 2 −1
A= 0 1 1
2 0 3
[]
x1
Now, we are looking for vectors X = x 2 such that Ax=b.
x3
[ ][ ] [ ]
1 2 −1 x 1 0
Setting up the system of equations: 0 1 1 x 2 = 0
2 0 3 x3 0
Solving this system, you can find the solutions for x 1 , x 2 , x 3,. The null space ( N ( A)) consists of
[]
x1
all vectors of the form x 2 that satisfy these equations. The set of solutions forms the basis for
x3
the null space of matrix A in R3 .
Linear dependence and independence are important concepts that describe the relationships
between vectors.
A set of vectors is said to be linearly dependent if there exists a nontrivial linear combination of
these vectors that equals the zero vector. In simpler terms, vectors v 1 , v 2 , .. . v n are linearly
dependent if there exist scalars c 1 , c2 , . .. c n not all zero, such that c 1 v 1 ,c 2 v 2 , .. . c n v n = 0
A set of vectors is said to be linearly independent if the only linear combination that equals the
zero vector is the trivial one, where all the coefficients are zero.
These vectors are linearly independent because the only way to get the zero vector as a linear combination
In general, if you have a set of vectors v 1 , v 2 , .. . v n , and you can express one of them as a linear
combination of the others, the set is linearly dependent. If no vector in the set can be written as a
linear combination of the others, the set is linearly independent. Linear independence is a
desirable property because it means that none of the vectors in the set is redundant; each
contributes something unique to the span of the set.
A basis is a fundamental concept in the context of vector spaces. A basis is a set of vectors that
spans the entire vector space and is linearly independent. The idea is that any vector in the vector
space can be uniquely expressed as a linear combination of the vectors in the basis. The number
of vectors in the basis is known as the dimension of the vector space. Some key properties of a
basis is mentioned below:
1. Spanning Property: A basis must span the entire vector space, meaning that any vector in the
space can be represented as a linear combination of the basis vectors.
2. Linear Independence: The vectors in a basis must be linearly independent, ensuring that no
vector in the basis can be written as a combination of the others.
Example 14: Consider the vector space ℝ3 , and let's define a basis for this space. A common
basis for ℝ3 is the standard basis, which consists of three vectors:
[] [] []
1 0 0
e 1= 0 , e 2= 1 ,∧e 3= 0
0 0 1
Linear Independence: The coefficients in the linear combination above are unique. If
a . e 1+ b . e2 +c . e3 =0 , then a=b=c=0 . This ensures linear independence.
[]
2
Now, let's take an example vector W = 3 . Then we can express W as a linear combination of the
1
[] [] []
1 0 0
standard basis vectors, as: W =2. 0 +3. 1 +1. 0
0 0 1
This demonstrates the spanning property of the basis, showing that any vector in ℝ3 can be
represented using the basis vectors.
Thus, we can say that a basis is a set of vectors that not only spans the vector space but also
ensures that the vectors are linearly independent, providing a unique way to represent any vector
in the space. The standard basis for ℝn is a common example that satisfies these properties.
Problem Set
[] [ ] []
1 4 1
U = U = U =
P.1 Let 1 2 , 2 −1 , and 3 3 . Find
4 3 5
(a) 3 U 1−2 U 2
(b) 2 U 1 +U 2−7 U 3
[] [] [ ]
1 1 1
P.2 Write the vector V = 2 as a linear combination of the vectors U 1= 0 , U 2= −1 , and
3 0 0
[]
1
U 3= 0 .
1
P.4 Consider U =(3 , 2 ,−2 , 1) and V =(3 , k , 5 , 4) . The find the value of k such that U
[] [ ] [ ]
1 4 3
U = U = U =
P.8 Check whether the vectors: 1 2 , 2 −1 , and 3 −3 are linearly independent.
4 3 −1
[] [ ] []
1 4 3
U = U = U =
P.9 Determine whether the vectors: 1 0 , 2 −1 , and 3 1 spans R3.
4 3 0
Chapter 14: Eigen Values and Eigen Vectors
14.1 Introduction
Eigenvalues and eigenvectors provide a way to understand the inherent properties of linear
transformations represented by matrices. In linear algebra statistical modelling, eigenvalues and
eigenvectors are utilized to uncover the principal components of a dataset, enabling
dimensionality reduction, and capturing the most significant variability in the data.
Consider an example of multiplying the nonzero vectors, [ 25] and [ 34 ] by a given square matrix
[ 64 37], such as
Case 1: [ 64 37] [ 25]=[ 2743]
Case 2: [ 64 37] [ 34 ]=[ 3040]
We aim to examine the impact of multiplying the given matrix on vectors. In case 1, it is
observed that the result is a completely different vector with altered direction and length, which
is typically unremarkable for our current discussion. However, in the case 2, something
noteworthy unfolds. The multiplication yields a vector, signifying that the new vector maintains
the same direction as the original one. The scaling factor denoted as λ is 10. This chapter will
delve into the systematic exploration of such scaling factors λ and non-zero vectors X for a
given square matrix A by considering the equation of the form: AX=λX .
Cayley-Hamilton theorem: The Cayley-Hamilton theorem states that every square matrix
satisfies its own characteristic equation. In other words, if p ( λ ) =det ( A−λI ), is the
characteristic polynomial of the matrix A , then substituting A for λ in the polynomial yields the
zero matrix. That is, p( A)=det (A− AI )=0.
Example 1:
p ( λ ) =det
[( 1−3 λ 2
4−λ ]) 2
=( 1−λ )( 4−λ )−2⋅3=λ −5 λ+4
2
P( A)= A −5 A +4 I =0
[ ] [ ] [ ]
2
1 2 1 2 1 0
P ( A )= −5⋅ +4 ⋅ =0
3 4 3 4 0 1
As expected, P( A) is the zero matrix, confirming the Cayley-Hamilton theorem for this
example.
Example 2:
p ( λ ) =det
([ a 11−λ
a21
a12
a22−λ ])
¿ ( a 11−λ ) ( a22− λ )−a12 ⋅ a21
2
¿ λ −( a11 + a22 ) λ+(a ¿ ¿ 11⋅ a22−a12 ⋅ a21) ¿
2
p ( λ ) =λ −tr ( A ) λ+det ( A)
where, tr ( A ) denotes the trace of A ,that is the sum of the diagonal elements of A .
[ ]
a 11 a12 a 13
(b) Now, consider a 3 ×3 matrix, A= a 21 a22 a 23 ,
a 31 a32 a 33
With similar approach (as in part (a)), The characteristic polynomial of A is given as:
3 2
p ( λ ) =λ −tr ( A ) λ + ( A11 + A 22+ A 33 ) λ−det (A ),
Where, A11 , A22 , A 33 denote the cofactors of a 11 , a22 ,and a 33 respectively.
Definition: Let A be any square matrix. Then a scalar λ is called an eigenvalue of A if there
exists a nonzero (column) vector X such that AX=λX holds. Any vector X satisfying this
relation is called an eigenvector of A associated with the eigenvalue λ .
Note that each scalar multiple of an eigenvector X associated with the eigenvalue λ is also such
an eigenvector, as: A(kX )=k ( AX )=k (λX )=λ(kX ).
(3−λ)(2− λ)−2=0
2
λ −5 λ+ 4=0
(λ−1)(λ−4)=0
1. For λ=1 :
( A−I ) X = 3−1
2 [
1 x1
2−1 x 2
= ][ ] [ ][ ] [ ]
2 1 x
2 1 x2
=
0
0
One solution is X 1 =[ ]
−1
2
2. For λ=4 :
( A−4 I ) X= [ 3−4
2
1
2−4 x 2
=][ ] [
x 1 −1 1 x 1
2 −2 x 2
=
0
0 ][ ] [ ]
One solution is X 2 =[ 11]
Therefore, the eigenvalues are λ 1=1, and λ 2=4 , and the corresponding eigenvectors are
X1=
[ ]
−1
2 []
1
and X 2 = .
1
Theorem 1: Eigenvalues
The eigenvalues of a square matrix A a re the roots of the characteristic equation of A . Then, an
n × n matrix has at least one eigenvalue and at most n numerically different eigenvalues.
[ ]
−2 2 −3
Let us find the eigenvalues and eigenvectors of a matrix, A= 2 1 −6 .
−1 −2 0
([ ])
−2−λ 2 −3
Or, det 2 1− λ −6 =0
−1 −2 − λ
3 2
Or, λ + λ −21 λ−45=0.
[ ][ ] [ ]
−2−5 2 −3 x 1 0
When λ=5 , then 2 1−5 −6 x 2 = 0
−1 −2 −5 x 3 0
[ ][ ] [ ]
−7 2 −3
x1 0
−24 −48
which is row-reduces to, 0 x2 = 0
7 7
x3 0
0 0 0
−24 48
thus, −7 x 1+ 2 x 2−3 x 3=0 and x2− x 3=0 ,
7 7
[ ][ ]
x1 1
A
Thus, the eigenvector of corresponding to λ=5 is X = x 2
= 2 .
x 3 −1
[ ][ ] [ ]
−2+3 2 −3 x1 0
Similarly, when λ=−3 , then 2 1+3 −6 x2 = 0 ,
−1 −2 3 x3 0
[ ][ ] [ ]
1 2 −3 x1 0
which row-reduce to 0 0 0 x2 = 0 ,
0 0 0 x3 0
[][ ]
x 1 −2
Let us choose x 2=1 and x 3=0 then we get x 1=−2, that is x 2 = 1 .
x3 0
[ ][]
x1 3
And choosing x 2=0 and x 3=1, we get x 1=3, that is x 2 = 0 .
x3 1
Note that, the order of an eigenvalue λ as a root of the characteristic polynomial is called the
algebraic multiplicity of λ , and denoted as M λ. And the number of linearly independent
eigenvectors corresponding to is called as the geometric multiplicity of λ , denoted as m λ. Thus
m λis the dimension of the eigenspace corresponding to this λ
⋮
yn
and a n × northogonal matrix A .Then the orthogonal transformation U =AX and V = AY
preserves the value of the inner product of vectors X , Y . That is ⟨ U ,V ⟩= ⟨ X , Y ⟩ .
Moreover, the transformation also preserves the length or norm of any vector, that is
‖U ‖=‖AX‖=‖ X‖.
i x 1 + x 2=0, let x 1=1, then x 2=i. Therefore, the eigenvector corresponding to λ 1is
[ ][]
x1
x2
=1 .
i
Theorem 7: (Any real square matrix A may be written as the sum of a symmetric matrix E and a
1 T 1 T
skew-symmetric matrix O , where E and O are given as: E¿ ( A+ A ) and O= ( A− A ).
2 2
[ ] [ ]
1 2 3 1 4 7
T
Example 6: Consider a real square matrix A= 4 5 6 , then A = 2 5 8 .
7 8 9 3 6 9
[ ] [ ]
1 3 5 0 −1 −2
1 1
Thus, E¿ ( A+ A )= 3 5 7 , and O¿ ( A−A ) = 1 0 −1 ,
T T
2 2
5 7 9 2 1 0
We observe that, A=E+O , where E is a symmetric matrix and O is a skew-symmetric matrix.
Theorem 8: (Eigenvalues of Symmetric and Skew-Symmetric Matrices)
(a) The eigenvalues of a symmetric matrix are real.
(b) The eigenvalues of a skew-symmetric matrix are pure imaginary or zero.
⟨ A j , Ak ⟩= A Tj A k = {10ifif j=k
j≠k
Two square matrices A and B, are said to be similar if there exists an invertible matrix P such
that: B=P−1 AP. Here, P is a non-singular (invertible) matrix that facilitates the
transformation. This transformation is called similarity transformation.
trace ( A ) =λ1 + λ 2+ ⋯ λn
det ( A )=λ 1 × λ2 × ⋯ λn
3. Eigenvalues of Inverse:
If the eigenvalues of an invertible matrix A are λ 1 , λ2 , ⋯ λ n .Then the eigenvalues of matrix A−1
1 1 1
are , , ⋯ .
λ1 λ2 λ n
4. Eigenvalues of Product:
The eigenvalues of the product AB are the same as the eigenvalues of BA , where A and B are
square matrices.
5. Eigenvalues of Power:
If an n × nmatrix A possesses a set of eigenvectors forming a basis of Rn , then the matrix X with
these eigenvectors as its columns diagonalizes A .
−1
D= X AX
Here, the diagonal matrix D has the eigenvalues of A along its main diagonal. And X
is the matrix with eigenvectors of A as column vectors. Also,
m −1 m
D =X A X for m=2 , 3 , ⋯
Consider a matrix A= [ 42 31] for which we will find the eigenvalues, eigenvectors, and then
demonstrate its diagonalization.
1. Eigenvalues:
The eigenvalues are obtained by solving the characteristic equation det ( A−λ I )=0.
det
[( 4−λ2 1
3− λ]) 2
=( 4−λ ) ( 3− λ )−( 2 ⋅1 )= λ −7 λ +10=0
2. Eigenvectors:
Now, for each eigenvalue, we find the corresponding eigenvector V by solving ( A−λ I )V =0 .
For λ=2 :
[
( A−2 I ) V = 4−2
2
1 x
3−2 y
=
0
0 ][ ] [ ]
Solving this system of equations, we find an eigenvector V 1= [ ]
1
−2
For λ=5 :
[
( A−2 I ) V = 4−5
2
1 x
3−5 y
=
][ ] [ ]
0
0
To diagonalize A we form the matrix P with the eigenvectors as columns and D as the diagonal
matrix of eigenvalues. Therefore,
P=
[−21 11] and D= [ 20 05]
Then, the diagonalization is given by A=PD P−1 .
[ ][ ][ ] =[ 42 13]=A
−1
−1 1 1 2 0 1 1
PD P =
−2 1 0 5 −2 1
Moreover, P D P =[ ] [ ] [ ] [ 14 11]
2 −1
21 1 2 0 1 1
−1 18 7 2
= =A
−2 1 0 5 −2 1
[ ]
2 −1 1
Consider a matrix A= −1 2 −1 .
1 −1 2
The characteristic equation is given by det ( A−λI ) =0
| |
2−λ −1 1
−1 2−λ −1 =0
1 −1 2− λ
| ||
( 2− λ ) 2−λ −1 + −1 −1 + −1 2− λ =0
−1 2−λ 1 2−λ 1 −1 || |
( 2− λ ) ( λ2−4 λ +3 ) + ( λ−1 ) +( λ−1)
3 2
λ −6 λ + 9 λ−4=0
([ ] [ ] [ ])
2
2 −1 1 2 −1 1 1 0 0
−1 1
A = −1 2 −1 −6 −1 2 −1 + 9 0 1 0
4
1 −1 2 1 −1 2 0 0 1
([ ][ ] [ ])
6 −5 −5 −12 6 −6 9 0 0
−1 1
A = −5 −6 −5 + 6 −12 6 +0 9 0
4
5 −5 6 −6 6 −12 0 0 9
[ ]
3 1 −1
−1 1
A = 1 3 1
4
−1 1 3
Problem Set
[ ]
4 6 0
P2. Determine a real matrix A=B , where B= 3
0 4 6.
0 0 4
[ ]
1 3 4
(a) A= [ ]
3 1
2 2
, (b) A= 1 4
0 1
6.
2
P4. For each of the following matrices, find the eigenvalues and corresponding eigenvectors:
[ ]
2 0 1
(a) A= [ 6 −1
2 3 ]
, (b) A= 0 3 0 .
1 0 2
2
(a) λ 1=1 , λ 2=4 and eigenvector corresponding to λ 1 is given as X = .
2 []
1
(b) λ 1=3 , λ 2=2 and eigenvector corresponding to λ 2 is given as X = .
2 []
P7. Let A be a square matrix with eigenvalues λ 1 , λ2 , … , λn. Prove that the sum of the
eigenvalues (trace) is equal to the sum of the elements on the main diagonal, and the product of
the eigenvalues (determinant) is equal to the determinant of matrix A .
P8. If A is a square matrix with eigenvalues λ 1 , λ2 , … , λnshow that the transpose of A has the
same eigenvalues.
P9. If A is invertible with eigenvalues λ 1 , λ2 , … , λn . Then find the eigenvalues of the inverse
of A .
P10. If A is a square matrix with eigenvalues λ 1 , λ2 , … , λn .Then find the eigenvalues of matrix
cA , where c is a scalar.
[ ]
1 2 −2
P13. Determine the Inverse of a matrix A= 2 1 2 using Kayley-Hamilton theorem.
−2 2 1
Dataset in Machine Learning
The Oxford Dictionary defines a dataset as “a collection of data that is treated as a single unit by
a computer.” This implies that a dataset consists of numerous individual pieces of data but is
utilized to train an algorithm with the objective of identifying patterns within the entire dataset.
Data is a fundamental component of any AI model and has played a pivotal role in the surge of
popularity in machine learning. With the availability of data, scalable ML algorithms have
transitioned from theoretical concepts to practical products capable of delivering value to
businesses.
A dataset in machine learning is a structured collection of data used to train, test, and validate
machine learning models. It serves as the foundation for building and evaluating algorithms by
providing examples that the model can learn from. Datasets typically consist of input features
and corresponding output labels or target values. The quality and characteristics of the dataset
significantly influence the performance and generalization ability of the machine learning model.
1. Features:
2. Labels or Targets:
3. Instances or Samples:
Individual data points in the dataset, each comprising a set of features and
corresponding labels.
4. Training Set:
5. Testing Set:
Subset of the dataset used to evaluate the model's performance on unseen data.
1. Training Dataset:
Definition: The portion of the dataset used to train the machine learning model.
Characteristics: It consists of labeled examples used to adjust the model's
parameters during training.
2. Testing Dataset:
Definition: The portion of the dataset reserved for evaluating the model's
performance.
3. Validation Dataset:
4. Unlabeled Dataset:
5. Labeled Dataset:
7. Image Dataset:
8. Text Dataset:
9. Multimodal Dataset:
While raw data serves as a starting point, it cannot be directly input into a machine learning
algorithm. Several steps must be taken before the dataset becomes usable.
Decide on the sources for collecting data, choosing from open-source datasets, the
Internet, or generators of artificial data.
2. Preprocess:
Determine if the dataset has been used before; assume it is flawed if not.
3. Annotate:
Quest for a Dataset in Machine Learning: Where to Find It and What Sources Fit Your
Case Best?
The sources for collecting a dataset vary based on your project, budget, and business size.
Collecting data directly correlated with business goals is ideal but can be resource-intensive. Free
datasets for machine learning offer a cost-effective option, although adjustments may be needed
to align them with specific project requirements.
Tip: Use live data whenever possible to avoid issues with predictability.
Note: Consult with a data scientist for advice on the volume of data needed for a specific AI
project.
Analyzing the dataset before deploying it is a crucial step. Real-life cases emphasize the
dependency of an ML algorithm on the comprehensive analysis of its dataset. Blind spots, biases,
and unexpected consequences may arise if the dataset is not thoroughly examined.
An anecdote about a hospital using an ML algorithm to reduce treatment costs for pneumonia
patients illustrates the importance of dataset analysis. The algorithm, trained on clinic data,
inaccurately classified asthma as a non-aggravating condition due to the absence of recorded
deaths for asthmatics with pneumonia in the historic dataset.
This case underscores the need for human supervision and control over machine learning
algorithms. Machines cannot independently perform the analytic work of humans, and dataset
analysis is essential before using the data for training.
Collecting a dataset for an AI project may seem straightforward, but it can be a time-consuming
task that requires careful consideration. Understanding what a dataset in machine learning is,
how to collect and preprocess the data, and the features of a proper dataset is crucial.
A dataset is a collection of data pieces treated as a single unit for analytic and prediction
purposes by a computer. Preprocessing involves cleaning and annotating the data, making it
understandable for machine processing. Key features of a good dataset include quality,
relevance, coverage, and sufficient quantity.
Collecting data from sources directly related to business goals is ideal, but free datasets for
machine learning offer a cost-effective option. However, adjustments may be necessary to align
them with specific project requirements. Analyzing the dataset before deployment is crucial to
identify blind spots, biases, and potential issues, ensuring a more accurate and reliable machine
learning model.
1. Noise Reduction: Real-world data is susceptible to noise, which refers to irrelevant or random
variations in the data. Noise can arise from various sources, including errors in data collection
instruments or inconsistencies in data entry. By identifying and eliminating noise, data
preprocessing ensures that the model does not learn from irrelevant patterns, contributing to the
model's robustness and accuracy.
2. Handling Missing Values: Datasets often contain missing values, which can adversely impact
model training. Data preprocessing involves strategies for handling missing data, such as
imputation (estimating missing values based on existing data) or removal of instances with
missing values. Proper handling of missing data ensures that the model is trained on complete
and reliable information.
3. Format Standardization: Datasets may come in different formats, including CSV, Excel,
HTML, or others. Data preprocessing involves standardizing the format to ensure consistency
and compatibility with machine learning algorithms. This step facilitates seamless data
manipulation and analysis.
4. Encoding Categorical Data: Machine learning algorithms typically require numerical input,
but real-world datasets often include categorical variables. Data preprocessing includes encoding
categorical variables into a numerical format. This can be achieved through techniques like label
encoding, assigning numerical labels to categories, or one-hot encoding, creating binary columns
for each category.
5. Splitting into Training and Test Sets: To evaluate the performance of a machine learning
model, the dataset is split into training and test sets. The training set is used to train the model,
while the test set assesses its performance on unseen data. Data preprocessing ensures a
representative split, helping the model generalize well to new, unseen instances.
6. Feature Scaling: Machine learning algorithms are sensitive to the scale of input features. Data
preprocessing includes feature scaling techniques to bring features to a similar scale. Common
methods include Min-Max scaling, which scales features to a specified range, and Z-score
normalization, which standardizes features by subtracting the mean and dividing by the standard
deviation.
The first step is obtaining the dataset relevant to the machine learning problem. Datasets can vary
in format and content based on the nature of the problem, and they are typically stored in files
like CSV, Excel, or databases.
2. Importing Libraries:
Python libraries such as Numpy, Matplotlib, and Pandas are commonly used for data
preprocessing. Numpy facilitates mathematical operations, Matplotlib supports data
visualization, and Pandas is instrumental in importing, cleaning, and managing datasets.
3. Importing Datasets:
After setting the working directory, the dataset is imported using Pandas. Pandas provides
powerful data structures like DataFrames, making it easy to manipulate and analyze data.
Data preprocessing involves identifying and handling missing values. Techniques include
removing instances with missing values or imputing missing values based on statistical methods.
Categorical variables are transformed into numerical format using encoding techniques. Label
encoding assigns numerical labels to categories, while one-hot encoding creates binary columns
for each category.
The dataset is divided into training and test sets to train and evaluate the machine learning
model, ensuring a balance between the two sets.
7. Feature Scaling:
Feature scaling is applied to standardize the scale of input features, preventing any feature from
dominating the learning process.
1. Numpy:
Numpy is a fundamental library for scientific computing in Python, providing support for large,
multidimensional arrays and matrices. It is essential for performing mathematical operations
during data preprocessing.
import numpy as np
2. Matplotlib:
Matplotlib, along with its sub-library pyplot, is widely used for creating visualizations and plots
in Python.
import matplotlib.pyplot as plt
3. Pandas:
Pandas is a powerful library for data manipulation and analysis. It provides data structures like
DataFrames, simplifying the handling of datasets.
import pandas as pd
In conclusion, data preprocessing is a vital step in machine learning, contributing to the success
of models by ensuring that they are trained on clean, relevant, and properly formatted data.
Through steps like noise reduction, handling missing values, and encoding categorical data, data
preprocessing sets the stage for robust and accurate machine learning models. Utilizing tools like
Numpy, Matplotlib, and Pandas streamlines the preprocessing workflow, making it accessible
and efficient for data scientists and machine learning practitioners.
In machine learning, achieving optimal model performance requires a delicate balance between
bias and variance. Bias and variance are two sources of error that influence a model's ability to
generalize to unseen data. Understanding these concepts is crucial for fine-tuning models and
making informed decisions during the model development process.
Bias:
Bias refers to the error introduced by approximating a real-world problem, which may be
highly complex, by a simplified model. A high bias model oversimplifies the underlying
patterns in the data and tends to make strong assumptions about the relationship between features
and the target variable. This can lead to systematic errors, causing the model to consistently
underperform.
“Bias is the difference between the average prediction of our model and the correct
value which we are trying to predict."
This refers to the predictions made by the machine learning model across the
entire dataset.
This is the true or actual value of the target variable in the dataset. In supervised
learning, the model is trained to predict this target variable.
3. Difference:
The disparity or gap between the average prediction of the model and the actual
value is what we define as bias.
where:
A high bias implies that the model is making simplistic assumptions about the underlying
patterns in the data and may not be capturing its complexity. This can result in underfitting,
where the model fails to perform well on both the training and unseen data
Overly Simple: High bias models may be too simplistic and unable to capture the
complexity of the underlying data patterns.
Underfitting: Such models may struggle to fit the training data and perform poorly on
both the training and test sets.
Addressing Bias:
Model Complexity: Increasing the complexity of the model by adding more features or
using more sophisticated algorithms can help reduce bias.
Feature Engineering: Carefully selecting and engineering features can provide the
model with more information to learn from.
Variance:
Variance, on the other hand, measures the model's sensitivity to fluctuations in the training
data. A high variance model is overly complex and tends to fit the training data too closely,
capturing noise and outliers. While these models may perform well on the training set, they often
fail to generalize to new data, leading to poor performance on the test set.
Sensitive to Noise: These models can be overly sensitive to variations in the training
data, capturing noise rather than the underlying trends.
Addressing Variance:
Data Augmentation: Increasing the size of the training dataset or applying data
augmentation techniques can help the model generalize better.
Feature Selection: Removing irrelevant or redundant features can reduce the complexity
of the model and mitigate overfitting.
Bulls-eye diagram
Let's visualize the four different cases representing combinations of both high and low bias
and variance using a bulls-eye diagram:
1. Low Bias, Low Variance:
In this case, the model consistently predicts values very close to the correct ones. The hits on
the target are tightly clustered around the bulls-eye, indicating low variability. This scenario
represents a well-fitted model that captures the underlying patterns in the data accurately.
The model still predicts values close to the correct ones on average, but the predictions vary
more. The hits on the target are more scattered, indicating higher variability. This scenario
suggests that the model is sensitive to fluctuations in the training data, potentially overfitting
noise.
In this case, the model consistently predicts values far from the correct ones, and the
predictions vary widely. The hits on the target are scattered, indicating both a systematic error
(high bias) and sensitivity to fluctuations in the training data (high variance). This scenario
represents a poorly fitted model that fails to capture the underlying patterns and is overly
sensitive to noise.
Mathematical Definition
If we label the variable we aim to predict as y and our covariates as X , it is reasonable to
assume a relationship between them, expressed as y=f (x )+ ϵ . Here, ϵ represents the error
term, assumed to follow a normal distribution with a mean of zero (ϵ ∼ N (0 , σ ϵ )).
We can derive an estimate of the model ^y = f^ (x ) for f (x) using techniques like linear
regression. In this context, the anticipated squared prediction error at a given point x is
defined as:
2
Err ( x )=E[ ( ^y − y ) ]
This error can be broken down into bias and variance components:
MSE ¿
[
¿ E ( f^ ( x )−f ( x ) ) + ( ϵ ) −2 ϵ ( ^f ( x ) −f ( x ) )
2 2
]
[ ]
¿ E ( f^ ( x )−f ( x ) ) + E [ ( ϵ ) ]−2 E [ ϵ ( f^ ( x )−f ( x )) ]
2 2
[ ]
¿ E ( f^ ( x )−f ( x ) ) + σ ϵ −2 E [ ϵ ( ^f ( x ) −f ( x ) ) ]
2 2
As error ϵ has zero mean i.e. E [ ϵ ] =0, hence the above equation reduces to
MSE ¿
MSE ¿
[ ] [
2 2
] [ 2
]
¿ E ( f^ ( x ) + E [ ^f ( x ) ]−E [ f^ ( x ) ]−f ( x ) ) =E ( E [ f^ ( x ) ]−f ( x ) ) + E ( f^ ( x )−E [ f^ ( x ) ]) +2 E ( E [ f^ ( x ) ]−f ( x ) )( f^ ( x )−E [ ^f
As 2 E ( E [ ^f ( x ) ] −f ( x ) )( f^ ( x )−E [ f^ ( x ) ] )=0
MSE ¿
MSE ¿
[
Where Bias=( E [ f^ ( x ) ]−f ( x ) ) and Variance=E ( f^ ( x )−E [ ^f ( x ) ] )
2
]
Here, the third term, irreducible error (σ 2ϵ ), represents the noise inherent in the true
relationship, which cannot be fundamentally diminished by any model. In an ideal scenario
with the true model and an infinite dataset for calibration, both bias and variance terms
should be reducible to zero. However, operating in a world with imperfect models and finite
data introduces a trade-off between minimizing bias and minimizing variance.
Bias-Variance Tradeoff:
The relationship between bias and variance is often depicted as a trade-off. Increasing the
complexity of a model typically reduces bias but increases variance, and vice versa. Striking
the right balance is crucial for creating models that generalize well to new, unseen data. This
balance is often referred to as the bias-variance trade-off.
Optimal Model Complexity:
Underfitting: Models that are too simple and underfit the data.
Optimal: Models with balanced bias and variance, achieving good generalization.
Overfitting: Models that are too complex and overfit the training data.
To evaluate the bias and variance of a model, various techniques can be employed, such as cross-
validation and learning curves. Learning curves provide insights into how a model's performance
changes with the size of the training dataset, offering valuable information about bias and
variance.
Here are some key concepts related to function approximation in machine learning:
1. Supervised Learning:
2. Regression:
3. Classification:
In classification tasks, the goal is to assign input data to one of several predefined
classes or categories. The algorithm learns a decision boundary that separates
different classes.
4. Model Selection:
Overfitting occurs when a model learns the training data too well, capturing noise
and outliers that do not generalize well to new data. Underfitting occurs when a
model is too simple to capture the underlying patterns in the data. Balancing
between overfitting and underfitting is crucial for effective function
approximation.
6. Hyperparameter Tuning:
Hyperparameters are configuration settings for a machine learning algorithm.
Tuning these hyperparameters is essential to optimize the model's performance.
Techniques such as cross-validation can be employed to find the best
hyperparameter values.
7. Loss Functions:
During training, the model's performance is evaluated using a loss function, which
measures the difference between the predicted outputs and the actual labels. The
objective is to minimize this loss to improve the model's accuracy.
8. Neural Networks:
Deep learning, particularly neural networks, has gained popularity for complex
function approximation tasks. Deep neural networks can automatically learn
hierarchical representations of data, allowing them to capture intricate patterns.
9. Ensemble Methods:
In summary, function approximation in machine learning involves training models to capture the
relationships between input data and output labels. The choice of model, handling
overfitting/underfitting, tuning hyperparameters, and using appropriate evaluation metrics are
critical aspects of successful function approximation.
Overfitting
Overfitting is a common challenge in machine learning, occurring when a model learns the
training data too well to the point that it captures noise and idiosyncrasies rather than the
underlying patterns of the data. It is a crucial concept to understand because overfit models
may perform exceptionally well on the training data but generalize poorly to new, unseen
data. This phenomenon can lead to suboptimal model performance and limits the model's
ability to make accurate predictions in real-world scenarios.
One of the primary causes of overfitting is excessive model complexity. A model with too
many parameters or that is too flexible can fit the training data precisely, even capturing
random fluctuations. As a result, the model becomes overly tailored to the specificities of
the training set, losing its ability to generalize to new, unseen data. Overfitting is often
visualized in a graph where the model's performance on the training data continues to
improve, but its performance on a separate validation or test set starts to decline.
Several factors contribute to overfitting, and understanding them is essential for effective
model development:
1. Model Complexity:
2. Insufficient Data:
Overfitting is more likely to occur when the available dataset is small. In such
cases, the model may find spurious correlations in the limited data, mistaking
them for genuine patterns.
If the training data contains noise or outliers, an overfit model may capture
these anomalies as if they were meaningful patterns. This results in a lack of
generalizability to new, clean data.
4. Feature Engineering:
1. Regularization:
2. Cross-Validation:
4. Early Stopping:
Introduction
Regression analysis is a powerful statistical method used in machine learning to model the
relationship between a dependent variable (target) and one or more independent variables
(predictors). The primary goal of regression is to understand how changes in the
independent variables correspond to changes in the dependent variable while holding
other variables constant. This method is particularly useful for predicting continuous or
real-valued outcomes, such as temperature, age, salary, and prices.
In the context of regression, a graph is plotted between the variables, aiming to find the
best-fitting line or curve that passes through the given data points. This line or curve
minimizes the vertical distance between the data points and the regression line, indicating
the strength of the relationship. The machine learning model can then make predictions
based on this fitted line or curve.
Independent Variable: The factors affecting the dependent variable, used for
prediction, are independent variables (also called predictors).
Outliers: Observations with very low or high values compared to others are outliers
and can impact results.
Multicollinearity: Highly correlated independent variables can lead to
multicollinearity, which may affect the ranking of important variables.
Underfitting and Overfitting: Overfitting occurs when a model fits the training
data too closely but generalizes poorly to new data, while underfitting happens
when the model is too simple and performs poorly on both training and test data.
Factor Importance: Regression enables the identification of the most and least
important factors affecting the dependent variable.
Types of Regression
There are various types of regressions which are used in data science and machine
learning. Each variant holds significance in diverse scenarios. Despite their differences, all
regression methods fundamentally assess how independent variables impact dependent
variables. Here we are discussing some important types of regression which are given
below:
Linear Regression
Logistic Regression
Polynomial Regression
Support Vector Regression
Decision Tree Regression
Random Forest Regression
Ridge Regression
Lasso Regression
Linear Regression
Linear regression is a fundamental statistical method used in machine learning and data
analysis to model the relationship between a dependent variable and one or more
independent variables. The primary goal of linear regression is to find the best-fitting
linear relationship that describes how changes in the independent variables are
associated with changes in the dependent variable.
Linear regression can be further divided into two types of the algorithm:
Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called
Simple Linear Regression.
Regression analysis utilizes the statistical relation between two or more variables so that
one variable can be predicted from the other variable(s).
In simple linear regression, we attempt to model the relationship between two variables, as
seen in the following examples.
where
i) β 0∧β 1are regression parameters. β 0 is the intercept and β 1is the slope
parameter.
ii) x is the independent (or predictor) variable. We assume that it can be measured
without error (i.e. x is not a random variable in the regression model).
iii) ε is the error term and E( ε )=0 , Var (ε )=σ 2. The error term is not observable. In
this context, error does not mean a mistake but is a statistical term representing
random fluctuations, measurement errors, or the effect of factors outside our
control.
iv) y is the dependent (or response) variable.
The designation simple indicates that there is only one x to predict the response y , and
linear means that the model is linear in β 0 and β 1
The simple linear regression model for n observations can be written as:
In this case, y i and ε i are random variables and x i are known constants.
Therefore, these assumptions hold for the above model:
Assumption (a) imply that y i depends only on x iand that all other variation in y i is random.
Assumption (b) is also known as the assumption of homoscedasticity, homogeneous
variance, or constant variance.
Assumption (c) implies the ε ivariables are uncorrelated.
7.3 Estimation of the regression parameters ( β 0 , β1 )
In practical scenarios, the regression parameters are typically not known beforehand.
However, by utilizing a random sample of n observations represented as:
( x 1 , y 1 ) , ( x 1 , y 1 ) , … , ( xn , y n ) we have the means to estimate the parameters β 0 , β 1, and σ 2.
These estimations, denoted as ^β and ^β , are obtained through the method of least squares.
0 1
Importantly, this method does not necessitate any assumptions regarding the distribution
of the data.
In the method of least squares the point estimates of β₀ and β₁ are the values that
minimize the sum of squared error (SSE) between observed values y and predicted values
ŷ:
SSE=∑ ( y i−^y i )
2
¿ ∑ ( y i− ^β 0− ^β 1 x i )
2
To find the minimum of SSE, we take partial derivatives with respect to β₀ and β₁ and set
them equal to zero:
∂ SSE
=−2 ∑ ( y i−^y i )=0
∂ β0
∂ SSE
=−2 ∑ x i ( y i−^y i )=0
∂ β1
Solving the above equations gives us the least squares estimates for β₀ and β₁:
For ^β 0:
^β = y − ^β x
0 1
where:
ȳ is the mean of the observed values ( y ).
x̄ is the mean of the predictor variable (x ).
For ^β 1:
n
^β 1=∑ (x ¿¿ i−x)( y¿ ¿i− y)
n
¿¿
i=1
n ∑ x i y i−n x y
∑ (x ¿¿ i−x ) = i =1n
2
¿
∑ x 2i −n x 2
i=1
i=1
These equations provide the estimated coefficients β 0 and ^β 1 that define the best-fitting
linear regression line, minimizing the sum of squared differences between the observed
and predicted values. These estimates are used to make predictions and describe the linear
relationship between the predictor variable (x ) and the response variable (y).
Example 7.1: Students in a statistics class (taught by one of the authors) claimed that doing
the homework had not helped prepare them for the midterm exam. The exam score y and
homework score x (averaged up to the time of the midterm) for the 11 students in the class
were as follows:
x 5 6 7 8 9 10 11 12 13 14 15
y 1 4 3 8 7 7 13 10 16 17 13
xᵢ yᵢ 2
xi xᵢ yᵢ
5 1 25 5
6 4 36 24
7 3 49 21
8 8 64 64
9 7 81 63
10 7 100 70
11 13 121 143
12 10 144 120
13 16 169 208
14 17 196 238
15 13 225 195
∑ xᵢ=110
∑ yᵢ=99
∑ x 2i =1210
∑ x i yᵢ=1151
x̄=
∑ xᵢ = 110 =10
n 11
ȳ=
∑ yᵢ = 99 =9
n 11
n x̄ ȳ =11∗10∗9=990
n
∑ xi y i−n x y
^β 1= i=1 =(1151−990)/(1210−11∗10∗10)=1.4636
n
∑ x −n x2
i
2
i=1
^β = y − ^β x=9−1.4636∗10=−5.6364
0 1
14
12 12
12 Simple Linear Regression
10 9
8
y
0
0.5 1 1.5 2 2.5 3 3.5
x
E [ ^β 1 ]=β 1 and
E [ ^β 0 ]= β0
b. and have minimum variance among all linear unbiased estimators (Gauss-Markov
theorem)
Proof a
n
( x¿¿ i−x )( y ¿¿ i− y )
^β = ∑ n
¿¿
1 i=1
∑ (x¿ ¿i− x) ¿ 2
i=1
We have
E [ y ]=β 0+ β1 x
E [ y i− y ]=β 1 (x i−x)
E [ ^β 1 ]=E ¿
n
¿ ∑ (x¿ ¿i−x )E ¿ ¿ ¿ ¿
i=1
n
∑ (x ¿¿ i−x )β 1 (x i−x)
i=1
n
¿
∑ (x¿ ¿i−x) =β 1 ¿ 2
i=1
E [ ^β 0 ]=E [ y− ^β 1 x ] =β 0+ β 1 x−x E [ β^ 1 ]= β0
2
σ
Var [ ^β 1 ]= n
(i)
∑ (x ¿¿ i−x )2 ¿
i=1
[ ]
(ii) Var ^β 0 =σ ¿
2
(i) Proof:
Var [ ^β 1 ]=Var ¿
Var [ ax ] =a Var [ x]
2
1
Var [ ^β 1 ]=
¿¿¿
1
¿
¿¿¿
1
Var [ ^β 1 ]=
¿¿¿
n
Note that the variance of ^β 1 is minimum when∑ (x ¿¿ i−x) ¿is as large as possible and
2
i=1
because of this reason, values of X should be chosen so as to cover the entire range of its
values.
(ii) Proof
2
σ
Var [ y ]=
n
2 2
σ σ
Var [ ^β 0 ]=
2
+x n
n
∑ (x ¿¿ i−x )2 ¿
i=1
Var [ ^β 0 ]=σ 2 ¿
SSR
∑ ( ^y ¿¿ i− y)2
r 2= = n
i=1
¿
SST
∑ ( y ¿¿ i− y) =1− SSE
2
SST
¿
i=1
n n
where SSR=∑ (^y ¿¿ i− y ) ¿ is the regression sum of squares and SST =∑ ( y ¿¿ i− y) ¿is
2 2
i=1 i=1
the total sum of squares. The total sum of squares can be partitioned into SST =SSR+ SSE ,
that is,
n n n
∑ ( y ¿¿ i− y ) =∑ (^y ¿¿ i− y) +∑ ( y ¿ ¿ i−^y i )2 ¿ ¿ ¿
2 2
Thus r 2 gives the proportion of variation in y that is explained by the model or,
equivalently, accounted for by regression on x.
xᵢ yᵢ xi
2
xᵢ yᵢ 2
( ^y ¿¿ i− y ) ¿
2
( y ¿¿ i− y ) ¿
5 1 25 5 53.559 64
6 4 36 24 34.2787 25
7 3 49 21 19.2826 36
8 8 64 64 8.57084 1
9 7 81 63 2.1433 4
10 7 100 70 1.6E-07 4
11 13 121 143 2.14095 16
12 10 144 120 8.56616 1
13 16 169 208 19.2756 49
14 17 196 238 34.2693 64
15 13 225 195 53.5473 16
∑ ( y ¿¿ i− y )2=280 ¿
i=1
2 SSR 235.64
r= = =0.84159
SST 280
In simple linear regression, interval estimation involves estimating a range within which a
population parameter, such as the slope or intercept of the regression line, is likely to fall.
This process is essential for understanding the precision and reliability of the regression
model's parameters. Two common types of interval estimation in simple linear regression
are confidence intervals for the regression coefficients (slope and intercept).
The slope of the regression line in simple linear regression represents the rate of change in
the dependent variable for a one-unit change in the independent variable. A confidence
interval for the slope provides a range of values within which the true population slope is
likely to lie. The width of these confidence intervals is a measure of the overall quality of the
regression line. If the errors are normally and independently distributed, then the sampling
distribution of both β 0, β ₁ is t with n−2 degrees of freedom
The formula for the confidence interval for the slope (β ₁) is typically given as:
^β −t SE ( β^ 1 ) ≤ β 1 ≤ β^ 1 +t α SE( β^ 1 )
1 α
, n−2 , n−2
2 2
where:
t is the critical value from the t -distribution based on the desired confidence
level with n−2 degree of freedom,
This interval provides a range of values for the population slope with a specified level of
confidence.
The formula for the confidence interval for the intercept (β₀) is typically given as:
^β −t SE ( β^ 0 ) ≤ β 0 ≤ ^β 0 +t α SE( β^ 0 )
0 α
,n−2 , n−2
2 2
where:
t is the critical value from the t-distribution based on the desired confidence
level,
This interval provides a range of values for the population intercept with a specified
level of confidence.
In both cases, the critical value t is determined based on the desired confidence level
and degrees of freedom associated with the regression analysis. Common confidence
levels are 95% or 99%, and the degrees of freedom depend on the sample size and the
number of parameters estimated.
Residual
The residual is defined as the disparity between the observed value and the fitted value of
the study variable. For the i th observation, the residual is computed as follows:
Here, y i represents an observed data point, and ^y i is the corresponding predicted value
obtained from the regression model.
Residuals can be seen as a measure of the deviation between the actual data and the
model's predictions. They represent the amount of variability in the response variable that
cannot be accounted for by the regression model.
Residuals can also be interpreted as the observed values of the model's errors. Therefore,
any departure from the assumptions regarding random errors should be reflected in the
residuals. Analysing the residuals aids in identifying inadequacies in the model and
assessing whether the model's assumptions hold.
Residual Analysis: Examining the pattern of residuals can provide insights into the
adequacy of the model. Ideally, residuals should be randomly distributed around
zero, indicating that the model is capturing the underlying relationship.
Outliers and Influential Points: Residual analysis can help identify outliers or
influential points that have a substantial impact on the regression model.
The general linear model, often referred to as the multiple linear regression model, is used
for analysing data involving multiple independent variables. In this context, we consider a
linear multiple regression model with k independent variables. Multiple linear regression
extends the principles of simple linear regression to accommodate situations where there
are more than one independent variable influencing the dependent variable.
8.2 Multiple Linear Regression Model
The representation of the regression model is as follows:
y=β 0 + β 1 x 1 + β 2 x 2 +…+ β k x k + ε
Where
(i) The regression parameters, denoted as β 0, β 1, ..., β k , represent key coefficients within the
model.
(ii) The independent variables, namely x ₁ , x ₂ , ... , x ₖ, are the predictor variables. These
variables are considered fixed and can be measured without any error or uncertainty.
(iii) The variable y, which serves as the response variable, is observable and subject to
random error u.
This regression model comes into play when the response variable relies on multiple
independent variables.
The regression function, also known as the response function, associated with the above
model can be expressed as:
E [ y ]=β 0+ β1 x 1+ β 2 x2 +…+ β k x k
The outcome y is frequently impacted by multiple predictor variables. For instance, the
crop yield could be influenced by the quantities of nitrogen, potash, and phosphate
fertilizers applied. While these factors can be manipulated by the experimenter, the yield
might also be affected by uncontrollable variables, such as those related to weather
conditions.
y i=β 0 + β 1 x 1 i + β 2 x 2 i+ …+ β k x ki + ε i
The assumptions for ε i or y i are the same as those for simple linear regression:
Assumption 1 states that the model is correct, in other words that all relevant x ’ s are
included, and the model is indeed linear. Assumption 2 asserts that the variance of y is
constant and therefore does not depend on the x ’ s . Assumption 3 states that the y ’ s are
uncorrelated with each other, which usually holds in a random sample (the observations
would typically be correlated in a time series or when repeated measurements are made on
a single plant or animal). Later we will add a normality assumption, under which the y
variable will be independent as well as uncorrelated.
When all three assumptions hold, the least-squares estimators of the β ’s have some good
properties. If one or more assumptions do not hold, the estimators may be poor. Under the
normality assumption, the maximum likelihood estimators have excellent properties.
Any of the three assumptions may fail to hold with real data. Several procedures have been
devised for checking the assumptions. These diagnostic techniques are discussed in
Chapter 10.
y 1=β 0 + β 1 x 11 + β 2 x 21 +…+ β k x k 1+ ε 1
y n=β 0 + β 1 x 1 n + β 2 x 2 n +…+ β k x kn +ε n
Y = Xβ+ ε (8.4)
[ ][ ][ ] [ ]
y1 1 x 11 x 21 ⋯ x k 1 β 0 ε 1
y 2 = 1 x 12 x22 … x k 2 β1 + ε 2
(8.5)
⋮ ⋮ ⋮ ⋮⋮ ⋮ ⋮
yn 1 x 1 n x2 n … x kn βk εn
where,
We assume that the Rank ( X )= p. In such a scenario, the matrix X is considered to have full
rank.
a) E [ε ]=0 or E [Y ]=Xβ .
b) Cov (ε )=σ 2 I ∨Cov ( y )=σ 2 I
When the influence of x i on E( y) remains consistent regardless of the values of the other
variables for all i, it can be said that the variables exhibit an additive effect or do not
interact with each other.
SSE=∑ ( y i−^y i )
2
SSE=∑ ( y i− β^ 0− β^ 1 x1 i−…− ^β k x ki )
2
(8.6)
Note that the predicted value ^y i= β^ 0 + ^β 1 x 1 i +…+ β^ k x ki estimates E( y i), not ^y i. A better
notation would be E [^y i ] but ^y i is commonly used.
To find the values of ^β 0 , β^ 1 , … , β^ k , k that minimize SSE, we could differentiate SSE with
respect to each βᵢ and set the results equal to zero to yield p=k +1 equations that can be
solved simultaneously for the ^β j ’s.
∂ SSE
=0
∂ βj
∂ SSE
=−2 ∑ ( y i−^y i )=0 (1)
∂ β0
∂ SSE
=−2 ∑ x 1 i ( y i−^y i )=0 (2)
∂ β1
∂ SSE
=−2 ∑ x ki ( y i− ^y i) =0 (p)
∂ βk
X X β^ =X y
T T
−1
Pre-multiply both sides by, ( X T X ) to get,
−1 −1
^ ( XT X ) XT y
( X T X ) X T X β=
^β=( X T X )−1 X T y
Because we assumed that X is of full rank, the matrix ( X T X ) has an inverse, and the
solution of ^β is unique.
Example 8.1:
Let us consider,
Here area, rooms, age are features / independent variables and price is the target /
dependent variable.
X = [[ 1, 23, 3, 8],
[ 1, 15, 2, 7],
[ 1, 24, 4, 9],
[ 1, 29, 5, 4],
[ 1, 31, 7, 6],
[ 1, 25, 3, 10]]
X = [[ 1, 1, 1, 1, 1, 1],
T
[ ]
305.33
^β=( X X ) X y= 236.86
T −1 T
−4.76
102.90
Hence
^β =305.33
0
^ =236.86
β 1
^β =−4.76
2
^β =102.90
3
The model can be interpreted as:
^y i= β^ 0 + ^β 1 x 1 i + ^β 2 x 2 i+ β^ 3 x 3i
4000
3000
2000
1000
0
14 16 18 20 22 24 26 28 30 32
Area x1
4000
3000
2000
1000
0
1 2 3 4 5 6 7 8
rooms
4000
3000
2000
1000
0
3 4 5 6 7 8 9 10 11
age
5000
4000
3000
2000
1000
0
4000 4500 5000 5500 6000 6500 7000 7500 8000 8500
actual price (y)
8.5.1 Unbiases of ^β
E [ Y ] =E [ Xβ+ ε ] = Xβ
E [ ^β ]=E [ ( X X ) X y ]=( X X ) X Xβ=β
T −1 T −1 T
T
−1
8.5.2 Covar ( ^β )=( X T X ) σ 2
−1 −1
cov ( ^β )= ( X X ) X σ I X ( X X )
T T 2 T
−1 2
cov ( ^β )= ( X X ) σ
T
The i th diagonal element of the matrix cov ( β^ )= ( X T X ) σ 2gives the variance of ^β i and the
−1
^β=( X T X )−1 X T y is a linear function of normal vector y . Therefore ^β is also a normal vector.
Example 8.2
(a) Use the matrix method to estimate the parameters of the simple linear regression
model.
(b) Obtain the variance-covariance matrix of the estimators.
Solution
(a)
We have
[ ] [ ][ ] [ ]
y1 1 x1 ε1
y2 = 1 x2 β0 + ε2
⋮ ⋮ ⋮ β1 ⋮
yn 1 xn εn
y= Xβ+ ε
[ ]
1 x1
T
X X=
[ 1 1
x1 x2 ]
… 1 1 x2
… x1 ⋮ ⋮
1 xn
X X=
T
[∑ n
xi
∑ xi
∑ x 2i ]
[]
y1
T
X Y=
[ 1 1 …1
x1 x2 … x1 ⋮ ] y2
yn
XT Y =
[ ]
∑ yi
∑ xi yi
[ 1 ∑
][ −∑ x i
]
2
(XT X) =
−1 xi
n ∑ x2i −( ∑ x i ) −∑ x i
2
n
[]
^β= β 0 =( X T X )−1 X T y
β1
¿
[ 2
1 ∑ x 2i
n ∑ x i −( ∑ x i ) −∑ x i
2
][ ][ ] −∑ x i
n
∑ yi
∑ xi yi
[∑ [) ] ∑ ∑∑∑ ∑ ∑∑ ]
2
1 xi y i− x i xi y i
=
x i −( ∑ x i
2
n
2
− xi y i +n xi yi
We obtain,
^β = y ∑ xi −¿ x ∑ x i y i ¿
2
0
∑ x 2i −n x 2
^β = ∑
xi y i−n y x
or
1
∑ x 2i −n x 2
(b) Var ( ^β )=Var (( X T X ) X T y )=( X T X ) σ 2
−1 −1
[ 1 ∑
][ −∑ x i
]
2
(XT X) =
−1 xi
n ∑ x2i −( ∑ x i ) −∑ x i
2
n
[ 2
1
2
∑
][
x 2i
n ∑ x i −( ∑ x i) −∑ x i
−∑ x i
n ] [
σ=
2 Var ( ^β 0)
Cov( β^ 0 , ^β 1) Var ( β^ 1 ) ]
Cov ( β^ 0 , β^ 1 )
var ( ^β 0 )=σ
2 ∑ x 2i =
σ
2
+ x
2 σ
2
n ∑ x i −( ∑ xi )
2 2
n ∑ (x ¿¿ i−x )2 ¿
2
σ
Var ( ^β 1 )=
∑ (x¿ ¿i−x)2 ¿
cov ( β^ 0 , ^β1 ) =
−σ ∑ xi = −x σ 2
2
n ∑ xi −( ∑ x i ) ∑ (x ¿¿ i−x ) ¿
2 2 2
Statistical measures to assess the goodness of fit and the significance
of the model
When interpreting the output of a multiple linear regression model, several key statistical
measures are essential to assess the goodness of fit and the significance of the model. Here
are some common elements found in the output and their interpretations:
R-squared (R ²) :
R-squared represents the proportion of the variance in the dependent variable that is
explained by the independent variables in the model. It ranges from 0 to 1, where a higher
R-squared value indicates a better fit.
n
SSR
∑ ( ^y ¿¿ i− y)2
r 2= = n
i=1
¿
SST
∑ ( y ¿¿ i− y) =1− SSE
2
SST
¿
i=1
n n
where SSR=∑ (^y ¿¿ i− y ) ¿ is the regression sum of squares and SST =∑ ( y ¿¿ i− y) ¿is
2 2
i=1 i=1
the total sum of squares. The total sum of squares can be partitioned into SST =SSR+ SSE ,
that is,
n n n
Thus r 2 gives the proportion of variation in y that is explained by the model or, equivalently,
accounted for by regression on x.
Example: An R-squared of 0.75 means that 75% of the variability in the dependent
variable is explained by the independent variables in the model.
SSE /(n−k )
R2=1−
SST /(n−1)
where
n is the number of observations,
Increased Predictors:
As more predictors are added to a model, R-squared tends to increase even if the
additional predictors do not contribute significantly to explaining the variance.
Adjusted R-squared adjusts for this by penalizing the increase in R-squared due to
adding less relevant predictors.
Adjusted R-squared allows for a fair comparison of models with different numbers
of predictors. A higher Adjusted R-squared suggests a better balance between model
complexity and explanatory power.
Model Parsimony:
Key Points:
Adjusted R-squared is a valuable metric for assessing the goodness of fit in multiple
linear regression models, providing a balanced measure that considers both
explanatory power and model complexity. It aids in selecting models that strike an
optimal balance between capturing variance in the data and avoiding unnecessary
complexity.
F -statistic
The F-statistic in multiple linear regression is a statistical measure that assesses the
overall significance of the regression model. It tests the null hypothesis that all the
coefficients of the independent variables in the model are equal to zero, suggesting that
the model has no explanatory power. In contrast, the alternative hypothesis is that at
least one of the coefficients is not equal to zero, indicating that the model is statistically
significant.
SSR
k
F=
SSE
n−k −1
where:
The F -statistic follows an F -distribution with degrees of freedom k for the numerator
and n−k −1for the denominator.
1. Null Hypothesis: The null hypothesis for the F-test is that all coefficients of the
independent variables are equal to zero, implying that the model has no explanatory
power.
5. Caution: A significant F -statistic does not provide information about which specific
predictors are significant; for that, individual t-tests for each coefficient are typically
conducted.
the F -statistic in multiple linear regression is a crucial tool for determining whether the
overall model is statistically significant. It helps assess whether the inclusion of
independent variables in the model contributes meaningfully to explaining the variation
in the dependent variable.
Significance F ( p-value):
The significance F , often referred to as the p-value associated with the F -statistic, indicates
the probability of observing an F -statistic as extreme as the one calculated if the null
hypothesis were true.
If the significance F is less than a chosen significance level (e.g., 0.05), you reject the null
hypothesis, suggesting that the overall model is statistically significant.
A significant F -statistic implies that at least one independent variable in the model
contributes significantly to explaining the variance in the dependent variable.
The degrees of freedom for the F -statistic are associated with the number of predictors ( k )
and the number of observations (n ) in the model.
Example: Suppose you obtain an F -statistic of 15.5 with 3 and 96 degrees of freedom
(associated with k and n ). The Significance F ( p-value) is calculated to be 0.0002 (less than
0.05). In this case:
Conclusion: Since the p-value is less than the chosen significance level (e.g., 0.05), you
would reject the null hypothesis.
Interpretation: This suggests that at least one of the predictors in the model is significant,
and the overall model is statistically significant.
The F -statistic and Significance F are crucial for determining whether the regression
model as a whole is statistically significant. A significant F-statistic implies that the model
has explanatory power, and at least one independent variable contributes significantly to
the prediction of the dependent variable.
Standard Error
In multiple linear regression, the standard error (often referred to as the standard error of
the residuals or residual standard error) is a measure of the dispersion of the observed
values around the predicted values. It provides an estimate of how much the actual values
deviate from the predicted values and is a crucial metric for assessing the accuracy and
precision of the regression model.
The standard error in multiple linear regression is mathematically calculated as the square
root of the mean squared residual (MSR), which is the average of the squared differences
between the observed and predicted values. The formula for the standard error is as
follows:
Standard Error=
√ ∑ ( y i− ^y i )2 where i=1 , 2 ,… , n
n−k −1
where:
5. Model Evaluation: In conjunction with other metrics such as R-squared and the F-
statistic, the standard error is useful for evaluating the overall performance of the
multiple linear regression model.
In summary, the standard error in multiple linear regression is a fundamental metric that
quantifies the precision of the model's predictions. A lower standard error indicates a
better fit, but its interpretation should be considered alongside other model evaluation
metrics and assumptions.
Coefficient p-values
In multiple linear regression, the p-values associated with the coefficients of the
independent variables provide information about the statistical significance of each
predictor in the model. Each coefficient p-value is associated with a null hypothesis that the
corresponding coefficient is equal to zero, indicating no effect of that particular predictor
on the dependent variable.
Here's how to interpret the p-values for the coefficients in multiple linear regression:
1. Null Hypothesis: The null hypothesis (H 0 ) for each coefficient is that the
corresponding predictor has no effect on the dependent variable.
4. Interpretation:
p-value < α : If the p-value for a coefficient is less than the chosen
significance level (e.g., 0.05), you reject the null hypothesis. This suggests that
the corresponding predictor is statistically significant in predicting the
dependent variable.
p-value ≥ α : If the p-value is greater than or equal to the significance level,
you fail to reject the null hypothesis. This implies that there is not enough
evidence to conclude that the predictor has a significant effect.
5. Decision Rule: A common decision rule is to reject the null hypothesis when the p-
value is less than α and, conversely, fail to reject the null hypothesis when the p-
value is greater than or equal to α .
Example: Suppose you have a multiple linear regression model with predictors
X 1 , X 2 ,∧X 3, and corresponding coefficients β 1 , β 2 ,∧β 3. The p-values associated with
these coefficients are p1 , p 2 ,∧ p3 . If p1 <0.05, you might conclude that X 1 is a significant
predictor, while p2 and p3above 0.05 suggest that X 2 and X 3 are not statistically significant
predictors.
Coefficient p-values in multiple linear regression help assess the individual contributions of
each predictor to the model. A low p-value indicates that the corresponding predictor is
likely to be a significant factor in predicting the dependent variable.
Introduction
In machine learning and data analysis, handling datasets with numerous features or
variables is a common challenge. High-dimensional data can lead to increased
computational complexity, overfitting, and difficulties in visualizing and interpreting
results. Feature selection and dimensionality reduction techniques are employed to address
these issues. This discussion focuses on three widely used methods: Principal Component
Analysis (PCA), Linear Discriminant Analysis (LDA), and Independent Component Analysis
(ICA).
Feature Selection: Feature selection involves choosing a subset of relevant features from
the original set. The goal is to retain the most informative variables while discarding
irrelevant or redundant ones. This process simplifies models, improves interpretability, and
reduces the risk of overfitting. Common techniques include filter methods (e.g., correlation
analysis), wrapper methods (e.g., recursive feature elimination), and embedded methods
(e.g., regularization).
Dimensionality Reduction: Dimensionality reduction aims to transform the data into a
lower-dimensional space while preserving its essential characteristics. This is achieved by
creating new variables, known as components or factors, that capture most of the original
data's variance. Dimensionality reduction techniques are particularly useful when the
dataset has a large number of features. PCA, LDA, and ICA are three powerful methods for
dimensionality reduction.
Procedure:
1. Standardization: Standardize the data to ensure that all variables are on the same
scale.
4. Select Principal Components: Sort the eigenvalues in descending order and choose
the top k eigenvectors to form the matrix W .
Applications:
Image Compression: Reduce the dimensionality of image data while retaining key
features.
Example
[ ]
1 2
X= 3 4
5 6
Step 1: Standardize the Data
We start by standardizing the data (subtracting the mean and dividing by the standard
deviation for each feature). In this small example, we won't need to do that since the
data is already small and simple.
1
Covariance Matrix (∑)= ( X −X )T ⋅ ( X−X )
n−1
∑=
[ 44]
1 4
2 4
Next, we calculate the eigenvectors and eigenvalues of the covariance matrix. The
eigenvectors represent the directions of maximum variance, and the eigenvalues
represent the magnitude of the variance in those directions.
Eigenvalues (λ):{8 , 0 }
Choose the top k eigenvectors to form the matrix W . Since we have only one non-zero
eigenvalue, we choose the corresponding eigenvector as our principal component.
[]
(W ): 1
1
Project the original data onto the subspace defined by the principal components.
[]
3
Projected Data (Z): Z=X ⋅W = 7
11
The resulting matrix Z represents the dataset in the reduced space defined by the
principal components.
This is a simplified example, and in real-world scenarios, PCA is applied to datasets with
many more dimensions to capture the most significant sources of variance in the data.
The eigenvalues and eigenvectors are crucial in determining the principal components
and their importance in representing the data.
LDA is a supervised technique that aims to find the linear combinations of features that
best separate different classes in a classification problem. Unlike PCA, LDA considers class
labels, making it particularly useful for feature extraction in the context of classification
tasks.
Linear Discriminant Analysis (LDA) is widely regarded as the preferred method for
effectively distinguishing between two or more classes characterized by multiple features
in classification problems. For instance, in scenarios where there are two classes with
multiple features and a need for efficient separation, LDA emerges as a commonly
employed technique. Attempting classification based on a single feature may result in
overlapping and insufficient discrimination between the classes.
To overcome the overlapping issue in the classification process, we must increase the
number of features regularly.
Let's assume we have to classify two different classes having two sets of data points in a 2-
dimensional plane as shown below figure:
Nevertheless, drawing an efficient straight line in a 2-dimensional plane to separate these
data points proves to be impossible. However, by employing Linear Discriminant Analysis,
we can achieve dimensional reduction from a 2-dimensional plane to a 1-dimensional
plane. This technique allows us to enhance the separability between multiple classes.
Working of LDA
For instance, let's consider a scenario with two classes represented on a 2-dimensional
plane with an X −Y axis, and the objective is to achieve efficient classification. As
demonstrated in the example, LDA facilitates the creation of a straight line that distinctly
separates the two classes of data points. In this context, LDA utilizes the X −Y axis to
establish a new axis by employing a straight line for segregation and subsequently
projecting the data onto this new axis.
Hence, we can maximize the separation between these classes and reduce the 2-D plane
into 1-D.
To create a new axis, Linear Discriminant Analysis uses the following criteria:
Using the above two conditions, LDA generates a new axis in such a way that it can
maximize the distance between the means of the two classes and minimizes the variation
within each class.
In other words, we can say that the new axis will increase the separation between the data
points of the two classes and plot them onto the new axis.
Procedure:
1. Compute Class Means: Calculate the mean vectors for each class.
2. Compute Scatter Matrices: Compute the within-class scatter matrix ( SW ) and the
between-class scatter matrix ( S B).
Example
Class 1:
Class 2:
Data points X 2 =(x 1 , x 2 )={(9 , 10) ,(6 ,8),(9 , 5),(8 , 7), (10 ,8) }
1 T
S1 =
n1−1
( X 1−x 1 ) ⋅ ( X 1−x 1 )
1 T
S2 = ( X −x ) ⋅( X 2− x2 )
n2−1 2 2
[ ][ ]
4−3 2−3.8 1 −1.8
2−3 4−3.8 −1 0.2
( X 1−x 1) = 2−3 3−3.8 = −1 −0.8
3−3 6−3.8 0 2.2
4−3 4−3.8 1 0.2
[ ][ ]
9−8.4 10−7.6 0.6 2.4
6−8.4 8−7.6 −2.4 0.4
( X 2−x 2) = 9−8.4 5−7.6 = 0.6 −2.6
8−8.4 7−7.6 −0.4 −0.6
10−8.4 8−7.6 1.6 0.4
[ ][ ]
1 −1.8
−1 0.2 4 −1
S1 =
1
[ 1 −1 −1 0 1
5−1 −1.8 0.2 −0.8 2.2 0.2
−1
0
] 1
−0.8 = −1 44 =
2.2
4
5
[1
−0.25
−0.25
2.2 ]
1 0.2
Similarly,
S2 =
[−0.05
2.3 −0.05
3.3 ]
Therefore
[−0.25
SW =
1 −0.25
2.2
+
][
2.3
−0.05
−0.05
3.3
=
3.3 −0.3
−0.3 5.5 ][ ]
II. Calculate the between-class scatter matrix (S B) :
T
S B=( x 1−x 2 ) ⋅ ( x 1−x 2 )
T
[ ][ ][ ]
3 8.4 −5.4
( x 1−x 2 ) = 3.8 − 7.6 = −3.8
S =
B [−5.4
−3.8 ]
[ −5.4 −3.8 ] =[ 29.16 20.52 ]
20.52 14.44
eigenvectors.
−1
SW S B =
[ 9.2213
4.2339
6.489
2.9794 ]
Step 4: Sort Eigenvalues and Choose Top Eigenvectors
Sort the eigenvalues in descending order and choose the top k (in this case, k =1)
eigenvector(s).
|S−1 |
W S B−λI |=
9.2213−λ
4.2339
6.489
2.9794−λ |
=0
2
λ −12.2007 λ=0
λ=0 , 12.2007
v 1=
[−0.5755
0.8178 ]
[ 0.9088
v 2=
0.4173 ]
[ 0.9088
v 2=
0.4173 ]
Project the original data onto the selected eigenvector(s) to obtain the transformed 1-
dimensional data. This leads to good separability between the two classes
The above steps illustrate how LDA can be applied to this simple dataset, resulting in a 1-
dimensional subspace that maximizes the separation between the two classes. In a real-
world scenario with higher-dimensional data, the same principles apply, and LDA aims to
find the linear combinations of features that best discriminate between different classes.
Applications:
The typical scenario used to illustrate Independent Component Analysis (ICA) is known as
the "Cocktail Party Problem." In its simplest form, consider two individuals engaged in a
conversation at a cocktail party, analogous to the red and blue speakers in the scenario
described. In this setup, two microphones are strategically positioned near the partygoers,
resembling the placement of the purple and pink microphones. Both microphones capture
the voices of both individuals at varying volumes determined by their proximity to the
microphones. Essentially, two audio files are recorded, each containing a mix of the two
speakers.
The challenge presented is how to effectively separate the voices within each file, aiming to
acquire distinct recordings for each speaker. Independent Component Analysis (ICA) proves
to be a straightforward solution to this problem. ICA transforms a set of vectors into a
maximally independent set. In the context of the "Cocktail Party Problem," ICA works to
convert the two mixed audio recordings (depicted by the purple and pink waveforms) into
two separate, unmixed recordings, each representing the individual speaker (illustrated by
the blue and red waveforms). It's noteworthy that the number of inputs and outputs
remains the same in this process. Additionally, since the outputs are mutually independent,
there isn't an apparent method for discarding components, as seen in Principal Component
Analysis (PCA).
Fig
ure: Converting mixed signal into independent components using ICA
Procedure:
4. Projection: Project the original data onto the subspace defined by the independent
components.
Applications:
Speech Signal Separation: Separate different speakers in a mixed audio signal.
Objective: PCA focuses on maximizing variance, while LDA aims to maximize the
separation between classes.
Use Case: PCA is suitable for tasks like noise reduction and visualization, while LDA
is effective in classification problems.
Assumption: PCA assumes Gaussian distributions, whereas ICA does not assume a
specific distribution.
At its core, classification is a supervised machine learning task where the primary objective
is to assign predefined categories or labels to input data based on its features. The process
involves training a model on a labeled dataset, where instances are associated with known
class labels. The goal is to develop a predictive model that can generalize well to unseen
instances, thereby accurately assigning appropriate labels. The significance of classification
lies in its ability to automate decision-making processes, enabling systems to categorize
and interpret data in a manner analogous to human reasoning.
1. Problem Definition: Clearly articulate the classification problem at hand. Define the
classes or categories that the algorithm should predict, setting the foundation for
subsequent steps.
2. Data Collection and Preparation: Gather a labeled dataset where each instance is
paired with a corresponding class label. This dataset is then divided into two subsets:
training data, used to train the model, and testing data, reserved for evaluating the model's
performance. Data preprocessing techniques are applied to handle missing values,
normalize features, and encode categorical variables.
3. Feature Selection and Extraction: Identify relevant features that contribute to the
prediction task. Optionally, perform feature extraction or dimensionality reduction
techniques to enhance model efficiency and interpretability.
5. Data Splitting: Divide the dataset into training and testing sets, ensuring that the model
is evaluated on unseen data. This step guards against overfitting and provides a realistic
assessment of the model's generalization capabilities.
6. Model Training: Feed the training data into the chosen classification algorithm, allowing
the model to learn patterns and relationships between features and class labels.
7. Model Evaluation: Assess the model's performance using the testing dataset. Common
evaluation metrics such as accuracy, precision, recall, and the confusion matrix offer
insights into the model's effectiveness. Fine-tune hyperparameters if necessary to optimize
performance.
8. Prediction and Deployment: Once satisfied with the model's performance, deploy it to
make predictions on new, unseen data. Continuous monitoring and potential updates are
essential to ensure the model remains relevant as new data becomes available.
Consider the dance floor as your data space, where each dancer represents a data point.
When you decide to dance, you might instinctively choose the person closest to you and
mimic their moves. This initial approach reflects the essence of the nearest neighbour
method: learning from the proximity of similar instances. However, acknowledging that not
everyone may be an expert dancer, you expand your observation to a few more people,
trying to discern a consensus. This act of observing multiple nearby dancers and adapting
your moves based on what most of them are doing mirrors the core principle of nearest
neighbour methods in machine learning.
In the realm of machine learning, nearest neighbour methods are a family of algorithms
that operate based on the idea of proximity. When faced with a task where a model is not
readily available, and the underlying structure of the data is unknown, these methods
leverage the similarity of data points to make predictions or classifications.
FIGURE 5: The nearest neighbors decision boundary with left: one neighbor and right:
two neighbors
4. Distance Metrics:
Application Scenarios:
1. Classification:
2. Regression:
3. Anomaly Detection:
Strengths:
Nearest neighbour methods are simple, intuitive, and effective for high-
dimensional data or situations where the underlying structure is complex.
Considerations:
The k-Nearest Neighbors (KNN) algorithm is a simple yet powerful supervised machine
learning algorithm used for both classification and regression tasks. In KNN, the prediction
for a new data point is based on the majority class (for classification) or the average (for
regression) of its k nearest neighbors in the feature space. The "nearest" is determined by a
distance metric, commonly Euclidean distance.
Dataset: Imagine we have data points representing red and white wines based on the levels
of rutin and myricetin. Each wine sample is a data point in a two-dimensional space.
Red 3 4
White 2 3
In the above figure, the x-axis represents rutin levels and the y-axis represents myricetin
levels, the data points for red and white wines can be plotted. This graph visually
represents the trends in data for the two types of wines.
The "K" in KNN refers to the number of nearest neighbors considered in the decision-
making process. In simpler terms, when determining the class of a new data point, KNN
looks at its closest neighbors to make a prediction. The choice of " k " is a crucial parameter
that influences the algorithm's performance.
Classifying a New Wine: Suppose we introduce a new glass of wine into the dataset, and
we want to determine whether it is red or white. Let's k =5 for this scenario.
To classify the new wine, we identify its five nearest neighbors on the chart. The
classification is then determined by the majority of votes from these neighbors. If four out
of the five nearest neighbors are red wines, the new point would be classified as a red wine.
1. Recommendation Engine:
2. Concept Search:
Datasets often contain missing values, which can be problematic for machine
learning models or analysis.
4. Pattern Recognition:
5. Banking:
Benefits:
Considerations:
K-Nearest Neighbors (KNN) relies on distance metrics to determine the proximity of data
points. Different distance metrics are suitable for various types of data and scenarios. Let's
explore some common distance metrics used in KNN:
1. Hamming Distance:
Hamming distance
2. Euclidean Distance:
Euclidean distance is the most popular distance measure used for finding the
distance between two real-valued vectors (e.g., integers or floats). It
measures the straight-line distance between two points in a
multidimensional space.
Euclidean Distance
(∑ ( )
n 1
2 2
d ( x , y )= x i− yi )
i=1
3. Manhattan Distance:
Manhattan Distance
n
d ( x , y )=∑ |x i− y i|
i=1
4. Minkowski Distance:
Minkowski Distance
(∑ | )
n 1
p p
d ( x , y )= x i− y i|
i=1
Choosing the appropriate distance metric is crucial in the KNN algorithm, as it influences
the model's performance and ability to capture relationships within the data. The selection
depends on the nature of the features and the characteristics of the problem at hand.
For each data point in the dataset, calculate its distance to the new data point
using a distance metric (e.g., Euclidean distance). The distance can be
calculated in a multi-dimensional k feature space.
3. Step 3 - Identify Neighbors:
Identify the k data points with the shortest distances to the new data point.
These k data points are considered the "nearest neighbors."
For classification tasks, assign the class label that is most common among the
k nearest neighbors to the new data point.
For regression tasks, predict the average or weighted average of the target
values of the k nearest neighbors.
Example:
Let's consider a simple example for a binary classification problem with two features (X1
and X2) and two classes (Class 0 and Class 1).
A 1 1 0
B 2 0 0
C 2 1 1
D 3 3 1
E 4 2 0
Now, suppose we have a new data point with features X1=3 and X2=2, and we want to
classify it using KNN with k=3.
1. Calculate Distances:
Calculate the Euclidean distance between the new data point and each
existing data point in the dataset.
2. Identify Neighbors:
Select the three data points with the shortest distances: D , E , and C .
Since two out of the three nearest neighbors belong to Class 1, we predict
that the new data point belongs to Class 1.
In this example, the KNN algorithm predicts that the new data point with features X1=3 and
X2=2 belongs to Class 1 based on the majority class among its three nearest neighbors (
D , E , and C ).
2. Computation cost is quite high because we need to compute the distance of each
query instance to all training samples.
The k-Nearest Neighbors (KNN) algorithm is a simple yet powerful supervised machine
learning algorithm used for both classification and regression tasks. In KNN, the prediction
for a new data point is based on the majority class (for classification) or the average (for
regression) of its k nearest neighbors in the feature space. The "nearest" is determined by a
distance metric, commonly Euclidean distance.
In two dimensions, the hyperplane that separates class A and class B is a line:
Figure: Separating Hyperplane in 2D (A Line)
If you take a closer look, for the two dimensional space example, each of the following is a
valid hyperplane that separates the classes A and B:
Figure: Separating Hyperplanes
So how do we decide which hyperplane is the most optimal? Enter maximum margin
classifier.
The optimal hyperplane is the one that separates the two classes while maximizing the
margin between them. And a classifier that functions thus is called a maximum margin
classifier.
Figure: Maximum Margin Classifier
But what if your data points were distributed like this? The classes are still perfectly
separable by a hyperplane, and the hyperplane that maximizes the margin will look like
this:
Figure: Is the Maximum Margin Classifier Optimal?
But do you see the problem with this approach? Well, it still achieves class separation.
However, this is a high variance model that is, perhaps, trying to fit the class A points too
well.
Notice, however, that the margin does not have any misclassified data point. Such a
classifier is called a hard margin classifier.
Take a look at this classifier instead. Won't such a classifier perform better? This is a
substantially lower variance model that would do reasonably well on classifying both
points from class A and class B.
Figure: Linear Support Vector Classifier
Notice that we have a misclassified data point inside the margin. Such a classifier that
allows minimal misclassifications is a soft margin classifier.
The soft margin classifier we have is a linear support vector classifier. The points are
separable by a line (or a linear equation). If you’ve been following along so far, it should be
clear what support vectors are and why they are called so.
Each data point is a vector in the feature space. The data points that are closest to the
separating hyperplane are called support vectors because they support or aid the
classification.
It's also interesting to note that if you remove a single data point or a subset of data points
that are not support vectors, the separating hyperplane does not change. But, if you remove
In the examples so far, the data points were linearly separable. So we could fit a soft margin
classifier with the least possible error. But what if the data points were distributed like this?
one or more support vectors, the hyperplane changes.
Figure: Non-linearly Separable Data
Problem: The data points are not linearly separable in the original feature space.
Solution: Project the points onto a higher dimensional space where they are linearly separable.
But projecting the points onto a higher dimensional features space requires us to map the data
points from the original feature space to the higher dimensional space.
This recomputation comes with non-negligible overhead, especially when the space that we want to
project onto is of much higher dimensions than the original feature space. Here's where the kernel trick
comes into play.
Mathematically, the support vector classifier you can be represented by the following equation [1]:
Ensemble Learning
Ensemble learning usually produces more accurate solutions than a single model would.
Ensemble Learning is a technique that creates multiple models and then combines them to
produce improved results. Ensemble learning usually produces more accurate solutions
than a single model would.
o Ensemble learning for regression creates multiple repressors i.e. multiple regression
models such as linear, polynomial, etc.
o Ensemble learning for classification creates multiple classifiers i.e. multiple classification
models such as logistic, decision tress, KNN, SVM, etc.
Multiples machine learning models were generated using same or different machine
learning algorithm. These are called “base models”. The prediction perform on the basis of
base models.
We discussed many different learning algorithms in the previous chapters. Though these
are generally successful, no one single algorithm is always the most accurate. Now, we are
going to discuss models composed of multiple learners that complement each other so that
by combining them, we attain higher accuracy.
There are also different ways the multiple base-learners are combined to generate the final
output:
Multiexpert combination methods have base-learners that work in parallel. These methods
can in turn be divided into two:
In the global approach, also called learner fusion, given an input, all base-learners generate
an output and all these outputs are used.
In the local approach, or learner selection, for example, in mixture of experts, there is a
gating model, which looks at the input and chooses one (or very few) of the learners as
responsible for generating the output.
Multistage combination
Multistage combination methods use a serial approach where the next base-learner is
trained with or tested on only the instances where the previous base-learners are not
accurate enough. The idea is that the base-learners (or the different representations they
use) are sorted in increasing complexity so that a complex base-learner is not used (or its
complex representation is not extracted) unless the preceding simpler base-learners are
not confident.
An example is cascading.
Let us say that we have L base-learners. We denote by d j (x) the prediction of base-learner
M j given the arbitrary dimensional input x. In the case of multiple representations, each M j
uses a different input representation x j . The final prediction is calculated from the
predictions of
the base-learners:
When there are K outputs, for each learner there are d ji (x ), i=1 , .. . , K , j=1 , .. . , L, and,
combining them, we also generate K values, y i ,i=1, . . . , K and then for example in
classification, we choose the class with the maximum y i value:
K
Choose C i if y i =max y k .
k−1
3.2 Voting
The simplest way to combine multiple classifiers is by voting, which corresponds to taking a
linear combination of the learners, Refer above figure 1.
This is also known as ensembles and linear opinion pools. In the simplest case, all learners
are given equal weight and we have simple voting that corresponds to taking an average.
Still, taking a (weighted) sum is only one of the possibilities and there are also other
combination rules, as shown in table 1. If the outputs are not posterior probabilities, these
rules require that outputs be normalized to the same scale
In weighted sum, d ji is the vote of learner j for class C i and w j is the weight of its vote.
Simple voting is a special case where all voters have equal weight, namely, w j=1/L . In
classification, this is called plurality voting where the class having the maximum number of
votes is the winner.
When there are two classes, this is majority voting where the winning class gets more than
half of the votes. If the voters can also supply the additional information of how much they
vote for each class (e.g., by the posterior probability), then after normalization, these can be
used as weights in a weighted voting scheme. Equivalently, if d ji are the class posterior
probabilities, P(C i∨x , M j),then we can just sum them up ( w j=1/L ) and choose the class
with maximum y i .
In the case of regression, simple or weighted averaging or median can be used to fuse the
outputs of base-regressors. Median is more robust to noise than the average.
Another possible way to find w j is to assess the accuracies of the learners (regressor or
classifier) on a separate validation set and use that information to compute the weights, so
that we give more weights to more accurate learners.
Voting schemes can be seen as approximations under a Bayesian framework with weights
approximating prior model probabilities, and model decisions approximating model-
conditional likelihoods.
P ( Ci | x )= ∑ P ( C i| x , M j ) P(M j )
all models M j
Let us assume that d j are iid with expected value E [d j ] and variance Var (d j) , then when we
take a simple
average with w j=1/L , the expected value and variance of the output are:
1 1
E [ y ] =E ⌊ ∑ d j ⌋ = ≤[ d j ]=E [d j ]
j L L
Var [ y ] =Var
(∑ 1L d )= L1 Var (∑ d )= L1 LVar (d )= 1L Var (d )
j
j 2
j
j 2 j j
We see that the expected value does not change, so the bias does not change. But variance,
and therefore mean square error, decreases as the number of independent voters, L,
increases. In the general case,
1 1
Var [ y ] =
L
2
Var
(∑ d j )= L 2 ¿
j
which implies that if learners are positively correlated, variance (and error) increase. We
can thus view using different algorithms and input features as efforts to decrease, if not
eliminate, the positive correlation.
3.3 Bagging
Bootstrap aggregating, often abbreviated as bagging, involves having each model in the
ensemble vote with equal weight. In order to promote model variance, bagging trains each
model in the ensemble using a randomly drawn subset of the training set. As an example,
the random forest algorithm combines random decision trees with bagging to achieve very
high classification accuracy.
The simplest method of combining classifiers is known as bagging, which stands for
bootstrap aggregating, the statistical description of the method. This is fine if you know
what a bootstrap is, but fairly useless if you don’t. A bootstrap sample is a sample taken
from the original dataset with replacement, so that we may get some data several times and
others not at all. The bootstrap sample is the same size as the original, and lots and lots of
these samples are taken: B of them, where B is at least 50, and could even be in the
thousands. The name bootstrap is more popular in computer science than anywhere else,
since there is also a bootstrap loader, which is the first program to run when a computer is
turned on. It comes from the nonsensical idea of ‘picking yourself up by your bootstraps,’
which means lifting yourself up by your shoelaces, and is meant to imply starting from
nothing.
Bootstrap sampling seems like a very strange thing to do. We’ve taken a perfectly good
dataset, mucked it up by sampling from it, which might be good if we had made a smaller
dataset (since it would be faster), but we still ended up with a dataset the same size. Worse,
we’ve done it lots of times. Surely this is just a way to burn up computer time without
gaining anything. The benefit of it is that we will get lots of learners that perform slightly
differently, which is exactly what we want for an ensemble method. Another benefit is that
estimates of the accuracy of the classification function can be made without complicated
analytic work, by throwing computer resources at the problem (technically, bagging is a
variance reducing algorithm; the meaning of this will become clearer when we talk about
bias and variance). Having taken a set of bootstrap samples, the bagging method simply
requires that we fit a model to each dataset, and then combine them by taking the output to
be the majority vote of all the classifiers. A NumPy implementation is shown next, and then
we will look at a simple example.
for i in range(nSamples):
sample = []
sample.append(data[samplePoints[j,i]])
sampleTarget.append(targets[samplePoints[j,i]])
# Train classifiers
classifiers.append(self.tree.make_tree(sample,sampleTarget,features))
The example consists of taking the party data that was used to demonstrate the decision
tree, and restricting the trees to stumps, so that they can make a classification based on just
one variable
When we want to construct the decision tree to decide what to do in the evening, we start
by listing everything that we’ve done for the past few days to get a suitable dataset (here,
the last ten days):
The output of a decision tree that uses the whole dataset for this is not surprising: it takes
the two largest classes, and separates them. However, using just stumps of trees and 20
samples, bagging can separate the data perfectly, as this output shows:
3.3.1 RANDOM FORESTS
A random forest is an ensemble learning method where multiple decision trees are
constructed and then they are merged to get a more accurate prediction.
If there is one method in machine learning that has grown in popularity over the last few
years, then it is the idea of random forests. The concept has been around for longer than
that, with several different people inventing variations, but the name that is most strongly
attached to it is that of Breiman, who also described the CART algorithm in unit 2.
As well as increasing the randomness in the training of each tree, it also speeds up the
training, since there are fewer features to search over at each stage. Of course, it does
introduce a new parameter (how many features to consider), but the random forest does
not seem to be very sensitive to this parameter; in practice, a subset size that is the square
root of the number of features seems to be common. The effect of these two forms of
randomness is to reduce the variance without effecting the bias. Another benefit of this is
that there is no need to prune the trees. There is another parameter that we don’t know
how to choose yet, which is the number of trees to put into the forest. However, this is fairly
easy to pick if we want optimal results: we can keep on building trees until the error stops
decreasing.
Once the set of trees are trained, the output of the forest is the majority vote for
classification, as with the other committee methods that we have seen, or the mean
response for regression. And those are pretty much the main features needed for creating a
random forest. The algorithm is given next before we see some results of using the random
forest.
Algorithm
1. The random forests algorithm generates many classification trees. Each tree is generated as
follows:
a) If the number of examples in the training set is N, take a sample of N examples at random -
but with replacement, from the original data. This sample will be the training set for
generating the tree.
b) If there are M input variables, a number m is specified such that at each node, m variables
are selected at random out of the M and the best split on these m is used to
split the node. The value of m is held constant during the generation of the various trees in
the forest.
2. To classify a new object from an input vector, put the input vector down each of the trees in
the forest. Each tree gives a classification, and we say the tree “votes” for that class. The
forest chooses the classification
The implementation of this is very easy: we modify the decision to take an extra parameter,
which is m, the number of features that should be used in the selection set at each stage. We
will look at an example of using it shortly as a comparison to boosting.
Looking at the algorithm you might be able to see that it is a very unusual machine learning
method because it is embarrassingly parallel: since the trees do not depend upon each
other, you can both create and get decisions from different trees on different individual
processors if you have them. This means that the random forest can run on as many
processors as you have available with nearly linear speedup.
There is one more nice thing to mention about random forests, which is that with a little bit
of programming effort they come with built-in test data: the bootstrap sample will miss out
about 35% of the data on average, the so-called out-of-bootstrap examples. If we keep track
of these datapoints then they can be used as novel samples for that particular tree, giving
an estimated test error that we get without having to use any extra datapoints.
As a brief example of using the random forest, we start by demonstrating that the random
forest gets the correct results on the Party example that has been used in both this and the
previous chapters, based on 10 trees, each trained on 7 samples, and with just two levels
allowed in each tree:
As a rather more involved example, the car evaluation dataset in the UCI Repository
contains 1,728 examples aiming to classify whether or not a car is a good purchase based
on six attributes. The following results compare a single decision tree, bagging, and a
random forest with 50 trees, each based on 100 samples, and with a maximum depth of five
for each tree. It can be seen that the random forest is the most accurate of the three
methods.
Strengths and weaknesses
Strengths
• It has an effective method for estimating missing data and maintains accuracy when a large
proportion of the data are missing.
• Prototypes are computed that give information about the relation between the variables
and the classification.
• The capabilities of the above can be extended to unlabeled data, leading to unsupervised
clustering, data views and outlier detection.
• They can handle binary features, categorical features, numerical features without any need
for scaling.
Weaknesses
• A weakness of random forest algorithms is that when used for regression they cannot
predict beyond the range in the training data, and that they may over-fit data sets that are
particularly noisy.
• The sizes of the models created by random forests may be very large. It may take hundreds
of megabytes of memory and may be slow to evaluate.
• Random forest models are black boxes that are very hard to interpret.
3.4 Boosting
Given a large training set, we randomly divide it into three. We use X1 and train d1. We then
take X2 and feed it to d1. We take all instances misclassified by d1 and also as many
instances on which d1 is correct from X2, and these together form the training set of d2. We
then take X3 and feed it to d1 and d2. The instances on which d1 and d2 disagree form the
training set of d3. During testing, given an instance, we give it to d1 and d2; if they agree,
that is the response, otherwise the response of d3 is taken as the output.
2. Train d1 on X1
Test d1 on X2
Test d1 and d2 on X3
overall system has reduced error rate, and the error rate can arbitrarily be reduced by using
such systems recursively, that is, a boosting system of three models used as dj in a higher
system.
Though it is quite successful, the disadvantage of the original boosting method is that it
requires a very large training sample. The sample should be divided into three and
furthermore, the second and third classifiers are only trained on a subset on which the
previous ones err. So unless one has a quite large training set, d2 and d3 will not have
training
3.5.1 AdaBoost
Freund and Schapire (1996) proposed a variant, named AdaBoost, short for adaptive
boosting, that uses the same training set over and over and thus need not be large, but the
classifiers should be simple so that they do not overfit. AdaBoost can also combine an
arbitrary number of baselearners, not three.
AdaBoost algorithm
The idea is to modify the probabilities of drawing the instances as a function of the error.
Let us say pt j
denotes the probability that the instance pair (xt, rt) is drawn to train the jth base-learner.
Initially, all
1 p = 1/N. Then we add new base-learners as follows, starting from j = 1: Є
t
AdaBoost requires that learners are weak, that is, Є j < 1/2,∀ j; if not, we stop adding new
base- learners. Note that this error rate is not on the original problem but on the dataset
used at step j. We j j +1 j
.
The combiner learns what the correct output is when the base-learners give a certain
output combination. We cannot train the combiner function on the training data because
the base-learners may be memorizing the training set; the combiner system should
actually learn how the baselearners make errors. Stacking is a means of estimating and
correcting for the biases of the base-learners. Therefore, the combiner should be trained
on data unused in training the base-learners.
The outputs of the base-learners dj define a new L-dimensional space in which the
output discriminant/regression function is learned by the combiner function.
When we compare a trained combiner as we have in stacking, with a fixed rule such as
in voting, we see that both have their advantages: A trained rule is more flexible and
may have less bias, but adds extra parameters, risks introducing variance, and needs
extra time and data for training. Note also that there is no need to normalize classifier
outputs before stacking.
Summary:
Manhattan distance is preferred for cases where features are in an integer space.
Minkowski distance allows flexibility by adjusting the parameter "p" to suit specific
requirements.